我分析了最受欢迎的夜总会的阵容。

我分析了最受欢迎的夜总会的阵容。
I analyzed the lineups at the most popular nightclubs

原始链接: https://dev.karltryggvason.com/how-i-analyzed-the-lineups-at-the-worlds-most-popular-nightclubs/

## Clubster 分析：深入舞曲场景本项目利用从 Resident Advisor (RA) 抓取的数据，探索了全球舞曲俱乐部场景。一个使用 Beautiful Soup 构建的 Python 爬虫，并遵守 RA 的 robots.txt，收集了 2019 年（疫情前最后完整一年）的俱乐部、活动和艺术家数据。然后使用 Pandas 和 NetworkX 对这些数据进行分析，以确定基于共同艺术家的俱乐部之间的相似性（使用 Jaccard 指数），并识别出不同的“俱乐部群落”。分析结果出乎意料：尽管人们认为俱乐部之间同质化严重，但俱乐部在演出阵容上的重叠率仅为平均 1%，表明场景比预期更加多样化。常驻 DJ 的作用也比预期小，大多数俱乐部更倾向于轮换的艺术家阵容。数据使用 D3 可视化，允许交互式地探索俱乐部集群和进行比较。该项目结合了数据驱动的故事讲述和用户探索，提供了过滤器和动画以增强理解。该项目使用 Python、Pandas、NetworkX、D3 和 React 构建，完整的项目可在 GitHub 上获取，为舞曲世界提供了一个引人入胜的见解。

## 黑客新闻讨论：夜店阵容分析一位用户 (kalli) 分析了热门夜店的阵容，并将数据可视化在 [karltryggvason.com](https://karltryggvason.com) 上。该项目引发了关于数据来源和准确性的争论。许多评论者指出存在不一致之处，尤其是在夜店分类方面——一些列出的场所是酒吧、小型场所，或者根本缺失。核心问题在于数据来源：Resident Advisor (RA)。用户指出RA主要展示独立推广人和小型活动，这使得那些利用更广泛的推广策略或独家售票协议（如AXS或Ticketmaster）的大型夜店处于劣势。讨论还涉及夜店文化的本质。一些人认为夜店优先考虑氛围和艺术家质量，而不是受欢迎程度，而另一些人则强调氛围和社会体验的重要性。许多评论者分享了他们最喜欢的夜店和艺术家，并争论人们是去夜店是为了音乐本身还是为了社交场景。最终，尽管数据存在局限性，该项目因其可视化效果而受到赞扬，作者也承认了从原始数据中捕捉细微“文化”的挑战。

原文

A few years back I did a bit of dance music related data visualization over at Lazily Evaluated. My favourite was an analysis of clubs and their lineups using Resident Advisor / RA data, I called it Clubster Analysis. I always wanted to dig into the technical aspects of gathering the data, analyzing it and building the charts and graphs to tell a story and give people insight. With this blog I now have the right venue for that kind of tech talk, so here goes.

Data gathering #

To visualize data, first you have to get some! For this purpose I wrote a little scraper in Python. I used Beautiful Soup to parse the html and grab the bits and pieces I was interested in.

My scraping of a few thousand pages didn’t cause considerable load on the RA servers. But in the age of overzealous AI scrapers it’s worth being polite, so I throttled according to their robots.txt. I also maintained a local cache of html files I had already downloaded, so that I wouldn’t have fetch the same data repeatedly (past lineups are unlikely to change after the fact) just because I discovered some bug or error in my parsing.

The order I scraped in was:

Get the 20 most popular regions in RA (and then I dropped “Streamland” which was a pandemic era pseudo-region)
Fetch the most popular clubs and some related metadata for all of those regions.
For each club, get the lineups for every 2019 event of theirs (the last full year before the pandemic started).
Save the results to csv files

Clean up, verification and Analysis #

I did some spot checks to verify that my parsing was working as I expected and added tests to make sure I handled edge cases and normalized artist names. There was a lot of variance in how dates were formatted, how artists were linked, etc.

After that I analyzed the data. I built one big table/dataframe in Pandas by joining all the info from the csv files. Then I calculated the similarities between each pair of clubs in the data set using the Jaccard index. Consider all the artists that have played at two given clubs, take the intersection (number of artists that have featured in lineups at both clubs) over the union (all the artists that have performed at one or the other). As an example if Club A had 100 artists booked and Club B had 100 artists, and they had 10 bookings in common, the Jaccard index would be 10/190 = ~5%. This gives you a good way to compare large and small clubs and balances large and small lineups (some of the clubs have multiple rooms with very long events, others have one dj playing in one room all night long once a week).

Based on the Jaccard index we can build a graph, using NetworkX from all the clubs. The edges between two nodes are weighted by the similarity of those clubs. On top of the graph we run community detection to create clusters (hence the clubster name). This gives us a rough idea of which clubs are most similar, that is to say, have similar tastes in their bookings.

Results #

For the year 2019, there were 131 clubs in the data set with 8.502 events. There were 9.405 unique artists making up 30.482 individual bookings. This means that the average artist in the dataset was booked 3.24 times at those clubs in that year and the average event had 3.5 artists on the line up.

As a whole, out of 8.515 possible pairs of clubs, 3.716 pairs had some overlap in their bookings and out of those the average overlap was 1%. This was lower than I thought, the bookings at European clubs felt more homogenous to me, but I suppose they book a lot of artists. It would be interesting to get more data, recent and historic, and see how this has evolved through time.

Visualization #

This was my first time using D3 to draw charts. There was a bit of a learning curve, in earlier projects I had used higher level charting libraries which have simpler apis. But with D3 you get a lot of control over how your charts look and behave which I think I used to good effect in this instance.

My main goal was to visualize the clusters and to allow people to interact with the clubs. I coloured the clubs according to their clusters and sized them based on the number of followers they had on RA. I played around with the gravity and placement of the cluster, trying to find a balance that worked on different screen sizes as well as being a fair portrayal of the different communities.

Interactive comparison of two clubs from the dataset — Users can click and compare different clubs in the dataset.

I then did some scrollytelling to tell the story of the data, as I saw it, while the reader scrolls down the page. But I also added filters and interactivity for people to explore and see if they agree with my telling of the story or if they can find one of their own.

At the time I didn’t find any great React and D3 bridges, so it was a bit of a hassle getting the React components to play nice with the D3 graph, but in the end I was able to connect the two with createRef to the D3 svg component.

Besides the clustering I looked into the “resident factor”, how many times an artist was booked at a club repeatedly compared to all the one offs. This was lower than expected, most of these clubs were booking a constantly rotating assembly of talent, residents don’t play as big a part as I would have thought.

Transitioning between the different sections of these graphs was one of my favourite parts. Seeing the clusters morph into dots and candlestick charts (and back again) was oddly satisfying. Took a lot of tweaking, but I think it really tied together the scrollytelling experience.

Transitioning between scenes in the Clubster Analysis visualization — D3 let me build neat animations to transition between charts.

I don’t think these transitions would have been possible with the higher level charting libraries I’d used previously. So the decision to go with D3 felt justified.

Summary #

This was a great pandemic project that combined web scraping, data analysis, and interactive visualization to explore the global dance music club scene. I learned me some D3 for the visualization, got better at doing cartesian graphing calculations in my head and learned about the underlying svg mechanics that power those graphs.

The results surprised me: despite my perceived homogeneity of European club bookings, only 1% average overlap between venues suggested more diverse landscape than I expected. The diminished role of residents compared to one-off bookings also challenged my assumptions about how these clubs operate. For the story telling maintaining the balance between a narrative and letting users explore and decide for themselves was a fun challenge. I think these sort of passion projects can give us deep insights into our world and culture.

The technical stack I worked with: Python, Pandas, NetworkX, D3, and React proved powerful despite some integration challenges. The complete project is available on GitHub and you can explore the live interactive visualization yourself.

I had a lot of fun building this and am proud of the result. If you’re working on cultural data analysis, need help with web scraping and visualization, or just want to discuss interesting datasets, feel free to reach out.