As a basis for ground-truth in musical artist similarity, we collected lists of artists represented in the personal collections of music listeners. Many peer-to-peer file sharing systems allow the list of files on a particular node to be queried: Since summer 2001, we (meaning Brian Whitman) have been running such queries on some 3,700 nodes on the OpenNap file sharing network, and, from the file names, inferred the musical artists present in the collections represented at each node.
The data presented here, which reflects queries up until February 2002, comprises a total of about 1.6 million user-to-song relations. Regularization to remove misspellings and exclude unknown artists left the data described below (317,470 user-to-song relations).
We defined a set of 400 highly-represented artists which we call the aset400. The list of artists is in aset400.txt.
One compact way to represent the data is in terms of a similarity matrix, giving a similarity between each pair of the 400 artists, where a high similarity indicates a high likelihood of co-occurrence in user collections, and vice versa. This 400x400 matrix is available here as aset-opennap-sim.txt. All 160,000 values are smaller than 1, except for the leading diagonal (artists compared with themselves).
Here is the data in various forms:
And, for relating this data to our 400-element artist set:
Total song-in-collection observations | 317,470 |
Total collections | 3,245 |
Unique artists identified | 4,591 |
Unique songs identified | 65,047 |
Unique collection-artist relations | 176,113 |
Average songs/collection | 97.8 |
Average artists/collection | 54.3 |
Maximum songs by a single artist in one collection | 216 (The vast majority of collection-artist relations consist of a single song) |
Most popular song | 398 occurrences of "It Wasn't Me" by Shaggy |
Artist with the most songs in collections | 2822 songs (0.89%) by The Beatles (in 589 collections) |
Artist appearing in most collections | 982 collections (30.3%) containing songs by Madonna |
More information about this measure, and what we did with it, is in our paper for ISMIR-02, The Quest for Ground Truth in Musical Artist Similarity.