Dan Ellis : Research : Music Similarity :

MusicSeer.com Data Statistics


The musicseer.com web site was used to run a web-based survey to collect subjective judgments about artist similarity. We were interested in collecting some independent ground-truth to validate our automatic musical artist similarity metrics. The site ran from March until October 2002. We describe the basic results in our ISMIR-02 paper, The Quest for Ground Truth in Musical Artist Similarity; see also the slides I presented at the conference.

We made the results of this survey available for others to use; you can download them from the local copy of the musicseer.com results page. In particular, this is where you find topset_to_sqlid to map from the internal reference numbers to actual band names. (Unfortunately, these names are not canonicalized in the approved manner). (To map directly from the musicseer 'sqlid' reference numbers to indices into the aset400 list, you can use aset400.3-canon-musicseer.ids, a list of 400 sqlids corresponding to the aset400 artists. See the Matlab example on the metrics page.)

Note that musicseer.com is no longer controlled by us, and is currently run by one of those misdirected-web-page-scavenger organizations.

For interest, and for reference, here are some statistics regarding these data, specifically the musicseer-results-2002-10-15 dataset:

GameSurveyTotalNotes
Raw users 6807131,032 overlap between survey and game
Raw judgments11,31310,99722,310 
Raw triplets114,50898,964213,472Click on the count to download the complete set of triplets (with duplicates removed)
Filtered users 602541842 overlap
Filtered judgments 9,8287,27617,104  
Filtered triplets 34,76416,44951,213 Click on counts to download filtered lists in same format as main list
Known artists/filtered user 18.8919.4116.18 knowledgeable users are more likely to be in both subsets?
Known artists/filtered judgment 5.544.264.99 includes target and chosen i.e. #triples/#judgments+2

"Filtering" is removing all triplets in which the unchosen artist was not ever chosen by that particular user in a different trial - i.e. the cases in which we can't be sure that the user actually knew the unchosen artist, making the choice meaningful.

There are 426 unique artist IDs in this data... It should have been 412, but some extras crept in.

I encountered the following issues constructing these results:


Valid HTML 4.0! Last updated: $Date: 2003/08/07 03:41:06 $
Dan Ellis <dpwe@ee.columbia.edu>