In our ISMIR-2003 paper we considered the problem of evaluating and comparing different music similarity measures. Part of the problem is finding the ground truth against which to evaluate the measures, but even then we need to define the figure of merit to evaluate using this ground truth. We have come up with several measures directly related to the musicseer data, which collected artist similarity judgments directly from users over the web:

**Average Response Rank**: for the musicseer ground-truth data, what is the average rank under the new measure of the band that the users actually selected? Each ranking is normalized on a scale of 1 to 10 (for those cases where there were greater or fewer than 10 choices), so `random chance' gives an average rank of 5.5, and perfect agreement would give an average rank of 1.0.**Triplet agreement percentage**: for all "triplets" <target chosen not-chosen> from the musicseer data, what proportion of them are ordered the same way by the metric?**First place agreement percentage**: for each individual trial in the musicseer data, how often does the similarity metric make the first choice of most similar artist matching the one chosen by the user? This involves fewer comparisons than the triplet agreement, so it could be argued to be using the ground truth data less efficiently. However, we have plenty of musicseer trials to work with; a big plus for this measure is that, by treating each judgment as an independent trial, we can calculate significance for these results using a simple binomial model. Thus we can tell if two different results under this measure are*really*indicative of underlying differences in the techniques, or whether they can be explained by random variation.

We also defined a metric to measure the similarity between any pair
of similarity measures (one of which could be a similarity measure derived
from the musicseer data, but that's not necessary). It calculates
a weighted score of the agreement between the first few artists rated
as most similar to each target artist in aset400.
We call it the **Top-N ranking agreement score**, and it is defined by:

where *s _{i}* is the score for artist

The Matlab script simvsgdtruth.m will compare a similarity matrix (e.g. a 400x400 matrix where each element is proportional to the similarity between the row and column artists) against musicseer survey-type data. Here's it might be used:

>> % Load the sim matrix >> ank = load('SIM-ank14C'); >> % Read the musicseer data >> [tr, sg, uid, trg, cho, nch] = textread('musicseer-results-2002-10-15-nodups.txt','%d %c %s %d %d %d'); >> % Choose just the survey data (unfiltered) >> Su = find(sg=='S'); >> % Build the mapping to convert musicseer artist IDs to aset400 indices >> [name, sqlid] = textread('aset400.3-canon-musicseer.ids','%s %d'); >> sql2topset = zeros(1,7000); >> % Make it so sql2topset(sql+2) will return the topset index, or 0 if not in aset400. >> % ("+2" is so that sql can be -1, which it sometimes is) >> sql2topset(sqlid+2) = 1:400; >> % OK, build the ground truth matrix: trial number, target, chosen, notchosen >> % for the unfiltered survey trials >> gdtrSu = [tr(Su),sql2topset(trg(Su)+2)',sql2topset(cho(Su)+2)',sql2topset(nch(Su)+2)']; >> % Now we can run the metric scoring: >> p = simvsgdtruth(ank, gdtrSu); 10997 trials, 98964 triplets 10905 valid trials (0 with empty notchosen), 19.73% first place agreement, avrank=4.314 >>

The Matlab script topNrankagree.m computes the top-N rank agreement score defined above. Here it is in use:

>> % Load the 400x400 aset400 sim matrices >> playlst = load('SIM-aotm'); >> collctn = load('SIM-opennap'); >> % How well does playlst agree with collctn ground truth? >> topNrankagree(collctn,playlst) ans = 0.2239 >> % What about the other way around (playlst as ground truth)? >> topNrankagree(playlst,collctn) ans = 0.2254 >> % Note: 'tied' orderings are randomized, so there is a random component to the results: >> topNrankagree(playlst,collctn) ans = 0.2273

Last updated: $Date: 2003/08/07 13:50:42 $

Dan Ellis <dpwe@ee.columbia.edu>