Dan Ellis : Research : Music Similarity :

Music Similarity Metrics

Comparing metrics against the musicseer ground truth

In our ISMIR-2003 paper we considered the problem of evaluating and comparing different music similarity measures. Part of the problem is finding the ground truth against which to evaluate the measures, but even then we need to define the figure of merit to evaluate using this ground truth. We have come up with several measures directly related to the musicseer data, which collected artist similarity judgments directly from users over the web:

Average Response Rank: for the musicseer ground-truth data, what is the average rank under the new measure of the band that the users actually selected? Each ranking is normalized on a scale of 1 to 10 (for those cases where there were greater or fewer than 10 choices), so `random chance' gives an average rank of 5.5, and perfect agreement would give an average rank of 1.0.
Triplet agreement percentage: for all "triplets" <target chosen not-chosen> from the musicseer data, what proportion of them are ordered the same way by the metric?
First place agreement percentage: for each individual trial in the musicseer data, how often does the similarity metric make the first choice of most similar artist matching the one chosen by the user? This involves fewer comparisons than the triplet agreement, so it could be argued to be using the ground truth data less efficiently. However, we have plenty of musicseer trials to work with; a big plus for this measure is that, by treating each judgment as an independent trial, we can calculate significance for these results using a simple binomial model. Thus we can tell if two different results under this measure are really indicative of underlying differences in the techniques, or whether they can be explained by random variation.

Comparing arbitrary similarity metrics against each other

We also defined a metric to measure the similarity between any pair of similarity measures (one of which could be a similarity measure derived from the musicseer data, but that's not necessary). It calculates a weighted score of the agreement between the first few artists rated as most similar to each target artist in aset400. We call it the Top-N ranking agreement score, and it is defined by:

where s_i is the score for artist i, N is how many similar artists are considered in each case (we use 10), α_r is the `decay constant' for the reference ranking (we used 0.5^0.33), α_c is the decay constant for the candidate ranking (we used 0.5^0.67), and k_r is the rank under the candidate measure of the artist ranked r under the reference measure. The overall agreement score is obtained by averaging over all artists, and normalizing by the maximum ideal score (which is 0.999 using our values). Thus the score varies from near to 0 for measures giving unrelated rankings to 1 for measures giving identical rankings (at least for the top N cases).

Tools

The Matlab script simvsgdtruth.m will compare a similarity matrix (e.g. a 400x400 matrix where each element is proportional to the similarity between the row and column artists) against musicseer survey-type data. Here's it might be used:


>> % Load the sim matrix
>> ank = load('SIM-ank14C');
>> % Read the musicseer data
>> [tr, sg, uid, trg, cho, nch] = textread('musicseer-results-2002-10-15-nodups.txt','%d %c %s %d %d %d');
>> % Choose just the survey data (unfiltered)
>> Su = find(sg=='S');
>> % Build the mapping to convert musicseer artist IDs to aset400 indices
>> [name, sqlid] = textread('aset400.3-canon-musicseer.ids','%s %d');
>> sql2topset = zeros(1,7000);
>> % Make it so sql2topset(sql+2) will return the topset index, or 0 if not in aset400.
>> % ("+2" is so that sql can be -1, which it sometimes is)
>> sql2topset(sqlid+2) = 1:400;
>> % OK, build the ground truth matrix: trial number, target, chosen, notchosen
>> % for the unfiltered survey trials
>> gdtrSu = [tr(Su),sql2topset(trg(Su)+2)',sql2topset(cho(Su)+2)',sql2topset(nch(Su)+2)'];
>> % Now we can run the metric scoring:
>> p = simvsgdtruth(ank, gdtrSu);
10997 trials, 98964 triplets
10905 valid trials (0 with empty notchosen), 19.73% first place agreement, avrank=4.314
>>

The Matlab script topNrankagree.m computes the top-N rank agreement score defined above. Here it is in use:


>> % Load the 400x400 aset400 sim matrices
>> playlst = load('SIM-aotm');
>> collctn = load('SIM-opennap');
>> % How well does playlst agree with collctn ground truth?
>> topNrankagree(collctn,playlst)
ans =
    0.2239
>> % What about the other way around (playlst as ground truth)?
>> topNrankagree(playlst,collctn)
ans =
    0.2254
>> % Note: 'tied' orderings are randomized, so there is a random component to the results:
>> topNrankagree(playlst,collctn)
ans =
    0.2273

Last updated: $Date: 2003/08/07 13:50:42 $
Dan Ellis <dpwe@ee.columbia.edu>