The SecondHandSongs Dataset


Welcome to the SecondHandSongs dataset, the official list of cover songs within the Million Song Dataset.

UPDATE (25/03/2011): we added the SHS performance number when we have it -> format slightly changed for track ID lines, it's now tid / aid / performance

The MSD team is proud to partner with the Second Hand Songs team in order to bring you the largest dataset of cover songs ever released for academic research. See below for details on getting the dataset and on how to perform experiments.

Please visit (and contribute to) the Second Hand Songs website, which is a great resource for the MIR community!

Description
Getting the dataset
General FAQ
Technical FAQ
Extreme Cover Songs
Publications

Description

The SHS dataset has a total of 18,196 tracks from the MSD, organized in "cliques", i.e. groups of versions of a single underlying musical work. Where possible, the cliques are referenced to "works" from the SHS site, see:
http://www.secondhandsongs.com/work/<work_ID>
If the work is not available on SHS, we use a negative number. Similarly, we provide the performance number if we have it, learn more about it at:
http://www.secondhandsongs.com/performance/<performance_ID>

The file format per line is:
# - comment, ignore
%a,b,c, title - beginning of a clique. a,b,c are work IDs (negative if not available)
TID<SEP>AID<SEP>perf - track ID from the MSD (plus artist ID and SHS performance)

It turns out that the MSD in many cases contains multiple versions of a given track by a given artist. We have attempted to remove such duplicates from the SHS dataset (since they are much too similar to be considered "covers"). This means that you should ignore known duplicates when testing. See the technical FAQ for the details.

We anticipate a growth in research that uses a set of known covers to tune parameters or otherwise train algorithms. Therefore, we have split the dataset into train and test as for other tasks. Performance should be reported on the "test" set, with system tuning performed only on the "train" portion. See the technical FAQ below for more details.

Getting the dataset

Here is the train set and the test set. They are already included in the github repository.
The training set contains 4,128 cliques out of 5,854, and 12,960 tracks out of 18,196.

General FAQ

What is the relationship with the Million Song Dataset?

The SecondHandSongs dataset is an independent dataset, but it only references songs that exist in the Million Song Dataset (MSD). The data mostly comes from the Second Hand Songs website. It was created as a collaboration between SecondHandSongs.com and the Million Song Dataset team.

How was that dataset created?

Most of the data comes from the Second Hand Songs (SHS) database, the backbone of their website.
Note that the MSD team added some other known covers. Therefore, you could find a cover song in the list that is not (yet) on the SHS website.

What are the possible flaws?

We are waiting for your input! Potential flaws we have identified:

  1. Covers that are actually duplicates of the same song by the same artist
  2. String matching on artist names / titles from SecondHandSongs failed
  3. We missed covers (we did!!! and some on purpose)
  4. Info from SecondHandSongs.com is wrong

See the technical FAQ below for more details.

What is the licensing?

Put simply, it is for research only. You can not use it to make money, or start your own cover song website project, or any of these things without the writing consent from the Second Hand Songs team (SHS). It is similar to The Echo Nest data provided in the MSD. Also, SHS can advertise and refer to any work or publication made based on this dataset.

How can I contribute to this dataset?

What do you do if you find a new cover song in the MSD? First, see if it is known on the SHS website, and add it if necessary. Then, send us the proper information, i.e. The Echo Nest track ID from the MSD and the SHS performance ID or url. Please double check that it is indeed a cover -- don't assume that songs with the same title will always be covers, for instance.

How can I cite this dataset?
You should cite this publication [bib].
Additionally, you can mention / link to this web resource:

SecondHandSongs dataset, the official list of cover songs within the Million Song Dataset, 
available at: http://labrosa.ee.columbia.edu/millionsong/secondhand

Who can I contact for additional help?

Thierry Bertin-Mahieux is still a good first try. Otherwise, you can try the MSD mailing list. For a secondhandsongs.com specific question, please contact them directly.

Technical FAQ

(for MIR practitioners)
If all else fails, read the instructions. - Donald Knuth

How do I train / test / evaluate?

- during training, you have access to all features, but only for the training set
- during testing, for each track in each clique of the test set, query the MSD and rank the closest songs
- evaluate using the ground-truth tracks identified for the clique, with the actual evaluation measure based on a metric such as average precision (AP), or mean reciprocal rank (MRR)


When testing using song A as a query, you should ignore all songs from the artist of A and all known duplicates of the cover songs (see official MSD duplicate list).


Why this configuration? We assumed the following scenario: you are YouTube or iTunes, you already have all the songs in the world, and you have a set of identified covers to train on. Then, an artist comes to you, identifies his track in your collection, and asks you to find the covers.

How clean is the dataset?

Quite clean! SecondHandSongs.com is very clean, data is entered and verified by site staff. Most errors come from the string matching (artist names, song titles) and errors within the MSD and The Echo Nest. For instance, two bands with the same name, or the same artists spelled slightly differently and considered as two artists.
That said, the real issue is how much we left out, see below.

Why are work IDs missing for certain cliques?

We (the MSD team) had already some covers identified before starting our collaboration with the SecondHandSong team. We decided to include them. Unfortunately, they are too many to add them all to secondhandsongs.com manually.

How come I can find tons of obvious covers that are left out?

We try to ignore MSD duplicates. In each clique, most tracks come from different artists. If they do come from the same artist, we made sure they had different song id and titles. If we had included these, the list would be longer!
We also skipped some "medleys", since it adds too much complication to have one track as part of multiple cliques (in both the evaluation and the way we created the cliques).
Also, SecondHandSongs does not know all the covers in the world.
It does mean that in your evaluation, your best match might actually be a good match even if it is not referenced in the dataset. That said, if your performance accuracy on a million song comes down to a few position errors, you just solved the task!

Why not include the duplicate tracks?

Finding covers which are really duplicates is trivial. We don't want to be testing the (nearly) identical track five times over.

Extreme Cover Songs

Your algorithm already does super well? Try to play with these songs! (they were all excluded from the official list, find why)

TRVMOOV128F92E4D89
TRKVYPP128F9337A83
TRLJJLR128E07927B6
TREGTQA128F426C0C2
TRDJEFP128F933925A
TRJZRKL128F931369B
TRZXLTH128E078AF43
TRRVJWB128F426C0A9
TRFTSLG128F92F7204
TROFRYV128F1482A5E
TRVATYS128F9339241
TRSFNHG128F427DF40
TRKIDJH128F4298840
TRZYXXX128E078CE4A
TRHSZPV128F1459651
TRPAXPU12903CAA835
TRZRYDQ128F9341FDC
TRRKQAS128F42AE480
TRCZJGM128F4261493

Publications

Below is an informal list of published results on the SecondHandSongs dataset. It should be a subset of the MSD publications page. We will probably only report works that have a result using most of the million songs. If you think your work should be included, please send us an email! It will help everyone keeping track with the state-of-the-art.

  • Large-scale cover song recognition using hashed chroma landmarks, T. Bertin-Mahieux and D. Ellis, WASPAA 11 [pdf] [bib] [code]