We took a deeper look at the problem of duplicate songs in the MSD. Take a task like cover song recognition. If your algorithm performs incredibly well, but keep finding some unknown song A as the closest cover before the known cover B, it might get frustrating. Especially if it turns out that A is a duplicate of B!
Even for non-predictive tasks, duplicates can introduce bias. For instance, if you analyze the lyrics content of the dataset, you would want to know if some songs are overly represented. Same problem with music recommendation, any algorithm would probably consider a song very similar to itself! (actually, an algorithm that would know that playing two songs too similar to one another is bad would be awesome!) Therefore, we want to eliminate duplicates from the list of similar songs.
So, we need a list of duplicates. And if we want results to be comparable, we need an official list of duplicates so everyone uses the same. How do we find most duplicates? In this case, we define it as songs with the same artist name or artist id, and the same song title or song id. Why? It is quite safe in the sense that there are few false positives. The only wrong duplicates are live versions, cover versions by a group of artists including the original one, and remixes. For music recommendation, it makes sens to remove all these. For lyrics analysis too. For cover song, it's sad that we have to ignore some live versions.
Does that get us all the duplicates in the dataset? No chance! Here are a few cases I can think of: titles are spelled slightly differently, or translated in another language, or foreign symbols are used, and The Echo Nest believe it is two different songs. Inconsistent band names create the same problem, The EN believes they are two bands. Versions with invited artists are endless trouble: how do you classify Cat Stevens in duo with Norah Jones featuring Aerosmith remixed by Puff Daddy using Abba samples? OK, I made this one up, but it's not that unrealistic!
There are duplicates in the dataset. There are duplicates in any real world music database. And until fingerprinting + cover song recognition both work perfectly, it will remain the case. In the meantime, here is the official duplicate list of the Million Song Dataset. Also, here is the python code used to generate it (careful, not efficient, takes an hour to run). Note that how to deal with these duplicates is task specific.
Quick stats: we found 53,471 "song object" with at least one duplicate, and a total of 131,661 tracks are the duplicate of another one, therefore 1M - 131,661 + 53,471, what we have is closer to the 921,810 song dataset!