Following up on our previous post, we worked to resolve some unexpected matches in the Taste Profile subset. However, the following result is important for anyone working with the song or artist metadata in the MSD.
What was the issue? Some tracks were matched to the wrong songs. More details: in the Echo Nest world, track is very close to a physical recording, it has audio features and some metadata that comes with it. In the MSD, all track-related information is *HIGHLY TRUSTED*.
Now, tracks are matched to songs (that can encompass tracks that are almost identical), and songs are matched to artists. When we built the MSD, we chose tracks, than took the song they were matched to (that's the 'song_id' field), than took the artist that song was matched to ('artist_id' field), and used that to get artist tags, similar artists, etc. Needless to say, if the track was matched to the wrong song, all the information we just mentioned is not trustworthy.
What's specific about Taste Profile? For technical reasons, the user data was matched to song metadata, not track metadata. It means that, if the user data tells us the song is: "Cool Song" by the artist "Bob", and we found a song with that metatada whose ID is SO123, we match "Bob - Cool Song" to SO123. Then, we take the info we have from the MSD, we find a track whose ID is TR456 and whose song is SO123, we decide that "Bob - Cool Song" is the song of track TR456. Of course, if TR456 was wrongly matched to SO123, it breaks! In our previous post, that's how we had a little-known band 'Harmonia' thinking it was a popular Katy Perry track.
What's the solution? We got a list of song ID metadata (thanks Paul!) for all the song ID in the MSD. Then, we did some string comparison with the metadata we had from the tracks. This gave us potential mismatches. FInally, we went through that list by hand, to remove what looked like actual good matches.
So, what do we have? A list of (song ID, track ID) pairs that should not be trusted in the MSD. You can download the list here:
LIST OT SONG - TRACK MISMATCHES
In practice, it means that if you're using songs (e.g. to link to Rdio, or in the Taste Profile subset) or artists (e.g. similar artists, artist tags), you should probably ignore tracks in that list.
How many wrong matches are there? We identified ~5,7K wrong matches, about 0.6% of the whole MSD, plus another ~13K matches that can not be verified.
Some matches can't be verified? The Echo Nest data is not frozen like the MSD, it can evolve in time, to fix issues for instance. The tracks in the MSD were downloaded around Christmas 2010, the song metadata in this cleanup was gathered around January 2012, a year later. Some songs were not known to The Echo Nest anymore, thus we have no metadata to compare with. In the file, those are the lines with ' - ', all grouped together at the end. What to do with those? Can you assume the tracks were fine when downloaded? Maybe. We don't have a good intuition on that yet, so it's your call.
More on finding mismatches. Matching is not trivial, nor is comparing two metadata even if we're highly confident that they represent the same song. The code we used to compare artist names and titles is available (though it might be improved over time). We accepted a song - track match if either the artist name or title was considered a match. It left us with ~6,5K possible errors. We got that number down to ~5,7K by going manually through the list.
A few examples, there are obvious mismatches, e.g.
Cristian Paduraru - Born Again != Yespiring - Journey Stages Musiq - Solong != Suthun Boy - Full Blown
Some are tougher calls. This one seems it could be a match, with a confusion between artist and title.
Fussible - Trip To Ensenada != Nortec Collective - Fussible
Here is an example of what we accepted.
M.A.N.D.Y. - quiet - marlow feat.dehlia != Marlow - Quiet F - n != Funky - 3 nochi I 2 dnia
The second case is interesting, we found a pattern in the possible mismatches, songs whose title started with a number were shrunk to the first letter of the artist, and first letter after the number for the title. Go figure...
For those who are interested, here is the list of manually accepted matches.
Final words, we are sorry that these mismatches happened in the first place, and we would have been glad to find them before releasing the Taste Profile subset. Secondly, our solution is not perfect. I believed we found most mismatches, but I'm sure we missed some. Even when we looked at the list manually, I am convinced that we skipped good matches and accepted wrong ones. As usual, working with millions of records, please allow for some amount of noise in everything you do.
Happy Valentine's day everyone! Be nice, kiss a geek.