2008-02-22: We are interested in the common IDs provided by MusicBrainz, and plan to provide these for all our data. Please see their description of The MusicBrainz Database.
Here, for posterity, are my recommendations for normalizing the names of artists, albums, and tracks:
Artist names are particularly important to get normalized to the same forms. Hence, they have severe normalization:
Other examples:
N'sync -> nsync D'Angleo -> dangelo R. Kelly -> r_kelly P.J. Harvey -> pj_harvey Run-D.M.C. -> run_dmc The Presidents of the United States of America -> presidents_of_the_united_states_of_america
Finally, there are some special cases for semantically-equivalent names:
Bruce Springsteen and the E Street Band -> bruce_springsteen Tom Petty and the Hearbreakers -> tom_petty Bob Marley and the Wailers -> bob_marley
Album and Track names are less aggressively normalized because it's not normally so important to match slight variants from different sources, as it has turned out to be for band names. (To be honest, mostly I'm just trying to come up with a set of explicit rules that are consistent with the data we already have.) The normalization rules are modified as follows:
The current versions of aset400 (the reference list of artist names) and uspop2002-aset (list of 8764 track names used in the ISMIR03 experiments) reflect these rules (with some laxity in the album names).
The idea is to update this entry as and when necessary.
2006-07-27: The uspop convention is to have a deep directory of files, structured as <artist_name>/<Album_Name>/01-Track_Name.mp3, but occasionally we want to have all the information in a single file name. Since "+" cannot occur in any of the normalized components (it's mapped to "_" according to the rules above), we'll use it as a separator. So the file names become <artist_name>+<Album_Name>+01-Track_Name.mp3.
2007-06-27: In preparing some new files for the artist20 release, I got caught up with a lot of inconsistent album naming capitalization and expansion, so I added some more detail to the rules above.