Dan Ellis : Research : Music Similarity :

Text normalization conventions for artist, album, and track names


2008-02-22: We are interested in the common IDs provided by MusicBrainz, and plan to provide these for all our data. Please see their description of The MusicBrainz Database.

Here, for posterity, are my recommendations for normalizing the names of artists, albums, and tracks:

Artist names are particularly important to get normalized to the same forms. Hence, they have severe normalization:

  1. Names are all mapped to lower case
  2. Delete apostrophes ("'") and periods (".").
  3. Everything else except a-z 0-9 maps to "_". Multiple _'s in sequence fold into a single _. Leading and trailing _'s are dropped.
  4. Don't reorder proper names - it's just too hard, and there's no clear boundary between proper names and band names. No more deejay_alice, cyrus_billy_ray, amos_tori etc.
  5. Always drop a leading "the". the_beatles and the_verve were the only ones who escaped this in aset400, but uspop2002 had lots of *_the. I guess always drop a leading indefinite article too, although I think "A New Found Glory" (new_found_glory) is the only one.

Other examples:

   N'sync -> nsync
   D'Angleo -> dangelo
   R. Kelly -> r_kelly
   P.J. Harvey -> pj_harvey
   Run-D.M.C. -> run_dmc
   The Presidents of the United States of America ->
                                presidents_of_the_united_states_of_america

Finally, there are some special cases for semantically-equivalent names:

   Bruce Springsteen and the E Street Band -> bruce_springsteen
   Tom Petty and the Hearbreakers -> tom_petty
   Bob Marley and the Wailers -> bob_marley

Album and Track names are less aggressively normalized because it's not normally so important to match slight variants from different sources, as it has turned out to be for band names. (To be honest, mostly I'm just trying to come up with a set of explicit rules that are consistent with the data we already have.) The normalization rules are modified as follows:

The current versions of aset400 (the reference list of artist names) and uspop2002-aset (list of 8764 track names used in the ISMIR03 experiments) reflect these rules (with some laxity in the album names).

The idea is to update this entry as and when necessary.

2006-07-27: The uspop convention is to have a deep directory of files, structured as <artist_name>/<Album_Name>/01-Track_Name.mp3, but occasionally we want to have all the information in a single file name. Since "+" cannot occur in any of the normalized components (it's mapped to "_" according to the rules above), we'll use it as a separator. So the file names become <artist_name>+<Album_Name>+01-Track_Name.mp3.

2007-06-27: In preparing some new files for the artist20 release, I got caught up with a lot of inconsistent album naming capitalization and expansion, so I added some more detail to the rules above.


Valid HTML 4.0! Last updated: $Date: 2007/06/27 22:05:51 $
Dan Ellis <dpwe@ee.columbia.edu>