Labrosa : Projects : Music Similarity :

The "uspop2002" Pop Music data set

2010-12-03: Please read the new entry in the errata concerning duplicate files identified in the original set. The current set of uspop files in use at LabROSA has only 8752 items, listed in this file.

2009-03-16: If you have a set of the uspop2002 DVDs, be sure to check our errata which details a problem causing NaNs in 24 of the data files.

For ISMIR-2003 we did a comparison between some different acoustic-based similarity measurements, using the ground-truth data described elsewhere on this site. For the Music IR white paper symposium at SIGIR-03, we recast this work as presenting a paradigm for comparing Music IR work on a single dataset: Although it is very problematic to distribute the music, sharing the derived feature representations (in our case, MFCCs) is both lower bandwidth and less likely to upset copyright holders. So we are trying to promote the sharing of this kind of data, and in particular, the reuse of our specific corpus.

This page defines the actual album tracks we used in that corpus. We used 400 artists chosen for popularity and for representation in our subjective data. Here are their canonical names: aset400.txt.

For each artist, we purchased at least one album, although some artists had as many as 10 discs (for Queen) in our set. (Albums were chosen to get coverage of the tracks found in our OpenNap trawl). Overall, this gave us a total of 706 albums and 8764 tracks from our 400 artists (it used to be 8772 tracks, but we found one disc -- pink_floyd/Delicate_Sound_of_Thunder_Disc_2_ -- that appeared twice under slightly different names). The file ntrack-nalbum-artist.txt gives the number of tracks (first column) and the number of albums (second column) for each artist (third column), sorted by decreasing track count. The file uspop-albums.txt lists all 706 albums, by our normalized artist and album title names (sorry, we don't have these resolved to catalog numbers).

The list of tracks is in uspop2002-aset.txt. Each line describes one track, with four fields separated by spaces: artist_name Album_Name TrackNumber Track_Name. (Here are the guidelines for name canonicalization.)

We are distribute the actual MFCC feature files we used (about 12G) on a set of 3 DVDs to interested researchers. Contact dpwe@ee.columbia.edu for more information.

Metadata

In addition to the (implicit) artist and album information, we also gathered the "style" tags from All Music Guide for each of the 400 artists represented in the collection. They are available in the file aset400-styles.txt, which consists of 400 lines, one for each artist in aset400.txt. Each line consists of a sequence of tokens separated by spaces; the first token is the artist name (the same as in aset400.txt). The remaining tags are the various styles from the AMG page, in the order they appear on that page, with spaces replaced by underscores ("_") to make them into single tokens. There are between one and ten styles defined for each artist. There are 251 unique style tokens, listed in amg-styles.txt along with the number of times they occur in aset400-styles.txt; 102 styles apply to only one artist in aset400, and 46 styles have 10 or more bands representing them. The most popular style is "Pop/Rock" with 115 representatives.

For completeness, we also include the All Music genres in aset400-genres.txt. Of 400 artists, 293 fall into "Rock", with 9 other genres making up the rest. (Both aset400-genres.txt and aset400-styles.txt are in the same 'library' order as aset400.txt).

How the data was generated

Uspop2002 was generated using the SPRACHcore software from ICSI. The actual feature extraction is done by the "feacalc" utility, which is part of SPRACHcore and can be downloaded from: http://www.icsi.berkeley.edu/~dpwe/projects/sprach/sprachcore.html.

The audio in our database is stored in stereo mp3 files encoded at 128 kbps. Prior to feature calculation, the mp3 files were decoded, downsampled by half to 22050 Hz, and mixed down to mono. The command used was:

mpg123 -m -2 -w $OUTPUT_WAV.wav $INPUT_MP3.mp3

The feacalc command used to generate the MFCC features from the wav files is:

feacalc  -list $UTTERANCE_LIST  \
	 -filecmd "echo $DATAROOT/%u" \
	 -ipformat MSWAVE \
	 -samp 22050\
	 -dith \
	 -hpfilter \
	 -nyq 8000 \
	 -opformat pfile \
	 -step 16 \
	 -window 32 \
	 -deltaorder 0 \
	 -ras no -plp no -dom cep -frq mel -filt tri -cepsord 20 \
	 -output $OUTPUT

Where $UTTERANCE_LIST is a file containing the song names, one line per song. This command was actually run once per artist, creating one file per artist that contained the mfccs for all of the artist songs concatenated together in the "pfile" format used by SPRACHcore. See the feacalc manpage for feacalc for details. Essentially, the command computes 20 MFCC coefficients on 32ms windows every 16ms, using dithering to prevent numerical problems, a simple high-pass filter to remove any DC offset, and a triangular filter on the window.

The pfiles were then split into htk files, one per song, used the feacat command:

feacat -ipformat pfile -period 16 -opformat htk -olist $LIST_FILE $INPUT_FILE

where $LIST_FILE contains the file names of the destination files for each song, one per line. The output files are stored under uspop2002/artists.

To regenerate a single file of MFCC feature file in HTK format from the original MP3 audio, you could use the following more compact form:

# Set up options for automatic mpg123 decoding (hacky)
MP3_ROPTS="-2 -m"; export MP3_ROPTS      # extra options to mpg123
PCMFORMAT="R22C1FsEl"; export PCMFORMAT  # implicit format of raw pcm stream
feacalc -sr 22050 -nyq 8000 -dith -hpf -opf htk -delta 0 -plp no \
    -dom cep -com yes -frq mel -filt tri -win 32 -step 16 -cep 20 \
    pretenders--Learning_To_Crawl--Middle_Of_The_Road--1.mp3 -o motr.htk

The closest I have been able to get to calculating these within Matlab is to use my melfcc routine from the rastamat package as follows:

[d,sr] = mp3read(mp3file,0,1,2);   % downsample to mono/22050 on input
d = resample(d,1600,2205);         % further downsample to 16000
sr = 16000;
mfc2 = melfcc(d*5.21, sr, 'maxfreq', 8000, 'numcep', 20, ...
    'nbands', 22, 'fbtype', 'fcmel', 'dcttype', 1, 'usecmp', 1, ...
    'wintime', 0.032, 'hoptime', 0.016, 'preemph', 0, 'dither', 1);

How to use the data

The file uspop2002-aset-files.txt contains the filenames corresponding to the songs listed in uspop2002-aset.txt.

For matlab users, this script might be useful for reading htk files: readhtk.m. Thanks to Mike Brookes, the author of the voicebox matlab toolbox, for this script.

See this quick example of using Matlab and netlab to do artist ID for an illustration of how the features can be used to train classifiers.

Inverting MFCCs back to audio

A lot of information is lost when converting audio to MFCCs - most notably, the pitch information. However, it is possible to resynthesize audio that approximates the signal that was originally analyzed by carefully inverting each stage of MFCC calculation. Here is some Matlab code to invert MFCCs to audio. The following example (similar to the one at the bottom of that page) will invert from the MFCC files in the uspop2002 set:

>> mm = readhtk('thompson_twins/Into_The_Gap/03-Hold_Me_Now.htk');
>> mm = mm';   % HTK files are transposed compared to what we want
>> sr = 16000;
>> [im,ispc] = invmelfcc(mm(:,1:2000), sr, 'maxfreq', sr/2, 'numcep', 20, ...
     'nbands', 22, 'fbtype', 'fcmel', 'dcttype', 1, 'usecmp', 1, ...
     'wintime', 0.032, 'hoptime', 0.016, 'preemph', 0, 'dither', 1);
>> % listen to the reconstruction
>> soundsc(im,sr)

Notes

There are a lot of live albums. The original album selection algorithm attempted to optimize the coverage of 'popular' tracks. This biased it in favor of live albums, which typically include most of the hits from recent studio albums.
It was on 2005-05-12 that we noticed the duplicate Pink Floyd album and removed it. Results prior to that had simply had two instances of each of the 8 tracks on Disc 2 of The Delicate Sound of Thunder.

Referencing

If you make a publication using this data and would like to reference the source, you can refer to the following paper:

A. Berenzweig, B. Logan, D. Ellis, B. Whitman (2004). A large-scale evaluation of acoustic and subjective music-similarity measures: Computer Music Journal, 28(2), pp. 63-76, June 2004. (14pp)

.. or you can simply reference this web page, e.g.

D. Ellis, A. Berenzweig, B. Whitman (2003). The "uspop2002" Pop Music data set: Web resource, available: http://labrosa.ee.columbia.edu/projects/musicsim/uspop2002.html.

Acknowledgments

This project has received support from several places:

The original collection of CDs was devised and purchased by Brian Whitman and Steve Lawrence while they were at the NEC Research Institute in Princeton, New Jersey. The NEC Research Institute then generously donated them to LabROSA for use in academic Music IR research.
LabROSA's work on music topics and collaboration with Columbia's Computer Music Center are supported by a grant from Columbia's Academic Quality Fund.
LabROSA and the PI are supported by the National Science Foundation under Grant No. IIS-0238301. Any opinions, findings and conclusions or recomendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

Last updated: $Date: 2010/08/29 06:20:03 $
Dan Ellis <dpwe@ee.columbia.edu>