Frequently Asked Questions

Good question, a lot of things. Principally, the dataset consists of almost all the information available through The Echo Nest API for one million popular tracks. This encompasses both metadata and audio analysis features. Each file is for one track which corresponds to one song, one release and one artist. All the information about these four items (track, song, release, artist) are in every file (which involves some redundancy, although the bulk of the data, relating to the audio analysis, is unique).
For an in-depth look, see section describing all the fields you will find in one song file. Note that the data is quite complete, but some fields might be missing for some songs. For instance, we don't have the location for all artists.

Field list

Below are a list of all fields available in the files of the dataset. The same list with data from a specific song is available here. Another reference is the code: display_song.py: if a field is displayed, the field exists and there should be a getter for it (if we forgot some in matlab or java, please let us know).
For the analysis fields, we suggest you first read The Echo Nest analyze documentation. The main audio features are 'segments_pitches' and 'segments_timbre'.

Field name	Type	Description	Link
analysis sample rate	float	sample rate of the audio used	url
artist 7digitalid	int	ID from 7digital.com or -1	url
artist familiarity	float	algorithmic estimation	url
artist hotttnesss	float	algorithmic estimation	url
artist id	string	Echo Nest ID	url
artist latitude	float	latitude
artist location	string	location name
artist longitude	float	longitude
artist mbid	string	ID from musicbrainz.org	url
artist mbtags	array string	tags from musicbrainz.org	url
artist mbtags count	array int	tag counts for musicbrainz tags	url
artist name	string	artist name	url
artist playmeid	int	ID from playme.com, or -1	url
artist terms	array string	Echo Nest tags	url
artist terms freq	array float	Echo Nest tags freqs	url
artist terms weight	array float	Echo Nest tags weight	url
audio md5	string	audio hash code
bars confidence	array float	confidence measure	url
bars start	array float	beginning of bars, usually on a beat	url
beats confidence	array float	confidence measure	url
beats start	array float	result of beat tracking	url
danceability	float	algorithmic estimation
duration	float	in seconds
end of fade in	float	seconds at the beginning of the song	url
energy	float	energy from listener point of view
key	int	key the song is in	url
key confidence	float	confidence measure	url
loudness	float	overall loudness in dB	url
mode	int	major or minor	url
mode confidence	float	confidence measure	url
release	string	album name
release 7digitalid	int	ID from 7digital.com or -1	url
sections confidence	array float	confidence measure	url
sections start	array float	largest grouping in a song, e.g. verse	url
segments confidence	array float	confidence measure	url
segments loudness max	array float	max dB value	url
segments loudness max time	array float	time of max dB value, i.e. end of attack	url
segments loudness max start	array float	dB value at onset	url
segments pitches	2D array float	chroma feature, one value per note	url
segments start	array float	musical events, ~ note onsets	url
segments timbre	2D array float	texture features (MFCC+PCA-like)	url
similar artists	array string	Echo Nest artist IDs (sim. algo. unpublished)	url
song hotttnesss	float	algorithmic estimation
song id	string	Echo Nest song ID
start of fade out	float	time in sec	url
tatums confidence	array float	confidence measure	url
tatums start	array float	smallest rythmic element	url
tempo	float	estimated tempo in BPM	url
time signature	int	estimate of number of beats per bar, e.g. 4	url
time signature confidence	float	confidence measure	url
title	string	song title
track id	string	Echo Nest track ID
track 7digitalid	int	ID from 7digital.com or -1	url
year	int	song release year from MusicBrainz or 0	url

How did you choose the million tracks?

Choosing a million songs is surprisingly challenging. We followed these steps:

Getting the most 'familiar' artists according to The Echo Nest, then downloading as many songs as possible from each of them
Getting the 200 top terms from The Echo Nest, then using each term as a descriptor to find 100 artists, then downloading as many of their songs as possible
Getting the songs and artists from the CAL500 dataset
Getting 'extreme' songs from The Echo Nest search params, e.g. songs with highest energy, lowest energy, tempo, song hotttnesss, ...
A random walk along the similar artists links starting from the 100 most familiar artists

The number of songs was approximately 8950 after step 1), step 3) added around 15000 songs, and we add approx. 500000 songs before starting step 5. For more technical details, see "dataset creation" in the "code" tab.

Statistics of the dataset

1,000,000 songs / files
273 GB of data
44,745 unique artists
7,643 unique terms (The Echo Nest tags)
2,321 unique musicbrainz tags
43,943 artists with at least one term
2,201,916 asymmetric similarity relationships
515,576 dated tracks starting from 1922
18,196 cover songs identified
11 pastries lost in a related hackday
99 bottles of beer on the wall

What MusicBrainz data is included?

Fields 'year', 'artist_mbtags' and 'artist_mbtags_count' have been extracted from the MusicBrainz. We used a local copy of the server, our version is this branch, the data dumps were of December 4th, 2010. Note that the field 'artist_mbid' is provided by the Echo Nest API.

What is the difference between "terms" and "mbtags"?

"Terms" are the tags provided by The Echo Nest. They can come from a number of places, but mostly blogs as far as we understand.
"Mbtags" are musicbrainz tags, specifically applied by humans to a particular artist. This explains why there are fewer of them (see 'mbtags_count'), but they are usually very clean and informative. For instance, if you want to create a genre recognition task where classes are mutually exclusive, mbtags are likely to be more reliable then terms.

Why are there more artist names than artist IDs?

If you count the artist names, you will see that there are almost twice as many unique names as there are artist IDs in the dataset. The Echo Nest often associates songs with artist names such as "A feat. B" or "A / B" to artist A. In most cases, this is the reasonable thing to do. Just be careful, when you use files indexed by artist ID (e.g. unique_artists.txt), since you will get one of the artist names at random. If you try to do string matching, you probably need to consider all the names for an artist; you can use the SQLite database track_metadata.db to find them.

How can I get the dataset?

See tab Getting the dataset. There you will also find a subset to get you started quickly.

Why didn't you build an API?

APIs have a lot of merit, and most of this dataset was built using The Echo Nest API. But APIs have not solved all research problems, and there are several advantages to having a local copy of a fixed dataset. Everyone gets the same data and can report results on the same songs and features. Also, it facilitates downloading. Just because an API exists doesn't mean that everyone will go to the trouble of downloading a million songs.
That being said, this dataset is complementary with the several APIs we mention. That's why we made sure we provide enough metadata such as musicbrainz ID so everyone can link this data to other existing resources.

Can I contact you privately to get the audio?

We don't have the audio at LabROSA, so there's nothing for us to give you. The features were mainly created in-house by content owners, using analysis code supplied to them by The Echo Nest.

You can, however, use services like 7digital or playme to preview a song; we provide demo code. Grooveshark is also very powerful (but is it legal? anyone?)
If you do have audio, you can analyze it and turn it into the same format as the rest of this dataset (see below in the FAQ).

Can I recover audio from the features?

Or "can I reverse engineer the features?". Well... you should try! Dan Ellis provided MATLAB code to do something like this, see the MatlabSrc directory on GITHUB and his Matlab tutorial.
To be clear, the main audio features are 'loudness', 'pitches' and 'timbre' on the segment level. Segments are 'musical events', usually note onsets. There is a lot of information lost when going from audio to these features (even though they are quite large in size/quantity/dimension!)
For many tasks, an excerpt from the full track may be sufficient, and these are available via 7digital. Here is a python script to access them, and Dan provides a Matlab version here. Note that the 'Random track' box on the right side of this website uses the Python code in real time; there is nothing pre-loaded.

What is the format of the data files?

Each file represents one track with all the related information (artist information, release information, audio analysis of the track, etc). Usually one track is one song, although there are multiple versions of the same track for many songs -- extended versions, different releases, etc. -- each in its own file.
Remember the some songs are released many times with slight differences (e.g. US versus European release).

The actual file format of each file is HDF5. A schema of the inner organization of a HDF5 song file can be found here.

HDF what?

HDF5! Why? Because! (regarding this, all complains should definitely go to tb2332@columbia.edu). In short, HDF5 is a format developped by NASA to handle 1) large 2) heterogeneous 3) hierarchical datasets. The data can be compressed (10%-15% more that matfiles), and the I/O speed is still impressive. Also note that the core library comes free of charge and wrappers exist in most languages (see code tab).
Is is perfect? No. Does it make more sense that 1M zipped json files or matfiles? Yes. Note that the new matfile format (v7.3) is actually HDF5.
Still not happy? Here is python code or matlab code to transform the data into matfiles (the latter is less tested, non-ASCII strings seem wrongly encoded).

What are song / aggregate / summary files?

A "song file" refers to the typical HDF5 file containing information for only one song.

An "aggregate file" is also an HDF5 file that contains the information for several songs. These are useful if you do I/O intensive experiments, since they reduce the number of open/close file operations you need to perform.

A "summary file" is similar to an aggregate file, but contains just the metadata, i.e. we remove all the tables (analysis of bars, beats, segments, ..., artist similarity, tags). Useful if you want to quickly search the metadata, since a lot of space is saved! Check the scripts create_summary_file.py and create_aggregate_file.py. The summary file of the whole dataset is available (only 300 Mb!): msd_summary_file.h5.

Note on summary files: if you're using the code display_song.py, you need the '-summary' flag to tell the code that some getters won't find their field, e.g. bars_start.
The dataset you received should contain one million song files. You can create aggregate and/or summary files using the python scripts.

Why aren't the strings displaying correctly?

All data was downloaded in utf-8 format and was saved in the HDF5 file in utf-8 format. We hoped that this would ensure that every name / release / title will display correctly if you set your display to utf-8, but in practice some string with uncommon elements will never be correctly recorded. In particular, it seems difficult to tell MATLAB to read HDF5 as UTF-8 instead of Unicode. If you have a work-around, let us know!
Conclusion: use the strings like titles and artist names as indications, the real identifiers should be the musicbrainz ID or the Echo Nest ID.

How can I visualize a file?

A common first thing you might want to do is quickly glance at the numerical content of a data file. How you do this of course depends on the language you want to use. In each case, you'll need the HDF5 library installed (see the code tab). In python (with pytables installed), use display_songs.py. In Java, you can use the great tool HDFVIEW. MATLAB will also let you visualize the content, see for example the Matlab tutorial.

What are these weird file paths?

We could not put a million files in one folder. Even a few thousand files in one directory can slow disk accesses significantly. We based the directory structure on The Echo Nest track IDs which are a kind of hash code. Echo Nest track IDs always take the form TR+LETTERS+LETTERS&NUMBERS. The directory path within the Million Song Dataset is the 3rd, 4th, 5th letters from the track ID, with the file itself is named after its track ID + the extension ".h5". For example, MillionSong/data/A/D/H/TRADHRX12903CD3866.h5.

What can I do with this data?

Surprise us! But you might want a take a look at the tasks / demos tab to get inspired. That page incudes snippets of code to help you crawl through the whole dataset.
This recent article provides a wide-ranging survey of existing Music Information Retrieval classification tasks.

How can I organize all this data?

Gordon! (more to come on this).
We have also created three SQLite databases:

track_metadata.db includes most of metadata for each track. It is useful, for instance, to find all the tracks from a particular artist.
artist_term.db consists only of the metadata that applies at the artist level.
artist_similarity.db summarizes artist similarity. In particular, it only lists artists that actually are in the dataset.

For demos using these databases with SQL queries in python, see demo_*.py in the SQLite folder.
You can, of course, also create your own SQLite databases best suited to your own applications.

How was the dataset created?

We used The Echo Nest API and some information from a local copy of the musicbrainz server. Data was downloaded during December 2010. For more information on how we chose the tracks, see How did you choose the million tracks?.

Can I add my audio to the dataset?

In the sense of having your audio analyzed and put in HDF5 format like the rest of the dataset, the answer is mostly yes: the following python code should do the trick. Note that it uses pyechonest, and to have the full data ('year', 'mbtags' and 'mbtags_count'), you need a local
musicbrainz server.
The code is not perfect for the following reasons: 1) it is unclear what happens if the song is not recognized by The Echo Nest fingerprinter and 2) even if it is recognized, if you upload audio associated with a song in the dataset, there is no guarantee you'll get the same track ID.

Will I get back the same data if I access The Echo Nest API?

Probably not, even though that's what was used to create this dataset! If you take a track ID and request it from The Echo Nest API, the info might not exactly match this dataset. Some fields are bound to change over time, such as artist familiarity and song hotttnesss. Some should be stable, like song title and artist name, but there is no guarantee. The analysis of the audio should also remain the same, but The Echo Nest updates their analyzer once in a while (to implement a better beat tracking or segmenter for instance).
Remember that the dataset was created / downloaded between December 18 and December 31 2010.

What are the licensing terms?

The Echo Nest data is released under the same terms of use as their API. For a more readable version of the TOS, please read their ground rules. Put simply, if you are a researcher and want to publish results on the dataset, you are fine. If you are a company and are concerned about experimenting on the dataset, send an email to Paul at The Echo Nest.
Regarding the MusicBrainz data contained in the dataset, the track 'year' is under public domain and the 'tags' and 'tag count' are under Attribution-NonCommercial-ShareAlike 2.0 license.
Regarding the SecondHandSongs dataset, see its webpage.
Regarding the musiXmatch dataset, see its webpage.
The code is released under GNU Public License.

Where can I get help?

Check to see if someone has brought up a similar issue in the forums; if not, try making your own posting. Then, send an email to one of the creators at LabROSA or The Echo Nest, see the contact us tab. Thierry Bertin-Mahieux is a good first try.

What other large datasets are available?

We mention some other sources of data related to Music Information Retrieval research. THIS IS NOT AN EXHAUSTIVE LIST! If you want your dataset to be included here, send me an email.

G. Tzanetakis maintains two datasets including the famous GZTAN which is small by today's standards. Magnatagatune is one of the largest dataset the provides audio and tag information. Also note the very important RWC music database. Other tagging / genre datasets include CAL500 and the latin music database.
Regarding recommendation, the new standard is probably Yahoo Music Ratings. Paul Lamere also maintains a 2007 crawl of some Last.fm data.
For metadata, nothing compares to musicbrainz.
For structure analysis, Chris Harte annotated the Beatles, check his paper and write him to obtain the data.
Resources that we are less familiar with but are worth checking out include SALAMI, Codaich, soundsoftware, MusiClef 2011 and OMRAS2. Don't forget the actual The Echo Nest API.
On a more general machine learning note, infochimps is an incredible source of data. Similar are UCI repository and mldata. Also, to compare algorithms, websites like MLcomp, TunedIT and kaggle are extremely useful.

How can I get involved?

Simple! Use the dataset and share your code. We will be happy to put it on this website. You can also share this website using the buttons on the right menu.

Then, put the list of songs you use (defined, for example, by their Echo Nest ID) on your website when you publish something. With a simple list of IDs in a text file, it is very easy to recreate a test set in a few minutes, allowing others to make precise comparisons against your algorithm.

Finally, talk to other researchers about using larger datasets. The GZTAN genre collection was amazingly useful, but we need to move on.

How can I extend the dataset?

If you have data that could be linked with the Million Song Dataset, we would love to hear from you! Examples include: another set of tags for artists or songs, new similarity relationships, download statistics from P2P networks, a new set of features, etc. Your contribution does not have to cover the full set of one mlllion, and it can include new artists or songs: if the intersection is larger than 10K songs, it can be very helpful to researchers!
The goal is to make it easy to link different resources. For instance, if your information is on the artist level, we are experienced in matching against The Echo Nest artist ID and musicbrainz ID, and even 7digital ID. Write us!

How do I cite the dataset?

Check the Contact Us tab.

What publications use the dataset?

We try to maintain an informal list here: http://millionsongdataset.com/pages/publications

Frequently Asked Questions

News

Quick links

Main contact