Good question, a lot of things. Principally, the dataset consists of almost all the information available through The Echo Nest API for one million popular tracks. This encompasses both metadata and audio analysis features. Each file is for one track which corresponds to one song, one release and one artist. All the information about these four items (track, song, release, artist) are in every file (which involves some redundancy, although the bulk of the data, relating to the audio analysis, is unique).
For an in-depth look, see section describing all the fields you will find in one song file. Note that the data is quite complete, but some fields might be missing for some songs. For instance, we don't have the location for all artists.
Below are a list of all fields available in the files of the dataset. The same list with data from a specific song is available here. Another reference is the code: display_song.py: if a field is displayed, the field exists and there should be a getter for it (if we forgot some in matlab or java, please let us know).
For the analysis fields, we suggest you first read The Echo Nest analyze documentation. The main audio features are 'segments_pitches' and 'segments_timbre'.
|analysis sample rate||float||sample rate of the audio used||url|
|artist 7digitalid||int||ID from 7digital.com or -1||url|
|artist familiarity||float||algorithmic estimation||url|
|artist hotttnesss||float||algorithmic estimation||url|
|artist id||string||Echo Nest ID||url|
|artist location||string||location name|
|artist mbid||string||ID from musicbrainz.org||url|
|artist mbtags||array string||tags from musicbrainz.org||url|
|artist mbtags count||array int||tag counts for musicbrainz tags||url|
|artist name||string||artist name||url|
|artist playmeid||int||ID from playme.com, or -1||url|
|artist terms||array string||Echo Nest tags||url|
|artist terms freq||array float||Echo Nest tags freqs||url|
|artist terms weight||array float||Echo Nest tags weight||url|
|audio md5||string||audio hash code|
|bars confidence||array float||confidence measure||url|
|bars start||array float||beginning of bars, usually on a beat||url|
|beats confidence||array float||confidence measure||url|
|beats start||array float||result of beat tracking||url|
|end of fade in||float||seconds at the beginning of the song||url|
|energy||float||energy from listener point of view|
|key||int||key the song is in||url|
|key confidence||float||confidence measure||url|
|loudness||float||overall loudness in dB||url|
|mode||int||major or minor||url|
|mode confidence||float||confidence measure||url|
|release 7digitalid||int||ID from 7digital.com or -1||url|
|sections confidence||array float||confidence measure||url|
|sections start||array float||largest grouping in a song, e.g. verse||url|
|segments confidence||array float||confidence measure||url|
|segments loudness max||array float||max dB value||url|
|segments loudness max time||array float||time of max dB value, i.e. end of attack||url|
|segments loudness max start||array float||dB value at onset||url|
|segments pitches||2D array float||chroma feature, one value per note||url|
|segments start||array float||musical events, ~ note onsets||url|
|segments timbre||2D array float||texture features (MFCC+PCA-like)||url|
|similar artists||array string||Echo Nest artist IDs (sim. algo. unpublished)||url|
|song hotttnesss||float||algorithmic estimation|
|song id||string||Echo Nest song ID|
|start of fade out||float||time in sec||url|
|tatums confidence||array float||confidence measure||url|
|tatums start||array float||smallest rythmic element||url|
|tempo||float||estimated tempo in BPM||url|
|time signature||int||estimate of number of beats per bar, e.g. 4||url|
|time signature confidence||float||confidence measure||url|
|track id||string||Echo Nest track ID|
|track 7digitalid||int||ID from 7digital.com or -1||url|
|year||int||song release year from MusicBrainz or 0||url|
Choosing a million songs is surprisingly challenging. We followed these steps:
- Getting the most 'familiar' artists according to The Echo Nest, then downloading as many songs as possible from each of them
- Getting the 200 top terms from The Echo Nest, then using each term as a descriptor to find 100 artists, then downloading as many of their songs as possible
- Getting the songs and artists from the CAL500 dataset
- Getting 'extreme' songs from The Echo Nest search params, e.g. songs with highest energy, lowest energy, tempo, song hotttnesss, ...
- A random walk along the similar artists links starting from the 100 most familiar artists
The number of songs was approximately 8950 after step 1), step 3) added around 15000 songs, and we add approx. 500000 songs before starting step 5. For more technical details, see "dataset creation" in the "code" tab.
- 1,000,000 songs / files
- 273 GB of data
- 44,745 unique artists
- 7,643 unique terms (The Echo Nest tags)
- 2,321 unique musicbrainz tags
- 43,943 artists with at least one term
- 2,201,916 asymmetric similarity relationships
- 515,576 dated tracks starting from 1922
- 18,196 cover songs identified
- 11 pastries lost in a related hackday
- 99 bottles of beer on the wall
"Terms" are the tags provided by The Echo Nest. They can come from a number of places, but mostly blogs as far as we understand.
"Mbtags" are musicbrainz tags, specifically applied by humans to a particular artist. This explains why there are fewer of them (see 'mbtags_count'), but they are usually very clean and informative. For instance, if you want to create a genre recognition task where classes are mutually exclusive, mbtags are likely to be more reliable then terms.
If you count the artist names, you will see that there are almost twice as many unique names as there are artist IDs in the dataset. The Echo Nest often associates songs with artist names such as "A feat. B" or "A / B" to artist A. In most cases, this is the reasonable thing to do. Just be careful, when you use files indexed by artist ID (e.g. unique_artists.txt), since you will get one of the artist names at random. If you try to do string matching, you probably need to consider all the names for an artist; you can use the SQLite database track_metadata.db to find them.
See tab Getting the dataset. There you will also find a subset to get you started quickly.
APIs have a lot of merit, and most of this dataset was built using The Echo Nest API. But APIs have not solved all research problems, and there are several advantages to having a local copy of a fixed dataset. Everyone gets the same data and can report results on the same songs and features. Also, it facilitates downloading. Just because an API exists doesn't mean that everyone will go to the trouble of downloading a million songs.
That being said, this dataset is complementary with the several APIs we mention. That's why we made sure we provide enough metadata such as musicbrainz ID so everyone can link this data to other existing resources.
We don't have the audio at LabROSA, so there's nothing for us to give you. The features were mainly created in-house by content owners, using analysis code supplied to them by The Echo Nest.
You can, however, use services like 7digital or playme to preview a song; we provide demo code. Grooveshark is also very powerful (but is it legal? anyone?)
If you do have audio, you can analyze it and turn it into the same format as the rest of this dataset (see below in the FAQ).
Or "can I reverse engineer the features?". Well... you should try! Dan Ellis provided MATLAB code to do something like this, see the MatlabSrc directory on GITHUB and his Matlab tutorial.
To be clear, the main audio features are 'loudness', 'pitches' and 'timbre' on the segment level. Segments are 'musical events', usually note onsets. There is a lot of information lost when going from audio to these features (even though they are quite large in size/quantity/dimension!)
For many tasks, an excerpt from the full track may be sufficient, and these are available via 7digital. Here is a python script to access them, and Dan provides a Matlab version here. Note that the 'Random track' box on the right side of this website uses the Python code in real time; there is nothing pre-loaded.
Each file represents one track with all the related information (artist information, release information, audio analysis of the track, etc). Usually one track is one song, although there are multiple versions of the same track for many songs -- extended versions, different releases, etc. -- each in its own file.
Remember the some songs are released many times with slight differences (e.g. US versus European release).
HDF5! Why? Because! (regarding this, all complains should definitely go to email@example.com). In short, HDF5 is a format developped by NASA to handle 1) large 2) heterogeneous 3) hierarchical datasets. The data can be compressed (10%-15% more that matfiles), and the I/O speed is still impressive. Also note that the core library comes free of charge and wrappers exist in most languages (see code tab).
Is is perfect? No. Does it make more sense that 1M zipped json files or matfiles? Yes. Note that the new matfile format (v7.3) is actually HDF5.
Still not happy? Here is python code or matlab code to transform the data into matfiles (the latter is less tested, non-ASCII strings seem wrongly encoded).
A "song file" refers to the typical HDF5 file containing information for only one song.
An "aggregate file" is also an HDF5 file that contains the information for several songs. These are useful if you do I/O intensive experiments, since they reduce the number of open/close file operations you need to perform.
A "summary file" is similar to an aggregate file, but contains just the metadata, i.e. we remove all the tables (analysis of bars, beats, segments, ..., artist similarity, tags). Useful if you want to quickly search the metadata, since a lot of space is saved! Check the scripts create_summary_file.py and create_aggregate_file.py. The summary file of the whole dataset is available (only 300 Mb!): msd_summary_file.h5.
Note on summary files: if you're using the code display_song.py, you need the '-summary' flag to tell the code that some getters won't find their field, e.g. bars_start.
The dataset you received should contain one million song files. You can create aggregate and/or summary files using the python scripts.
All data was downloaded in utf-8 format and was saved in the HDF5 file in utf-8 format. We hoped that this would ensure that every name / release / title will display correctly if you set your display to utf-8, but in practice some string with uncommon elements will never be correctly recorded. In particular, it seems difficult to tell MATLAB to read HDF5 as UTF-8 instead of Unicode. If you have a work-around, let us know!
Conclusion: use the strings like titles and artist names as indications, the real identifiers should be the musicbrainz ID or the Echo Nest ID.
A common first thing you might want to do is quickly glance at the numerical content of a data file. How you do this of course depends on the language you want to use. In each case, you'll need the HDF5 library installed (see the code tab). In python (with pytables installed), use display_songs.py. In Java, you can use the great tool HDFVIEW. MATLAB will also let you visualize the content, see for example the Matlab tutorial.
We could not put a million files in one folder. Even a few thousand files in one directory can slow disk accesses significantly. We based the directory structure on The Echo Nest track IDs which are a kind of hash code. Echo Nest track IDs always take the form TR+LETTERS+LETTERS&NUMBERS. The directory path within the Million Song Dataset is the 3rd, 4th, 5th letters from the track ID, with the file itself is named after its track ID + the extension ".h5". For example, MillionSong/data/A/D/H/TRADHRX12903CD3866.h5.
Gordon! (more to come on this).
We have also created three SQLite databases:
- track_metadata.db includes most of metadata for each track. It is useful, for instance, to find all the tracks from a particular artist.
- artist_term.db consists only of the metadata that applies at the artist level.
- artist_similarity.db summarizes artist similarity. In particular, it only lists artists that actually are in the dataset.
For demos using these databases with SQL queries in python, see demo_*.py in the SQLite folder.
You can, of course, also create your own SQLite databases best suited to your own applications.
We used The Echo Nest API and some information from a local copy of the musicbrainz server. Data was downloaded during December 2010. For more information on how we chose the tracks, see How did you choose the million tracks?.
In the sense of having your audio analyzed and put in HDF5 format like the rest of the dataset, the answer is mostly yes: the following python code should do the trick. Note that it uses pyechonest, and to have the full data ('year', 'mbtags' and 'mbtags_count'), you need a local
The code is not perfect for the following reasons: 1) it is unclear what happens if the song is not recognized by The Echo Nest fingerprinter and 2) even if it is recognized, if you upload audio associated with a song in the dataset, there is no guarantee you'll get the same track ID.
Probably not, even though that's what was used to create this dataset! If you take a track ID and request it from The Echo Nest API, the info might not exactly match this dataset. Some fields are bound to change over time, such as artist familiarity and song hotttnesss. Some should be stable, like song title and artist name, but there is no guarantee. The analysis of the audio should also remain the same, but The Echo Nest updates their analyzer once in a while (to implement a better beat tracking or segmenter for instance).
Remember that the dataset was created / downloaded between December 18 and December 31 2010.
Regarding the MusicBrainz data contained in the dataset, the track 'year' is under public domain and the 'tags' and 'tag count' are under Attribution-NonCommercial-ShareAlike 2.0 license.
Regarding the SecondHandSongs dataset, see its webpage.
Regarding the musiXmatch dataset, see its webpage.
The code is released under GNU Public License.
We mention some other sources of data related to Music Information Retrieval research. THIS IS NOT AN EXHAUSTIVE LIST! If you want your dataset to be included here, send me an email.
G. Tzanetakis maintains two datasets including the famous GZTAN which is small by today's standards. Magnatagatune is one of the largest dataset the provides audio and tag information. Also note the very important RWC music database. Other tagging / genre datasets include CAL500 and the latin music database.
Regarding recommendation, the new standard is probably Yahoo Music Ratings. Paul Lamere also maintains a 2007 crawl of some Last.fm data.
For metadata, nothing compares to musicbrainz.
For structure analysis, Chris Harte annotated the Beatles, check his paper and write him to obtain the data.
Resources that we are less familiar with but are worth checking out include SALAMI, Codaich, soundsoftware, MusiClef 2011 and OMRAS2. Don't forget the actual The Echo Nest API.
On a more general machine learning note, infochimps is an incredible source of data. Similar are UCI repository and mldata. Also, to compare algorithms, websites like MLcomp, TunedIT and kaggle are extremely useful.
Simple! Use the dataset and share your code. We will be happy to put it on this website. You can also share this website using the buttons on the right menu.
Then, put the list of songs you use (defined, for example, by their Echo Nest ID) on your website when you publish something. With a simple list of IDs in a text file, it is very easy to recreate a test set in a few minutes, allowing others to make precise comparisons against your algorithm.
Finally, talk to other researchers about using larger datasets. The GZTAN genre collection was amazingly useful, but we need to move on.
If you have data that could be linked with the Million Song Dataset, we would love to hear from you! Examples include: another set of tags for artists or songs, new similarity relationships, download statistics from P2P networks, a new set of features, etc. Your contribution does not have to cover the full set of one mlllion, and it can include new artists or songs: if the intersection is larger than 10K songs, it can be very helpful to researchers!
The goal is to make it easy to link different resources. For instance, if your information is on the artist level, we are experienced in matching against The Echo Nest artist ID and musicbrainz ID, and even 7digital ID. Write us!
We try to maintain an informal list here: http://labrosa.ee.columbia.edu/millionsong/pages/publications