Getting the dataset

The logistics of distributing a 300 GB dataset are a little more complicated than for smaller collections. We do, however, provide a directly-downloadable subset for a quick look.

Before you start, you might want to review exactly what the dataset contains. Here is a page showing the contents of a single example file. You can download the corresponding raw HDF5 file here: TRAXLZU12903D05F94.h5.

You can download the whole dataset, but first check to see if you know someone that has it already. The following universities should have a copy: Drexel, Ithaca College, QMUL, NYU, UCSD, UPF. LabROSA also has a number of portable drives that we may be able to send out on request.

infochimps / AWS

The whole dataset is available through infochimps: MILLION SONG DATASET.
The data is split into 26 main downloads (letters A-Z), one set of additional files (also available below from this page), and the subset (also available below). We recommend you extract the A-Z files to a folder 'millionsong/data' and the rest in 'millionsong/AdditionalFiles'.
See the MD5 codes here (WARNING: they might be erroneous now).
As of August 2011, the dataset is also available as an Amazon Public Dataset, thanks to the leadership of Infochimps.  

MillionSongSubset

To let you get a feel for the dataset without committing to a full download, we also provide a subset consisting of 10,000 songs (1%, 1.8 gb) selected at random:
MILLION SONG SUBSET
It contains "additional files" (SQLite databases) in the same format as those for the full set, but referring only to the 10K song subset. Therefore, you can develop code on the subset, then port it to the full dataset.

Additional Files

To help you get started we provide some additional files which are reverse indices of several types. These should come bundled with the core dataset.

  1. List of all track Echo Nest ID. The format is: track id<SEP>song id<SEP>artist name<SEP>song title
    (Careful, large to open in a web browser)
  2. List of all artist ID. The format is: artist id<SEP>artist mbid<SEP>track id<SEP>artist name
    The code to recreate that file is available here (and a faster version using the SQLite databases here).
  3. List of all unique artist terms (Echo Nest tags).
  4. List of all unique artist musicbrainz tags.
  5. List of the 515.576 tracks for which we have the year information, ordered by year.
  6. List of artists for which we know latitude and longitude.
  7. Summary file of the whole dataset, meaning same HDF5 format as regular files, it contains all metadata but no arrays like audio analysis, similar artists and tags. Only 300 Mb.
  8. SQLite database containing most metadata about each track (NEW VERSION 03/27/2011).
  9. SQLite database linking artist ID to the tags (Echo Nest and musicbrainz ones).
  10. SQLite database containing similarity among artists.
The code to create these lists is usually available in one of the different /Tasks_Demos/ folders when you download the code.

UCI repository

Subsets of the data will be available on the UCI Machine Learning Repository, we have one for the moment. It is an easy way to get some of the Million Song Dataset data in a simple text file format. Please give us feedback on what subsets you would want to see on the repository. Of course, it is not intended to replace the full dataset!

  1. uci 1: year prediction, features are timbre average and covariance of every song, target is the year. Note that the split train/test is now slightly different than the official one on github, but it should not affect the results in a major way.

Infobright

Infobright ported most of the data in Relational Database format. Depending on what part of the data you need, this might be a good solution. Questions about this should be addressed directly to Infobright.