Getting the dataset

The logistics of distributing a 300 GB dataset are a little more complicated than for smaller collections. We do, however, provide a directly-downloadable subset for a quick look.

Before you start, you might want to review exactly what the dataset contains. Here is a page showing the contents of a single example file. You can download the corresponding raw HDF5 file here: TRAXLZU12903D05F94.h5.

If you want the whole dataset, check to see if you know someone that has it already. The following universities should have a copy: Drexel, Ithaca College, QMUL, NYU, UCSD, UPF.


The dataset is available as an Amazon Public Dataset snapshot which can easily be attached to an Amazon EC2 virtual machine to run your experiments in the cloud. You simply set up an EBS disk instance from snap-5178cf30 (I think this means your EC2 virtual machine has to be in us-east-1).

For me, when I launch an EC2 virtual machine running Ubuntu, then create an EBS instance from that snapshot, then attach the EBS to the virtual machine, it appears as /dev/xvdf from within Ubuntu. Then you just have to mount it:

ubuntu@ip-xxx:~$ sudo mkdir /mnt/snap
ubuntu@ip-xxx:~$ sudo mount -t ext4 /dev/xvdf /mnt/snap
ubuntu@ip-xxx:~$ ls /mnt/snap
AdditionalFiles  data  LICENSE  lost+found  README
ubuntu@ip-xxx:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      7.8G  808M  6.6G  11% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
udev            492M   12K  492M   1% /dev
tmpfs           100M  328K   99M   1% /run
none            5.0M     0  5.0M   0% /run/lock
none            497M     0  497M   0% /run/shm
none            100M     0  100M   0% /run/user
/dev/xvdf       493G  272G  196G  59% /mnt/snap

The 493G partition at the end (of which only 272G used) is the MSD data.

Note that although there's a free tier for EC2 processors, Amazon charges for EBS usage; this 500G partition costs something like $10/week for the time it is in existence (whether or not it's attached to a live VM).



To let you get a feel for the dataset without committing to a full download, we also provide a subset consisting of 10,000 songs (1%, 1.8 gb) selected at random:
It contains "additional files" (SQLite databases) in the same format as those for the full set, but referring only to the 10K song subset. Therefore, you can develop code on the subset, then port it to the full dataset.

Additional Files

To help you get started we provide some additional files which are reverse indices of several types. These should come bundled with the core dataset.

  1. List of all track Echo Nest ID. The format is: track id<SEP>song id<SEP>artist name<SEP>song title
    (Careful, large to open in a web browser)
  2. List of all artist ID. The format is: artist id<SEP>artist mbid<SEP>track id<SEP>artist name
    The code to recreate that file is available here (and a faster version using the SQLite databases here).
  3. List of all unique artist terms (Echo Nest tags).
  4. List of all unique artist musicbrainz tags.
  5. List of the 515.576 tracks for which we have the year information, ordered by year.
  6. List of artists for which we know latitude and longitude.
  7. Summary file of the whole dataset, meaning same HDF5 format as regular files, it contains all metadata but no arrays like audio analysis, similar artists and tags. Only 300 Mb.
  8. SQLite database containing most metadata about each track (NEW VERSION 03/27/2011).
  9. SQLite database linking artist ID to the tags (Echo Nest and musicbrainz ones).
  10. SQLite database containing similarity among artists.
The code to create these lists is usually available in one of the different /Tasks_Demos/ folders when you download the code.

UCI repository

Subsets of the data will be available on the UCI Machine Learning Repository, we have one for the moment. It is an easy way to get some of the Million Song Dataset data in a simple text file format. Please give us feedback on what subsets you would want to see on the repository. Of course, it is not intended to replace the full dataset!

  1. uci 1: year prediction, features are timbre average and covariance of every song, target is the year. Note that the split train/test is now slightly different than the official one on github, but it should not affect the results in a major way.


Infobright ported most of the data in Relational Database format. Depending on what part of the data you need, this might be a good solution. Questions about this should be addressed directly to Infobright.