Before you start, you might want to review exactly what the dataset contains. Here is a page showing the contents of a single example file. You can download the corresponding raw HDF5 file here: TRAXLZU12903D05F94.h5.
You can download the whole dataset, but first check to see if you know someone that has it already. The following universities should have a copy: Drexel, Ithaca College, QMUL, NYU, UCSD, UPF. LabROSA also has a number of portable drives that we may be able to send out on request.
The whole dataset is available on Amazon S3 in the bucket s3://tbmmsd/ .
The data is split into 26 main downloads (letters A-Z), one set of additional files (also available below from this page), and the subset (also available below). We recommend you extract the A-Z files to a folder 'millionsong/data' and the rest in 'millionsong/AdditionalFiles'.
The dataset is also available as an Amazon Public Dataset snapshot which can easily be attached to an Amazon EC2 virtual machine to run your experiments in the cloud. You simply set up an EBS disk instance from snap-5178cf30 (I think this means your EC2 virtual machine has to be in us-east-1).
For me, when I launch at EC2 virtual machine running Ubuntu, then
create an EBS instance from that snapshot, then attach the EBS to the
virtual machine, it appears as /dev/xvdf from within Ubuntu. Then you
just have to mount it:
ubuntu@ip-xxx:~$ sudo mkdir /mnt/snap
ubuntu@ip-xxx:~$ sudo mount -t ext4 /dev/xvdf /mnt/snap
ubuntu@ip-xxx:~$ ls /mnt/snap
AdditionalFiles data LICENSE lost+found README
ubuntu@ip-xxx:~$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 7.8G 808M 6.6G 11% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
udev 492M 12K 492M 1% /dev
tmpfs 100M 328K 99M 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 497M 0 497M 0% /run/shm
none 100M 0 100M 0% /run/user
/dev/xvdf 493G 272G 196G 59% /mnt/snap
The 493G partition at the end (of which only 272G used) is the MSD data.
Note that although there's a free tier for EC2 processors, Amazon charges for EBS usage; this 500G partition costs something like $10/week for the time it is in existence (whether or not it's attached to a live VM).
MillionSongSubsetTo let you get a feel for the dataset without committing to a full download, we also provide a subset consisting of 10,000 songs (1%, 1.8 gb) selected at random:
MILLION SONG SUBSET
It contains "additional files" (SQLite databases) in the same format as those for the full set, but referring only to the 10K song subset. Therefore, you can develop code on the subset, then port it to the full dataset.
To help you get started we provide some additional files which are reverse indices of several types. These should come bundled with the core dataset.
- List of all track Echo Nest ID. The format is:
track id<SEP>song id<SEP>artist name<SEP>song title
(Careful, large to open in a web browser)
- List of all artist ID. The format is:
artist id<SEP>artist mbid<SEP>track id<SEP>artist name
The code to recreate that file is available here (and a faster version using the SQLite databases here).
- List of all unique artist terms (Echo Nest tags).
- List of all unique artist musicbrainz tags.
- List of the 515.576 tracks for which we have the year information, ordered by year.
- List of artists for which we know latitude and longitude.
- Summary file of the whole dataset, meaning same HDF5 format as regular files, it contains all metadata but no arrays like audio analysis, similar artists and tags. Only 300 Mb.
- SQLite database containing most metadata about each track (NEW VERSION 03/27/2011).
- SQLite database linking artist ID to the tags (Echo Nest and musicbrainz ones).
- SQLite database containing similarity among artists.
Subsets of the data will be available on the UCI Machine Learning Repository, we have one for the moment. It is an easy way to get some of the Million Song Dataset data in a simple text file format. Please give us feedback on what subsets you would want to see on the repository. Of course, it is not intended to replace the full dataset!
- uci 1: year prediction, features are timbre average and covariance of every song, target is the year. Note that the split train/test is now slightly different than the official one on github, but it should not affect the results in a major way.
Infobright ported most of the data in Relational Database format. Depending on what part of the data you need, this might be a good solution. Questions about this should be addressed directly to Infobright.