GET ALL THE CODE FROM GITHUB (you don't need git installed).
Files are in HDF5 format. Why? See FAQ. Except for new MATLABs (2009b-), you need the HDF5 library to use the dataset . Other requirements depend on the language you use, see the following section.
For Ubuntu, as of 09/2011, install everything (including optional packages) with:
sudo apt-get install libhdf5-serial-1.8.4 libhdf5-serial-dev
python-tables python-tables-doc libjhdf5-java h5utils hdfview liblzo2-2
We believe that more than 90% of the researchers using the dataset will not need to understand the structure of the HDF5 files. Therefore, the wrappers below simply provide getters to the different fields saved in each file. If you really want to speed things up and want to play with HDF5, we have some code to aggregate many files into one (thus saving on IO operations!). Want to dig deeper? All our code is open source, the HDF5 file descriptors are available in python, see the latter section on Dataset Creation.
We provide wrappers for python, matlab and java, plus a note on C/C++ and R.
The HDF5 wrapper in python is pytables. It relies on numpy (that you probably already have). We also recommend matplotlib/pylab for visualization. Then, to acces the fields in the HDF5 song files provided in the dataset, the file hdf5_getters.py should be enough:
h5 = hdf5_getters.open_h5_file_read(path to some file)
duration = hdf5_getters.get_duration(h5)
If you want a deeper understanding of the dataset, or want to combine many files into one, see the section on dataset creation below. All our work was done in python.
MATLAB requires no additional libraries (tested on Matlab 2009b). We created the basic getters through this class.
h5 = HDF5_Song_File_Reader(path to some file);
duration = h5.get_duration();
Also, you might be interested by more in-depths examples of using HDF5 with Matlab, available here.
You need some JAR files accessible here. Read the README file, but the class hdf5_getters.java should give you all you need. This code is not optimize at all, in particular it copies data from arrays, you might want to access it directly.
A C++ wrapper is available on here. Look at the makefile and "hdf5_display.cc" for a demo. Required libraries should be found here. The code could be better optimized, but it works.
We planned to release a R wrapper and looked at the default HDF5 library for R on Ubuntu. Unfortunately, it crashes on empty arrays. This happens when a track has no musicbrainz tag, for instance. If any R specialist is willing to help us with this, please contact us!
Here is the python code used to create the dataset. Note that we had an unrestricted access to the Echo Nest API, but nothing special other than that. We also installed a local copy of the Musicbrainz server.
To create the hdf5 files, we use these two python scripts:
hdf5_descriptors.py and hdf5_utils.py.
The first one gives you the structure of each hdf5 song file, the second puts the pieces together.
We provide a lot of code to get you started with different tasks, for instance getting a preview of a particular song, parse the million files, ... Find these scripts here. If you want your code to be included, send us an email! You could also find code from people that published using the dataset, we maintain an unofficial list of MSD publications.