Find a song with a specific name or feature

You want all the songs from a specific artist or with a given tempo, and you want it fast! That is why we prepared some files to help you, which are reverse indices: given some feature, give me tracks. To illustrate this, we will search for all the songs from Radiohead.

There are two ways to go about doing this: using the HDF5 summary file, or the SQLite databases.

HDF5 summary file
Quick reminder: our song files in HDF5 format are made so that they can hold the info for more than one song. If you concatenate many songs into one file, we talk about 'aggregate files'. If you do it with only the metadata, we talk about 'summary' file. More on this in the FAQ.

The file msd_summary_file.h5 looks like any song file, except that it contains 1M songs (the whole dataset) excluding the analysis (beats, segments, ...), tags and similar artist. Because of this, it takes only 316Mb. It is perfect to find all artists where 'Radiohead' is the artist name.

Here is a slow but intuitive way of doing this, we simply query each of the 1M songs. The outputs are from iPython. It takes about 7 minutes to run.

In [42]: import hdf5_getters
In [43]: h5 = hdf5_getters.open_h5_file_read('msd_summary_file.h5')
In [44]: for k in range(1000000):
   ....:     a_name = hdf5_getters.get_artist_name(h5,k)
   ....:     if a_name == 'Radiohead':
   ....:         print hdf5_getters.get_track_id(h5,k)
TRMMNCI128F9310D00
TRMIPNA128F9310F08
TRMTOBP128E07822EF
TRWZFDM128F4252CF5
TRWPNPS128F9310DC8
TRWUHAU128F4259E6A
......

Using HDF5 capabilities and pytables wrapper, we can do the same under a minute. You can look at this tutorial or this one on pytables. You need to know that in each song file, there are 3 groups ('metadata', 'analysis' and 'musicbrainz') and each group contains a table called 'songs'. The path can be found in the code of hdf5_getters.py or by viewing the file with HDFVIEW. The new code gets the index of all rows in metadata/songs where the artist_name is Radiohead, than reads the corresponding track_id from the analysis/songs table. It takes less than a minute.

In [41]: h5 = hdf5_getters.open_h5_file_read('msd_summary_file.h5')
In [42]: idxs = h5.root.metadata.songs.getWhereList('artist_name=="Radiohead"')
In [43]: for idx in idxs:
   ....:     print h5.root.analysis.songs.cols.track_id[idx]

Numerical values are faster to compare. Also, there exist a notation shortcut when the data you want is in the same table as the feature you are testing. For instance, in a few seconds, here are the fastest tracks, according to The Echo Nest tempo measure, in the dataset:

In [70]: [row['track_id'] for row in h5.root.analysis.songs.where('tempo>280')]
Out[70]: 
['TRRQQTD12903CA2669',
 'TRBHQUV12903CFAFA9',
 'TRFWRVO128F425C4EF',
 'TRNPTWJ128F93136D2',
 'TRPPDKE128F930D9C0',
 'TRLPHPU12903CD8DAA']

SQLite databases
Another way to quickly search for tracks was to iterate over all the songs once and keep some metadata in a database for future reference. We did it (it takes 15-20 hours), the file is called track_metada.db, it is a SQLite database.

SQLite is awesome, it holds in one file, it has a wrapper in every language, it's almost as fast as MySQL and PgSQL. See the official SQLite website. In python, you need the module 'sqlite3' to query the database.

In track_metadata.db, we put one table called 'songs' with the following columns: (track_id, title, song_id, release, artist_id, artist_mbid, artist_name, duration, artist_familiarity, artist_hotttnesss, year). We could have put more columns, the code to recreate the database is here with a demo here.

Back to our problem, to get all the songs from Radiohead, the following command takes about 5 seconds!

In [1]: import sqlite3
In [2]: conn = sqlite3.connect('track_metadata.db')
In [3]: q = "SELECT track_id FROM songs WHERE artist_name='Radiohead'"
In [4]: res = conn.execute(q)
In [4]: res.fetchall()
Out[4]: 
[(u'TRESMVD128F9310D57',),
 (u'TRARTEH128F423DBC1',),
 (u'TRBGIZH128F4252D9E',),
 (u'TRTYSJW128F932E744',),
...

The SQL command in the previous call was quite straightforward. Yes, you might need to learn a little of the SQL language to use that solution, but it is clearly the fastest, tons of SQL tutorial exist online, and we prepared a demo for you with most queries you might need.

Approximate search
This part is not fully coded. But let's assume that your name is Paul, you love 'Emerson, Lake and Palmer', you believe we have it in our dataset but you're not sure of the spelling we used, what do you do?

There is no efficient way to do approximate search with the previous methods. In python, you can use the edit distance (this module here) to find the closest artist name to your query. But it is slow, even for 72K artist names!

The solution is to use Lucene, a powerful, open-source, search engine written in Java. It is not very hard to setup, our demo is on its way...!

In the meantime, you can get The Echo Nest song search API to find the artist ID.