Why a Million Song Dataset?

Submitted by dpwe on Tue, 02/08/2011 - 10:16

The idea for the Million Song Dataset came to us a couple of years ago when we were discussing with the Echo Nest possible ideas for a NSF GOALI (Grant Opportunities for Academic Liaison with Industry) grant. We were looking for an idea that wouldn't be possible without an academic-industrial collaboration, and that would appeal to the NSF as contributing to scientific progress.

One of the long-standing criticisms of academic music information research from our colleagues in the commercial sphere is that the ideas and techniques we develop simply aren't practical for real services, which must offer hundreds of thousands of tracks at a minimum. But, as academics, how can we develop scalable algorithms without the large-scale datasets to try them on? The idea of a "million song dataset" started as a flippant suggestion of what it would take to solve this problem. But the idea stuck - not only in the form of developing a very large, common dataset, but even in the specific scale of one million tracks.

There are a several possible reasons why the community doesn't already have a dataset of this scale:

We all already have our favorite, personal datasets of hundreds or thousands of tracks, and to a large extent we're happy with the results we get from them.
Collecting the actual music for a dataset of more than a few hundred CDs (i.e. the kind of thing you can do by asking all your colleagues to lend you their collections) becomes something of a challenge.
The well-known antagonistic stance of the recording industry to the digital sharing of their data seems to doom any effort to share large music data collections.
It's simply a lot of work to manage all the details for this amount of data.

On the other hand, there are some obvious advantages to creating a large dataset:

A large dataset helps reveal problems with algorithm scaling that may not be so obvious or pressing when tested on small sets, but which are critical to real-world deployment.
Certain kinds of relatively-rare phenomena or patterns may simply not occur in small datasets, but may lead to exciting, novel discoveries from large collections.
A large dataset can be relatively comprehensive, encompassing various more specialized subsets. By having all subsets within a single universe, we can have standardized data fields, features, etc.
A single, multipurpose, freely-available dataset greatly promotes direct comparisons and interchange of ideas and results.

Of all these, the last is probably the one that matters most to me. Having a single, natural choice for a dataset to try out new algorithms, and hence an end to mutually-incomparable results on what should be basically the same task, would be a big step forward in bringing clarity and focus to progress in our research area.

Of course, there are some obvious limitations to this work. Because of the copyright limitations, we can't provide (and in fact do not ourselves have) the raw audio, only the derived Echo Nest features. In fact, when you do the calculations for how long it would take to process a million tracks, it might make you think twice about wanting the audio: the dataset comprises 250 million seconds of data, or about 8 years of continuous audio. Even with 8 CPUs running a process that takes only one-tenth of real time, that's still over one month of continuous processing. So precalculated features are an attractive solution, and the Echo Nest's segment-level features (which describe the audio in terms of adaptively-sized chunks, placed to capture all the significant events in the sounds) are a very efficient mechanism to capture audio features. They can be used to recreate approximations of more familiar low-level features such as MFCCs, but for many applications they actually offer a more promising foundation.

So the big question is what happens now. We've put together a dataset that we hope will be comprehensive and detailed enough to support a very wide range of music information research tasks for today and for the future. Our hope is that the Million Song Dataset becomes the natural choice for researchers wanting to try out ideas and algorithms on data that is standardized, easily obtained, and relevant to both academia and industry. If we get it right, our field should be greatly strengthened through the use of a common dataset.

But for all this to come true, we need lots of people to start using the data. Naturally, we want all the effort we've put in to creating this dataset to have as much positive impact as possible. So please let us know: what can we do to make the Million Song Dataset better? If you're thinking about trying it, what more could we provide to close the deal? If you've tried it, what were the things that you found difficult or obscure? What could we add to make it more useful, to make it a no-brainer to adopt?

Although we've developed the dataset independently up until now, we hope that, as more and more researchers get involved, this will become a true community effort. Our vision and hope is that many different individuals and groups will develop and contribute additional data, all referenced to the same underlying dataset, that can be shared to further improve the usefulness of the data, while preserving as far as possible the commonality and comparability of a single collection.

We hope that you'll see the Million Song Dataset as something that can really help your research, and that you will join in with making it a common standard for our field.

-- Dan Ellis <dpwe@ee.columbia.edu>

dpwe's blog
Login to post comments

Why a Million Song Dataset?

News

Quick links

Main contact