The Million Song Dataset Challenge

Welcome to the MSD Challenge, the largest open offline music recommendation evaluation. This page gives some background information and pointers. To participate in the contest, see our Kaggle page.

Important Dates
Organizing Committee
Data From Year 1

The Million Song Dataset Challenge is an open, offline music recommendation evaluation:
music recommendation: predict what people might want to listen to;
open: everything is known about the songs (metadata, features, ...), anything can be used;
offline: evaluation is done on a fixed set of actual listening data.
The MSD Challenge takes the form of a contest where anyone can predict what the test users have also listened to, using whatever technique & data they need. The best teams will be awarded prizes. The full details of the contest are available on Kaggle.

April 2012: launch of the contest
August 2012: submission period ends
October 2012: workshop / special session, awards
2013: second (and final) edition

Here what you should be looking at in order to participate:
- Kaggle website
- AdMIRe 2012 paper
- Taste Profile subset
- Going from song IDs to track IDs

The challenge is administered by labs at UCSD and Columbia, helped by the members of the advisory committee. The main organizers are barred from winning any prize in the challenged.
Organizing Committee
Brian McFee, UCSD
Dan Ellis, Columbia University
Gert Lanckriet, UCSD
Advisory Committee
Thierry Bertin-Mahieux, Columbia University
Oscar Celma, Gracenote
J. Stephen Downie, University of Illinois at Urbana-Champaign
Douglas Eck, Google Research
Paul Lamere, The Echo Nest
Mark Levy,
Malcolm Slaney, Yahoo! Research
Julián Urbano, University Carlos III of Madrid

Why a contest?
Because we don't know yet what is useful for music recommendation. Pure collaborative filtering? content-based recommendations? Metadata like years and nominal genre? There have been other ``music'' contests, e.g. the KDD Cup 2011, but they were closed: the metadata about the artists/songs was hidden and no audio features were available. We want to reproduce the challenge facing a music technology start-up: if you can crawl the web, pay humans, analyze the audio, how do you best recommend songs to your listeners based on a few songs they have already played?
Who is organizing it?
Researchers from the Music Information Retrieval (MIR) community. This field encompasses tools from machine learning, recommender systems, multimedia analysis, psychology, ... in order to manage music. For the curious, the main MIR conference is ISMIR.
What are the rules?
See Kaggle.
When will we be announcing the results?
The contest ends in August, and the main result will be announced then. However, NEMA will conduct additional analysis on the submissions, with the results to be presented at ISMIR 2012.
Where can I get help?
General questions should be sent to the MSD mailing list. Contest-specific questions, e.g. unclear rules, typos, etc., should be sent to Brian McFee. Data-specific questions that don't get answered on the mailing list can be sent to Thierry Bertin-Mahieux.

Data From Year 1
The first edition of the contest has ended in August 2012, and here is the data from the challenge so you can reproduce the results.

  • The training set (~1M users) is still available, see the Taste Profile Subset
  • The challenge data always comes in two parts: for a given user, half of his listening habits is 'visible' and can be trained on, and a 'hidden' part (kept secret) we use to measure the performance.
  • The challenge on Kaggle had a public leaderboard where results were updated instantly. This can be considered the validation set. It contains 10K users.
  • The real, publication-worthy results, were computed over a test set of 100K users.

The data is available here: EvalDataYear1.