As of today, the subset of the MSD on the UC Irvine Machine Learning repository has been viewed more than 10,000 times!
YearPredictionMSD is extracted from the full Million Song Dataset: it contains the average and covariance of the timbre data, and the year a song was released. It is one of the largest regression dataset on the UCI repository.
From the start, we believed that the MSD was an opportunity for our community to advertise research topics to other connected fields. Since the UCI ML repo is viewed by a wide range of machine learning practitioners, we can safely say that most of them were not from music information retrieval. And even if only half a percent of all these page views turn into a citation or some experiment on the data, it could be enough to raise awareness in some new conferences / communities.
Take-home message: if you have some machine learning task ready, including a dataset, post it on the UCI repo! If someone is interested in creating other subsets of the MSD, contact us, or simply go ahead!
As a side note, the goal of such a subset was never to replace the full MSD. If you are serious about Year Prediction from audio features, the features we provided in the subset are naive and restrictive. The main goal was for machine learning folks to demo their algorithms on music-related problems.
Finally, and unrelated, an interesting short presentation from Malcolm Slaney at ICASSP 2011 (pdf).