The musiXmatch Dataset


Welcome to the musiXmatch dataset, the official lyrics collection of the Million Song Dataset.

The MSD team is proud to partner with musiXmatch in order to bring you a large collection of song lyrics in bag-of-words format, for academic research. All of these lyrics are directly associated with MSD tracks: you can correlate them with all the data contained in the dataset such as similar artists, tags, years, audio features, etc.

The musiXmatch team was able to resolve over 77% of the MSD tracks; we provide the full mapping of MSD IDs to musiXmatch IDs. Of these, we are releasing lyrics for 237,662 tracks (erratum: we had announced 237,701). The other tracks were omitted for various reasons, including:
* diverse restrictions, including copyrights
* instrumental tracks
* the numerous MSD duplicates were skipped as much as possible
That said, with 237,662 bags-of-words, it is the largest, clean lyrics collection available for research!

Description
Getting the dataset
musiXmatch API
FAQ
Work using the dataset

Description

The MXM dataset provides lyrics for many MSD tracks. The lyrics come in bag-of-words format: each track is described as the word-counts for a dictionary of the top 5,000 words across the set. Although copyright issues prevent us from distributing the full, original lyrics, we hope and believe that this format is for many purposes just as useful, and may be easier to use.

The dataset comes in two text files, describing training and test sets. The split was done according to the split for tagging, see tagging test artists. There are 210,519 training bag-of-words, 27,143 testing ones. We also provide the full list of words with total counts across all tracks so you can measure the relative importance of the top 5,000.

The two text files are formatted as follow (per line):
# - comment, ignore
%word1,word2,... - list of top words, in popularity order
TID,MXMID,idx:cnt,idx:cnt,... - track ID from MSD, track ID from musiXmatch,
then word index : word count (word index starts at 1!)

Getting the dataset

Here is the train file and here is the test file. The top 5,000 words are the same for both. That said, we strongly encourage you to use the SQLite version below, it is faster and more convenient.

To help you deal with this data, we also provide it as an SQLite database. The details can be found in the README, and the code to recreate it is this python code. You might also want to check this blog post.

The full list of 779K matches with musiXmatch is also provided, the format is described in the header.

Then, we release the full list of stemmed words and the total word counts, i.e. all the words that were seen at least once. There are 498,134 unique words, for a total of 55,163,335 occurrences. The 5,000 words in the dataset account for 50,607,582 occurrences, so roughly 92%. NOTE 1: for choosing our 5,000 words, we normalized the word counts by the number of word occurrences in each song. Thus, it is not the top 5,000 of this file. NOTE 2: the list is super noisy, we know it! We made sure that the top 5,000 words was clean, but for the rest, no guarantee whatsoever, the bottom of the list is a mess (punctuation signs, foreign symbols, words glued together, ... name it, it's there).
P.S. thanks to Marc Brysbaert for his feedback and the request of this list.

Finally, if you work with visualizing lyrics, stemmed ones are annoying, as Andrew Clegg pointed out to us. Therefore, here is a list of unstemmed words with their stemmed version. Of course, it is only one possible list, the goal of stemming is that many related words are mapped onto the same one. But it should still improve the comprehension, for instance it maps 'victori'->'victory'.

If you want to better understand our bag-of-words creation process, including the stemming, or if you want to complement the dataset from your own lyrics collection, use this python code.

musiXmatch API

A few words on musiXmatch API which you can use to complement the dataset.

You can access the API using these wrappers, including the python one I collaborated on.

FAQ

Why use bag-of-words and not the original lyrics?
The actual lyrics are protected by copyright and we do not have permissions to redistribute them. However, by releasing only the word counts over a finite dictionary, we are providing a statistical description which respects the authors' copyright, yet provides what we believe is enough data to perform a wide range of interesting research.

Why use stemming?
Stemming is very commonly used in this kind of text analysis task. For statistical purposes, it is more interesting to treat "cry", "cried", and "crying" as instances of the same thing, rather than treating them as distinct, unrelated tokens. We use a simple, well-known stemming algorithm (Porter2) (which for this example maps all these words to "cri"). We added extra pre-processing rules, meaning we don;t follow the real Porter2 specs, see the next point 'stemming++'

stemming++
We added rules to the regular stemming algorithm. It might not have been a good idea, but the goal was to maximize the info held by a limited number of top words. Here is the full lyrics to bag-of-words algorithm. That explains why some punctuation is simply removed. Also why "I'm" is never seen, it becomes "I am". Note the mistake we made with "n't " -> "n not", should have been -> "n not ". It explains why "can't" becomes "ca not". We hope none of this will cause too much trouble in your research. To get a better intuition, here is a list of 10K popular English words and how they would appear in the dataset. Thanks to C. Févotte for uncovering some of these issues.

Why a limited number of top words?
The most popular words provide the most relevant information in the lyrics. If we told you which single song includes the lyric "deathray", it wouldn't be very interesting from a statistical modeling point of view. But telling you which 157 songs include "silk" (one of the least-popular words we include) could actually be useful.
Note that stemming and limited dictionary also help us to respect copyrights.
P.S. if you do possess a deathray, please don't take it personally.

What is the licensing?
Research only, strictly non-commercial. Also, musiXmatch has the right to advertise and refer to any work derived from the dataset. For details, contact MusiXmatch.

Why two files?
We believe that, for some applications, the use of a test set will be important and we want it to be standard. We used the same split as for automatic tagging. That said, if you concatenate the two files, you'll get back the full dataset.

How to cite the dataset?
You should cite this publication [bib].
Additionally, you can mention / link to this web resource:

musiXmatch dataset, the official lyrics collection for the Million Song Dataset, 
available at: http://millionsongdataset.com/musixmatch

What is the relationship between musiXmatch and the MSD?
musiXmatch generously donated data to create the musiXmatch dataset, referenced against the Million Song Dataset. The dataset was created by the MSD team with the approval of the final result by musiXmatch. musiXmatch is not responsible for any other part of the MSD project.

How was the dataset created?
MusiXmatch provided a list of matches based on artist names and song titles for about 77% of the million tracks. We retrieve the lyrics for these files. Note that many lyrics were not available to us for copyrights issues. Also, mXm has some tracks that it knows about without having the lyrics. Then, we removed the instrumental tracks (the one with less than 3 words or specifically identified as instrumental by mXm). Finally, we removed the duplicates based on the official MSD duplicates list and the mXm track ID. We had access to the words in a random order, and we stemmed them using this Porter2 algorithm.

Who can I contact for additional help?
Thierry Bertin-Mahieux is still a good first try. Otherwise, you can try the MSD mailing list. For a musiXmatch specific question, please contact them directly.

Work using the dataset

  • Last.fm blog, June 2011
  • CCA and a Multi-way Extension for Investigating Common Components between Audio, Lyrics and Tags, M. McVicar and T. De Bie, CMMR '12 [pdf]
  • Ranking lyrics for online search, R. Macrae and S. Dixon, ISMIR '12 [pdf]
  • Maximum marginal likelihood estimation for nonnegative dictionary learning in the Gamma-Poisson Model, O. Dikmen and C. Févotte, IEEE Transactions on Signal Processing, Oct. 2012 [pdf]
  • Learning the B-Divergence in Tweedie Compound Poisson Matrix
    Factorization Models
    , U. Simsekli, A. Cemgil and Y. Yilmaz, ICML 2013 [pdf]