The Echo Nest Taste Profile Subset

WARNING: we had a matching issue between the Taste Profile Subset and the MSD tracks, please read this blog post for details. We also now have a fix, a list of song - track pairs that should not be trusted, get it here.

Welcome to the Taste Profile subset, the official user dataset of the Million Song Dataset.

The Echo Nest is committed to giving back to the research community (for instance by creating the MSD!), and they prove it again by releasing the Taste Profile dataset. The dataset contains real user - play counts from undisclosed partners, all songs already matched to the MSD. if you were looking for the right collaborative filtering dataset with audio features, this might be for you! Plus, you can link that user data to lyrics, tags and Last.fm's similar songs, thus you have many viewpoint for explaining the data.

Below you can download the subset that overlaps the MSD as a standalone file. Also, some users are already available through the Echo Nest API as "user catalog". We provide the list of users and corresponding catalog ID that you can read through The Echo Nest API. An example is shown below.

Finally, user anonymity is taken very seriously, you can read The Echo Nest's blog post about the data (and privacy in particular).

Some numbers
Description
Getting the dataset
FAQ
Challenge - More Data
Work using the dataset

Some numbers

Before you read the full description, you might want to know that the Taste Profile subset is big. How big? Below are some numbers:

  • 1,019,318 unique users
  • 384,546 unique MSD songs
  • 48,373,586 user - song - play count triplets

Description

First, you should read The Echo Nest's blog post about the data.

For the donwloadable version, the format is straightforward, we provide (user, song, play count) triplets, and each line looks like this (tab-delimited):

b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOAKIMP12A8C130995	1
b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOAPDEY12A81C210A9	1
b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOBBMDR12A8C13253B	2
b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOBFNSP12AF72A0E22	1
b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOBFOVM12A58A7D494	1
b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOBNZDC12A6D4FC103	1
b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOBSUJE12A6D4F8CF5	2
b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOBVFZR12A6D4F8AE3	1
b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOBXALG12A8C13C108	1
b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOBXHDL12A81C204C0	1
b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOBYHAJ12A6701BF1D	1
b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOCNMUH12A6D4F6E6D	1
b80344d063b5ccb3212f76538f3d9e43d87dca9e	SODACBL12A8C13C273	1
b80344d063b5ccb3212f76538f3d9e43d87dca9e	SODDNQT12A6D4F5F7E	5
...

If you call The Echo Nest API, you can get the information for about ~120K of these users.
CATALOG FOR 120K USERS
The first few lines are shown below, each catalog (name and ID) represent one user.

### f85c6de77b853f0b4d624a042129aee374db2637_tmp_catalog --> CACHGYH1332EB0628E
### 993c1bb7906374683bd517a55a500512c492cc94_tmp_catalog --> CAKCLXJ1332EB06A11
### 074c473776aa8742c442823fec1ee1a6a4c18599_tmp_catalog --> CARHWHW1332EB07121
### 05f16e747ce3f98f81a192ecc51f22bb7e6b27b3_tmp_catalog --> CAODWOL1332EB077C4
### 8481d9dc7640ba65fbff38ebd85c2c36f2a261dd_tmp_catalog --> CAAMNTA1332EB07EF3
### 95f502d804aa9fc2cc4278d5a0356c6fe90eabdc_tmp_catalog --> CAGSYUX1332EB085D6
### 332f3afa4f60d92629bce8d2216bc9fe53cd2c16_tmp_catalog --> CAXPSHX13330DD5544
....

See below for how to get the catalog data from the API.

Getting the dataset

First, if you want to download the full subset as one file, here it is:
TRIPLETS FOR 1M USERS (~500MB)

Now, we show you how to get that information from The Echo Nest API, e.g. how to query the catalog of one of the 120K users we provide. We will get the information for user f85c6de77b853f0b4d624a042129aee374db2637 whose playcount catalog has ID: CACHGYH1332EB0628E (first user in the file above). We use python and pyechonest (v. 4.2), we assume your API key is already set.

In [6]: from pyechonest import catalog
In [7]: cat = catalog.Catalog('CACNYVZ1332EB0BA9D')
In [8]: cat.read()
Out[8]: 
{u'id': u'CACNYVZ1332EB0BA9D',
 u'items': [{u'artist_id': u'ARB6OGR1187FB4D43D',
             u'artist_name': u'M83',
             u'date_added': u'2011-10-23T15:59:59',
             u'foreign_id': u'CACNYVZ1332EB0BA9D:song:10286694_usercat',
             u'play_count': 1,
             u'request': {u'artist_id': u'ARB6OGR1187FB4D43D',
                          u'item_id': u'10286694_usercat',
                          u'song_id': u'SOFMYVK12A58A7A675'},
             u'song_id': u'SOFMYVK12A58A7A675',
             u'song_name': u'Skin Of The Night'},
            {u'artist_id': u'ARK9LNI1187FB4D116',
             u'artist_name': u'A*Teens',
             u'date_added': u'2011-10-23T15:59:59',
             u'foreign_id': u'CACNYVZ1332EB0BA9D:song:11559594_usercat',
             u'request': {u'artist_id': u'ARK9LNI1187FB4D116',
                          u'item_id': u'11559594_usercat',
                          u'song_id': u'SOIYYWE12AB0182FD8'},
             u'song_id': u'SOIYYWE12AB0182FD8',
             u'song_name': u'One Night In Bangkok'},
            ...................
            {u'artist_id': u'ARV9QVP1187FB54F24',
             u'artist_name': u'Booty Luv',
             u'date_added': u'2011-10-23T15:59:59',
             u'foreign_id': u'CACNYVZ1332EB0BA9D:song:3878364_usercat',
             u'play_count': 1,
             u'request': {u'artist_id': u'ARV9QVP1187FB54F24',
                          u'item_id': u'3878364_usercat',
                          u'song_id': u'SOHMQGF12A58A7BFD2'},
             u'song_id': u'SOHMQGF12A58A7BFD2',
             u'song_name': u'Boogie 2Nite'},
            {u'artist_id': u'ARMCO9E1187B9B7314',
             u'artist_name': u'Midnight Juggernauts',
             u'date_added': u'2011-10-23T15:59:59',
             u'foreign_id': u'CACNYVZ1332EB0BA9D:song:9884334_usercat',
             u'play_count': 1,
             u'request': {u'artist_id': u'ARMCO9E1187B9B7314',
                          u'item_id': u'9884334_usercat',
                          u'song_id': u'SOYTVDF12A8AE487E0'},
             u'song_id': u'SOYTVDF12A8AE487E0',
             u'song_name': u'Into The Galaxy (Album Version)'}],
 u'name': u'01056e159da428c96c7db9f11377dc8df430f2ba_tmp_catalog',
 u'start': 0,
 u'total': 22,
 u'type': u'song'}

The Echo Nest API

All this would not be feasible without the great API that started the whole MSD project, all the info on The Echo Nest's Developer Center. If you work with music data, there's something there useful to you.

FAQ

What is the link between The Echo Nest and the Million Song Dataset?
The Echo Nest help started the MSD project and is this dataset shows how much they care about this project. That said, the MSD is an independent "open" project mostly maintained by LabROSA @ Columbia University. There is no official relation between LabROSA and The Echo Nest.

What is the licensing?
Same as the Echo Nest API license

How to cite the dataset?
You should cite this publication [bib].
Additionally, you can mention / link to this web resource:

The Echo Nest Taste profile subset, the official user data collection for the Million Song
Dataset, available at: http://millionsongdataset.com/tasteprofile

Challenge - More Data

The MSD Challenge was organized as a music recommendation contest on Kaggle. We provide the evaluation data form the 1st edition, which can be seen as an additional 110K users of data. More details on our challenge page.

Work using the dataset

Publications using the dataset. Should be a subset of the MSD publications. If you think your work should be included, send us an email!

  • The Million Song Dataset Challenge, B. McFee, T. Bertin-Mahieux, D. Ellis and G. Lanckriet, AdMIRe '12 [pdf][bib]