Dan Ellis : Sound Examples :

The Music-Speech Corpus

The "music-speech" corpus is a small collection of some 240 15-second extracts collected 'at random' from the radio by Eric Scheirer during his internship at Interval Research Corporation in the summer of 1996 under the supervision of Malcolm Slaney. This is the database used in:

  E. Scheirer & M. Slaney (1997)
  Construction And Evaluation Of A Robust Multifeature Speech/music
  Discriminator 
  Proc. ICASSP-97, Munich.
  http://cobweb.ecn.purdue.edu/~malcolm/interval/1996-085/SpeechMusicICASSP97.pdf

It was also used, for comparison, in the paper Gethin and I did:

  G. Williams & D. Ellis (1999)
  Speech/music discrimination based on posterior probability features 
  Proc. Eurospeech-99, Budapest. 
  ftp://ftp.icsi.berkeley.edu/pub/speech/papers/euro99-mussp.pdf

The data is broken up into training and test portions, and further categorized as containing speech, music (with or without vocals), speech over music, plus a few examples of 'other' (birdsong).

Gethin and I produced a complete lexical transcript of the set (including the spanish utterances!), it is in wrdfile/musicspeech.ref . The wrdfile/ directory includes a messy pile of derivative files that we used for various tests.

In 2001, Adam Berenzweig produced time-aligned labels for all the music data to distinguish between the vocals and musical accompaniment parts. This was used to train a vocals/music discriminator, described in:

  A.L. Berenzweig & D.P.W. Ellis (2001)
  Locating singing voice segments within music signals
  Proc. IEEE WASPAA, Mohonk NY, October 2001.
  http://www.ee.columbia.edu/~dpwe/pubs/waspaa01-singing.pdf

Label files for the speech and music portions only (not the m+s) are included in the mvlabfile directory, in <start> <duration> <label> format, where start and duration are in seconds. For the speech only files, the labels are simply one line: "0 15.000 speech". For the music files, there can be several lines indicating segments of "mus" (just instruments) and "vox" (vocals over instruments).

When preparing the labels, we found that files 2, 3, 12, 14, 17 and 21 in the test/music/novocals directory did in fact contain vocals (and are labelled accordingly).

Originally, there was a file 61 in the training/music directory, but it was a garbage file and has now been removed.

Here's a brief description of the full dataset::

161239  ./wavfile                       Contains all 246 audio files, 
					each in MSWAVE format, each 15s long

13121   ./wavfile/test/speech		20 examples of speech alone
13777   ./wavfile/test/music/novocals	21 examples of music without vocals
13121   ./wavfile/test/music/vocals	20 examples of music with vocals

39405   ./wavfile/train/speech  	60 examples of speech alone
40105   ./wavfile/train/music   	60 examples of music (vocal+novocal)
39081   ./wavfile/train/m+s		60 examples of speech-over-music
2625    ./wavfile/train/other   	4 examples of environmental sound

764	./mvlabfile			Partial mirror of ./wavfile with 
					speech/mus/vox labels (added Feb 2002)

231     ./wrdfile			Transcripts; musicspeech.ref is master

2       ./list				RANGES defines some utt ranges

If you're interested in getting hold of this dataset for research use, please send me an email to dpwe@ee.columbia.edu.

Last updated: $Date: 2006/09/23 16:31:28 $
Dan Ellis <dpwe@ee.columbia.edu>