Separating Speech from Speech Noise

The task of separating speech in complex acoustic environments -- such as a single voice in a cocktail party -- is an extremely difficult challenge. Many speech enhancement or separation techniques cannot accommodate the situation when both target and interference have the same properties, because both are speech. This project is concerned with applying some novel models -- using Computational Auditory Scene Analysis (CASA) and trained models of the speech signal -- to see how well speech can be separated. In particular, our goal is to provide separations that are demonstrably of benefit to human listeners, hence our collaboration with perceptual experimentalists at EBIRE and Boston University.

Partners

East Bay Institute for Research and Education - Pierre Divenyi
Boston University - Barbara Shinn-Cunningham
Columbia University - Dan Ellis
Ohio State University - DeLiang Wang

Resources

A page of examples of very challenging acoustic environments
alignSpondee.tgz - a package of HTK scripts for making 1ms-resolution alignments between experimental tokens and phone labels. We are using "spondees" (dog-house, fire-truck) to control for stress prosody in our experiments.

Project Reports

Theses

R. J. Weiss (2009)
Underdetermined Source Separation Using Speaker Subspace Models
Ph.D. Thesis, Columbia University Dept. of Electrical Engineering.

Related Publications

2009

J. B. Boldt and D. Ellis (2009) A Simple Correlation-Based Model of Intelligibility for Nonlinear Speech Enhancement and Separation Proc. EUSIPCO'09, Glasgow, August 2009. (to appear)
R. Weiss and D. Ellis (2009) A Variational EM Algorithm for Learning Eigenvoice Parameters in Mixed Signals Proc. ICASSP-09, pp. 113-116, Taiwan, April 2009.

2008

R. Weiss and D. Ellis (2008) Speech separation using speaker-adapted Eigenvoice speech models Computer Speech and Language, accepted for publication. (18pp) DOI: 10.1016/j.csl.2008.03.003
R. Weiss, M. Mandel, D. Ellis (2008) Source Separation Based on Binaural Cues and Source Model Constraints Proc. Interspeech-08, pp. 419-422, Brisbane, Australia, September 2008.
K. Hu, P. Divenyi, D. Ellis, Z. Jin, B. Shinn-Cunningham, D. Wang (2008) Preliminary Intelligibility Tests of a Monaural Speech Segregation System Proc. SAPA-08, pp. 11-16, Brisbane, Australia, September 2008.
A. Lammert, D. Ellis, P. Divenyi (2008) Data-driven articulatory inversion incorporating articulator priors Proc. SAPA-08, pp. 29-34, Brisbane, Australia, September 2008.
S. Ravuri and D. Ellis (2008) Stylization of Pitch with Syllable-Based Linear Segments Proc. ICASSP-08 Las Vegas, April 2008, pp. 3985-3988.

2007

M. Mandel and D. Ellis (2007) EM localization and separation using interaural level and phase cues Proc. IEEE Workshop on Apps. of Sig. Proc. to Acous. and Audio WASPAA-07, Mohonk NY, October 2007, pp. 275-278.
R. Weiss and D. Ellis (2007) Monaural speech separation using source-adapted models Proc. IEEE Workshop on Apps. of Sig. Proc. to Acous. and Audio WASPAA-07, Mohonk NY, October 2007, pp. 114-117.
M. Athineos and D. Ellis (2007) Autoregressive Modeling of Temporal Envelopes IEEE Tr. Signal Processing, vol. 15 no. 11, Nov 2007, pp. 5237-5245. (9pp)
R. Weiss and D. Ellis (2006) Estimating single-channel source separation masks: Relevance Vector Machine classifiers vs. pitch-based masking Proc. Workshop on Statistical and Perceptual Audition SAPA-06, pp. 31-36, Pittsburgh PA, Oct 2006. (6pp)
D. Ellis and R. Weiss (2006) Model-Based Monaural Source Separation Using a Vector-Quantized Phase-Vocoder Representation Proc. ICASSP-06, Toulouse, May 2006, pp. V-957-960. (4pp)
M. Mandel, D. Ellis, and T. Jebara (2006) An EM algorithm for localizing multiple sound sources in reverberant environments Proc. Neural Info. Proc. Sys., Vancouver CA, Dec 2006. (8pp)
M. Mandel and D. Ellis (2006) A probability model for interaural phase difference Proc. Workshop on Statistical and Perceptual Audition SAPA-06, pp. 1-6, Pittsburgh PA, Oct 2006. (6pp)
D. Ellis (2006) Model-Based Scene Analysis Chapter 4 of Computational Auditory Scene Analysis: Principles, Algorithms, and Applications, D. Wang & G. Brown, eds., Wiley/IEEE Press, pp. 115-146, 2006. (46pp)

Acknowledgment

This material is based in part upon work supported by the National Science Foundation under Grant No. IIS-05-35168. Any opinions, findings and conclusions or recomendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

Last updated: $Date: 2005/08/09 03:26:12 $
Dan Ellis <dpwe@ee.columbia.edu>