The "Speech" in Speech Separation - A Primer

Presenters: Steve Greenberg, Rich Stern

(NB: Talk titles link to presentation slides.)

Steven Greenberg (The Speech Institute)

A Multi-Tier Theoretical Framework for Understanding Spoken Language

Spoken language is often viewed merely as sequences of words and phonemes. The listener's task is one of decoding the speech signal into its constituent elements derived from spectral decomposition of the acoustic signal. However, under acoustic interference, spectral decomposition is particularly challenging. Future-generation speech separation methods are likely to utilize a more comprehensive set of representational approaches than merely decoding words and phonemes. This presentation outlines a multi-tier theory of spoken language in which utterances are composed not only of words and phones, but also syllables, articulatory-acoustic features and (most importantly) prosemes, encapsulating the prosodic pattern in terms of prominence and accent. This multi-tier framework portrays pronunciation variation and the phonetic micro-structure of the utterance with far greater precision than the conventional lexico-phonetic approach, and thereby offers the prospect of improving machine-based recognition and separation systems in the future.

Papers:

S. Greenberg, T. Arai What Are The Essential Cues For Understanding Spoken Language? IEICE Trans. Inf. & Syst., Vol. E87-D no. 5, May 2004.
S. Greenberg A Multi-Tier Framework for Understanding Spoken Language, from "Listening to Speech: An Auditory Perspective", ed. Greenberg & Aisworth, Lawrence Erlbaum (to appear).

Richard Stern (Carnegie Mellon University)

Signal processing for sound separation and robust representation

In recent years there has been renewed in the development of signal processing motivated by human auditory perception that provide a more robust representation of speech signals, facilitate the separation of competing streams of signals, or provide features that improve the robustness and recognition accuracy of speech recognition systems. While the literature of physiologically-motivated signal processing is daunting in its vastness, a number of common themes are frequently observed among the competing models and representations. This talk will review and comment on current trends and algorithms that have been proposed at both the peripheral and more central levels for general robust speech representation, signal separation, and representations for automatic speech recognition. We will discuss and comment on some of the important unresolved problems in physiologically-motivated speech representations, and we will speculate on some of the reasons why physiologically-motivated representations have up to now enjoyed only limited success in reducing error rates in automatic speech recognition.

Relevant Material:

Stern, R. M. (2003). Signal Separation Motivated by Human Auditory Perception: Applications to Automatic Speech Recognition, presentation at the NSF Workshop Perspectives on Sound Separation, Montreal PQ, November 2003.
Stern, R. M. (2002). Using Computational Models of Binaural Hearing to Improve Automatic Speech Recognition: Promise, Progress, and Problems, AFOSR Workshop on Computational Audition.
Stern, R. M. and Sullivan, T. M. (1995) Robust Speech Recognition Based on Human Binaural Perception, Proc. of the ATR workshop on A Biological Framework for Speech Perception and Production, Kansai Science City, September, 1994, Reprinted as the ATR Technical Report TR-H-121, (1995).

Other Material:

http://www.cs.cmu.edu/~robust/

http://www.cs.cmu.edu/~rms/BinauralWeb/