[AFRL logo]

Speech Separation and Comprehension in Complex Acoustic Environments
Thu Nov 4 - Sun Nov 7, 2004
Montreal, Quebec
sponsored by the Air Force Office of Scientific Research and the National Science Foundation

[EBIRE logo]

The "Speech" in Speech Separation - A Primer

Presenters: Steve Greenberg, Rich Stern

(NB: Talk titles link to presentation slides.)


Steven Greenberg (The Speech Institute)

A Multi-Tier Theoretical Framework for Understanding Spoken Language

Spoken language is often viewed merely as sequences of words and phonemes. The listener's task is one of decoding the speech signal into its constituent elements derived from spectral decomposition of the acoustic signal. However, under acoustic interference, spectral decomposition is particularly challenging. Future-generation speech separation methods are likely to utilize a more comprehensive set of representational approaches than merely decoding words and phonemes. This presentation outlines a multi-tier theory of spoken language in which utterances are composed not only of words and phones, but also syllables, articulatory-acoustic features and (most importantly) prosemes, encapsulating the prosodic pattern in terms of prominence and accent. This multi-tier framework portrays pronunciation variation and the phonetic micro-structure of the utterance with far greater precision than the conventional lexico-phonetic approach, and thereby offers the prospect of improving machine-based recognition and separation systems in the future.

Papers:


Richard Stern (Carnegie Mellon University)

Signal processing for sound separation and robust representation

In recent years there has been renewed in the development of signal processing motivated by human auditory perception that provide a more robust representation of speech signals, facilitate the separation of competing streams of signals, or provide features that improve the robustness and recognition accuracy of speech recognition systems. While the literature of physiologically-motivated signal processing is daunting in its vastness, a number of common themes are frequently observed among the competing models and representations. This talk will review and comment on current trends and algorithms that have been proposed at both the peripheral and more central levels for general robust speech representation, signal separation, and representations for automatic speech recognition. We will discuss and comment on some of the important unresolved problems in physiologically-motivated speech representations, and we will speculate on some of the reasons why physiologically-motivated representations have up to now enjoyed only limited success in reducing error rates in automatic speech recognition.

Relevant Material:

Other Material: