[AFRL logo]

Speech Separation and Comprehension in Complex Acoustic Environments
Thu Nov 4 - Sun Nov 7, 2004
Montreal, Quebec
sponsored by the Air Force Office of Scientific Research and the National Science Foundation

[EBIRE logo]

Brief Introductory Remarks to Areas of Speech Separation: Objectives, Advantages, and Problems

Chair: Pierre Divenyi

Participants: A. Bronkhurst, M. Cooke, T.-W. Lee, R. Zatorre, D. Ellis, N. Durlach

(NB: Talk titles link to presentation slides.)

Session overview:


A.W. Bronkhorst, TNO Human Factors, Soesterberg, The Netherlands

Human single-channel and spatial performance

Given that the peripheral auditory system has a tonotopic and no spatiotopic organization, one might think that it is not optimally suited for separating sound sources. However, spatial cues are of limited use in realistic environments where reflections and background noise occur. They will be ambiguous or even masked in such conditions, and separation also relies on several other cues, in particular (single-channel) spectral and temporal information. The performance of human listeners in speech separation tasks is normally measured by scoring percentages of correctly reproduced speech items. It is important to distinguish two different effects that can account for reduced performance in this task: (energetic) masking and stimulus ambiguity (also called informational masking). Energetic masking has been studied extensively currently, there are models that can calculate its effect when the frequency spectra of the target speech and the interferer(s) and their interaural phase differences are known. Interest in informational masking is more recent, and although it is still not well understood, it is clear that it depends in a fundamentally different way on binaural, spectral and temporal cues. For example, a significant binaural release from informational masking can occur in conditions where there is no release from energetic masking.

Literature:

  • Arbogast, T.L., Mason, C.R. and Kidd, G. Jr. (2002). "The effect of spatial separation on informational and energetic masking of speech," J. Acoust. Soc. Am. 112, 2086-2098.
  • Bronkhorst, A. W. (2000) The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions, Acustica united with Acta Acustica 86, 117-128.
  • Brungart, D. S. and Simpson, B.D. (2001). "Informational and energetic masking effects in the perception of multiple simultaneous talkers." J. Acoust. Soc. Am. 110, 2527-2538.
  • Freyman, R.L., Balakrishnan, U. and Helfer, K.S. (2001) Spatial release from informational masking in speech recognition, J. Acoust. Soc. Am. 109, 2112-2122.
  • Zurek, P. M. (1993) Binaural advantages and directional effects in speech intelligibility, in Acoustical Factors Affecting Hearing Aid Performance, G.A. Studebaker and I. Hochberg (Eds), Allyn and Bacon, Boston, pp. 255-276.

Martin Cooke

Machine separation using human models

Practically all human-inspired computational approaches to the separation of speech and other sources take Bregman's auditory scene analysis (ASA) account as their guide. ASA distinguishes between primitive (bottom-up) and schema-driven (top-down) processes in perceptual auditory organisation. Computational ASA (CASA) has focussed almost exclusively on the former, yet there is good evidence that prior knowledge of the signals to be separated plays a large role. In this overview, I will review the principal bottom-up approaches and go on to examine ways to involve top-down information in the separation process. I will discuss a recent speech decoding architecture which efficiently integrates coherent but fragmentary evidence for speech in an arbitrarily-complex background.


Te-Won Lee (Institute for Neural Computation, University of California, San Diego)

Blind Machine Separation

Recently, blind source separation (BSS) or independent component analysis (ICA) has increased its appeal due to many potential practical applications and a growing number of proposed solutions. The methods relevant to the speech separation problem are multichannel blind deconvolution in which ideally a room transfer function is estimated to formulate the inverse and thus estimate the source signal. I will briefly summarize approaches to multichannel blind deconvolution with its strengths and limitations. Other methods relevant for speech separation in the framework of ICA are related to learning a representation of speech signals and using this representation for speech separation in a single microphone or multiple microphone setting. I will summarize recent approaches to single channel source separation.


Robert Zatorre (McGill)

Auditory cortex: Stimulus analyzer and generator of meaning


Dan Ellis (Columbia University)

Recognition and learning as tools of machine separation

The acoustic signal separation task presents two major problems: The first is a problem of inadequate constraints. Even in the simplest separation problem of two signals mixed into two observations, there are infinitely many ways to combine them to produce outputs. We need additional constraints to guide the choice among these alternatives, and the richest approach to deriving these constraints is by observing examples of the outputs we are expecting and learning their properties. The second major problem concerns signal obliteration, where the energy of one source can be completely overwhelmed by interference (at least over part of its extent) so that even if we know a lot about the interference and mixing characteristics we still cannot reconstruct it. In this case, learning the characteristics of signals similar to the one we wish to extract can allow us to reconstruct these missing parts of the signal. This is equivalent to a system whose output is a more abstract description of the sources than a raw waveform such as classification into a set of discrete classes or states. This "recognition" approach may in many cases form a more appropriate formulation of the signal separation problem.

References:


Nat Durlach (BU&MIT)

Exploiting Human-Machine (HM) Collaboration to Achieve Superior Source-Separation-and-Comprehension (SS&C)

Currently, humans enter into the development of more effective machine-only SS&C systems in two ways:

  1. They apply their general engineering abilities and resources to
  2. create useful SS&C machines;
  3. They acquire scientific knowledge about how (and how well) humans perform SS&C using their own biological apparatus to serve as inspiration for machine design.

Relatively little attention is being given to serious deficiencies in human SS&C processing, to how these deficiencies in human processing relate to current deficiencies in machine-only SS&C processing, and to the exploration of collaborative HM systems that integrate human and machines at the component level (to achieve `supernormal' systems).

In this introductory presentation, we outline the complementarity of human and machine processing advantages/deficiencies and consider some of the opportunities and challenges associated with attempts to develop and test HM SS&C systems. The presentation then concludes with an outline of the topics to be focused upon in the session on these systems.