[AFRL logo]

Speech Separation and Comprehension in Complex Acoustic Environments
Thu Nov 4 - Sun Nov 7, 2004
Montreal, Quebec
sponsored by the Air Force Office of Scientific Research and the National Science Foundation

[EBIRE logo]

Exploiting Human-Machine (HM) Collaboration to Achieve Superior Source-Separation-and-Comprehension (SS&C) Systems

Chair: Nat Durlach

Participants: Barbara Shinn-Cunningham, Jay Desloge, Betty Tuller, Abeer Alwan, Sumit Basu

(NB: Talk titles link to presentation slides.)


INTRODUCTION

Nat Durlach

Currently, humans enter into the development of more effective machine-only SS&C systems in two ways:

  1. They apply their general engineering abilities and resources to create improved SS&C machines;
  2. They acquire scientific knowledge about how (and how well) humans perform SS&C using their own biological apparatus to serve as inspiration for machine design.

Relatively little attention is being given to (a) serious deficiencies in human SS&C processing; (b) how these deficiencies in human processing relate to current deficiencies in machine-only SS&C processing; and (c) the exploration of collaborative HM systems that integrate human and machines at the component level to achieve SS&C systems that are superior to both machine-only and human-only systems.

In this introductory presentation, we outline the complementarity of human and machine processing advantages/disadvantages and consider some of the opportunities and challenges associated with attempts to develop and test HM SS&C systems. The presentation then concludes with an outline of the topics to be focused upon in the session on these systems.


INTERFACING WITH THE MACHINE

Sumit Basu and Jay Desloge

Although the current performance of source separation and comprehension systems is still quite limited, there are a variety of ways in which they can interact with a human operator to enhance both human capabilities and machine operation. We can roughly divide the use of machines to expand human capabilities into four broad categories: attentive filtering (finding things worthy of attention in complex auditory environments), sensor multiplication (dealing with many more sensors than ears/eyes), going beyond the human scale (micro and macro sized arrays, super and sub sensitive sensors in terms of frequency and magnitude), and multimodal fusion (using other sensors such as cameras, laser range finders to guide beamforming and vice versa). We can roughly divide the use of human cognitive ability to enhance machine performance into three broad categories: focus control (concentrating machine attention on the most relevant sources), environmental awareness (adjusting machine parameters to suit different environments), and calibration (setting up systems particularly mobile systems for operation).

In this presentation, we will develop these categories and go through a variety of motivating scenarios in which these human-machine interactions greatly improve combined system performance. We will then discuss how we can design interactive systems for such scenarios. This will lead us to new challenges in the sensing, control, and display of the vast array of information available to the systems.


ADAPTATION AND PERCEPTUAL LEARNING

Betty Tuller

In designing collaborative human-machine (HM) systems, it is necessary to consider how task performance is likely to depend on the amount and type of experience the human operator has with the system. For example, whereas the performance of an HM system designed to be as "natural" as possible may initially be better than that of an HM system designed to facilitate "supernormal" performance, after appropriate experience with the two systems the performance of the latter system may prove superior. Because such performance crossovers can have important practical consequences, it is essential that system designers include consideration of experiential factors (sensorimotor adaptation, perceptual and cognitive learning, training effects, etc.) when selecting a system design (or performing a system evaluation).

In this talk, I will first outline some of the classical results in the area of perceptual learning and training. In particular, I will discuss the classical learning curve, how it was derived, under what conditions, and whether it adequately describes how individuals adapt and learn. Following these comments, I will discuss how this picture is being modified by recent research that focuses on the processes involved in adaptation and learning. Specifically, I will address (1) individual differences in initial abilities and how these affect learning over time; (2) transfer of learning to novel contexts; and (3) variables that may facilitate learning and adaptation. The last two aspects include a consideration of variability as informative to task performance rather than as a source of noise.


MAGNIFICATION OF STREAM DIFFERENCES

Given access to separated speech streams, one can attempt to maintain the separation when presenting the speech streams to a human listener by magnifying the perceptual differences among the streams. In the last two talks of this session, we consider magnification of (1) voice differences and (2) spatial differences

VOICE DIFFERENCES

Abeer Alwan

In this talk, we summarize natural differences among voices, discuss methods of transforming speech signals to accentuate voice differences, and speculate about the perceptual effects of such transformations. Implications for signal separation, intelligibility, and talker identification will be discussed.

Voices differ due to physiological, behavioral, and linguistic reasons. The differences are manifested in the spectral and temporal domains. Due to differences in the vocal-tract shape and geometry, the resonant frequencies of the vocal tract (or formant frequencies) differ. For example, females have shorter vocal tracts than males and, hence, tend to have higher formant frequencies. Differences in the vocal-folds are manifested by differences in the fundamental frequency (F0). For example, children have a much higher F0 than adults. There may also be more subtle differences such as in the locations of subglottal resonances. The way the vocal folds vibrate affects other voice-quality measures such as the open quotient and spectral tilt parameters of the underlying source model . Prosodic (e.g., those related to the person's pitch track) and lexical (e.g., relative frequency of disfluencies) features together with articulatory timing information (phone duration for example) are also speaker specific. Differences in linguistic backgrounds and dialects can further affect spectro-temporal patterns of speech signals. Both types of measures (temporal and spectral) are utilized by listeners and by machines in speaker-recognition tasks. While the perceptual affect of manipulating some of these spectral and temporal parameters are known (for example, the JND of different formant frequencies), there has not been a systematic study on the perceptual correlates of many other speaker-specific features.


SPATIAL DIFFERENCES

Barbara Shinn-Cunningham

Spatial separation between competing sound sources is known to improve a listener's ability to detect and understand a source of interest. In designing a good machine-human interface, spatial cues should undoubtedly be one of the tools utilized to help the listener manage competition between sound sources. In fact, when designing a human display that mixes competing sources for presentation to a human operator, one has the freedom to use spatial cues that might not naturally occur, perhaps enhancing the perceptual benefits of spatial separation. Because there appear to be multiple mechanisms through which spatial separation of competing sources benefits perception, the effect of magnifying spatial cues is likely to be complex and to depend on the number of and nature of the competing sources presented to the listener. Moreover, there are few studies that even begin to address how humans might use "enhanced" spatial cues. The goal of this presentation is to explore whether one might be able to get even bigger benefits of spatial separation of acoustic sources by using a magnified spatial cue range.