[AFRL logo]

Speech Separation and Comprehension in Complex Acoustic Environments
Thu Nov 4 - Sun Nov 7, 2004
Montreal, Quebec
sponsored by the Air Force Office of Scientific Research and the National Science Foundation

[EBIRE logo]

Evaluation of Speech Separation Systems and Corpus Development

Chair: Dan Ellis

Participants: Martin Cooke, Alex Acero, Douglas Brungart, Lucas Parra (also representing Te-Won Lee)

(NB: Talk titles link to presentation slides.)

Outline

  • The Speech Recognition Experience (Alex Acero)
    • A brief history of evaluations in speech recognition
      • NIST/DARPA tasks and results
      • The Aurora tasks
    • Benefits of evaluation
      • identification of useful techniques
      • small improvements cumulate into large improvements
      • sustained support
    • Negative results of evaluation
      • reduced diversity in systems
      • neglect of unevaluated aspects (noise, efficiency)

  • Evaluating Blind Source Separation Systems (Lucas Parra and Te-Won Lee)
    • Tasks and metrics used in evaluating BSS/ICA algorithms
      • examples and experiences
      • issues and tradeoffs

  • Evaluating Human Performance (Doug Brungart)
    • Scenarios for testing human speech separation
    • Metrics
    • Issues in working with human subjects
      • training
      • fatigue
      • familiarity
      • ...

  • Speech Tasks for Human/Machine Comparisons (Martin Cooke)
    • The multispeaker-babble continuum as an evaluation paradigm for machine and human speech separation
    • The Grid Task
      • design goals
      • proposed data preparation
      • proposed tasks

  • Towards a Common Speech Separation Evaluation (Dan Ellis)
    • The importance of evaluation
      • .. to compare different approaches
      • .. to guide progress/optimization
      • .. to give confidence to funding bodies
    • Obstacles to establishing evaluation standards
      • disagreements over the task
      • disagreements over the metric
      • difficulties providing ground-truth data
    • Agreeing a task
      • what are the scenarios we'd like to solve?
    • Agreeing a metric
      • what are the important things to achieve?

  • Open discussion

References:


Martin Cooke

GRID: an audio-visual corpus for research in speech perception and automatic speech recognition

Microscopic models of speech perception predict listeners' responses to individual (usually noisy) speech tokens. Such models promise to lead to greater insights into human speech perception than their macroscopic cousins which are only able to predict overall intelligibility. However, no collection of speech material suitable for joint work in modelling and perception testing exists at present. Corpora collected for speech perception tend to be too small to allow training of speech recognisers, while those used for work in ASR are usually inappropriate for presentation to listeners. As a consequence, models of speech perception are typically based on tiny amounts of training material and non state of the art learning algorithms. The GRID corpus is a first step towards the provision of speech material suitable for both modelling and listening tests. GRID builds on the CRM corpus but corrects for the latter's lack of phonetic balance and small size. At the same time, both audio and visual (facial) material will be collected, making up for the absence of large/affordable AV corpora. Collection of the corpus is scheduled for Q4 2004, with analysis and release in Q1 and Q2 2005. In this talk, I'll describe the rationale for and detailed design of GRID, and outline progress on its collection.