Speech Separation and Comprehension in Complex Acoustic Environments |
||||||||
Variations in Design and Performance of Sensing ArraysChairs: Steve Colburn, Te-Won Lee Presenters: Machine session: DeLiang Wang, Jay Desloge, Te-Won Lee, Jim Flanagan (NB: Talk titles link to presentation slides.) Presenters: Perceptual session:
Monaural and Binaural Speech SeparationIn this presentation, I will illustrate how to perform speech separation using perceptually-based monaural and binaural analysis. For monaural separation, I'll describe algorithms based on auditory scene segmentation, pitch tracking, onset/offset analysis, and amplitude modulation analysis. For binaural separation, I'll present a supervised learning approach to estimate ideal binary time-frequency masks in the joint feature space of ITD (interaural time difference) and IID (interaul intensity difference). I will also discuss relative strengths and weaknesses of monaural versus binaural processing as well as microphone array techniques. Directional multimicrophone arrays: a spatial-filtering approach to source separationIn this section, I will discuss the use of M-element microphone arrays to create adaptive spatial filters that can be used to extract specific sources from within complex acoustic environments. I will explore the realistic attainable performance that can be achieved with these systems both in terms of source localization and source separation. I will also compare spatial filtering to other multi-sensor techniques (most notably independent component analysis ICA) in order to provide some understanding of both the strengths and weaknesses of spatial filtering when applied to this task. ICA-based Techniques for Single Channel and Multichannel Speech SeparationI will briefly summarize approaches to speech separation based on ICA techniques. This includes methods for multichannel blind deconvolution as well as recently proposed methods for single channel blind source separation. I will illustrate the relevancy of the machine learning framework for learning a representation of speech signals and other sounds. The use of a probabilistic graphical model allows a principled and systematic approach to the speech separation problem. Spatial Selectivity for Speech SeparationAs capabilities advance for natural communication with complex systems, hands-free capture of sound grows in interest. Multimodal interfaces, mobile communication, and large-group conferencing are venues where hands/eyes busy tasks are conducted, and where hand-held or body-worn microphones are inconvenient. Hands-free sound capture ideally requires accurate source location (preferably in three dimensions) and good-quality transduction of the located source (again with three-dimensional selectivity). This report indicates techniques, challenges, and research status for employing spatial selectivity for sound separation. |