Integration of Different Machine Approaches

Chairs: DeLiang Wang, Dan Ellis

Participants: Lawrence Saul, Lucas Parra, Les Atlas

(NB: Talk titles link to presentation slides.)

Session overview:

Saul on the machine learning perspective
Atlas on the signal processing perspective
Parra on the microphone array perspective (including ICA)
Ellis on the CASA and ASR perspectives

Lucas Parra

Acoustic Source Separation with Microphone Arrays

Blind Source Separation (BSS) has received much attention in the context of acoustic mixtures. Most algorithms that separate convolutive mixtures exploit the spatial selectivity of an array of microphones. It is natural therefore to put convolutive BSS into the context of traditional beamforming. This talk will review different optimization criteria, including statistical independence, aka ICA. The talk will be biased toward frequency domain implementations as those tend to be the most efficient. Only algorithms that have shown significant results (i.e. 10-20dB improvement) on real-world applications will be discussed.

Below are some relevant papers, which can be downloaded from http://newton.bme.columbia.edu/~lparra/publish/

Lucas Parra, Christopher Alvino, ``Geometric Source Separation: Merging convolutive source separation with geometric beamforming'', IEEE Transaction on Speech and Audio Processing, vol. 10, no. 6, pp. 352-362, Sept. 2002
Lucas Parra, Clay Spence, "Convolutive blind source separation of non-stationary sources", IEEE Trans. on Speech and Audio Processing pp. 320-327, May 2000.
Lucas Parra, Paul Sajda, "Blind Source Separation via Generalized Eigenvalue Decomposition", Journal of Machine Learning Research, vol. 4, pp. 1261-1269, 2003
Craig Fancourt, Lucas Parra, "The Generalized Sidelobe Decorrelator", IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 167-170, 21-24 Oct. 2001.
Craig Fancourt, Lucas Parra, "A comparison of decorrelation criteria for the blind source separation of non-stationary signals", IEEE Sensor Array and Multichannel Signal Processing Workshop, Rosslyn, VA, pp. 165-168, August 2002

Lawrence Saul

Machine Learning and Auditory Scene Analysis

How can we integrate the latest advances in machine learning into systems for auditory scene analysis and speech separation? The main challenge is to develop representations of the acoustic signal that can be analyzed by statistical learning algorithms. In this talk, I will describe some recently proposed models in machine learning for dimensionality reduction and sequence analysis and discuss their application to problems in multiple f_0 tracking, speaker separation, and acoustic modeling.

Les Atlas, University of Washington

Modulation Spectral Filtering: A New Tool for Acoustic Signal Separation

There is substantial evidence that commonality of modulations rates or frequencies provide an important cue for perceptual grouping of multiple sound sources for both monaural and binaural perception. Unfortunately, this modulation concept has previously had little, if any, quantitative foundation. The elementary notions of frequency in a Fourier sense and concepts of linear time-invariant filtering are very well defined. It is thus reasonable to expect analogous properties for modulation frequency representations and modulation filters [1]. A correct and substantive definition of modulation frequency filtering, with the suppression and distortion-free performance one normally expects in a filter, could be a key ingredient of sound separation systems.

A time frequency approach can provide a start of a careful definition of modulation frequency. However, the conventional assumption of an incoherently detected real and non-negative modulation envelope, as used by essentially all researchers, is incomplete [2]. Correspondingly, with a more accurate coherent modulation detection foundation, there is the potential to satisfy superposition and other properties in modulation filtering. Well-defined modulation spectra can then be viewed as a new and useful dimension to filter in, complementing and potentially augmenting existing separation technologies. Demonstrations will include single-channel talker and music source separation. Remaining challenges will be discussed.

References

L. Atlas, Modulation Spectral Filtering of Speech, Proc. Eurospeech 2003, Sept., 2003, Geneva.
L. Atlas, Q. Li, and J. Thompson, Homomorphic Modulation Spectra, Proc. IEEE ICASSP 2004, May, 2004, Montreal, pp. II-761 II-764.

Dan Ellis, Columbia University

Integrating CASA information with other signal separation techniques

Computational Auditory Scene Analysis (CASA) has been broadly used to refer to computer systems that try to duplicate the human ability to organize complex sound scenes into individual sources by directly modeling what is understood of how the auditory system achieves this task. In practice, this is mainly associated with a collection of "CASA features" that attempt to capture the cues to sound organization identified by experimental psychologists -- continuity, common onset, common periodicity, common modulation, and conformity to well-known patterns. On the other hand, a number of alternative approaches including Independent Component Analysis start from a purely theoretical analysis of the problem and make no claims of perceptual relevance. This talk will attempt to clarify the distinctions and links between these two approaches, and suggest ways in which CASA cues can be successfully integrated into more rigorous signal separation algorithms.

References

J. Barker, M. Cooke, D. Ellis (2004). Decoding speech in the presence of other sources, Speech Communication, to appear, 2004.
M. Cooke and D.P.W. Ellis (2001). The auditory organization of speech and other sources in listeners and computational models, Speech Communication, vol. 35, no. 3-4, pp. 141-177.
F. Bach and M. Jordan (2004). Blind one-microphone speech separation: A spectral learning approach, Neural Information Processing Systems NIPS*17, 2004.