Maximum-likelihood multi-channel speaker separation using factorial Hidden Markov Models.
Bhiksha Raj
(Mitsubishi Electric Research Labs [MERL])

In this talk I will present a speech-recognizer-based maximum-likelihood beamforming technique, that can be used both for signal enhancement and speaker separation. The presented technique uses an HMM-based speech recognizer as a statistical model for the target signal to be enhanced or separated. The parameters of a filter-and-sum array processor are estimated to maximize the likelihood of the output as measured using the speech recognizer. The filter-and-sum operation may be performed either in the time domain or the frequency domain. When used for speaker separation, the beamforming must be performed individually for each of the speakers. Since the competing signal is also in-domain speech in this case, the statistical model used for the beamforming is now a factorial HMM formed from the HMM for the target, and that for the competing speaker(s). This work was done jointly with Michael L. Seltzer of CMU and Manuel Jesus Reyes Gomez of Columbia University. The frequency-domain beamformer was developed principally by Michael Seltzer, with minimum encouragement from the presenter. Other contributors are Richard Stern of CMU and Dan Ellis of Columbia University.

Relevant material:

M. L. Seltzer and B. Raj, "Speech recognizer-based filter optimization for microphone array processing," /IEEE Signal Processing Letters, vol. 10, no. 3, March 2003. (from http://www-2.cs.cmu.edu/~mseltzer/papers/index.html)
M. L. Seltzer and R. M. Stern, "Subband parameter optimization of microphone arrays for speech recognition in reverberant environments," Proceedings of the International Conference on Acoustics, Speech, and Signal Processing 2003, Hong Kong. (from http://www-2.cs.cmu.edu/~mseltzer/papers/index.html)