We're working with speech recognition on some very noisy speech files that were apparently recorded by a mic some distance from the speaker in a noisy office environment.
One special tool we have at our disposal is a trained, noise-robust pitch tracker derived from the RATS data, which consisted of pairs of files, a clean version, and badly corrupted version that had been transmitted over a variety of noisy radio links. We used these to train our Subband Autoccorrelation Classification (SAcC) pitch tracker which does well in conditions much too noisy for conventional explicit pitch trackers. Informally, it seems that the RATS-trained pitch tracker does well on our noisy Babel data.
This could be used in several places (e.g., for improved voice activity detection), but one idea is to try to enhance speech by filtering out just the harmonics corresponding to the "known" pitch.
I've been thinking about how to do this ever since Avery Wang did it to separate vocals from accompaniment at CCRMA in 1996. Spectrogram masking is a bit too crude and messy; his technique was to "heterodyne" each harmonic down to dc, then to use a low-pass filter to extract the amplitude, then remodulate it back to its original position.
Of course, a comb filter picks out all the harmonics in a single stroke, but it needs everything to be at a constant pitch. Rather than trying to demodulate each harmonic, my idea is to use time-varying resampling ("varispeed") to pitch-shift the target using the known pitch at each time so that the target pitch becomes constant. Then you can filter, and, in theory, un-resample to get back the original signal, now with the gaps between the harmonics "dug out".
The problem was that I'd always gotten tied up trying to make sense of how to undo the time resampling, since it's a confusing operation. But for the ehist project on energy histogram equalization, I ended up implementing piecewise linear mappings that are exactly invertible, and which thus gave me the piece I needed for the pitch-flattening.
The figure below a noisy OP1_204 test fragment (I started with the Wiener-filtered output), along with the pitch track and prob(vx) from SAcC-RATS. I tweaked the prior on voicing to make it guess at a pitch even when the evidence was weak, and I smoothed the prob(vx) with a 200ms median filter.
The next pane is that signal with time-varying resampling to flatten the inferred pitch track to 100 Hz. Then we put it through a comb filter to remove noise in-between the harmonics (third pane).
The final pane shows this cleaned signal un-resampled, to get back the original timebase. I also cross-fade between the pitch-filtered version and the original unprocessed signal using the smoothed p(vx) as the weighting.
This all works, I think. Unfortunately, it doesn't sound that good (yet?). But I think it's the basis of a really nice story - if only we can get some kind of ASR gain out of it.
The Matlab code for this is at https://github.com/dpwe/pitchfilter . The figure below is created within the pitchfilter command, e.g.
>> [d,sr] = wavread('nr1ex_wiener-8k.wav');
>> y = pitchfilter(d,sr);
>> wavwrite(y, sr, 'nr1ex_w+pitchfilt.wav');