It is widely accepted that one of the key factors in human listeners' ability to separate a target voice from interference (e.g., the "cocktail party problem") is the harmonic structure of (voiced) speech, allowing the listener to target the harmonics of a particular fundamental frequency contour to extract the speech. However, it's difficult to separate the contribution of harmonicity (a set of regularly-spaced harmonics) from the looser property of spectral sparsity (i.e., a set of comodulated sinusoid components that are spaced out in frequency, but not necessarily as multiples of a single fundamental), since all natural speech signals are harmonic. This research uses a precise speech analysis-synthesis mechanism (Hideki Kawahara's STRAIGHT) to create copies of speech tokens that are very close to the original, except the harmonic components of the voiced portions have been modified in a number of ways to construct the closest inharmonic approximations to natural speech. These tokens can then be used for listening tests to probe the degree to which the harmonicity of the speech is needed for tasks like separation, or whether simply having a sparse, comodulated spectrum -- without precise harmonic arrangement -- is sufficient.
This page provides some examples to illustrate the results of our technique.
You can also read our Interspeech-12 (submitted) paper on this work.
Below are spectrograms of the inharmonic and ![]() |
"Woe betide the interviewee if he answered vaguely"
|
"This has been attributed to helium film flow in the vapor pressure thermometer"
These mixtures are intended for perceptual experiments in which the task is to understand the words from one speaker. The sound examples begin with an single utterance, followed by a mixture of two utterances, one of which is the same speaker as the single utterance, but speaking a different sentence. The listener uses the first utterance to "learn" which voice they are supposed to transcribe in the mixture. The examples start with the simulated whisper case (anticipated to be the hardest), followed by the inharmonic syntheses, and finally the harmonic reconstruction (anticipated to be the easiest).
This work was supported in part by the National Science Foundation (NSF) via grant IIS-1117015, by Grants-in-Aid for Scientific Research 22650042 from JSPS, and by the Howard Hughes Medical Institute. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors.