Underlying principles of a high-quality speech manipulation system STRAIGHT and its application to speech segregation
Hideki Kawahara and Toshio Irino
(University of Wakayama and ATR, Japan)

A human is a highly nonlinear system. It is generally difficult to understand its performance and functions only from responses to idealized (simplified or synthetic) stimuli. A better way for dealing with such complex nonlinear systems is to use ecologically relevant stimuli. However, there is an obvious difficulty. Usually, ecologically relevant stimuli, in other words, natural stimuli, do not allow precise control of physical parameters. To provide means to resolve this difficulty is the primary motivation to develop STRAIGHT. The key concept of STRAIGHT is to represent signals in terms of parameters which embodies essential aspects of auditory perception. In other words, special representations are sought for tone, noise and transient. For tone, a time-frequency representation that does not suffer from interferences caused by periodic excitation is introduced. F0 adaptive design of a complementary set of time windows effectively eliminates temporal variations in power spectrum estimate and a spline based F0 adaptive smoothing in the frequency domain eliminates variations due to harmonic structure. For transient, a fixed point based method to extract and to represent auditory events is introduced. STRAIGHT has been applied to test perceptual normalization of vowels, physical correlates of emotional speech perception, perception of speaking rate and so on so far and has been proved to have comparative naturalness to real speech. Linear and nonlinear manipulation procedures as well as an auditory morphing procedure have been developed to enable these studies. It will be interesting to test human speech segregation performance using STRAIGHT by introducing extra control dimensions. An outline how to implement a STRAIGHT-based sinusoidal synthesis will be introduced and will be discussed what extra dimensions have to be added. Finally, STRAIGHT also provides means to test a hypothesis; human auditory system separates size and shape information of sounding objects. This hypothesis is a result of on going collaboration with Roy Patterson and leads to an algorithm for speech segregation based on the stabilized wavelet-Mellin transformation and event detection.

Related Material:
STRAIGHT: basic structure. (Speech Communication, 1999) http://www.sys.wakayama-u.ac.jp/~kawahara/PSSws/draft.pdf
Auditory morphing using STRAIGHT. (ICASSP'2003) http://www.sys.wakayama-u.ac.jp/~kawahara/PSSws/kwhrv2.pdf
Application to speech segregation. (Eurospeech'03) http://www.sys.wakayama-u.ac.jp/~kawahara/PSSws/ES030117.PDF
STRAIGHT information and demonstration page. http://www.sys.wakayama-u.ac.jp/~kawahara/PSSws/