Predicting Intelligibility of Enhanced Speech
Nonlinear speech enhancement can improve intelligibility even though speech quality (and SNR) remain poor. Since SNR will not reflect the enhancement, we have developed an alternative objective measure that better correlates with intelligibility for enhancement schemes such as time-frequency masking. We are releasing this simple Matlab implementation so that others can take advantage of this measure. The measure is simple to compute and understand, and correlates well with human performance under a range of different conditions.
Contents
Introduction
Speech enhancement aims to undo some of the effects of adding noise or interference to speech. These effects include reducing speech quality (how "clean" the speech sounds), and reduced intelligibility (how accurately the listener receives the message). Many traditional speech enhancement techniques (such as Wiener filtering) improve quality, but in most cases intelligibility improvement is much harder to achieve.
Recently, there have been many novel speech enhancement schemes proposed that use drastically nonlinear, time-varying processing such as binary masking of spectrograms. In some cases, these have been shown to be able to improve intelligibility, even though speech quality may not be much improved - a seemingly paradoxical situation.
Both speech quality and speech intelligibility are measured through listening tests, but such tests are slow and expensive, so there is a great need for objective measures that correlate well with subjective performance. Traditionally, signal-to-noise ratio (SNR) is the standard measure for enhancement, measuring the squared-difference between the original, pre-corruption speech signal, and the output of the enhancement stage. Clearly, a high SNR will give speech that sounds like the original, achieving high quality and intelligibility. However, the case mentioned above, where intelligibilty can be high while quality is quite low, will clearly not be well reflected by SNR. In fact, for complex, nonlinear speech enhancement schemes, SNR is likely to be a very poor predictor of intelligibility, yet intelligibility is often the more important goal of enhancement.
The purpose of this work is to come up with an alternative objective predictor of speech intelligibility. We propose Normalized Subband Envelope Correlation, or NSEC, as such a measure. NSEC is fairly simple to calculate: it consists of taking the time-frequenc envelope of original speech and enhancement output under an auditory (Gammatone) filterbank, some compression and equalization, then simply correlating the two envelopes. Our observation is that if the envelopes of the signals are similar, intelligibility is preserved, even if the fine structure under the envelope is very different (implying poor SNR).
We provide small Matlab routines to calculate NSEC, presented below. A more complete description of our technique, including extensive comparisons with human intelligibility data collected at Oticon A/S, is available in our EUSIPCO-2009 submission:
Jesper B. Boldt and Daniel P. W. Ellis (2009), A SIMPLE CORRELATION-BASED MODEL OF INTELLIGIBILITY FOR NONLINEAR SPEECH ENHANCEMENT AND SEPARATION, submitted to EUSIPCO-2009.
Example usage
The code consists of a main routine, nsecgt, which takes two input waveforms, the clean original, and the output of the enhancement scheme, and returns a single value which is a prediction of the speech intelligibility percentage (in the range 0..1). nsecgt simply calculates the gammatone-based time-frequency distributions of each sound using gammatonegram2, then passes them to nsec, which calculates the actual normalized subband envelope correlation based on the two time-frequency envelopes:
% We need Malcolm Slaney's Auditory Toolbox % see http://cobweb.ecn.purdue.edu/~malcolm/interval/1998-010/ addpath('/Users/dpwe/matlab/audtbox') % Read in the original, reference speech [ref,sr] = wavread( 'sounds/ref.wav' ); % Read in the output of an enhancement scheme applied to noisy % speeech. Here, it is the Ideal Binary Mask applied to the speech % mixed with speech-shaped noise at -7.3dB SNR [out,sr] = wavread( 'sounds/IBM-SSN_RC=-10_SNR=-7.30.wav' ); % Calculate the intelligibility prediction disp( [ 'Calculated nsec: ', num2str( nsecgt( ref, out, sr ) ) ] );
Calculated nsec: 0.86015
How the measure is built up
The plots below show the three stages of building up the envelopes that can be correlated to predict intelligibility. Show the three modifications. First, we compare the unmodified subband envelopes of original, clean, and noisy, enhanced signals. Then we show the effect of normalizing the contribution of each frequency channel, followed by the addition of amplitude compression, and finally high-pass filtering to remove the effects of constant offsets.
T = gammatonegram2( ref, sr, 80e-3, 40e-3, 16, 80, 8000 ).^2; Y = gammatonegram2( out, sr, 80e-3, 40e-3, 16, 80, 8000 ).^2; h = subplot( 4, 2, 1, 'Fontsize', 7 ); imagesc( nsec( T, 0, 1, 0 ) ); axis xy; xlabel( 'Time' ); ylabel( 'Frequency' ); title( 'nsec( T, 0, 1, 0 ) = no modifications' ); h = subplot( 4, 2, 2, 'Fontsize', 7 ); imagesc( nsec( Y, 0, 1, 0 ) ); axis xy; xlabel( 'Time' ); ylabel( 'Frequency' ); title( 'nsec( Y, 0, 1, 0 ) = no modifications' ); h = subplot( 4, 2, 3, 'Fontsize', 7 ); imagesc( nsec( T, 1, 1, 0 ) ); axis xy; xlabel( 'Time' ); ylabel( 'Frequency' ); title( 'nsec( T, 1, 1, 0 ) = Frequency Normalization' ); h = subplot( 4, 2, 4, 'Fontsize', 7 ); imagesc( nsec( Y, 1, 1, 0 ) ); axis xy; xlabel( 'Time' ); ylabel( 'Frequency' ); title( 'nsec( Y, 1, 1, 0 ) = Frequency Normalization' ); h = subplot( 4, 2, 5, 'Fontsize', 7 ); imagesc( nsec( T, 1, .15, 0 ) ); axis xy; xlabel( 'Time' ); ylabel( 'Frequency' ); title( 'nsec( T, 1, .15, 0 ) = Frequency Normalization, compression' ); h = subplot( 4, 2, 6, 'Fontsize', 7 ); imagesc( nsec( Y, 1, .15, 0 ) ); axis xy; xlabel( 'Time' ); ylabel( 'Frequency' ); title( 'nsec( Y, 1, .15, 0 ) = Frequency Normalization, compression' ); h = subplot( 4, 2, 7, 'Fontsize', 7 ); imagesc( nsec( T, 1, .15, 1 ) ); axis xy; xlabel( 'Time' ); ylabel( 'Frequency' ); title( 'nsec( T, 1, .15, 1 ) = Frequency Normalization, compression, HP filtering (default)' ); h = subplot( 4, 2, 8, 'Fontsize', 7 ); imagesc( nsec( Y, 1, .15, 1 ) ); axis xy; xlabel( 'Time' ); ylabel( 'Frequency' ); title( 'nsec( Y, 1, .15, 1 ) = Frequency Normalization, compression, HP filtering (default)' );
Intelligibility as a function of masking threshold
Here, we take a set of sounds that have been enhanced by binary masking based on (oracle) masks chosen by thresholding the target speech envelope at different levels, and applying those different masks to speech mixed with speech-shaped noise at different ratios. These data show that intelligibility can be high even for speech completely submerged in noise (original SNR=-60 dB), provided the mask preserves some of the envelope of the original speech (i.e. mask chosen to pass cells where the original target was above a threshold chosen relative to the long-term average energy of speech in that band, here called the RC value):
% Calculate nsec vs. RC value. % (The sounds used are not from the experiment by Kjems et al. 2009.) SNR = [ -60 -9.8 -7.3 ]; % chosen to to give unprocessed listener % intelligibility of 0, 25, and 50% RC = [ -100 -30 : 5 : 25 ]; for cSNR = 1 : 3; for cRC = 1 : length( RC ) w{ cSNR, cRC } = wavread( [ 'sounds/', sprintf( 'IBM-SSN_RC=%d_SNR=%.2f.wav', RC( cRC ), SNR( cSNR ) ) ] ); ns2( cSNR, cRC ) = nsecgt( ref, w{ cSNR, cRC } ); end end subplot(111) plot( -40, ns2( 3, 1 ), 'color', [ 0 1 0 ], 'MarkerEdgeColor', [ 0 1 0 ], 'Marker', 'o', 'linewidth', 2 ); hold on plot( -40, ns2( 2, 1 ), 'color', [ 1 0 0 ], 'MarkerEdgeColor', [ 1 0 0 ], 'Marker', 'o', 'linewidth', 2 ); plot( -40, ns2( 1, 1 ), 'color', [ 0 0 0 ], 'MarkerEdgeColor', [ 0 0 0 ], 'Marker', 'o', 'linewidth', 2 ); plot( RC( 2:end ), ns2( 3, 2:end ), 'color', [ 0 1 0 ], 'linewidth', 2, 'Marker', 'o' ); plot( RC( 2:end ), ns2( 2, 2:end ), 'color', [ 1 0 0 ], 'linewidth', 2, 'Marker', 'o' ); plot( RC( 2:end ), ns2( 1, 2:end ), 'color', [ 0 0 0 ], 'linewidth', 2, 'Marker', 'o' ); hold off xlim( [ -45, 35 ] ); legend( { '-7.3 dB SNR', '-9.8 dB SNR', '-60 dB SNR' }, 'Location', 'South' ); set( gca, 'Xticklabel', { 'aom', -30:10:30 } ); xlabel( 'RC = LC - SNR [dB]' ); ylabel( 'NSEC' ); grid on
Download
You can download the code for these examples here: nsec.tgz. If you want the data to reproduce the graphs shown on this page, you will additionally need sounds.tgz.
Acknowledgment
This work was supported by Oticon A/S, The Danish Agency for Science, Technology and Innovation, and by the NSF under grant no. IIS-0535168. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors.
% Last updated: $Date: 2009/02/22 01:46:42 $ % Dan Ellis <dpwe@ee.columbia.edu> and Jesper Bünsow Boldt <jesper@bboldt.dk>