stnr

 Usage: stnr [-fhqs] <filename>

-------------------------------

   -f : run on a list of files. <filename> must contain an ASCII list of
        filenames separated by whitespace.
   -q : quick SNR algorithm (default is slower but more precise).
   -s : compute SNR on a sub-band basis.
   -c : CODEC SNR algorithm. See Appendix 2 of this document.

This program estimates the speech to noise ratio (SNR) of a file, defined
as 10 log (peak_speech_power / mean_noise_power), where power refers to
the variance of a signal computed over 20 ms windows. This program extends
the functionality of earlier SNR software internal to NIST written by Jon
Fiscus; the algorithm has not changed, and the documentation of this
earlier version is appended here for reference. 

It should be emphasized that the expression "signal-to-noise ratio,"
strictly speaking, makes sense only if the "signal" and the "noise" may be
measured separately. Once the two are combined, however, we must rely on
prior knowledge about the behavior of the signal and the noise to estimate
what the true SNR might have been. In the case of speech recordings, where the
"signal" and "noise" never existed independently of each other, the
situation becomes more complicated, since there is no strict number that
can be referred to as the "true" SNR, even in theory. In the absence of a
"true" SNR, therefore, our estimate becomes something of a definition.

This implies that direct comparisons between one SNR estimation program
and another may not be completely meaningful, since the definitions of SNR
may not agree. On the other hand, informal tests have shown that under most
circumstances this algorithm agrees with itself, i.e., it behaves as we
would expect for an estimate of SNR. Measuring the SNR of a "clean"
recording corrupted by white noise of power X, for example, usually yields
an SNR 6 dB higher than same clean recording contimated by noise of power
2X. 

The -s option allows users to measure speech to noise ratios in certain
frequency bands. This is equivalent to passing the speech file through a
filterbank and running the program on each of the filterbank outputs.
Users may design the filterbank to use by creating a file called
"filterbank" in their home directory. This file should have the form 

<number of bands>

<band 1 lower edge> <band 1 upper edge> <order of filter> <error weight 1>
<band 2 lower edge> <band 2 upper edge> <order of filter> <error weight 2>
....
<band n lower edge> <band n upper edge> <order of filter> <error weight n>

With these specifications, the program designs a set of FIR filters of the
desired order using the well-known Remez exchange algorithm. The <error
weight> allows the user to specify the ratio of passband to stopband
ripple in the design. Band edge frequencies should be specified between 0
and 0.5, where 0 is DC and 0.5 is half of the sampling rate (pi
radians/sec). The bands do not need to be specified in any particular
order, and they may overlap.

The following appendix, taken from the documentation of the earlier
version of this code by Jon Fiscus, contains a more detailed explanation
of the SNR algorithm:

-------------------------------------------------------------------------
APPENDIX 1:     The NIST Signal to Noise Estimation Utility
-------------------------------------------------------------------------

An estimate of the signal to noise ratio (SNR) is an important quality
characterization of a speech file.  NIST has implemented a technique
suggested by Ned Neuberg, Jordan Cohen and others.  The utility
implements two techniques, 1) a "quick" estimation of the speech and
noise level, and 2) the mean noise and peak speech level SNR.  Both
techniques use an energy histogram computed over the entire file 
to characterize the levels.

A signal energy histogram is generated by computing the root mean
squared (RMS) power, in decibels, over a 20 ms. window and then
updating the appropriate histogram bin.  The window is then shifted by
10 ms. and the next power is computed.

Quick Method
------------

After the histogram has been computed, the noise and speech levels are
estimated.  The first method, the "quick" method, is a crude
approximation of the SNR.  NIST does NOT endorse its use.  It is
simply included as a reference against alternative methods.  The
algorithm defines the noise level as the 15th percentile of the RMS
power histogram.  The 15th percentile is the point on the horizontal
axis of the histogram where the area to the left is 15% of the total
area.  Since the histogram is not a continous function, but rather a
series of quantized intervals, the interval midpoint is used to
approximate the location of the actual percentile.  Using the same
technique for percentile approximation,	 the speech level is defined
as the 85th percentile of the RMS power histogram.  The noise level is
subtracted from the speech level to calculate the SNR.  The 15th and
85th percentiles have been chosen based on observations of typical
noise and speech power distributions.


                                            _-_
                                           /   \
                                          /     \   RMS Power Histogram
                                         /       \  /
                                        /         \/
                     _-_               /           \
                    /   \             /             \ 
                   /     \           /               \ 
                  /       \         /                 \
     __________--'         `-------'                   `--___
  --'------------------+------------------------+------------`--
                       NL                       SL
    |------15 %--------|
    |--------------------85 %-------------------|
                       |-----------SNR----------|

Second Method
-------------

In the second (preferred) method, the noise level is also subtracted
>from the speech level in order to obtain the signal to noise ratio.
However, a less arbitrary technique is use to finding the speech and
noise levels.  The noise level is defined to be the mean of the noise
power distribution.  Since the speech and noise distributions in the
RMS power histogram overlap, parameter estimation of the noise
distribution is accomplished by fitting, in the Chi-Square sense, a
raised cosine function to the left hand peak of the complete RMS
histogram.  The raised cosine function can be described
by the following parameters and function:

      bin[i + (Peak_location - Width/2 )]

               = Amplitude/2 + Amplitude/2 * cos( (Pi/2 * 1/i) - Pi )

                     Where i varies from 0 to Width and

         Peak_location: the horizontal coordinate of the peak of the
                        cosine function (in number of bins),
         Amplitude:     the vertical coordinate of the peak of the
                        cosine function (in bin counts),
         Width:  the cosine function period (in number of bins).

In fitting the raised cosine function to the mean noise power
distribution, a guess is first made for the location, amplitude and width
of the left most peak, then a solution space search algroithm, "direct
search" by Hook and Reeves [1], is invoked to minimize the Chi-Squared
distance between the raised cosine function and the targeted noise
peak.  Once the best fit is found, the midpoint of the raised cosine
function is labeled as the mean noise power level.


                                            _-_
                                           /   \   RMS Power Histogram
                          raised Cosine   /     \   /
                            / Curve      /       \ /
                     _-_   /            /         \
                    /_-_\ /            /           \
                   //   \\            /             \ 
                  //     \\          /               \ 
                 //       \\        /                 \
     __________-/'         `\------'                   `--___
  --'----------'------+------`-------------------------------`--
                      NL

Using the cosine function as an estimate of the noise power
distribution, the cosine function is subtracted from the complete RMS
power histogram in order to estimate the speech power distribution.
The speech level (in this case a peak speech level) is defined to be
the histogram bin midpoint where the 95th percentile occurs in the
speech power histogram.  As before, the noise level is subtracted from
the speech level to calculate the SNR.

                                           
                                            _-_
                                           /   \
                                          /     \
                                         /       \
                                        /         \
                                       /           \
                                      /             \ 
                                     /               \ 
                                    /                 \
     __________               -----'                   `--___
  --'----------`-------------'-----------------------+-------`--
                                                     SL
    |---------------------97 %-----------------------|
                     

Unlike the first method, the second method assumes the power
distribution of the entire file is a mixture of two distributions -
one for the speech and one for the noise.  If the means of the two
distributions are close to each other, as in a very noisy recording,
the estimate will be suspect. Other recordings that sometimes result in
unusual (not bi-modal) power distributions are telephone conversations
with an echo-cancelling device in operation, since these devices tend to
produce long periods of "silence" through squelching. In the next release
of this software, we hope to hope to adapt the algorithm to handle such
unusual aspects of telephone-switchboard speech [see Appendix 2].


[1]  "Direct Search" Solution of Numerical and Statistical Problems,
      Robert Hook and T. A. Reeves, Journal ACM 1961 (p212-229)


---------------------------------------------------------------------
APPENDIX 2 : Adapting the SNR algorithm to the Switchboard Corpus
---------------------------------------------------------------------

The -c option invokes an adaptation of the algorithm described immediately
above in order to cope better with the peculiarities of the Switchboard
Corpus. The main peculiarities are long periods of "silence" (i.e. sample
values of 0) in some files, usually occuring when the other channel is
active; unusual pops and glitches due to telephone network noise or data
collection errors; and crosstalk. These anomalies, their consequences, and
the corresponding adaptation of the SNR algorithm are discussed below:

PROBLEM: Occasional long periods of "silence" (null samples). 

CONSEQUENCES: These lead to large spikes in the short-time energy
histogram at negative infinity dB.

MODIFICATION TO THE ALGORITHM: These "silence" portions, because they do
not contribute any information to the estimation of speech and noise
power, are completely ignored.


PROBLEM: Unusual pops and glitches from the telephone network or data
collection errors.

CONSEQUENCES: If the same glitch occurs frequently, it causes a spike in
the energy histogram. These spikes, if they are sufficiently high, may be
confused for the peak of the noise energy distribution. They may also be
confused for speech and corrupt the speech peak energy estimation. 

MODIFICATION TO THE ALGORITHM: The complete short-time energy histogram is
3-point median filtered before computing the peak of the noise energy
distribution and locating the 95th percentile of the speech energy
distribution. 


PROBLEM: Crosstalk--viz., speech from one channel is sometimes audible
in the other.

CONSEQUENCES: If the crosstalk is sufficiently energetic and frequent, it
causes a hump in the short-time energy histogram of the file, so that the
distribution is no longer bi-modal as previously assumed. This means that
the curve-fitting algorithm to approximate the noise energy distribution
with a raised-cosine sometimes mistakes the crosstalk hump for the noise
hump. 

MODIFICATION TO THE ALGORITHM: The algorithm no longer attempts to fit a
raised-cosine to the noise distribution. Instead it starts at the left
side of the energy distribution, finds the first peak, and calls this the
"mean noise level." It then tries to find the first trough, the second
peak, and the second trough. Everything up to the first trough is
considered noise; everything between the first trough and the second
trough is considered crosstalk; and everything else is considered speech.
The second peak, if it exists, is reported as an estimate of the mean
crosstalk power. This latter estimate is not entirely reliable and is
included only as a gross estimate. A proper measurement of crosstalk power
would use some calculation of inter-channel correlation; this crude
method, however, may be used when only one channel is available for
analysis. 

                           * * *

Finally, it is noted that the SNR estimates of mu-law encoded speech
should be taken with a grain of salt inasmuch as they do not account for
the distortions incurred in the quantization process. That is to say, the
SNR estimate should be interpreted strictly as a ratio of peak (quantized)
speech to mean noise power, not as a measure of the fidelity of the mu-law
encoded speech to the "original" speech (which in most cases is much less).