stnr Usage: stnr [-fhqs] ------------------------------- -f : run on a list of files. must contain an ASCII list of filenames separated by whitespace. -q : quick SNR algorithm (default is slower but more precise). -s : compute SNR on a sub-band basis. -c : CODEC SNR algorithm. See Appendix 2 of this document. This program estimates the speech to noise ratio (SNR) of a file, defined as 10 log (peak_speech_power / mean_noise_power), where power refers to the variance of a signal computed over 20 ms windows. This program extends the functionality of earlier SNR software internal to NIST written by Jon Fiscus; the algorithm has not changed, and the documentation of this earlier version is appended here for reference. It should be emphasized that the expression "signal-to-noise ratio," strictly speaking, makes sense only if the "signal" and the "noise" may be measured separately. Once the two are combined, however, we must rely on prior knowledge about the behavior of the signal and the noise to estimate what the true SNR might have been. In the case of speech recordings, where the "signal" and "noise" never existed independently of each other, the situation becomes more complicated, since there is no strict number that can be referred to as the "true" SNR, even in theory. In the absence of a "true" SNR, therefore, our estimate becomes something of a definition. This implies that direct comparisons between one SNR estimation program and another may not be completely meaningful, since the definitions of SNR may not agree. On the other hand, informal tests have shown that under most circumstances this algorithm agrees with itself, i.e., it behaves as we would expect for an estimate of SNR. Measuring the SNR of a "clean" recording corrupted by white noise of power X, for example, usually yields an SNR 6 dB higher than same clean recording contimated by noise of power 2X. The -s option allows users to measure speech to noise ratios in certain frequency bands. This is equivalent to passing the speech file through a filterbank and running the program on each of the filterbank outputs. Users may design the filterbank to use by creating a file called "filterbank" in their home directory. This file should have the form .... With these specifications, the program designs a set of FIR filters of the desired order using the well-known Remez exchange algorithm. The allows the user to specify the ratio of passband to stopband ripple in the design. Band edge frequencies should be specified between 0 and 0.5, where 0 is DC and 0.5 is half of the sampling rate (pi radians/sec). The bands do not need to be specified in any particular order, and they may overlap. The following appendix, taken from the documentation of the earlier version of this code by Jon Fiscus, contains a more detailed explanation of the SNR algorithm: ------------------------------------------------------------------------- APPENDIX 1: The NIST Signal to Noise Estimation Utility ------------------------------------------------------------------------- An estimate of the signal to noise ratio (SNR) is an important quality characterization of a speech file. NIST has implemented a technique suggested by Ned Neuberg, Jordan Cohen and others. The utility implements two techniques, 1) a "quick" estimation of the speech and noise level, and 2) the mean noise and peak speech level SNR. Both techniques use an energy histogram computed over the entire file to characterize the levels. A signal energy histogram is generated by computing the root mean squared (RMS) power, in decibels, over a 20 ms. window and then updating the appropriate histogram bin. The window is then shifted by 10 ms. and the next power is computed. Quick Method ------------ After the histogram has been computed, the noise and speech levels are estimated. The first method, the "quick" method, is a crude approximation of the SNR. NIST does NOT endorse its use. It is simply included as a reference against alternative methods. The algorithm defines the noise level as the 15th percentile of the RMS power histogram. The 15th percentile is the point on the horizontal axis of the histogram where the area to the left is 15% of the total area. Since the histogram is not a continous function, but rather a series of quantized intervals, the interval midpoint is used to approximate the location of the actual percentile. Using the same technique for percentile approximation, the speech level is defined as the 85th percentile of the RMS power histogram. The noise level is subtracted from the speech level to calculate the SNR. The 15th and 85th percentiles have been chosen based on observations of typical noise and speech power distributions. _-_ / \ / \ RMS Power Histogram / \ / / \/ _-_ / \ / \ / \ / \ / \ / \ / \ __________--' `-------' `--___ --'------------------+------------------------+------------`-- NL SL |------15 %--------| |--------------------85 %-------------------| |-----------SNR----------| Second Method ------------- In the second (preferred) method, the noise level is also subtracted >from the speech level in order to obtain the signal to noise ratio. However, a less arbitrary technique is use to finding the speech and noise levels. The noise level is defined to be the mean of the noise power distribution. Since the speech and noise distributions in the RMS power histogram overlap, parameter estimation of the noise distribution is accomplished by fitting, in the Chi-Square sense, a raised cosine function to the left hand peak of the complete RMS histogram. The raised cosine function can be described by the following parameters and function: bin[i + (Peak_location - Width/2 )] = Amplitude/2 + Amplitude/2 * cos( (Pi/2 * 1/i) - Pi ) Where i varies from 0 to Width and Peak_location: the horizontal coordinate of the peak of the cosine function (in number of bins), Amplitude: the vertical coordinate of the peak of the cosine function (in bin counts), Width: the cosine function period (in number of bins). In fitting the raised cosine function to the mean noise power distribution, a guess is first made for the location, amplitude and width of the left most peak, then a solution space search algroithm, "direct search" by Hook and Reeves [1], is invoked to minimize the Chi-Squared distance between the raised cosine function and the targeted noise peak. Once the best fit is found, the midpoint of the raised cosine function is labeled as the mean noise power level. _-_ / \ RMS Power Histogram raised Cosine / \ / / Curve / \ / _-_ / / \ /_-_\ / / \ // \\ / \ // \\ / \ // \\ / \ __________-/' `\------' `--___ --'----------'------+------`-------------------------------`-- NL Using the cosine function as an estimate of the noise power distribution, the cosine function is subtracted from the complete RMS power histogram in order to estimate the speech power distribution. The speech level (in this case a peak speech level) is defined to be the histogram bin midpoint where the 95th percentile occurs in the speech power histogram. As before, the noise level is subtracted from the speech level to calculate the SNR. _-_ / \ / \ / \ / \ / \ / \ / \ / \ __________ -----' `--___ --'----------`-------------'-----------------------+-------`-- SL |---------------------97 %-----------------------| Unlike the first method, the second method assumes the power distribution of the entire file is a mixture of two distributions - one for the speech and one for the noise. If the means of the two distributions are close to each other, as in a very noisy recording, the estimate will be suspect. Other recordings that sometimes result in unusual (not bi-modal) power distributions are telephone conversations with an echo-cancelling device in operation, since these devices tend to produce long periods of "silence" through squelching. In the next release of this software, we hope to hope to adapt the algorithm to handle such unusual aspects of telephone-switchboard speech [see Appendix 2]. [1] "Direct Search" Solution of Numerical and Statistical Problems, Robert Hook and T. A. Reeves, Journal ACM 1961 (p212-229) --------------------------------------------------------------------- APPENDIX 2 : Adapting the SNR algorithm to the Switchboard Corpus --------------------------------------------------------------------- The -c option invokes an adaptation of the algorithm described immediately above in order to cope better with the peculiarities of the Switchboard Corpus. The main peculiarities are long periods of "silence" (i.e. sample values of 0) in some files, usually occuring when the other channel is active; unusual pops and glitches due to telephone network noise or data collection errors; and crosstalk. These anomalies, their consequences, and the corresponding adaptation of the SNR algorithm are discussed below: PROBLEM: Occasional long periods of "silence" (null samples). CONSEQUENCES: These lead to large spikes in the short-time energy histogram at negative infinity dB. MODIFICATION TO THE ALGORITHM: These "silence" portions, because they do not contribute any information to the estimation of speech and noise power, are completely ignored. PROBLEM: Unusual pops and glitches from the telephone network or data collection errors. CONSEQUENCES: If the same glitch occurs frequently, it causes a spike in the energy histogram. These spikes, if they are sufficiently high, may be confused for the peak of the noise energy distribution. They may also be confused for speech and corrupt the speech peak energy estimation. MODIFICATION TO THE ALGORITHM: The complete short-time energy histogram is 3-point median filtered before computing the peak of the noise energy distribution and locating the 95th percentile of the speech energy distribution. PROBLEM: Crosstalk--viz., speech from one channel is sometimes audible in the other. CONSEQUENCES: If the crosstalk is sufficiently energetic and frequent, it causes a hump in the short-time energy histogram of the file, so that the distribution is no longer bi-modal as previously assumed. This means that the curve-fitting algorithm to approximate the noise energy distribution with a raised-cosine sometimes mistakes the crosstalk hump for the noise hump. MODIFICATION TO THE ALGORITHM: The algorithm no longer attempts to fit a raised-cosine to the noise distribution. Instead it starts at the left side of the energy distribution, finds the first peak, and calls this the "mean noise level." It then tries to find the first trough, the second peak, and the second trough. Everything up to the first trough is considered noise; everything between the first trough and the second trough is considered crosstalk; and everything else is considered speech. The second peak, if it exists, is reported as an estimate of the mean crosstalk power. This latter estimate is not entirely reliable and is included only as a gross estimate. A proper measurement of crosstalk power would use some calculation of inter-channel correlation; this crude method, however, may be used when only one channel is available for analysis. * * * Finally, it is noted that the SNR estimates of mu-law encoded speech should be taken with a grain of salt inasmuch as they do not account for the distortions incurred in the quantization process. That is to say, the SNR estimate should be interpreted strictly as a ratio of peak (quantized) speech to mean noise power, not as a measure of the fidelity of the mu-law encoded speech to the "original" speech (which in most cases is much less).