next up previous contents index
Next: 1.5 Recognition and Viterbi Decoding Up: 1 The Fundamentals of HTK Previous: 1.3 Output Probability Specification

1.4 Baum-Welch Re-Estimation


To determine the parameters of a HMM it is first necessary to make a rough guess at what they might be. Once this is done, more accurate (in the maximum likelihood sense) parameters can be found by applying the so-called Baum-Welch re-estimation  formulae.


Chapter 8 gives the formulae used in HTK in full detail. Here the basis of the formulae will be presented in a very informal way. Firstly, it should be noted that the inclusion of multiple data streams does not alter matters significantly since each stream is considered to be statistically independent. Furthermore, mixture components can be considered to be a special form of sub-state in which the transition probabilities are the mixture weights (see Fig. 1.5).

Thus, the essential problem is to estimate the means and variances of a HMM in which each state output distribution is a single component Gaussian, that is


If there was just one state j in the HMM, this parameter estimation would be easy. The maximum likelihood estimates of tex2html_wrap_inline19572 and tex2html_wrap_inline19574 would be just the simple averages, that is




In practice, of course, there are multiple states and there is no direct assignment of observation vectors to individual states because the underlying state sequence is unknown. Note, however, that if some approximate assignment of vectors to states could be made then equations 1.11 and 1.12 could be used to give the required initial values for the parameters. Indeed, this is exactly what is done in the HTK tool called HINIT . HINIT first divides the training observation vectors equally amongst the model states and then uses equations 1.11 and 1.12 to give initial values for the mean and variance of each state. It then finds the maximum likelihood state sequence using the Viterbi  algorithm described below, reassigns the observation vectors to states and then uses equations 1.11 and 1.12 again to get better initial values. This process is repeated until the estimates do not change.

Since the full likelihood of each observation sequence is based on the summation of all possible state sequences, each observation vector tex2html_wrap_inline19590 contributes to the computation of the maximum likelihood parameter values for each state j. In other words, instead of assigning each observation vector to a specific state as in the above approximation, each observation is assigned to every state in proportion to the probability of the model being in that state when the vector was observed. Thus, if tex2html_wrap_inline19594 denotes the probability of being in state j at time t then the equations 1.11 and 1.12 given above become the following weighted averages




where the summations in the denominators are included to give the required normalisation.

Equations 1.13 and 1.14 are the Baum-Welch re-estimation  formulae for the means and covariances of a HMM. A similar but slightly more complex formula can be derived for the transition probabilities (see chapter 8).

Of course, to apply equations 1.13 and 1.14, the probability of state occupation tex2html_wrap_inline19614 must be calculated. This is done efficiently using the so-called Forward-Backward   algorithm. Let the forward probabilitygif   tex2html_wrap_inline19616 for some model M with N states be defined as


That is, tex2html_wrap_inline19626 is the joint probability of observing the first t speech vectors and being in state j at time t. This forward probability can be efficiently calculated by the following recursion


This recursion depends on the fact that the probability of being in state j at time t and seeing observation tex2html_wrap_inline19640 can be deduced by summing the forward probabilities for all possible predecessor states i weighted by the transition probability tex2html_wrap_inline19644 . The slightly odd limits are caused by the fact that states 1 and N are non-emittinggif. The initial conditions for the above recursion are



for 1<j<N and the final condition is given by


Notice here that from the definition of tex2html_wrap_inline19660 ,


Hence, the calculation of the forward probability also yields the total likelihood  tex2html_wrap_inline19664 .

The backward probability  tex2html_wrap_inline19666 is defined as


As in the forward case, this backward probability can be computed efficiently using the following recursion


with initial condition given by


for 1<i<N and final condition given by


Notice that in the definitions above, the forward probability is a joint probability whereas the backward probability is a conditional probability. This somewhat asymmetric definition is deliberate since it allows the probability of state occupation to be determined by taking the product of the two probabilities. From the definitions,




where tex2html_wrap_inline19686 .

All of the information needed to perform HMM parameter re-estimation using the Baum-Welch algorithm  is now in place. The steps in this algorithm may be summarised as follows

  1. For every parameter vector/matrix requiring re-estimation, allocate storage for the numerator and denominator summations of the form illustrated by equations 1.13 and 1.14. These storage locations are referred to as  accumulatorsgif.
  2. Calculate the forward and backward probabilities for all states j and times t.
  3. For each state j and time t, use the probability tex2html_wrap_inline19696 and the current observation vector tex2html_wrap_inline19698 to update the accumulators for that state.
  4. Use the final accumulator values to calculate new parameter values.
  5. If the value of tex2html_wrap_inline19700 for this iteration is not higher than the value at the previous iteration then stop, otherwise repeat the above steps using the new re-estimated parameter values.

All of the above assumes that the parameters for a HMM are re-estimated from a single observation sequence, that is a single example of the spoken word. In practice, many examples are needed to get good parameter estimates. However, the use of multiple observation sequences adds no additional complexity to the algorithm. Steps 2 and 3 above are simply repeated for each distinct training sequence.

One final point that should be mentioned is that the computation of the forward and backward probabilities involves taking the product of a large number of probabilities. In practice, this means that the actual numbers involved become very small. Hence, to avoid numerical problems, the forward-backward computation is computed in HTK using log arithmetic .

The HTK program which implements the above algorithm is called HREST . In combination with the tool HINIT  for estimating initial values mentioned earlier, HREST allows isolated word HMMs to be constructed from a set of training examples using Baum-Welch re-estimation.

next up previous contents index
Next: 1.5 Recognition and Viterbi Decoding Up: 1 The Fundamentals of HTK Previous: 1.3 Output Probability Specification

ECRL HTK_V2.1: email