Steven R. Dunbar
Department of Mathematics
203 Avery Hall
University of Nebraska-Lincoln
Lincoln, NE 68588-0130
Probability Theory and Stochastic Processes
Steven R. Dunbar
Algorithms for Hidden Markov Models
Mathematically Mature: may contain mathematics beyond calculus with proofs.
Recall that Problem 1 is the evaluation problem: given a model and a sequence of observations, how can we compute the probability that the model produces the sequence. We can also view the problem as: how do we “score” or evaluate the model? If we think of the case in which we have several competing models, the solution of Problem 1 allows us to choose the model that best matches the observations.
Let be a given model and let be a series of observations. We want to ﬁnd , that is, we want to ﬁnd the likelihood of the observed sequence , given the model. Let be a state sequence. Then by the deﬁnition of we have
and by the deﬁnition of and it follows that
or in the traditional Hidden Markov Model notation
By summing over all possible state sequences
This direct computation requires about multiplications, meaning that the number of multiplications becomes infeasible as either or increases.
However, an eﬃcient algorithm can achieve the same result. The algorithm is the forward algorithm or alpha pass. For and , deﬁne
Then is the probability of the partial observation of the sequence up to time , where the underlying Markov process is in state at time . The crucial insight is that we can compute the recursively.
Proof. Step 2 of the Forward Algorithm.
Note that this proof uses the traditional mathematical notation for conditional probability, suppressing the model dependency. Let be a sequence of observations. Let , for . Find the conditional probability of the Markov chain state at time given . Let
and note that
where the preceding step used that□
The forward algorithm requires about multiplications, a large improvement over the multiplications for the naive approach. More precisely, the forward pass algorithm requires multiplications and additions.
Problem 2 is the estimation problem or discovery problem in which we try to uncover the hidden state sequence. We usually use an optimality criterion to discriminate which sequence best matches the observations. Two optimality criteria are common, and so the choice of criterion is a strong inﬂuence on chosen algorithm and the state sequence revealed. For Hidden Markov Models we often want to maximize the expected number of correct states. In contrast, the Dynamic Programming solution ﬁnds the highest scoring overall path. The two solutions are not necessarily the same.
The formal statement of Problem 2 is: Given the model and a sequence of observations , ﬁnd an optimal state sequence for some useful deﬁnition of “optimal”. In other words, we want to uncover the hidden states of the Hidden Markov Model.
The following deﬁnes the backward algorithm or beta-pass. This is analogous to the forward algorithm described in the solution to the ﬁrst problem of Hidden Markov Models, except that it starts at the end and works back toward the beginning.
For and , deﬁne
The crucial insight again is that we compute the recursively.
Proof. Step 2 of the Backward Algorithm. Note that this proof uses the traditional mathematical notation for conditional probability, suppressing the model dependency. By the conditional probability law of total probability:
This allows computation of through□
For and deﬁne
Since measures the relevant probability up to time and measures the relevant probability after time
Recall that we obtain the denominator by summing over or alternatively using the last step in the proof of the Backward Algorithm. From the deﬁnition of it follows that the most likely state at time is the state for which is a maximum over the index . This is the Viterbi algorithm.
Remark. Another approach to obtaining is to combine both the forward and the backward approaches. Suppose that for some we have computed both and . Because
The advantage of the previous identity for determining is to simultaneously compute the sequence of forward functions, the alpha-passes starting at , and the sequence of backward functions, the beta-passes, starting at . The parallel computations can stop when both and are known for some .
As a toy example, consider the Variable Factory, considered in the section Examples of Hidden Markov Models..
The sequences of variable factory states is a classic Markov chain with transition probability matrix
and initial distribution .
The observations are the quality of the items produced in each state.
Then the conditional emission probabilities are:
so the emission matrix is
The forward algorithm probabilities are:
|0 (a)||1 (u)||2 (a)|
The backward algorithm probabilities are:
The joint probabilities are:
The gamma probabilities are:
The Viterbi algorithm gives:
At each time the state which is individually most likely is , i.e. factory good, not good, not good. Note that here all state transitions are valid, but valid transitions may not always result from application of this algorithm. This individually most likely state path is diﬀerent from the most likely overall path exhaustively computed in the examples of Hidden Markov Models.
The solution of Problem 3 attempts to optimize the model parameters to best represent how the observed sequence comes about. The observed sequence used to solve Problem 3 is a training sequence since it “trains” the model. This training problem is crucial for many applications of Hidden Markov Models since it creates best models for real phenomena.
The formal statement of Problem 3 is: Given an observation sequence , determine by some means (physical knowledge, intuition, best guessing) the ﬁxed dimensions and , then ﬁnd the model that maximizes the probability of . The interpretation is training a model by adjusting the model parameters to best ﬁt the observations. We can also view this as discrete time search in the parameter space represented by , and . The sizes of the matrices and are ﬁxed but the algorithm will ﬁnd the elements of , , and , subject to the row stochastic condition. The fact that we can eﬃciently re-estimate the model itself at each step is one of the more amazing aspects of Hidden Markov Models.
For and , deﬁne the xi-passes as
so is the probability of being in state at time and transitioning to state at time , given the observations. The xi’s in terms of , , and are
For , the relation between and is:
Given the gammas and the xis, we can re-estimate the model as follows:
The numerator of the re-estimated gives the expected number of transitions from state to state , while the denominator is the expected number of transitions from to any state. Then the ratio is the empirical probability of transitioning from state to state , the desired value of .
The numerator of the re-estimated gives the expected number of times the model is in state with observation , while the denominator is the expected number of transitions from . Then the ratio is the empirical probability of observing symbol given that the model is in state , the desired value of .
Re-estimation is an iterative process. First we initialize with a best guess, or if no reasonable guess is available, we choose random values such that and and . It is critical that , and are not uniform since exactly uniform values result in a local maximum from which the model cannot climb. As always, , , must be row stochastic. One way to accomplish this is to start with each row uniform, say , select random values in the range , , ﬁnd the mean , and take , so and . Then is positive and row-stochastic. Another way to accomplish this (which is employed by Dr. Lin Himmelmann, in the initHMM function in the HMM R library package) is to take
where is an identity matrix and is an matrix with in every entry. The initial distribution and emission matrices are taken by Himmelmann to be uniform. Yet another way to accomplish this (which is employed by Mo Chen, in the hmmEm function in the Hidden Markov Model Toolbox Matlab library package) is to ﬁll appropriately sized matrices and vectors with uniform random values and then normalize so the rows sum to .
Summarize the iterative process as:
In a mathematically dense pair of papers, Baum et al. show the Baum-Welch algorithm is basically a hill-climbing technique in the parameter space of the , matrices, so it converges at least to a local maximizer of the likelihood function. However, as with any hill-climbing algorithm, it is also prone to get stuck in local maxima, especially if the data are from a Hidden Markov Model that is far from the initial guess . On the other hand, if the starting point of the iterations is suﬃciently close to the true but unknown Hidden Markov Model, then the algorithm may to converge to the true Hidden Markov Model. Note that the Baum-Welch algorithm is also referred to as an expectation maximization algorithm.
Even the toy variable factory example above shows that all computations involve multiple products of probabilities, so the products tend to quickly as increases. Therefore, ﬂoating-point underﬂow is a serious problem for computation. The basic scaling procedure multiplies by a scaling coeﬃcient independent of but depending on with the goal of keeping the within the dynamic range of the computer for . The solution is to scale all computations by the sum over of all , while ensuring that all formulas remain valid. The scaling ensures that the set of rescaled sum over to , so heuristically are on the order of .
Proof. The proof is by induction. To start, .
The induction assumption is
holds for all by induction.
Then by deﬁnition
hence the are the properly scaled values of for all . As a consequence
Combining the last two results gives
Remark. The only real change to the algorithm because of scaling is the procedure for computing . We cannot merely sum up the scaled values of already. Additionally, to avoid underﬂow, instead compute
Thus, we can compute the log of , but not itself, since it would be out of the dynamic range of the computer anyway.
Remark. For convenience let and so and . Then the estimation formula for the Markov state transition probabilities becomes
The last problem is to ﬁnd the most likely overall path given the observations (not the states that are individually most likely at each time). That is, choose the sequence of states whose conditional probability, given the sequence of signals, is maximal. The Dynamic Programming algorithm solves this problem. The Dynamic Programming algorithm is the forward algorithm with “sum” replaced by “max”.
The Dynamic Programming Algorithm only gives the optimal probability, not the path. To ﬁnd the most likely state path, one must also keep track of preceding states at each stage, tracing back from the highest-scoring ﬁnal state.
Proof. Let be a vector of the ﬁrst states. The goal is to ﬁnd the speciﬁc sequence of states that maximizes . Because
this is equivalent to ﬁnding the sequence of states that maximizes (make sure you have a right and consistent notation for the sequence of observations, watch subscripts k and n!) .
To solve this maximization problem, let for ,
Recursively solve for with□
Numerical underﬂow is again a concern with the Dynamic Programming algorithm, again since we have multiple products of probabilities. For the Dynamic Programming algorithm, avoid underﬂow by simply taking logarithms. Then the underﬂow-resistant version of Dynamic Programming is
This just ﬁnds the probability, to ﬁnd the optimal overall path requires recording the states.
Once again, consider the case of the Variable Factory:
|0 (a)||1 (u)||2 (a)|
Tracing back the source of the red maximum, the maximum overall state sequence was 111, same as exhaustive search. This simple example required only multiplications compared with for the exhaustive listing of all possibilities. This is not surprising, since Dynamic Programming throws out one path of two at every stage. Dynamic Programming achieves a similar eﬃciency for larger problems.
This algorithms in this section are adapted from Stamp,  and Rabiner . The proofs of the algorithms are adapted from Ross. Additional ideas are adapted from Rabiner, [1, 2]. The schematic diagrams are adapted from Rabiner, . The problems are adapted from Stamp, .
 Lawrence R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(3):257–286, February 1989. hidden Markov model, speech recognition.
I check all the information on each page for correctness and typographical errors. Nevertheless, some errors may occur and I would be grateful if you would alert me to such errors. I make every reasonable eﬀort to present current and accurate information for public use, however I do not guarantee the accuracy or timeliness of information on this website. Your use of the information from this website is strictly voluntary and at your risk.
I have checked the links to external sites for usefulness. Links to external websites are provided as a convenience. I do not endorse, control, monitor, or guarantee the information contained in any external website. I don’t guarantee that the links are active at all times. Use the links here with the same caution as you would all information on the Internet. This website reﬂects the thoughts, interests and opinions of its author. They do not explicitly represent oﬃcial positions or policies of my employer.
Information on this website is subject to change without notice.
Steve Dunbar’s Home Page, http://www.math.unl.edu/~sdunbar1
Email to Steve Dunbar, sdunbar1 at unl dot edu
Last modiﬁed: Processed from LATEX source on May 9, 2017