Steven R. Dunbar
Department of Mathematics
203 Avery Hall
Lincoln, NE 68588-0130
http://www.math.unl.edu
Voice: 402-472-3731
Fax: 402-472-8466

Topics in
Probability Theory and Stochastic Processes
Steven R. Dunbar

__________________________________________________________________________

Algorithms for Hidden Markov Models

_______________________________________________________________________

Note: These pages are prepared with MathJax. MathJax is an open source JavaScript display engine for mathematics that works in all browsers. See http://mathjax.org for details on supported browsers, accessibility, copy-and-paste, and other features.

_______________________________________________________________________________________________ ### Rating

Mathematically Mature: may contain mathematics beyond calculus with proofs.

_______________________________________________________________________________________________ ### Section Starter Question

__________________________________________________________________________ ### Key Concepts

1. The eﬃcient algorithm called the forward algorithm ﬁnds $ℙ\left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]$. That is, the forward algorithm ﬁnds the likelihood of the observed sequence $\mathsc{𝒪}$, given the model.
2. The backward algorithm is analogous to the forward algorithm in the solution to the ﬁrst problem of Hidden Markov Models, except that it starts at the end and works back toward the beginning.
3. The Viterbi algorithm gives the individually most likely state ${q}_{i}$ at time $t$.
4. The training problem is crucial for most applications of hidden Markov models since it creates best models for real phenomena. The Baum-Welch algorithm solves the training problem.
5. The Dynamic Programming algorithm solves the problem of ﬁnding the sequence of states whose conditional probability given the sequence of signals is maximal. The Dynamic Programming algorithm is the forward algorithm with “sum” replaced by “max”.

__________________________________________________________________________ ### Vocabulary

1. The forward algorithm is the eﬃcient algorithm to ﬁnd $ℙ\left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]$, that is, to ﬁnd the likelihood of the observed sequence $\mathsc{𝒪}$, given the model.
2. The backward algorithm is analogous to the forward algorithm in the solution to the ﬁrst problem of Hidden Markov Models, except that it starts at the end and works back toward the beginning.
3. The Viterbi algorithm gives the individually most likely state ${q}_{i}$ at time $t$.
4. The observed sequence used to solve Problem 3 is a training sequence since it “trains” the model.
5. The Dynamic Programming algorithm solves the problem of ﬁnding the sequence of states whose conditional probability given the sequence of signals is maximal.

__________________________________________________________________________ ### Mathematical Ideas

#### Solution to Problem 1

Recall that Problem 1 is the evaluation problem: given a model and a sequence of observations, how can we compute the probability that the model produces the sequence. We can also view the problem as: how do we “score” or evaluate the model? If we think of the case in which we have several competing models, the solution of Problem 1 allows us to choose the model that best matches the observations.

Let $\lambda =\left(A,B,\pi \right)$ be a given model and let $\mathsc{𝒪}=\left({\mathsc{𝒪}}_{0},{\mathsc{𝒪}}_{1},\dots ,{\mathsc{𝒪}}_{T-1}\right)$ be a series of observations. We want to ﬁnd $ℙ\left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]$, that is, we want to ﬁnd the likelihood of the observed sequence $\mathsc{𝒪}$, given the model. Let $X=\left({x}_{0},{x}_{1},\dots ,{x}_{T-1}\right)$ be a state sequence. Then by the deﬁnition of $B$ we have

$ℙ\left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}X,\lambda \right]={b}_{{x}_{0}}\left({\mathsc{𝒪}}_{0}\right)\cdot {b}_{{x}_{1}}\left({\mathsc{𝒪}}_{1}\right)\cdots {b}_{{x}_{T-1}}\left({\mathsc{𝒪}}_{T-1}\right)$

and by the deﬁnition of $\pi$ and $A$ it follows that

$ℙ\left[X\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]={\pi }_{{x}_{0}}{a}_{{x}_{0},{x}_{1}}{a}_{{x}_{1},{x}_{2}}\cdots {a}_{{x}_{T-2},{x}_{T-1}}.$

Since

$ℙ\left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}X,\lambda \right]=ℙ\lambda \left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}X\right]=\frac{ℙ\lambda \left[\mathsc{𝒪}\cap X\right]}{ℙ\lambda \left[X\right]}$

we have

$ℙ\lambda \left[\mathsc{𝒪},X\right]=ℙ\lambda \left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}X\right]\cdot ℙ\lambda \left[X\right]$

or in the traditional Hidden Markov Model notation

$ℙ\left[\mathsc{𝒪},X\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]=ℙ\left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}X,\lambda \right]\cdot ℙ\left[X\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right].$

By summing over all possible state sequences

$\begin{array}{llll}\hfill ℙ\left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]& =\sum _{X}ℙ\left[\mathsc{𝒪},X\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & =\sum _{X}ℙ\left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}X,\lambda \right]ℙ\left[X\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & =\sum _{X}{\pi }_{{x}_{0}}{b}_{{x}_{0}}\left({\mathsc{𝒪}}_{0}\right){a}_{{x}_{0},{x}_{1}}{b}_{{x}_{1}}\left({\mathsc{𝒪}}_{1}\right){a}_{{x}_{1},{x}_{2}}\cdots {a}_{{x}_{T-2},{x}_{T-1}}{b}_{{x}_{T-1}}\left({\mathsc{𝒪}}_{T-1}\right)\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\end{array}$

This direct computation requires about $2T{N}^{T}$ multiplications, meaning that the number of multiplications becomes infeasible as either $N$ or $T$ increases.

However, an eﬃcient algorithm can achieve the same result. The algorithm is the forward algorithm or alpha pass. For $t=0,1,2,\dots T-1$ and $i=0,1,\dots ,N-1$, deﬁne

${\alpha }_{t}\left(i\right)=ℙ\left[{\mathsc{𝒪}}_{0},{\mathsc{𝒪}}_{1},{\mathsc{𝒪}}_{2},\dots {\mathsc{𝒪}}_{t},{x}_{t}={q}_{i}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right].$

Then $\alpha \left(i\right)$ is the probability of the partial observation of the sequence up to time $t$, where the underlying Markov process is in state ${q}_{i}$ at time $t$. The crucial insight is that we can compute the $\alpha \left(i\right)$ recursively. Figure 1: A schematic diagram of the event deﬁned by ${\alpha }_{2}\left(i\right)$

Algorithm 1 (Forward Algorithm).

1. For $i=0,1,\dots ,N-1$, let ${\alpha }_{0}\left(i\right)={\pi }_{i}{b}_{i}\left({\mathsc{𝒪}}_{0}\right)$.
2. For $t=1,2,\dots ,T-1$ and $i=0,1,\dots ,N-1$, compute
${\alpha }_{t}\left(i\right)=\left[\sum _{j=0}^{N-1}{\alpha }_{t-1}\left(j\right){a}_{ji}\right]{b}_{i}\left({\mathsc{𝒪}}_{t}\right).$

3. Then it follows that
$ℙ\left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]=\sum _{i=0}^{N-1}{\alpha }_{T-1}\left(i\right).$ Figure 2: Schematic diagram of the computations in the forward algorithm as a lattice in time and states.

Proof. Step 2 of the Forward Algorithm.

Note that this proof uses the traditional mathematical notation for conditional probability, suppressing the model dependency. Let $\mathsc{𝒪}=\left({\mathsc{𝒪}}_{0},{\mathsc{𝒪}}_{1},\dots ,{\mathsc{𝒪}}_{T-1}\right)$ be a sequence of observations. Let ${\left(\mathsc{𝒪}\right)}_{k}=\left({\mathsc{𝒪}}_{0},{\mathsc{𝒪}}_{1},\dots ,{\mathsc{𝒪}}_{k-1}\right)\right)$, for $k\le n$. Find the conditional probability of the Markov chain state at time $n$ given $\mathsc{𝒪}$. Let

${\alpha }_{t}\left(j\right)=ℙ\left[{\left(\mathsc{𝒪}\right)}_{t},{x}_{t}=j\right]$

and note that

$\begin{array}{llll}\hfill ℙ\left[{x}_{t}=j\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}{\left(\mathsc{𝒪}\right)}_{t}\right]& =\frac{ℙ\left[{x}_{t}=j,{\left(\mathsc{𝒪}\right)}_{t}\right]}{ℙ\left[{\left(\mathsc{𝒪}\right)}_{t}\right]}\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & =\frac{{\alpha }_{t}\left(j\right)}{\sum _{i}{\alpha }_{t}\left(i\right)}.\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & \phantom{\rule{2em}{0ex}}& \hfill \end{array}$

Now,

where the preceding step used that

The forward algorithm requires about ${N}^{2}T$ multiplications, a large improvement over the $2T{N}^{T}$ multiplications for the naive approach. More precisely, the forward pass algorithm requires $N\left(N+1\right)\left(T-1\right)+N$ multiplications and $N\left(N-1\right)\left(T-1\right)$ additions.

#### Solution to Problem 2

Problem 2 is the estimation problem or discovery problem in which we try to uncover the hidden state sequence. We usually use an optimality criterion to discriminate which sequence best matches the observations. Two optimality criteria are common, and so the choice of criterion is a strong inﬂuence on chosen algorithm and the state sequence revealed. For Hidden Markov Models we often want to maximize the expected number of correct states. In contrast, the Dynamic Programming solution ﬁnds the highest scoring overall path. The two solutions are not necessarily the same.

The formal statement of Problem 2 is: Given the model $\lambda =\left(A,B,\pi \right)$ and a sequence of observations $\mathsc{𝒪}$, ﬁnd an optimal state sequence for some useful deﬁnition of “optimal”. In other words, we want to uncover the hidden states of the Hidden Markov Model.

The following deﬁnes the backward algorithm or beta-pass. This is analogous to the forward algorithm described in the solution to the ﬁrst problem of Hidden Markov Models, except that it starts at the end and works back toward the beginning.

For $t=0,1,2,\dots T-1$ and $i=0,1,\dots ,N-1$, deﬁne

${\beta }_{t}\left(i\right)=ℙ\left[{\mathsc{𝒪}}_{t+1},{\mathsc{𝒪}}_{t+2},\dots {\mathsc{𝒪}}_{T-1}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}{x}_{t}={q}_{i},\lambda \right].$

The crucial insight again is that we compute the $\beta \left(i\right)$ recursively. Figure 3: A schematic diagram of the event deﬁned by ${\beta }_{1}\left(i\right)$

Algorithm 2 (Backward Algorithm).

1. For $i=0,1,\dots ,N-1$, let ${\beta }_{T-1}\left(i\right)=1$, .
2. For $t=T-2,T-3,\dots ,0$ and $i=0,1,\dots ,N-1$, compute
${\beta }_{t}\left(i\right)=\sum _{j=0}^{N-1}{a}_{ij}{b}_{j}\left({\mathsc{𝒪}}_{t+1}\right){\beta }_{t+1}\left(j\right).$ Figure 4: Schematic diagram of the computations in the backward algorithm.

Proof. Step 2 of the Backward Algorithm. Note that this proof uses the traditional mathematical notation for conditional probability, suppressing the model dependency. By the conditional probability law of total probability:

This allows computation of $ℙ\left[{\mathsc{𝒪}}_{T-1}\right]$ through

$\begin{array}{llll}\hfill ℙ\left[{\mathsc{𝒪}}_{T-1}\right]& =\sum _{i}ℙ\left[{\mathsc{𝒪}}_{0},{\mathsc{𝒪}}_{1},\dots ,{\mathsc{𝒪}}_{T-1}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}{x}_{1}={q}_{i}\right]{\pi }_{i}\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & =\sum _{i}ℙ\left[{\mathsc{𝒪}}_{0}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}{x}_{1}={q}_{i}\right]ℙ\left[{\mathsc{𝒪}}_{1},\dots ,{\mathsc{𝒪}}_{T-1}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}{x}_{1}={q}_{i},{\mathsc{𝒪}}_{0}\right]{\pi }_{i}\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & =\sum _{i}{b}_{i}\left({\mathsc{𝒪}}_{0}\right)ℙ\left[{\mathsc{𝒪}}_{1},\dots ,{\mathsc{𝒪}}_{T-1}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}{x}_{1}={q}_{i},\mathsc{𝒪}\right]{\pi }_{i}\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & =\sum _{i}{b}_{i}\left({\mathsc{𝒪}}_{0}\right){\beta }_{1}\left(i\right){\pi }_{i}.\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & \phantom{\rule{2em}{0ex}}& \hfill \end{array}$

For $t=0,1,2,\dots T-1$ and $i=0,1,\dots N-1$ deﬁne

${\gamma }_{t}\left(i\right)=ℙ\left[{x}_{t}={q}_{i}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\mathsc{𝒪},\lambda \right].$ Figure 5: A schematic diagram of the event deﬁned by ${\gamma }_{2}\left(i\right)$

Since ${\alpha }_{t}\left(i\right)$ measures the relevant probability up to time $t$ and ${\beta }_{t}\left(i\right)$ measures the relevant probability after time $t$

${\gamma }_{t}\left(i\right)=\frac{{\alpha }_{t}\left(i\right){\beta }_{t}\left(i\right)}{ℙ\left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]}.$

Recall that we obtain the denominator $ℙ\left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]$ by summing ${\alpha }_{T-1}\left(i\right)$ over $i$ or alternatively using the last step in the proof of the Backward Algorithm. From the deﬁnition of ${\gamma }_{t}\left(i\right)$ it follows that the most likely state at time $t$ is the state ${q}_{i}$ for which ${\gamma }_{i}\left(t\right)$ is a maximum over the index $i$. This is the Viterbi algorithm.

Algorithm 3 (Viterbi Algorithm). Using ${\gamma }_{t}\left(i\right)$, the individually most likely state ${i}_{t}$ at time $t$ is

${i}_{t}=arg\phantom{\rule{0.3em}{0ex}}\underset{0\le i\le N-1}{max}\left[{\gamma }_{t}\left(i\right)\right]$

for $0\le t\le T-1$.

Remark. Another approach to obtaining $ℙ\left[{\mathsc{𝒪}}_{0},{\mathsc{𝒪}}_{1},\dots ,{\mathsc{𝒪}}_{T-1}\right]$ is to combine both the forward and the backward approaches. Suppose that for some $k$ we have computed both ${\alpha }_{k}\left(j\right)$ and ${\beta }_{k}\left(j\right)$. Because

The advantage of the previous identity for determining $ℙ\left[{\mathsc{𝒪}}_{0},{\mathsc{𝒪}}_{1},\dots ,{\mathsc{𝒪}}_{T-1}\right]$ is to simultaneously compute the sequence of forward functions, the alpha-passes starting at ${\alpha }_{1}$, and the sequence of backward functions, the beta-passes, starting at ${\beta }_{T-1}$. The parallel computations can stop when both ${\alpha }_{k}$ and ${\beta }_{k}$ are known for some $k$.

#### Example of Viterbi Algorithm

As a toy example, consider the Variable Factory, considered in the section Examples of Hidden Markov Models..

• The factory starts in the good state with probability $0.8$ and the bad state with probability $0.2$.
• If the factory is in good state $0$ then, independent of the past, with probability $0.9$ it will be in state $0$ during the next period.
• With probability $0.1$ the factory will move from good state 0 to bad state 1.
• Once in state 1, it remains in that state forever.

The sequences of variable factory states is a classic Markov chain with transition probability matrix

$A=\phantom{\rule{3.26288pt}{0ex}}\left(\right)$ and initial distribution $\pi =\left(0.8,0.2\right)$.

The observations are the quality of the items produced in each state.

• Each item produced in state 0 is acceptable quality with probability $0.99$.
• Each item produced in state 1 is acceptable quality with probability $0.96$.

Then the conditional emission probabilities are:

$\begin{array}{llllllll}\hfill ℙ\left[a\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}0\right]& =0.99\phantom{\rule{2em}{0ex}}& \hfill ℙ\left[u\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}0\right]& =0.01\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill ℙ\left[a\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}1\right]& =0.96\phantom{\rule{2em}{0ex}}& \hfill ℙ\left[u\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}1\right]& =0.04\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & \phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}& \hfill \end{array}$

so the emission matrix is

$B=\phantom{\rule{3.26288pt}{0ex}}\left(\right)$ .

The forward algorithm probabilities are:

 ${\alpha }_{t}\left(i\right)$ 0 (a) 1 (u) 2 (a) 0 $\left(0.8\right)\left(0.99\right)$ $\left[\left(0.792\right)\left(0.9\right)$ $\left[\left(0.007128\right)\left(0.9\right)$ $=0.792$ $+\left(0.192\right)\left(0\right)\right]$ $+\left(0.010848\right)\left(0\right)\right]$ $×\left(0.01\right)$ $×\left(0.99\right)$ $=0.007128$ $=0.006351$ 1 $\left(0.2\right)\left(0.96\right)$ $\left[\left(0.792\right)\left(0.1\right)$ $\left[\left(0.007128\right)\left(0.1\right)$ $=0.192$ $+\left(0.192\right)\left(1\right)\right]$ $+\left(0.010848\right)\left(1\right)\right]$ $×\left(0.04\right)$ $×\left(0.96\right)$ $=0.010848$ $=0.011098$ $0.006351+0.011098$ $=0.01745$

The backward algorithm probabilities are:

 ${\beta }_{t}\left(i\right)$ 0 1 2 0 $\left(0.9\right)\left(0.01\right)\left(0.987\right)$ $\left(0.9\right)\left(0.99\right)\left(1\right)$ $1$ $+\left(0.1\right)\left(0.04\right)\left(0.96\right)$ $+\left(0.1\right)\left(0.96\right)\left(1\right)$ $=0.012723$ $=0.987$ 1 $\left(0\right)\left(0.01\right)\left(0.987\right)$ $\left(0\right)\left(0.99\right)\left(1\right)$ $1$ $+\left(1\right)\left(0.04\right)\left(0.96\right)$ $+\left(1\right)\left(0.96\right)\left(1\right)$ $=0.0384$ $=0.96$

The joint probabilities are:

 $\alpha \beta$ 0 1 2 $\left(0.792\right)\left(0.012723\right)$ $\left(0.0077128\right)\left(0.987\right)$ $\left(0.006351\right)\left(1\right)$ $=0.01007662$ $=0.007035$ $=0.006351$ $\left(0.192\right)\left(0.0384\right)$ $\left(0.010848\right)\left(0.96\right)$ $\left(0.011098\right)\left(1\right)$ $=0.007373$ $=0.010414$ $=0.011098$ $0.01745$ $0.01745$ $0.01745$

The gamma probabilities are:

 ${\gamma }_{t}\left(i\right)$ 0 1 2 0 $0.5775$ $0.4031$ $0.3640$ 1 $0.4225$ $0.5968$ $0.6360$

The Viterbi algorithm gives:

 ${\gamma }_{t}\left(i\right)$ 0 1 2 0 $0.5775$ $0.4031$ $0.3640$ 1 $0.4225$ $0.5968$ $0.6360$

At each time the state which is individually most likely is $\left(0,1,1\right)$, i.e. factory good, not good, not good. Note that here all state transitions are valid, but valid transitions may not always result from application of this algorithm. This individually most likely state path is diﬀerent from the most likely overall path $\left(1,1,1\right)$ exhaustively computed in the examples of Hidden Markov Models.

#### Solution to Problem 3

The solution of Problem 3 attempts to optimize the model parameters to best represent how the observed sequence comes about. The observed sequence used to solve Problem 3 is a training sequence since it “trains” the model. This training problem is crucial for many applications of Hidden Markov Models since it creates best models for real phenomena.

The formal statement of Problem 3 is: Given an observation sequence $\mathsc{𝒪}$, determine by some means (physical knowledge, intuition, best guessing) the ﬁxed dimensions $N$ and $M$, then ﬁnd the model $\lambda =\left(A,B,\pi \right)$ that maximizes the probability of $\mathsc{𝒪}$. The interpretation is training a model by adjusting the model parameters to best ﬁt the observations. We can also view this as discrete time search in the parameter space represented by $A$, $B$ and $\pi$. The sizes of the matrices $N$ and $M$ are ﬁxed but the algorithm will ﬁnd the elements of $A$, $B$, and $\pi$, subject to the row stochastic condition. The fact that we can eﬃciently re-estimate the model itself at each step is one of the more amazing aspects of Hidden Markov Models.

For $t=0,1,\dots ,T-2$ and $i,j\in \left\{0,1,\dots ,N-1\right\}$, deﬁne the xi-passes as

${\xi }_{t}\left(i,j\right)=ℙ\left[{x}_{t}={q}_{i},{x}_{t+1}={q}_{j}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\mathsc{𝒪},\lambda \right]$

so ${\xi }_{t}\left(i,j\right)$ is the probability of being in state ${q}_{i}$ at time $t$ and transitioning to state ${q}_{j}$ at time $t+1$, given the observations. The xi’s in terms of $\alpha$, $\beta$, $A$ and $B$ are

${\xi }_{t}\left(i,j\right)=\frac{{\alpha }_{t}\left(i\right){a}_{ij}{b}_{j}\left({\mathsc{𝒪}}_{t+1}\right){\beta }_{t+1}\left(j\right)}{ℙ\left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]}.$

For $t=0,1,\dots ,T-2$, the relation between ${\gamma }_{t}\left(i\right)$ and ${\xi }_{t}\left(i,j\right)$ is:

${\gamma }_{t}\left(i\right)=\sum _{j=0}^{N-1}{\xi }_{t}\left(i,j\right).$ Figure 6: Schematic diagram of the computations in the joint event algorithm as a lattice.

Given the gammas and the xis, we can re-estimate the model $\lambda =\left(A,B,\pi \right)$ as follows:

1. For $i=0,1,\dots ,N-1$, let ${\pi }_{i}={\gamma }_{0}\left(i\right)$.
2. For $i=0,1,\dots ,N-1$ and for $j=0,1,\dots ,N-1$ compute
 ${ā}_{ij}=\frac{\sum _{t=0}^{T-2}{\xi }_{t}\left(i,j\right)}{\sum _{t=0}^{T-2}{\gamma }_{t}\left(i\right)}.$ (1)
3. For $j=0,1,\dots ,N-1$ and for $k=0,1,\dots ,M-1$ compute
 ${\stackrel{̄}{b}}_{j}\left(k\right)=\frac{\sum _{t\in \left\{0,1,\dots ,T-1\right\},{\mathsc{𝒪}}_{t}=k}{\xi }_{t}\left(i,j\right)}{\sum _{t=0}^{T-1}{\gamma }_{t}\left(i\right)}.$ (2)

The numerator of the re-estimated ${ā}_{ij}$ gives the expected number of transitions from state ${q}_{i}$ to state ${q}_{j}$, while the denominator is the expected number of transitions from ${q}_{i}$ to any state. Then the ratio is the empirical probability of transitioning from state ${q}_{i}$ to state ${q}_{j}$, the desired value of ${ā}_{ij}$.

The numerator of the re-estimated ${\stackrel{̄}{b}}_{j}\left(k\right)$ gives the expected number of times the model is in state ${q}_{j}$ with observation $k$, while the denominator is the expected number of transitions from ${q}_{i}$. Then the ratio is the empirical probability of observing symbol $k$ given that the model is in state ${q}_{j}$, the desired value of ${b}_{j}\left(k\right)$.

Re-estimation is an iterative process. First we initialize $\lambda =\left(A,B,\pi \right)$ with a best guess, or if no reasonable guess is available, we choose random values such that ${\pi }_{i}\approx 1∕N$ and ${a}_{ij}\approx 1∕N$ and ${b}_{j}\left(k\right)\approx 1∕M$. It is critical that $A$, $B$ and $\pi$ are not uniform since exactly uniform values result in a local maximum from which the model cannot climb. As always, $A$, $B$, $\pi$ must be row stochastic. One way to accomplish this is to start with each row uniform, say $\pi =\left(1∕N,\dots ,1∕N\right)$, select $N$ random values in the range ${u}_{i}=\left(\frac{-1}{2N},\frac{1}{2N}\right)$, $i=0,\dots ,N-1$, ﬁnd the mean $ū=\frac{1}{N}\sum _{i=0}^{N-1}{u}_{i}$, and take ${u}_{i}-ū$, so $\frac{-1}{N}\le {u}_{i}-ū\le \frac{1}{N}$ and $\sum _{j=0}^{N-1}\left({u}_{i}-ū\right)=0$. Then $\pi =\left(\frac{1}{N}-\left({u}_{0}-ū\right),\dots \frac{1}{N}-\left({u}_{N-1}-ū\right)\right)$ is positive and row-stochastic. Another way to accomplish this (which is employed by Dr. Lin Himmelmann, in the initHMM function in the HMM R library package) is to take

$A=\frac{1}{2}{I}_{N}+\frac{1}{2N}{1}_{N}$

where ${I}_{N}$ is an $N×N$ identity matrix and ${1}_{N}$ is an $N×N$ matrix with $1$ in every entry. The initial distribution and emission matrices are taken by Himmelmann to be uniform. Yet another way to accomplish this (which is employed by Mo Chen, in the hmmEm function in the Hidden Markov Model Toolbox Matlab library package) is to ﬁll appropriately sized matrices and vectors with uniform random values and then normalize so the rows sum to $1$.

Summarize the iterative process as:

Algorithm 4 (Baum-Welch Algorithm).

1. Set a maximum number of iterations ${n}_{\text{max}}$, and initialize $\lambda =\left(A,B,\pi \right)$ with a best guess or a randomized row-stochastic initialization as above.
2. Using $\lambda =\left(A,B,\pi \right)$ compute ${\alpha }_{t}\left(i\right)$, ${\beta }_{t}\left(i\right)$, ${\gamma }_{t}\left(i,j\right)$, ${\gamma }_{t}\left(i\right)$ and $ℙ\left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]$.
3. Re-estimate a new model $\stackrel{̄}{\lambda }=\left(Ā,\stackrel{̄}{B},\stackrel{̄}{\pi }\right)$ with empirical probabilities (1) and (2)
4. If $ℙ\left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\stackrel{̄}{\lambda }\right]$ increases over $ℙ\left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]$ by at least some predetermined threshold or the predetermined maximum number of iterations ${n}_{\text{max}}$ has not been exceeded, reset the model $\left(A,B,\pi \right)=\left(Ā,\stackrel{̄}{B},\stackrel{̄}{\pi }\right)$ and go to step 2.
5. Else stop and output $\lambda =\left(A,B,\pi \right)$.

In a mathematically dense pair of papers, Baum et al. show the Baum-Welch algorithm is basically a hill-climbing technique in the parameter space of the $A$, $B$ matrices, so it converges at least to a local maximizer of the likelihood function. However, as with any hill-climbing algorithm, it is also prone to get stuck in local maxima, especially if the data are from a Hidden Markov Model that is far from the initial guess $\left({A}_{0},{B}_{0}\right)$. On the other hand, if the starting point of the iterations is suﬃciently close to the true but unknown Hidden Markov Model, then the algorithm may to converge to the true Hidden Markov Model. Note that the Baum-Welch algorithm is also referred to as an expectation maximization algorithm.

#### Scaling and Practical Considerations

Even the toy variable factory example above shows that all computations involve multiple products of probabilities, so the products tend to $0$ quickly as $T$ increases. Therefore, ﬂoating-point underﬂow is a serious problem for computation. The basic scaling procedure multiplies ${\alpha }_{t}\left(i\right)$ by a scaling coeﬃcient independent of $i$ but depending on $t$ with the goal of keeping the ${\alpha }_{t}\left(i\right)$ within the dynamic range of the computer for $0\le t\le T-1$. The solution is to scale all computations by the sum over $i$ of all ${\alpha }_{t}\left(i\right)$, while ensuring that all formulas remain valid. The scaling ensures that the set of rescaled ${\alpha }_{t}\left(i\right)$ sum over $i$ to $1$, so heuristically are on the order of $1∕N$.

Algorithm 5 (Scaling Procedure).

1. For $i=0,1,\dots ,N-1$, let ${\stackrel{̃}{\alpha }}_{0}\left(i\right)={\alpha }_{0}\left(i\right)$.
2. Let ${c}_{0}=1/\sum _{j=0}^{N-1}{\stackrel{̃}{\alpha }}_{0}\left(j\right)$
3. For $i=0,1,\dots ,N-1$, let $\stackrel{̂}{\alpha }\left(i\right)={c}_{0}{\stackrel{̃}{\alpha }}_{0}\left(i\right)$.
1. For $i=0,1,\dots ,N-1$, compute
${\stackrel{̃}{\alpha }}_{t}\left(i\right)=\left[\sum _{j=0}^{N-1}{\stackrel{̂}{\alpha }}_{t-1}\left(j\right){a}_{ji}\right]{b}_{i}\left({\mathsc{𝒪}}_{t}\right)$

2. Let ${c}_{t}=1/\sum _{j=0}^{N-1}{\stackrel{̃}{\alpha }}_{t}\left(j\right)$
3. By induction compute ${\stackrel{̂}{\alpha }}_{t}\left(i\right)={c}_{0}{c}_{1}\cdots {c}_{t}{\alpha }_{t}\left(i\right)$. Then ${\stackrel{̂}{\alpha }}_{t}\left(i\right)=\frac{{\alpha }_{t}\left(i\right)}{\sum _{j=0}^{N-1}{\alpha }_{t}\left(j\right)}$
4. Let $\begin{array}{llll}\hfill {\stackrel{̂}{\beta }}_{t}\left(i\right)& ={c}_{t}{\beta }_{t}\left(i\right)\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill {\stackrel{̂}{\gamma }}_{t}\left(i\right)& =\frac{{\stackrel{̂}{\alpha }}_{t}\left(i\right){\stackrel{̂}{\beta }}_{t}\left(i\right)}{ℙ\left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]}\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill {\stackrel{̂}{\xi }}_{t}\left(i,j\right)& ={\stackrel{̂}{\alpha }}_{t}\left(i\right){a}_{ij}{b}_{j}\left({\mathsc{𝒪}}_{t+1}\right){\stackrel{̂}{\beta }}_{t+1}\left(j\right)\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\end{array}$

Proof. The proof is by induction. To start, ${\stackrel{̂}{\alpha }}_{t}\left(i\right)={c}_{0}{\stackrel{̃}{\alpha }}_{t}\left(i\right)={c}_{0}{\alpha }_{t}\left(i\right)={\alpha }_{t}\left(i\right)∕\sum _{j=0}^{N-1}{\stackrel{̃}{\alpha }}_{0}\left(j\right)={\alpha }_{t}\left(i\right)∕\sum _{j=0}^{N-1}{\alpha }_{0}\left(j\right)$.

The induction assumption is

${\stackrel{̂}{\alpha }}_{t}\left(i\right)={c}_{0}{c}_{1}\cdots {c}_{t}{\alpha }_{t}\left(i\right).$

Then

$\begin{array}{llll}\hfill {\stackrel{̂}{\alpha }}_{t+1}\left(i\right)& ={c}_{t+1}{\stackrel{̃}{\alpha }}_{t+1}\left(i\right)\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & ={c}_{t+1}\sum _{j=0}^{N-1}{\stackrel{̂}{\alpha }}_{t}\left(j\right){a}_{ji}{b}_{i}\left({\mathsc{𝒪}}_{t+1}\right)\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & ={c}_{0}{c}_{1}\cdots {c}_{t}\sum _{j=0}^{N-1}{\alpha }_{t}\left(j\right){a}_{ji}{b}_{i}\left({\mathsc{𝒪}}_{t+1}\right)\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & ={c}_{0}{c}_{1}\cdots {c}_{t}{\alpha }_{t+1}\left(i\right)\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\end{array}$

Hence

${\stackrel{̂}{\alpha }}_{t}\left(i\right)={c}_{0}{c}_{1}\cdots {c}_{t}{\alpha }_{t}\left(i\right).$

holds for all $t$ by induction.

Then by deﬁnition

${\stackrel{̂}{\alpha }}_{t}\left(i\right)=\frac{{\alpha }_{t}\left(i\right)}{\sum _{j=0}^{N-1}{\alpha }_{t}\left(j\right)}$

hence the ${\stackrel{̂}{\alpha }}_{t}\left(i\right)$ are the properly scaled values of ${\alpha }_{t}\left(i\right)$ for all $t$. As a consequence

$\sum _{j=0}^{N-1}{\stackrel{̂}{\alpha }}_{T-1}\left(j\right)=1.$

Also,

$\begin{array}{llll}\hfill \sum _{j=0}^{N-1}{\stackrel{̂}{\alpha }}_{T-1}\left(j\right)& ={c}_{0}{c}_{1}\cdots {c}_{T-1}\sum _{j=0}^{N-1}{\alpha }_{t}\left(j\right)\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & ={c}_{0}{c}_{1}\cdots {c}_{T-1}ℙ\left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right].\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\end{array}$

Combining the last two results gives

$ℙ\left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]=\frac{1}{{c}_{0}{c}_{1}\cdots {c}_{T-1}}.$

Remark. The only real change to the algorithm because of scaling is the procedure for computing $ℙ\left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]$. We cannot merely sum up the scaled values of ${\stackrel{̂}{\alpha }}_{t}\left(i\right)$ already. Additionally, to avoid underﬂow, instead compute

$log\left(ℙ\left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]\right)=-\sum _{j=0}^{T-1}log{c}_{j}.$

Thus, we can compute the log of $ℙ\left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]$, but not $ℙ\left[\mathsc{𝒪}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]$ itself, since it would be out of the dynamic range of the computer anyway.

Remark. For convenience let ${C}_{t}={c}_{0}{c}_{1}\cdots {c}_{t}$ and ${D}_{t}={c}_{t+1}\cdots {c}_{T-1}$ so ${\stackrel{̂}{\alpha }}_{t}\left(j\right)={C}_{t}{\alpha }_{t}\left(i\right)$ and ${\stackrel{̂}{\beta }}_{t+1}\left(j\right)={D}_{t+1}{\beta }_{t+1}\left(i\right)$. Then the estimation formula for the Markov state transition probabilities becomes

$\begin{array}{llll}\hfill {ā}_{ij}& =\frac{\sum _{t=0}^{T-2}{\stackrel{̂}{\gamma }}_{t}\left(i,j\right)}{\sum _{t=0}^{T-2}{\stackrel{̂}{\gamma }}_{t}\left(i\right)}\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & =\frac{\sum _{t=0}^{T-2}{\stackrel{̂}{\alpha }}_{t}\left(i\right){a}_{ij}{b}_{j}\left({\mathsc{𝒪}}_{t+1}\right){\stackrel{̂}{\beta }}_{t+1}\left(j\right)}{\sum _{t=0}^{T-2}\sum _{j=0}^{N}{\stackrel{̂}{\alpha }}_{t}\left(i\right){a}_{ij}{b}_{j}\left({\mathsc{𝒪}}_{t+1}\right){\stackrel{̂}{\beta }}_{t+1}\left(j\right)}\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & =\frac{\sum _{t=0}^{T-2}{C}_{t}{\alpha }_{t}\left(i\right){a}_{ij}{b}_{j}\left({\mathsc{𝒪}}_{t+1}\right){D}_{t+1}{\beta }_{t+1}\left(j\right)}{\sum _{t=0}^{T-2}\sum _{j=0}^{N}{C}_{t}{\alpha }_{t}\left(i\right){a}_{ij}{b}_{j}\left({\mathsc{𝒪}}_{t+1}\right){D}_{t+1}{\beta }_{t+1}\left(j\right)}\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & \phantom{\rule{2em}{0ex}}& \hfill \end{array}$

But

${C}_{t}{D}_{t+1}={c}_{0}{c}_{1}\cdots {c}_{t}\cdot {c}_{t+1}\cdots {c}_{T-1}={C}_{T-1}$

independent of $t$. Hence the terms ${C}_{t}{D}_{t+1}$ cancel out of both numerator and denominator, and so the scaled re-estimation formulas give the same re-estimations as equations (1) and (2).

#### Dynamic Programming Algorithm for the Overall Path Problem

The last problem is to ﬁnd the most likely overall path given the observations (not the states that are individually most likely at each time). That is, choose the sequence of states whose conditional probability, given the sequence of signals, is maximal. The Dynamic Programming algorithm solves this problem. The Dynamic Programming algorithm is the forward algorithm with “sum” replaced by “max”.

Algorithm 6 (Dynamic Programming).

1. For $i=0,1,\dots ,N-1$ let ${\delta }_{0}\left(i\right)={\pi }_{i}{b}_{i}\left({\mathsc{𝒪}}_{0}\right)$.
2. For $t=1,2,\dots ,T-1$ and $i=0,1,\dots ,N-1$, compute
${\delta }_{t}\left(i\right)=\underset{j\in \left\{0,\dots ,N-1\right\}}{max}\left[{\delta }_{t-1}\left(j\right){a}_{ji}{b}_{i}\left({\mathsc{𝒪}}_{t}\right)\right]$

3. The best overall path is
$\underset{j\in \left\{0,\dots ,N-1\right\}}{max}{\delta }_{T-1}\left(j\right).$

The Dynamic Programming Algorithm only gives the optimal probability, not the path. To ﬁnd the most likely state path, one must also keep track of preceding states at each stage, tracing back from the highest-scoring ﬁnal state.

Proof. Let ${X}_{k}$ be a vector of the ﬁrst $k$ states. The goal is to ﬁnd the speciﬁc sequence of states ${X}_{k}=\left\{{x}_{1},\dots ,{x}_{k}\right\}$ that maximizes $ℙ\left[{X}_{k}=\left\{{x}_{1},\dots ,{x}_{k}\right\}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\mathsc{𝒪}\right]$. Because

$ℙ\left[{X}_{k}=\left\{{x}_{1},\dots ,{x}_{k}\right\}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\mathsc{𝒪}\right]=\frac{ℙ\left[{X}_{k}=\left\{{x}_{1},\dots ,{x}_{k}\right\},\mathsc{𝒪}\right]}{ℙ\left[\mathsc{𝒪}\right]}$

this is equivalent to ﬁnding the sequence of states that maximizes (make sure you have a right and consistent notation for the sequence of observations, watch subscripts k and n!) $ℙ\left[{X}_{k}=\left\{{x}_{1},\dots ,{x}_{k}\right\},\mathsc{𝒪}\right]$.

To solve this maximization problem, let for $k\le n$,

${V}_{k}\left(j\right)=\underset{{x}_{1},\dots ,{x}_{k-1}}{max}ℙ\left[{X}_{k}=\left\{{x}_{1},\dots ,{x}_{k-1}\right\},{X}_{k}={x}_{k},{\left(\mathsc{𝒪}\right)}_{k}\right].$

Recursively solve for ${V}_{k}\left(j\right)$ with

$\begin{array}{llll}\hfill {V}_{k}\left(j\right)& =\underset{i}{max}\underset{{x}_{1},\dots ,{x}_{k-2}}{max}ℙ\left[{X}_{k-2}=\left\{{x}_{1},\dots ,{x}_{k-2}\right\},{X}_{k-1}={q}_{i},{X}_{k}={q}_{j},{\left(\mathsc{𝒪}\right)}_{k}\right]\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & =\underset{i}{max}\underset{{x}_{1},\dots ,{x}_{k-2}}{max}ℙ\left[{X}_{k-2}=\left\{{x}_{1},\dots ,{x}_{k-2}\right\},{X}_{k-1}={q}_{i},{X}_{k}={q}_{j},{\left(\mathsc{𝒪}\right)}_{k-1},{\mathsc{𝒪}}_{k}\right]\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & =\underset{i}{max}\underset{{x}_{1},\dots ,{x}_{k-2}}{max}ℙ\left[{X}_{k-2}=\left\{{x}_{1},\dots ,{x}_{k-2}\right\},{X}_{k-1}={q}_{i},{\left(\mathsc{𝒪}\right)}_{k-1}\right]\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & \phantom{\rule{2em}{0ex}}×ℙ\left[{X}_{k}={q}_{j},{\mathsc{𝒪}}_{k}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}{X}_{k-1}={q}_{i}\right]\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & =\underset{i}{max}ℙ\left[{X}_{k}={q}_{j},{\mathsc{𝒪}}_{k}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}{X}_{k-1}={q}_{i}\right]\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & \phantom{\rule{2em}{0ex}}×\underset{{x}_{1},\dots ,{x}_{k-2}}{max}ℙ\left[{X}_{k-2}=\left\{{x}_{1},\dots ,{x}_{k-2}\right\},{X}_{k-1}={q}_{i},{\left(\mathsc{𝒪}\right)}_{k-1}\right]\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & =\underset{i}{max}{a}_{ij}\cdot p\left({\mathsc{𝒪}}_{k}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}j\right){V}_{k-1}\left(i\right)\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & =p\left({\mathsc{𝒪}}_{k}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}j\right)\underset{i}{max}{a}_{i,j}{V}_{k-1}\left(i\right)\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\end{array}$

Numerical underﬂow is again a concern with the Dynamic Programming algorithm, again since we have multiple products of probabilities. For the Dynamic Programming algorithm, avoid underﬂow by simply taking logarithms. Then the underﬂow-resistant version of Dynamic Programming is

Algorithm 7.

1. Let ${\stackrel{̂}{\delta }}_{0}\left(i\right)=log\left({\pi }_{i}{b}_{i}\left({\mathsc{𝒪}}_{0}\right)\right)$, for $i=0,1,\dots ,N-1$.
2. For $t=1,2,\dots ,T-1$ and $i=0,1,\dots ,N-1$, compute
${\stackrel{̂}{\delta }}_{t}\left(i\right)=\underset{j\in \left\{0,\dots ,N-1\right\}}{max}\left[{\stackrel{̂}{\delta }}_{t-1}\left(j\right)+log\left({a}_{ji}\right)+log\left[{b}_{i}\left({\mathsc{𝒪}}_{t}\right)\right]\right]$

3. The probability of the best overall path is
$\underset{j\in \left\{0,\dots ,N-1}{max}{\stackrel{̂}{\delta }}_{T-1}\left(j\right).$

This just ﬁnds the probability, to ﬁnd the optimal overall path requires recording the states.

#### Example of Dynamic Programming

Once again, consider the case of the Variable Factory:

$\begin{array}{llll}\hfill {\delta }_{0}\left(0\right)& ={\pi }_{0}{b}_{0}\left({\mathsc{𝒪}}_{0}\right)=\left(0.8\right)\left(0.99\right)=0.792\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill {\delta }_{0}\left(1\right)& ={\pi }_{1}{b}_{1}\left({\mathsc{𝒪}}_{0}\right)=\left(0.2\right)\left(0.96\right)=0.192\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & \phantom{\rule{2em}{0ex}}& \hfill \end{array}$

$\begin{array}{llll}\hfill {\delta }_{1}\left(0\right)& =max\left\{\begin{array}{cc}{\delta }_{0}\left(0\right){a}_{00}{b}_{0}\left({\mathsc{𝒪}}_{1}\right)\phantom{\rule{1em}{0ex}}\hfill & =\left(0.792\right)\left(0.9\right)\left(0.01\right)\hfill \\ \phantom{\rule{1em}{0ex}}\hfill & =0.007128\hfill \\ {\delta }_{0}\left(1\right){a}_{10}{b}_{0}\left({\mathsc{𝒪}}_{1}\right)\phantom{\rule{1em}{0ex}}\hfill & =\left(0.192\right)\left(0\right)\left(0.01\right)\hfill \\ \phantom{\rule{1em}{0ex}}\hfill & =0.0\hfill \\ \phantom{\rule{1em}{0ex}}\hfill \end{array}\right\\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill {\delta }_{1}\left(1\right)& =max\left\{\begin{array}{cc}{\delta }_{0}\left(0\right){a}_{01}{b}_{1}\left({\mathsc{𝒪}}_{1}\right)\phantom{\rule{1em}{0ex}}\hfill & =\left(0.792\right)\left(0.1\right)\left(0.04\right)\hfill \\ \phantom{\rule{1em}{0ex}}\hfill & =0.003168\hfill \\ {\delta }_{0}\left(1\right){a}_{11}{b}_{1}\left({\mathsc{𝒪}}_{1}\right)\phantom{\rule{1em}{0ex}}\hfill & =\left(0.192\right)\left(1\right)\left(0.04\right)\hfill \\ \phantom{\rule{1em}{0ex}}\hfill & =0.00768\hfill \\ \phantom{\rule{1em}{0ex}}\hfill \end{array}\right\\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & \phantom{\rule{2em}{0ex}}& \hfill \end{array}$

$\begin{array}{llll}\hfill {\delta }_{2}\left(0\right)& =max\left\{\begin{array}{cc}{\delta }_{1}\left(0\right){a}_{00}{b}_{0}\left({\mathsc{𝒪}}_{2}\right)\phantom{\rule{1em}{0ex}}\hfill & =\left(0.007128\right)\left(0.9\right)\left(0.99\right)\hfill \\ \phantom{\rule{1em}{0ex}}\hfill & =0.006351\hfill \\ {\delta }_{1}\left(1\right){a}_{10}{b}_{0}\left({\mathsc{𝒪}}_{2}\right)\phantom{\rule{1em}{0ex}}\hfill & =\left(0.00768\right)\left(0\right)\left(0.96\right)\hfill \\ \phantom{\rule{1em}{0ex}}\hfill & =0.0\hfill \\ \phantom{\rule{1em}{0ex}}\hfill \end{array}\right\\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill {\delta }_{2}\left(1\right)& =max\left\{\begin{array}{cc}{\delta }_{1}\left(0\right){a}_{01}{b}_{1}\left({\mathsc{𝒪}}_{2}\right)\phantom{\rule{1em}{0ex}}\hfill & =\left(0.007128\right)\left(0.1\right)\left(0.99\right)\hfill \\ \phantom{\rule{1em}{0ex}}\hfill & =0.0007057\hfill \\ {\delta }_{1}\left(1\right){a}_{11}{b}_{1}\left({\mathsc{𝒪}}_{2}\right)\phantom{\rule{1em}{0ex}}\hfill & =\left(0.00768\right)\left(1\right)\left(0.96\right)\hfill \\ \phantom{\rule{1em}{0ex}}\hfill & =0.007373\hfill \\ \phantom{\rule{1em}{0ex}}\hfill \end{array}\right\\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill & \phantom{\rule{2em}{0ex}}& \hfill \end{array}$

 ${\delta }_{t}\left(i\right)$ 0 (a) 1 (u) 2 (a) 0 0.792 0.007128 0.006351 1 0.192 0.00768 0.007373

Tracing back the source of the red maximum, the maximum overall state sequence was 111, same as exhaustive search. This simple example required only $18$ multiplications compared with $40$ for the exhaustive listing of all possibilities. This is not surprising, since Dynamic Programming throws out one path of two at every stage. Dynamic Programming achieves a similar eﬃciency for larger problems.

#### Sources

This algorithms in this section are adapted from Stamp,  and Rabiner . The proofs of the algorithms are adapted from Ross. Additional ideas are adapted from Rabiner, . The schematic diagrams are adapted from Rabiner, . The problems are adapted from Stamp, .

_______________________________________________________________________________________________ ### Algorithms, Scripts, Simulations

#### Scripts

__________________________________________________________________________ ### Problems to Work for Understanding

1. Compute the forward pass probabilities for the Paleontological Temperature Model (tree rings as signals from average annual temperatures) from the section on Examples of Hidden Markov Models.
2. Compute the backward pass probabilities for the Paleontological Temperature Model (tree rings as signals from average annual temperatures) from the section on Examples of Hidden Markov Models.
3. Compute the gamma probabilities for the Paleontological Temperature Model (tree rings as signals from average annual temperatures) from the section on Examples of Hidden Markov Models.
4. Use the Viterbi algorithm to ﬁnd the most likely states for the Paleontological Temperature Model (tree rings as signals from average annual temperatures) from the section on Examples of Hidden Markov Models.
5. Show that the forward pass algorithm requires $N\left(N+1\right)\left(T-1\right)+N$ multiplications and $N\left(N-1\right)\left(T-1\right)$ additions.

__________________________________________________________________________ ### References

   L. R. Rabiner and B. H. Juang. An Introduction to Hidden Markov Models. IEEE ASSP Magazine, pages 4–16, January 1986. hidden markov models.

   Lawrence R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(3):257–286, February 1989. hidden Markov model, speech recognition.

   Mark Stamp. A revealing introduction to hidden markov models. https://www.cs.sjsu.edu/ stamp/RUA/HMM.pdf, December 2015.

__________________________________________________________________________ __________________________________________________________________________

I check all the information on each page for correctness and typographical errors. Nevertheless, some errors may occur and I would be grateful if you would alert me to such errors. I make every reasonable eﬀort to present current and accurate information for public use, however I do not guarantee the accuracy or timeliness of information on this website. Your use of the information from this website is strictly voluntary and at your risk.

I have checked the links to external sites for usefulness. Links to external websites are provided as a convenience. I do not endorse, control, monitor, or guarantee the information contained in any external website. I don’t guarantee that the links are active at all times. Use the links here with the same caution as you would all information on the Internet. This website reﬂects the thoughts, interests and opinions of its author. They do not explicitly represent oﬃcial positions or policies of my employer.

Information on this website is subject to change without notice.