Steven R. Dunbar

Department of Mathematics

203 Avery Hall

University of Nebraska-Lincoln

Lincoln, NE 68588-0130

http://www.math.unl.edu

Voice: 402-472-3731

Fax: 402-472-8466

Topics in

Probability Theory and Stochastic Processes

Steven R. Dunbar

__________________________________________________________________________

Notation and Problems of Hidden Markov Models

_______________________________________________________________________

Note: These pages are prepared with MathJax. MathJax is an open source JavaScript display engine for mathematics that works in all browsers. See http://mathjax.org for details on supported browsers, accessibility, copy-and-paste, and other features.

_______________________________________________________________________________________________

Mathematically Mature: may contain mathematics beyond calculus with proofs.

_______________________________________________________________________________________________

What is meant by an ergodic Markov chain?

_______________________________________________________________________________________________

- For Hidden Markov Models, the ﬁrst problem is the evaluation problem: given a model and a sequence of observations, how can we compute the probability that the model produced the observed sequence? We can also view this problem as: how do we “score” or evaluate the model?
- Given the model $\lambda =\left(A,B,\pi \right)$ and a sequence of observations $\mathcal{\mathcal{O}}$, ﬁnd an optimal state sequence after deﬁning “optimal” in some way for the model. In other words, we want to uncover the hidden part of the Hidden Markov Model.
- The solution of Problem 3 attempts to optimize the model parameters to best describe how the observed sequence comes about. The observed sequence used to solve Problem 3 is a training sequence since it “trains” the model. This training problem is the crucial one for most applications of hidden Markov models since it creates best models for real phenomena.
- We usually use an optimality criterion to discriminate which sequence best matches the observations. Two optimality criteria are common, and the choice of criterion determines the algorithm and the consequent revealed state sequence.

__________________________________________________________________________

- The evaluation problem: given a model and a sequence of observations, how can we compute the probability that the observed the model produces the sequence?
- The estimation problem or discovery problem is the attempt to uncover the hidden state sequence.
- The observed sequence used to solve Problem 3 is a training sequence since it “trains” the model.

__________________________________________________________________________

- The model has a ﬁnite number, say $N$, of states. Each state emits one of several possible signals having measurable, distinctive properties. In many practical situations, the states have a physical signiﬁcance.
- At each discrete clock time, $t$, the model enters a new state, or possibly remains in the current state. The entered state depends on a transition probability distribution based on the previous state through the Markov chain property, that is, independently of earlier Markov chain states and signals. Often the Markov chain is ergodic so that all states can connect to any other state. In some applications, other interconnections of states, e.g. increasing or left-to-right, are of interest.
- After each transition, the model emits one of $M$ observation output symbols or signals according to a probability distribution depending on the current state. Each state has a ﬁxed probability distribution for its signals regardless of how and when the chain enters the state. There are thus $N$ such observation probability distributions. Sometimes we say $M$ is the alphabet size. The observation symbols correspond to the physical output of the modeled system.

Alternative names for Hidden Markov Models, especially in communications engineering and signal processing are Markov sources or probabilistic functions of Markov chains, emphasizing the emission of observation signals from the states.

Example. The “urn-and-ball” model is a standard example of the previous elements. The model has $N$ urns, each ﬁlled with a large number of colored balls. Each ball is one of $M$ possible colors. The urn-and-ball model generates an observation sequence by choosing one of the $N$ urns, according to an initial probability distribution, selecting a ball from the initial urn, recording the color, replacing the ball, and then choosing a new urn according to transition probability distribution associated with the current urn.

A Hidden Markov Model is a double stochastic process with an underlying stochastic process that is not observable but can only be glimpsed through another stochastic process that produces the set of observations. Each state has a probability distribution over the possible output tokens. Therefore the sequence of tokens generated by an Hidden Markov Model gives some information about the sequence of states.

The usual notation for a discrete-time discrete-observation Hidden Markov Model is

$$\begin{array}{llll}\hfill T& =\text{lengthofobservationsequence},\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill N& =\text{numberofstatesinthemodel},\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill M& =\text{numberofobservationsymbols},\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill Q& =\left\{{q}_{0},{q}_{1},\dots ,{q}_{N-1}\right\}=\text{distinctstatesoftheMarkovprocess},\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill X& =\left\{{x}_{0},{x}_{1},\dots ,{x}_{T-1}\right\}=\text{sequenceofstatesvisitedbytheMarkovchain},\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill V& =\left\{0,1,2,\dots ,M-1\right\}=\text{setofpossibleobservations},\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill A& =\left({a}_{ij}\right),\text{where}{a}_{ij}=\mathbb{P}\left[{q}_{j}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}{q}_{i}\right]\text{isthestatetransitionprobabilitymatrix},\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill B& =\left({b}_{ij}\right)=\left({b}_{i}\left(j\right)\right),\text{where}{b}_{ij}=\mathbb{P}\left[j\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}{q}_{i}\right]\text{istheobservationtransitionprobabilitymatrix},\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill \pi & =\left\{{\pi}_{j}\right\}=\text{initialstatedistributionattime}0\text{},\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill \mathcal{\mathcal{O}}& =\left({\mathcal{\mathcal{O}}}_{0},{\mathcal{\mathcal{O}}}_{1},\dots {\mathcal{\mathcal{O}}}_{T-1}\right)=\text{theobservationsequence}.\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\end{array}$$Remark. The choice of indexing beginning at $0$ is arbitrary. It is chosen here to conform with Stamp [4]. The choice of indexing beginning at $0$ is consistent with the initial value of dynamical systems at time $0$. It is also convenient for programming languages such as python, perl and C that begin indexing at $0$.

Using the Model, the process for generating an observation sequence $\mathcal{\mathcal{O}}=\left({\mathcal{\mathcal{O}}}_{0},{\mathcal{\mathcal{O}}}_{1},\dots {\mathcal{\mathcal{O}}}_{T-1}\right)$ is:

- Set $t=0$.
- Choose an initial state ${x}_{t}\in \left\{{q}_{0},\dots {q}_{N-1}\right\}$ according to the initial state distribution $\pi $.
- Choose ${\mathcal{\mathcal{O}}}_{j}$ according to ${b}_{i}\left(j\right)$, the symbol probability distribution in state $i={x}_{t}$.
- Set $t=t+1$.
- Choose a new state ${x}_{t}\in \left\{{q}_{0},\dots {q}_{N-1}\right\}$ according to the probability transition matrix $A$.
- Return to step $3$ if $t<T$; otherwise terminate the process.

The independence assumptions in this notation are

$$\begin{array}{llll}\hfill \mathbb{P}\left[{\mathcal{\mathcal{O}}}_{n}=j\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}{x}_{t}={q}_{i}\right]& ={b}_{i}\left(j\right)\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\\ \hfill \mathbb{P}\left[{\mathcal{\mathcal{O}}}_{n}=j\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}{x}_{0},{\mathcal{\mathcal{O}}}_{0},\dots ,{x}_{n-1},{\mathcal{\mathcal{O}}}_{n-1},{x}_{n}={q}_{i}\right]& ={b}_{i}\left(j\right)\phantom{\rule{2em}{0ex}}& \hfill & \phantom{\rule{2em}{0ex}}\end{array}$$The matrix $A$ is an $n\times n$ row stochastic matrix with probabilities ${a}_{ij}$ independent of $t$, so we are assuming a stationary Markov process. The interpretation of $A$ is

$${a}_{ij}=\mathbb{P}\left[\text{state}{q}_{j}\text{at}t+1\text{}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\text{state}{q}_{i}\text{at}t\text{}\right].$$

The standard notation in Hidden Markov Models for the $N\times M$ matrix $B$ is

$${b}_{i}\left(j\right)=\mathbb{P}\left[\text{observation}j\text{at}t\text{}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\text{state}{q}_{i}\text{at}t\text{}\right].$$

As with $A$, the matrix $B$ is row stochastic and the probabilities are independent of $t$. The triple $\lambda =\left(A,B,\pi \right)$ deﬁnes an Hidden Markov Model.

Although $\mathcal{\mathcal{O}}=\left({\mathcal{\mathcal{O}}}_{0},{\mathcal{\mathcal{O}}}_{1},{\mathcal{\mathcal{O}}}_{2},\dots ,{\mathcal{\mathcal{O}}}_{n-1}\right)$ is not a Markov chain (explain why!) conditional on the current state ${\mathcal{\mathcal{O}}}_{n}$, the sequence ${\mathcal{\mathcal{O}}}_{n},{x}_{n+1},{\mathcal{\mathcal{O}}}_{n+1},\dots $ of future signals and states is independent of the sequence ${x}_{o},{\mathcal{\mathcal{O}}}_{0},\dots ,{x}_{n-1},{\mathcal{\mathcal{O}}}_{n-1}$.

As an example, for an example state sequence of length $4$

$$X=\left({x}_{0},{x}_{1},{x}_{2},{x}_{3}\right)$$

with corresponding observations

$$\mathcal{\mathcal{O}}=\left({\mathcal{\mathcal{O}}}_{0},{\mathcal{\mathcal{O}}}_{1},{\mathcal{\mathcal{O}}}_{2},{\mathcal{\mathcal{O}}}_{3}\right)$$

let ${\pi}_{0}$ be the probability of starting in state ${x}_{0}$, ${b}_{0}\left({\mathcal{\mathcal{O}}}_{0}\right)$ is the probability of initially observing ${\mathcal{\mathcal{O}}}_{0}$ and ${a}_{01}$ is the probability of transitioning from state ${x}_{0}$ to state ${x}_{1}$. Continuing and multiplying the conditional probabilities, the probability of this state sequence with these observations is

$$\mathbb{P}\left[X,\mathcal{\mathcal{O}}\right]={\pi}_{0}{b}_{{x}_{0}}\left({\mathcal{\mathcal{O}}}_{0}\right){a}_{{x}_{0},{x}_{1}}{b}_{{x}_{1}}\left({\mathcal{\mathcal{O}}}_{1}\right){a}_{{x}_{1},{x}_{2}}{b}_{{x}_{2}}\left({\mathcal{\mathcal{O}}}_{2}\right){a}_{{x}_{2},{x}_{3}}{b}_{{x}_{3}}\left({\mathcal{\mathcal{O}}}_{3}\right).$$

- Given the model $\lambda =\left(A,B,\pi \right)$
and a sequence of observations $\mathcal{\mathcal{O}}$,
ﬁnd $\mathbb{P}\left[\mathcal{\mathcal{O}}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]$.
That is, we want to determine the likelihood of the observed sequence
$\mathcal{\mathcal{O}}$,
given the model.
Problem 1 is the evaluation problem: given a model and a sequence of observations, how can we compute the probability that the observed the model produces the sequence. We can also view the problem as: how do we “score” or evaluate the model. If we think of the case in which we have several competing models, the solutions of problem 1 allows us to choose the model that best matches the observations.

- Given the model $\lambda =\left(A,B,\pi \right)$
and a sequence of observations $\mathcal{\mathcal{O}}$,
ﬁnd an optimal state sequence after deﬁning “optimal” in some way
for the model. In other words, we want to uncover the hidden part of
the Hidden Markov Model.
Problem 2 is the estimation problem or discovery problem in which we attempt to uncover the hidden state sequence. We usually use an optimality criterion to discriminate which sequence best matches the observations. Two optimality criteria are common, and so the choice of criterion is a strong inﬂuence on the revealed state sequence. See the next section for speciﬁc examples of the two optimality criteria.

- Given an observation sequence $\mathcal{\mathcal{O}}$
and the dimensions $N$
and $M$,
ﬁnd the model $\lambda =\left(A,B,\pi \right)$
that maximizes the probability of $\mathcal{\mathcal{O}}$.
This can interpreted as training a model to best ﬁt the observed data.
We can also view this as search in the parameter space represented by
$A$,
$B$
and $\pi $.
The solution of Problem 3 attempts to optimize the model parameters so as best to describe how the observed sequence comes about. The observed sequence used to solve Problem 3 is a training sequence since it “trains” the model. This training problem is the crucial one for most applications of hidden Markov models since it creates best models for real phenomena.

Rabiner [2] credits Jack Ferguson of the Institute for Defense Analysis with introducing the three fundamental problems for Hidden Markov Models.

Remark. The notation $\mathbb{P}\left[\mathcal{\mathcal{O}}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]$ is standard and traditional in the theory and application of Hidden Markov Models. It is similar to notation occurring in statistics with parametric tests, indicating that that the model depends on the values of the parameters. The notation $\mathbb{P}\left[\mathcal{\mathcal{O}}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]$ is not strictly related to conditional probabilities in the mathematical probability sense. A mathematician’s typical notation would indicate the parameter dependence with subscripts, so we would write $\mathbb{P}\lambda \left[\mathcal{\mathcal{O}}\right]$ instead. However, the notation $\mathbb{P}\left[\mathcal{\mathcal{O}}\phantom{\rule{0.3em}{0ex}}|\phantom{\rule{0.3em}{0ex}}\lambda \right]$ is so strongly embedded in the literature of Hidden Markov Models that I will continue to use it, except occasionally in proofs where mathematical conditional probabilities do occur and the notation would mix meanings.

Consider the variable factory example from the previous section. For brevity, represent the factory good state with ${q}_{0}=0$ and the bad state with ${q}_{1}=1$. Suppose that for some particular three periods, the observation of factory outputs is $a,u,a$. We can exhaustively compute the probabilities of this observation sequence from various state sequences, see Table 1. We want to determine the most likely state sequence of the Markov process over these three periods. We could reasonably deﬁne “most likely” as the state sequence with the highest probability from among all possible state sequences of length three. Later we will use Dynamic Programming to eﬃciently ﬁnd this particular sequence. On the other hand, we might reasonably deﬁne “most likely” as the state sequence that maximizes the expected number of correct states. In the next section, we will use the Viterbi Algorithm associated with Hidden Markov Models to ﬁnd this sequence. The Dynamic Programming and Viterbi Algorithm solutions are not necessarily the same.

As a single example, compute the probability of starting in state $0$, expressing observation $a$, transitioning from state $0$ to state $0$, expressing observation $u$, transitioning to state $1$, and ﬁnally expressing observation $a$.

$$\mathbb{P}\left[001\right]=\left(0.8\right)\left(0.99\right)\left(0.9\right)\left(0.01\right)\left(0.1\right)\left(0.96\right)=0.000684288.$$

Similarly, we can directly compute the probability of each of the remaining $7$ possible state sequences of length $3$. The results are in Table 1 with the probabilities in the last column normalized by the total probability of observing $\left(a,u,a\right)$ so they sum to $1$. To ﬁnd the optimal state sequence in the Dynamic Programming sense, choose the sequence with the highest probability, namely $111$.

To ﬁnd the optimal sequence in the Hidden Markov Model sense, we choose the most probable symbol at each position. To this end, we sum the probabilities in Table 1 that have a $0$ in the ﬁrst position. Doing so, we ﬁnd the normalized probability of $0$ in the ﬁrst position is $0.5774$ and hence the probability of a $1$ in the ﬁrst position is the complementary probability $0.4226$. The Hidden Markov Model approach therefore chooses the ﬁrst element of the sequence to be $0$. Repeat this calculation for each time step of the sequence, obtaining the probabilities in Table 2. From Table 2, the optimal sequence in the Hidden Markov Model sense is $0,1,1$. The optimal Dynamic Programming solution diﬀers from the optimal Hidden Markov Model sequence.

$000$ | $\left(0.8\right)\left(0.99\right)\left(0.9\right)\left(0.01\right)\left(0.9\right)\left(0.99\right)$ | $0.006351$ | $0.3640$ |

$001$ | $\left(0.8\right)\left(0.99\right)\left(0.9\right)\left(0.01\right)\left(0.1\right)\left(0.96\right)$ | $0.0006843$ | $0.03921$ |

$010$ | $\left(0.8\right)\left(0.99\right)\left(0.1\right)\left(0.04\right)\left(0.0\right)\left(0.99\right)$ | $0$ | $0$ |

$011$ | $\left(0.8\right)\left(0.99\right)\left(0.1\right)\left(0.04\right)\left(1.0\right)\left(0.96\right)$ | $0.0030413$ | $0.1743$ |

$100$ | $\left(0.2\right)\left(0.96\right)\left(0.0\right)\left(0.01\right)\left(0.9\right)\left(0.99\right)$ | $0$ | $0$ |

$101$ | $\left(0.2\right)\left(0.96\right)\left(0.0\right)\left(0.01\right)\left(0.1\right)\left(0.96\right)$ | $0$ | $0$ |

$110$ | $\left(0.2\right)\left(0.96\right)\left(1.0\right)\left(0.04\right)\left(0.0\right)\left(0.99\right)$ | $0$ | $0$ |

$111$ | $\left(0.2\right)\left(0.96\right)\left(1.0\right)\left(0.04\right)\left(1.0\right)\left(0.96\right)$ | $0.007373$ | $0.4225$ |

sum | $0.01745$ | $1$ |

Period/Obs | 0 $\left(a\right)$ | 1 $\left(u\right)$ | 2 $\left(a\right)$ |

$\mathbb{P}\left[0\right]$ | $0.5774$ | $0.4031$ | $0.3640$ |

$\mathbb{P}\left[1\right]$ | $0.4226$ | $0.5968$ | $0.6360$ |

This section is adapted from: “A Revealing Introduction to Hidden Markov Models” by Mark Stamp, December 11, 2015; and “An Introduction to Hidden Markov Models”, by L. R. Rabiner and B. H. Juang. The calculations for the Variable Factory are extended from Sheldon M. Ross, Introduction to Probability Models, Section 4.11, pages 256-262, Academic Press, 2006, 9th Edition.

_______________________________________________________________________________________________

__________________________________________________________________________

- Verify all calculations in Table 1
- Verify all calculations in Table 2
- Create exhaustive state similar tables for the example of average annual temperatures with observations of tree-ring growth with observation sequence $S,M,S,L$.

__________________________________________________________________________

[1] L. R. Rabiner and B. H. Juang. An Introduction to Hidden Markov Models. IEEE ASSP Magazine, pages 4–16, January 1986. hidden markov models.

[2] Lawrence R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(3):257–286, February 1989. hidden Markov model, speech recognition.

[3] Sheldon M. Ross. Introduction to Probability Models. Academic Press, 9th edition, 2006.

[4] Mark Stamp. A revealing introduction to hidden markov models. https://www.cs.sjsu.edu/ stamp/RUA/HMM.pdf, December 2015.

__________________________________________________________________________

__________________________________________________________________________

I check all the information on each page for correctness and typographical errors. Nevertheless, some errors may occur and I would be grateful if you would alert me to such errors. I make every reasonable eﬀort to present current and accurate information for public use, however I do not guarantee the accuracy or timeliness of information on this website. Your use of the information from this website is strictly voluntary and at your risk.

I have checked the links to external sites for usefulness. Links to external websites are provided as a convenience. I do not endorse, control, monitor, or guarantee the information contained in any external website. I don’t guarantee that the links are active at all times. Use the links here with the same caution as you would all information on the Internet. This website reﬂects the thoughts, interests and opinions of its author. They do not explicitly represent oﬃcial positions or policies of my employer.

Information on this website is subject to change without notice.

Steve Dunbar’s Home Page, http://www.math.unl.edu/~sdunbar1

Email to Steve Dunbar, sdunbar1 at unl dot edu

Last modiﬁed: Processed from LATEX source on April 14, 2017