Steven R. Dunbar
Department of Mathematics
203 Avery Hall
University of Nebraska-Lincoln
Lincoln, NE 68588-0130
http://www.math.unl.edu
Voice: 402-472-3731
Fax: 402-472-8466

Topics in
Probability Theory and Stochastic Processes
Steven R. Dunbar

__________________________________________________________________________

Examples of Hidden Markov Models

_______________________________________________________________________

Note: These pages are prepared with MathJax. MathJax is an open source JavaScript display engine for mathematics that works in all browsers. See http://mathjax.org for details on supported browsers, accessibility, copy-and-paste, and other features.

_______________________________________________________________________________________________

Rating

Rating

Student: contains scenes of mild algebra or calculus that may require guidance.

_______________________________________________________________________________________________

Section Starter Question

Section Starter Question

__________________________________________________________________________

Key Concepts

Key Concepts

  1. Hidden Markov Models are useful representations of situations ranging from bioinformatics to speech recognition, and language processing.
  2. A Hidden Markov Model consists of a Markov chain among states and expression of a signal or observation from each state. The states are hidden.
  3. With Hidden Markov Models we usually have only the observations or signals, not all the necessary information for complete representation. From the observations, we wish to find the “most likely” states. The words “most likely” indicates that we must consider possible measures of optimality. So Hidden Markov Models are a modeling and statistical problem, and in some ways an inverse problem. That accounts for calling these Hidden Markov Models and not considering them from the point of view of Markov processes.

__________________________________________________________________________

Vocabulary

Vocabulary

  1. In a Markov chain process, if each state emits a random signal or observation from a set of possible signals while the process states themselves are unobservable, then we say the process is a hidden Markov chain model.
  2. A standard mathematical example of a general Hidden Markov Model is an urn and ball model.

__________________________________________________________________________

Mathematical Ideas

Mathematical Ideas

Toy Examples of Hidden Markov Models

A Variable Factory

A production process in a factory is either in a good state (call it state 0) or in a poor state (state 1). If the process is in state 0 during some period then, independent of the past, with probability 0.9 it will be in state 0 during the next period and with probability 0.1 it will be in state 1. Once in state 1 it remains in that state forever. Suppose the factory produces a single item in each period and that each item produced when the process is in state 0 is of acceptable quality with probability 0.99, while each item produced when the process is in state 1 is of acceptable quality with probability 0.96.

If the status, either acceptable or unacceptable, of each successive item is observable, while the process states are unobservable, then we say the process is a hidden Markov chain model. The state of the factory is 0 or 1 and the signal is the quality of the item produced with value either a or u, depending on whether the item is acceptable or unacceptable. The transition probabilities of the underlying Markov chain are

A =      0    1

0   0.9   0.1
1    0    1  .

The signal probabilities are

a|0 = 0.99 u|0 = 0.01 a|1 = 0.96 u|1 = 0.04

so the emission matrix is

B =      0      1
0   0.99  0.01

1   0.96  0.01  .

Suppose that the probability of starting in state 0 is 0.8 or π = [0.8, 0.2].

Suppose a sequence of three observed articles are (a,u,a). Then given each possible state sequence, the probability of the corresponding observed sequence is in Table 1.


State Conditional Probability Product Probability



000(0.8)(0.99)(0.9)(0.01)(0.9)(0.99)0.006351048
001(0.8)(0.99)(0.9)(0.01)(0.1)(0.96)0.000684288
010(0.8)(0.99)(0.1)(0.04)(0.0)(0.99)0
011(0.8)(0.99)(0.1)(0.04)(1.0)(0.96)0.00304128
100(0.2)(0.96)(0.0)(0.01)(0.9)(0.99)0
101(0.2)(0.96)(0.0)(0.01)(0.1)(0.96)0
110(0.2)(0.96)(1.0)(0.04)(0.0)(0.99)0
111(0.2)(0.96)(1.0)(0.04)(1.0)(0.96)0.0073728

Table 1: State sequence conditional probability products and probability of the observed sequence given the state sequence.

Notice that even without the 0 entries some products can combine to make the calculations more efficient.

A Paleontological Temperature Model

We want to determine the average annual temperature at a particular place on earth over a sequence of years in the distant past. For simplicity, we consider that there were only annual average temperatures, “hot” and “cold”. Suppose that modern evidence indicates the probability of a hot year followed by another hot year is 0.7 and the probability of a cold year followed by another cold year is 0.6, independent of the temperature in prior years. Assume that these probabilities held in the distant past as well. A probability transition matrix summarizing the information is

A =      H    C
H    0.7   0.3

C    0.4   0.6  .

Also suppose that current research indicates a correlation between the size of tree growth rings and temperature. Again for simplicity, we consider only three different tree ring sizes, small designated as S, medium designated as M, and large designated as L, the observable signal of the average annual temperature. Suppose that based on available evidence, the probabilistic relationship between annual temperature and tree ring sizes is

B =      S    M     L
H    0.1  0.4  0.5
C    0.7  0.2  0.1  .

For this system, the state is the annual average temperature, either H or C. The transition from one state to another is a Markov chain. However, these are hidden states, since we can’t directly observe the temperature in the past. Although we can’t observe the state or temperature in the past, we can observe the size of tree rings. From this evidence, we would like to determine the most likely temperature state in past years.

The occasionally cheating casino

In a hypothetical dishonest casino, the casino uses a fair die most of the time, but occasionally the casino secretly switches to a loaded die, and later the casino switches back to the fair die. A probabilistic process determines the switching back-and-forth from loaded die to fair die and back again after each toss of the die, with the switch from fair-to-loaded occurring with probability 0.05 and from loaded-to-fair with probability 0.1. In addition, assume that the loaded die will come up “six” with probability 0.5 and the remaining five numbers with probability 0.1 each. The transmission matrix is

A =      0     1
0   0.95   0.05
1   0.1    0.9  .

and the emission probability matrix is

B =       1    2     3    4     5     6
F    1∕6  1∕6   1∕6  1∕6   1∕6   1∕6
L    1∕5  1∕5   1∕5  1∕5   1∕5   1∕2  .

If you can see only the sequence of rolls (the sequence of observations or signals) you do not know which rolls used a loaded die and which used a fair die, because the casino hides the state. This is an example of a Hidden Markov Model.

Standard Mathematical Examples

Urn and ball model

A standard general mathematical example is an urn and ball model. Then are N urns, each filled with colored balls with M possible colors for the balls. Generate the observation sequence by initially choosing one of the N urns, randomly according to an initial probability distribution, randomly selecting a ball, recording its color, replacing the ball, and then choosing a new urn according to a transition probability distribution associated with the current urn. Then at each time, the signal or observation is the color of the selected ball. The hidden states are the urns.


Schematic   diagram   of   an   urn   and   ball   model.

Figure 1: Schematic diagram of an urn and ball model with N = 3 urns and M = 6 colors.

Coin Flip Models

Consider the following coin tossing experiment: You are in a room with a curtain through which you cannot see what is happening. On the other side of the curtain a person is tossing a coin, or maybe one of several coins. The other person will not tell us exactly what is happening, only the result of each coin flip. This is a sequence of hidden coin flips. Thus we can only use the results of the coin tosses, say

𝒪 = HHHTTHHT

with H for heads and T for tails. Take first the case that the proportion of heads and tails are equal, statistically speaking, without any obvious patterns or organization to occurrences of heads and tails. How do we build a Hidden Markov Model to best explain the observed sequence of heads and tails?


State diagram for the one coin model.

Figure 2: State diagram for the one coin model.

Figure 2 shows the first possible model. This simplest “one-fair-coin model”, has two states, each state is directly associated with heads or tails. The probability of being in the state generating a head would be 0.5 and equally for being in the state generating a tail. This model is not truly hidden because each observation directly defines the state. This is a degenerate example of a hidden Markov model which is exactly the same as the classic stochastic process of repeated Bernoulli trials.

A second possible Hidden Markov Model for the observations is a “two-fair-coin model”, see Figure 3. Associate each state with a fair coin, so the probability of generating a head in each state is p = 0.5. In this special case, called the “two-fair-coins model”, the probabilities associated with remaining in or leaving each of the two states form a probability transition matrix whose entries are unimportant because the observable sequences from the two-fair coins model are statistically indistinguishable in each of the states. That means this two-fair-coin model is indistinguishable from the one-fair-coin model in a statistical sense and so this is another degenerate example of a Hidden Markov Model.


Two fair coin model.

Figure 3: Two fair coin model

Other Hidden Markov Models which can account for a observed sequence of equal proportions of heads and tails are possible. Take the “two-compensating-biased-coins model” as a model of what happens behind the curtain. The model has two different states corresponding to 2 different coins. In one state, the coin is biased toward heads, say with H = p > 0.5. In the other state, the coin is biased towards tails with H = 1 p < 0.5. The state transition probabilities all equal 0.5. See Figure 4. This could accomplished by the person behind the curtain having two biased coins and a third fair coin with the biased coins associated to the faces of the fair coin respectively. The person behind the barrier flips the fair coin to decide which biased coin to use and then flips the chosen biased coin to generate the observed outcome. Note that with complementary biased coin probabilities indicated in Figure 4 the long term averages of heads or tails would be statistically indistinguishable from either the one-fair-coin model or the two-fair-coin model.


Two compensating coins model.

Figure 4: Two compensating biased coins model

It is clear that other more complicated models with three or more coins could also be constructed. In this special case that the proportion of heads and tails are equal statistically speaking, without any obvious patterns or organization to occurrences of heads and tails, there would have to be some compelling physical reason to choose a multiple-hidden-state model over the simpler and equivalent one-fair-coin model.

Now in another direction imagine that the observed sequence is a very long sequence of heads, then followed by another long sequence of tails of some random length, interrupted by a head, followed again by yet another long random sequence of tails Then we might use a “two-biased-coins model”. with two biased coins with biased switching between the states of the coins as a possible model for the observed sequence. Of course, such a sequence of many heads followed by many tails could conceivably come from a fair coin. The choice between a one-fair-coin or two-biased-coin model would be a choice justified by the likelihoods of the observations under the models, or possibly by other external modeling considerations.


Two biased coin model.

Figure 5: Two biased coin model

However other higher-order statistics of the two-biased-coins model such as the probability of runs of heads, should be distinguishable from the one-fair-coin or the two-fair coin model.


Three coin model.

Figure 6: Three coin model

Another Hidden Markov Model could be the “three-biased-coins model”. In the first state, the coin is slightly biased to heads, in the second state the coin is slightly biased toward tails, in the third state the coin is some other distribution, maybe fair, maybe biased in some direction. A Markov chain determines transition probabilities among the three states. See Figure 6. The sequence of observations depends on the biases and the transition probabilities. The simple statistics and higher-order statistics of the observations would be correspondingly influenced and would suggest the appropriateness of this choice of model.

Several important points emerge from the possibility of different models for the observed outputs of the coin tossing experiment behind the curtain.

Realistic Hidden Markov Models

CpG Islands

In the human genome the dinucleotide CG (frequently written CpG to distinguish it from the CG base-pair across two strands) is rarer than expected from the independent probabilities of C and G, for reasons of chemistry that transform the C into a T. For biologically important reasons, the chemical transformation is suppressed in short regions of the genome, such as around the promoters or start regions of many genes. In these regions, we see significantly more CpG dinucleotides than elsewhere. Such regions are called CpG islands. The islands are typically a few hundred to a few thousand bases long.

Given a short stretch of genomic sequence, how would we decide if it comes from a CpG island or not? Second, given a long piece of sequence, how would we find the CpG islands in it, if any exist? Here, the Model has two states, CpG islands, and non-islands. In each state, the probabilities of expressing CpG base-pairs are different.

Profile HMMs in Bioinformatics

For a pair (or more) of proteins, an important question is: How are the proteins similar? The goals are to detect and measure overall similarity between protein amino acid sequences, find proteins with similar functions in different organisms by finding similar subsequences of amino acids called “conserved sequences” and to detect conserved sequences and evolution of conserved sequences. Alignment is the method for answering these questions.

There are two types of alignment: A global alignment is an alignment of the full length of two sequences. A local alignment is an alignment of part of one sequence to part of another sequence. For possibly distantly related sequences, it might be more sensible to make local alignments of subregions of high similarity, not the whole sequence. A sample toy alignment is in Figure 7


AC- - AC- TGT
TAGACGGAGCT- TCAC

Figure 7: Toy example of gapped alignment of DNA sequences.

Alignment allows amino acid matches and mismatches along columns with corresponding scores based on chemistry and biology. In order to make alignments, we also allow introduction of gaps in either of the protein amino acid sequences. Introducing gaps when making alignments adds penalty scores.

A common task in bioinformatics is to obtain a cluster of related sequences, e.g. from a database, and then to align those sequences using multiple sequence alignment algorithms. The clustering reflects the insights of the biology community as to which proteins belong within the same family. The outcome of the clustering process is a set of distinct protein families. This is the first step in most phylogenetic analyses. Heuristic algorithms are generally used to create multiple sequence alignments. There are large databases of proteins and alignments, some created with Hidden Markov Models, some provide Hidden Markov Model data, see below. An example of an actual multiple sequence alignment (MSA) is in Figure 8.


Alignment of acidic ribosomal protein P0 from several organisms.

Figure 8: Alignment of acidic ribosomal protein P0 from several organisms.

A profile HMM (pHMM) is a particular Hidden Markov Model with states, signals, transition matrix, and emission matrix summarizing a multiple sequence alignment. The goal is use this Hidden Markov Model information about the MSA to align a new query sequence. Profile HMMs have three states for each alignment position, i.e. each column in the multiple sequence alignment. Three outcomes are possible when aligning each residue of the query sequence with the MSA:

There are heuristic rules assigning MSA columns as match states, for example, the MSA has a match column if less than half of the characters are gaps. The length of a pHMM is the number of columns in the MSA assigned to match states. Each match state in the pHMM has its corresponding set of emission probabilities, generated from counting the frequencies of each amino acid in the corresponding column. Insertions, i.e. portions of the query sequence that do not match anything in the multiple alignment, correspond to an insert state. As in the case of the match states, each insert state has its own set of emission probabilities. The insert state emission probabilities are typically generated using the distribution of amino acids over the entire MSA. A delete state is possible for each of the positions in the MSA. The delete state is an example of a silent state in the model, as it does not emit any residues.

Let l denote the number of match locations. Then the associated profile HMM has 3l + 3 states in the underlying Markov process, namely:


PIC

Figure 9: Schematic diagram of the transitions in a profile HMM

Thus, a pHMM typically has many more states than the previous examples of Hidden Markov Models with only a handful of states. The connections between the states is highly structured and more complicated than the other examples of Hidden Markov Models. The set of emissions, 20 or fewer, is about the same order of magnitude as in the other examples.

A typical application of a profile HMM is the following: Start with collection of protein families (clusters) F1Fk, where all proteins within a family have the same length after assigning gaps as necessary. For each family Fi, construct a corresponding profile HMM, λ(Fi) in the notation of the next section. The objective is to assign a newly sequenced protein to one of the k families. Then compute the likelihood 𝒪|λ(Fi) of the gap-aligned new protein for each of the k profile HMMs. The new protein is then assigned to the family for which the likelihood is maximum. This is the “scoring” problem of Hidden Markov Models.

Language Analysis Translation

Language translation is a classic application of Hidden Markov Models, originating with Cave and Neuwirth, see [4] for history and additional details. Suppose you do not understand English, but you do know something about punctuation. You obtain a large body of English text, such as the “Brown Corpus”. Henry Kučera and W. Nelson Francis at Brown University compiled The Brown University Standard Corpus of Present-Day American English as a general corpus (text collection) in the field of corpus linguistics. The Brown Corpus contains 500 samples of English-language text, totaling roughly one million words, compiled from works published in the United States in 1961. The Brown Corpus is one of many such corpuses, available through the Natural Language Toolkit, see nltk.org. With knowledge of Hidden Markov Models, but no knowledge of English, you would like to determine some basic properties of this mysterious writing system. Can you partition the characters into sets so that characters in each set are “different” in some statistically significant way?

First remove all punctuation and numbers and convert all letters to lower case. This leaves 26 distinct letters and the space, for a total of 27 symbols. The observations are the series of characters found in the resulting text. Then test the hypothesis that English text has an underlying Markov chain with two states. For each of these two hidden states, assume that the 27 symbols appear according to a fixed probability distribution. This sets up a Hidden Markov Model with N = 2 and M = 27 where the state transition probabilities and the observation probabilities from each state are unknown.

Results of a case study, [4] using the first 50,000 observations from the Brown Corpus of letters converted to lower case and the space are in Table 2. Without having any assumption about the nature of the two states, the results sort into two familiar categories! The probabilities tell us that the one hidden state contains the vowels while the other hidden state contains the consonants. Interestingly, space is more like a vowel and “y” is a consonant. The Hidden Markov Model “deduces” the statistically significant difference between vowels and consonants without knowing anything about the English language.


State 0 State 1



a0.138450.00075
b0.000000.02311
c0.000620.05614
d0.000000.06937
e0.214040.00000
f0.000000.03559
g0.000810.02724
h0.000660.07278
i0.122750.00000
j0.000000.00365
k0.001820.00703
l0.000490.07231
m0.000000.03889
n0.000000.11461
o0.131560.00000
p0.000400.03674
q0.000000.00153
r0.000000.10225
s0.000000.11042
t0.011020.14392
u0.045080.00000
v0.000000.01621
w0.000000.02303
x0.000000.00447
y0.000190.02587
z0.000000.00110
space0.332110.01298

Table 2: Emission probabilities of letters from the two states, from [4]

Cave and Neuwirth obtain further results by allowing more than two hidden states. They are able to obtain and sensibly interpret the results for models with up to 12 hidden states. This example has further applications to automatic language translation.

This example suggests Hidden Markov Models may be applicable to cryptanalysis. In fact, a Hidden Markov Model has been applied to “secret messages” such as Hamptonese, the Voynich Manuscript and the “Kryptos” sculpture at the CIA headquarters but without too much success, [4]. Partly the reasons for success or failure depend on the quality of the transcriptions and partly on the assumptions that the cipher text is a plaintext in an unknown language, and not steganography, or even just babbling (glossolalia).

Speech Recognition

A classic example and practical application of Hidden Markov Models is speech recognition, especially isolated word recognition. Hidden Markov Models were developed in the 1960s and 1970s for satellite communication. Andrew Viterbi made a key contribution to the theory in 1967. They were later adapted for language analysis, translation and speech recognition in the 1970s and 1980s by Bell Labs and IBM [2]. Interest in HMMs for speech recognition seems to have peaked in the late 1980s.

Speech recognition takes place in several steps:

  1. Feature analysis – a spectral or temporal analysis of the speech signal to decompose the continuous sound sample into discrete observations of speech sounds for the Hidden Markov Model.
  2. Unit matching – the speech signal parts are matched to words or phonemes with a Hidden Markov Model.
  3. Lexical analysis – if the units are phonemes, combine the units into recognized words with either a deterministic or a probabilistic finite state network. If the units are words, this step can generally be eliminated.
  4. Syntactic analysis – with a grammar, group words into proper sequences. If single word like “yes” or “no”, or digit sequences, this step is minimal or completely eliminated.
  5. Semantic analysis – interpret the words or word sequences for the task model.

Concentrating on the second step of unit matching, assume we have a vocabulary of V words to recognize. We have a training set of L tokens of each word. We also have an independent observation set. We use the observations from the set of L tokens to estimate the optimum parameters for each word, creating model λv for the vth vocabulary word, 1 v V . For each unknown word in the observation sequence 𝒪 = 𝒪0𝒪1𝒪T1 and for each word model λv we calculate Pv = 𝒪|λv. We choose the word whose model probability is highest

v = argmax 1vV [Pv].

For example, we could train an Hidden Markov Model, say λ0 to recognize the spoken word “no” and train another Hidden Markov Model, say λ1 to recognize the spoken word “yes”. (This is the step we will later call the solution to Problem 3.) Then given an unknown spoken word, we can use the Hidden Markov Model to score this word against λ0 and also against λ1 to decide if the spoken word is more likely “no”, “yes” or neither. (This is the problem we will later call Problem 1.) Notice that this application does not uncover the hidden states (which we will later call Problem 2) but such information might provide additional insight into the underlying speech model.

The Hidden Markov Model for speech recognition is very efficient. For isolated word recognition with the Viterbi Algorithm, a vocabulary of V = 100 words with an N = 5 state model, and 40 observations, it takes about 105 computations (additions/multiplications) for a single word recognition.

It is hard to determine what current (2017) speech recognition applications are based on. Explanations are clouded with buzzwords and hype, with no theory. Common buzzwords surrounding current (2017) speech recognition are “artificial intelligence”, “machine learning”, “neural networks”, and “deep learning”, but there does not seem to be a connection to Hidden Markov Models.

Sources

The variable factory example is adapted from Sheldon M. Ross, Introduction to Probability Models, Section 4.11, pages 256-262, Academic Press, 2006, 9th Edition.

The paleontological temperature model is adapted from “A Revealing Introduction to Hidden Markov Models”, by Mark Stamp.

The cheating casino and CpG Islands example is adapted from Biological Sequence Analysis, by R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Chapter 3, pages 46-79.

The urn and ball model is adapted from “An Introduction to Hidden Markov Models”, by L.R. Rabiner and B. H. Juang, 1986.

The language analysis example is adapted from “A Revealing Introduction to Hidden Markov Models”, by Mark Stamp.

The speech recognition example is adapted from “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition” by L. R. Rabiner.

_______________________________________________________________________________________________

Algorithms, Scripts, Simulations

Algorithms, Scripts, Simulations

Algorithm

Scripts

__________________________________________________________________________

Problems to Work

Problems to Work for Understanding

__________________________________________________________________________

Books

Reading Suggestion:

References

[1]   L. R. Rabiner and B. H. Juang. An Introduction to Hidden Markov Models. IEEE ASSP Magazine, pages 4–16, January 1986. hidden markov models.

[2]   Lawrence R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(3):257–286, February 1989. hidden Markov model, speech recognition.

[3]   Sheldon M. Ross. Introduction to Probability Models. Academic Press, 9th edition, 2006.

[4]   Mark Stamp. A revealing introduction to hidden markov models. https://www.cs.sjsu.edu/ stamp/RUA/HMM.pdf, December 2015.

__________________________________________________________________________

Links

Outside Readings and Links:

__________________________________________________________________________

I check all the information on each page for correctness and typographical errors. Nevertheless, some errors may occur and I would be grateful if you would alert me to such errors. I make every reasonable effort to present current and accurate information for public use, however I do not guarantee the accuracy or timeliness of information on this website. Your use of the information from this website is strictly voluntary and at your risk.

I have checked the links to external sites for usefulness. Links to external websites are provided as a convenience. I do not endorse, control, monitor, or guarantee the information contained in any external website. I don’t guarantee that the links are active at all times. Use the links here with the same caution as you would all information on the Internet. This website reflects the thoughts, interests and opinions of its author. They do not explicitly represent official positions or policies of my employer.

Information on this website is subject to change without notice.

Steve Dunbar’s Home Page, http://www.math.unl.edu/~sdunbar1

Email to Steve Dunbar, sdunbar1 at unl dot edu

Last modified: Processed from LATEX source on April 12, 2017