Steven R. Dunbar
Department of Mathematics
203 Avery Hall
University of Nebraska-Lincoln
Lincoln, NE 68588-0130
Probability Theory and Stochastic Processes
Steven R. Dunbar
Examples of Hidden Markov Models
Student: contains scenes of mild algebra or calculus that may require guidance.
A production process in a factory is either in a good state (call it state 0) or in a poor state (state 1). If the process is in state 0 during some period then, independent of the past, with probability it will be in state 0 during the next period and with probability it will be in state 1. Once in state 1 it remains in that state forever. Suppose the factory produces a single item in each period and that each item produced when the process is in state 0 is of acceptable quality with probability , while each item produced when the process is in state 1 is of acceptable quality with probability .
If the status, either acceptable or unacceptable, of each successive item is observable, while the process states are unobservable, then we say the process is a hidden Markov chain model. The state of the factory is 0 or 1 and the signal is the quality of the item produced with value either or , depending on whether the item is acceptable or unacceptable. The transition probabilities of the underlying Markov chain are
The signal probabilities are
so the emission matrix is
Suppose that the probability of starting in state 0 is or .
Suppose a sequence of three observed articles are . Then given each possible state sequence, the probability of the corresponding observed sequence is in Table 1.
Notice that even without the entries some products can combine to make the calculations more eﬃcient.
We want to determine the average annual temperature at a particular place on earth over a sequence of years in the distant past. For simplicity, we consider that there were only annual average temperatures, “hot” and “cold”. Suppose that modern evidence indicates the probability of a hot year followed by another hot year is and the probability of a cold year followed by another cold year is , independent of the temperature in prior years. Assume that these probabilities held in the distant past as well. A probability transition matrix summarizing the information is
Also suppose that current research indicates a correlation between the size of tree growth rings and temperature. Again for simplicity, we consider only three diﬀerent tree ring sizes, small designated as , medium designated as , and large designated as , the observable signal of the average annual temperature. Suppose that based on available evidence, the probabilistic relationship between annual temperature and tree ring sizes is
For this system, the state is the annual average temperature, either or . The transition from one state to another is a Markov chain. However, these are hidden states, since we can’t directly observe the temperature in the past. Although we can’t observe the state or temperature in the past, we can observe the size of tree rings. From this evidence, we would like to determine the most likely temperature state in past years.
In a hypothetical dishonest casino, the casino uses a fair die most of the time, but occasionally the casino secretly switches to a loaded die, and later the casino switches back to the fair die. A probabilistic process determines the switching back-and-forth from loaded die to fair die and back again after each toss of the die, with the switch from fair-to-loaded occurring with probability and from loaded-to-fair with probability . In addition, assume that the loaded die will come up “six” with probability and the remaining ﬁve numbers with probability each. The transmission matrix is
and the emission probability matrix is
If you can see only the sequence of rolls (the sequence of observations or signals) you do not know which rolls used a loaded die and which used a fair die, because the casino hides the state. This is an example of a Hidden Markov Model.
A standard general mathematical example is an urn and ball model. Then are urns, each ﬁlled with colored balls with possible colors for the balls. Generate the observation sequence by initially choosing one of the urns, randomly according to an initial probability distribution, randomly selecting a ball, recording its color, replacing the ball, and then choosing a new urn according to a transition probability distribution associated with the current urn. Then at each time, the signal or observation is the color of the selected ball. The hidden states are the urns.
Consider the following coin tossing experiment: You are in a room with a curtain through which you cannot see what is happening. On the other side of the curtain a person is tossing a coin, or maybe one of several coins. The other person will not tell us exactly what is happening, only the result of each coin ﬂip. This is a sequence of hidden coin ﬂips. Thus we can only use the results of the coin tosses, say
with for heads and for tails. Take ﬁrst the case that the proportion of heads and tails are equal, statistically speaking, without any obvious patterns or organization to occurrences of heads and tails. How do we build a Hidden Markov Model to best explain the observed sequence of heads and tails?
Figure 2 shows the ﬁrst possible model. This simplest “one-fair-coin model”, has two states, each state is directly associated with heads or tails. The probability of being in the state generating a head would be and equally for being in the state generating a tail. This model is not truly hidden because each observation directly deﬁnes the state. This is a degenerate example of a hidden Markov model which is exactly the same as the classic stochastic process of repeated Bernoulli trials.
A second possible Hidden Markov Model for the observations is a “two-fair-coin model”, see Figure 3. Associate each state with a fair coin, so the probability of generating a head in each state is . In this special case, called the “two-fair-coins model”, the probabilities associated with remaining in or leaving each of the two states form a probability transition matrix whose entries are unimportant because the observable sequences from the two-fair coins model are statistically indistinguishable in each of the states. That means this two-fair-coin model is indistinguishable from the one-fair-coin model in a statistical sense and so this is another degenerate example of a Hidden Markov Model.
Other Hidden Markov Models which can account for a observed sequence of equal proportions of heads and tails are possible. Take the “two-compensating-biased-coins model” as a model of what happens behind the curtain. The model has two diﬀerent states corresponding to diﬀerent coins. In one state, the coin is biased toward heads, say with . In the other state, the coin is biased towards tails with . The state transition probabilities all equal . See Figure 4. This could accomplished by the person behind the curtain having two biased coins and a third fair coin with the biased coins associated to the faces of the fair coin respectively. The person behind the barrier ﬂips the fair coin to decide which biased coin to use and then ﬂips the chosen biased coin to generate the observed outcome. Note that with complementary biased coin probabilities indicated in Figure 4 the long term averages of heads or tails would be statistically indistinguishable from either the one-fair-coin model or the two-fair-coin model.
It is clear that other more complicated models with three or more coins could also be constructed. In this special case that the proportion of heads and tails are equal statistically speaking, without any obvious patterns or organization to occurrences of heads and tails, there would have to be some compelling physical reason to choose a multiple-hidden-state model over the simpler and equivalent one-fair-coin model.
Now in another direction imagine that the observed sequence is a very long sequence of heads, then followed by another long sequence of tails of some random length, interrupted by a head, followed again by yet another long random sequence of tails Then we might use a “two-biased-coins model”. with two biased coins with biased switching between the states of the coins as a possible model for the observed sequence. Of course, such a sequence of many heads followed by many tails could conceivably come from a fair coin. The choice between a one-fair-coin or two-biased-coin model would be a choice justiﬁed by the likelihoods of the observations under the models, or possibly by other external modeling considerations.
However other higher-order statistics of the two-biased-coins model such as the probability of runs of heads, should be distinguishable from the one-fair-coin or the two-fair coin model.
Another Hidden Markov Model could be the “three-biased-coins model”. In the ﬁrst state, the coin is slightly biased to heads, in the second state the coin is slightly biased toward tails, in the third state the coin is some other distribution, maybe fair, maybe biased in some direction. A Markov chain determines transition probabilities among the three states. See Figure 6. The sequence of observations depends on the biases and the transition probabilities. The simple statistics and higher-order statistics of the observations would be correspondingly inﬂuenced and would suggest the appropriateness of this choice of model.
Several important points emerge from the possibility of diﬀerent models for the observed outputs of the coin tossing experiment behind the curtain.
In the human genome the dinucleotide CG (frequently written CpG to distinguish it from the CG base-pair across two strands) is rarer than expected from the independent probabilities of C and G, for reasons of chemistry that transform the C into a T. For biologically important reasons, the chemical transformation is suppressed in short regions of the genome, such as around the promoters or start regions of many genes. In these regions, we see signiﬁcantly more CpG dinucleotides than elsewhere. Such regions are called CpG islands. The islands are typically a few hundred to a few thousand bases long.
Given a short stretch of genomic sequence, how would we decide if it comes from a CpG island or not? Second, given a long piece of sequence, how would we ﬁnd the CpG islands in it, if any exist? Here, the Model has two states, CpG islands, and non-islands. In each state, the probabilities of expressing CpG base-pairs are diﬀerent.
For a pair (or more) of proteins, an important question is: How are the proteins similar? The goals are to detect and measure overall similarity between protein amino acid sequences, ﬁnd proteins with similar functions in diﬀerent organisms by ﬁnding similar subsequences of amino acids called “conserved sequences” and to detect conserved sequences and evolution of conserved sequences. Alignment is the method for answering these questions.
There are two types of alignment: A global alignment is an alignment of the full length of two sequences. A local alignment is an alignment of part of one sequence to part of another sequence. For possibly distantly related sequences, it might be more sensible to make local alignments of subregions of high similarity, not the whole sequence. A sample toy alignment is in Figure 7
Alignment allows amino acid matches and mismatches along columns with corresponding scores based on chemistry and biology. In order to make alignments, we also allow introduction of gaps in either of the protein amino acid sequences. Introducing gaps when making alignments adds penalty scores.
A common task in bioinformatics is to obtain a cluster of related sequences, e.g. from a database, and then to align those sequences using multiple sequence alignment algorithms. The clustering reﬂects the insights of the biology community as to which proteins belong within the same family. The outcome of the clustering process is a set of distinct protein families. This is the ﬁrst step in most phylogenetic analyses. Heuristic algorithms are generally used to create multiple sequence alignments. There are large databases of proteins and alignments, some created with Hidden Markov Models, some provide Hidden Markov Model data, see below. An example of an actual multiple sequence alignment (MSA) is in Figure 8.
A proﬁle HMM (pHMM) is a particular Hidden Markov Model with states, signals, transition matrix, and emission matrix summarizing a multiple sequence alignment. The goal is use this Hidden Markov Model information about the MSA to align a new query sequence. Proﬁle HMMs have three states for each alignment position, i.e. each column in the multiple sequence alignment. Three outcomes are possible when aligning each residue of the query sequence with the MSA:
There are heuristic rules assigning MSA columns as match states, for example, the MSA has a match column if less than half of the characters are gaps. The length of a pHMM is the number of columns in the MSA assigned to match states. Each match state in the pHMM has its corresponding set of emission probabilities, generated from counting the frequencies of each amino acid in the corresponding column. Insertions, i.e. portions of the query sequence that do not match anything in the multiple alignment, correspond to an insert state. As in the case of the match states, each insert state has its own set of emission probabilities. The insert state emission probabilities are typically generated using the distribution of amino acids over the entire MSA. A delete state is possible for each of the positions in the MSA. The delete state is an example of a silent state in the model, as it does not emit any residues.
Let denote the number of match locations. Then the associated proﬁle HMM has states in the underlying Markov process, namely:
Thus, a pHMM typically has many more states than the previous examples of Hidden Markov Models with only a handful of states. The connections between the states is highly structured and more complicated than the other examples of Hidden Markov Models. The set of emissions, or fewer, is about the same order of magnitude as in the other examples.
A typical application of a proﬁle HMM is the following: Start with collection of protein families (clusters) , where all proteins within a family have the same length after assigning gaps as necessary. For each family , construct a corresponding proﬁle HMM, in the notation of the next section. The objective is to assign a newly sequenced protein to one of the families. Then compute the likelihood of the gap-aligned new protein for each of the proﬁle HMMs. The new protein is then assigned to the family for which the likelihood is maximum. This is the “scoring” problem of Hidden Markov Models.
Language translation is a classic application of Hidden Markov Models, originating with Cave and Neuwirth, see  for history and additional details. Suppose you do not understand English, but you do know something about punctuation. You obtain a large body of English text, such as the “Brown Corpus”. Henry Kučera and W. Nelson Francis at Brown University compiled The Brown University Standard Corpus of Present-Day American English as a general corpus (text collection) in the ﬁeld of corpus linguistics. The Brown Corpus contains 500 samples of English-language text, totaling roughly one million words, compiled from works published in the United States in 1961. The Brown Corpus is one of many such corpuses, available through the Natural Language Toolkit, see nltk.org. With knowledge of Hidden Markov Models, but no knowledge of English, you would like to determine some basic properties of this mysterious writing system. Can you partition the characters into sets so that characters in each set are “diﬀerent” in some statistically signiﬁcant way?
First remove all punctuation and numbers and convert all letters to lower case. This leaves 26 distinct letters and the space, for a total of 27 symbols. The observations are the series of characters found in the resulting text. Then test the hypothesis that English text has an underlying Markov chain with two states. For each of these two hidden states, assume that the 27 symbols appear according to a ﬁxed probability distribution. This sets up a Hidden Markov Model with and where the state transition probabilities and the observation probabilities from each state are unknown.
Results of a case study,  using the ﬁrst observations from the Brown Corpus of letters converted to lower case and the space are in Table 2. Without having any assumption about the nature of the two states, the results sort into two familiar categories! The probabilities tell us that the one hidden state contains the vowels while the other hidden state contains the consonants. Interestingly, space is more like a vowel and “y” is a consonant. The Hidden Markov Model “deduces” the statistically signiﬁcant diﬀerence between vowels and consonants without knowing anything about the English language.
Cave and Neuwirth obtain further results by allowing more than two hidden states. They are able to obtain and sensibly interpret the results for models with up to hidden states. This example has further applications to automatic language translation.
This example suggests Hidden Markov Models may be applicable to cryptanalysis. In fact, a Hidden Markov Model has been applied to “secret messages” such as Hamptonese, the Voynich Manuscript and the “Kryptos” sculpture at the CIA headquarters but without too much success, . Partly the reasons for success or failure depend on the quality of the transcriptions and partly on the assumptions that the cipher text is a plaintext in an unknown language, and not steganography, or even just babbling (glossolalia).
A classic example and practical application of Hidden Markov Models is speech recognition, especially isolated word recognition. Hidden Markov Models were developed in the 1960s and 1970s for satellite communication. Andrew Viterbi made a key contribution to the theory in 1967. They were later adapted for language analysis, translation and speech recognition in the 1970s and 1980s by Bell Labs and IBM . Interest in HMMs for speech recognition seems to have peaked in the late 1980s.
Speech recognition takes place in several steps:
Concentrating on the second step of unit matching, assume we have a vocabulary of words to recognize. We have a training set of tokens of each word. We also have an independent observation set. We use the observations from the set of tokens to estimate the optimum parameters for each word, creating model for the th vocabulary word, . For each unknown word in the observation sequence and for each word model we calculate . We choose the word whose model probability is highest
For example, we could train an Hidden Markov Model, say to recognize the spoken word “no” and train another Hidden Markov Model, say to recognize the spoken word “yes”. (This is the step we will later call the solution to Problem 3.) Then given an unknown spoken word, we can use the Hidden Markov Model to score this word against and also against to decide if the spoken word is more likely “no”, “yes” or neither. (This is the problem we will later call Problem 1.) Notice that this application does not uncover the hidden states (which we will later call Problem 2) but such information might provide additional insight into the underlying speech model.
The Hidden Markov Model for speech recognition is very eﬃcient. For isolated word recognition with the Viterbi Algorithm, a vocabulary of words with an state model, and observations, it takes about computations (additions/multiplications) for a single word recognition.
It is hard to determine what current (2017) speech recognition applications are based on. Explanations are clouded with buzzwords and hype, with no theory. Common buzzwords surrounding current (2017) speech recognition are “artiﬁcial intelligence”, “machine learning”, “neural networks”, and “deep learning”, but there does not seem to be a connection to Hidden Markov Models.
The variable factory example is adapted from Sheldon M. Ross, Introduction to Probability Models, Section 4.11, pages 256-262, Academic Press, 2006, 9th Edition.
The paleontological temperature model is adapted from “A Revealing Introduction to Hidden Markov Models”, by Mark Stamp.
The cheating casino and CpG Islands example is adapted from Biological Sequence Analysis, by R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Chapter 3, pages 46-79.
The urn and ball model is adapted from “An Introduction to Hidden Markov Models”, by L.R. Rabiner and B. H. Juang, 1986.
The language analysis example is adapted from “A Revealing Introduction to Hidden Markov Models”, by Mark Stamp.
The speech recognition example is adapted from “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition” by L. R. Rabiner.
 Lawrence R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(3):257–286, February 1989. hidden Markov model, speech recognition.
I check all the information on each page for correctness and typographical errors. Nevertheless, some errors may occur and I would be grateful if you would alert me to such errors. I make every reasonable eﬀort to present current and accurate information for public use, however I do not guarantee the accuracy or timeliness of information on this website. Your use of the information from this website is strictly voluntary and at your risk.
I have checked the links to external sites for usefulness. Links to external websites are provided as a convenience. I do not endorse, control, monitor, or guarantee the information contained in any external website. I don’t guarantee that the links are active at all times. Use the links here with the same caution as you would all information on the Internet. This website reﬂects the thoughts, interests and opinions of its author. They do not explicitly represent oﬃcial positions or policies of my employer.
Information on this website is subject to change without notice.
Steve Dunbar’s Home Page, http://www.math.unl.edu/~sdunbar1
Email to Steve Dunbar, sdunbar1 at unl dot edu
Last modiﬁed: Processed from LATEX source on April 12, 2017