[an error occurred while processing this directive] [an error occurred while processing this directive]

### Mathematical Ideas

#### CpG Islands

In the human genome the dinucleotide CG (frequently written CpG to distinguish it from the CG base-pair across two strands) is rarer than would be expected from the independent probabilities of C and G, for reasons of chemistry that transform the C into a T. For biologically important reasons, the chemical transformation is suppressed in short regions of the genome, such as around the promoters or start regions of many genes. In these regions, we see many more CpG dinucleotides than elsewhere. Such regions are called CpG islands. They are typically a few hundred to a few thousand bases long.

Given a short stretch of genomic sequence, how would we decide if it comes from a CpG island or not? Second, given a long piece of sequence, how would we find the CpG islands in it, if there are any?

#### The occasionally cheating casino

In a certain dishonest casino, they use a fair die most of the time, but occasionally they switch to a loaded die, and later they will switch back to the fair die. In fact, let us assume that the loaded die will come up “six” with probability 0.5 and the remaining five numbers with probability 0.1 each. The switching back-and-forth from loaded die to fair die and back again is determined probabilistically after each toss of the die, with the switch from fair-to-loaded occurring with probability 0.05 and from loaded-to-fair with probability 0.1. The stochastic process of the rolls of the dice may be modeled as a Markov process with 12 states, six from the fair die and six from the loaded die. However, there is something hidden about the states in the process: If you can see just a sequence of rolls (the sequence of observations) you do not know which rolls used a loaded die and which used a fair die, because that is kept hidden by the casino, that is the state sequence is (partially) hidden! This is an example of a hidden Markov model.

We can formalize the notation for a hidden Markov model in the following way. We need to distinguish the set of observed symbols from the sequence of states of the Markov chain. In a sense to be made precise in a moment, we only get partial information about the states from the visible symbols. As typical, call the state sequence of a Markov chain , with the i-th state of the process . Then,also as usual, let A be the matrix of (stationary) state transition probabilities:

We will have a beginning position, which may in turn have a probability distribution a0k of being in state k.

Now we decouple the symbols b from the states k by introducing a new set of parameters ek(b), so that a state can produce a symbol from a fixed (or stationary) distribution over all possible symbols. That is, we define

the probability that symbol b is seen when in state k. These are known as the emission probabilities. For the case of the dice used in the cheating casino, the emission probabilities are all 0 or 1, for example if the die is loaded and shows “six”, the state is 6L and the symbol displayed is 6, so e6L(6) = 1, while e6L(1) = 0.

It is now easy to write down the joint probability of an observed sequence x and a state sequence :

#### The Viterbi algorithm for identifying the most probable path

Although it is no longer possible to tell what state the system is in by looking at the corresponding symbol, it is often the underlying states that we are interested in. To find out what the observation sequence “really is” by means of considering the underlying sequence is called decoding. The most common algorithm for doing so is the Viterbi algorithm, a dynamic programming algorithm.

There may be many state sequences that could give rise to any particular sequence of symbols. For example, at the cheating casino, we might observe the sequence (1, 2, 3, 4) of four rolls. Each of (1F , 2F , 3F , 4F ), (1L, 2L, 3L, 4L) and (1F , 2L, 3F , 4L) could emit the observed sequence. However they do so with very different probabilities. The last is unlikely because it is the product of multiple small probabilities of switching back forth between the dice. The second has smaller probability that the first because the probabilities of moving from each of the observed states to the next is much smaller for the loaded die than it is for the fair die. Hence among these three, the first is the most likely.

This most probable path is found recursively. Suppose the probability vk(i) of the most probable path ending in state k at position i is known for all the states k. Then the probabilities can be calculated for observation xi+1 at position i + 1 as

For simplicity, we say all sequences have to start in state 0, the begin state, so the initial condition is v0(0) = 1. Likewise, we assume that all sequences end in the state 0.

Algorithm: Viterbi

Initialization:
(i = 0)
Recursion:
(i = 1...L)
Termination:
Traceback:
(i = L...1):

There are some implementational issues for the Viterbi algorithm. The most severe practical problem is that multiplying many probabilities always yields small numbers that will give underflow errors. For this reason, the Viterbi algorithm should always be done using logarithms so the products becomes sums and the numbers stay reasonable.

This section is adapted from: Biological Sequence Analysis, by R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Chapter 3, pages 46-79.