Announcements:
| ![]() |
Nucleotide frequencies in the human genome
| a | c | g | t |
| 29.5 | 20.4 | 20.5 | 29.6 |
| aa | ac | ag | at | ca | cc | cg | ct | ga | gc | gg | gt | ta | tc | tg | tt |
| 0.0978 | 0.0503 | 0.0699 | 0.0773 | 0.0725 | 0.0521 | 0.0098 | 0.07 | 0.0593 | 0.0426 | 0.0521 | 0.0505 | 0.0657 | 0.0594 | 0.0727 | 0.098 |
| 0.0872 | 0.0604 | 0.0604 | 0.0873 | 0.0604 | 0.0418 | 0.0418 | 0.0605 | 0.0604 | 0.0418 | 0.0418 | 0.0605 | 0.0873 | 0.0605 | 0.0605 | 0.0875 |
Explanation: CG frequently mutates into TG
Numerous heuristics
Investigate two models of sequence
What can help us?
Suppose we have two coins, differently biased
| coin used | X | X | X | Y | Y | Y | X | X |
| result | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
Probability problem 1: likelihood of result given hidden information
What if we don't have the s's, but just have the x's?
A probabilistic model for the s's
Markov models generate sequences over an alphabet

Alternative description/notation
| Transition matrix T = |
|
| Emission matrix E = |
|
| Starting vector pi = |
|
Probability problem 2:
Formally reduces to a sum of problem 1 over all possible s strings, weighted by the Markov model
This sum is very big, but can be cast in terms of matrix multiplies
Two traditionally important auxillary quantities
Can now calculate P(sk=s|x1...xn)
Can we use P(sk=s|x1...xn) to infer a sequence of states?
Maximum likelihood problem 1:
It is really a tropicalized version of P(x1...xn)
This is called `the Viterbi algorithm'
We can score sequences, we can tag states, but where do HMMs come from?
Maximum likelihood problem 2:
Nonlinear optimization problem with constraints: M >= 0, E >= 0, M 1 = 1, E 1 = 1
Very commonly used in machine learning
An outgrowth of general considerations of gradient search for constrained polynomials
a^ = avg # of times a is used
Baum-Welch EM algorithm
Converges to a local optimum
leading HMM for gene discovery

The maximum likelihood path in this HMM is equivalent to global alignment with affine gap penalties

Can be generalized to sequence profiles