Announcements:
| ![]() |
A fundamental random model in population genetics is the Moran model
We output n samples from a population of N evolving for time T
Current = [0 0 ... 0]
for t=1 to T
for i=1 to N
Next[i] = Mutate( Current[ Random(N) ] )
Current = Next
return Current[1:n]
At each generation, cells pick parents and copy their content with mutations
It is a haploid model. Can be extended to a diploid model
Current = [0 0 ... 0]
for t=1 to T
for i=1 to N
Next[i] = Mutate( SEX(Current[Random(N)],Current[Random(N)]) )
Current = Next
return Current[1:n]
Give n samples which were evolving for an infinite time in a population of size T
A compact representation of the ancestral tree of the samples:

The coallescent is a stochastic process generating such trees
Tree interpretations


Emiler Zuckerkandl and Linus Pauling
Molecular clockmitochondrial DNA (mtDNA) is used in many systematic studies
neutral mutations useful for closely related species

Basic trees
Nodes labeled by intermediate species
Edges labeled by differences (phenetic) or `time' (cladistic)
Basic input data
computing distances in the tree

Additive tree reconstruction
One insight
Two significant insights
How do we tell if d is tree-additive?

In practice, d(i,j) is not the tree distance, and must be regularized
Approximate additive tree reconstruction
Usually done with least squares but obvious variants with different d-metrics
Unweighted Pair Group Method with Arithmetic Mean
UPGMA(d,n)
Hierarchical clustering with
d(C1,C2) = avg of d(x1,x2) for x1 in C1, x2 in C2
Internal node has height = avg of d(x1,x2)
Leaf have height = 0
edge(i,j) = height of i - height of j
Hierarchical clustering with post-computed edge lengthsUPGMA assumes a constant molecular clock If the mutation rate increases along any branch, it gets screwy |
|


Saitou and Nei - deceptively simple, yet works well
Key idea: attempt to reward both closeness and distinctness
NJ(d,n)
u(i) = (1/(n-2)) * SUM_k d(i,k)
until there is 1 cluster
pick i,j neighbors to merge minimizing d(i,j) - u(i) - u(j)
create new node x with edge(x,i) and edge(x,j)
d(x,k) = (d(i,k)+d(j,k)-d(i,j))/2
edge(x,i) = (d(i,j) + u(i) - u(j))/2
edge(x,j) = (d(i,j) + u(j) - u(i))/2
remove i,j rows/columns from d
Each atom, i, has some discrete data associated with it
Examples
Character-based tree reconstruction
Edge weight: d(i,j)=#(differences between i and j)
NOTE: character trees are not generally additive
Advantages
Biological motivation: each difference is an evolutionary event
Rarely should evolutionary events happen to two different ancestors
Most believable answer has fewest events (can be rephrased as a max likelihood problem)
Violations/confounding circumstances:
Two sorts of variables under consideration:
This sub-problem isn't hard
This problem is hard
Typical example

Variations: