15

Trees: Evolution

instructor: Ross A. Lippert

http://www-math.mit.edu/~lippert/18.417/

Announcements:
  • Finish chapter 10


Evolution

A fundamental random model in population genetics is the Moran model

We output n samples from a population of N evolving for time T

  Current = [0 0 ... 0]
  for t=1 to T
    for i=1 to N
       Next[i] = Mutate( Current[ Random(N) ] )
    Current = Next
  return Current[1:n]

At each generation, cells pick parents and copy their content with mutations

It is a haploid model. Can be extended to a diploid model

  Current = [0 0 ... 0]
  for t=1 to T
    for i=1 to N
       Next[i] = Mutate( SEX(Current[Random(N)],Current[Random(N)]) )
    Current = Next
  return Current[1:n]

Coallescent analysis

Give n samples which were evolving for an infinite time in a population of size T

A compact representation of the ancestral tree of the samples:

The coallescent is a stochastic process generating such trees


Population

Tree interpretations


Molecular clock

Emiler Zuckerkandl and Linus Pauling

Molecular clock

mitochondrial DNA (mtDNA) is used in many systematic studies

neutral mutations useful for closely related species


The Panda story

Read about it over here

Human mtDNA tree


Trees in biology

Basic trees

Nodes labeled by intermediate species

Edges labeled by differences (phenetic) or `time' (cladistic)

Basic input data


Distance-based tree reconstruction

computing distances in the tree

Additive tree reconstruction


Additive tree reconstruction

One insight

* trees can be reconstructed by repeated coallescing sibling leaves

Two significant insights

* leads to recursive method for tree reconstruction

How do we tell if d is tree-additive?


Approximate Additive tree reconstruction

In practice, d(i,j) is not the tree distance, and must be regularized

Approximate additive tree reconstruction

Usually done with least squares but obvious variants with different d-metrics

Unweighted Pair Group Method with Arithmetic Mean

  UPGMA(d,n)
    Hierarchical clustering with
      d(C1,C2) = avg of d(x1,x2) for x1 in C1, x2 in C2
    Internal node has height = avg of d(x1,x2)
    Leaf have height = 0
    edge(i,j) = height of i - height of j
Hierarchical clustering with post-computed edge lengths


UPGMA problem case

UPGMA assumes a constant molecular clock

If the mutation rate increases along any branch, it gets screwy

ABCDE
A0----
B200---
C80800--
D60601000-
E8080120800

NeighborJoining

Saitou and Nei - deceptively simple, yet works well

Key idea: attempt to reward both closeness and distinctness

  NJ(d,n)
    u(i) = (1/(n-2)) * SUM_k d(i,k)
    until there is 1 cluster
      pick i,j neighbors to merge minimizing d(i,j) - u(i) - u(j)
      create new node x with edge(x,i) and edge(x,j)
      d(x,k) = (d(i,k)+d(j,k)-d(i,j))/2
      edge(x,i) = (d(i,j) + u(i) - u(j))/2
      edge(x,j) = (d(i,j) + u(j) - u(i))/2
      remove i,j rows/columns from d

Character based methods

Each atom, i, has some discrete data associated with it

Examples

Character-based tree reconstruction

Edge weight: d(i,j)=#(differences between i and j)

NOTE: character trees are not generally additive

Advantages


Parsimony problem

Biological motivation: each difference is an evolutionary event

Rarely should evolutionary events happen to two different ancestors

Most believable answer has fewest events (can be rephrased as a max likelihood problem)

Violations/confounding circumstances:

Two sorts of variables under consideration:


Small parsimony problem

This sub-problem isn't hard

Can be solved with a dynamic program (Sankoff)


Large parsimony problem

This problem is hard

People do local search heuristics

Typical example

  1. Form initial guess by NJ
  2. Score the topology with small parsimony
  3. For each internal edge, evaluate local change:
  4. Take the best one and go to 2

Variations: