13

Filtration and BLAST

instructor: Ross A. Lippert

http://www-math.mit.edu/~lippert/18.417/

For class:
  • Reading : finish ch. 9
  • I added a recent paper by Miranda to the suffix array section


Inexact matching

Take a baby-step into inexact matching

Slight variation


Pigeon Hole principle

Good matches usually contain exact matches


Ungapped extension


Gapped extension


BLAST

Basic Local Alignment Search Tool

successor to FASTA and FASTP

big claim to fame: use of Altschul-Karlin-Dembo statistics to quantify

biggest claim to fame: very fast (PatternHunter might be better today)


MSP statistics

An approximation to local alignment statistics

MSP result from Altschul-Karlin

E[Number of disinct MSP's with score >= S] = K N M exp{-zS}

z is a normalizing factor: SUM_{xy} p_x p_y exp{-z s(x,y)} = 1

Chen-Stein result

Number of MSPs distributed Poisson with avg E[S]


The many faces of BLAST


Gapped seeds - the wave of the future

Bin Ma has an on-line chapter in this book

Forget about extensions and gaps: just find seeds to `hit' every MSP

Example: MSP of length 20 with greater than 70% identity

Qaatcttgcgagaccaatggcactt
Tcttcctgcgggacctatcccacaa
=001101111011110110011100

Fundamental problem: seeds too short=noise, seeds too long=lack of signal

look for all 4-mers --> look for 1111

look for a gapped seed --> look for 11011

The problem of detecting an MSP of length X with p percent id is one of `hitting' binary strings with other binary strings

Consecutively spaced seeds don't hit as many MSP