13
Filtration and BLAST
instructor: Ross A. Lippert
http://www-math.mit.edu/~lippert/18.417/
For class:- Reading : finish ch. 9
- I added a recent paper by Miranda to the suffix array section
|  |
Inexact matching
Take a baby-step into inexact matching
m-mismatch k-word problem- Input: P query string, T text string
- Output: All (i,j) where d(P(i:i+k),T(j,j+k)) <= m
Slight variation
S-scoring k-word problem- Input: P query string, T test string, scoring matrix
- Output: All (i,j) where score(P(i:i+k),T(j,j+k)) >= S
Pigeon Hole principle

Good matches usually contain exact matches
m-mismatch problem- length s=floor(k/(m+1)) exact match must exist
- locate all s-words of P in T via exact matching
- extend left/right to find a k-word
- O(k^2) to do this right
- many methods don't bother to do this right
S-scoring problem- every S scoring k-word match contains some s-word with score T
- form T-scoring neighborhood of s-words of P
- local all neighborhood words in T via exact matching
- extend left/right to find an k-word to a local MSP

Ungapped extension

Gapped extension

BLAST
Basic Local Alignment Search Tool
successor to FASTA and FASTP
- sacrifices sensitivity: doesn't explore all possible alignments
- seed size: 9-11 bps or 3-6 AAs
- originally: returned maximally scoring pairs, now returns alignments
big claim to fame: use of Altschul-Karlin-Dembo statistics to quantify
- judging the significance of a match
- setting (vaguely) justified thresholds
biggest claim to fame: very fast (PatternHunter might be better today)
- uses AC-style FSM to find seed matchs
- uses another exact matching technique to generate word neighborhoods
MSP statistics
An approximation to local alignment statistics
- Bernoulli sequence, p_i letter probabilities of length N and M
- per-letter scoring matrix s(x,y)
MSP result from Altschul-Karlin
E[Number of disinct MSP's with score >= S] = K N M exp{-zS}
z is a normalizing factor: SUM_{xy} p_x p_y exp{-z s(x,y)} = 1
Chen-Stein result
Number of MSPs distributed Poisson with avg E[S]
The many faces of BLAST
- BLASTn: nucleotide to nucleotide database
- BLASTp: protein to protein db
- BLASTx: translated nucleotide to protein db
- tBLASTn: protein to translated nucleotide db
- tBLASTx: translated nucleotide to translated nucleotide db
Gapped seeds - the wave of the future
Bin Ma has an on-line chapter in this bookForget about extensions and gaps: just find seeds to `hit' every MSP
Example: MSP of length 20 with greater than 70% identity
| Q | a | a | t | c | t | t | g | c | g | a | g | a | c | c | a | a | t | g | g | c | a | c | t | t |
| T | c | t | t | c | c | t | g | c | g | g | g | a | c | c | t | a | t | c | c | c | a | c | a | a |
| = | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
Fundamental problem: seeds too short=noise, seeds too long=lack of signal
look for all 4-mers --> look for 1111
look for a gapped seed --> look for 11011
The problem of detecting an MSP of length X with p percent id is one of `hitting' binary strings with other binary strings
Consecutively spaced seeds don't hit as many MSP