12
Suffix Arrays and Burrows Wheeler transforms
instructor: Ross A. Lippert
http://www-math.mit.edu/~lippert/18.417/
| S= | a | c | a | t | a | g | g | a | g | a | c | a | t | a | c | g | a | $ |
| B= | a | g | g | $ | t | g | t | c | c | a | a | a | c | a | g | a | a | a |
Finishing suffix trees

structures:
Stree { string, root }
Node { start, depth, slink, parent, children }
Implementations
Complexity variations:
| children | build-time | search-time | space |
| linked list | O(|A||S|) | O(|A||P|) | O(|S|) |
| arrays | O(|A||S|) | O(|P|) | O(|A||S|) |
| maps/trees | O(log|A| |S|) | O(log|A| |P|) | O(|S|) |
Implementation hacks
- Storage of the suffix/node information in arrays
- Getting rid of the parent information
- Storing leaves in smaller datatypes
- (Kurtz) Implicit storing of starts when possible
- Replacing top of the tree with a big lookup
Known implementations
- REPuter (very efficient in time and memory)
- MUMmer (fairly efficient)
- strmat
- libstree
- SuffixTree from Haifa
Too much space? Try suffix arrays
Suffix array is the permutation of sorted suffixes Think of it as a depth-first `dump' of the suffix tree | | suffix | string | | 17 | $ | | 16 | a$ | | 9 | acatacga$ | | 0 | acataggagacatacga$ | | 13 | acga$ | | 7 | agacatacga$ | | 4 | aggagacatacga$ | | 11 | atacga$ | | 2 | ataggagacatacga$ | | 10 | catacga$ | | 1 | cataggagacatacga$ | | 14 | cga$ | | 15 | ga$ | | 8 | gacatacga$ | | 6 | gagacatacga$ | | 5 | ggagacatacga$ | | 12 | tacga$ | | 3 | taggagacatacga$ |
|
What can you do with a suffix array?
- Can't stream text -- no suffix links
- Can do pattern lookup in O(|P|log|S|) time
find(P):
i = 0
lo = 0, hi = length(A)
for 0<=i<length(P):
Binary search for x,y where P[i]=S[A[j]+i] for lo<=x<=j<y<=hi
lo = x, hi = y
return {A[lo],A[lo+1],...,A[hi-1]}
- Can do pattern lookup in O(|P|) time using a different trick (see below)
- Can build `in place' in O(|S| log|S|) time
- Alphabet independent
- Nice cache/paging properties
LCP enhancement
- More functionality with an lcp vector
- MUMs, MEMs, long common substrings or repeats
- Computing a complete lcp tree from the lcp vector is O(|S|)
- lcp tree's space can be traded efficiently against time
| | +32 | +16 | +8 | +4 | +2 | lcp(i,i+1) | suffix | string | | 0 | 0 | 0 | 0 | 0 | 0 | 17 | $ | | | | | | 1 | 16 | a$ | | | | | 2 | 5 | 9 | acatacga$ | | | | | | 2 | 0 | acataggagacatacga$ | | | | 1 | 1 | 1 | 13 | acga$ | | | | | | 2 | 7 | agacatacga$ | | | | | 1 | 1 | 4 | aggagacatacga$ | | | | | | 3 | 11 | atacga$ | | | 0 | 0 | 0 | 0 | 2 | ataggagacatacga$ | | | | | | 4 | 10 | catacga$ | | | | | 1 | 1 | 1 | cataggagacatacga$ | | | | | | 1 | 14 | cga$ | | | | 0 | 2 | 2 | 15 | ga$ | | | | | | 2 | 8 | gacatacga$ | | | | | 0 | 1 | 6 | gagacatacga$ | | | | | | 0 | 5 | ggagacatacga$ | | 0 | 0 | 0 | 0 | 2 | 12 | tacga$ | | | | | | 0 | 3 | taggagacatacga$ |
|
What can we do with less?
Obviously, suffix trees and suffix arrays are not space efficient
- O(n log(n)) number of permutations on n
- O(n log|A|) number of possible suffix arrays
Not all permutations are suffix arrays!
The structure of a suffix array permutation


The structure of a suffix array permutation
Bigger example:
cggcccctttcggggtgtgaccaggcgagtaaagctcgaaacggtaaaatcttgatttttcgctccggcggacagggggagcgctttggtcattccaggg$


An auxilliary permutation
The runs break up along character boundaries |A|+1 monotonic runs | | i | Ainv[A[i]-1] | A[i] | string (wrapped) | | 0 | 1 | 17 | $acataggagacatacga | | 1 | 12 | 16 | a$acataggagacatacg | | 2 | 13 | 9 | acatacga$acataggag | | 3 | 0 | 0 | acataggagacatacga$ | | 4 | 16 | 13 | acga$acataggagacat | | 5 | 14 | 7 | agacatacga$acatagg | | 6 | 17 | 4 | aggagacatacga$acat | | 7 | 9 | 11 | atacga$acataggagac | | 8 | 10 | 2 | ataggagacatacga$ac | | 9 | 2 | 10 | catacga$acataggaga | | 10 | 3 | 1 | cataggagacatacga$a | | 11 | 4 | 14 | cga$acataggagacata | | 12 | 11 | 15 | ga$acataggagacatac | | 13 | 5 | 8 | gacatacga$acatagga | | 14 | 15 | 6 | gagacatacga$acatag | | 15 | 6 | 5 | ggagacatacga$acata | | 16 | 7 | 12 | tacga$acataggagaca | | 17 | 8 | 3 | taggagacatacga$aca |
|
Burrows Wheeler Transform
Example: B=agg$tgtccaaacagaaa Monotonic runs can be `colored' by letters Requires additional O(|A|) storage for block starts of each letter | | i | S[A[i]-1] | Ainv[A[i]-1] | A[i] | string | | 0 | a | 1 | 17 | $ | | 1 | g | 12 | 16 | a$ | | 2 | g | 13 | 9 | acatacga$ | | 3 | $ | 0 | 0 | acataggagacatacga$ | | 4 | t | 16 | 13 | acga$ | | 5 | g | 14 | 7 | agacatacga$ | | 6 | t | 17 | 4 | aggagacatacga$ | | 7 | c | 9 | 11 | atacga$ | | 8 | c | 10 | 2 | ataggagacatacga$ | | 9 | a | 2 | 10 | catacga$ | | 10 | a | 3 | 1 | cataggagacatacga$ | | 11 | a | 4 | 14 | cga$ | | 12 | c | 11 | 15 | ga$ | | 13 | a | 5 | 8 | gacatacga$ | | 14 | g | 15 | 6 | gagacatacga$ | | 15 | a | 6 | 5 | ggagacatacga$ | | 16 | a | 7 | 12 | tacga$ | | 17 | a | 8 | 3 | taggagacatacga$ |
|
What good is just another string?
Two important queries: - occ(x,i) = the number of `x's before i
- find(x,i) = the location of the ith x
Pattern lookup comes down to doing 2|P| 'occ' operations
lookup(P):
lo = 0, hi = length(B)
i = length(P)
while i>0:
i = i-1
lo = block(P[i]) + occ(P[i],lo)
hi = block(P[i]) + occ(P[i],hi)
return hi-lo
Note: this style of lookup could be done on a suffix array | lookup of tag | | i | B[i] | Ainv[A[i]-1] | string | | 0 | a | 1 | $ | | 1 | g | 12 | a$ | | 2 | g | 13 | acatacga$ | | 3 | $ | 0 | acataggagacatacga$ | | 4 | t | 16 | acga$ | | 5 | g | 14 | agacatacga$ | | 6 | t | 17 | aggagacatacga$ | | 7 | c | 9 | atacga$ | | 8 | c | 10 | ataggagacatacga$ | | 9 | a | 2 | catacga$ | | 10 | a | 3 | cataggagacatacga$ | | 11 | a | 4 | cga$ | | 12 | c | 11 | ga$ | | 13 | a | 5 | gacatacga$ | | 14 | g | 15 | gagacatacga$ | | 15 | a | 6 | ggagacatacga$ | | 16 | a | 7 | tacga$ | | 17 | a | 8 | taggagacatacga$ |
|
Fast counts
Store subtotals for every W letters - Time = O(1 + W)
- Space = (1+4|A|/W)|S| bytes
- Bitwise parallelism allows hacks to let W~1000
Common pattern: support queries with auxilliary data structure which fits into arbitrarily small memory | | i | #$ | #a | #c | #g | #t | B[i] | | 0 | 0 | 0 | 0 | 0 | 0 | a | | 1 | | | | | | g | | 2 | | | | | | g | | 3 | 0 | 1 | 0 | 2 | 0 | $ | | 4 | | | | | | t | | 5 | | | | | | g | | 6 | 1 | 1 | 0 | 3 | 1 | t | | 7 | | | | | | c | | 8 | | | | | | c | | 9 | 1 | 1 | 2 | 3 | 2 | a | | 10 | | | | | | a | | 11 | | | | | | a | | 12 | 1 | 4 | 2 | 3 | 2 | c | | 13 | | | | | | a | | 14 | | | | | | g | | 15 | 1 | 5 | 3 | 4 | 2 | a | | 16 | | | | | | a | | 17 | | | | | | a |
|
Bit-parallelism for fast counting
Recall the example `agg$tgtccaaacagaaa'
| atgc | vector |
| a | 100000000111010111 |
| c | 000000011000100000 |
| g | 011001000000001000 |
| t | 000010100000000000 |
Larger example: 101010101110101000101011111101010101010100010100001000011111110010010111
| counts | 0 | | | 13 | | | 25 | | |
| words | 10101010 | 11101010 | 00101011 | 11110101 | 01010101 | 00010100 | 00100001 | 11111100 | 10010111 |
occ reduced to population counts
sum-up-to(i):
q = i/(3*8), r = i/8, s = 8*(r+1)-i
x = counts[q]
for 3*q<=i<r:
x = x + wordsum(words[i])
x = x + wordsum(words[r] >> s)
wordsum(w):
w = (w & 01010101) + ((w>>1) & 01010101)
w = (w & 00110011) + ((w>>2) & 00110011)
w = (w & 00001111) + ((w>>4) & 00001111)
return w
CSA/BWT state of the art
- min{ 2log|A|+2, (|A|+1)/2 } bits per character on genomic alphabets
- O(|S| log|S| log|A|) time to build in 2x memory (recent reports of faster times)
(2000) Grossi and Vitter introduce a general framework for CSAs (see their 2005 paper for a good survey)
(2000) Sadakane produces the first low memory construction scheme based on dynamic dictionaries (red-black trees)


Trade-off between leaf size and performance


Improvements and simplifications
Using a single dictionary + encoding:
| char | code |
| A | 0 |
| C | 01 |
| G | 011 |
| T | 0111 |
| N | 01111 |
- min{ 2log|A|+2, (|A|+1)/2 } bits per character on genomic alphabets
- O(|S| log|S| log|A|) time to build in 2x memory (recent reports of better complexities with large constant factors)
- O(|S|) time position recovery regardless of how many positions to recover
Finding matches
Matches can be quickly found by an exhaustive enumeration of intervals of the corresponding binary strings.
matches([loA hiA],[loB hiB],zeros):
if hiAn-loAn == 0 or hiBn-loBn == 0:
return
for bit=0,1
loAn=forward(loA,bit)
hiAn=forward(hiA,bit)
loBn=forward(loB,bit)
hiBn=forward(hiB,bit)
if bit == 0 and zeros == K:
report(loAn,hiAn,loBn,hiBn)
else:
matches(loAn,hiAn,loBn,hiBn,zeros+!bit)
Software and experiments
- BBBWTgen
creates backwards binary BWTs for nucleotide data. Uses a dynamic dictionary.
- MerMatches
finds k-mer matches between BWT pairs
- Positionator/PostMatches
recovers position information for the matches
On a Mac G5 with 1.8 Gb of available RAM
- 12 hours to index 1 strand of a mammalian genome
- 7 hours to compute positions of matching 20-mers between a pair
Conclusions
- whole-genome BWT indices can be constructed reasonably
- non-trivial comparative matchings can be computed
- large memory machines are not necessary for comparative genomics
- large memory machines can now do queries of fragment stores!