12

Suffix Arrays and Burrows Wheeler transforms

instructor: Ross A. Lippert

http://www-math.mit.edu/~lippert/18.417/




S=acataggagacatacga$
B=agg$tgtccaaacagaaa


Finishing suffix trees

structures:
  Stree { string, root }
  Node  { start, depth, slink, parent, children }

Implementations

Complexity variations:

childrenbuild-timesearch-timespace
linked listO(|A||S|)O(|A||P|)O(|S|)
arraysO(|A||S|)O(|P|)O(|A||S|)
maps/treesO(log|A| |S|)O(log|A| |P|)O(|S|)

Implementation hacks

Known implementations


Too much space? Try suffix arrays

Suffix array is the permutation of sorted suffixes

Think of it as a depth-first `dump' of the suffix tree

suffixstring
17$
16a$
9acatacga$
0acataggagacatacga$
13acga$
7agacatacga$
4aggagacatacga$
11atacga$
2ataggagacatacga$
10catacga$
1cataggagacatacga$
14cga$
15ga$
8gacatacga$
6gagacatacga$
5ggagacatacga$
12tacga$
3taggagacatacga$

What can you do with a suffix array?


LCP enhancement

  • More functionality with an lcp vector
  • MUMs, MEMs, long common substrings or repeats
  • Computing a complete lcp tree from the lcp vector is O(|S|)
  • lcp tree's space can be traded efficiently against time
+32+16+8+4+2lcp(i,i+1)suffixstring
00000017$
116a$
259acatacga$
20acataggagacatacga$
11113acga$
27agacatacga$
114aggagacatacga$
311atacga$
00002ataggagacatacga$
410catacga$
111cataggagacatacga$
114cga$
02215ga$
28gacatacga$
016gagacatacga$
05ggagacatacga$
0000212tacga$
03taggagacatacga$

What can we do with less?

Obviously, suffix trees and suffix arrays are not space efficient

Not all permutations are suffix arrays!


The structure of a suffix array permutation


The structure of a suffix array permutation

Bigger example:

cggcccctttcggggtgtgaccaggcgagtaaagctcgaaacggtaaaatcttgatttttcgctccggcggacagggggagcgctttggtcattccaggg$


An auxilliary permutation

The runs break up along character boundaries

|A|+1 monotonic runs

iAinv[A[i]-1]A[i]string (wrapped)
0 117$acataggagacatacga
11216a$acataggagacatacg
2139acatacga$acataggag
3 00acataggagacatacga$
41613acga$acataggagacat
5147agacatacga$acatagg
6174aggagacatacga$acat
7 911atacga$acataggagac
8102ataggagacatacga$ac
9 210catacga$acataggaga
10 31cataggagacatacga$a
11 414cga$acataggagacata
121115ga$acataggagacatac
13 58gacatacga$acatagga
14156gagacatacga$acatag
15 65ggagacatacga$acata
16 712tacga$acataggagaca
17 83taggagacatacga$aca

Burrows Wheeler Transform

Example: B=agg$tgtccaaacagaaa

Monotonic runs can be `colored' by letters

Requires additional O(|A|) storage for block starts of each letter

char$acgt
block0191216
iS[A[i]-1]Ainv[A[i]-1]A[i]string
0a 117$
1g1216a$
2g139acatacga$
3$ 00acataggagacatacga$
4t1613acga$
5g147agacatacga$
6t174aggagacatacga$
7c 911atacga$
8c102ataggagacatacga$
9a 210catacga$
10a 31cataggagacatacga$
11a 414cga$
12c1115ga$
13a 58gacatacga$
14g156gagacatacga$
15a 65ggagacatacga$
16a 712tacga$
17a 83taggagacatacga$

What good is just another string?

Two important queries:

  • occ(x,i) = the number of `x's before i
  • find(x,i) = the location of the ith x

Pattern lookup comes down to doing 2|P| 'occ' operations

  lookup(P):
    lo = 0, hi = length(B)
    i = length(P)
    while i>0:
      i = i-1
      lo = block(P[i]) + occ(P[i],lo)
      hi = block(P[i]) + occ(P[i],hi)
    return hi-lo
Note: this style of lookup could be done on a suffix array

char$acgt-
block019121618

lookup of
tag
$gat
lo012517
hi1816718
iB[i]Ainv[A[i]-1]string
0a 1$
1g12a$
2g13acatacga$
3$ 0acataggagacatacga$
4t16acga$
5g14agacatacga$
6t17aggagacatacga$
7c 9atacga$
8c10ataggagacatacga$
9a 2catacga$
10a 3cataggagacatacga$
11a 4cga$
12c11ga$
13a 5gacatacga$
14g15gagacatacga$
15a 6ggagacatacga$
16a 7tacga$
17a 8taggagacatacga$

Fast counts

Store subtotals for every W letters

  • Time = O(1 + W)
  • Space = (1+4|A|/W)|S| bytes
  • Bitwise parallelism allows hacks to let W~1000

Common pattern: support queries with auxilliary data structure which fits into arbitrarily small memory

i#$#a#c#g#tB[i]
000000a
1g
2g
301020$
4t
5g
611031t
7c
8c
911232a
10a
11a
1214232c
13a
14g
1515342a
16a
17a

Bit-parallelism for fast counting

Recall the example `agg$tgtccaaacagaaa'

atgcvector
a100000000111010111
c000000011000100000
g011001000000001000
t000010100000000000

Larger example: 101010101110101000101011111101010101010100010100001000011111110010010111

counts01325
words101010101110101000101011111101010101010100010100001000011111110010010111

occ reduced to population counts

  sum-up-to(i):
    q = i/(3*8), r = i/8, s = 8*(r+1)-i
    x = counts[q]
    for 3*q<=i<r:
      x = x + wordsum(words[i])
    x = x + wordsum(words[r] >> s)

  wordsum(w):
    w = (w & 01010101) + ((w>>1) & 01010101)
    w = (w & 00110011) + ((w>>2) & 00110011)
    w = (w & 00001111) + ((w>>4) & 00001111)
    return w

CSA/BWT state of the art

(2000) Grossi and Vitter introduce a general framework for CSAs (see their 2005 paper for a good survey)

(2000) Sadakane produces the first low memory construction scheme based on dynamic dictionaries (red-black trees)



Trade-off between leaf size and performance



Improvements and simplifications

Using a single dictionary + encoding:

charcode
A0
C01
G011
T0111
N01111


Finding matches

Matches can be quickly found by an exhaustive enumeration of intervals of the corresponding binary strings.

  matches([loA hiA],[loB hiB],zeros):
    if hiAn-loAn == 0 or hiBn-loBn == 0:
      return

    for bit=0,1
      loAn=forward(loA,bit)
      hiAn=forward(hiA,bit)
      loBn=forward(loB,bit)
      hiBn=forward(hiB,bit)

      if bit == 0 and zeros == K:
        report(loAn,hiAn,loBn,hiBn)
      else:
        matches(loAn,hiAn,loBn,hiBn,zeros+!bit)

Software and experiments

On a Mac G5 with 1.8 Gb of available RAM


Conclusions