11

Suffix Trees: the mother of all string indices

instructor: Ross A. Lippert

http://www-math.mit.edu/~lippert/18.417/



Suffix trees

An interesting special case of the keyword tree.

Typically not implemented with O(|S|^2) space or time.

Example word: banana$anana$nana$ana$na$a$


Keyword (AC) tree vs suffix tree


Compressed edges: O(|S|) space


Using the suffix tree like an FSM

Failures: F[node(xW)]==node(W), hence `suffix links'

Match down the edges of the tree

When mismatches encountered, follow links

  • skip-search speeds up descent parent
  • total number of internal nodes traversed is O(d)
  • # skips, # failures <= # extensions
  • Scan time O(|A||P|) for pattern P

Uses of Suffix Trees

Typical hacks


A more complicated ST

acataggagacatacga$

Suffix tree complexity


Building a suffix tree

structures:
  Stree { string=string, root=Node }
  Node  { start=int, length=int, slink=NIL, parent=NIL, children=[] }

  remove-child(node,child-of-node): ...
  add-child(node,new-child):...

  fork-node(N,len):
    # place a new node on edge between N and N's parent
    X.start = N.start, X.length = len
    # stitch it up
    add-child(N.parent,X)
    remove-child(N.parent,N)
    add-child(X,N)
    return X
   
  get-child(T,N,char):
    # return the child of N which extends N by character
    for X in N.children:
      if T.string[X.start+N.length]==char:
        return X
    return NIL

  build-stree(S):
    # a useful dollar at the end
    T.string = S + '$'
    # set up the root
    T.root.start = 0
    T.root.length = 0
    T.root.parent = T.root
    T.root.slink = T.root
    # add suffixes in one by one
    Len = length(T.string)
    for 0<= i < Len:
      # obtain the parent of the new leaf
      par = find-or-make-parent(T,i)
      # add the new leaf to the parent
      add-child(par,node{start=i,length=Len-i})
    return T

  find-or-make-parent(T,i):
    d = 0
    N = root
    while 1 :
      #match down the tree along the edge into N
      if d == N.length:
        Ch = get-child(T,N,T.string[i+N.length])
        if Ch == NIL:
          return N
        else
          N = Ch
      else # d < N.length
        if T.string[N.start+d] == T.string[i+d]:
          d = d+1
        else
          N = fork-node(N,d)

Modified find-or-make-parent:

  build-stree(S):
    ....
    last = T.root
    for 0<= i < Len:
      par = find-or-make-parent(T,last,i)
      add-child(par,node{start=i,length=Len-i})
      last = par
    return T

  find-or-make-parent(T,N,i):
    # guaranteed that we matched into N
    # so free match on the suffix of N
    d = max(0,N.length-1)
    # if N has no slink, we fix it here
    if N.slink=NIL:
      L = N.parent.slink
      # fast-match down the tree
      while d != L.length:
        if d > L.length:
          L = get-child(T,L,T.string[i+L.length])
        else
          L = fork-node(L,d)
      # we've either found or created the suffix link
      # by the above
      N.slink = L
    N = N.slink
    # now we do the rest of the search
    while 1 :
      # match down the tree along the edge into N
      if d == N.length:
        Ch = get-child(T,N,T.string[i+N.length])
      ....

Implementations

Complexity variations:

childrenbuild-timesearch-timespace
linked listO(|A||S|)O(|A||P|)O(|S|)
arraysO(|A||S|)O(|P|)O(|A||S|)
maps/treesO(log|A| |S|)O(log|A| |P|)O(|S|)

Implementation hacks

Known implementations