10

Exact Matching: Hash-tables and Automata

instructor: Ross A. Lippert

http://www-math.mit.edu/~lippert/18.417/

Announcements:
  • problem set due friday
  • new problem set out
  • read ch. 9


The structure of matches

pairwise matches don't look like this

they look like this


Looking up words in large texts

Input: a large string S of length n and k short strings p_i of length m

Output: the locations of all occurrences in S of some p_i

Variations on this theme:


Hashtables

Requires a good hashing function on strings


Exact matching revisited

We can do better than O(mn) for finding all exact occurrences of 1 pattern

Example: pattern actaac

Finite State Machine: Knuth Morris Pratt algorithm


FSM representation

For convenience we use a nonsense character, $, to terminate

The FSM fits into a small table:

i:0123456
F[i]:-100-1102
S[i]:actaac$

F[i] = max f : f < i, S[0:f-1]==suffix_of(S[0:i-1]) (also implies S[f]!=S[i])


Knuth-Morris-Pratt:

kmp-match(T,S)
  S = S+'$'
  kmp-process(S,F)
  j = 0
  for 0<= i < length(T):
    j = kmp-next(T[i],j,S,F)
    if S[j]=='$':
       report match ending at i
kmp-next(c,j,S,F)
  if j==-1 or S[j]==c:
    return j+1
  else:
    return kmp-next(c,F[j],S,F)
kmp-process(S,F)
  f = -1
  for 0<= i < length(S):
    # loop inv: f = max f : f < i, S[0:f-1]==S[i-f:i-1]
    if f!=-1 and S[f]!=S[i]:
      F[i] = f
    else:
      F[i] = F[f]
    f = kmp-next(S[i],f,S,F)

Time to process is |P|, time to match is |S|

Common variation: build a lookup table for kmp-next using |A||P| space for a constant time improvement.


Matching multiple patterns:

Aho-Corasick: Rather than matching one FSM, make an FSM for all patterns, and scan that once.

Example strings: aggg$agcc$ac$cat$

cute animation found here

Building an AC Tree - implementation

The FSM again fits into a table:

i:012345678910111213141516
F[i]:1311700-1-101414-1014-1010
S[i]:aggg$agcc$ac$cat$

F[i] = f : f < i, S[0:f-1]==suffix_of(S[0:i-1])


Pseudo-code for Aho Corasick

ac-match(T,P_1,P_2,...,)
  S = P_1+'$'+...+P_N+'$'
  ac-process(S,F)
  then same as kmp-match
ac-next(c,j,S,F)
  same as kmp-next
ac-process(S,F)
  i=0, j=0
  while i < length(S):
    if S[i]==S[j]:
      if S[i]=='$':
        j = -1
      j = j+1
      i = i+1
    else
      if F[j]==-1:
        F[j] = i
      j = F[j]
  failures(0,-1,S,F)
failures(i,f,S,F)
  if F[i]==-1:
    F[i]=f
  else
    failures(F[i],f,S,F)
  if S[i]!='$':
    failures(i+1,ac-next(S[i],f,S,F),S,F)

Time to process is |A||P|, time to match is |A||S|

Common variation: build a lookup table for ac-next using |A||P| space for |S| time.


Suffix trees

An interesting special case of the keyword tree.

Typically not implemented with O(|P|^2) space or time.

Example word: banana$anana$nana$ana$na$a$


Uses of Suffix Trees