Announcements:
| ![]() |
pairwise matches don't look like this

they look like this

Input: a large string S of length n and k short strings p_i of length m
Output: the locations of all occurrences in S of some p_i
Variations on this theme:

Requires a good hashing function on strings
We can do better than O(mn) for finding all exact occurrences of 1 pattern
Example: pattern actaac

Finite State Machine: Knuth Morris Pratt algorithm
For convenience we use a nonsense character, $, to terminate
The FSM fits into a small table:
| i: | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
| F[i]: | -1 | 0 | 0 | -1 | 1 | 0 | 2 |
| S[i]: | a | c | t | a | a | c | $ |
F[i] = max f : f < i, S[0:f-1]==suffix_of(S[0:i-1]) (also implies S[f]!=S[i])
kmp-match(T,S)
S = S+'$'
kmp-process(S,F)
j = 0
for 0<= i < length(T):
j = kmp-next(T[i],j,S,F)
if S[j]=='$':
report match ending at i
kmp-next(c,j,S,F)
if j==-1 or S[j]==c:
return j+1
else:
return kmp-next(c,F[j],S,F)
kmp-process(S,F)
f = -1
for 0<= i < length(S):
# loop inv: f = max f : f < i, S[0:f-1]==S[i-f:i-1]
if f!=-1 and S[f]!=S[i]:
F[i] = f
else:
F[i] = F[f]
f = kmp-next(S[i],f,S,F)
Time to process is |P|, time to match is |S|
Common variation: build a lookup table for kmp-next using |A||P| space for a constant time improvement.
Aho-Corasick: Rather than matching one FSM, make an FSM for all patterns, and scan that once.
Example strings: aggg$agcc$ac$cat$

The FSM again fits into a table:
| i: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
| F[i]: | 13 | 11 | 7 | 0 | 0 | -1 | -1 | 0 | 14 | 14 | -1 | 0 | 14 | -1 | 0 | 1 | 0 |
| S[i]: | a | g | g | g | $ | a | g | c | c | $ | a | c | $ | c | a | t | $ |
F[i] = f : f < i, S[0:f-1]==suffix_of(S[0:i-1])
ac-match(T,P_1,P_2,...,) S = P_1+'$'+...+P_N+'$' ac-process(S,F) then same as kmp-match
ac-next(c,j,S,F) same as kmp-next
ac-process(S,F)
i=0, j=0
while i < length(S):
if S[i]==S[j]:
if S[i]=='$':
j = -1
j = j+1
i = i+1
else
if F[j]==-1:
F[j] = i
j = F[j]
failures(0,-1,S,F)
failures(i,f,S,F)
if F[i]==-1:
F[i]=f
else
failures(F[i],f,S,F)
if S[i]!='$':
failures(i+1,ac-next(S[i],f,S,F),S,F)
Time to process is |A||P|, time to match is |A||S|
Common variation: build a lookup table for ac-next using |A||P| space for |S| time.
An interesting special case of the keyword tree.
Typically not implemented with O(|P|^2) space or time.
Example word: banana$anana$nana$ana$na$a$

