Welcome, guest | Sign In | My Account | Store | Cart

This is an implementation of the Knuth-Morris-Pratt algorithm for finding copies of a given pattern as a contiguous subsequence of a larger text. Since KMP accesses the text only sequentially, it is natural to implement it in a way that allows the text to be an arbitrary iterator. After a preprocessing stage which takes time linear in the length of the pattern, each text symbol is processed in constant amortized time.

Python, 36 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Knuth-Morris-Pratt string matching
# David Eppstein, UC Irvine, 1 Mar 2002

from __future__ import generators

def KnuthMorrisPratt(text, pattern):

    '''Yields all starting positions of copies of the pattern in the text.
Calling conventions are similar to string.find, but its arguments can be
lists or iterators, not just strings, it returns all matches, not just
the first one, and it does not need the whole text in memory at once.
Whenever it yields, it will have read the text exactly up to and including
the match that caused the yield.'''

    # allow indexing into pattern and protect against change during yield
    pattern = list(pattern)

    # build table of shift amounts
    shifts = [1] * (len(pattern) + 1)
    shift = 1
    for pos in range(len(pattern)):
        while shift <= pos and pattern[pos] != pattern[pos-shift]:
            shift += shifts[pos-shift]
        shifts[pos+1] = shift

    # do the actual search
    startPos = 0
    matchLen = 0
    for c in text:
        while matchLen == len(pattern) or \
              matchLen >= 0 and pattern[matchLen] != c:
            startPos += shifts[matchLen]
            matchLen -= shifts[matchLen]
        matchLen += 1
        if matchLen == len(pattern):
            yield startPos

An explanation of why this works can be found in many algorithms texts, for instance Cormen, Leiserson, Rivest, and Stein, _Introduction to Algorithms_.

8 comments

Alexander Semenov 19 years, 9 months ago  # | flag

benchmarks. It's looking cool, but in my tests 500-700 times slow:

import random, time

bigstr = ''.join(map(lambda i:chr(random.randrange(32,127)), xrange(650000) ))
instr = ''.join(map(lambda i:chr(random.randrange(32,127)), xrange(3) ))
print 'Searching', instr, 'in', bigstr[:30]

t=time.clock()
i = 0
for f in KnuthMorrisPratt(bigstr,instr): i += 1
print 'Found',i,'times, in', time.clock()-t,'clocks'

t=time.clock()
i = 0
pos =0
while 1:
    pos = bigstr.find(instr)
    if pos&lt;0:break
    i += 1
    bigstr = bigstr[pos+1:]
print 'Found',i,'times, in', time.clock()-t,'clocks'

Output:

Searching 2SM in s&lt;.z~GWKoA\NrDmxCzZ[[eRlXnRy(V
Found 3 times, in 5.43653341416 clocks
Found 3 times, in 0.0177835451154 clocks
David Eppstein (author) 19 years, 9 months ago  # | flag

Re: benchmarks. Thanks for the comment. Partly, the slowdown you saw is the penalty for using interpreted code instead of built-in functions. But partly, you chose parameters that would make the built-ins look good and my KMP implementation look bad: your code pays a linear-time penalty every time it finds a match, while increasing the number of matches does not slow my function down much. Random data does not have a large number of matches, compared to most situations where one would use a string matching routine. If I change your test code only very slightly, to limit the character set in your random strings to four characters (increasing the number of matches, and also increasing the similarity to biological data), the statistics look very different:

Searching ! ! in "!""!  ""   "  !!!"  "  !"" !
Found 24136 times, in 6.455586 clocks
Found 24136 times, in 49.912182 clocks

My routine takes roughly the same time as I was getting on your original example, while using your loop with find and slice is now slower than my routine by a factor of eight.

Alexander Semenov 19 years, 9 months ago  # | flag

Now I see, it's more effective on often matches. But to be correct, second cycle must be changed to:

t=time.clock()
i = 0
pos = -1
while 1:
    pos = bigstr.find(instr, pos+1)
    if pos&lt;0:break
    i += 1
print 'Found',i,'times, in', time.clock()-t,'clocks'

to eliminate lots of string reallocating.

Searching   ! in "!!    !!!" ! !! !  !  ""!"! "
Found 23909 times, in 6.90686190563 clocks
Found 23909 times, in 0.220056739055 clocks

still too slow...

David Eppstein (author) 19 years, 9 months ago  # | flag

I agree that, as you found, with the addition of the pos argument, the built-in find is faster, and should be used when you are searching a string in-memory. If what you are searching is not a string, string.find() is not applicable, and this KMP code would be more useful.

Alexander Semenov 16 years, 7 months ago  # | flag

It's can be much flexible then find. It's worth to notice, the biggest weak point - interpreted code gives in change big advantage in flexibility - one can change this function for case-insensetive searches or using wildcards.

Przemek Drochomirecki 14 years, 8 months ago  # | flag

try this :-)

bigstr = 'A'10*7+'B'

instr = 'A'10*6+'B'

zwhitchcox 6 years, 7 months ago  # | flag
for c in text:
    while matchLen == len(pattern) or \
          matchLen >= 0 and pattern[matchLen] != c:
        startPos += shifts[matchLen]
        matchLen -= shifts[matchLen]
    matchLen += 1
    if matchLen == len(pattern):
        yield startPos

^ this code doesn't make any sense. "for c in text". If you're just going to iterate through every character in the text, why even bother with the shift indices.

Still trying to wrap my head around this algorithm. I could be missing something of course

zwhitchcox 6 years, 7 months ago  # | flag

Ah never mind. I see now.