Welcome, guest | Sign In | My Account | Store | Cart

This is an implementation of the Knuth-Morris-Pratt algorithm for finding copies of a given pattern as a contiguous subsequence of a larger text. Since KMP accesses the text only sequentially, it is natural to implement it in a way that allows the text to be an arbitrary iterator. After a preprocessing stage which takes time linear in the length of the pattern, each text symbol is processed in constant amortized time.

Python, 36 lines
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36``` ```# Knuth-Morris-Pratt string matching # David Eppstein, UC Irvine, 1 Mar 2002 from __future__ import generators def KnuthMorrisPratt(text, pattern): '''Yields all starting positions of copies of the pattern in the text. Calling conventions are similar to string.find, but its arguments can be lists or iterators, not just strings, it returns all matches, not just the first one, and it does not need the whole text in memory at once. Whenever it yields, it will have read the text exactly up to and including the match that caused the yield.''' # allow indexing into pattern and protect against change during yield pattern = list(pattern) # build table of shift amounts shifts = [1] * (len(pattern) + 1) shift = 1 for pos in range(len(pattern)): while shift <= pos and pattern[pos] != pattern[pos-shift]: shift += shifts[pos-shift] shifts[pos+1] = shift # do the actual search startPos = 0 matchLen = 0 for c in text: while matchLen == len(pattern) or \ matchLen >= 0 and pattern[matchLen] != c: startPos += shifts[matchLen] matchLen -= shifts[matchLen] matchLen += 1 if matchLen == len(pattern): yield startPos ```

An explanation of why this works can be found in many algorithms texts, for instance Cormen, Leiserson, Rivest, and Stein, _Introduction to Algorithms_.

Alexander Semenov 21 years, 3 months ago

benchmarks. It's looking cool, but in my tests 500-700 times slow:

``````import random, time

bigstr = ''.join(map(lambda i:chr(random.randrange(32,127)), xrange(650000) ))
instr = ''.join(map(lambda i:chr(random.randrange(32,127)), xrange(3) ))
print 'Searching', instr, 'in', bigstr[:30]

t=time.clock()
i = 0
for f in KnuthMorrisPratt(bigstr,instr): i += 1
print 'Found',i,'times, in', time.clock()-t,'clocks'

t=time.clock()
i = 0
pos =0
while 1:
pos = bigstr.find(instr)
if pos&lt;0:break
i += 1
bigstr = bigstr[pos+1:]
print 'Found',i,'times, in', time.clock()-t,'clocks'
``````

Output:

``````Searching 2SM in s&lt;.z~GWKoA\NrDmxCzZ[[eRlXnRy(V
Found 3 times, in 5.43653341416 clocks
Found 3 times, in 0.0177835451154 clocks
``````
David Eppstein (author) 21 years, 3 months ago

Re: benchmarks. Thanks for the comment. Partly, the slowdown you saw is the penalty for using interpreted code instead of built-in functions. But partly, you chose parameters that would make the built-ins look good and my KMP implementation look bad: your code pays a linear-time penalty every time it finds a match, while increasing the number of matches does not slow my function down much. Random data does not have a large number of matches, compared to most situations where one would use a string matching routine. If I change your test code only very slightly, to limit the character set in your random strings to four characters (increasing the number of matches, and also increasing the similarity to biological data), the statistics look very different:

``````Searching ! ! in "!""!  ""   "  !!!"  "  !"" !
Found 24136 times, in 6.455586 clocks
Found 24136 times, in 49.912182 clocks
``````

My routine takes roughly the same time as I was getting on your original example, while using your loop with find and slice is now slower than my routine by a factor of eight.

Alexander Semenov 21 years, 3 months ago

Now I see, it's more effective on often matches. But to be correct, second cycle must be changed to:

``````t=time.clock()
i = 0
pos = -1
while 1:
pos = bigstr.find(instr, pos+1)
if pos&lt;0:break
i += 1
print 'Found',i,'times, in', time.clock()-t,'clocks'
``````

to eliminate lots of string reallocating.

``````Searching   ! in "!!    !!!" ! !! !  !  ""!"! "
Found 23909 times, in 6.90686190563 clocks
Found 23909 times, in 0.220056739055 clocks
``````

still too slow...

David Eppstein (author) 21 years, 3 months ago

I agree that, as you found, with the addition of the pos argument, the built-in find is faster, and should be used when you are searching a string in-memory. If what you are searching is not a string, string.find() is not applicable, and this KMP code would be more useful.

Alexander Semenov 18 years, 1 month ago

It's can be much flexible then find. It's worth to notice, the biggest weak point - interpreted code gives in change big advantage in flexibility - one can change this function for case-insensetive searches or using wildcards.

Przemek Drochomirecki 16 years, 3 months ago

try this :-)

bigstr = 'A'10*7+'B'

instr = 'A'10*6+'B'

zwhitchcox 8 years, 1 month ago
``````for c in text:
while matchLen == len(pattern) or \
matchLen >= 0 and pattern[matchLen] != c:
startPos += shifts[matchLen]
matchLen -= shifts[matchLen]
matchLen += 1
if matchLen == len(pattern):
yield startPos
``````

^ this code doesn't make any sense. "for c in text". If you're just going to iterate through every character in the text, why even bother with the shift indices.

Still trying to wrap my head around this algorithm. I could be missing something of course

zwhitchcox 8 years, 1 month ago

Ah never mind. I see now.

 Created by David Eppstein on Fri, 1 Mar 2002 (PSF)