This is an implementation of the Knuth-Morris-Pratt algorithm for finding copies of a given pattern as a contiguous subsequence of a larger text. Since KMP accesses the text only sequentially, it is natural to implement it in a way that allows the text to be an arbitrary iterator. After a preprocessing stage which takes time linear in the length of the pattern, each text symbol is processed in constant amortized time.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | # Knuth-Morris-Pratt string matching
# David Eppstein, UC Irvine, 1 Mar 2002
from __future__ import generators
def KnuthMorrisPratt(text, pattern):
'''Yields all starting positions of copies of the pattern in the text.
Calling conventions are similar to string.find, but its arguments can be
lists or iterators, not just strings, it returns all matches, not just
the first one, and it does not need the whole text in memory at once.
Whenever it yields, it will have read the text exactly up to and including
the match that caused the yield.'''
# allow indexing into pattern and protect against change during yield
pattern = list(pattern)
# build table of shift amounts
shifts = [1] * (len(pattern) + 1)
shift = 1
for pos in range(len(pattern)):
while shift <= pos and pattern[pos] != pattern[pos-shift]:
shift += shifts[pos-shift]
shifts[pos+1] = shift
# do the actual search
startPos = 0
matchLen = 0
for c in text:
while matchLen == len(pattern) or \
matchLen >= 0 and pattern[matchLen] != c:
startPos += shifts[matchLen]
matchLen -= shifts[matchLen]
matchLen += 1
if matchLen == len(pattern):
yield startPos
|
An explanation of why this works can be found in many algorithms texts, for instance Cormen, Leiserson, Rivest, and Stein, _Introduction to Algorithms_.
benchmarks. It's looking cool, but in my tests 500-700 times slow:
Output:
Re: benchmarks. Thanks for the comment. Partly, the slowdown you saw is the penalty for using interpreted code instead of built-in functions. But partly, you chose parameters that would make the built-ins look good and my KMP implementation look bad: your code pays a linear-time penalty every time it finds a match, while increasing the number of matches does not slow my function down much. Random data does not have a large number of matches, compared to most situations where one would use a string matching routine. If I change your test code only very slightly, to limit the character set in your random strings to four characters (increasing the number of matches, and also increasing the similarity to biological data), the statistics look very different:
My routine takes roughly the same time as I was getting on your original example, while using your loop with find and slice is now slower than my routine by a factor of eight.
Now I see, it's more effective on often matches. But to be correct, second cycle must be changed to:
to eliminate lots of string reallocating.
still too slow...
I agree that, as you found, with the addition of the pos argument, the built-in find is faster, and should be used when you are searching a string in-memory. If what you are searching is not a string, string.find() is not applicable, and this KMP code would be more useful.
It's can be much flexible then find. It's worth to notice, the biggest weak point - interpreted code gives in change big advantage in flexibility - one can change this function for case-insensetive searches or using wildcards.
try this :-)
bigstr = 'A'10*7+'B'
instr = 'A'10*6+'B'
^ this code doesn't make any sense. "for c in text". If you're just going to iterate through every character in the text, why even bother with the shift indices.
Still trying to wrap my head around this algorithm. I could be missing something of course
Ah never mind. I see now.