This is an implementation of the Knuth-Morris-Pratt algorithm for finding copies of a given pattern as a contiguous subsequence of a larger text. Since KMP accesses the text only sequentially, it is natural to implement it in a way that allows the text to be an arbitrary iterator. After a preprocessing stage which takes time linear in the length of the pattern, each text symbol is processed in constant amortized time.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
# Knuth-Morris-Pratt string matching # David Eppstein, UC Irvine, 1 Mar 2002 from __future__ import generators def KnuthMorrisPratt(text, pattern): '''Yields all starting positions of copies of the pattern in the text. Calling conventions are similar to string.find, but its arguments can be lists or iterators, not just strings, it returns all matches, not just the first one, and it does not need the whole text in memory at once. Whenever it yields, it will have read the text exactly up to and including the match that caused the yield.''' # allow indexing into pattern and protect against change during yield pattern = list(pattern) # build table of shift amounts shifts =  * (len(pattern) + 1) shift = 1 for pos in range(len(pattern)): while shift <= pos and pattern[pos] != pattern[pos-shift]: shift += shifts[pos-shift] shifts[pos+1] = shift # do the actual search startPos = 0 matchLen = 0 for c in text: while matchLen == len(pattern) or \ matchLen >= 0 and pattern[matchLen] != c: startPos += shifts[matchLen] matchLen -= shifts[matchLen] matchLen += 1 if matchLen == len(pattern): yield startPos
An explanation of why this works can be found in many algorithms texts, for instance Cormen, Leiserson, Rivest, and Stein, _Introduction to Algorithms_.