I came up with this when I tried to implement some autocompletion feature. The problem is to find a common substring (beginning from the start) for all strings in a given list.
I couldn't find an existing recipe that is doing this task, but I'm wondering if I just didn't look hard enough
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | def getcommonletters(strlist):
return ''.join([x[0] for x in zip(*strlist) \
if reduce(lambda a,b:(a == b) and a or None,x)])
def findcommonstart(strlist):
strlist = strlist[:]
prev = None
while True:
common = getcommonletters(strlist)
if common == prev:
break
strlist.append(common)
prev = common
return getcommonletters(strlist)
|
The code works as follows (commonletters):
the list of strings is converted to a list of tuples. Each tuple contains one character from each word from the same position. This is done by the builtin 'zip' function. Example: zip('hippie','hippo') == \ (('h','h'),('i','i'),('p','p'),('p','p'),('i','o'))
the result is iterated over and every tuple is checked if its elements are identical. This is done by the (equally aclaimed and hated) reduce function. If this test is successful, one tupleelement is added to the resultlist.
Now, we have a string, that contains all letters that are identical in all strings for every position.
Unfortunatelly, this is not enough (otherwise, this would be a cool one liner). If the strings in question had only identical letters at the beginning, we were done. But consider the strings ('hip','hop'). The result would be 'hp', which is clearly not the common start of both strings.
But, if we start over with the resultstring added to the stringlist as long as the resultstring changes, we'll get the one and only true result (a proof of that is left to the reader :-)
PLEASE note that the above statement was wrong with the old codeversion: <pre> def findcommonstart(strlist): strlist = strlist[:] common = getcommonletters(strlist) strlist.append(common) return getcommonletters(strlist) </pre>
The lambda function inside the reduce function does the same as the following regular function (and is, of course, more readable):
<pre> def myeq(op1,op2): if op1 == op2: return op1 </pre>
Please note that the algorithm does work equally well for all iterable types, not only strings (but remove the 'join' first)
problem with the proof. I have a small problem with the proof you let to the user.
I'am even asking if it exist considering the example
which leads to hpp and so become:
which leads to hp which seems wrong.
You are right. I've changed the code to get a right result in these kind of cases (hope, I got it right this time)
How about this ?
import operator not needed. And you don't even need the operator module.
I originally had
but I changed it to use min() instead and forgot to remove the import.
That's nice. What about a combined approach?:
The 'booltrigger' is doing what should be possible with list comprehension anyway: Stop the iteration.
Or you could use itertools. I believe that itertools.takewhile does something close enough to what you were trying to achieve with booltrigger.
What about ... <p>... this:</p>
Lazy 'allsame'
Know thy libraries! Hmm, that's why I thought this should have been solved somewhere already. I'm happy though that this was not the first comment :-) Just for the interested, the posixpath.commonprefix looks like this:
One-line version of this code. Not that it is very interesting, but I amused myself by creating a one-line version of findcommonstart as follows:
</pre>
corrected (and optimized) version.
Avoid sort(). The sort() could be expensive, and isn't really needed, since you're only interested in the first and last element of the sorted sequence, i.e. the minimum and maximum elements of the sequence. So, change
to
Avoid sort(). [Somehow, this comment went elsewhere in the comment-tree yesterday.]
The sort() could be expensive, and isn't really needed, since you're only interested in the first and last element of the sorted sequence, i.e. the minimum and maximum elements of the sequence. So, change
to
Binary search version. There are even fewer trips around the Python eval loop if the longest match is found using binary search instead of sequential comparison.
sort not bad. Why should sort be slower than computing a min and a max? As far as my computational knowledge goes, in order to compute some minimum or maximum, the first thing to do is to sort anyway. Computing separate min and max actually slows things down.
I've done some tests and it looks like Raymonds version with sort instead of min,max is indeed faster. (And by the way, is the quickest of all the shown algorithms)
correction. max,min seems indeed to be faster. I did the first test with presorted lists.