Find the common beginning in a list of strings « Python recipes

I came up with this when I tried to implement some autocompletion feature. The problem is to find a common substring (beginning from the start) for all strings in a given list.

I couldn't find an existing recipe that is doing this task, but I'm wondering if I just didn't look hard enough

      def getcommonletters(strlist):
    return ''.join([x[0] for x in zip(*strlist) \
                     if reduce(lambda a,b:(a == b) and a or None,x)])

def findcommonstart(strlist):
    strlist = strlist[:]
    prev = None
    while True:
        common = getcommonletters(strlist)
        if common == prev:
            break
        strlist.append(common)
        prev = common

    return getcommonletters(strlist)

      

The code works as follows (commonletters):

the list of strings is converted to a list of tuples. Each tuple contains one character from each word from the same position. This is done by the builtin 'zip' function. Example: zip('hippie','hippo') == \ (('h','h'),('i','i'),('p','p'),('p','p'),('i','o'))
the result is iterated over and every tuple is checked if its elements are identical. This is done by the (equally aclaimed and hated) reduce function. If this test is successful, one tupleelement is added to the resultlist.
Now, we have a string, that contains all letters that are identical in all strings for every position.

Unfortunatelly, this is not enough (otherwise, this would be a cool one liner). If the strings in question had only identical letters at the beginning, we were done. But consider the strings ('hip','hop'). The result would be 'hp', which is clearly not the common start of both strings.

But, if we start over with the resultstring added to the stringlist as long as the resultstring changes, we'll get the one and only true result (a proof of that is left to the reader :-)

PLEASE note that the above statement was wrong with the old codeversion: <pre> def findcommonstart(strlist): strlist = strlist[:] common = getcommonletters(strlist) strlist.append(common) return getcommonletters(strlist) </pre>

The lambda function inside the reduce function does the same as the following regular function (and is, of course, more readable):

<pre> def myeq(op1,op2): if op1 == op2: return op1 </pre>

Please note that the algorithm does work equally well for all iterable types, not only strings (but remove the 'join' first)

Tags: shortcuts

17 comments

Vivian De Smedt 20 years, 4 months ago # | flag

problem with the proof. I have a small problem with the proof you let to the user.

I'am even asking if it exist considering the example

hipop

hapap

which leads to hpp and so become:

hipop

hapap

hpp

which leads to hp which seems wrong.

Stephan Diehl (author) 20 years, 4 months ago # | flag

You are right. I've changed the code to get a right result in these kind of cases (hope, I got it right this time)

Yannick Loitiere 20 years, 4 months ago # | flag

How about this ?

import operator

def allsame(seq):
    "Determines whether all the elements in a sequence are the same."

    # Compare all the elements to the first in the sequence,
    # then do a logical and (min) to see if they all matched.
    return min([elem==seq[0] for elem in seq]+[True])

def getcommonstart(seqlist):
    """Returns the common prefix of a list of sequences.

    Raises an exception if the list is empty.
    """
    # if not seqlist: return None
    m = [allsame(seq) for seq in zip(*seqlist)] # Map the matching elements
    m.append(False)                             # Guard in case all the sequences match
    return seqlist[0][0:m.index(False)]         # Truncate before first mismatch

Yannick Loitiere 20 years, 4 months ago # | flag

import operator not needed. And you don't even need the operator module.

I originally had

def allsame(seq):
   return reduce(operator.and_,[elem==seq[0] for elem in seq],True)

but I changed it to use min() instead and forgot to remove the import.

Stephan Diehl (author) 20 years, 4 months ago # | flag

That's nice. What about a combined approach?:

def getcommonletters(strlist):
    def booltrigger(expr):
        if not booltrigger.truth: return False
        if not expr:booltrigger.truth = False
        return expr
    booltrigger.truth = True
    return ''.join([x[0] for x in zip(*strlist) if \
                booltrigger(min([x[0]==elem for elem in x]))])

The 'booltrigger' is doing what should be possible with list comprehension anyway: Stop the iteration.

Yannick Loitiere 20 years, 4 months ago # | flag

Or you could use itertools. I believe that itertools.takewhile does something close enough to what you were trying to achieve with booltrigger.

import itertools

def allsame(seq):
    return min([elem==seq[0] for elem in seq]+[True])

def getcommonstart(seqlist):
    letters = itertools.izip(*seqlist)                 # let's be lazy
    common = itertools.takewhile(allsame, letters)     # still lazy
    return ''.join([letters[0] for letters in common]) # merge back

Gustavo Niemeyer 20 years, 4 months ago # | flag

What about ... <p>... this:</p>

>>> import os
>>> os.path.commonprefix(["hippie", "hippo"])
'hipp'
>>>

Yannick Loitiere 20 years, 4 months ago # | flag

Lazy 'allsame'

def allsame(seq):
    return not seq or False not in itertools.imap(seq[0].__eq__, seq)

Stephan Diehl (author) 20 years, 4 months ago # | flag

Know thy libraries! Hmm, that's why I thought this should have been solved somewhere already. I'm happy though that this was not the first comment :-) Just for the interested, the posixpath.commonprefix looks like this:

def commonprefix(m):
    "Given a list of pathnames, returns the longest common leading component"
    if not m: return ''
    prefix = m[0]
    for item in m:
        for i in range(len(prefix)):
            if prefix[:i+1] != item[:i+1]:
                prefix = prefix[:i]
                if i == 0:
                    return ''
                break
    return prefix

Alfred Milgrom 20 years, 4 months ago # | flag

One-line version of this code. Not that it is very interesting, but I amused myself by creating a one-line version of findcommonstart as follows:

def findcommonstart(strlist):
    return strlist[0][:([min([x[0]==elem for elem in x]) \
                         for x in zip(*strlist)]+[0]).index(0)]
<pre>

</pre>

Stephan Diehl (author) 20 years, 4 months ago # | flag

corrected (and optimized) version.

def getcommonstart(seq):
    if not seq:return ""
    seq.sort()
    s1, s2 = seq[0], seq[-1]
    l = min(len(s1), len(s2))
    if l == 0 :
        return ""
    for i in xrange(l) :
        if s1[i] != s2[i] :
            return s1[0:i]
    return s1[0:l]

Michael Dyck 20 years, 4 months ago # | flag

Avoid sort(). The sort() could be expensive, and isn't really needed, since you're only interested in the first and last element of the sorted sequence, i.e. the minimum and maximum elements of the sequence. So, change

seq.sort()
s1, s2 = seq[0], seq[-1]

s1 = min(seq)
s2 = max(seq)

Michael Dyck 20 years, 4 months ago # | flag

Avoid sort(). [Somehow, this comment went elsewhere in the comment-tree yesterday.]

The sort() could be expensive, and isn't really needed, since you're only interested in the first and last element of the sorted sequence, i.e. the minimum and maximum elements of the sequence. So, change

seq.sort()
s1, s2 = seq[0], seq[-1]

s1 = min(seq)
s2 = max(seq)

Raymond Hettinger 20 years, 3 months ago # | flag

Binary search version. There are even fewer trips around the Python eval loop if the longest match is found using binary search instead of sequential comparison.

def commonprefix(m):
    "Given a list of pathnames, returns the longest common leading component"
    if not m: return ''
    a, b = min(m), max(m)
    lo, hi = 0, min(len(a), len(b))
    while lo &lt; hi:
        mid = (lo+hi)//2 + 1
        if a[lo:mid] == b[lo:mid]:
            lo = mid
        else:
            hi = mid - 1
    return a[:hi]

Stephan Diehl (author) 20 years, 3 months ago # | flag

sort not bad. Why should sort be slower than computing a min and a max? As far as my computational knowledge goes, in order to compute some minimum or maximum, the first thing to do is to sort anyway. Computing separate min and max actually slows things down.

I've done some tests and it looks like Raymonds version with sort instead of min,max is indeed faster. (And by the way, is the quickest of all the shown algorithms)

Stephan Diehl (author) 20 years, 3 months ago # | flag

correction. max,min seems indeed to be faster. I did the first test with presorted lists.

Mano Bastardo 8 years, 5 months ago # | flag

one = [0, 1, 4]
two = [0, 1, 3]

both = [x if x[0] == x[1] else None for x in zip(one, two)] + [None]

common = one[:both.index(None)]

◄	Python recipes (4591)	►
◄	Stephan Diehl's recipes (5)	►

Find the common beginning in a list of strings (Python recipe) by Stephan Diehl
ActiveState Code (http://code.activestate.com/recipes/252177/)

17 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Find the common beginning in a list of strings (Python recipe) by Stephan Diehl ActiveState Code (http://code.activestate.com/recipes/252177/)

17 comments

Tags

Required Modules

Other Information and Tasks

Accounts

Code Recipes

Feedback & Information

ActiveState

Find the common beginning in a list of strings (Python recipe) by Stephan Diehl
ActiveState Code (http://code.activestate.com/recipes/252177/)