ActiveState Code

Recipe 252177: Find the common beginning in a list of strings


I came up with this when I tried to implement some autocompletion feature. The problem is to find a common substring (beginning from the start) for all strings in a given list.

I couldn't find an existing recipe that is doing this task, but I'm wondering if I just didn't look hard enough

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
def getcommonletters(strlist):
    return ''.join([x[0] for x in zip(*strlist) \
                     if reduce(lambda a,b:(a == b) and a or None,x)])

def findcommonstart(strlist):
    strlist = strlist[:]
    prev = None
    while True:
        common = getcommonletters(strlist)
        if common == prev:
            break
        strlist.append(common)
        prev = common

    return getcommonletters(strlist)

Discussion

The code works as follows (commonletters):

  1. the list of strings is converted to a list of tuples. Each tuple contains one character from each word from the same position. This is done by the builtin 'zip' function. Example: zip('hippie','hippo') == \ (('h','h'),('i','i'),('p','p'),('p','p'),('i','o'))

  2. the result is iterated over and every tuple is checked if its elements are identical. This is done by the (equally aclaimed and hated) reduce function. If this test is successful, one tupleelement is added to the resultlist.

  3. Now, we have a string, that contains all letters that are identical in all strings for every position.

Unfortunatelly, this is not enough (otherwise, this would be a cool one liner). If the strings in question had only identical letters at the beginning, we were done. But consider the strings ('hip','hop'). The result would be 'hp', which is clearly not the common start of both strings.

But, if we start over with the resultstring added to the stringlist as long as the resultstring changes, we'll get the one and only true result (a proof of that is left to the reader :-)

PLEASE note that the above statement was wrong with the old codeversion: <pre> def findcommonstart(strlist): strlist = strlist[:] common = getcommonletters(strlist) strlist.append(common) return getcommonletters(strlist) </pre>

The lambda function inside the reduce function does the same as the following regular function (and is, of course, more readable):

<pre> def myeq(op1,op2): if op1 == op2: return op1 </pre>

Please note that the algorithm does work equally well for all iterable types, not only strings (but remove the 'join' first)

Comments

  1. 1. At 2:26 a.m. on 3 dec 2003, Vivian De Smedt said:

    problem with the proof. I have a small problem with the proof you let to the user.

    I'am even asking if it exist considering the example

    hipop
    
    hapap
    

    which leads to hpp and so become:

    hipop
    
    hapap
    
    hpp
    

    which leads to hp which seems wrong.

  2. 2. At 4:06 a.m. on 3 dec 2003, Stephan Diehl (the author) said:

    You are right. I've changed the code to get a right result in these kind of cases (hope, I got it right this time)

  3. 3. At 4:35 a.m. on 3 dec 2003, Yannick Loitiere said:

    How about this ?

    import operator
    
    def allsame(seq):
        "Determines whether all the elements in a sequence are the same."
    
        # Compare all the elements to the first in the sequence,
        # then do a logical and (min) to see if they all matched.
        return min([elem==seq[0] for elem in seq]+[True])
    
    def getcommonstart(seqlist):
        """Returns the common prefix of a list of sequences.
    
        Raises an exception if the list is empty.
        """
        # if not seqlist: return None
        m = [allsame(seq) for seq in zip(*seqlist)] # Map the matching elements
        m.append(False)                             # Guard in case all the sequences match
        return seqlist[0][0:m.index(False)]         # Truncate before first mismatch
    
  4. 4. At 4:40 a.m. on 3 dec 2003, Yannick Loitiere said:

    import operator not needed. And you don't even need the operator module.

    I originally had

    def allsame(seq):
       return reduce(operator.and_,[elem==seq[0] for elem in seq],True)
    

    but I changed it to use min() instead and forgot to remove the import.

  5. 5. At 4:59 a.m. on 3 dec 2003, Stephan Diehl (the author) said:

    That's nice. What about a combined approach?:

    def getcommonletters(strlist):
        def booltrigger(expr):
            if not booltrigger.truth: return False
            if not expr:booltrigger.truth = False
            return expr
        booltrigger.truth = True
        return ''.join([x[0] for x in zip(*strlist) if \
                    booltrigger(min([x[0]==elem for elem in x]))])
    

    The 'booltrigger' is doing what should be possible with list comprehension anyway: Stop the iteration.

  6. 6. At 5:29 a.m. on 3 dec 2003, Yannick Loitiere said:

    Or you could use itertools. I believe that itertools.takewhile does something close enough to what you were trying to achieve with booltrigger.

    import itertools
    
    def allsame(seq):
        return min([elem==seq[0] for elem in seq]+[True])
    
    def getcommonstart(seqlist):
        letters = itertools.izip(*seqlist)                 # let's be lazy
        common = itertools.takewhile(allsame, letters)     # still lazy
        return ''.join([letters[0] for letters in common]) # merge back
    
  7. 7. At 5:36 a.m. on 3 dec 2003, Gustavo Niemeyer said:

    What about ... <p>... this:</p>

    >>> import os
    >>> os.path.commonprefix(["hippie", "hippo"])
    'hipp'
    >>>
    
  8. 8. At 5:59 a.m. on 3 dec 2003, Yannick Loitiere said:

    Lazy 'allsame'

    def allsame(seq):
        return not seq or False not in itertools.imap(seq[0].__eq__, seq)
    
  9. 9. At 6:13 a.m. on 3 dec 2003, Stephan Diehl (the author) said:

    Know thy libraries! Hmm, that's why I thought this should have been solved somewhere already. I'm happy though that this was not the first comment :-) Just for the interested, the posixpath.commonprefix looks like this:

    def commonprefix(m):
        "Given a list of pathnames, returns the longest common leading component"
        if not m: return ''
        prefix = m[0]
        for item in m:
            for i in range(len(prefix)):
                if prefix[:i+1] != item[:i+1]:
                    prefix = prefix[:i]
                    if i == 0:
                        return ''
                    break
        return prefix
    
  10. 10. At 8:26 p.m. on 3 dec 2003, Alfred Milgrom said:

    One-line version of this code. Not that it is very interesting, but I amused myself by creating a one-line version of findcommonstart as follows:

    def findcommonstart(strlist):
        return strlist[0][:([min([x[0]==elem for elem in x]) \
                             for x in zip(*strlist)]+[0]).index(0)]
    <pre>
    

    </pre>

  11. 11. At 7:50 a.m. on 4 dec 2003, Stephan Diehl (the author) said:

    corrected (and optimized) version.

    def getcommonstart(seq):
        if not seq:return ""
        seq.sort()
        s1, s2 = seq[0], seq[-1]
        l = min(len(s1), len(s2))
        if l == 0 :
            return ""
        for i in xrange(l) :
            if s1[i] != s2[i] :
                return s1[0:i]
        return s1[0:l]
    
  12. 12. At 11:22 a.m. on 17 dec 2003, Michael Dyck said:

    Avoid sort(). The sort() could be expensive, and isn't really needed, since you're only interested in the first and last element of the sorted sequence, i.e. the minimum and maximum elements of the sequence. So, change

    seq.sort()
    s1, s2 = seq[0], seq[-1]
    

    to

    s1 = min(seq)
    s2 = max(seq)
    
  13. 13. At 10:23 a.m. on 18 dec 2003, Michael Dyck said:

    Avoid sort(). [Somehow, this comment went elsewhere in the comment-tree yesterday.]

    The sort() could be expensive, and isn't really needed, since you're only interested in the first and last element of the sorted sequence, i.e. the minimum and maximum elements of the sequence. So, change

    seq.sort()
    s1, s2 = seq[0], seq[-1]
    

    to

    s1 = min(seq)
    s2 = max(seq)
    
  14. 14. At 7:59 p.m. on 26 dec 2003, Raymond Hettinger said:

    Binary search version. There are even fewer trips around the Python eval loop if the longest match is found using binary search instead of sequential comparison.

    def commonprefix(m):
        "Given a list of pathnames, returns the longest common leading component"
        if not m: return ''
        a, b = min(m), max(m)
        lo, hi = 0, min(len(a), len(b))
        while lo &lt; hi:
            mid = (lo+hi)//2 + 1
            if a[lo:mid] == b[lo:mid]:
                lo = mid
            else:
                hi = mid - 1
        return a[:hi]
    
  15. 15. At 6:19 a.m. on 27 dec 2003, Stephan Diehl (the author) said:

    sort not bad. Why should sort be slower than computing a min and a max? As far as my computational knowledge goes, in order to compute some minimum or maximum, the first thing to do is to sort anyway. Computing separate min and max actually slows things down.

    I've done some tests and it looks like Raymonds version with sort instead of min,max is indeed faster. (And by the way, is the quickest of all the shown algorithms)

  16. 16. At 8:10 a.m. on 27 dec 2003, Stephan Diehl (the author) said:

    correction. max,min seems indeed to be faster. I did the first test with presorted lists.

Sign in to comment