Welcome, guest | Sign In | My Account | Store | Cart

Provides a do-what-I-mean function, dwim_match, that takes two strings and returns True or False; the former if the two strings 'match', the latter if they don't. The matching algorithm is taken from SWI-Prolog's dwim_match/2 function.

Python, 137 lines
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
#$Id: $

from sets import Set
import re

""" 
From the SWI-Prolog manual:
dwim_match(+Atom1, +Atom2)

       Succeeds if Atom1 matches Atom2 in `Do What I Mean' sense. Both
       Atom1 and Atom2 may also be integers or floats. The two atoms
       match if:

        o They are identical
        o They differ by one character (spy == spu)
        o One character is inserted/deleted (debug == deug)

        o Two characters are transposed (trace == tarce)

        o `Sub-words' are glued differently (existsfile == existsFile
        == exists_file)

        o Two adjacent sub words are transposed (existsFile == fileExists)

Thanks for M.L. Hetland for writing a Levenshtein Distance function in Py.
Thanks to Bijan Parsia for pointing me to SWI-Prolog's DWIM algo.
"""

def dwim_match(s1, s2, degree=1): #passing a degree arg sounds nice, but is
                                  #probably dumb... the rest of the code will
                                  #break if degree != 1... needs to be reworked
    #the easy, obvious case
    if s1 == s2:
        return True

    #covers these cases:
    # - one character of diff
    # - one character inserted/deleted
    if ld(s1, s2) == degree:
        return True

    #transposition is trickier since it's ld == 2; so, maybe:
    if ld(s1, s2) == 2:
        if len(s1) == len(s2):
            return True #this fails on "pat" and "atp"

    #the two subword cases: diff gluings; transp'd adjacents
    w1 = split_words(s1)
    w2 = split_words(s2)

    if w1 and w2: #diff gluings
        if len(w1) == len(w2):
            if Set(map(lambda s: s.lower(), w1)) == Set(map(lambda s: s.lower(), w2)):
                return True #this may cover both subword cases!?
                            #for now, let's say it does....
    #give up
    return False

def split_words(s, other=False):
    # we consider 4 word separator cases:
    #  "_", "-", "+", camelCase; short of running through a dictionary, I don't
    #  know how to do the no separator case: foobar...
    if "_" in s: sep = "_"
    if "-" in s: sep = "-"
    if "+" in s: sep = "+"
    if other and other in s:
        sep = other
    try:
        if sep:
            return s.split(sep)
    except UnboundLocalError:
        return case_splitter(s)

def case_splitter(s):
    pattern = re.compile(r"""([A-Z][a-z]*)""")
    def nullp(str):
        if str != "": return True
        else: return False
    return filter(nullp, pattern.split(s))

def ld(a, b): #stolen from m.l. hetland
    n, m = len(a), len(b)
    if n > m:
        # Make sure n <= m, to use O(min(n,m)) space
        a,b = b,a
        n,m = m,n
    current = xrange(n+1)
    for i in xrange(1,m+1):
        previous, current = current, [i]+[0] * m
        for j in xrange(1, n+1):
            add, delete = previous[j] + 1, current[j-1] + 1
            change = previous[j-1]
            if a[j-1] != b[i-1]:
                change +=1
            current[j] = min(add, delete, change)
    return current[n]

if __name__ == "__main__":


    s1 = s2 = "foobar"

    print "Testing: %s, %s" % (s1, s2)
    print dwim_match(s1, s2)

    s1 = "spy"; s2 = "spu"
    print "Testing: %s, %s" % (s1, s2)
    print dwim_match(s1, s2)
    
    s1 = "debug"; s2 = "deug"
    print "Testing: %s, %s" % (s1, s2)
    print dwim_match(s1, s2)
    
    s1 = "file_exists"; s2 = "file-exists"

    print "Testing: %s, %s" % (s1, s2)
    print dwim_match(s1, s2)
    
    s1 = "file+exists"; s2 = "file-exists"

    print "Testing: %s, %s" % (s1, s2)
    print dwim_match(s1, s2)
    
    s1 = "Bartles"; s2 = "bartles"
    
    print "Testing: %s,  %s" % (s1, s2)
    print dwim_match(s1, s2)
    
    s1 = "fileExists"; s2 = "existsFile"

    print "Testing: %s, %s" % (s1, s2)
    print dwim_match(s1, s2)
    
    s1 = "bartles"; s2 = "james"

    print "Testing: %s, %s" % (s1, s2)
    print dwim_match(s1, s2)

This is useful for loosely matching user input against known good values, for example. It catches many simple input errors, including misspellings, different multiword gluing approaches, transpositions, etc.

It uses Magnus Lie Hetland's Levenshtein Distance implementation (with slight updatings). There is an issue with character transpositions; there's an obvious, brute force way of handling them, but I wanted something a bit more elegant, and I haven't really found it yet. Help appreciated.

1 comment

Etienne Labuschagne 17 years, 2 months ago  # | flag

How about using difflib?

import difflib

differ = difflib.SequenceMatcher()
differ.set_seqs('firstMethod','methodFirst')
match = differ.quick_ratio() > 0.8

I used this very successfully to match contact details with existing contacts in a contact database with the typical spelling errors, punctuation and spacing discrepancies associated with this type of data.

Created by Kendall Clark on Mon, 1 Dec 2003 (PSF)
Python recipes (4591)
Kendall Clark's recipes (1)

Required Modules

Other Information and Tasks