Welcome, guest | Sign In | My Account | Store | Cart

This is more an ease of use subclass of dict - rather then one that uses a lot of dict features.

If your requested item is in the dictionary (with a key that hashes the same) then it acts as a more or less normal dictionary.

If on the other hand you are looking for a string that is similar to a key in the dictionary, then the class iterates over all the keys and finds the one that is the closest match.

Python, 196 lines
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
"""Match items in a dictionary using fuzzy matching

Implemented for pywinauto.

This class uses difflib to match strings.
This class uses a linear search to find the items as it HAS to iterate over
every item in the dictionary (otherwise it would not be possible to know which
is the 'best' match).

If the exact item is in the dictionary (no fuzzy matching needed - then it
doesn't do the linear search and speed should be similar to standard Python
dictionaries.

>>> fuzzywuzzy = FuzzyDict({"hello" : "World", "Hiya" : 2, "Here you are" : 3})
>>> fuzzywuzzy['Me again'] = [1,2,3]
>>>
>>> fuzzywuzzy['Hi']
2
>>>
>>>
>>> # next one doesn't match well enough - so a key error is raised
...
>>> fuzzywuzzy['There']
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "pywinauto\fuzzydict.py", line 125, in __getitem__
    raise KeyError(
KeyError: "'There'. closest match: 'hello' with ratio 0.400"
>>>
>>> fuzzywuzzy['you are']
3
>>> fuzzywuzzy['again']
[1, 2, 3]
>>>
"""

__revision__ = "$Rev$"

import difflib

class FuzzyDict(dict):
    "Provides a dictionary that performs fuzzy lookup"
    def __init__(self, items = None, cutoff = .6):
        """Construct a new FuzzyDict instance

        items is an dictionary to copy items from (optional)
        cutoff is the match ratio below which mathes should not be considered
        cutoff needs to be a float between 0 and 1 (where zero is no match
        and 1 is a perfect match)"""
        super(FuzzyDict, self).__init__()

        if items:
            self.update(items)
        self.cutoff =  cutoff

        # short wrapper around some super (dict) methods
        self._dict_contains = lambda key: \
            super(FuzzyDict,self).__contains__(key)

        self._dict_getitem = lambda key: \
            super(FuzzyDict,self).__getitem__(key)

    def _search(self, lookfor, stop_on_first = False):
        """Returns the value whose key best matches lookfor

        if stop_on_first is True then the method returns as soon
        as it finds the first item
        """

        # if the item is in the dictionary then just return it
        if self._dict_contains(lookfor):
            return True, lookfor, self._dict_getitem(lookfor), 1

        # set up the fuzzy matching tool
        ratio_calc = difflib.SequenceMatcher()
        ratio_calc.set_seq1(lookfor)

        # test each key in the dictionary
        best_ratio = 0
        best_match = None
        best_key = None
        for key in self:

            # if the current key is not a string
            # then we just skip it
            try:
                # set up the SequenceMatcher with other text
                ratio_calc.set_seq2(key)
            except TypeError:
                continue

            # we get an error here if the item to look for is not a
            # string - if it cannot be fuzzy matched and we are here
            # this it is defintely not in the dictionary
            try:
            # calculate the match value
                ratio = ratio_calc.ratio()
            except TypeError:
                break

            # if this is the best ratio so far - save it and the value
            if ratio > best_ratio:
                best_ratio = ratio
                best_key = key
                best_match = self._dict_getitem(key)

            if stop_on_first and ratio >= self.cutoff:
                break

        return (
            best_ratio >= self.cutoff,
            best_key,
            best_match,
            best_ratio)


    def __contains__(self, item):
        "Overides Dictionary __contains__ to use fuzzy matching"
        if self._search(item, True)[0]:
            return True
        else:
            return False

    def __getitem__(self, lookfor):
        "Overides Dictionary __getitem__ to use fuzzy matching"
        matched, key, item, ratio = self._search(lookfor)

        if not matched:
            raise KeyError(
                "'%s'. closest match: '%s' with ratio %.3f"%
                    (str(lookfor), str(key), ratio))

        return item



if __name__ == '__main__':
    import unittest

    class FuzzyTestCase(unittest.TestCase):
        "Perform some tests"
        test_dict = {
            'Hiya'  : 1,
            u'hiy\xe4' : 2,
            'test3' : 3,
            1: 324}


        def testCreation_Empty(self):
            "Verify that not specifying any values creates an empty dictionary"
            fd = FuzzyDict()

            self.assertEquals(fd, {})

        def testCreation_Dict(self):
            "Test creating a fuzzy dict"
            fd = FuzzyDict(self.test_dict)
            self.assertEquals(fd, self.test_dict)
            self.assertEquals(self.test_dict['Hiya'], fd['hiya'])

            fd2 = FuzzyDict(self.test_dict, cutoff = .8)
            self.assertEquals(fd, self.test_dict)
            self.assertRaises(KeyError, fd2.__getitem__, 'hiya')


        def testContains(self):
            "Test checking if an item is in a FuzzyDict"
            fd = FuzzyDict(self.test_dict)

            self.assertEquals(True, fd.__contains__('hiya'))

            self.assertEquals(True, fd.__contains__(u'test3'))

            self.assertEquals(True, fd.__contains__(u'hiy\xe4'))

            self.assertEquals(False, fd.__contains__('FuzzyWuzzy'))

            self.assertEquals(True, fd.__contains__(1))

            self.assertEquals(False, fd.__contains__(23))


        def testGetItem(self):
            "Test getting items from a FuzzyDict"
            fd = FuzzyDict(self.test_dict)

            self.assertEquals(self.test_dict["Hiya"], fd['hiya'])
            self.assertRaises(KeyError, fd.__getitem__, 'FuzzyWuzzy')

            fd2 = FuzzyDict(self.test_dict, cutoff = .14)

            self.assertEquals(1, fd2['FuzzyWuzzy'])
            self.assertEquals(324, fd2[1])
            self.assertRaises(KeyError, fd2.__getitem__, 23)

    unittest.main()

I already had a function that used difflib - but when doing some refactoring - I thought that the syntax used in a dictionary would match better the way I was using it.

As such it might be useful to others out there :-)

This was implemented for pywinauto https://sourceforge.net/projects/pywinauto/

3 comments

Nicolas Lehuen 18 years ago  # | flag

High performance alternative. Hi,

If you need a similar data structure with high-performance, have a look at pytst, which includes a close_match method which is similar to this feature. pytst uses a Ternary Search Tree index and is implemented in C++, with a SWIG Python wrapper.

pytst also include high-performance algorithms for wildcard matches, prefix matches, etc.

Unfortunately, documentation is quite scarce. I'll work on it Real Soon Now.

http://nicolas.lehuen.com/index.php/Pytst

Grant Jenks 9 years, 6 months ago  # | flag

If you simply want to know the prev/next string in a sort order, the sortedcontainers module provides an API. It implements sorted list, sorted dict, and sorted set data types in pure-Python and is fast-as-C implementations (even faster!). Learn more about sortedcontainers, available on PyPI and github.

Muhammad Haziq Bin Hasbulah 8 years, 9 months ago  # | flag

i don't understand how it works. is this the expected output? *

....

Ran 4 tests in 0.004s

OK * can somebody help me.... i really need help