Welcome, guest | Sign In | My Account | Store | Cart

Space efficient, probabilistic set membership tester. Has no False Negatives but allows a rare False Positive.

Python, 70 lines
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
from array import array
from random import Random

class BloomFilter:
    # http://en.wikipedia.org/wiki/Bloom_filter

    def __init__(self, num_bits, num_probes, probe_func):
        self.num_bits= num_bits
        num_words = (num_bits + 31) // 32
        self.arr = array('L', [0]) * num_words
        self.num_probes = num_probes
        self.probe_func = get_probes

    def add(self, key):
        for i, mask in self.probe_func(self, key):
            self.arr[i] |= mask

    def match_template(self, bfilter):
        return (self.num_bits == bfilter.num_bits \
            and self.num_probes == bfilter.num_probes \
            and self.probe_func == bfilter.probe_func)

    def union(self, bfilter):
        if self.match_template(bfilter):
            self.arr = [a | b for a, b in zip(self.arr, bfilter.arr)]
        else:
            # Union b/w two unrelated bloom filter raises this
            raise ValueError("Mismatched bloom filters")

    def intersection(self, bfilter):
        if self.match_template(bfilter):
            self.arr = [a & b for a, b in zip(self.arr, bfilter.arr)]
        else:
            # Intersection b/w two unrelated bloom filter raises this
            raise ValueError("Mismatched bloom filters")

    def __contains__(self, key):
        return all(self.arr[i] & mask for i, mask in self.probe_func(self, key))

def get_probes(bfilter, key):
    hasher = Random(key).randrange
    for _ in range(bfilter.num_probes):
        array_index = hasher(len(bfilter.arr))
        bit_index = hasher(32)
        yield array_index, 1 << bit_index

if __name__ == '__main__':

    from random import sample
    from string import ascii_letters

    states = '''Alabama Alaska Arizona Arkansas California Colorado Connecticut
        Delaware Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas
        Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota
        Mississippi Missouri Montana Nebraska Nevada NewHampshire NewJersey
        NewMexico NewYork NorthCarolina NorthDakota Ohio Oklahoma Oregon
        Pennsylvania RhodeIsland SouthCarolina SouthDakota Tennessee Texas Utah
        Vermont Virginia Washington WestVirginia Wisconsin Wyoming'''.split()

    bf = BloomFilter(num_bits=1000, num_probes=14, probe_func=get_probes)
    for state in states:
        bf.add(state)

    m = sum(state in bf for state in states)
    print('%d true positives out of %d trials' % (m, len(states)))

    trials = 100000
    m = sum(''.join(sample(ascii_letters, 5)) in bf for i in range(trials))
    print('%d true negatives and %d false negatives out of %d trials'
          % (trials-m, m, trials))

The get_probes() method generates a series of hashed lookups into the bitarray. There are many ways to do this and it is possible to get a much faster hashing function. The pickle->sha512->int->divmod approach was chosen for this recipe because 1) it produces 512 uncorrelated bits, 2) it is easy to understand, and 3) it works in both Python2 and Python3.

probe_func is the input function for performing hashing. Currently it uses Random.randrange(), but can be changed to any implementation, like SHA for instance

The example code shows the names of fifty US states stored in a 1000 bit bloom filter using 14 probes. We verify that all 50 are found in the filter (guaranteed True Positives). We then try 100,000 random strings to see if they are in the filter. Typically, fewer than 1 in 10,000 return a False Positive.

Bloom filters are much more impressive with large datasets. For example, imagine the list of all correctly spelled words in the English language being stored in a bloom filter so a spell checker can be kept in memory and saving disk accesses only for likely misspellings.

5 comments

Kirk Strauser 12 years, 11 months ago  # | flag

Your false positive test at the end isn't quite right. First, %d false negatives should be %d false positives. Second, it's possible (if unlikely) that you could generate a valid state name, like 'Idaho'. I replaced the last four lines with:

trials = 100000
fp = 0
for trial in range(trials):
    while True:
        candidate = ''.join(sample(ascii_letters, 5))
        # If we accidentally found a real state, try again
        if candidate in states:
            continue
        if candidate in bf:
            fp += 1
        break
print('%d true negatives and %d false positives out of %d trials'
      % (trials-fp, fp, trials))

I think the odds of accidentally generating a valid state name here are only like 1:127,000,000, but it's easy enough to rule out.

Sundar Srinivasan (author) 12 years, 11 months ago  # | flag

That part is forked from here: http://code.activestate.com/recipes/577684-bloom-filter/, and I had noticed the same. I left it as it was because I assumed that since the odds were too low the author decided not to add the search in the list on every iteration. Anyway, thanks for adding the check; it made it better.

Kirk Strauser 12 years, 11 months ago  # | flag

I noticed that about 5 minutes after I posted. :-)

Dan Stromberg 12 years, 8 months ago  # | flag

I've taken this module and modified it. The main changes are:

  1. Constructor accepts ideal size and maximum error rate, and computes the rest from that
  2. Uses a better get_probes that gives a much more in-line-with-the-theory error rate
  3. Expanded the unit tests significantly
  4. Passes pylint

It's at http://stromberg.dnsalias.org/svn/bloom-filter/trunk/

Sundar Srinivasan (author) 12 years, 8 months ago  # | flag

@Dan, Looked at your code. Thanks for improving it.