Welcome, guest | Sign In | My Account | Store | Cart

Space efficient, probabilistic set membership tester. Has no False Negatives but allows a rare False Positive.

Python, 70 lines
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70``` ```from array import array from random import Random class BloomFilter: # http://en.wikipedia.org/wiki/Bloom_filter def __init__(self, num_bits, num_probes, probe_func): self.num_bits= num_bits num_words = (num_bits + 31) // 32 self.arr = array('L', [0]) * num_words self.num_probes = num_probes self.probe_func = get_probes def add(self, key): for i, mask in self.probe_func(self, key): self.arr[i] |= mask def match_template(self, bfilter): return (self.num_bits == bfilter.num_bits \ and self.num_probes == bfilter.num_probes \ and self.probe_func == bfilter.probe_func) def union(self, bfilter): if self.match_template(bfilter): self.arr = [a | b for a, b in zip(self.arr, bfilter.arr)] else: # Union b/w two unrelated bloom filter raises this raise ValueError("Mismatched bloom filters") def intersection(self, bfilter): if self.match_template(bfilter): self.arr = [a & b for a, b in zip(self.arr, bfilter.arr)] else: # Intersection b/w two unrelated bloom filter raises this raise ValueError("Mismatched bloom filters") def __contains__(self, key): return all(self.arr[i] & mask for i, mask in self.probe_func(self, key)) def get_probes(bfilter, key): hasher = Random(key).randrange for _ in range(bfilter.num_probes): array_index = hasher(len(bfilter.arr)) bit_index = hasher(32) yield array_index, 1 << bit_index if __name__ == '__main__': from random import sample from string import ascii_letters states = '''Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada NewHampshire NewJersey NewMexico NewYork NorthCarolina NorthDakota Ohio Oklahoma Oregon Pennsylvania RhodeIsland SouthCarolina SouthDakota Tennessee Texas Utah Vermont Virginia Washington WestVirginia Wisconsin Wyoming'''.split() bf = BloomFilter(num_bits=1000, num_probes=14, probe_func=get_probes) for state in states: bf.add(state) m = sum(state in bf for state in states) print('%d true positives out of %d trials' % (m, len(states))) trials = 100000 m = sum(''.join(sample(ascii_letters, 5)) in bf for i in range(trials)) print('%d true negatives and %d false negatives out of %d trials' % (trials-m, m, trials)) ```

The get_probes() method generates a series of hashed lookups into the bitarray. There are many ways to do this and it is possible to get a much faster hashing function. The pickle->sha512->int->divmod approach was chosen for this recipe because 1) it produces 512 uncorrelated bits, 2) it is easy to understand, and 3) it works in both Python2 and Python3.

probe_func is the input function for performing hashing. Currently it uses Random.randrange(), but can be changed to any implementation, like SHA for instance

The example code shows the names of fifty US states stored in a 1000 bit bloom filter using 14 probes. We verify that all 50 are found in the filter (guaranteed True Positives). We then try 100,000 random strings to see if they are in the filter. Typically, fewer than 1 in 10,000 return a False Positive.

Bloom filters are much more impressive with large datasets. For example, imagine the list of all correctly spelled words in the English language being stored in a bloom filter so a spell checker can be kept in memory and saving disk accesses only for likely misspellings.

Kirk Strauser 13 years ago

Your false positive test at the end isn't quite right. First, `%d false negatives` should be `%d false positives`. Second, it's possible (if unlikely) that you could generate a valid state name, like 'Idaho'. I replaced the last four lines with:

``````trials = 100000
fp = 0
for trial in range(trials):
while True:
candidate = ''.join(sample(ascii_letters, 5))
# If we accidentally found a real state, try again
if candidate in states:
continue
if candidate in bf:
fp += 1
break
print('%d true negatives and %d false positives out of %d trials'
% (trials-fp, fp, trials))
``````

I think the odds of accidentally generating a valid state name here are only like 1:127,000,000, but it's easy enough to rule out.

That part is forked from here: http://code.activestate.com/recipes/577684-bloom-filter/, and I had noticed the same. I left it as it was because I assumed that since the odds were too low the author decided not to add the search in the list on every iteration. Anyway, thanks for adding the check; it made it better.

Kirk Strauser 13 years ago

I noticed that about 5 minutes after I posted. :-)

Dan Stromberg 12 years, 8 months ago

I've taken this module and modified it. The main changes are:

1. Constructor accepts ideal size and maximum error rate, and computes the rest from that
2. Uses a better get_probes that gives a much more in-line-with-the-theory error rate
3. Expanded the unit tests significantly
4. Passes pylint
Sundar Srinivasan (author) 12 years, 8 months ago

@Dan, Looked at your code. Thanks for improving it.

 Created by Sundar Srinivasan on Wed, 4 May 2011 (MIT)