Space efficient, probabilistic set membership tester. Has no False Negatives but allows a rare False Positive.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | from array import array
from random import Random
class BloomFilter:
# http://en.wikipedia.org/wiki/Bloom_filter
def __init__(self, num_bits, num_probes, probe_func):
self.num_bits= num_bits
num_words = (num_bits + 31) // 32
self.arr = array('L', [0]) * num_words
self.num_probes = num_probes
self.probe_func = get_probes
def add(self, key):
for i, mask in self.probe_func(self, key):
self.arr[i] |= mask
def match_template(self, bfilter):
return (self.num_bits == bfilter.num_bits \
and self.num_probes == bfilter.num_probes \
and self.probe_func == bfilter.probe_func)
def union(self, bfilter):
if self.match_template(bfilter):
self.arr = [a | b for a, b in zip(self.arr, bfilter.arr)]
else:
# Union b/w two unrelated bloom filter raises this
raise ValueError("Mismatched bloom filters")
def intersection(self, bfilter):
if self.match_template(bfilter):
self.arr = [a & b for a, b in zip(self.arr, bfilter.arr)]
else:
# Intersection b/w two unrelated bloom filter raises this
raise ValueError("Mismatched bloom filters")
def __contains__(self, key):
return all(self.arr[i] & mask for i, mask in self.probe_func(self, key))
def get_probes(bfilter, key):
hasher = Random(key).randrange
for _ in range(bfilter.num_probes):
array_index = hasher(len(bfilter.arr))
bit_index = hasher(32)
yield array_index, 1 << bit_index
if __name__ == '__main__':
from random import sample
from string import ascii_letters
states = '''Alabama Alaska Arizona Arkansas California Colorado Connecticut
Delaware Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas
Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota
Mississippi Missouri Montana Nebraska Nevada NewHampshire NewJersey
NewMexico NewYork NorthCarolina NorthDakota Ohio Oklahoma Oregon
Pennsylvania RhodeIsland SouthCarolina SouthDakota Tennessee Texas Utah
Vermont Virginia Washington WestVirginia Wisconsin Wyoming'''.split()
bf = BloomFilter(num_bits=1000, num_probes=14, probe_func=get_probes)
for state in states:
bf.add(state)
m = sum(state in bf for state in states)
print('%d true positives out of %d trials' % (m, len(states)))
trials = 100000
m = sum(''.join(sample(ascii_letters, 5)) in bf for i in range(trials))
print('%d true negatives and %d false negatives out of %d trials'
% (trials-m, m, trials))
|
The get_probes() method generates a series of hashed lookups into the bitarray. There are many ways to do this and it is possible to get a much faster hashing function. The pickle->sha512->int->divmod approach was chosen for this recipe because 1) it produces 512 uncorrelated bits, 2) it is easy to understand, and 3) it works in both Python2 and Python3.
probe_func is the input function for performing hashing. Currently it uses Random.randrange(), but can be changed to any implementation, like SHA for instance
The example code shows the names of fifty US states stored in a 1000 bit bloom filter using 14 probes. We verify that all 50 are found in the filter (guaranteed True Positives). We then try 100,000 random strings to see if they are in the filter. Typically, fewer than 1 in 10,000 return a False Positive.
Bloom filters are much more impressive with large datasets. For example, imagine the list of all correctly spelled words in the English language being stored in a bloom filter so a spell checker can be kept in memory and saving disk accesses only for likely misspellings.
Your false positive test at the end isn't quite right. First,
%d false negatives
should be%d false positives
. Second, it's possible (if unlikely) that you could generate a valid state name, like 'Idaho'. I replaced the last four lines with:I think the odds of accidentally generating a valid state name here are only like 1:127,000,000, but it's easy enough to rule out.
That part is forked from here: http://code.activestate.com/recipes/577684-bloom-filter/, and I had noticed the same. I left it as it was because I assumed that since the odds were too low the author decided not to add the search in the list on every iteration. Anyway, thanks for adding the check; it made it better.
I noticed that about 5 minutes after I posted. :-)
I've taken this module and modified it. The main changes are:
It's at http://stromberg.dnsalias.org/svn/bloom-filter/trunk/
@Dan, Looked at your code. Thanks for improving it.