Space efficient, probabilistic set membership tester. Has no False Negatives but allows a rare False Positive.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
from array import array from random import Random class BloomFilter: # http://en.wikipedia.org/wiki/Bloom_filter def __init__(self, num_bits, num_probes, probe_func): self.num_bits= num_bits num_words = (num_bits + 31) // 32 self.arr = array('L', ) * num_words self.num_probes = num_probes self.probe_func = get_probes def add(self, key): for i, mask in self.probe_func(self, key): self.arr[i] |= mask def match_template(self, bfilter): return (self.num_bits == bfilter.num_bits \ and self.num_probes == bfilter.num_probes \ and self.probe_func == bfilter.probe_func) def union(self, bfilter): if self.match_template(bfilter): self.arr = [a | b for a, b in zip(self.arr, bfilter.arr)] else: # Union b/w two unrelated bloom filter raises this raise ValueError("Mismatched bloom filters") def intersection(self, bfilter): if self.match_template(bfilter): self.arr = [a & b for a, b in zip(self.arr, bfilter.arr)] else: # Intersection b/w two unrelated bloom filter raises this raise ValueError("Mismatched bloom filters") def __contains__(self, key): return all(self.arr[i] & mask for i, mask in self.probe_func(self, key)) def get_probes(bfilter, key): hasher = Random(key).randrange for _ in range(bfilter.num_probes): array_index = hasher(len(bfilter.arr)) bit_index = hasher(32) yield array_index, 1 << bit_index if __name__ == '__main__': from random import sample from string import ascii_letters states = '''Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada NewHampshire NewJersey NewMexico NewYork NorthCarolina NorthDakota Ohio Oklahoma Oregon Pennsylvania RhodeIsland SouthCarolina SouthDakota Tennessee Texas Utah Vermont Virginia Washington WestVirginia Wisconsin Wyoming'''.split() bf = BloomFilter(num_bits=1000, num_probes=14, probe_func=get_probes) for state in states: bf.add(state) m = sum(state in bf for state in states) print('%d true positives out of %d trials' % (m, len(states))) trials = 100000 m = sum(''.join(sample(ascii_letters, 5)) in bf for i in range(trials)) print('%d true negatives and %d false negatives out of %d trials' % (trials-m, m, trials))
The get_probes() method generates a series of hashed lookups into the bitarray. There are many ways to do this and it is possible to get a much faster hashing function. The pickle->sha512->int->divmod approach was chosen for this recipe because 1) it produces 512 uncorrelated bits, 2) it is easy to understand, and 3) it works in both Python2 and Python3.
probe_func is the input function for performing hashing. Currently it uses Random.randrange(), but can be changed to any implementation, like SHA for instance
The example code shows the names of fifty US states stored in a 1000 bit bloom filter using 14 probes. We verify that all 50 are found in the filter (guaranteed True Positives). We then try 100,000 random strings to see if they are in the filter. Typically, fewer than 1 in 10,000 return a False Positive.
Bloom filters are much more impressive with large datasets. For example, imagine the list of all correctly spelled words in the English language being stored in a bloom filter so a spell checker can be kept in memory and saving disk accesses only for likely misspellings.