Bloom Filters: The Magic Wand of Memory-Efficient Membership Testing
Introduction
Imagine you are a librarian managing thousands of books. When someone asks if a particular book exists in your collection, how do you answer without checking every shelf? Enter the Bloom Filter, your magical wand for efficient, space-saving membership testing.
A Bloom filter is a probabilistic data structure that answers a simple question: "Is this element part of the set?" It trades off a small chance of false positives for massive space efficiency and blazing speed.
In this blog, we’ll explore how Bloom filters work, look at a simple example, and implement one in Python. Let’s dive in!
How Does a Bloom Filter Work?
A Bloom filter uses:
- A bit array of size mmm (initialized to all 0s).
- kkk hash functions to map inputs to indices in the bit array.
Workflow:
-
Adding an element:
- Pass the element through kkk hash functions.
- Set the corresponding bits in the bit array to 1.
-
Checking membership:
- Pass the element through kkk hash functions.
- Check the bits at the hashed indices.
- If all bits are 1, the element might be in the set.
- If any bit is 0, the element is definitely not in the set.
Example: Adding Words to a Bloom Filter
Suppose we want to store these words:
"cat", "dog", "bird", "fish", "lion".
Hash Functions:
Let’s assume two simple hash functions:
- Hash1: f1(x)=sum of ASCII values of lettersmod mf_1(x) = \text{sum of ASCII values of letters} \mod mf1(x)=sum of ASCII values of lettersmodm
- Hash2: f2(x)=(sum of ASCII values×3)mod mf_2(x) = (\text{sum of ASCII values} \times 3) \mod mf2(x)=(sum of ASCII values×3)modm
Process:
-
Initialize a bit array of size m=10m = 10m=10:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
. -
Add "cat":
- Hash1("cat") → (99+97+116)mod 10=2(99 + 97 + 116) \mod 10 = 2(99+97+116)mod10=2.
- Hash2("cat") → ((99+97+116)×3)mod 10=6((99 + 97 + 116) \times 3) \mod 10 = 6((99+97+116)×3)mod10=6.
- Set bits at indices 2 and 6 to 1.
Bit array:[0, 0, 1, 0, 0, 0, 1, 0, 0, 0]
.
-
Repeat for the other words. The final bit array looks like this:
[1, 0, 1, 1, 0, 1, 1, 0, 1, 0]
.
Pseudocode for a Simple Bloom Filter
Initialize bit_array of size m to 0
Choose k hash functions
# Add an element
function add(element):
for each hash_function in k:
index = hash_function(element) % m
bit_array[index] = 1
# Check membership
function check(element):
for each hash_function in k:
index = hash_function(element) % m
if bit_array[index] == 0:
return False # Definitely not in the set
return True # Might be in the set`
Python Code Implementation
import hashlib
class BloomFilter:
def __init__(self, size, num_hashes):
self.size = size # Size of the bit array
self.num_hashes = num_hashes # Number of hash functions
self.bit_array = [0] * size
def _hashes(self, item):
"""Generate multiple hash values for the item."""
hashes = []
for i in range(self.num_hashes):
# Use hashlib to create multiple hash functions
hash_val = int(hashlib.md5((item + str(i)).encode()).hexdigest(), 16)
hashes.append(hash_val % self.size)
return hashes
def add(self, item):
"""Add an item to the Bloom filter."""
for index in self._hashes(item):
self.bit_array[index] = 1
def check(self, item):
"""Check if an item might be in the Bloom filter."""
for index in self._hashes(item):
if self.bit_array[index] == 0:
return False # Definitely not in the set
return True # Might be in the set
# Example Usage
bf = BloomFilter(size=10, num_hashes=2)
# Adding words
words = ["cat", "dog", "bird", "fish", "lion"]
for word in words:
bf.add(word)
# Checking membership
print("Checking membership:")
print("cat:", bf.check("cat")) # True (might be in the set)
print("mantis:", bf.check("mantis")) # False (definitely not in the set)
print("panther:", bf.check("panther")) # True (might be in the set)
# Visualizing the bit array
print("\nBit array after adding elements:")
print(bf.bit_array)
Output of Program:
Checking membership: cat: True mantis: False panther: True Bit array after adding elements: [1, 1, 0, 1, 1, 1, 1, 1, 1, 0]
Application of Bloom Filters
- Web Crawling: Avoid revisiting URLs.
- Databases: Check existence of keys in a database cache.
- Networking: Prevent duplicate packets in routing.
- Blockchain: Efficient transaction lookup.