Bloom Filters: The Magic Wand of Memory-Efficient Membership Testing

You,system design

Introduction

Imagine you are a librarian managing thousands of books. When someone asks if a particular book exists in your collection, how do you answer without checking every shelf? Enter the Bloom Filter, your magical wand for efficient, space-saving membership testing.

A Bloom filter is a probabilistic data structure that answers a simple question: "Is this element part of the set?" It trades off a small chance of false positives for massive space efficiency and blazing speed.

In this blog, we’ll explore how Bloom filters work, look at a simple example, and implement one in Python. Let’s dive in!


How Does a Bloom Filter Work?

A Bloom filter uses:

  1. A bit array of size mmm (initialized to all 0s).
  2. kkk hash functions to map inputs to indices in the bit array.

Workflow:

  1. Adding an element:

    • Pass the element through kkk hash functions.
    • Set the corresponding bits in the bit array to 1.
  2. Checking membership:

    • Pass the element through kkk hash functions.
    • Check the bits at the hashed indices.
    • If all bits are 1, the element might be in the set.
    • If any bit is 0, the element is definitely not in the set.

Example: Adding Words to a Bloom Filter

Suppose we want to store these words:
"cat", "dog", "bird", "fish", "lion".

Hash Functions:

Let’s assume two simple hash functions:

  1. Hash1: f1(x)=sum of ASCII values of lettersmod  mf_1(x) = \text{sum of ASCII values of letters} \mod mf1​(x)=sum of ASCII values of lettersmodm
  2. Hash2: f2(x)=(sum of ASCII values×3)mod  mf_2(x) = (\text{sum of ASCII values} \times 3) \mod mf2​(x)=(sum of ASCII values×3)modm

Process:

  1. Initialize a bit array of size m=10m = 10m=10:
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0].

  2. Add "cat":

    • Hash1("cat") → (99+97+116)mod  10=2(99 + 97 + 116) \mod 10 = 2(99+97+116)mod10=2.
    • Hash2("cat") → ((99+97+116)×3)mod  10=6((99 + 97 + 116) \times 3) \mod 10 = 6((99+97+116)×3)mod10=6.
    • Set bits at indices 2 and 6 to 1.
      Bit array: [0, 0, 1, 0, 0, 0, 1, 0, 0, 0].
  3. Repeat for the other words. The final bit array looks like this:
    [1, 0, 1, 1, 0, 1, 1, 0, 1, 0].


Pseudocode for a Simple Bloom Filter

Initialize bit_array of size m to 0
Choose k hash functions

# Add an element
function add(element):
    for each hash_function in k:
        index = hash_function(element) % m
        bit_array[index] = 1

# Check membership
function check(element):
    for each hash_function in k:
        index = hash_function(element) % m
        if bit_array[index] == 0:
            return False  # Definitely not in the set
    return True  # Might be in the set`

Python Code Implementation

import hashlib

class BloomFilter:
    def __init__(self, size, num_hashes):
        self.size = size  # Size of the bit array
        self.num_hashes = num_hashes  # Number of hash functions
        self.bit_array = [0] * size

    def _hashes(self, item):
        """Generate multiple hash values for the item."""
        hashes = []
        for i in range(self.num_hashes):
            # Use hashlib to create multiple hash functions
            hash_val = int(hashlib.md5((item + str(i)).encode()).hexdigest(), 16)
            hashes.append(hash_val % self.size)
        return hashes

    def add(self, item):
        """Add an item to the Bloom filter."""
        for index in self._hashes(item):
            self.bit_array[index] = 1

    def check(self, item):
        """Check if an item might be in the Bloom filter."""
        for index in self._hashes(item):
            if self.bit_array[index] == 0:
                return False  # Definitely not in the set
        return True  # Might be in the set


# Example Usage
bf = BloomFilter(size=10, num_hashes=2)

# Adding words
words = ["cat", "dog", "bird", "fish", "lion"]
for word in words:
    bf.add(word)

# Checking membership
print("Checking membership:")
print("cat:", bf.check("cat"))  # True (might be in the set)
print("mantis:", bf.check("mantis"))  # False (definitely not in the set)
print("panther:", bf.check("panther"))  # True (might be in the set)

# Visualizing the bit array
print("\nBit array after adding elements:")
print(bf.bit_array)

Output of Program:

Checking membership: cat: True mantis: False panther: True Bit array after adding elements: [1, 1, 0, 1, 1, 1, 1, 1, 1, 0]


Application of Bloom Filters