Bloom Filters / Probabilistic Membership

Problem

Answering "is this element in the set?" normally means storing the set, a hash table or tree holding the keys themselves, so memory grows with both the number of elements and their size. When the set is large or lives behind slow storage, you can't keep it all in fast memory, and often the answer you most want is the negative one: "this key is definitely not here, so don't bother with the expensive lookup," where the lookup is a disk read, a network round trip, or a cache miss. Paying full-set memory to answer a mostly-negative question is the waste.

Solution

A Bloom filter answers set membership probabilistically using a fixed bit array and k independent hash functions, storing no keys at all. To insert an element, hash it k ways and set those k bits. To query, hash it k ways and check those bits: if any bit is 0 the element is definitely absent; if all are 1 it's probably present. The error is one-sided, no false negatives, and a false-positive rate you tune by sizing the array.

The payoff is that memory is independent of key size, you store bits rather than keys, and you trade a chosen false-positive rate for fewer bits: roughly 1.44 · log₂(1/p) bits per element, about 10 bits each for a 1% rate, regardless of whether the keys are 8 bytes or 8 kilobytes. You use it as a cheap pre-filter in front of the authoritative lookup. Query the filter first, and only on "probably present" do you pay for the real check. A false positive merely wastes one such check; it never returns wrong data, because the authoritative source still has the final say.

This is the mechanism the LSM and columnar pieces leaned on to skip files that can't contain a key.

The sizing math: for m bits and n elements the optimal number of hashes is k = (m/n)·ln 2, and the false-positive rate is approximately (1 − e^(−kn/m))^k. Standard Bloom filters don't support deletion, since clearing a bit could clear one shared with another element and create a false negative. Variants address this: a counting Bloom filter stores small counters instead of bits to allow deletes, a scalable Bloom filter grows as n grows, and cuckoo or quotient filters support deletion and are often more cache-friendly.

Tradeoffs

Property	Effect
Memory	Tiny and fixed, independent of key size: the whole appeal
Error model	One-sided: never a false negative, a tunable false-positive rate
Requires a backstop	Only useful when an authoritative check exists and a false positive just wastes one cheap-to-detect lookup
No enumeration	Can't list members or retrieve values, membership only
No deletion	Not in the standard form; needs a counting or cuckoo variant
Sizing	You must size for expected n; overfilling pushes the false-positive rate toward 1, degrading rather than failing
Net benefit	Wins only when negatives are common and the avoided lookup is expensive; if most queries hit or the backing lookup is cheap, it's pure overhead

Implementations

Minimal pseudocode

class BloomFilter:
    def __init__(self, m_bits, k_hashes):
        self.bits = bitarray(m_bits)
        self.m, self.k = m_bits, k_hashes

    def _positions(self, item):
        h1, h2 = hash1(item), hash2(item)
        return [(h1 + i * h2) % self.m for i in range(self.k)]  # double hashing

    def add(self, item):
        for p in self._positions(item):
            self.bits[p] = 1

    def might_contain(self, item):
        return all(self.bits[p] for p in self._positions(item))
        # False  -> definitely absent (skip the expensive lookup)
        # True   -> probably present (do the real check)

Cassandra and RocksDB SSTable filters

Each SSTable carries a Bloom filter over the keys it contains. On a read, the engine tests each SSTable's filter and skips every file whose filter says the key is absent, avoiding a disk read or block fetch per skipped file, which directly cuts the read amplification inherent in checking many SSTables. RocksDB keeps per-SSTable Bloom filters in its block-based table format and can add a prefix filter for range-prefix queries; Cassandra exposes bloom_filter_fp_chance per table to trade memory against false positives. Docs: https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter and https://cassandra.apache.org/doc/latest/cassandra/operating/bloom_filters.html.

Bitcoin SPV

A Simplified Payment Verification client stores no full blockchain and wants only the transactions relevant to its own addresses. Under BIP 37 it sends a Bloom filter to a full node rather than its exact addresses, and the node returns transactions matching the filter; the false positives even help privacy by blurring which addresses are truly the client's. BIP 37 turned out to have real privacy weaknesses and has largely been replaced by BIP 157/158 (Neutrino), which inverts the flow: the server builds a Golomb-coded set filter per block that clients download and test locally, a related compact-filter idea rather than a classic Bloom filter. BIP 37: https://github.com/bitcoin/bips/blob/master/bip-0037.mediawiki. BIP 158: https://github.com/bitcoin/bips/blob/master/bip-0158.mediawiki.

CDN cache admission

Caches get polluted by one-hit wonders, objects requested once and never again, which evict genuinely hot content. A common admission policy uses a Bloom filter to remember which keys have been seen before and admits an object to the cache only on its second request, so single-use objects never displace useful entries. The filter's small memory cost is what makes tracking the entire seen-key history feasible, and a false positive only means admitting one object slightly early, which is harmless. This "admit on second sight" pattern is widely used in CDN and large-scale caching tiers, sometimes with a counting variant or alongside a frequency sketch when the policy also needs to compare access frequencies.