Thundering Herd / Cache Stampede | Microservices Course

Problem

This is the failure mode the earlier pieces kept gesturing at when they mentioned coalescing fills. A hot key, one that absorbs a large share of the read traffic, expires or is evicted or invalidated. In the gap before anything repopulates it, every concurrent request for that key misses at the same instant and falls through to the origin together, so a key that was serving thousands of reads per second from memory suddenly sends thousands of identical reads to the store at once. The origin sees a spike it never sees in steady state, and the requests are redundant: they all compute the same value. If the origin slows under the surge, requests queue, more arrive behind them, and the herd grows instead of draining, a feedback loop that can take the store down.

Two shapes of this exist. One is a single hot key expiring (the classic stampede, also called a dogpile). The other is correlated expiry, where a batch of keys written together share a TTL and all expire in the same second, producing a synchronized wave of misses. A cold start is the same problem with an empty cache. The root cause in every case is that a miss is handled as "every requester independently goes and recomputes," when the behavior you want is "one requester recomputes and the rest wait or accept a stale value."

Solution

Either collapse the concurrent misses on a key into a single origin fetch, or avoid the synchronized miss in the first place. The techniques run from forceful to gentle.

Per-key locking. The first miss takes a lock on the key and does the recompute; everyone else finds the lock held and waits for the result rather than going to the origin. One origin hit. The lock needs its own expiry so a crashed holder doesn't wedge the key forever, and the waiting requests pay the full recompute latency.

Request coalescing (single-flight). Fold concurrent identical calls into one in-flight execution and hand every caller the shared result. Same effect as a lock without an external store, but scoped to one process, so behind N servers you can still get up to N simultaneous fills unless you coalesce across them with a distributed lock.

Stale-while-revalidate (serve-stale). Keep the value past its nominal expiry for a grace window. A request arriving in that window is served the stale value immediately while a single background refresh runs, so nobody blocks and the origin sees exactly one fetch. The cost is bounded staleness, capped by the grace window.

Probabilistic early expiration. Refresh a key slightly before it actually expires, with a probability that climbs as the deadline approaches and as the recompute gets more expensive, so the refreshes scatter across many earlier requests instead of all firing at the deadline. This is the XFetch approach from "Optimal Probabilistic Cache Stampede Prevention."

TTL jitter. Add randomness to each TTL so keys written together don't expire in lockstep. It costs nothing and kills correlated-expiry waves, though it does nothing for a single genuinely hot key.

The throughline: either elect one filler and make everyone else defer (locking, single-flight, leases), or remove the synchronized miss entirely (serve-stale, early expiration, jitter).

Tradeoffs

Technique	Effect
Per-key locking	Cuts the herd to one origin hit; needs a lock expiry to survive a dead holder, and waiters block for the recompute, raising tail latency
Request coalescing	Collapses duplicate calls with no external dependency; in-process only, so it doesn't coalesce across servers, and a slow or failed call is shared by every waiter
Stale-while-revalidate	Nobody blocks and the origin sees a single refresh; serves data stale for up to the grace window
Probabilistic early expiration	Spreads refreshes so the spike never forms; costs a few early recomputes and needs tuning to the recompute cost
TTL jitter	Trivial and kills correlated expiry; useless against one hot key
Leases	Combine election and a staleness bound in one mechanism; add protocol complexity at the cache layer

Implementations

Minimal pseudocode

# per-key lock: one filler, the rest wait or serve stale
def get(key):
    val = cache.get(key)
    if val and not val.expired():
        return val.data
    if cache.add(lock_key(key), ttl=5):        # only one caller wins
        fresh = origin.read(key)
        cache.set(key, fresh)
        cache.delete(lock_key(key))
        return fresh
    return val.data if val else wait_and_retry(key)   # others defer

# serve-stale: return the stale value, refresh once in the background
def get_swr(key):
    val = cache.get(key)
    if val and val.fresh():
        return val.data
    if val and val.within_grace():
        background(lambda: refresh(key))       # single async refresh
        return val.data                        # nobody blocks
    return refresh(key)                        # true miss, no stale copy

Facebook leases (memcached)

On a miss, memcached hands the first client a 64-bit lease token tied to the key, and the client must present that token to set the value back. A delete on the key invalidates outstanding tokens, so a late, reordered write fails its set rather than caching a stale value (the stale-set half of the problem). For the herd, the server issues a token at most once every ten seconds per key; other readers arriving in that window are told to wait briefly and retry, by which point the lease holder has usually filled the key in a few milliseconds, or they take data up to ten seconds stale if the application tolerates it. The store sees one read instead of a stampede. Paper: https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final170_update.pdf.

Varnish grace mode

Grace keeps an object serveable past its TTL for a configured window (set beresp.grace, with req.grace controlling it at lookup). A request that arrives after the TTL expires but inside the grace window is served the stale object as a cache hit while Varnish fires a single asynchronous fetch to refresh it, so concurrent clients all receive the stale copy and only one request reaches the backend. Once grace is exhausted, the refresh becomes synchronous, which is an ordinary miss. This is stale-while-revalidate at the edge; Cloudflare, Fastly, and Google Cloud CDN implement the same Cache-Control directive. Docs: https://info.varnish-software.com/blog/two-minute-tech-tuesdays-grace-mode.

single-flight in Go

golang.org/x/sync/singleflight coalesces duplicate work in-process. group.Do(key, fn) runs fn exactly once even when many goroutines call it concurrently with the same key, and every caller receives the one shared result and error; DoChan returns a channel form, and Forget(key) drops the in-flight record so the next call recomputes. It's the standard guard in front of an expensive lookup, used in Go's own DNS resolver. The caveats follow from its design: it only coalesces within a single process, and because the result is shared, a slow or failing fn is handed to every waiting caller. Docs: https://pkg.go.dev/golang.org/x/sync/singleflight.