Caching & Flow Control

Distributed Cache

A sharded, replicated in-memory cache: consistent-hashing routing, LRU eviction, replica failover, single-flight stampede control, and hot-key fan-out.

~35 min · intermediate

Problem & Requirements

Put a fast in-memory tier in front of a slow backing store (a database, an API) so most reads never reach it. The cache holds far less than the dataset, lives across many nodes, and has to keep serving when a node dies and when traffic concentrates on a handful of keys.

The application talks to the cache with the cache-aside pattern: on a read it checks the cache, and on a miss it loads from the backing store and populates the cache itself. The cache stays a dumb, authoritative-of-nothing store — it never reads the database, never knows the schema, and is always allowed to forget. That "always allowed to forget" property is what lets the whole thing be lossy, replicated loosely, and evicted aggressively without correctness consequences. The failure modes that do hurt — losing a node's worth of cache at once, or a stampede onto the backing store — are what the build steps target.

Functional

  • get(key) / set(key, value, ttl) / delete(key) over many nodes, with the application doing cache-aside loads on a miss.
  • A key maps to a node deterministically from the client, with no central coordinator on the hot path.
  • Surviving a node loss without a correctness bug (a cold shard is acceptable; serving stale-after-delete is not).

Non-functional (back-of-envelope, intra-DC fleet)

QuantityTargetWhy it drives the design
Aggregate throughput~1M ops/sForces client-side routing — no proxy or coordinator in the read path.
Cache-hit p99< 1 msOne network round trip (~0.5 ms) plus a hash-map lookup; no fan-out per get.
Hit ratio≥ 95%A 5%→6% miss-rate jump doubles nothing, but a stampede on one hot key can 100× backing-store load — hence single-flight.
Memory / node~64 GBFar below dataset size → eviction is the steady state, not the exception.
Value size~1 KB typicalKeeps a value in one packet; lets a node hold tens of millions of entries.
Rebalance on node changeMove ~1/N of keysThe whole reason to use a hash ring instead of hash(key) % N.

That last row is the quiet requirement that rules out the obvious design. hash(key) % N remaps almost every key when N changes, dumping the entire cache onto the backing store at once — which is why we reach for consistent hashing in step three.

Design

Five pieces, each a named principle applied to a concrete job:

  1. Client-side router — the client hashes the key onto a consistent-hashing ring of virtual nodes and talks straight to the owning node. Adding or removing a node moves only its neighbors' share of keys, not the whole keyspace. No coordinator sees the read.
  2. Per-node store with eviction — each node is an independent map with a memory cap and an eviction policy (LRU here). Nodes share nothing and don't know about each other; the ring is the only thing tying them together.
  3. Replication — each key is written to its primary node plus R−1 successors on the ring, so losing one node degrades to a warm replica instead of a cold-shard stampede. Reads can fall back to a replica when the primary is unreachable.
  4. Single-flight on miss — when a hot key expires, every concurrent reader misses at once and piles onto the backing store (thundering herd). Coalesce duplicate in-flight loads for the same key into one call.
  5. Hot-key fan-out — single-flight fixes the stampede on expiry; it doesn't fix a key whose steady-state traffic alone exceeds one node's capacity. Detect such keys and replicate them across many nodes (or cache them in the client briefly), spreading the load that consistent hashing otherwise pins to a single shard. See hot-key mitigation.

The contrast worth holding in mind: this is the Memcached model — independent nodes, routing pushed entirely to the client. Redis Cluster instead splits the keyspace into 16,384 fixed slots and lets the servers own slot assignment and replication, gossiping membership among themselves and redirecting a misrouted client. Same problems, opposite answer to "who knows where a key lives." The Scaling section returns to this.

Build it

1

Start with one node: a map with per-entry TTLs, and the cache-aside read helper that the application uses. The cache never touches the backing store on its own — the loader is supplied by the caller, which is what keeps the cache ignorant of the database. Everything later is about making this survive sharing, node loss, and traffic skew.

2

A node holding far less than the dataset must evict, and the steady state is "full." Track recency with an ordered map: a touched key moves to the most-recent end, and when the entry count crosses the cap the least-recent key is dropped. LRU is one point on the eviction-policy spectrum — cheap, and a good default when access has temporal locality. The cap is on entries here for clarity; production caches cap on bytes.

3

One node can't hold the dataset or the throughput. Spread keys across many nodes — but hash(key) % N remaps nearly everything when N changes, which would dump the whole cache onto the backing store at once. A consistent-hashing ring instead places each node at many points (virtual nodes) and routes a key to the next node clockwise, so adding or removing a node only moves the keys between it and its neighbor: about 1/N of the keyspace. Routing happens entirely in the client; no node coordinates with another.

4

With one copy per key, losing a node cold-misses its entire share at once — a self-inflicted stampede. Write each key to its primary plus the next R−1 distinct physical nodes clockwise, and on read fall back to a replica when the primary is unreachable. The cache stays lossy by design, so replicas can drift; what replication buys is that a single node failure degrades to a warm replica instead of a cold shard. The delete must hit all replicas, or a stale copy resurrects.

5

When a popular key expires, every concurrent reader misses in the same instant and all of them call the backing store — a thundering herd that can spike origin load by orders of magnitude. Coalesce them: the first caller for a key runs the loader while the rest wait on its result, so N simultaneous misses become one backing-store call. This is per-node and per-key, and it's the difference between a cache miss and a cache outage.

6

Single-flight stops the stampede at expiry, but it can't help a key whose steady-state read rate alone exceeds one node's capacity — consistent hashing pins that key to a single shard, and that shard melts. Detect hot keys with a lightweight per-node counter, and for a flagged key, write it to many nodes and route reads to a random one of them, so the traffic that hashing concentrated gets spread across the fleet. Reads also keep a brief client-local copy, the cheapest hot-key mitigation of all. Hot status is approximate and decays; the cost is extra writes and slightly staler reads for those few keys.

Tradeoffs

DecisionBuysCosts
Cache-aside, lossy by designCache stays simple and ignorant of the schema; can drop anything anytimeApplication owns load + invalidation logic; a delete must reach every copy
Consistent hashing over mod NNode change moves ~1/N of keys, not all of themUneven load without enough virtual nodes; client must know ring membership
Client-side routing (Memcached model)No coordinator on the read path; trivial to scale readsEvery client needs current membership; rebalancing is the client's problem
LRU evictionCheap, exploits temporal localityScan-heavy workloads pollute it; no scan resistance (vs. LRU-K / ARC)
Replication (R copies)A node loss degrades to a warm replica, not a cold shardR× the writes and memory; replicas drift since the cache is lossy
Single-flightOne backing-store call per key per miss instead of NPer-key contention; a slow loader blocks its followers
Hot-key fan-outSpreads load hashing would pin to one shardExtra writes + staler reads for hot keys; detection is approximate

Scaling it up

The toy is the Memcached model; here's where production caches diverge and get harder.

Server-owned routing (Redis Cluster). Instead of trusting every client to hold the ring, Redis Cluster splits keys into 16,384 fixed hash slots, assigns slots to primaries, and has the servers gossip membership among themselves. A client that asks the wrong node gets a MOVED/ASK redirect and updates its slot map lazily. This trades client simplicity for the ability to rebalance slots and fail over replicas without every client agreeing on the ring first — the cost is a real cluster protocol (gossip, epochs, failover voting) that the dumb-node model avoids entirely.

Memory-aware eviction and the real policies. Capping on entry count is a teaching simplification; production caches cap on bytes and account for slab/fragmentation overhead. Redis offers a menu (allkeys-lru, lfu, random, volatile-ttl) because LRU is wrong for scan-heavy or frequency-skewed workloads; LFU and segmented/scan-resistant variants (LRU-K, ARC, the W-TinyLFU used by Caffeine) exist precisely for the cases plain LRU pollutes.

Stampede control beyond single-flight. Single-flight handles concurrent misses on one node, but two further problems remain: synchronized expiry (thousands of keys with the same TTL all expiring together — fix with TTL jitter), and the latency cliff when even one loader is slow. Production answers add probabilistic early recomputation (refresh a key before expiry with rising probability) and serve-stale-while-revalidate, so a hot key is never simultaneously missing for all readers.

Invalidation is the hard part. Cache-aside makes correctness the application's job, and the classic race is real: reader loads a stale value and writes it to cache after a concurrent writer invalidated. Mitigations range from delete-after-write ordering, to short TTLs as a backstop, to versioned keys, to a change-data-capture stream that invalidates the cache from the database's write log. None are free, and the right one depends on how stale you can afford to be.

Hot-key detection at scale. A plain counter per node doesn't see a key that's hot across the fleet, and it grows unbounded. Production systems use sketches (count-min) with decay to bound memory, aggregate hot-key signals centrally, and combine server-side fan-out with client-side local caching (Memcached's "mcrouter" and the newer client-side caching in Redis 6+ both exist for this).

References