Multi-Tier Caching

Problem

No single cache can be both close to the user and large enough to hold everything. A cache near the user has the lowest latency but the least room, and it sees only one user's or one region's traffic, so it can't share work across everyone. A cache near the data is large and shared by all requests but sits a full network round-trip away and stays on the critical path. Pick one tier and you give up what the others provide: a big origin-side cache still pays the round-trip on every hit, and a tiny client cache misses constantly because it can neither hold much nor share across users. The layers also can't substitute for one another. A browser can't cache another user's data, a CDN can't hold an entire database, and a database buffer pool can't put bytes near the user. Each position on the path between user and data can absorb a different slice of the load, and no one position can absorb all of it.

Solution

Stack the caches and let each tier absorb what its position lets it absorb, passing only its misses to the next. Order them from the user inward: client, then CDN edge, then a shared application cache, then the database buffer pool, then disk. Each layer answers the requests it can and forwards the rest, so a request walks inward only as far as it has to.

The payoff is multiplicative. If tier i has hit rate hᵢ, the fraction of requests reaching the origin is the product of the miss rates ∏(1 − hᵢ), so even a few modest layers cut origin load by orders of magnitude rather than adding up linearly. Each tier trades capacity against proximity: the outer tiers are small, per-user or per-region, and closest to the user, while the inner tiers are large, shared, and farthest, so the hot set lives near the user and the long tail lives deep.

Every pattern in this series turns out to be one tier's policy. The browser and CDN run TTL with revalidation; the application cache runs cache-aside under an eviction policy; the buffer pool is write-behind over the data files. Multi-tier caching is the composition of all of them, and the hazards compose along with the benefits. Each layer can be stale independently, so freshness is the weakest link, and a change has to propagate through every tier, which is the cross-tier version of the invalidation problem. Each layer also needs its own stampede protection, because a miss at one tier arrives as a request at the next, so an unguarded expiry deep in the stack becomes a herd against the origin. The tiers are independently and eventually consistent rather than coordinated; you bound the divergence with per-tier TTLs and reach for explicit invalidation only where you must.

One more axis cuts across the stack: how a tier decides what to hold. A pull (demand-fill) tier caches an object on the miss that requests it, which is the web default. A push (proactive) tier pre-positions content before anyone asks, which is how a video stack keeps the catalog warm at the edge. Real stacks mix the two.

Tradeoffs

PropertyEffect
LatencyEach tier nearer the user removes a network hop, so an outer hit is the fastest possible answer
Origin offloadMiss rates multiply down the stack, so combined layers shed far more origin load than any single tier could
Capacity vs proximityOuter tiers are small and close, inner tiers large and far; the hot set sits near the user and the tail sits deep
StalenessEach tier can be stale on its own, so the freshest the system can be is bounded by the laggiest layer a request touches
InvalidationPurging one tier does not purge the others; propagating a change through every layer is the expensive case
StampedeA miss at one tier is a request to the next, so each layer needs its own coalescing or a herd cascades inward
Operational costMore layers mean more moving parts and more independent places a bug can serve stale data

Implementations

Minimal pseudocode

# a request walks the tiers; each answers a hit or forwards the miss
def get(key, tiers): # client -> cdn -> app -> db
for i, tier in enumerate(tiers):
val = tier.get(key)
if val is not None:
for upper in tiers[:i]: # backfill the tiers it skipped
upper.set(key, val)
return val
val = origin.read(key) # everything missed: hit the store
for tier in tiers:
tier.set(key, val)
return val

Netflix Open Connect (video streaming)

The tiers run client player buffer → an Open Connect Appliance embedded inside the user's ISP or sitting at an internet exchange → peer and regional fill → the S3 origin on AWS. The distinctive part is that the edge tier is push, not pull: rather than caching on a miss, Netflix predicts what members will watch and pushes the catalog to the appliances during off-peak fill windows, and an appliance prefers to fill from a neighboring appliance over reaching back toward origin, so at prime time nearly every byte is served from a box already inside or next to the viewer's network. A control plane on AWS steers each client to the best appliance, and the appliances themselves are commodity servers running FreeBSD and NGINX. Overview: https://openconnect.netflix.com/Open-Connect-Overview.pdf.

The web request (browser → CDN → origin)

The everyday multi-tier cache, coordinated entirely by HTTP headers. The browser keeps a private per-user cache, the CDN edge keeps a cache shared across a whole region, and the origin sits behind both, usually fronted by its own application cache (Redis) with the database buffer pool underneath. Cache-Control sets freshness per tier, a long immutable lifetime on content-hashed assets and a short or no-cache lifetime on HTML, while ETag with If-None-Match lets any tier revalidate with a cheap 304 instead of refetching the body. A change is pushed through by changing the versioned URL for the browser and issuing a surrogate-key purge at the CDN, the invalidation techniques from earlier in the series applied per layer. HTTP caching reference: https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching.

CPU cache hierarchy

The same idea in hardware, and the one that predates the rest. Between a core and main memory sit L1, L2, and L3 caches, each larger, slower, and farther from the core than the one before, holding a SRAM copy of recently used DRAM lines. A load checks L1, falls through to L2 and L3 on a miss, and reaches DRAM only when all three miss, the identical fall-through with the identical capacity-versus-proximity trade, decided in nanoseconds by the hardware rather than in software. Reference: https://en.wikipedia.org/wiki/CPU_cache.