Hot-Partition / Hot-Key Mitigation | Microservices Course

Problem

Partitioning balances key ranges, not per-key traffic, as the range-vs-hash piece noted. Even with keys spread perfectly across partitions, the requests hitting those keys are almost never uniform: one account has millions of followers, one tweet goes viral, one product trends, one counter is incremented by everyone. That key lives on a single partition on a single node, so all of its load lands on that one node no matter how many nodes the cluster has. Resharding doesn't rescue you, because you cannot split a single key across partitions, the key is the indivisible unit. The result is a saturated node beside idle ones, tail latency that spikes, throttled requests, and a bottleneck that adding capacity does nothing to fix, since the constraint is one key in one place.

It helps to separate two cases. A hot partition is a range or partition taking more than its share, often from monotonic keys or a popular range, and is sometimes curable by splitting and rebalancing. A hot key is a single key, which the partitioning scheme cannot divide, so it needs help from the application or the storage engine rather than from a better partition function.

Solution

Everything starts with detection: track request rate and bytes per key or per partition, usually by sampling, and flag the outliers. Once you know what's hot, the techniques fall into spreading the load, absorbing it, or isolating it.

Key salting (write sharding) turns one hot write key into many. Append a small bucket suffix so key becomes key#0 through key#N, which hash to different partitions, so writes that would have contended on one node now spread across N. The cost lands on reads, which must query all N variants and merge them. The sharded-counter pattern is exactly this: rather than one row everyone increments, keep N shard rows, increment a random shard per write, and sum all N to read, trading read amplification for write throughput.

Request coalescing and caching handle a hot read key, where the problem is many identical reads rather than contended writes. Coalescing collapses concurrent reads of the same key so one in-flight fetch serves all the waiters instead of each hitting the partition. Caching the hot value near the clients, in a CDN, a local cache, or a read replica, with a short freshness window, means most reads never reach the owning partition at all. A read-heavy hot key is better solved this way than by salting.

Dedicated handling gives the hot entity its own code path. Twitter's timeline is normally built fanout-on-write: a tweet is pushed into each follower's materialized timeline so reads are cheap, the denormalized read model from the CQRS piece. For an account with millions of followers that becomes a write hotspot, one tweet causing millions of inserts, so those accounts are switched to fanout-on-read: their tweets are not pushed, and each follower's timeline read merges the celebrity's recent tweets in at query time. Most users stay fanout-on-write, the heavy producers go fanout-on-read, and the threshold between them is a runtime config rather than a fixed design, so a user crossing it just flips a flag.

Adaptive capacity pushes the work into the storage engine. The system detects a hot partition and shifts throughput toward it, or splits the partition so frequently accessed items get their own, with no application change. Its ceiling is the single key: one item still can't exceed the throughput of one partition, so an extreme single-key hotspot falls back on application-level write sharding.

Tradeoffs

Property	Effect
Write spreading (salting)	One hot key becomes N spread keys; cost is read amplification, since every read scatter-gathers across all N
Read absorption (cache, coalesce)	Most reads never reach the partition; cost is staleness and cache coherence
Dedicated path (fanout-on-read)	Isolates the heavy producer; cost is two code paths and a merge at read time
Adaptive capacity	Automatic for hot partitions, no app change; cost is reaction lag and a hard ceiling at one partition / one item
Detection	Requires per-key telemetry; sampling has cost and lag, so the hotspot is seen late
Indivisibility	A single hot key can't be partition-split, so only application techniques reach it

Implementations

Minimal pseudocode

import random

# sharded counter: spread a hot write key across N shards
def incr(store, key, n_shards):
    shard = random.randrange(n_shards)
    store.add(f"{key}#{shard}", 1)            # contention spread over N partitions

def read_count(store, key, n_shards):
    return sum(store.get(f"{key}#{s}") for s in range(n_shards))   # scatter-gather

# request coalescing: collapse concurrent reads of one hot key
inflight = {}
def get_coalesced(store, key):
    if key in inflight:
        return inflight[key].result()         # share the in-flight read
    fut = inflight[key] = Future()
    try:
        fut.set_result(store.get(key))        # one real read serves every waiter
        return fut.result()
    finally:
        del inflight[key]

DynamoDB adaptive capacity

DynamoDB watches per-partition traffic and automatically shifts throughput toward the partitions taking more requests, and it will split a partition to isolate frequently accessed items onto their own partition, so a hot partition is absorbed without operator action. The limit is the single item: it is capped by one partition's maximum, so a genuine single-key hotspot still needs application-side write sharding on top. Docs: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-design.html.

Twitter celebrity fanout

Twitter materializes home timelines fanout-on-write, pushing each tweet into followers' timelines (in Redis) so the 75-to-1 read-heavy workload is served cheaply. Accounts above roughly ten thousand followers are moved off that path because the write amplification is too large, and their tweets are instead pulled and merged into each follower's timeline at read time. The hybrid keeps cheap reads for the common case while isolating the heavy producers, and the cutover threshold is tunable at runtime. Reference: https://highscalability.com/the-architecture-twitter-uses-to-deal-with-150m-active-users/.

Sharded counters

A single counter that everyone increments is the canonical write hotspot on one key. The fix is to keep N counter shards, increment a randomly chosen shard on each write so the contention spreads across partitions, and sum all N shards on read. Google documented this for Datastore and Firestore as the standard pattern, where it trades a more expensive read for the write throughput a single counted row could never sustain. Docs: https://cloud.google.com/datastore/docs/concepts/sharding-counters.