Consistent Hashing | Microservices Course

Problem

Data is spread across N nodes and something has to decide which node holds a given key. The obvious rule, node = hash(key) % N, assigns keys evenly and is cheap, but it depends on N. The moment a node is added or removed, N changes, the modulus changes, and almost every key now maps somewhere new. For a cache that means a near-total miss at once, with every relocated key falling through to the database in the same instant. For a stateful store it means physically moving nearly all the data to rebalance. Either way a one-node change forces a cluster-wide reshuffle, which is the opposite of what you want from adding capacity. The goal is a key-to-node mapping that barely moves when the node set changes.

Solution

Hash the keys and the nodes onto the same fixed circular space, a ring spanning 0 to 2^32 - 1 that wraps around. A key belongs to the first node found by walking clockwise from the key's position. Adding a node drops one new point onto the ring, and only the keys lying between that point and the previous node move onto it, roughly 1/N of the total, all taken from a single neighbor while every other key stays put. Removing a node hands its keys to the next node clockwise, again moving only that node's share. A membership change touches about 1/N of the keys instead of essentially all of them, which is the whole point.

The naive ring has two defects, both from placing one point per node. With few nodes the random positions leave uneven gaps, so some nodes own a much larger arc and take more load, which is skew. And when a node leaves, its entire arc lands on the one neighbor clockwise of it rather than spreading out. Virtual nodes fix both. Each physical node is hashed to many points on the ring, say 128 or 256 tokens, so it owns many small scattered arcs instead of one large one. The arcs average out and load evens across physical nodes, and when a node leaves, its many arcs fall to many different neighbors rather than dumping on one. The token count also lets you weight unequal hardware: give a larger machine more tokens and it takes a proportionally larger share.

Replication sits on top of this directly. To keep a key on N nodes, as the leaderless and Dynamo designs do, walk clockwise from the key and take the next N distinct physical nodes, skipping extra tokens of a node already chosen. That ordered set is the preference list for the key.

Tradeoffs

Property	Effect
Rebalancing cost	A membership change moves only about `1/N` of keys, from one neighbor: the main reason to do this
Load balance	Virtual nodes smooth skew; more tokens means flatter load but more ring metadata to track
Lookup	A sorted token list gives `O(log V)` routing for `V` total tokens, or `O(1)` with a precomputed slot table
Heterogeneity	Weight a node by giving it more tokens
Hot keys	A single popular key still lands on one node; the ring balances key ranges, not per-key traffic
Membership	The ring must be agreed across the cluster (gossip or a coordinator), and in some systems the token count per node is set at join time and awkward to change later

Implementations

Minimal pseudocode

import bisect, hashlib

def h(s):
    return int(hashlib.sha1(s.encode()).hexdigest(), 16)

class Ring:
    def __init__(self, vnodes=128):
        self.vnodes = vnodes
        self.tokens = []           # sorted hash positions on the ring
        self.owner = {}            # token -> physical node

    def add(self, node):
        for i in range(self.vnodes):
            t = h(f"{node}#{i}")   # many points per physical node
            bisect.insort(self.tokens, t)
            self.owner[t] = node

    def remove(self, node):
        self.tokens = [t for t in self.tokens if self.owner[t] != node]
        for t in [t for t in self.owner if self.owner[t] == node]:
            del self.owner[t]      # those arcs fall to the next token clockwise

    def route(self, key):
        i = bisect.bisect(self.tokens, h(key)) % len(self.tokens)  # next clockwise
        return self.owner[self.tokens[i]]

    def replicas(self, key, n):    # the preference list: next n distinct nodes
        i = bisect.bisect(self.tokens, h(key))
        out = []
        for j in range(len(self.tokens)):
            node = self.owner[self.tokens[(i + j) % len(self.tokens)]]
            if node not in out:
                out.append(node)
            if len(out) == n:
                break
        return out

DynamoDB and the Dynamo paper

The 2007 Dynamo paper is where consistent hashing was put into a production data store at scale: keys are partitioned across nodes on a ring, each key is replicated to the next N nodes clockwise (its preference list), and virtual nodes handle uneven hardware and smooth load. DynamoDB, the managed AWS product descended from it, hides the ring behind partition keys and automatic splitting, but the same idea is underneath. Paper: https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf.

Cassandra

Cassandra partitions data with a token ring and assigns each node a set of tokens, so a key's partitioner hash selects the node that owns the matching range, and replicas are the following nodes around the ring. Virtual nodes arrived in version 1.2 to automate token assignment and rebalancing; num_tokens controls how many tokens each node holds. Ring membership and token ownership propagate through a gossip protocol rather than a central coordinator. Docs: https://cassandra.apache.org/doc/latest/cassandra/architecture/dynamo.html.

memcached clients (ketama)

memcached servers don't coordinate; the consistent hashing lives in the client. The ketama scheme, used by libketama and many memcached client libraries, hashes both keys and server identities onto a ring so that adding or removing a cache server only invalidates the fraction of keys that move, instead of blowing away the whole cache as modulo would. This was the original motivating use case, where the cost of a reshuffle is a cache stampede onto the database. Reference: https://github.com/RJ/ketama.

Discord cluster routing

Discord routes a user's session to a specific backend node by hashing the relevant identity onto a consistent-hash ring, so that scaling the node pool up or down keeps a given user pinned to the same node wherever possible rather than reshuffling connections across the fleet. Holding that stickiness matters because each node carries live session and presence state, and relocating it on every membership change would cause reconnect storms. The ring membership is shared across nodes through gossip.