Rebalancing & Resharding

Problem

Partitioning decides which keys live on which node, but that assignment doesn't hold still. You add nodes when you need capacity, lose them to failure or scale-down, and watch partitions grow unevenly until one runs hot. Every one of these calls for moving data between nodes. The naive response, recompute the placement and copy everything to its new home, fails for two reasons. Moving real volumes of data takes minutes to hours, and during that window you either stop serving the data in flight, which is downtime, or serve it from two places at once, which is inconsistency. The move can also manufacture the very imbalance it was meant to cure: drop a busy partition's entire load onto one fresh node, or send reads to a node that hasn't finished receiving the data, and you've made things worse. The task is to change data placement while traffic keeps flowing and without creating a hotspot in the process.

Solution

Start by not rebalancing with hash mod N, which the consistent-hashing piece showed relocates nearly everything on any membership change. Use a scheme that bounds movement: consistent hashing, or a large fixed number of partitions assigned to nodes so the partition, not the key, is the unit of movement. Many systems pick a partition count up front and never change it, reassigning whole partitions to nodes as membership shifts, so a key never changes which partition owns it, only which node hosts that partition.

The move itself runs in the background and incrementally. Copy a partition's data to its destination while the source keeps serving every request. Because live writes keep arriving during that copy, the destination then follows the source's change stream, the same write-ahead log the WAL and replication pieces relied on, until it has caught up to the present. Only once the destination is current do you cut over, and the cutover is a single atomic flip of a routing table that records, per partition, which node is authoritative. A routing layer that lags reality is its own failure: it sends traffic to a node that no longer owns the data, so the switch and the routing update have to be consistent, often with a brief read-from-both or dual-write window so no in-flight write is dropped at the seam.

Throttling is what keeps rebalancing from becoming the hotspot. Copying data burns disk, network, and CPU on both nodes, so an unthrottled migration starves live traffic; production systems cap concurrent moves and bytes per second.

Resharding is the harder case, because you aren't only moving a partition but subdividing its keyspace, the dynamic split the range-partitioning piece described. Splitting one partition into several means producing new partitions whose ranges together cover the original, backfilling each from the source, keeping them current with live writes through the change stream, and then atomically switching routing from the one old partition to the several new ones. Merging is the same in reverse.

Tradeoffs

PropertyEffect
Availability during a moveBackground copy plus an atomic cutover keeps traffic flowing: the main reason to do this carefully
Data movedBounded to about 1/N under consistent hashing or fixed partitions; hash mod N moves nearly all of it
Speed vs interferenceThrottling trades how fast you rebalance against how much you disturb live traffic
Cutover consistencyIn-flight writes must be captured via a change stream or dual-write, or they're lost at the seam
CoordinationA controller must own the plan and the routing table; a stale router points traffic at the wrong node
Resharding costSplitting or merging partitions is harder than relocating them: backfill, catch-up, then an atomic routing switch
Hotspot avoidancePlacement and move scheduling must spread load, or rebalancing recreates the imbalance

Implementations

Minimal pseudocode

# move one partition from src to dst without halting traffic
def move_partition(p, src, dst, router, rate):
snapshot = src.snapshot(p) # consistent point-in-time copy
dst.bulk_load(snapshot, rate=rate) # background, throttled
while src.lag(p, dst) > 0: # follow live writes
dst.apply(src.changes_since(p, dst.offset)) # stream the WAL tail
with router.lock(p): # brief and atomic
if dst.caught_up(p):
router.set_owner(p, dst) # flip routing in one step
src.drop(p) # reclaim source after cutover
# reshard: split one partition's keyspace into two
def split_partition(p, router):
lo, hi = p.range
mid = midpoint(lo, hi)
a, b = new_partition(lo, mid), new_partition(mid, hi)
backfill(a, src=p); backfill(b, src=p) # snapshot + tail, as above
with router.lock(p):
router.replace(p, [a, b]) # one partition -> two, atomically

Vitess resharding

Vitess fronts sharded MySQL and reshards while online. It creates the target shards, then uses VReplication to stream both the initial snapshot and the ongoing binlog changes into them so the new shards track live writes, verifies the copied data, and switches traffic in stages, reads first and writes second, one shard at a time. The switch is reversible if something looks wrong, because the original shards keep their data until you tear them down. Splitting is by ranges of the sharding key. Docs: https://vitess.io/docs/reference/vreplication/reshard/.

Elasticsearch shard allocation

Elasticsearch fixes the number of primary shards per index at creation, so it relocates whole shards rather than resizing them: the allocation decider continuously moves shards across nodes to balance disk and load and to honor constraints like keeping a primary and its replica apart. A relocation streams the shard's segments to the target while the source keeps serving, then hands off, and the process is throttled by limits on concurrent recoveries and recovery bandwidth. Changing the shard count itself means reindexing or the split and shrink APIs, not allocation. Docs: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cluster.html.

Kafka partition reassignment

A Kafka topic's partition count is fixed and can only grow, and growing it breaks the key-to-partition mapping for future writes, so reassignment is mostly about moving existing partitions between brokers. You hand the reassignment tool a target replica layout; each new replica joins by fetching from the partition leader until it has caught up enough to enter the in-sync replica set, after which the old replicas are dropped and leadership is rebalanced. Replication quotas throttle the catch-up so it doesn't starve normal traffic, and Cruise Control automates generating balanced plans. Docs: https://kafka.apache.org/documentation/#basic_ops_partitionassignment.