Multi-Leader Replication | Microservices Course

Problem

With a single leader, every write travels through one node. A client far from that node pays a round trip on every write, and a client cut off from it by a network partition can't write at all until the partition heals or a failover moves the write path, which is itself disruptive. The same constraint bites whenever writers are spread out: regions on different continents, or laptops and phones that go offline and reconnect later. Funneling all of their writes through one always-reachable leader is either slow or impossible, and the leader-follower design has no answer for it because it permits exactly one writer.

Solution

Run several leaders, typically one per region or per device, each accepting writes locally and replicating them asynchronously to the others. Every leader also applies the changes arriving from its peers the way a follower does, so each one eventually holds every write. Local writes are fast and stay available during a partition, because a leader never has to reach the others to accept a write. The cost moves to reconciliation: two leaders can modify the same record concurrently without having seen each other, and when their streams meet, the system holds two divergent versions of one record and has to decide what the value is.

Telling a real conflict (two concurrent edits) from an ordinary update (one edit that already saw the other) needs causality tracking, usually a version vector that records how much of each leader's history a record reflects. Once a conflict is detected, resolution takes one of a few forms. Last-write-wins picks a winner by timestamp; it is trivial but silently drops a write and leans on clocks agreeing across machines. An application-supplied merge function combines the two versions with domain knowledge. A conflict-free replicated data type (CRDT) is built so any two versions merge to the same result regardless of order, which removes the decision for the data shapes it covers. Some systems decline to choose and instead keep both versions for the application or the user to settle, the way a calendar surfaces a double-booking rather than discarding one event.

Tradeoffs

Property	Effect
Write availability	Any region or device accepts writes, including while partitioned: the main reason to do this
Write latency	Writes commit locally, with no cross-region round trip
Conflict handling	Concurrent writes to the same data must be detected and resolved: the cost and the hard part
Consistency	Eventual; replicas diverge until their streams reconcile
Topology	Leaders form a graph (all-to-all, ring, star); more leaders means more replication paths and more places for conflicts to arise
Data model fit	Clean for commutative or CRDT-friendly data, painful for data with cross-record invariants like uniqueness or a non-negative balance

Implementations

Minimal pseudocode

# each leader accepts local writes and tags them with its version vector
def write(leader, key, value):
    leader.clock[leader.id] += 1
    record = Record(value, version=dict(leader.clock))
    leader.store[key] = record
    for peer in leader.peers:
        peer.send(key, record)              # async, to every other leader

# receiving a remote write: apply it, ignore it, or reconcile a conflict
def receive(leader, key, incoming):
    local = leader.store.get(key)
    if local is None or dominates(incoming.version, local.version):
        leader.store[key] = incoming        # incoming already saw local
    elif dominates(local.version, incoming.version):
        pass                                # local already saw incoming
    else:
        leader.store[key] = resolve(local, incoming)   # concurrent: conflict
    leader.clock = merge(leader.clock, incoming.version)

# dominates(a, b): a >= b on every entry -> a causally follows b
# resolve: last-write-wins, an app merge function, or a CRDT join

CouchDB and offline sync

CouchDB replication is multi-leader: any node, including a PouchDB instance running in a browser or on a phone, accepts writes locally and syncs bidirectionally with its peers whenever it has a connection. Each document carries a revision tree, and when two replicas edit the same document independently, replication keeps both revisions, deterministically marks one as the winner so reads keep working, and exposes the conflict so the application can merge and resolve it. This is the model under offline-first apps: write while disconnected, then sync and reconcile on reconnect. Docs: https://docs.couchdb.org/en/stable/replication/conflicts.html.

CockroachDB multi-region

CockroachDB runs active-active across regions, so every node accepts reads and writes and you can pin data to regions for locality. It is worth being precise about how, because it differs from the systems above: it does not reconcile conflicts after the fact. Each range of data is replicated by Raft consensus with a single leaseholder that orders writes, so two concurrent writes to the same key are serialized rather than merged. It buys multi-region write availability through consensus, paying a cross-region round trip for writes that can't be served locally instead of paying for conflict resolution. Ordering versus reconciling is the axis that separates it from CouchDB. Docs: https://www.cockroachlabs.com/docs/stable/multiregion-overview.html.

Conflict-free replicated data types

A CRDT is a data type whose merge is commutative, associative, and idempotent, so replicas that have received the same set of updates converge to the same state in any order, with no coordination and no resolution step. Counters, grow-only and observed-remove sets, last-write-wins registers, and editable text sequences all have CRDT forms. Riak exposed several as first-class types, and they sit under collaborative editors and many offline-sync stores. Reference: https://crdt.tech/.