Problem
The leader-follower and multi-leader designs both name certain nodes as writers and depend on failover to move the write path when one of them dies. Failover is the fragile part: a window where writes are refused while the system detects the failure, picks a replacement, and redirects traffic, with split-brain and lost writes waiting on either side of a misjudged timeout. The underlying issue is that some specific node has to be reachable for a write to proceed, so its failure stops writes until something promotes a stand-in. What you want is a design where no single node's failure stops writes and there is no promotion step to get wrong.
Solution
Make no node special. The client, or a coordinator acting for it, sends each write to several replicas at once and treats the write as done after enough of them acknowledge. Reads work the same way, querying several replicas in parallel. There is no leader to lose, so a downed node just means the client uses the others, with no failover and no promotion. The price is that replicas drift apart: a read and a write can touch different subsets of nodes, and a node that was down missed whatever was written while it was gone. Two mechanisms pull the replicas back together. Read repair notices during a read that some replicas returned a stale version and writes the current value back to them. Anti-entropy is a background process that compares replicas, usually by exchanging Merkle trees so only the differing ranges are examined, and copies over whatever is missing.
Whether a read can be trusted to see the latest write comes down to overlap. With N replicas for a key, requiring W acknowledgments to commit a write and R responses to serve a read, the condition W + R > N forces the read set and write set to share at least one node, so a read always reaches a replica that holds the most recent acknowledged write. This is a quorum. Loosening it to W + R <= N lowers latency and raises availability but allows stale reads. A sloppy quorum goes further: when a target replica is unreachable, the write lands on a substitute node that holds it temporarily and forwards it once the intended node recovers, a step called hinted handoff, which keeps writes flowing during a partition at the cost of weaker quorum guarantees.
Concurrent writes to one key still collide here, for the same reason they do under multi-leader: there is no single point that serializes them. Version vectors detect the concurrency, and resolution is either last-write-wins by timestamp or returning every concurrent version (siblings) for the application to merge.
Tradeoffs
| Property | Effect |
|---|---|
| Write availability | No node's failure stops writes, and there is no failover step: the main reason to do this |
| Tunable consistency | W and R trade read latency, write latency, and staleness against each other per workload, or per query |
| Read freshness | Guaranteed only when W + R > N; below that, reads can be stale |
| Conflict handling | Concurrent writes collide and need version vectors plus a resolution policy, as in multi-leader |
| Repair cost | Read repair and anti-entropy run continuously to fight drift; Merkle trees keep anti-entropy from rescanning everything |
| Consistency | Eventual; a replica is current only after a quorum read, read repair, or anti-entropy reaches it |
Implementations
Minimal pseudocode
# write: send to the N replicas, succeed once W acknowledgedef write(key, value, W):record = Record(value, version=bump_version(key))acks = 0for r in replicas_for(key): # N nodes, sent in parallelif r.store(key, record):acks += 1return acks >= W # quorum reached# read: query the N replicas, return newest, repair the stale onesdef read(key, R):responses = gather(replicas_for(key), until=R) # wait for R repliesnewest = max(responses, key=lambda x: x.version)for r in stale_replicas(responses, newest):r.store(key, newest) # read repairreturn newest if not concurrent(responses) else siblings(responses)# anti-entropy: background, compare replicas and sync only what differsdef anti_entropy(a, b):for key_range in merkle_diff(a.tree, b.tree):reconcile(a, b, key_range)
Amazon Dynamo
The 2007 Dynamo paper is where this design was assembled and named, and nearly every system here descends from it. It partitions keys across nodes with consistent hashing, replicates each key to N nodes, uses sloppy quorums with hinted handoff to stay writable during failures, tracks causality with vector clocks, and runs read repair plus Merkle-tree anti-entropy to converge. Note that Dynamo the paper is a separate thing from DynamoDB the AWS product, which shares lineage but exposes a managed API rather than these knobs. Paper: https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf.
Cassandra
Cassandra implements the Dynamo replication model with consistency set per query: a write or read names a level (ONE, QUORUM, ALL, and others) that fixes W or R against the replication factor N, so the same cluster serves strong and weak reads side by side. It runs read repair on the read path and Merkle-tree repair as anti-entropy (nodetool repair), with hinted handoff for short outages. One deliberate difference from the paper: Cassandra resolves conflicts by last-write-wins on per-cell timestamps rather than vector clocks, which simplifies reads at the cost of silently dropping a losing concurrent write. Docs: https://cassandra.apache.org/doc/latest/cassandra/architecture/dynamo.html.
Riak
Riak stays closest to the paper. It keeps causality with dotted version vectors, and on a conflict it returns the concurrent versions as siblings for the application to merge rather than discarding one, with optional CRDT data types that merge automatically. The same N, R, and W are configurable per bucket and per request. Docs: https://docs.riak.com/riak/kv/latest/learn/concepts/replication/index.html.