Problem
A single node is one point of failure and one point of truth. Replicate it for availability and the replicas can now disagree: a partition splits them, a leader crashes halfway through a write, messages arrive reordered, and two replicas end up holding different "latest" values with no way to tell which is correct. Picking the wrong one corrupts everything downstream that trusted it.
What's needed is for a set of nodes to agree on a single ordered history of operations, and to keep agreeing even while some nodes crash and the network drops, delays, and reorders messages, such that no two surviving nodes ever disagree about what was decided. That agreed log is what every linearizable store, lock service, and leader election sits on top of, because feeding the same ordered commands to every replica keeps them identical.
Solution
Require a majority quorum to accept each decision before it counts as committed, and require any later decision to consult a majority too. Because any two majorities share at least one node, that overlapping node carries every committed value forward, so a new leader can never miss something already decided. With 2f+1 nodes this tolerates f failures.
Elect a leader to order proposals, which avoids competing proposers stalling each other. The leader appends each client command to its log, replicates the entry to followers, and commits it once a majority has durably stored it, then applies the command to a deterministic state machine and answers the client. Replaying the same committed log through the same state machine on every node keeps the replicas identical, which is the replicated-state-machine construction.
Safety holds unconditionally: even during a partition, no two nodes decide differently, because progress requires a majority and a minority simply can't form one. Liveness is the part you give up. The FLP result shows that in a fully asynchronous system a single crash can prevent any deterministic protocol from ever deciding, so practical protocols guarantee safety always and make progress only once enough nodes can exchange messages within their timeouts. Randomized election timeouts keep the system from getting stuck retrying elections.
Tradeoffs
| Property | Effect |
|---|---|
| Fault tolerance | 2f+1 nodes tolerate f failures; a majority must stay alive and reachable, so quorum loss means unavailability |
| Latency | Every commit costs a majority round-trip, and all writes funnel through the leader |
| Partition behavior | The minority side stops accepting writes by design, choosing consistency over availability |
| Liveness | FLP forbids guaranteed termination under full asynchrony; bad timing can stall progress while safety still holds |
| Write scaling | A single group doesn't scale writes; you shard into many independent consensus groups to grow |
| Implementation risk | Hard to get right, and membership reconfiguration is especially error-prone |
Implementations
Minimal pseudocode (leader-based log replication)
# leader: order, replicate, commit on majority, then applydef propose(cmd):entry = Entry(term=current_term, index=log.next_index(), cmd=cmd)log.append(entry)replicate(entry) # send to followerswait_until(acked_by_majority(entry)) # majority has it durablycommit_index = entry.indexreturn apply(cmd) # deterministic state machine# follower: accept only from a current leader with a matching logdef on_append(entry, leader_term, prev):if leader_term < current_term: return reject()if not log_matches(prev): return reject() # leader backs up, retrieslog.append(entry)return ack(entry)
The two safety hinges: commit only after a majority stores an entry, and a follower accepts entries only from a current-term leader whose log lines up with its own.
etcd and Consul (Raft)
etcd is the Raft-based key-value store that holds Kubernetes cluster state: a leader replicates a command log to a majority of members and serves linearizable reads through a quorum check. Consul uses Raft the same way for its service catalog and KV store. Both pick an odd member count (commonly 3 or 5) so a majority is well-defined and they tolerate one or two failures.
Google Chubby (Paxos)
Google's coarse-grained lock service, built on Multi-Paxos. A Chubby cell is usually five replicas with an elected master that orders operations and replicates them via Paxos, exposing locks and a small file namespace that other systems use for leader election and configuration. It's the direct ancestor of ZooKeeper, and the paper that introduced it is where many engineers first met Paxos in production form.
CockroachDB
Distributed SQL that splits the keyspace into ranges, each replicated as its own Raft group across several nodes. Running thousands of independent Raft groups (multi-raft) is how it scales writes that a single group couldn't, and a per-range lease designates one replica to serve reads quickly without a quorum round-trip on every read.
TiKV
The distributed transactional key-value store under TiDB, also multi-raft: data is split into Regions, each a Raft group built on the raft-rs library, while a Placement Driver rebalances Regions across nodes as load and capacity shift. The same per-group consensus pattern as CockroachDB, in a different stack.
Kafka KRaft
KRaft replaces Kafka's former ZooKeeper dependency with a built-in Raft quorum of controller nodes that manage cluster metadata as a replicated log, generally available since Kafka 3.3 and the only mode after ZooKeeper's removal in 4.0. Note the scope: KRaft governs the controller and metadata plane, while ordinary topic-partition data replication still uses Kafka's in-sync-replica scheme rather than Raft.