Consensus

Problem

A single node is one point of failure and one point of truth. Replicate it for availability and the replicas can now disagree: a partition splits them, a leader crashes halfway through a write, messages arrive reordered, and two replicas end up holding different "latest" values with no way to tell which is correct. Picking the wrong one corrupts everything downstream that trusted it.

What's needed is for a set of nodes to agree on a single ordered history of operations, and to keep agreeing even while some nodes crash and the network drops, delays, and reorders messages, such that no two surviving nodes ever disagree about what was decided. That agreed log is what every linearizable store, lock service, and leader election sits on top of, because feeding the same ordered commands to every replica keeps them identical.

Solution

Require a majority quorum to accept each decision before it counts as committed, and require any later decision to consult a majority too. Because any two majorities share at least one node, that overlapping node carries every committed value forward, so a new leader can never miss something already decided. With 2f+1 nodes this tolerates f failures.

Elect a leader to order proposals, which avoids competing proposers stalling each other. The leader appends each client command to its log, replicates the entry to followers, and commits it once a majority has durably stored it, then applies the command to a deterministic state machine and answers the client. Replaying the same committed log through the same state machine on every node keeps the replicas identical, which is the replicated-state-machine construction.

Safety holds unconditionally: even during a partition, no two nodes decide differently, because progress requires a majority and a minority simply can't form one. Liveness is the part you give up. The FLP result shows that in a fully asynchronous system a single crash can prevent any deterministic protocol from ever deciding, so practical protocols guarantee safety always and make progress only once enough nodes can exchange messages within their timeouts. Randomized election timeouts keep the system from getting stuck retrying elections.

Tradeoffs

PropertyEffect
Fault tolerance2f+1 nodes tolerate f failures; a majority must stay alive and reachable, so quorum loss means unavailability
LatencyEvery commit costs a majority round-trip, and all writes funnel through the leader
Partition behaviorThe minority side stops accepting writes by design, choosing consistency over availability
LivenessFLP forbids guaranteed termination under full asynchrony; bad timing can stall progress while safety still holds
Write scalingA single group doesn't scale writes; you shard into many independent consensus groups to grow
Implementation riskHard to get right, and membership reconfiguration is especially error-prone

Implementations

Minimal pseudocode (leader-based log replication)

# leader: order, replicate, commit on majority, then apply
def propose(cmd):
entry = Entry(term=current_term, index=log.next_index(), cmd=cmd)
log.append(entry)
replicate(entry) # send to followers
wait_until(acked_by_majority(entry)) # majority has it durably
commit_index = entry.index
return apply(cmd) # deterministic state machine
# follower: accept only from a current leader with a matching log
def on_append(entry, leader_term, prev):
if leader_term < current_term: return reject()
if not log_matches(prev): return reject() # leader backs up, retries
log.append(entry)
return ack(entry)

The two safety hinges: commit only after a majority stores an entry, and a follower accepts entries only from a current-term leader whose log lines up with its own.

etcd and Consul (Raft)

etcd is the Raft-based key-value store that holds Kubernetes cluster state: a leader replicates a command log to a majority of members and serves linearizable reads through a quorum check. Consul uses Raft the same way for its service catalog and KV store. Both pick an odd member count (commonly 3 or 5) so a majority is well-defined and they tolerate one or two failures.

Google Chubby (Paxos)

Google's coarse-grained lock service, built on Multi-Paxos. A Chubby cell is usually five replicas with an elected master that orders operations and replicates them via Paxos, exposing locks and a small file namespace that other systems use for leader election and configuration. It's the direct ancestor of ZooKeeper, and the paper that introduced it is where many engineers first met Paxos in production form.

CockroachDB

Distributed SQL that splits the keyspace into ranges, each replicated as its own Raft group across several nodes. Running thousands of independent Raft groups (multi-raft) is how it scales writes that a single group couldn't, and a per-range lease designates one replica to serve reads quickly without a quorum round-trip on every read.

TiKV

The distributed transactional key-value store under TiDB, also multi-raft: data is split into Regions, each a Raft group built on the raft-rs library, while a Placement Driver rebalances Regions across nodes as load and capacity shift. The same per-group consensus pattern as CockroachDB, in a different stack.

Kafka KRaft

KRaft replaces Kafka's former ZooKeeper dependency with a built-in Raft quorum of controller nodes that manage cluster metadata as a replicated log, generally available since Kafka 3.3 and the only mode after ZooKeeper's removal in 4.0. Note the scope: KRaft governs the controller and metadata plane, while ordinary topic-partition data replication still uses Kafka's in-sync-replica scheme rather than Raft.