Linearizability

Problem

A value replicated across machines has no agreed "now." A write returns success on one node while a read on another still returns the old value. Two clients reading at the same wall-clock moment disagree about what the value is. Ordering can even invert against reality: client A writes x=1, tells client B over a side channel, B reads and sees the prior value, so the store contradicts an ordering that already happened in the world.

Any code that assumes "once my write is acknowledged, every subsequent read sees it" is wrong on such a store. You cannot build a correct distributed lock, leader election, or unique-ID allocator on top of something that serves stale reads, because each of those depends on every participant agreeing on the single latest value.

Solution

Guarantee that each operation appears to take effect at one instant between its invocation and its response, and that those instants respect real time: if operation A returns before operation B starts, A is ordered before B. The consequence clients depend on is that a read reflects the most recently completed write, or a later one, never an older value. There is one total order of operations that every client agrees on, and it matches the wall-clock order of any two operations that don't overlap.

Operations that do overlap in real time (issued but not yet acknowledged) may be ordered either way; linearizability only constrains non-overlapping operations.

The mechanism is a single agreed sequencer. A consensus protocol (Raft, Paxos, ZAB) elects a leader that assigns each write a position in a replicated log and commits it to a majority before acknowledging. Reads either pass through that leader or confirm current leadership and commit position before returning, so a deposed or lagging replica can't answer with stale data.

Tradeoffs

PropertyEffect
Write latencyEvery write needs a majority round-trip before it can ack; at least one quorum RTT, more across regions
Read costA correct read can't trust a local replica; it needs leader confirmation or a quorum, so reads cost more than eventually-consistent ones
AvailabilityCAP forces the choice: a minority partition can't reach a quorum, so it must reject operations rather than risk stale answers. Consistency wins, availability loses
ThroughputOrdering flows through one leader, capping single-object write throughput at what one node can sequence. You can't shard a single key's writes
GeographyThe latency floor is the round-trip to the farthest quorum member, which makes global deployments slow
ScopeIt's a single-object guarantee and says nothing about atomic updates spanning multiple objects. That's serializability, a separate and stronger property

Implementations

Minimal pseudocode

# all ops for the key route through the elected leader
def write(key, value):
pos = leader.append(log, (key, value)) # assign next log position
wait_until_committed(pos) # replicated to a majority
return OK # ack only after commit
def read(key):
confirm_still_leader() # else redirect / step down
return apply(log.committed())[key] # newest committed value

The two load-bearing lines: acknowledge a write only after a majority has it, and confirm leadership before answering a read.

etcd

A Raft-backed key-value store. Writes go through the Raft leader and commit to a majority before returning. Reads are linearizable by default via a ReadIndex check that confirms the leader is still current and has applied all committed entries before answering. You can opt into faster serializable reads served from a local replica, which may be stale.

ZooKeeper

Uses the ZAB consensus protocol. Writes are linearizable, totally ordered through the leader. Reads are deliberately not linearizable by default: they're served from a local replica and can lag. A client that needs a linearizable read calls sync() first, which forces the replica to catch up to the leader before the read returns. A useful contrast to etcd's default.

Spanner reads

Spanner uses TrueTime, an API that exposes bounded clock uncertainty from GPS and atomic clocks. By assigning each transaction a timestamp and waiting out the uncertainty interval before committing, Spanner makes reads and writes externally consistent (linearizable) across a globally distributed database, which is otherwise the hardest place to get this guarantee.

Single-key DynamoDB

A strongly consistent read (ConsistentRead=true) on a single item is linearizable, served from the leader replica of that item's partition. The default read is eventually consistent and can return a stale value, trading the guarantee for lower latency and higher availability.