Copy-On-Write & Immutability

Problem

A logical write often touches several pages or structures, but storage only guarantees atomicity at the granularity of a single sector or page. A crash partway through leaves the data in a torn, half-applied state with no record of what was meant to happen. The naive fix, flushing every modified data page to its final location on each commit, is slow because those locations are scattered (random I/O), and it still gives no way to tell after a crash which writes actually landed. You need two guarantees, durability (a committed change survives a crash) and atomicity (a transaction applies fully or not at all), without paying random-write latency on the commit path.

Solution

Before touching the real data, append a record describing the intended change to a sequential log and flush that log to stable storage. The defining rule is in the name: the log record must reach disk before the data page it describes is allowed to be considered durable. Because the log is append-only it's sequential, so the fsync that makes a commit durable is fast. The actual data pages can be written lazily afterward, or reconstructed from the log if a crash beats them to disk.

After a crash, recovery scans the log and replays it: redo committed changes that hadn't yet reached the data files, and undo changes from transactions that never committed. The canonical algorithm is ARIES, which tags every record with a monotonic log sequence number (LSN), takes periodic checkpoints so recovery needn't start from the beginning of time, and recycles log space once a checkpoint covers it. A useful side effect: since the log is a complete, ordered description of every change, shipping it to another machine reproduces the same state, which is how most replication and change-data-capture is built.

Tradeoffs

PropertyEffect
Write amplificationData is written twice, once to the log and later to its data page
Commit latencyDominated by fsync; group commit batches many transactions into one flush to amortize it
Durability knobYou can relax the commit fsync (async commit) to trade a window of data loss for lower latency
Sequential I/OThe log is append-only, which suits both spinning disks and SSDs and avoids seek-bound random writes
Recovery timeBounded by the checkpoint interval: frequent checkpoints shorten recovery but raise steady-state I/O
Log retentionThe log must be kept until a checkpoint covers it; lagging checkpoints let it grow
ContentionA single shared log can become a concurrency bottleneck under heavy write load

Implementations

Minimal pseudocode

def write(txn, change):
rec = LogRecord(lsn=next_lsn(), txn=txn.id, change=change)
log.append(rec) # sequential, in memory buffer
apply_to_page_cache(change) # data page dirtied, not yet flushed
return rec.lsn
def commit(txn):
log.append(LogRecord(lsn=next_lsn(), txn=txn.id, type=COMMIT))
log.fsync() # the durability point
# data pages flush lazily later, or get replayed on recovery
def recover():
last_ckpt = find_last_checkpoint()
committed = set()
for rec in log.scan_from(last_ckpt): # redo pass
if rec.type == COMMIT:
committed.add(rec.txn)
elif rec.lsn > page_lsn(rec.change.page):
redo(rec.change) # change hadn't reached the page
for txn in in_progress_but_not(committed): # undo pass
rollback(txn)

PostgreSQL WAL

Every modification emits a WAL record with an LSN, written into 16 MB segment files under pg_wal/. A commit forces the WAL up to its LSN to flush, governed by synchronous_commit; checkpoints flush dirty shared-buffer pages so the redo start point advances. The same WAL stream feeds streaming replication and point-in-time recovery (archived WAL replayed on top of a base backup), and full_page_writes defends against torn pages by logging a full image of a page the first time it's touched after a checkpoint. Docs: https://www.postgresql.org/docs/current/wal-intro.html.

MySQL / InnoDB redo log

A fixed-size set of redo log files used circularly, recording physical page changes ahead of the pages themselves. innodb_flush_log_at_trx_commit selects the durability/latency point (1 flushes on every commit, 0 and 2 relax it). Recovery is ARIES-style, pairing the redo log with InnoDB's undo log for rollback and a doublewrite buffer to handle torn writes during page flushes. Docs: https://dev.mysql.com/doc/refman/en/innodb-redo-log.html.

etcd Raft log

Here the WAL is a consensus log rather than a local recovery aid (though it serves both). Each entry is a proposed state-machine command; a node appends it and fsyncs it before acknowledging, and an entry is only committed once a quorum of nodes has it durably. The ordered log defines a single agreed sequence of operations, which etcd's MVCC key-value store then applies in order. The same log lets a lagging or rejoining follower catch up by replaying entries. Source: https://github.com/etcd-io/raft.

Kafka — the WAL as the product

A Kafka partition is, on disk, an append-only log split into segments retained by time or size. Producers append records and get back an offset; consumers read forward from an offset and can rewind to replay history. The idea that quietly backs the engines above is here promoted to the central abstraction: the log is the durable source of truth, and many independent consumers replay it for event sourcing, stream processing, and cross-system data integration. Design write-up: https://kafka.apache.org/documentation/#design.