Problem
A logical write often touches several pages or structures, but storage only guarantees atomicity at the granularity of a single sector or page. A crash partway through leaves the data in a torn, half-applied state with no record of what was meant to happen. The naive fix, flushing every modified data page to its final location on each commit, is slow because those locations are scattered (random I/O), and it still gives no way to tell after a crash which writes actually landed. You need two guarantees, durability (a committed change survives a crash) and atomicity (a transaction applies fully or not at all), without paying random-write latency on the commit path.
Solution
Before touching the real data, append a record describing the intended change to a sequential log and flush that log to stable storage. The defining rule is in the name: the log record must reach disk before the data page it describes is allowed to be considered durable. Because the log is append-only it's sequential, so the fsync that makes a commit durable is fast. The actual data pages can be written lazily afterward, or reconstructed from the log if a crash beats them to disk.
After a crash, recovery scans the log and replays it: redo committed changes that hadn't yet reached the data files, and undo changes from transactions that never committed. The canonical algorithm is ARIES, which tags every record with a monotonic log sequence number (LSN), takes periodic checkpoints so recovery needn't start from the beginning of time, and recycles log space once a checkpoint covers it. A useful side effect: since the log is a complete, ordered description of every change, shipping it to another machine reproduces the same state, which is how most replication and change-data-capture is built.
Tradeoffs
| Property | Effect |
|---|---|
| Write amplification | Data is written twice, once to the log and later to its data page |
| Commit latency | Dominated by fsync; group commit batches many transactions into one flush to amortize it |
| Durability knob | You can relax the commit fsync (async commit) to trade a window of data loss for lower latency |
| Sequential I/O | The log is append-only, which suits both spinning disks and SSDs and avoids seek-bound random writes |
| Recovery time | Bounded by the checkpoint interval: frequent checkpoints shorten recovery but raise steady-state I/O |
| Log retention | The log must be kept until a checkpoint covers it; lagging checkpoints let it grow |
| Contention | A single shared log can become a concurrency bottleneck under heavy write load |
Implementations
Minimal pseudocode
def write(txn, change):rec = LogRecord(lsn=next_lsn(), txn=txn.id, change=change)log.append(rec) # sequential, in memory bufferapply_to_page_cache(change) # data page dirtied, not yet flushedreturn rec.lsndef commit(txn):log.append(LogRecord(lsn=next_lsn(), txn=txn.id, type=COMMIT))log.fsync() # the durability point# data pages flush lazily later, or get replayed on recoverydef recover():last_ckpt = find_last_checkpoint()committed = set()for rec in log.scan_from(last_ckpt): # redo passif rec.type == COMMIT:committed.add(rec.txn)elif rec.lsn > page_lsn(rec.change.page):redo(rec.change) # change hadn't reached the pagefor txn in in_progress_but_not(committed): # undo passrollback(txn)
PostgreSQL WAL
Every modification emits a WAL record with an LSN, written into 16 MB segment files under pg_wal/. A commit forces the WAL up to its LSN to flush, governed by synchronous_commit; checkpoints flush dirty shared-buffer pages so the redo start point advances. The same WAL stream feeds streaming replication and point-in-time recovery (archived WAL replayed on top of a base backup), and full_page_writes defends against torn pages by logging a full image of a page the first time it's touched after a checkpoint. Docs: https://www.postgresql.org/docs/current/wal-intro.html.
MySQL / InnoDB redo log
A fixed-size set of redo log files used circularly, recording physical page changes ahead of the pages themselves. innodb_flush_log_at_trx_commit selects the durability/latency point (1 flushes on every commit, 0 and 2 relax it). Recovery is ARIES-style, pairing the redo log with InnoDB's undo log for rollback and a doublewrite buffer to handle torn writes during page flushes. Docs: https://dev.mysql.com/doc/refman/en/innodb-redo-log.html.
etcd Raft log
Here the WAL is a consensus log rather than a local recovery aid (though it serves both). Each entry is a proposed state-machine command; a node appends it and fsyncs it before acknowledging, and an entry is only committed once a quorum of nodes has it durably. The ordered log defines a single agreed sequence of operations, which etcd's MVCC key-value store then applies in order. The same log lets a lagging or rejoining follower catch up by replaying entries. Source: https://github.com/etcd-io/raft.
Kafka — the WAL as the product
A Kafka partition is, on disk, an append-only log split into segments retained by time or size. Producers append records and get back an offset; consumers read forward from an offset and can rewind to replay history. The idea that quietly backs the engines above is here promoted to the central abstraction: the log is the durable source of truth, and many independent consumers replay it for event sourcing, stream processing, and cross-system data integration. Design write-up: https://kafka.apache.org/documentation/#design.