MVCC (Multi-Version Concurrency Control)

Problem

In a single-version store, a read and a concurrent write to the same row collide. Either the reader blocks the writer (reads take shared locks, writes wait) or the writer blocks the reader, so a read-heavy workload serializes against every writer and throughput collapses. A long-running read that scans many rows is worse: to get a consistent picture it would have to lock every row it touches for its entire duration, freezing out writers the whole time.

What you want is for readers to see a stable, consistent view of the database while writers keep modifying it, with neither side waiting on the other.

Solution

Stop overwriting data in place and keep multiple versions instead. Every write creates a new version of the row stamped with the transaction that produced it and marks the prior version as superseded, and every transaction reads against a snapshot: a consistent view defined by which transactions had committed at a chosen point. A read returns, for each row, the latest version visible to its snapshot, skipping versions written by transactions that committed after the snapshot or haven't committed at all. Because the old version stays put until no snapshot still needs it, a reader never locks a row and a writer never waits for a reader.

That delivers a consistent snapshot at no locking cost, and point-in-time reads fall out for free where versions are retained long enough, since any past state is just the set of versions visible as of that time. The price is storage and cleanup. Superseded versions pile up and must be garbage-collected once no active snapshot can see them, which is the recurring operational burden of every MVCC system. Concurrent writes to the same row still conflict and resolve with first-committer-wins, and snapshot isolation by itself allows write-skew anomalies, which is why full serializability is layered on top rather than coming from MVCC alone.

Tradeoffs

Property	Effect
Non-blocking reads	Readers don't block writers and writers don't block readers, the reason to use MVCC; read-heavy throughput stays high
Snapshots and time travel	A transaction sees a stable view, and as-of-timestamp queries fall out where old versions are kept
Storage overhead	Several versions per row accumulate until reclaimed
Garbage collection	Vacuuming, undo recycling, or TTL-based GC is mandatory background work and a real tuning surface
Write conflicts	Concurrent writes to the same row still abort one under first-committer-wins
Isolation anomalies	Snapshot isolation permits write skew, so true serializability needs an extra layer

Implementations

Minimal pseudocode

# write appends a new version stamped with the writer's transaction id
def write(key, value, txn):
    mark_prior_version_expired(key, by=txn.id)
    versions[key].append(Version(value, created_by=txn.id, expired_by=None))

# read returns the newest version visible to the transaction's snapshot
def read(key, txn):
    for v in reversed(versions[key]):                 # newest first
        if visible(v, txn.snapshot):
            return v.value
    return NONE

def visible(v, snap):
    return (committed_before(v.created_by, snap)          # creator is visible
            and not committed_before(v.expired_by, snap)) # not yet superseded

# background: drop versions no active snapshot can still reach
def gc():
    for key, vs in versions.items():
        versions[key] = [v for v in vs if needed_by_active_snapshot(v)]

visible encodes the snapshot rule that makes reads consistent without locks; gc is the cost that rule imposes.

PostgreSQL

Postgres stores every row version inline in the heap, each tuple carrying xmin (the creating transaction) and xmax (the transaction that superseded or deleted it), and computes visibility by comparing those against a snapshot of in-progress and committed transaction ids. Updates and deletes leave dead tuples behind that VACUUM and autovacuum reclaim; let vacuuming fall behind and tables bloat, and because the transaction id is 32 bits, vacuuming is also what prevents id wraparound. Ordinary reads and writes never block each other.

Oracle

Oracle keeps the current row in place and reconstructs older versions on demand from undo (rollback) segments, giving each query read consistency as of its start time even while other sessions commit. The tradeoff lives in the undo space: if a long-running query needs a version that undo has already recycled, it fails with the familiar "snapshot too old" error. It's a different storage strategy from Postgres's inline versions that reaches the same non-blocking-read guarantee.

CockroachDB

CockroachDB applies MVCC in its distributed key-value layer, where each key holds multiple versions tagged with hybrid-logical-clock timestamps, so a read at timestamp T returns the newest version at or below T. That yields consistent distributed snapshots and AS OF SYSTEM TIME queries, with old versions removed by a configurable GC TTL, and it handles clock skew by carrying an uncertainty interval and restarting a read when a version falls inside it.

TiDB

TiDB layers timestamp-ordered MVCC over its TiKV storage, with transaction timestamps handed out by a centralized timestamp oracle in the Placement Driver so every node agrees on ordering. A read at a given timestamp sees the matching versions, a background GC worker reclaims versions older than a safe point, and the same timestamp mechanism powers stale and historical reads against past states of the data.