Problem
When data structures are shared and updated in place, two problems appear together. A reader traversing the structure while a writer modifies it can observe a half-applied change, so reads and writes need locks to coordinate, which serializes access and hurts concurrency. And an in-place update interrupted by a crash leaves the structure partially modified and possibly corrupt. On top of both, there's no cheap way to capture the state as of a moment in time, because the data that represented that moment has been overwritten.
Solution
Never overwrite existing data. To change something, allocate new nodes for the parts that changed plus every node on the path from there to the root, then publish the result by atomically swapping a single pointer, the root reference. The previous version stays intact and fully readable. This is path copying with structural sharing: only the nodes along the modified path are duplicated, while every untouched subtree is shared between the old and new versions, so a change costs work proportional to tree depth rather than tree size.
A reader captures the current root and traverses from there. It can't see a partial update, because an update isn't visible until its root swap completes, and that swap is a single atomic operation (a compare-and-swap in memory, or a torn-write-safe pointer update on disk). So readers need no lock against writers.
Several capabilities fall out of this directly. Holding onto an old root is a consistent point-in-time snapshot, taken in constant time. Keeping many old roots lets you query the database as of any past version, which is time travel. Readers and writers don't block each other because the writer builds a new tree off to the side and only the final swap is shared state, giving lock-free MVCC reads. Crash recovery is trivial: the new root only becomes durable after its nodes are written, so a crash mid-update simply leaves you on the old, still-valid root.
This generalizes the immutable-SSTable idea from the LSM piece, and reaches WAL's atomic-commit goal by a different route, swapping a pointer instead of logging an intent.
Tradeoffs
| Property | Effect |
|---|---|
| Reads | Lock-free, consistent, and snapshots are O(1) |
| Write amplification | Changing one leaf rewrites the entire root-to-leaf path |
| Durability | The new nodes and the root must be flushed before the swap counts as committed |
| Space reclamation | Old versions and orphaned nodes accumulate; freeing them needs reference counting, epoch-based reclamation, or explicit snapshot deletion, and this is the genuinely hard part |
| Write contention | The single shared root concentrates writer contention, which single-writer designs (LMDB) sidestep entirely |
| Fragmentation | New allocations scatter related data, hurting locality over time |
Implementations
Minimal pseudocode
def update(root, key, value):# returns a NEW root; the old tree is untouched and still validpath = []node = rootwhile not node.is_leaf:i = node.child_index(key)path.append((node, i))node = node.children[i]new_node = node.copy_with(key, value) # copy the leaffor parent, i in reversed(path): # copy the path upparent = parent.copy()parent.children[i] = new_nodenew_node = parentreturn new_node # the new rootdef commit(db, new_root):flush(new_root) # persist new nodes firstatomic_store(db.root_ptr, new_root) # single-word publishdef snapshot(db):return db.root_ptr # O(1): just keep the root
ZFS
A copy-on-write filesystem where no live block is ever overwritten. A modified block is written to a free location, which changes its parent block pointer, which propagates up the tree of block pointers to a single root called the uberblock, updated atomically. Because every pointer also carries a checksum, the tree is self-validating, and because on-disk state always reflects a complete version, ZFS has no fsck. Snapshots cost almost nothing: retain the block pointers of an old version, and blocks are freed only once no snapshot references them. Overview: https://openzfs.github.io/openzfs-docs/.
Btrfs
The Linux copy-on-write filesystem, built on B-trees. Modifying a node copies it and the path to the tree's root, then the superblock is pointed at the new root. Subvolumes and snapshots share extents and are tracked by a reference-counting structure that decides when an extent can be reclaimed. Docs: https://btrfs.readthedocs.io/.
Git objects
Git stores content-addressed immutable objects (blobs, trees, commits) named by their hash, so an object can never change without changing its name. You don't edit a commit; you create a new commit that points at new tree objects and at the parent commit, while every unchanged subtree is shared by hash. A branch is a mutable pointer to a commit, which is exactly the atomic root swap, and retained history is the set of old roots that makes past states reachable. Internals: https://git-scm.com/book/en/v2/Git-Internals-Git-Objects.
Datomic
An immutable database where facts (datoms) are only ever added, each stamped with a transaction time, never updated or deleted in place. The whole database is treated as an immutable value, so reading "as of time T" is just filtering by transaction time against the same accumulated data. Index trees share structure across versions, and a query holds a stable database value with no read locks. Architecture: https://docs.datomic.com/.
LMDB
A memory-mapped B+ tree using copy-on-write with MVCC and a single writer. A write transaction copies the pages it touches plus the path to the root, then commits by atomically updating one of two alternating meta pages, which is the durable root swap. Readers open a transaction pinned to the meta page current at that instant, so they see a stable snapshot, take no locks, and never block the writer. Pages from versions no longer referenced by any reader return to a free list for reuse. Design talk and docs: http://www.lmdb.tech/doc/.