Problem
Data outgrows memory, so it has to live on block storage that's read and written a page at a time. You still need point lookups, range scans, inserts, updates, and deletes, all with a bounded number of disk I/Os per operation, and all crash-safe. A flat sorted file gives fast lookups but every insert shifts the tail; an append-only unsorted file makes inserts cheap but lookups scan everything. Neither survives at scale.
Solution
A B-tree (B+ tree in practice) stores data in fixed-size pages, typically 4–16 KB to match the OS/disk block. It's a balanced tree with high fan-out: each interior page holds many keys and child pointers, so even a billion rows sit only three or four levels deep. A lookup walks root → leaf, touching one page per level.
In the B+ tree variant used everywhere, only leaf pages hold actual data; interior pages are pure routing keys, and leaves are linked left-to-right so a range scan reads one leaf then follows the link to the next. Updates happen in place: locate the leaf, modify it, write the page back. An insert that overflows a page splits it into two and pushes a separator key up; a delete that underflows a page may merge it with a sibling. A redo write-ahead log records page changes before they're applied, so a crash mid-write can be recovered.
A key design axis is whether the table itself is the tree. Index-organized engines store full rows in the primary-key B-tree's leaves. Heap + index engines keep rows in an unordered heap file and build separate B-trees that point at heap locations.
Tradeoffs
| Property | Effect |
|---|---|
| Reads | Low and predictable read amplification: one root-to-leaf path. This is the B-tree's main advantage |
| Range scans | Efficient via linked leaves, especially when rows are clustered by the scan key |
| Write path | In-place updates are random writes; each logical change dirties a page that must later be flushed |
| WAL cost | Every change is written twice (log, then page), and engines like InnoDB add a doublewrite buffer, raising write volume further |
| Space | Pages sit partly empty after splits (often ~70% full); fragmentation accumulates and needs occasional rebuilds |
| MVCC engines | Old row versions linger until a cleanup process reclaims them (Postgres VACUUM, InnoDB purge) |
| Concurrency | Page-level latching and locking is intricate and a classic contention point |
Against an LSM tree the contrast is direct: B-trees win on reads and range scans, LSM trees win on random-write throughput.
Implementations
Minimal pseudocode
def search(node, key):while not node.is_leaf:i = node.find_child_index(key) # binary search within the pagenode = read_page(node.children[i])return node.get(key) # None if absentdef insert(node, key, value):leaf = descend_to_leaf(node, key)leaf.put(key, value) # in-placewrite_page(leaf)if leaf.overflows():split(leaf) # may cascade splits up to the rootdef split(page):mid = page.middle_key()left, right = page.divide_at(mid)write_page(left); write_page(right)parent = page.parentparent.insert_separator(mid, right) # push median upwrite_page(parent)if parent.overflows():split(parent)
Real engines add page latching, the WAL, and free-space management around this skeleton.
InnoDB (MySQL) — index-organized
The table is a B+ tree keyed on the primary key; leaf pages hold the complete rows, so primary-key range scans are clustered and fast. Secondary indexes store the primary key as their row pointer, meaning a secondary lookup does two descents (index tree, then PK tree). Pages are 16 KB. Durability comes from the redo log plus a doublewrite buffer that guards against torn page writes; MVCC is served from an undo log, and a buffer pool caches hot pages. Docs: https://dev.mysql.com/doc/refman/en/innodb-storage-engine.html.
PostgreSQL — heap + B-tree
Rows live in an unordered heap file, addressed by TID (page number, item offset). Indexes are separate structures, the default being a B-tree implementing the Lehman-Yao concurrent variant (lock-free descent via high keys and right-links). Because the primary key isn't clustered by default, even a primary-key range scan may hit scattered heap pages. MVCC keeps superseded row versions in the heap itself, which VACUUM later reclaims; the WAL provides crash recovery and feeds replication. Source: https://github.com/postgres/postgres, tree under src/backend/access/nbtree.
SQLite — single-file embedded
The entire database is one file divided into pages (4 KB by default). Each table is a B-tree keyed by an integer rowid with rows stored in the leaves (index-organized, like InnoDB), and each explicit index is its own B-tree. Durability uses either a rollback journal or WAL mode. It's a library linked into the application, not a server, which is why it's the most deployed database engine in the world (phones, browsers, embedded devices). Source: https://www.sqlite.org/src.