Page-Oriented Storage | Microservices Course

Problem

Data outgrows memory, so it has to live on block storage that's read and written a page at a time. You still need point lookups, range scans, inserts, updates, and deletes, all with a bounded number of disk I/Os per operation, and all crash-safe. A flat sorted file gives fast lookups but every insert shifts the tail; an append-only unsorted file makes inserts cheap but lookups scan everything. Neither survives at scale.

Solution

A B-tree (B+ tree in practice) stores data in fixed-size pages, typically 4–16 KB to match the OS/disk block. It's a balanced tree with high fan-out: each interior page holds many keys and child pointers, so even a billion rows sit only three or four levels deep. A lookup walks root → leaf, touching one page per level.

In the B+ tree variant used everywhere, only leaf pages hold actual data; interior pages are pure routing keys, and leaves are linked left-to-right so a range scan reads one leaf then follows the link to the next. Updates happen in place: locate the leaf, modify it, write the page back. An insert that overflows a page splits it into two and pushes a separator key up; a delete that underflows a page may merge it with a sibling. A redo write-ahead log records page changes before they're applied, so a crash mid-write can be recovered.

A key design axis is whether the table itself is the tree. Index-organized engines store full rows in the primary-key B-tree's leaves. Heap + index engines keep rows in an unordered heap file and build separate B-trees that point at heap locations.

Tradeoffs

Property	Effect
Reads	Low and predictable read amplification: one root-to-leaf path. This is the B-tree's main advantage
Range scans	Efficient via linked leaves, especially when rows are clustered by the scan key
Write path	In-place updates are random writes; each logical change dirties a page that must later be flushed
WAL cost	Every change is written twice (log, then page), and engines like InnoDB add a doublewrite buffer, raising write volume further
Space	Pages sit partly empty after splits (often ~70% full); fragmentation accumulates and needs occasional rebuilds
MVCC engines	Old row versions linger until a cleanup process reclaims them (Postgres VACUUM, InnoDB purge)
Concurrency	Page-level latching and locking is intricate and a classic contention point

Against an LSM tree the contrast is direct: B-trees win on reads and range scans, LSM trees win on random-write throughput.

Implementations

Minimal pseudocode

def search(node, key):
    while not node.is_leaf:
        i = node.find_child_index(key)   # binary search within the page
        node = read_page(node.children[i])
    return node.get(key)                  # None if absent

def insert(node, key, value):
    leaf = descend_to_leaf(node, key)
    leaf.put(key, value)                  # in-place
    write_page(leaf)
    if leaf.overflows():
        split(leaf)                       # may cascade splits up to the root

def split(page):
    mid = page.middle_key()
    left, right = page.divide_at(mid)
    write_page(left); write_page(right)
    parent = page.parent
    parent.insert_separator(mid, right)   # push median up
    write_page(parent)
    if parent.overflows():
        split(parent)

Real engines add page latching, the WAL, and free-space management around this skeleton.

InnoDB (MySQL) — index-organized

The table is a B+ tree keyed on the primary key; leaf pages hold the complete rows, so primary-key range scans are clustered and fast. Secondary indexes store the primary key as their row pointer, meaning a secondary lookup does two descents (index tree, then PK tree). Pages are 16 KB. Durability comes from the redo log plus a doublewrite buffer that guards against torn page writes; MVCC is served from an undo log, and a buffer pool caches hot pages. Docs: https://dev.mysql.com/doc/refman/en/innodb-storage-engine.html.

PostgreSQL — heap + B-tree

Rows live in an unordered heap file, addressed by TID (page number, item offset). Indexes are separate structures, the default being a B-tree implementing the Lehman-Yao concurrent variant (lock-free descent via high keys and right-links). Because the primary key isn't clustered by default, even a primary-key range scan may hit scattered heap pages. MVCC keeps superseded row versions in the heap itself, which VACUUM later reclaims; the WAL provides crash recovery and feeds replication. Source: https://github.com/postgres/postgres, tree under src/backend/access/nbtree.

SQLite — single-file embedded

The entire database is one file divided into pages (4 KB by default). Each table is a B-tree keyed by an integer rowid with rows stored in the leaves (index-organized, like InnoDB), and each explicit index is its own B-tree. Durability uses either a rollback journal or WAL mode. It's a library linked into the application, not a server, which is why it's the most deployed database engine in the world (phones, browsers, embedded devices). Source: https://www.sqlite.org/src.