Storage Engines

Persistent Key-Value Store

An embedded, crash-safe KV store built on an LSM tree: WAL durability, memtable flushes, Bloom-filtered reads, compaction, and MVCC snapshots.

~30 min · Intermediate

Problem & Requirements

Build a single-node embedded store with the standard contract: put(key, value), get(key), delete(key), and ordered range scans. It has to survive a process crash or power loss without losing acknowledged writes, and it has to stay fast as the dataset grows past RAM.

The core tension is the read/write asymmetry. A naive on-disk B-tree gives cheap reads but pays a random write per update; a naive append-only log gives cheap writes but turns every read into a full scan. The whole design is about not picking one of those losing ends. We go with an LSM tree here because the workload is write-heavy; a read-heavy or range-scan-dominated workload would push you toward a B-tree instead, and that tradeoff is the first thing to internalize.

Functional

  • Point ops (get/put/delete) and ordered range scans over keys.
  • Crash recovery: any write that returned successfully is still present after a restart.
  • Deletes are real — a deleted key must not resurrect from an older on-disk segment.

Non-functional (back-of-envelope, commodity NVMe SSD)

QuantityTargetWhy it drives the design
Sustained writes~50–100k ops/sForces sequential I/O only; no random in-place updates on the write path.
Point read p99~1–5 msCaps how many on-disk segments a single get may touch → sparse index + Bloom filters.
Key / value size≤ a few hundred B / ≤ a few KBKeeps a record in one I/O; lets the memtable hold ~tens of thousands of entries.
Memtable flush threshold64 MBBounds recovery time — replay is at most one memtable of WAL.
Durabilityfsync per commit (group-committable)A returned put is on stable storage, not just in the page cache.
Recovery timesecondsEqual to replaying ≤ one memtable worth of WAL, not the whole dataset.

That recovery bound is the hidden requirement that shapes everything: it's why the memtable has a size cap, why we flush, and why the WAL can be truncated after a flush.

Design

Five components, each one a named principle applied to a concrete job:

  1. Write-ahead log (WAL) — every mutation is appended and fsync'd here before it's acknowledged. This is the only thing standing between an in-memory write and a power cut. See write-ahead logging.
  2. Memtable — a sorted in-memory map holding the most recent writes. Sorted (not a hash map) so it can be flushed to disk in key order and so range scans work. Part of the LSM tree structure.
  3. SSTables — immutable, sorted on-disk segment files. Once written they're never modified, only merged or deleted. Newer SSTables shadow older ones.
  4. Sparse index + Bloom filter, per SSTable — the sparse index turns "find this key in a 64 MB file" into one seek plus a short scan; the Bloom filter lets a get skip a file entirely when the key is definitely absent. Without the filter, a miss would touch every segment on disk.
  5. Compaction — a background merge of overlapping SSTables that drops shadowed values and tombstones, reclaiming space and bounding read fan-out. The "and then it gets harder" half of an LSM tree.

The write path is: WAL append → fsync → memtable insert → ack. When the memtable hits 64 MB it becomes immutable, a new one takes over, and the old one is flushed to a fresh SSTable (after which its WAL segment can be dropped). The read path checks the live memtable, then any flushing memtable, then SSTables newest-to-oldest, stopping at the first hit — with the Bloom filter pruning files before any disk read.

Concurrent readers get consistent snapshots through MVCC: every write carries a monotonic sequence number, and a reader pinned to sequence s ignores anything written after s. That's the last build step.

Build it

1

Start with the contract and nothing else: a dict. Correct, fast, and it loses everything the instant the process dies. Every later step exists to fix a specific way this version fails — first durability, then the fact that it can't outgrow RAM.

2

Append every mutation to a log and fsync before returning, then replay it on startup. Now an acknowledged put survives a crash — the in-memory map is just a cache of the log. Deletes are logged as tombstones, not absences, so a replay can tell "deleted" apart from "never written." This is plain write-ahead logging; the record format is length-prefixed so the replay can find frame boundaries.

3

A WAL-backed dict still has to fit the whole dataset in RAM. Cap it. Writes land in a sorted memtable; when it exceeds the threshold, freeze it and write its contents — in key order — to an immutable SSTable on disk. After the flush the WAL can be truncated, which is exactly what keeps recovery bounded to one memtable. The sorted order is what makes the on-disk file scannable and mergeable later. This is the heart of the LSM tree.

4

A key can live in the memtable or in any SSTable, and the newest copy wins. So get checks the memtable first, then SSTables newest-to-oldest, and stops at the first match — including a tombstone, which it reports as "not found" rather than continuing to an older live value. That stop-on-tombstone rule is what makes deletes actually delete. (The naive scan here is what the next two steps make fast.)

5

A get for an absent key currently scans every SSTable to the end — the read fan-out grows with your data and blows the p99 budget. Attach a Bloom filter to each SSTable, built at flush time from its keys. A negative answer is exact, so a get skips any file whose filter says the key is absent, and only the filters that say "maybe" cost a disk read. This is the single change that keeps reads cheap as segment count climbs.

6

Updates and deletes never rewrite old SSTables, so dead bytes pile up and the segment count keeps growing — both reads (more files to maybe-check) and disk usage suffer. Compaction merges several SSTables into one: walk them in key order, keep only the newest version of each key, and drop tombstones once no older segment can contradict them. It's a sequential merge-sort write, the same cheap I/O pattern as a flush, and it's the work that turns a pile of segments back into a bounded, sorted set. See the compaction half of LSM trees.

7

A range scan that runs while a flush or compaction swaps the segment list can see a half-applied state. Fix it with MVCC: stamp every write with a monotonic sequence number, store (value, seq), and let a reader pin a snapshot sequence s. The reader then ignores any record with seq > s, so concurrent writers never perturb an in-flight scan — and no locks are held across the read. This is the step that makes the store safe under real concurrency.

Tradeoffs

DecisionBuysCosts
LSM tree over B-treeSequential-only writes → high write throughput; cheap deletes via tombstonesRead amplification (a key may live in N segments); space amplification until compaction runs
fsync per commitA returned write is genuinely durableBounded by disk fsync latency; needs group commit to hit the throughput target
Bloom filter per SSTableNegative lookups skip files with no disk I/OMemory per table (~10 bits/key at 1% FP); filters must be rebuilt or loaded on open
CompactionBounds segment count and reclaims spaceBackground write I/O competing with foreground; tuning leveled vs. size-tiered is its own rabbit hole
MVCC sequence numbersLock-free consistent reads and snapshotsOld versions linger until compaction; sequence/timestamp bookkeeping on every record
Memtable size cap (64 MB)Recovery ≈ one memtable of WAL replayMore, smaller SSTables → more compaction pressure

Scaling it up

The toy above is the skeleton of RocksDB and LevelDB; here's where the real ones get harder.

Leveled compaction. Merging all SSTables on every compaction is quadratic. Production engines organize SSTables into levels (L0…Ln) with size ratios, and compact only overlapping key ranges between adjacent levels. This caps both read fan-out (one file per level past L0) and write amplification. The strategy choice — leveled vs. size-tiered vs. RocksDB's universal — is the main knob for trading write amp against space amp.

Block-based SSTables with a real index. A flat file scanned per lookup is too slow. Real SSTables are split into compressed blocks with a block index and a footer; the in-memory sparse index maps key ranges to block offsets, so a lookup is one index probe plus one block read. Add a block cache in front of it.

Group commit and WAL management. To hit 100k writes/s you can't fsync once per put. Batch concurrent writes into one fsync (group commit), and recycle/segment the WAL so truncation after a flush is O(1) rather than rewriting a file. RocksDB also supports a manifest log that records the SSTable set atomically so recovery doesn't have to scan the directory.

Column families and a shared WAL. Multiple logical keyspaces (RocksDB calls them column families) typically share one WAL for atomic cross-family writes but keep independent memtables and LSM levels, so each can be tuned and compacted on its own.

Crash safety of the manifest. The hard part isn't losing a put — the WAL handles that. It's making the set of live SSTables update atomically across flush, compaction, and crash, so you never recover into a state that references a half-written file or has lost a just-flushed one. That's what RocksDB's MANIFEST + VersionEdit machinery exists for, and it's worth reading before you trust your own.

References