Log-Structured Storage | Microservices Course

Problem

B-tree storage engines update data in place. Each write seeks to wherever the key already lives and overwrites it, which scatters small writes randomly across the disk. Spinning disks pay a seek penalty per write; SSDs suffer write amplification and wear because the device rewrites whole flash pages for tiny logical updates. Either way, a write-heavy workload wastes most of the device's sequential bandwidth.

Solution

Convert random writes into sequential writes by never updating in place.

Incoming writes go to two places: an append-only write-ahead log (WAL) on disk for durability, and an in-memory sorted structure called the memtable (usually a skiplist). When the memtable reaches a size threshold it's flushed to disk as an immutable, sorted file called an SSTable (sorted string table), then a fresh memtable takes over. SSTables are never modified after creation; a newer write to the same key just lands in a newer SSTable, and a delete writes a tombstone marker rather than removing anything.

Reads check the memtable first, then SSTables from newest to oldest, returning the first match. To stop SSTables from accumulating forever, a background process called compaction merges them: it combines sorted files, keeps only the latest version of each key, drops keys shadowed by tombstones, and writes out fewer, larger files.

Tradeoffs

Property	Effect
Write path	Cheap and sequential; this is the whole point
Write amplification	Compaction rewrites the same data several times before it settles
Read amplification	A key may not be in the memtable, so reads probe multiple SSTables. Mitigated by per-SSTable Bloom filters (skip files that can't contain the key) and block caches
Space amplification	Obsolete versions and tombstones occupy disk until compaction reclaims them
Range scans	More expensive than a B-tree's clustered layout, since the scan merges across SSTables
Ingest spikes	Heavy writing can outrun compaction, triggering write stalls / backpressure

The dominant tuning knob is compaction strategy. Leveled compaction keeps each level non-overlapping and roughly 10× the previous, giving low space and read amplification at the cost of high write amplification. Size-tiered compaction merges similarly-sized files, giving low write amplification but higher space and read amplification. Choosing between them is mostly a write-vs-read/space decision.

Implementations

Minimal pseudocode

class LSM:
    def __init__(self):
        self.memtable = SortedMap()      # e.g. skiplist
        self.sstables = []               # newest last
        self.wal = open("wal.log", "ab")

    def put(self, key, value):
        self.wal.append((key, value))    # durability first
        self.memtable[key] = value
        if self.memtable.size() > THRESHOLD:
            self.flush()

    def delete(self, key):
        self.put(key, TOMBSTONE)

    def flush(self):
        sst = SSTable.write_sorted(self.memtable)  # immutable file
        self.sstables.append(sst)
        self.memtable = SortedMap()
        self.wal.truncate()

    def get(self, key):
        if key in self.memtable:
            return unwrap(self.memtable[key])
        for sst in reversed(self.sstables):        # newest first
            if sst.bloom.might_contain(key):       # skip misses cheaply
                v = sst.get(key)
                if v is not None:
                    return unwrap(v)
        return None

# background thread: merge sstables, drop shadowed keys + tombstones
def compact(sstables):
    return merge_sorted(sstables, keep="latest", drop_tombstones=True)

unwrap returns None if the stored value is a tombstone. Everything else is detail layered on this skeleton.

LevelDB (Google)

The reference implementation most others descend from, written by Sanjay Ghemawat and Jeff Dean. Single keyspace, leveled compaction across seven levels, Bloom filters, snapshots. It's embedded (a library, not a server) and ships inside Chrome's IndexedDB and Bitcoin Core. Source: https://github.com/google/leveldb.

RocksDB (Meta)

A LevelDB fork that became the de facto production LSM engine. It adds column families (independent keyspaces sharing a WAL), pluggable compaction (leveled, universal/size-tiered, FIFO), block cache, prefix Bloom filters, and extensive tuning knobs. It's used as the embedded engine under MyRocks (MySQL), TiKV, Kafka Streams' state stores, and many others. CockroachDB ran on RocksDB before replacing it with Pebble, a Go-native, RocksDB-compatible LSM they wrote to avoid the cgo boundary. RocksDB: https://github.com/facebook/rocksdb. Pebble: https://github.com/cockroachdb/pebble.

BadgerDB (Dgraph) — a variant worth knowing

A Go engine implementing the WiscKey design, which separates keys from values: the LSM tree stores only keys plus pointers, while values live in a separate append-only value log. Because compaction then moves only small key entries rather than full values, write amplification drops sharply for workloads with large values. The cost is an extra disk read per lookup (follow the pointer) and a separate garbage-collection process for the value log. It backs Dgraph. Source: https://github.com/dgraph-io/badger; WiscKey paper: https://www.usenix.org/conference/fast16/technical-sessions/presentation/lu.

Cassandra / ScyllaDB — LSM at the row level

Both store each table as memtable + commitlog + SSTables and expose the compaction strategy directly to operators: size-tiered (STCS) as the general default, leveled (LCS) for read-heavy workloads, and time-window (TWCS) for time-series data where old SSTables expire wholesale. ScyllaDB is a C++ reimplementation of Cassandra on a shard-per-core (Seastar) architecture, same LSM model underneath. Cassandra: https://github.com/apache/cassandra. ScyllaDB: https://github.com/scylladb/scylladb.