Storage Engines

Document Database

A schema-flexible document store with secondary indexes, walked from a single in-memory collection up through leader-follower and leaderless quorum replication.

~35 min · intermediate

Problem & Requirements

What I actually want is a place to throw JSON documents that don't all look the same, get one back by id instantly, and still query by some field other than the id without scanning everything. That's it. The interesting part isn't storing the documents — a map does that — it's that I refused to commit to a schema up front, and now every other feature has to cope with that decision.

The reason I'm reaching for schema-on-read is that I genuinely don't know the document shape yet, or it varies per record, or it'll change three times before launch and I don't want a migration each time. The cost I'm signing up for: the database can't validate or optimize around a shape it doesn't know, so the burden of "does this field exist, is it the type I expect" moves into read-time code. I'm fine with that trade for the flexibility — but I want to be honest that it is a trade, and the place it bites hardest is indexing, because an index over a field half my documents don't have is a weird object that I have to design deliberately.

Before I build anything, the numbers I'm holding in my head:

Quantity	Target	What it forces
Document size	~1–16 KB typical	One doc per I/O; indexes store ids, not whole docs
Indexed read p99	low single-digit ms	Query has to hit an index, not scan — so I need a planner that knows which indexes exist
Write cost	doc write + every affected index	Indexes aren't free; each one I add taxes every write to that collection
Replication factor N	3	Standard for surviving a node loss; sets the quorum math later
Quorum R, W	2, 2 (R + W > N)	The knob that decides whether a read sees the latest write
Field cardinality	wide range	A field with 3 distinct values is a bad index; the planner should still prefer a scan sometimes

That last row is the one I keep forgetting until it hurts: an index on a low-cardinality field (say status with three values) can be slower than a scan, so "use the index whenever one exists" is wrong, and I'll need the planner to be allowed to ignore an index. Holding that now so I don't design myself into a corner.

Design

Walking the pieces and naming what each one is, because once they're named the build order writes itself:

The collection is the document store — ids to arbitrary JSON, schema-on-read. It's the only thing that's authoritative. Everything else is derived from it and can be rebuilt.
Secondary indexes are the derived structures that make "find documents where field = x" fast instead of a scan. See secondary indexes. Because of schema-on-read, an index only contains the documents that actually have the field — a sparse index, and I have to remember that querying through it can never return the documents it doesn't know about.
A small query planner sits between the two. Given a query, it decides: is there an index on this field, and is using it actually worth it given cardinality? If not, scan. This is the component that keeps me honest about the "indexes aren't always faster" point above.
Replication is where the document model stops mattering and distributed-systems reality takes over. I'll build the leader-follower version first because it's what MongoDB's replica sets do and it's easier to reason about: one node takes writes, the rest copy them. Then I'll build the leaderless version, where any replica takes a write, because it trades simplicity for availability and it's the more interesting failure model.
Quorums are how the leaderless version stays useful: by requiring a write to land on W replicas and a read to consult R of them with R + W > N, a read is guaranteed to overlap the latest write. See quorums. This is also where I have to confront concurrent writes producing conflicting versions, which the single-leader model quietly avoided.

One thing I want to flag up front so I don't mislead myself: MongoDB is a clean leader-follower system (a replica set elects one primary; secondaries replicate its oplog). The pure leaderless-with-quorums model is really the Dynamo lineage — Cassandra, Riak. Couchbase sits in between: each shard (vBucket) has an active node and replicas, so it's leader-per-shard, but it exposes majority/quorum durability on top. I'm building both replication styles here because the principle list spans them, and the contrast is the whole lesson — but I won't pretend one real product embodies both.

Build it

Simplest thing that could possibly work, and it also happens to be the whole point: a map from id to a JSON document, where I never look at the document's shape on the way in. If a caller hands me a record with three fields and the next one with thirty, both go in. The find here is a full scan that reads each document at query time and just skips the ones missing the field — that "skip if absent" is schema-on-read in one line, and it's the behavior every later step has to preserve.

The find above is O(documents), which is fine for the demo and a disaster the moment a collection is real. I want field lookups to be O(log n) on the result set, so I'll keep a derived structure: for one field, a sorted map from value to the set of ids that have it. Sorted rather than a plain hash because I'll want range queries (age > 30) eventually and I don't want to rebuild for that. The thing I have to keep reminding myself: this index only contains documents that have the field, so it's sparse by construction — a query through it can't see the others, which is correct for "where age = 30" and a trap if I ever forget it. See secondary indexes.

Now I wire the index into the collection — but I refuse to write "if an index exists, use it," because that's the bug I called out in the requirements. A field like status with three values: the index hands back a third of the collection and then I read all of those documents anyway, which can be slower than a straight scan. So the planner makes a real choice: use the index only when it's selective enough, otherwise scan. The estimate here is crude (distinct-value count) and that's deliberate — a real cost model is a rabbit hole, and crude-but-present beats absent.

An index is a lie the moment a document changes and the index doesn't. So inserts, updates, and deletes all have to touch every index on the collection, and an update is specifically remove-old-then-add-new, because the indexed value itself may have changed. The wrinkle that schema-flexible documents force on me: a field can hold an array (tags: ["a","b"]), and the sensible behavior — the one MongoDB calls a multikey index — is to index each element so find(tags, "a") works. I'm handling that in add so it's a property of the index, not something every caller remembers.

Single node means a crash loses everything and a busy node has no help. The model I trust most to reason about is one leader that owns all writes, streaming its change log to followers that replay it — MongoDB's replica set, basically. Writes go to the leader; reads can go to a follower if I'm willing to accept that the follower might be slightly behind. That staleness is the entire tradeoff of leader-follower: I get read scaling and a standby for failover, and in exchange a read from a follower can miss a write that the leader already acknowledged. For a lot of document workloads that's an acceptable deal, as long as I expose the choice rather than hiding it.

The leader is a single point that has to be elected and failed-over, and during that window writes stall. So the other branch — the Dynamo lineage — drops the leader entirely: a client writes to several replicas directly, reads from several, and uses overlap to stay correct. The rule that makes it work is quorums: require W replicas to ack a write and R replicas to answer a read with R + W > N, and any read set is guaranteed to include at least one node that saw the latest write. The catch I now have to face, which the single leader hid from me, is that two clients can write the same key concurrently to different replicas — so I tag every value with a version, return the newest on read, and repair stale replicas in the background.

import copy, threading, uuid, random
from sortedcontainers import SortedDict

class Collection:
    def __init__(self):
        self._docs = {}                       # _id -> document (a dict)

    def insert(self, doc):
        doc = copy.deepcopy(doc)
        _id = doc.setdefault("_id", uuid.uuid4().hex)
        self._docs[_id] = doc
        return _id

    def get(self, _id):
        return self._docs.get(_id)            # by id: instant

    def find(self, field, value):
        # full scan. missing field == doesn't match, no error. that's the deal.
        return [d for d in self._docs.values() if d.get(field) == value]

Tradeoffs

Decision	What I get	What it costs me
Schema-on-read	Store anything, evolve shape with zero migrations	No server-side validation; type/shape checks move into read-time code; sparse indexes
Secondary indexes	Field queries in O(log n) instead of full scans	Every index taxes every write to the collection; storage for the derived structure
Cost-based planner	Avoids the low-cardinality index trap	A real cost model is hard; my crude one will misjudge skewed data
Multikey (array) indexes	`find` matches inside array fields	A doc with a big array writes many index entries; fan-out on update
Leader-follower	Simple to reason about; read scaling; clean failover	Follower reads can be stale; the leader election window stalls writes
Leaderless + quorums	No leader to elect; writes survive node loss; tunable R/W	Concurrent writes conflict; needs versioning + read repair; R+W tuning is on me

Scaling it up

Where the toy stops and the real document databases keep going.

Indexes on disk, not in a dict. My SortedDict is a stand-in for what MongoDB stores as a B-tree on disk through its storage engine (WiredTiger). The same read/write asymmetry from a storage-engine lesson applies: B-tree indexes give cheap point and range reads and pay a structured write per update, which is why the index tax in my tradeoff table is real and measurable. Compound indexes (multiple fields, order-sensitive) and partial indexes (index only docs matching a predicate) are the next two features, and both are just richer versions of the sparse index I already have.

A planner that actually plans. My selectivity check is a toy. Production planners keep statistics (histograms of value distribution), estimate the cost of each candidate plan, sometimes race plans against each other on a sample (MongoDB literally runs candidate plans briefly and caches the winner), and handle queries that touch several fields by intersecting indexes or picking a compound one. The principle I baked in — "having an index doesn't mean using it" — is the seed; the rest is cost modeling.

Replication's hard middle. Leader-follower hides conflicts but exposes failover: detecting a dead leader, electing a new one without two leaders briefly coexisting (split-brain), and not acknowledging a write that the new leader will lose. That's why MongoDB replica sets carry a real consensus-ish election protocol and write concerns (w: majority) that let a client demand a write reach a majority before it's acked — which is the leader-follower world quietly borrowing quorum thinking.

Conflict resolution in the leaderless world. My "highest version wins" is last-write-wins, and LWW silently drops a concurrent write — sometimes fine, sometimes data loss. The honest version uses version vectors to detect concurrency and either keeps both siblings for the application to merge (Riak's model) or uses CRDTs so merges are automatic and deterministic. Picking LWW is a real choice with a real cost, and I'd want it to be a conscious one per collection, not a default I forgot about.

Sharding, which I skipped entirely. I replicated but never partitioned. Real document stores hash or range-partition the key space across shards (MongoDB's chunks, Couchbase's vBuckets) and then replicate each shard, so the two axes — partition for capacity, replicate for availability — compose. A secondary index then has to be either local to each shard (cheap writes, scatter-gather reads) or global (cheap reads, expensive writes), and that choice is one of the genuinely hard ones in this whole space.

References

MongoDB replica sets and the oplog — the leader-follower model in production: https://www.mongodb.com/docs/manual/replication/
MongoDB indexing (compound, multikey, partial) — the real version of step 3–4: https://www.mongodb.com/docs/manual/indexes/
DeCandia et al., Dynamo: Amazon's Highly Available Key-value Store (2007) — the source of leaderless quorums, versioning, and read repair: https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
Couchbase architecture (vBuckets, durability levels) — the leader-per-shard middle ground: https://docs.couchbase.com/server/current/learn/architecture-overview.html
Designing Data-Intensive Applications, ch. 5 — the clearest treatment of leader-based vs. leaderless replication and quorum math.
Foundations referenced inline: schema-on-read · secondary indexes · leader-follower replication · leaderless replication · quorums