Storage Engines

Time-Series Database

Ingest and query high-volume metrics: append-only columnar chunks, delta-of-delta + XOR compression, time-based block pruning, a label inverted index, and cardinality control.

~30 min · intermediate

Problem & Requirements

Ingest a high-volume stream of metric points and answer range queries over them. A point is a (series, timestamp, value) triple, where a series is a metric name plus a set of labels — cpu_usage{host="a",region="us-east"}. Writes arrive constantly, almost always with increasing timestamps, and they vastly outnumber reads. Queries ask for a time range across one or many series and usually aggregate the result (average, rate, percentile) rather than returning raw points.

The shape of the data is what makes a general-purpose store the wrong tool. Points are immutable once written, timestamps are nearly monotonic, values within a series change slowly, and a query touches a contiguous time slice of a known set of series. Those four facts justify the four principles this prototype is built on: append-only storage because nothing is ever updated in place, columnar storage because reading one column across many points compresses and scans far better than reading whole rows, time-based indexing because every query prunes by time first, and cardinality control because the number of distinct series — not the number of points — is what actually breaks the system.

Functional

write(metric, labels, timestamp, value) for ingest; query(metric, matchers, start, end) for range reads over matching series.
Append-only: a written point is never modified, only read or eventually dropped by retention.
Label-based selection: a query names a metric and a set of label matchers and gets back every series satisfying all of them.

Non-functional (back-of-envelope, single node)

Quantity	Target	What it forces
Ingest rate	~1M points/s	Write path must be append-only and allocation-light; no per-point index churn
Bytes per point	~16 raw → ~1.5 compressed	Delta-of-delta timestamps + XOR floats; raw storage is a non-starter at this rate
Active series (cardinality)	~10⁵–10⁶	The inverted index and per-series chunks scale with this, not with point count
Range-query latency	tens of ms for a series over hours	Time-based block pruning + columnar slices; never scan blocks outside the range
Block window	2 hours	Recent block is writable; older blocks seal to immutable, query-prunable units
Retention	tiered (hot → downsampled → dropped)	Old blocks compress harder or get rolled up; storage can't grow without bound

The trap hiding in this table is the cardinality row. Point volume is easy — it's a stream you compress and append. Cardinality is the killer: every distinct label combination is a new series with its own chunk and index entries, so a single unbounded label (a request_id, a raw URL) can spawn millions of series and exhaust memory while the point rate stays flat. Controlling that is step six, and it's the difference between a metrics system and an outage.

Design

Five components, each mapped to the principle it embodies:

Series identity — a metric plus its sorted labels canonicalizes to a stable series key, which maps to a small integer series id. Everything downstream keys on the id, not the string.
Per-series columnar chunks — each series stores timestamps and values as two parallel columns rather than a list of pairs. This is columnar storage: a query that only needs values never reads the timestamp column, and each column compresses on its own because adjacent entries are similar.
Compression — timestamps encode as delta-of-delta (regular intervals collapse to near-zero) and float values as XOR against the previous value (slowly-changing values share most bits). This is the Gorilla scheme, and it's what turns 16 bytes a point into roughly 1.5.
Time blocks — data is partitioned into fixed time windows. The newest block (the head) is writable; once a window closes it seals to an immutable block. A query intersects its time range with block ranges and skips every block that can't contain matching points — time-based indexing and append-only storage at block granularity.
Label inverted index — a posting list from each (label, value) pair to the set of series ids carrying it, so a multi-label query resolves to a set intersection. This is an ordinary inverted/secondary index; what's special here is that its size is governed by cardinality control, which caps distinct series and rejects unbounded labels before they spawn series.

Three real systems anchor the design and split along these lines. Prometheus's TSDB is the closest match to what's built here: 2-hour blocks with a writable head, a label inverted index, and Gorilla-derived compression. InfluxDB's TSM engine is columnar with its own time-partitioned, compressed format. TimescaleDB takes a different route — a Postgres extension whose hypertables partition a regular table into time-based chunks and apply columnar compression to old ones — which is worth holding as the "build it on an existing engine" counterpoint to "build it from scratch."

Build it

The starting point is a stream appended per series and never mutated. Points go into a per-series list, timestamps assumed roughly increasing, and a query filters by range. This is correct and it establishes the append-only contract that everything else preserves. The weakness it sets up: storing points as (ts, value) tuples interleaves two very different columns, which wrecks both compression and scan speed — the next step separates them.

Split each series into two parallel arrays: one column of timestamps, one of values. A range query binary-searches the sorted timestamp column for the slice bounds, then returns contiguous slices of both — no per-point tuple unpacking, and a query that only needs values can ignore the timestamp column entirely. The deeper payoff is for the next step: timestamps next to timestamps and values next to values are each highly similar to their neighbors, which is exactly what makes columnar storage compress.

Raw 8-byte timestamps and floats are unaffordable at a million points a second. Timestamps from a regular scrape interval barely change between points, so encode the delta of the delta — for a steady 10s interval every value after the first is zero. Float values in a series usually drift slowly, so XOR each against the previous one and the result is mostly zero bits, which pack tight. This is Facebook's Gorilla scheme; the encoder below shows the structure (the bit-level varint packing is sketched, since the principle is the encoding choice, not the bit twiddling).

A single ever-growing chunk per series can't be pruned, recovered, or aged out cleanly. Partition into fixed time windows. The current window is the writable head; when a point arrives past its end, the head seals into an immutable block and a new head opens. A query intersects its [start, end) with each block's range and touches only the overlapping ones, so a query for the last hour never looks at last week. Sealed blocks being immutable is append-only at block granularity and is what makes them safe to compress harder or drop — the basis of time-based indexing.

So far a query needs a known series id, but callers ask in labels: cpu_usage{host="a",region="us-east"}. Map each (label, value) pair to the set of series ids carrying it — a standard inverted index — and resolve a multi-label query by intersecting the posting lists. Canonicalizing the metric and sorted labels into one key gives each series a stable id the moment it's first seen. Reads now go labels → series ids → blocks, which is the full query path.

The index above grows with the number of distinct series, and that number is the real failure mode. A single unbounded label — a request id, a raw path, a user id — turns each request into a brand-new series with its own chunk and posting entries, and memory is gone while the point rate looks fine. Guard the creation path: cap total active series, and cap the distinct values any one label may take, rejecting a new series that would breach either. This is cardinality control, and it belongs at write time because once a high-cardinality series exists, the damage is already done.

import bisect, struct
from collections import defaultdict

class TSDB:
    def __init__(self):
        self._series = defaultdict(list)       # series_key -> [(ts, value), ...]

    def write(self, key, ts, value):
        self._series[key].append((ts, value))  # append-only; never update in place

    def query(self, key, start, end):
        pts = self._series[key]
        return [(t, v) for (t, v) in pts if start <= t < end]   # row scan

Tradeoffs

Decision	What it buys	What it costs
Append-only storage	Sequential writes, simple recovery, immutable blocks safe to compress/drop	No in-place edits or out-of-order corrections without rewriting a block
Columnar per-series layout	Column scans, per-column compression, ignore unread columns	Worse for "give me the whole row of one point"; reassembly on output
Delta-of-delta + XOR	~10× space reduction on regular metrics	Tied to mostly-regular timestamps and slowly-changing floats; irregular data compresses poorly
Time blocks	Query prunes by time; old blocks age out independently	Out-of-order or very late points fall outside the head and are awkward to place
Label inverted index	Multi-label selection as a set intersection	Index size scales with series count; intersection cost grows with matcher breadth
Cardinality control	Caps memory and index blowup at the source	Rejects or drops legitimate-looking series; tuning the caps is a judgment call

Scaling it up

The toy stops well short of the real systems. Several gaps matter.

The index has to live on disk and be mergeable. An in-memory posting-set map is fine for a demo and impossible at a million series across weeks of blocks. Prometheus persists each block with its own on-disk inverted index and posting lists, and a query fans out across block indexes and unions the results. Compaction merges adjacent blocks (and their indexes) into larger ones, the same merge pattern an LSM tree uses, which keeps the number of blocks a query must consult bounded.

Out-of-order and backfilled writes. The head-block model assumes time moves forward. Real ingest sees clock skew, retries, and deliberate backfill, so modern TSDBs accept a bounded out-of-order window into the head and reject or separately handle anything older. Getting this wrong silently drops data, so it's worth designing explicitly rather than discovering in production.

Downsampling and retention tiers. Keeping raw resolution forever is neither affordable nor useful. Production systems roll old blocks into lower-resolution rollups (one point per minute, then per hour) and tier storage from hot SSD to cold object storage, often dropping raw points entirely past a horizon. The query layer then has to pick the right resolution for the requested range.

Cardinality control that's smarter than a hard cap. A flat cap protects the node but bluntly rejects whole metrics. Real defenses add per-tenant limits, relabeling rules that drop or aggregate offending labels before ingest, and cardinality analysis that surfaces which label is exploding. The caps here are the floor, not the design.

Query execution beyond a single series. Aggregations across many series (a sum by (region) over thousands of series), rate calculations, and joins are a query engine in their own right. Prometheus's PromQL, InfluxDB's Flux, and TimescaleDB leaning on SQL are three different answers, and the execution model — vectorized over columnar chunks, pushed down to the storage layer — is where a lot of the real performance work lives.

This completes the storage-engine arc across these prototypes: the LSM key-value store, the distributed cache, the document database, and now the time-series engine. The natural next steps from here are a dedicated compaction/merge prototype (the on-disk index merging this lesson defers to Scaling), a query-execution prototype that builds the vectorized aggregation engine sketched above, and a write-ahead-log and recovery prototype shared across all of them. Any of those extends the same foundation set without re-treading this ground.

References

Prometheus TSDB format — head block, 2-hour blocks, on-disk index, compaction: https://github.com/prometheus/prometheus/blob/main/tsdb/docs/format/README.md
Pelkonen et al., Gorilla: A Fast, Scalable, In-Memory Time Series Database (2015) — the delta-of-delta + XOR compression: https://www.vldb.org/pvldb/vol8/p1816-teller.pdf
InfluxDB TSM storage engine — columnar time-structured merge files: https://docs.influxdata.com/influxdb/v2/reference/internals/storage-engine/
TimescaleDB hypertables and columnar compression — the "build on Postgres" approach: https://docs.timescale.com/use-timescale/latest/hypertables/
Foundations referenced inline: columnar storage · time-based indexing · cardinality control · append-only storage · secondary indexes