Time-Series Database
Ingest and query high-volume metrics: append-only columnar chunks, delta-of-delta + XOR compression, time-based block pruning, a label inverted index, and cardinality control.
~30 min · intermediate
Problem & Requirements
Ingest a high-volume stream of metric points and answer range queries over them. A point is a (series, timestamp, value) triple, where a series is a metric name plus a set of labels — cpu_usage{host="a",region="us-east"}. Writes arrive constantly, almost always with increasing timestamps, and they vastly outnumber reads. Queries ask for a time range across one or many series and usually aggregate the result (average, rate, percentile) rather than returning raw points.
The shape of the data is what makes a general-purpose store the wrong tool. Points are immutable once written, timestamps are nearly monotonic, values within a series change slowly, and a query touches a contiguous time slice of a known set of series. Those four facts justify the four principles this prototype is built on: append-only storage because nothing is ever updated in place, columnar storage because reading one column across many points compresses and scans far better than reading whole rows, time-based indexing because every query prunes by time first, and cardinality control because the number of distinct series — not the number of points — is what actually breaks the system.
Functional
write(metric, labels, timestamp, value)for ingest;query(metric, matchers, start, end)for range reads over matching series.- Append-only: a written point is never modified, only read or eventually dropped by retention.
- Label-based selection: a query names a metric and a set of label matchers and gets back every series satisfying all of them.
Non-functional (back-of-envelope, single node)
| Quantity | Target | What it forces |
|---|---|---|
| Ingest rate | ~1M points/s | Write path must be append-only and allocation-light; no per-point index churn |
| Bytes per point | ~16 raw → ~1.5 compressed | Delta-of-delta timestamps + XOR floats; raw storage is a non-starter at this rate |
| Active series (cardinality) | ~10⁵–10⁶ | The inverted index and per-series chunks scale with this, not with point count |
| Range-query latency | tens of ms for a series over hours | Time-based block pruning + columnar slices; never scan blocks outside the range |
| Block window | 2 hours | Recent block is writable; older blocks seal to immutable, query-prunable units |
| Retention | tiered (hot → downsampled → dropped) | Old blocks compress harder or get rolled up; storage can't grow without bound |
The trap hiding in this table is the cardinality row. Point volume is easy — it's a stream you compress and append. Cardinality is the killer: every distinct label combination is a new series with its own chunk and index entries, so a single unbounded label (a request_id, a raw URL) can spawn millions of series and exhaust memory while the point rate stays flat. Controlling that is step six, and it's the difference between a metrics system and an outage.
Design
Five components, each mapped to the principle it embodies:
- Series identity — a metric plus its sorted labels canonicalizes to a stable series key, which maps to a small integer series id. Everything downstream keys on the id, not the string.
- Per-series columnar chunks — each series stores timestamps and values as two parallel columns rather than a list of pairs. This is columnar storage: a query that only needs values never reads the timestamp column, and each column compresses on its own because adjacent entries are similar.
- Compression — timestamps encode as delta-of-delta (regular intervals collapse to near-zero) and float values as XOR against the previous value (slowly-changing values share most bits). This is the Gorilla scheme, and it's what turns 16 bytes a point into roughly 1.5.
- Time blocks — data is partitioned into fixed time windows. The newest block (the head) is writable; once a window closes it seals to an immutable block. A query intersects its time range with block ranges and skips every block that can't contain matching points — time-based indexing and append-only storage at block granularity.
- Label inverted index — a posting list from each
(label, value)pair to the set of series ids carrying it, so a multi-label query resolves to a set intersection. This is an ordinary inverted/secondary index; what's special here is that its size is governed by cardinality control, which caps distinct series and rejects unbounded labels before they spawn series.
Three real systems anchor the design and split along these lines. Prometheus's TSDB is the closest match to what's built here: 2-hour blocks with a writable head, a label inverted index, and Gorilla-derived compression. InfluxDB's TSM engine is columnar with its own time-partitioned, compressed format. TimescaleDB takes a different route — a Postgres extension whose hypertables partition a regular table into time-based chunks and apply columnar compression to old ones — which is worth holding as the "build it on an existing engine" counterpoint to "build it from scratch."
Build it
The starting point is a stream appended per series and never mutated. Points go into a per-series list, timestamps assumed roughly increasing, and a query filters by range. This is correct and it establishes the append-only contract that everything else preserves. The weakness it sets up: storing points as (ts, value) tuples interleaves two very different columns, which wrecks both compression and scan speed — the next step separates them.
Split each series into two parallel arrays: one column of timestamps, one of values. A range query binary-searches the sorted timestamp column for the slice bounds, then returns contiguous slices of both — no per-point tuple unpacking, and a query that only needs values can ignore the timestamp column entirely. The deeper payoff is for the next step: timestamps next to timestamps and values next to values are each highly similar to their neighbors, which is exactly what makes columnar storage compress.
Raw 8-byte timestamps and floats are unaffordable at a million points a second. Timestamps from a regular scrape interval barely change between points, so encode the delta of the delta — for a steady 10s interval every value after the first is zero. Float values in a series usually drift slowly, so XOR each against the previous one and the result is mostly zero bits, which pack tight. This is Facebook's Gorilla scheme; the encoder below shows the structure (the bit-level varint packing is sketched, since the principle is the encoding choice, not the bit twiddling).
A single ever-growing chunk per series can't be pruned, recovered, or aged out cleanly. Partition into fixed time windows. The current window is the writable head; when a point arrives past its end, the head seals into an immutable block and a new head opens. A query intersects its [start, end) with each block's range and touches only the overlapping ones, so a query for the last hour never looks at last week. Sealed blocks being immutable is append-only at block granularity and is what makes them safe to compress harder or drop — the basis of time-based indexing.
So far a query needs a known series id, but callers ask in labels: cpu_usage{host="a",region="us-east"}. Map each (label, value) pair to the set of series ids carrying it — a standard inverted index — and resolve a multi-label query by intersecting the posting lists. Canonicalizing the metric and sorted labels into one key gives each series a stable id the moment it's first seen. Reads now go labels → series ids → blocks, which is the full query path.
The index above grows with the number of distinct series, and that number is the real failure mode. A single unbounded label — a request id, a raw path, a user id — turns each request into a brand-new series with its own chunk and posting entries, and memory is gone while the point rate looks fine. Guard the creation path: cap total active series, and cap the distinct values any one label may take, rejecting a new series that would breach either. This is cardinality control, and it belongs at write time because once a high-cardinality series exists, the damage is already done.
import bisect, structfrom collections import defaultdictclass TSDB:def __init__(self):self._series = defaultdict(list) # series_key -> [(ts, value), ...]def write(self, key, ts, value):self._series[key].append((ts, value)) # append-only; never update in placedef query(self, key, start, end):pts = self._series[key]return [(t, v) for (t, v) in pts if start <= t < end] # row scan
Tradeoffs
| Decision | What it buys | What it costs |
|---|---|---|
| Append-only storage | Sequential writes, simple recovery, immutable blocks safe to compress/drop | No in-place edits or out-of-order corrections without rewriting a block |
| Columnar per-series layout | Column scans, per-column compression, ignore unread columns | Worse for "give me the whole row of one point"; reassembly on output |
| Delta-of-delta + XOR | ~10× space reduction on regular metrics | Tied to mostly-regular timestamps and slowly-changing floats; irregular data compresses poorly |
| Time blocks | Query prunes by time; old blocks age out independently | Out-of-order or very late points fall outside the head and are awkward to place |
| Label inverted index | Multi-label selection as a set intersection | Index size scales with series count; intersection cost grows with matcher breadth |
| Cardinality control | Caps memory and index blowup at the source | Rejects or drops legitimate-looking series; tuning the caps is a judgment call |
Scaling it up
The toy stops well short of the real systems. Several gaps matter.
The index has to live on disk and be mergeable. An in-memory posting-set map is fine for a demo and impossible at a million series across weeks of blocks. Prometheus persists each block with its own on-disk inverted index and posting lists, and a query fans out across block indexes and unions the results. Compaction merges adjacent blocks (and their indexes) into larger ones, the same merge pattern an LSM tree uses, which keeps the number of blocks a query must consult bounded.
Out-of-order and backfilled writes. The head-block model assumes time moves forward. Real ingest sees clock skew, retries, and deliberate backfill, so modern TSDBs accept a bounded out-of-order window into the head and reject or separately handle anything older. Getting this wrong silently drops data, so it's worth designing explicitly rather than discovering in production.
Downsampling and retention tiers. Keeping raw resolution forever is neither affordable nor useful. Production systems roll old blocks into lower-resolution rollups (one point per minute, then per hour) and tier storage from hot SSD to cold object storage, often dropping raw points entirely past a horizon. The query layer then has to pick the right resolution for the requested range.
Cardinality control that's smarter than a hard cap. A flat cap protects the node but bluntly rejects whole metrics. Real defenses add per-tenant limits, relabeling rules that drop or aggregate offending labels before ingest, and cardinality analysis that surfaces which label is exploding. The caps here are the floor, not the design.
Query execution beyond a single series. Aggregations across many series (a sum by (region) over thousands of series), rate calculations, and joins are a query engine in their own right. Prometheus's PromQL, InfluxDB's Flux, and TimescaleDB leaning on SQL are three different answers, and the execution model — vectorized over columnar chunks, pushed down to the storage layer — is where a lot of the real performance work lives.
This completes the storage-engine arc across these prototypes: the LSM key-value store, the distributed cache, the document database, and now the time-series engine. The natural next steps from here are a dedicated compaction/merge prototype (the on-disk index merging this lesson defers to Scaling), a query-execution prototype that builds the vectorized aggregation engine sketched above, and a write-ahead-log and recovery prototype shared across all of them. Any of those extends the same foundation set without re-treading this ground.
References
- Prometheus TSDB format — head block, 2-hour blocks, on-disk index, compaction: https://github.com/prometheus/prometheus/blob/main/tsdb/docs/format/README.md
- Pelkonen et al., Gorilla: A Fast, Scalable, In-Memory Time Series Database (2015) — the delta-of-delta + XOR compression: https://www.vldb.org/pvldb/vol8/p1816-teller.pdf
- InfluxDB TSM storage engine — columnar time-structured merge files: https://docs.influxdata.com/influxdb/v2/reference/internals/storage-engine/
- TimescaleDB hypertables and columnar compression — the "build on Postgres" approach: https://docs.timescale.com/use-timescale/latest/hypertables/
- Foundations referenced inline: columnar storage · time-based indexing · cardinality control · append-only storage · secondary indexes