Columnar Storage | Microservices Course

Problem

Analytical queries scan enormous numbers of rows but read only a few columns and aggregate them, things like the average of one column over a billion rows grouped by another. Row-oriented storage lays out all of a row's columns contiguously, so reading one column across the whole table drags every other column's bytes through disk I/O, memory bandwidth, and CPU cache, even though the query never uses them. The amount of data moved is dictated by the widest rows, not by the columns the query actually needs. Interleaving columns of different types on the same page also defeats compression, because adjacent bytes have unrelated types and distributions.

Solution

Store the values of each column contiguously instead of the fields of each row. A query then reads only the columns it references, so I/O scales with columns-touched over columns-total rather than with full row width. And because a column chunk holds values of a single type, frequently sorted or low-cardinality, it compresses far better with lightweight schemes: run-length encoding for repeats, dictionary encoding for low cardinality, delta and frame-of-reference for sorted numerics, bit-packing for small ranges. Smaller data means less I/O, and the engine can compute directly on the encoded form, aggregating over run lengths or filtering on dictionary codes, while vectorized execution streams tight typed arrays through SIMD instructions.

The physical layout splits the table into horizontal chunks, called row groups, stripes, or parts depending on the system, and within each chunk stores one contiguous segment per column. Each chunk carries metadata: min/max per column, null counts, sometimes Bloom filters or bitmap indexes. A predicate can then skip any chunk whose min/max can't satisfy it, so a filtered scan reads only the chunks that might match (data skipping / predicate pushdown). The cost is row reconstruction: assembling a full row means gathering the i-th value out of every column segment and decompressing each, which is why columnar layouts are poor for point lookups and single-row updates.

Most columnar stores handle writes the way the LSM piece described, accumulating immutable files and merging them in the background, and handle deletes with tombstones or delete vectors rather than in-place edits.

Tradeoffs

Property	Effect
Scans and aggregations	The central win: read only needed columns, with strong compression and vectorized CPU
Compression	Much better than row storage because each column is one type with local value distribution
Point lookups	Poor: reconstructing a single row touches and decompresses every column segment
Writes and updates	Poor: a row insert appends to every column, and updates usually mean rewrite-and-merge, not in-place edits
Table shape	Wide tables with selective column access benefit most; narrow transactional tables don't
Encoding cost	Extra CPU to encode on write and decode on read, normally repaid by the I/O savings

Implementations

Minimal pseudocode

# storage: one array per column, chunked, with per-chunk min/max
table = {
    "id":     [Chunk(values=[...], min=.., max=..), ...],
    "ts":     [Chunk(values=[...], min=.., max=..), ...],
    "amount": [Chunk(values=[...], min=.., max=..), ...],
}

def scan_sum(table, sum_col, filter_col, lo, hi):
    total = 0
    for c_idx, fchunk in enumerate(table[filter_col]):
        if fchunk.max < lo or fchunk.min > hi:
            continue                              # skip whole chunk
        schunk = table[sum_col][c_idx]            # only the 2 columns read
        for i, v in enumerate(fchunk.values):     # vectorizable inner loop
            if lo <= v <= hi:
                total += schunk.values[i]
    return total

Parquet

The open columnar file format that grew out of Google's Dremel work. A file is row groups, each holding one column chunk per column, each chunk split into pages with their own encoding (dictionary, RLE, delta, bit-packing). Per-chunk statistics (min/max, null count), optional Bloom filters, and page indexes drive skipping, and nested/semi-structured data is flattened with Dremel-style repetition and definition levels. It's the default interchange format for data lakes, read by Spark, Trino, DuckDB, and most query engines. Spec: https://parquet.apache.org/docs/.

ORC

The Optimized Row Columnar format from the Hive ecosystem. Data is divided into stripes (around 64 MB), each with separate streams per column plus lightweight indexes at the stripe and row-group (10k-row) level carrying min/max for predicate pushdown, built-in compression (Zlib or Zstd), and optional Bloom filters. It's tuned for Hive on Hadoop. Spec: https://orc.apache.org/specification/.

ClickHouse

A column-oriented OLAP database whose MergeTree engine stores each column in its own files, sorted by a primary key, in immutable parts that merge in the background much like an LSM tree. A sparse primary index plus granules and skip indexes (min/max, set membership, Bloom) prune data before reading, and a vectorized execution engine drives very fast scans and aggregations. Docs: https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree.

Apache Druid

A real-time analytics store whose segments are columnar and partitioned by time. Dimension columns carry compressed bitmap (Roaring) indexes for fast filtering, and optional ingest-time rollup pre-aggregates rows, making it well suited to slicing high-volume event and time-series data. Design: https://druid.apache.org/docs/latest/design/.

BigQuery

Google's serverless warehouse, storing data in a columnar format called Capacitor (the successor to the ColumnIO/Dremel layout) on the Colossus file system, with compute running as the Dremel execution engine. Storage and compute are decoupled, so a query fans out across many workers that scan only the referenced columns in parallel. Overview: https://cloud.google.com/bigquery/docs/storage_overview.

Snowflake

A cloud warehouse that stores tables as immutable, columnar micro-partitions (roughly 50–500 MB of uncompressed data each) on object storage, keeping per-partition min/max metadata so the query planner can prune partitions before scanning. Storage and compute are separated, with stateless virtual warehouses reading the same partitions, and automatic clustering reorganizes data to keep pruning effective. Overview: https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.