Denormalization For Read Paths | Microservices Course

Problem

A normalized schema stores each fact exactly once and reassembles query results with joins. That works well inside a single machine with a query optimizer, but the rows a join needs often live on different partitions or nodes, so producing one result means gathering pieces over the network, and the response is bounded by the slowest of those fetches. A read-heavy path that draws on many entities, rendering a home timeline or a product page, turns into many cross-partition round trips. Most distributed stores don't even offer general joins, because a join across arbitrary partitions can't be executed efficiently. Normalization minimizes write and storage cost (one copy, one place to update) and pushes the cost onto reads.

Solution

Duplicate the data and pre-shape it so each read touches one place. Rather than modeling around entities and their relationships, model around the queries the application actually issues: for every read, store a record already arranged as that query's answer, located on the partition the query will probe. The same fact then gets written into several records. Writes become more expensive and have to keep those copies aligned, but a read collapses to a single-partition lookup with no join.

The social feed is the textbook case. Computing a timeline at read time means joining "the accounts I follow" against "their recent posts" on every view, which is fan-out on read: costly per read and repeated for everyone who looks. The alternative precomputes it. When someone posts, the post is pushed into the stored timeline of each follower, which is fan-out on write, so a read just fetches one ready-made list. The economics flip: reads get cheap, writes get expensive, storage grows from duplication, and one case breaks badly. An account with a hundred million followers would trigger a hundred million writes per post. The usual fix is a hybrid, fanning out on write for ordinary accounts and falling back to fan-out on read (merge the few high-fan-out accounts in at query time) for the handful that would overwhelm the write path.

This is the same write-cost-for-read-cost trade the LSM and columnar pieces made, applied to data modeling, and it's effectively a hand-rolled materialized view that the application keeps current.

Tradeoffs

Property	Effect
Reads	Single-partition, join-free, fast and predictable: the entire point
Write amplification	One logical change updates every copy of the duplicated fact
Consistency	Copies can drift; you accept eventual consistency, and the application owns the invariants instead of the database
Storage	More, in proportion to how widely data is duplicated
Flexibility	The model is welded to the queries you knew in advance; a new access pattern needs a new structure and a backfill, and ad-hoc queries suffer
Skew	Hot keys and high-fan-out writers (the celebrity problem) need special-case handling

Implementations

Minimal pseudocode

# write: fan out one post into every follower's prebuilt timeline
def on_post(author_id, post):
    followers = get_followers(author_id)
    if len(followers) < FANOUT_LIMIT:                 # ordinary account
        for f in followers:
            timeline_store.prepend(f, post)           # duplicated per follower
    else:
        mark_high_fanout(author_id)                   # handled at read time

# read: one lookup, no join, then merge the few high-fanout accounts
def get_timeline(user_id):
    base = timeline_store.read(user_id)               # single-partition fetch
    hot = [recent_posts(a) for a in high_fanout_follows(user_id)]
    return merge_by_time(base, *hot)

DynamoDB single-table design

DynamoDB has no joins, so related entities are stored together and read in one call. The technique puts multiple entity types in a single table keyed by a composite of partition key and sort key, with generic overloaded key attributes so that, for example, an order and all its line items share one partition key and come back as a single item collection from one Query. Global secondary indexes provide alternate pre-shaped access paths. The schema is derived from the list of access patterns up front. Reference: https://aws.amazon.com/blogs/database/single-table-vs-multi-table-design-in-amazon-dynamodb/ and Alex DeBrie's "The DynamoDB Book."

Cassandra query-first modeling

Cassandra likewise offers no joins, so the modeling rule is one table per query. You pick the partition key so an entire query result sits in a single partition, order it with clustering columns, and duplicate the same data across several tables to serve different queries. Writing a fact means writing it to each table that needs it, coordinated in application logic or a logged batch. Guide: https://cassandra.apache.org/doc/latest/cassandra/data_modeling/intro.html.

Materialized feeds (Twitter, Instagram)

Large social products precompute per-user timelines with fan-out on write into a fast timeline store (Twitter's were famously Redis-backed), so reading a timeline is a single sequential fetch rather than a query-time join across everyone you follow. The known weak point is accounts with very large followings, where writing to every follower's timeline is infeasible, so those are excluded from write-time fan-out and merged in at read time, blending the two strategies. Background: https://blog.twitter.com/engineering/en_us/topics/infrastructure/2017/the-infrastructure-behind-twitter-scale.