Sagas (Long-Lived Transactions) | Microservices Course

Problem

A business operation spans several services, each with its own database, and runs for seconds to days: reserve inventory, charge the card, create the shipment, confirm the order. Wrapping that in one distributed transaction would hold locks for the entire duration and block if the coordinator failed, the weaknesses two-phase commit already showed. Worse, many participants, a payment gateway or a third-party shipping carrier, don't support 2PC at all, so a single atomic commit across them isn't even available.

The operation still needs to end consistent: either the order is fully placed, or every partial effect it created along the way is undone. The challenge is getting all-or-nothing behavior without a global lock or a blocking coordinator holding everything together.

Solution

Replace the one big transaction with a sequence of local transactions, each committing independently in its own service, and give each step a compensating action that semantically reverses it. Run the steps in order; if step k fails, run the compensations for steps k-1 down to 1 to unwind the work already committed. Nothing holds a distributed lock and nothing blocks on a coordinator, because every step commits locally and immediately. The price is that intermediate states are visible (there is no isolation) and the operation reaches consistency eventually through compensation rather than atomically at one instant.

A compensation is not a rollback. A committed step's effects were visible to others, so you can't pretend it didn't happen; you issue a new action that undoes its effect in business terms, refunding a charge or releasing reserved stock. Some actions can't be cleanly undone, so you order the steps to put the hardest-to-reverse one (the pivot) after all the compensable steps. Before the pivot a failure compensates backward; after it the saga retries forward until it succeeds.

Making this reliable needs three things. A durable coordinator, the saga log or a workflow engine, records which steps completed so a crash mid-saga resumes and runs the correct compensations rather than losing track. Every step and compensation is idempotent, because crashes force retries and a repeated charge or a double refund is a real bug. And because there's no isolation, concurrent sagas can interleave and read each other's intermediate state, so you guard shared records with semantic locks such as a pending status. Coordination is either an explicit orchestrator that owns the control flow, which is easy to trace, or choreography where each service reacts to others' events, which couples loosely but gets hard to follow as it grows.

Tradeoffs

Property	Effect
No locks or blocking	Works across services and non-transactional resources with no distributed lock, the reason to use a saga
No isolation	Intermediate states are visible and concurrent sagas can interleave badly, needing semantic locks and careful design
Compensation	Each step needs a semantic undo, and effects that can't be reversed must be pushed past the pivot and retried forward
Idempotency	Every step and compensation can be retried after a crash, so all must be safe to repeat
Consistency	Eventual, passing through visibly inconsistent intermediate states before it converges
Observability	Many steps across services; orchestration is far easier to trace and debug than choreography
Durable state	Requires a crash-recoverable coordinator or log to resume and compensate correctly

Implementations

Minimal pseudocode (orchestrated saga)

steps         = [reserve_stock, charge_card, create_shipment]
compensations = [release_stock, refund_card, cancel_shipment]

def run_saga(ctx):
    done = []
    for i, step in enumerate(steps):
        try:
            step(ctx)                       # local commit, idempotent
            log.write(STEP_DONE, i)         # durable progress marker
            done.append(i)
        except StepFailed:
            for j in reversed(done):        # unwind committed steps in reverse
                compensations[j](ctx)       # semantic undo, idempotent
                log.write(COMPENSATED, j)
            return ABORTED
    return COMMITTED

The durable progress marker is what lets a crashed orchestrator restart, see which steps committed, and run exactly the right compensations instead of guessing.

Order and payment pipelines

E-commerce checkout is the canonical saga: reserve inventory, authorize and capture payment, allocate a shipment, confirm the order, with compensations to release inventory, void or refund the charge, and cancel the shipment. None of these resources participate in a shared transaction, so a failure at the shipping step refunds the payment and releases the stock rather than rolling back a global transaction that never existed. Payment capture is often the pivot: everything before it is compensable, and once money has moved the saga drives forward to completion.

Temporal and Cadence workflows

Cadence (originally from Uber) and Temporal (its widely used successor) are durable workflow engines that persist a workflow's execution history so a long-running orchestration survives process crashes and resumes exactly where it stopped. You write the saga as ordinary code that invokes activities, the engine retries activities until they complete and re-drives the workflow after failures, and you express compensation as explicit cleanup the workflow runs when a later step fails. They make orchestration-style sagas practical by solving the coordinator-durability problem the pattern depends on.

Microservice checkout flows

Across order, payment, inventory, and shipping services, teams implement sagas either as event choreography, where each service emits and reacts to domain events, or as orchestration through a managed state machine such as AWS Step Functions or libraries like Eventuate Tram Sagas. Choreography keeps services decoupled but scatters the control flow across event handlers, while an orchestrator centralizes the sequence and compensation logic in one place that's easier to monitor, which is why larger flows tend to drift toward orchestration as they accumulate steps.