API Orchestration

When You Click
"Buy Now" on Amazon

23 microservices. One saga. Under 100ms. The moment you place an order, a distributed saga starts — Stripe authorization to FedEx label, all or nothing.

23+
Microservices
99.9%
Uptime SLA
42ms
Avg Latency
3.2M
Calls / Day
The Story

One Click. 23 Services. Zero Inconsistency.

Every Amazon order runs a distributed saga — a chain of steps that either all commit, or all cleanly roll back. Here's what really happens.

🔐

You Click "Buy Now" — API Gateway Receives It

Your HTTPS POST hits Amazon's API Gateway. In milliseconds: JWT validated, idempotency key issued (so a network retry never double-charges you), rate limit checked (you're at 3 of 1000 rpm), all 23 circuit breakers verified closed. The saga starts.

💳

Stripe Authorizes — But Does Not Charge — Your Card

Stripe runs 300+ fraud signals in 800ms. Your card is authorized (funds ring-fenced) but NOT yet captured. If the warehouse is out of stock, Amazon voids the authorization — you never see a pending charge.

📦

Phoenix Warehouse PHX7 Reserves Your Stock

The nearest fulfillment center decrements the MacBook Pro counter atomically. If this saga fails later, a compensation fires: inventory.release(sku, qty). Stock is restored — Amazon can't oversell an item.

🚚

FedEx Label Created (UPS Fallback if FedEx Times Out)

FedEx API generates a tracking number. Timeout? Exponential backoff: retry at 1s, 2s, 4s with ±30% jitter. After 3 failures, the circuit breaker opens — Amazon's gateway auto-routes to UPS. You never know it happened.

💰

Now Stripe Captures the Payment

Only after inventory is reserved and a shipping label exists does Stripe capture your card. Authorize-then-capture: you're never charged for an order Amazon can't fulfill. The saga commits all 6 steps atomically.

23 Notifications Fire in Parallel

Confirmation email (SES), push notification (SNS), #orders Slack webhook, your registered endpoint — fired in parallel via async tasks. OpenTelemetry traces the full saga: 23 spans across 6 services, under 100ms total.

Live Simulation

Run the Order Saga

Watch a distributed Amazon-style order saga execute in real time — payment, inventory, shipping, and notifications, with random failures and automatic compensation.

Pick a scenario — see what the saga does

Choose an order outcome and see how Amazon's saga handles it — successfully or with a compensation rollback.

Amazon Saga Orchestrator — Live Simulator

Gateway
Payment
Inventory
Shipping
Notify
Complete
Order Details
Saga Log
Saga Dashboard
0
Sagas Run
Total Spans
Avg Latency
Error Rate
OrderAmountServicesSpansLatencyStatus
Classroom

6 Concepts Behind Every Amazon Order

Core ideas explained simply. Auto-advances every 6 seconds — or navigate yourself.

Slide 1 of 6 — Patterns

⛓️ The Distributed Saga

Traditional databases use ACID transactions — all-or-nothing. But an Amazon order touches Stripe, a warehouse system, FedEx, and SES. There's no single database to roll back across all four.

A saga breaks the transaction into local steps. Each step commits to its own service's database. If step N fails, steps N−1 down to 1 each run a compensating action that semantically reverses the effect.

  • Step 1 success → Step 2 starts
  • Step 3 fails → compensate Step 2, then Step 1
  • All compensations idempotent — safe to run twice
Slide 2 of 6 — Resilience

🔌 Circuit Breakers — Three States

Named after electrical circuit breakers. When a downstream service starts failing, you stop calling it instead of waiting for timeouts on every request.

  • CLOSED — normal operation, all requests pass through
  • OPEN — after N failures in T seconds, fail fast without calling the service. Zero wasted connections.
  • HALF-OPEN — after a timeout, send one test request. Success → CLOSED. Failure → OPEN again.

Amazon's gateway tracks error rates per service. FedEx at 50% errors in 10s? Circuit opens, UPS fallback activates automatically.

Slide 3 of 6 — Reliability

🔑 Idempotency Keys — Safe to Retry

Network fails. Did the Stripe charge go through? Retry safely with an idempotency key: order_id + ":" + step + ":" + attempt.

Stripe receives the retry, checks its key store (Redis, 24h TTL), finds the key already used, and returns the cached result — no second charge.

  • Every saga step gets a unique idempotency key
  • Compensation steps get their own keys: corrId + "-comp-pay"
  • Enables at-least-once delivery with exactly-once effect
Slide 4 of 6 — Observability

🔍 Distributed Tracing with OpenTelemetry

One Amazon order creates a trace_id (128-bit UUID). Every service creates a span with that trace_id as parent. OpenTelemetry injects context into HTTP headers:

traceparent: 00-{traceId}-{spanId}-01

The next service extracts it and continues the same trace. In Jaeger you see a flame chart: 23 spans across 6 services. Find the slow service in 10 seconds, not 10 hours.

  • Each span records: service name, duration, status, custom attributes
  • inject(headers) propagates context across HTTP boundaries
Slide 5 of 6 — Architecture

⚖️ Saga vs Two-Phase Commit (2PC)

2PC: coordinator sends PREPARE to all services. All respond READY. Coordinator sends COMMIT. One slow service → everyone waits. In microservices, each service owns its own DB. 2PC requires distributed locks across all DBs.

Saga: local commits only. No distributed locks. Scales horizontally. Tolerates partial failures. The tradeoff:

  • 2PC: strong isolation, but blocks on participant failure
  • Saga: eventual consistency, intermediate states are briefly visible
  • Amazon chooses saga — availability over strict isolation
Slide 6 of 6 — Failure Handling

↩️ Compensation Chains in Practice

A compensation is NOT a database rollback. It's a forward action with the semantic opposite effect: refund_payment() instead of undo_payment(). It runs in reverse step order:

  • Shipping failed? → cancel_shipment()
  • Then → release_inventory()
  • Then → void_stripe_authorization()

Each compensation must be idempotent — running it twice is safe. If a compensation itself fails after 3 retries, it goes to the Dead Letter Queue (DLQ) for human review.

1 / 6
Key Points

What This Means

The saga pattern, explained for everyone and implemented for engineers.

For Everyone
  • Your card isn't charged until Amazon confirms the box is reserved. Stripe authorizes (rings-fence funds) but captures only after inventory is secured — you're never charged for an item Amazon can't ship.
  • If FedEx is down, Amazon automatically switches to UPS. Circuit breakers monitor error rates. When FedEx hits 50% failures, the breaker opens and the fallback carrier activates — invisible to you.
  • You can never be charged twice for the same order. Every saga step uses an idempotency key. Network retries find the cached result and skip re-execution — exactly-once, always.
  • If anything goes wrong mid-order, everything rolls back automatically. Saga compensation runs in reverse: cancel label → release inventory → void payment auth. Your account is untouched.
For Engineers
  • mulberry32(hashStr(orderId)) — deterministic PRNG seeds failure injection from the order ID. Same order ID always fails at the same step in testing — reproducible chaos engineering.
  • Saga state machine: STARTED → PAYMENT_PENDING → PAYMENT_CONFIRMED → INVENTORY_RESERVED → SHIPPING_INITIATED → COMPLETED | COMPENSATING → FAILED. Persisted to DB on every transition.
  • Jitter formula: delay = min(base × 2^attempt × (0.7 + rand×0.6), max_delay) — prevents thundering herd when 1,000 correlated failures all retry simultaneously.
  • OpenTelemetry context propagation: inject(headers) writes traceparent into outbound HTTP headers. Downstream services call extract(headers) to continue the same trace — 23 spans, one root trace_id.
Production Code

Production Implementation

Real-world saga patterns with state machines, compensation chains, and distributed tracing.

Saga Orchestrator with State Machine (Python)
import enum
from dataclasses import dataclass, field

class SagaState(enum.Enum):
    STARTED = "STARTED"
    PAYMENT_PENDING = "PAYMENT_PENDING"
    PAYMENT_CONFIRMED = "PAYMENT_CONFIRMED"
    INVENTORY_RESERVED = "INVENTORY_RESERVED"
    SHIPPING_INITIATED = "SHIPPING_INITIATED"
    COMPLETED = "COMPLETED"
    COMPENSATING = "COMPENSATING"
    FAILED = "FAILED"

@dataclass
class SagaExecution:
    saga_id: str
    order_id: str
    state: SagaState = SagaState.STARTED
    completed_steps: list = field(default_factory=list)

    def persist(self, store):
        """Persist to DB -- survives orchestrator crashes."""
        store.save(self.saga_id, {
            "state": self.state.value,
            "completed_steps": self.completed_steps
        })

class SagaOrchestrator:
    TRANSITIONS = {
        SagaState.STARTED:            ("process_payment",    SagaState.PAYMENT_PENDING),
        SagaState.PAYMENT_PENDING:    ("confirm_payment",    SagaState.PAYMENT_CONFIRMED),
        SagaState.PAYMENT_CONFIRMED:  ("reserve_inventory",  SagaState.INVENTORY_RESERVED),
        SagaState.INVENTORY_RESERVED: ("initiate_shipping",  SagaState.SHIPPING_INITIATED),
        SagaState.SHIPPING_INITIATED: ("finalize",           SagaState.COMPLETED),
    }

    COMPENSATIONS = {
        "initiate_shipping": "cancel_shipment",
        "reserve_inventory": "release_inventory",
        "confirm_payment":   "refund_payment",
        "process_payment":   "void_authorization",
    }

    async def execute(self, saga: SagaExecution):
        while saga.state in self.TRANSITIONS:
            action, next_state = self.TRANSITIONS[saga.state]
            try:
                await getattr(self.services, action)(
                    saga.order_id,
                    idempotency_key=f"{saga.saga_id}:{action}"
                )
                saga.completed_steps.append(action)
                saga.state = next_state
                saga.persist(self.store)
            except Exception as e:
                saga.state = SagaState.COMPENSATING
                saga.persist(self.store)
                await self._compensate(saga)
                return saga
        return saga

    async def _compensate(self, saga: SagaExecution):
        """Walk backward through completed steps."""
        for step in reversed(saga.completed_steps):
            comp = self.COMPENSATIONS.get(step)
            if comp:
                try:
                    await getattr(self.services, comp)(
                        saga.order_id,
                        idempotency_key=f"{saga.saga_id}:{comp}"
                    )
                except Exception:
                    await self.store.send_to_dlq(saga.saga_id, comp)
        saga.state = SagaState.FAILED
        saga.persist(self.store)
Compensating Transaction Handler (Python)
import asyncio, logging, time
from dataclasses import dataclass
from typing import Callable, Awaitable

@dataclass
class CompensationStep:
    name: str
    execute: Callable[..., Awaitable]
    max_retries: int = 3
    base_delay: float = 1.0

class CompensationChain:
    def __init__(self, dlq_client):
        self.steps = []
        self.dlq = dlq_client
        self.log = logging.getLogger("compensation")

    def add(self, name, handler, max_retries=3):
        self.steps.append(CompensationStep(name, handler, max_retries))
        return self

    async def execute(self, saga_id: str, context: dict):
        for step in reversed(self.steps):
            await self._retry_step(saga_id, step, context)

    async def _retry_step(self, saga_id, step, context):
        for attempt in range(step.max_retries):
            try:
                idem_key = f"{saga_id}:comp:{step.name}:{attempt}"
                await step.execute(context=context, idempotency_key=idem_key)
                self.log.info(f"[{saga_id}] {step.name} compensated")
                return
            except Exception as e:
                jitter = 0.7 + 0.6 * (hash(idem_key) % 100) / 100
                delay = min(step.base_delay * (2 ** attempt) * jitter, 30.0)
                self.log.warning(f"[{saga_id}] {step.name} retry {attempt+1}: {e}")
                await asyncio.sleep(delay)

        # Exhausted retries -- dead letter queue
        self.log.critical(f"[{saga_id}] {step.name} -> DLQ")
        await self.dlq.publish({
            "saga_id": saga_id, "step": step.name,
            "context": context, "failed_at": time.time()
        })

# Usage
chain = CompensationChain(dlq_client=redis_dlq)
chain.add("cancel_shipment",   shipping_svc.cancel)
chain.add("release_inventory", inventory_svc.release)
chain.add("refund_payment",    payment_svc.refund)
await chain.execute("SAGA-12345", {"order_id": "ORD-2024-56789"})
Distributed Tracing with OpenTelemetry (Python)
from opentelemetry import trace
from opentelemetry.trace import StatusCode
from opentelemetry.propagate import inject
import time

tracer = trace.get_tracer("saga.orchestrator")

class TracedSagaOrchestrator:
    async def execute_saga(self, order_id, steps):
        with tracer.start_as_current_span(
            "saga.execute",
            attributes={"saga.order_id": order_id, "saga.steps": len(steps)}
        ) as parent:
            trace_id = format(parent.get_span_context().trace_id, '032x')
            completed = []

            for step in steps:
                try:
                    await self._traced_step(step, order_id)
                    completed.append(step)
                except Exception as e:
                    parent.set_status(StatusCode.ERROR, str(e))
                    parent.set_attribute("saga.failed_at", step["name"])
                    await self._compensate_traced(completed, order_id)
                    return {"status": "compensated", "trace_id": trace_id}

            parent.set_attribute("saga.status", "completed")
            return {"status": "completed", "trace_id": trace_id}

    async def _traced_step(self, step, order_id):
        with tracer.start_as_current_span(
            f"saga.step.{step['name']}",
            attributes={"service.name": step["service"]}
        ) as span:
            headers = {}
            inject(headers)  # propagate trace context across boundary
            start = time.monotonic()
            result = await step["handler"](
                order_id=order_id, headers=headers,
                deadline_ms=step.get("deadline_ms")
            )
            span.set_attribute("duration_ms", (time.monotonic() - start) * 1000)
            return result

    async def _compensate_traced(self, completed, order_id):
        with tracer.start_as_current_span("saga.compensate"):
            for step in reversed(completed):
                with tracer.start_as_current_span(
                    f"saga.compensate.{step['name']}"
                ) as comp_span:
                    try:
                        await step["compensate"](order_id=order_id)
                        comp_span.set_attribute("status", "success")
                    except Exception as e:
                        comp_span.set_status(StatusCode.ERROR, str(e))
                        comp_span.set_attribute("status", "dlq")
About

Built by Kieth Caballero

Senior Software Engineer specializing in distributed systems, API integration, and platform engineering.

KC
Kieth Caballero
Senior Software Engineer — Platform & API Integration

I design API integration layers with circuit breakers, exponential backoff, and saga orchestration for distributed transactions. At Witherite Law, I built a gateway handling 3.2M+ API calls/day across 25 services with 99.9% uptime. I use idempotency keys for deduplication, distributed tracing with OpenTelemetry, and compensating transactions for rollbacks when downstream services fail.

3.2M
API calls/day
42ms
Avg latency
99.9%
Uptime
25+
Integrations
6
Saga services