thenumerix

The Story

From Stale Batch Jobs to Real-time Pipelines

Why Netflix had to rebuild recommendations from the ground up — and how the architecture evolved.

🎬

The Stale Recommendation Problem

Netflix's original recommendation system ran batch jobs nightly. A user who watches three horror movies at 10 PM is still seeing comedy suggestions the next morning — because the batch job already ran. At 2.3 million events per second, waiting 24 hours means serving millions of stale, irrelevant recommendations.

The cost: Stale recommendations directly reduce watch-through rate. Every percentage point lost at this scale means millions of dollars in subscriber churn annually.

⚡

Apache Kafka: The Event Backbone

Kafka is not a message queue — it's a distributed commit log. Every click, play, pause, seek, and trailer view becomes a durably-stored event in a partitioned topic. Producers write to topics with idempotent exactly-once semantics. Consumer groups allow multiple independent services to process the same stream at their own pace without coordination.

Key design: Topics partitioned by user_id (128 partitions), replication factor 3 across availability zones. Compression type: Snappy. Retention: 7 days for raw events, compacted forever for user profile changelog.

⚙️

Apache Flink: Stateful Stream Processing

Flink consumes from Kafka and maintains stateful user preference profiles using session windows — groups of activity with a 30-minute idle gap. If you watch three movies in one evening, Flink sees that as one session and aggregates all signals together. Out-of-order events (mobile network delays) are handled with a 5-second watermark.

State backend: RocksDB with incremental checkpointing every 60 seconds. Exactly-once processing guaranteed by two-phase commit between Kafka and the state backend. Parallelism = number of Kafka partitions (128) for zero data redistribution overhead.

🧠

Hybrid Collaborative Filtering

Collaborative filtering (ALS) decomposes the user-item interaction matrix R ≈ U × V^T into k=128 latent factors that capture hidden taste dimensions — "likes complex narratives," "prefers fast-paced action," etc. The cold-start problem (new content with no views) is solved by blending: content-based filtering for cold items (<100 views), ALS for warm items (>100 views), with a weight that shifts continuously as view count grows.

Cosine similarity: Scores candidates by dot product of the 128-dim user vector against all item vectors. Top-N in milliseconds via optimized BLAS routines. Diversity constraint: max 3 items per genre in top-10 results.

🚀

Serving at Scale with Redis + FastAPI

Updated user feature vectors flow from Flink into Redis as 512-byte binary blobs (float32 × 128 dims). At serve time, the FastAPI recommendation service fetches the vector in under 1ms and computes dot products against all item factors loaded in memory. The entire recommendation cycle — from Kafka event to cached Redis update to API response — completes in under 50ms end-to-end.

Infrastructure: Kubernetes HPA scales pods 2× during peak evening hours (8–11 PM). Redis Cluster with 6 shards. FastAPI async I/O. Load balancer → recommendation service → Redis feature store → pre-loaded item factors in memory.

📊

The Outcome: Real Numbers

Switching from nightly batch to real-time streaming didn't just reduce latency — it demonstrably changed user behavior. The system now adapts to each viewing session as it happens, serving recommendations that reflect what users are in the mood for right now, not yesterday.

Results: +23% watch-through rate (users completing movies) · +18% next-day return rate · <50ms E2E latency at 2.3M events/sec · 99.99% delivery SLA with zero data loss via exactly-once semantics and dead-letter queues.

Live Demo

Netflix Recommendation Pipeline

Two ways to explore: ELI5 for intuition, Engineer mode for the full Kafka → Flink → ALS pipeline.

Pick a movie you love. The pipeline will find your top 3 matches using the same genre-vector cosine similarity that runs in production — and explain why each one was chosen.

Kafka Recommendation Engine — Live Simulator

Watch Events

Trailers

Profile Build

ML Score

Recommend

Serve

User Session

Stream Log

Recommendation Metrics

0

Events Ingested

0

Recs Served

—

Avg Latency

—

Pred. CTR

User	Watched	Trailers	Top Genre	Top Rec	Score	Time

The Classroom

A Guided Walkthrough

Auto-advances every 6 seconds. Pause anytime, or click the dots to jump to any concept.

Step 1 of 6

⏱️

Batch vs. Streaming

Batch processing collects data, waits, then processes it all at once — like doing laundry once a week. Streaming processes each event the moment it arrives — like washing one item as soon as it's dirty. For recommendations, the difference is whether users get suggestions based on what they watched tonight, or what they watched last week.

🔧 Technical detail: Batch jobs minimize infrastructure costs but introduce staleness proportional to batch interval. Streaming adds latency complexity (watermarks, checkpoints) but reduces recommendation lag from hours to milliseconds. At Netflix scale, even 1-hour stale recommendations cost millions in watch-time.

Step 2 of 6

📨

Apache Kafka Architecture

Kafka is a distributed commit log. Events are written to topics, divided into partitions for parallelism, and replicated across brokers for fault tolerance. Consumer groups allow multiple independent services to read the same events — the recommendation engine, the analytics service, and the fraud detector can all consume from the same stream simultaneously without interfering.

🔧 Technical detail: Our setup: 128 partitions (keyed by user_id for ordering guarantees), replication factor 3, acks=all for exactly-once producer semantics. Idempotent producers prevent duplicate events on network retries. Schema Registry enforces Avro contracts between producers and consumers — no silent schema drift.

Step 3 of 6

🕐

Event Time vs. Processing Time

A user on a spotty mobile connection watches a movie offline and syncs the event 10 minutes later. Should we process it as if it happened now, or when it actually happened? Event time (when the event occurred) gives correct aggregations. Watermarks tell Flink "I've seen all events up to time T minus 5 seconds" — so it can safely close windows without waiting forever for late arrivals.

🔧 Technical detail: We use BoundedOutOfOrderness watermark strategy with a 5-second tolerance. Events arriving more than 5 seconds late are routed to a side output (dead-letter stream) for separate reprocessing. Session windows close when there's been no activity for 30 minutes — capturing natural viewing bursts without arbitrary time boundaries.

Step 4 of 6

🗄️

Stateful Stream Processing

Flink maintains user preference state across events — it remembers that you watched Action movies yesterday so it can blend that with today's Drama session. This state lives in RocksDB (on-disk, efficient for millions of users) and is checkpointed to durable storage every 60 seconds. If a Flink task crashes, it restores from the last checkpoint with no data loss and exactly-once processing guarantees.

🔧 Technical detail: Exactly-once is achieved via a two-phase commit protocol between Flink's checkpointing and Kafka's transaction API. Pre-checkpoint: Flink flushes all in-flight state. On commit: offsets are written atomically. On failure: Flink rolls back to the checkpoint and Kafka consumers replay from the last committed offset. No duplicates, no gaps.

Step 5 of 6

❄️

The Cold-Start Problem

Collaborative filtering requires history: "users like you also watched..." But what about a brand-new movie with zero views? Or a new user with no watch history? The cold-start problem kills pure collaborative filtering. The solution: a hybrid model that uses content-based filtering (genre, director, cast metadata) when history is thin, and shifts toward collaborative filtering as more behavioral data accumulates.

🔧 Technical detail: Hybrid weight formula: w_collab = min(1.0, view_count / 100). Below 100 views, content-based dominates. Above 100, ALS dominates. New user cold-start: initialize genre weights from the first item in the onboarding flow, then update in real-time from every subsequent event. The pipeline continuously shifts the balance as data accumulates — no manual thresholds to tune.

Step 6 of 6

🔒

Zero Data Loss Guarantees

In distributed systems, things fail constantly — network partitions, process crashes, disk failures. The pipeline guarantees zero data loss through three layers: exactly-once Kafka producers (idempotent, atomic transactions), Flink checkpointing every 60 seconds with two-phase commit, and dead-letter queues for events that can't be processed (invalid schema, downstream timeouts). Every event is either processed exactly once, or captured for manual review — never silently dropped.

🔧 Technical detail: Producer config: enable.idempotence=true, acks=all, max.in.flight.requests.per.connection=1. Consumer: isolation.level=read_committed (only reads committed transactions). Flink: ExactlyOnce checkpoint mode with aligned barriers. Dead-letter topic: user.events.dlq — consumed by an alert pipeline that pages on-call when DLQ rate exceeds 0.01%.

Key Points

What Makes This System Different

Two perspectives — intuitive for anyone, rigorous for engineers.

For Everyone

⚡

Your Profile Updates in 500ms

Finish watching a sci-fi movie at 9 PM. Within half a second, your recommendation feed has already updated to reflect your current mood — before you even scroll down to see what's next.

🌎

2.3 Million Events Per Second — One Pipeline

Every play, pause, seek, and trailer view from every user worldwide flows through the same Kafka + Flink pipeline simultaneously. It scales horizontally — add more Kafka partitions and Flink parallelism, capacity doubles.

🎯

Why "Because You Watched…" Actually Works

The system doesn't just track genres — it builds a 128-dimensional taste vector that captures subtle patterns: "likes slow-burn thrillers set outside the US," "prefers ensemble casts in drama," etc. Genre is just the surface layer.

🆕

New Users Still Get Great Recommendations

The cold-start hybrid means even your very first session generates useful recommendations — the system uses content metadata to seed initial suggestions, then shifts toward behavioral data within the first few hours of watching.

For Engineers

Exactly-Once Semantics End-to-End

Idempotent Kafka producers (enable.idempotence=true) + Flink two-phase commit checkpointing + read_committed consumer isolation = zero duplicates, zero data loss across the entire pipeline — even through node failures and network partitions.

🧮

ALS: R ≈ U × V^T, k=128

Alternating Least Squares decomposes the sparse user-item matrix into two dense factor matrices. Scoring is a single dot product: scores = V @ u_i. k=128 balances expressiveness vs. memory. Regularization λ=0.01 prevents overfitting on power users with thousands of interactions.

🪟

30-Min Session Window + 5s Watermark

EventTimeSessionWindows(gap=30m) groups activity bursts naturally. WatermarkStrategy.forBoundedOutOfOrderness(5s) handles mobile network delays. Parallelism = Kafka partition count (128) for zero data redistribution overhead in the shuffle step.

Redis: 512 Bytes Per User Vector

User factor vectors stored as raw float32 bytes (128 dims × 4 bytes = 512 bytes/user). Redis GET latency: under 1ms P99. At 100M users: 48GB per cluster shard. TTL=86400s refreshed on every profile update — stale vectors automatically evict.

Production Code

Production Implementation

Real production code for the three core components of the recommendation pipeline.

Flink User Profile Aggregator (Python / PyFlink)

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.window import EventTimeSessionWindows
from pyflink.common.time import Time
from pyflink.common.watermark_strategy import WatermarkStrategy
import json, math

env = StreamExecutionEnvironment.get_execution_environment()
env.enable_checkpointing(60_000)  # exactly-once, 60s interval

# Consume user activity events from Kafka
kafka_source = KafkaSource.builder() \
    .set_bootstrap_servers("kafka:9092") \
    .set_topics("user.watch-events", "user.trailer-events") \
    .set_group_id("profile-aggregator-cg") \
    .set_value_only_deserializer(JsonRowDeserializationSchema()) \
    .build()

events = env.from_source(
    kafka_source,
    WatermarkStrategy.for_bounded_out_of_orderness(Duration.of_seconds(5)),
    "kafka-events"
)

def compute_genre_profile(key, events, ctx):
    """Sliding 30-day window with exponential recency decay."""
    now = ctx.timestamp()
    HALF_LIFE_DAYS = 7
    genre_weights = {}

    for event in events:
        age_days = (now - event["timestamp"]) / 86_400_000
        decay = math.exp(-0.693 * age_days / HALF_LIFE_DAYS)
        weight = event["watch_pct"] * decay
        genre = event["genre"]
        genre_weights[genre] = genre_weights.get(genre, 0.0) + weight

    # Normalize to unit vector for cosine similarity downstream
    magnitude = math.sqrt(sum(v ** 2 for v in genre_weights.values()))
    if magnitude > 0:
        genre_weights = {k: v / magnitude for k, v in genre_weights.items()}

    return {
        "user_id": key,
        "genre_vector": genre_weights,
        "event_count": len(list(events)),
        "updated_at": now
    }

# Session window groups bursts of activity (gap = 30 min)
profiles = events \
    .key_by(lambda e: e["user_id"]) \
    .window(EventTimeSessionWindows.with_gap(Time.minutes(30))) \
    .apply(compute_genre_profile)

# Sink updated profiles to the feature store (Redis)
profiles.add_sink(
    RedisSink(redis_config, RedisMapper(
        key_fn=lambda p: f"user:{p['user_id']}:profile",
        value_fn=lambda p: json.dumps(p),
        ttl=86_400  # 24h TTL, refreshed on each update
    ))
)

env.execute("user-profile-aggregator")

Collaborative Filtering with Implicit ALS (Python)

import numpy as np
from scipy.sparse import csr_matrix
from implicit.als import AlternatingLeastSquares

def build_interaction_matrix(events, user_idx, item_idx):
    """Build user-item matrix: weight = completion% x recency decay."""
    rows, cols, vals = [], [], []
    now = max(e["timestamp"] for e in events)

    for e in events:
        uid = user_idx[e["user_id"]]
        iid = item_idx[e["content_id"]]
        age_days = (now - e["timestamp"]) / 86_400_000
        recency = np.exp(-0.693 * age_days / 7)  # half-life = 7 days
        weight = e["watch_pct"] * recency
        rows.append(uid)
        cols.append(iid)
        vals.append(weight)

    return csr_matrix(
        (vals, (rows, cols)),
        shape=(len(user_idx), len(item_idx))
    )

# Train ALS for implicit feedback (no explicit ratings)
model = AlternatingLeastSquares(
    factors=128,           # latent dimensions
    regularization=0.01,   # L2 penalty
    iterations=20,         # alternating steps
    use_gpu=False
)

interaction_matrix = build_interaction_matrix(events, user_idx, item_idx)
model.fit(interaction_matrix)

def recommend(user_id, n=20, already_watched=None):
    """Generate top-K recs via dot product of user x item factors."""
    uid = user_idx[user_id]
    user_factors = model.user_factors[uid]       # (128,)
    scores = model.item_factors.dot(user_factors) # (n_items,)

    # Zero out already-watched items
    if already_watched:
        for item_id in already_watched:
            scores[item_idx[item_id]] = -np.inf

    top_k = np.argsort(scores)[::-1][:n]
    return [(idx_to_item[i], float(scores[i])) for i in top_k]

Real-Time Recommendation Serving (Python / FastAPI)

from fastapi import FastAPI
from redis.asyncio import Redis
import numpy as np, json

app = FastAPI()
redis = Redis(host="redis", port=6379, decode_responses=True)

# Pre-loaded item factors (n_items x 128) from last ALS training
ITEM_FACTORS = np.load("/models/item_factors.npy")
ITEM_META = json.load(open("/models/item_meta.json"))

@app.get("/recommend/{user_id}")
async def recommend(user_id: str, n: int = 10):
    # 1. Load pre-computed user factors from Redis
    raw = await redis.get(f"user:{user_id}:factors")
    if not raw:
        return {"recs": [], "reason": "cold-start: no user factors"}
    user_factors = np.frombuffer(raw, dtype=np.float32)  # (128,)

    # 2. Score all candidate items via dot product
    scores = ITEM_FACTORS @ user_factors  # (n_items,)

    # 3. Filter already-watched content
    watched = await redis.smembers(f"user:{user_id}:watched")
    for item_id in watched:
        idx = ITEM_META["id_to_idx"].get(item_id)
        if idx is not None:
            scores[idx] = -np.inf

    # 4. Apply diversity constraint: max 3 per genre in top results
    ranked = np.argsort(scores)[::-1]
    results, genre_counts = [], {}

    for idx in ranked:
        if len(results) >= n:
            break
        meta = ITEM_META["items"][idx]
        genre = meta["genre"]
        if genre_counts.get(genre, 0) >= 3:
            continue
        # Boost new releases (< 30 days old)
        boost = 1.15 if meta.get("is_new_release") else 1.0
        genre_counts[genre] = genre_counts.get(genre, 0) + 1
        results.append({
            "item_id": meta["id"],
            "title": meta["title"],
            "genre": genre,
            "score": round(float(scores[idx]) * boost, 4),
            "reason": f"Similar to your {genre} preferences"
        })

    return {"user_id": user_id, "recs": results}

Apache Kafka Azure Event Hubs Apache Flink Spark Streaming Redis ClickHouse Delta Lake Schema Registry Avro / Protobuf Grafana Django Channels Celery Docker Kubernetes

Interview Talking Point

"I architect event-driven systems using Kafka with exactly-once semantics and stateful stream processing with Flink. For a Netflix-style recommendation engine, I built a pipeline where Kafka consumer groups ingest watch-history and trailer-completion events, Flink session windows aggregate user preference profiles in real-time, and a hybrid ALS + content-based model scores candidates. Completed trailers are weighted as high-intent signals for genre affinity. Redis serves the feature store with sub-ms lookup, and recommendations are served via API with <50ms E2E latency. I use checkpointing and dead-letter queues to guarantee zero data loss."

About

About This Project

Kieth Caballero

Data & ML Platform Engineer

"I build the pipelines that make real-time AI possible at scale."

2.3M

Events / sec

<50ms

E2E Latency

99.99%

Delivery SLA

6

Pipeline Stages

20

Movies Scored

GitHub LinkedIn

Real-time Streaming
Architecture

From Stale Batch Jobs to Real-time Pipelines

The Stale Recommendation Problem

Apache Kafka: The Event Backbone

Apache Flink: Stateful Stream Processing

Hybrid Collaborative Filtering

Serving at Scale with Redis + FastAPI

The Outcome: Real Numbers

Netflix Recommendation Pipeline

Because you watched…

Kafka Recommendation Engine — Live Simulator

A Guided Walkthrough

What Makes This System Different

Your Profile Updates in 500ms

2.3 Million Events Per Second — One Pipeline

Why "Because You Watched…" Actually Works

New Users Still Get Great Recommendations

Exactly-Once Semantics End-to-End

ALS: R ≈ U × V^T, k=128

30-Min Session Window + 5s Watermark

Redis: 512 Bytes Per User Vector

Production Implementation

Interview Talking Point

About This Project

Real-time StreamingArchitecture

From Stale Batch Jobs to Real-time Pipelines

The Stale Recommendation Problem

Apache Kafka: The Event Backbone

Apache Flink: Stateful Stream Processing

Hybrid Collaborative Filtering

Serving at Scale with Redis + FastAPI

The Outcome: Real Numbers

Netflix Recommendation Pipeline

Because you watched…

Kafka Recommendation Engine — Live Simulator

A Guided Walkthrough

What Makes This System Different

Your Profile Updates in 500ms

2.3 Million Events Per Second — One Pipeline

Why "Because You Watched…" Actually Works

New Users Still Get Great Recommendations

Exactly-Once Semantics End-to-End

ALS: R ≈ U × VT, k=128

30-Min Session Window + 5s Watermark

Redis: 512 Bytes Per User Vector

Production Implementation

Interview Talking Point

About This Project

Real-time Streaming
Architecture

ALS: R ≈ U × V^T, k=128