thenumerix

The Story

How a Computer Learns to Predict Baseball

No jargon. Here's exactly what this system does, explained like you're seeing it for the first time.

⚾

Step 1

Gather the Data

Every morning the pipeline wakes up and calls the MLB Stats API to pull season statistics for all 30 teams — wins, losses, batting averages, ERA, and even bullpen fatigue. Think of it like reading the sports page automatically, every single day, for every team.

🔢

Step 2

Build Features

Raw stats aren't enough. We calculate 47 advanced metrics: wOBA (how well a batter gets on base), ERA+ (pitcher quality adjusted for the ballpark), rolling bullpen fatigue, park factor adjustments, and left/right platoon advantages. These "ingredients" are what the model actually learns from.

🧠

Step 3

Train the Model

We feed five years of historical games to XGBoost — a gradient boosting algorithm. It studies 147,000 matchups and learns patterns: which combinations of features actually predict wins. Optuna runs 200 hyperparameter trials to find the best possible settings before we finalize the model.

🧪

Step 4

Test It Rigorously

Before trusting the model, we run it against games it never saw in training — always testing on future data, never peeking at answers first. We measure AUC-ROC (ranking quality), Brier Score (confidence accuracy), and calibration (does "70% confident" actually win 70% of the time?). If it doesn't pass, it doesn't deploy.

🏟️

Step 5

Make Predictions

Today's games come in, we run live features through XGBoost, then blend it with Bill James' legendary Log5 formula — a statistical method proven over 40 years of baseball research. The result: "Yankees 57.2%, Red Sox 42.8%." Served in 3.2 milliseconds via Kubernetes.

📡

Step 6

Watch for Changes

Baseball changes all season — trades, injuries, players going cold. We monitor Population Stability Index (PSI) of input features daily. When PSI exceeds 0.2, the system auto-triggers a retrain. At the trade deadline, a retrain always fires regardless of PSI. The model never goes stale.

Live Demo

Try It Yourself

Pick your experience: a friendly prediction tool or the full engineering pipeline with every technical detail.

Who's winning tonight? 🏆

Pick a game below, then let the model make its best guess.

⏳ Loading today's games…

chance of winning

Away

Home

Ingest

Features

Train

Evaluate

Predict

Monitor

Pipeline Output

System Log

The Classroom

A Guided Walkthrough

Auto-advances every 5 seconds. Pause anytime, or click the dots to jump to any step.

Step 1 of 6

⚾

Gathering the Data

Every morning, the pipeline wakes up and calls the MLB Stats API to pull today's schedule and season statistics for all 30 teams. It downloads win-loss records, batting averages, pitching stats, and bullpen usage logs — automatically, every single day.

🔧 Technical detail: Raw data lands in a Delta Lake table on Azure Blob Storage — giving ACID transactions so data can never be half-written. We store 5 seasons of game history: ~147,000 training samples total.

Step 2 of 6

🔢

Building Features

Raw stats aren't enough. We calculate 47 engineered features: wOBA (weighted on-base average), ERA+, rolling bullpen fatigue over 5 days, park factor adjustments, and L/R platoon advantages. These become the "ingredients" the model actually learns from.

🔧 Technical detail: Features are served from Feast (open-source feature store). Training and serving use identical feature computation, eliminating training-serving skew that silently degrades real-world accuracy.

Step 3 of 6

🧠

Training the Model

We train an XGBoost gradient boosting model on 5 years of historical games. Optuna runs 200 hyperparameter trials to find the best configuration. The model learns which combinations of features predict wins most reliably — from bullpen rest days to park dimensions.

🔧 Technical detail: Time-based 5-fold cross-validation — folds are always chronological, testing on future data only. This prevents data leakage that makes accuracy look artificially inflated.

Step 4 of 6

🧪

Testing It

Before deployment, the model is backtested on the full 2025 postseason — games it has never seen. We measure AUC-ROC, Brier Score, and calibration. Does "70% confident" actually win 70% of the time? If the model doesn't pass all gates, it doesn't deploy.

🔧 Technical detail: Brier Score is the primary gate (not accuracy) because it rewards proper confidence levels. A model that says "51%" every game has decent accuracy but terrible Brier Score — and terrible real-world value.

Step 5 of 6

🏟️

Making Predictions

For each game, we compute features in real-time, run them through XGBoost, then blend with Bill James' Log5 formula — a statistical method proven over 40 years. The blend gives calibrated, trustworthy probabilities served in 3.2ms via Kubernetes.

🔧 Technical detail: New model versions use a canary deployment strategy — they serve 5% of traffic first and are only promoted to champion when they outperform the existing model on live games.

Step 6 of 6

📡

Watching for Changes

Baseball changes constantly — trades, injuries, slumps. We monitor Population Stability Index (PSI) of input features daily. PSI above 0.2 auto-triggers a retrain. At the trade deadline, a retrain always fires regardless. The model never goes stale.

🔧 Technical detail: Drift monitoring runs on Evidently AI. The trade deadline retrain is a hard-scheduled pipeline job — mid-season roster changes are the single biggest model killer in sports analytics.

Key Points

What Makes This System Different

Two perspectives — approachable for everyone, technical for engineers.

For Everyone

Live Data, Live Predictions

This isn't a static demo. Every day it connects to the real MLB API and fetches today's actual games. The predictions you see are generated fresh, right now.

🏆

92.1% AUC — What That Means

If you randomly picked two games — one the model was right about and one it was wrong about — it would correctly identify which is which 92.1% of the time. That's well above the ~60% you'd get from guessing home-field advantage alone.

🔄

It Learns All Season Long

A model trained in March doesn't know about the August trade that moved your team's best pitcher. This system watches for those changes automatically and retrains itself — so predictions stay relevant all the way to the World Series.

For Engineers

Time-Based CV Prevents Leakage

Standard k-fold CV would train on September games to predict April games — impossible in production. TimeSeriesSplit enforces chronological folds, so validation accuracy matches real-world performance.

Feature Store Eliminates Skew

Training-serving skew (different feature computation paths for training vs. inference) is the #1 silent killer of deployed ML models. Feast feature store ensures both use the exact same pipeline.

PSI-Triggered Auto-Retraining

Population Stability Index over 0.2 on any feature triggers an automated retrain via Azure ML pipelines — no human in the loop. Evidently AI reports log to MLflow for full observability.

Production Code

Implementation Details

The real engineering behind the predictions. Click any block to expand.

Sabermetric Feature Engineering

import pandas as pd
import numpy as np
from feast import FeatureStore

def compute_sabermetrics(df: pd.DataFrame) -> pd.DataFrame:
    # Weighted On-Base Average (wOBA) — 2025 weights
    df["woba"] = (
        0.69 * df["bb"] + 0.72 * df["hbp"] +
        0.89 * df["singles"] + 1.27 * df["doubles"] +
        1.62 * df["triples"] + 2.10 * df["hr"]
    ) / (df["ab"] + df["bb"] + df["sf"] + df["hbp"])

    # ERA+ (park-adjusted pitcher quality)
    df["era_plus"] = 100 * (df["lg_era"] / df["era"].replace(0, np.nan))

    # Bullpen fatigue — rolling 5-day pitch count
    df["bullpen_fatigue"] = (
        df.groupby("team_id")["pitches"]
          .transform(lambda x: x.rolling(5, min_periods=1).mean())
    )

    # Park factor (run environment adjustment)
    df["park_factor"] = df["home_park_factor"].fillna(1.0)

    # Left/Right platoon advantage
    df["platoon_adv"] = (df["starter_hand"] != df["opp_lineup_side"]).astype(float)

    return df

# Serve features from Feast for zero training-serving skew
def get_online_features(team_id: int, store: FeatureStore) -> dict:
    return store.get_online_features(
        features=["mlb_teams:woba","mlb_teams:era_plus",
                  "mlb_teams:bullpen_fatigue","mlb_teams:park_factor"],
        entity_rows=[{"team_id": team_id}]
    ).to_dict()

XGBoost Training with Time-Based CV

import xgboost as xgb
import optuna
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import brier_score_loss, roc_auc_score
import mlflow

def train_with_time_cv(X: pd.DataFrame, y: pd.Series) -> xgb.XGBClassifier:
    tscv = TimeSeriesSplit(n_splits=5)  # chronological — no future leakage

    def objective(trial):
        params = {
            "n_estimators":     trial.suggest_int("n_estimators", 100, 500),
            "max_depth":        trial.suggest_int("max_depth", 3, 8),
            "learning_rate":    trial.suggest_float("lr", 0.01, 0.3, log=True),
            "subsample":        trial.suggest_float("subsample", 0.6, 1.0),
            "colsample_bytree": trial.suggest_float("colsample", 0.6, 1.0),
        }
        scores = []
        for train_idx, val_idx in tscv.split(X):
            model = xgb.XGBClassifier(**params, eval_metric="logloss")
            model.fit(X.iloc[train_idx], y.iloc[train_idx], verbose=False)
            preds = model.predict_proba(X.iloc[val_idx])[:, 1]
            scores.append(roc_auc_score(y.iloc[val_idx], preds))
        return float(np.mean(scores))

    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=200, show_progress_bar=True)

    with mlflow.start_run(run_name="xgb_final"):
        best = xgb.XGBClassifier(**study.best_params)
        best.fit(X, y)
        mlflow.log_params(study.best_params)
        mlflow.log_metric("best_cv_auc", study.best_value)
        mlflow.xgboost.log_model(best, "model")
    return best

Season Drift Detection with Evidently

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
import pandas as pd
import mlflow

def check_and_retrain_if_drifted(
    reference: pd.DataFrame,
    current: pd.DataFrame,
    psi_threshold: float = 0.20
) -> dict:
    report = Report(metrics=[DataDriftPreset()])
    report.run(reference_data=reference, current_data=current)
    result = report.as_dict()

    drift_info = {
        "dataset_drift":       result["metrics"][0]["result"]["dataset_drift"],
        "n_drifted_features":  result["metrics"][0]["result"]["n_drifted_features"],
        "share_drifted":       result["metrics"][0]["result"]["share_drifted_features"],
    }

    with mlflow.start_run(run_name="drift_monitor"):
        for k, v in drift_info.items():
            mlflow.log_metric(k, float(v))

    if drift_info["share_drifted"] > psi_threshold:
        # Auto-trigger retrain via Azure ML pipeline
        trigger_azure_ml_retrain(
            reason="feature_drift_detected",
            psi=drift_info["share_drifted"]
        )

    return drift_info

Python 3.11 XGBoost 2.0 Optuna Feast Evidently AI MLflow Delta Lake Azure Blob Azure ML Kubernetes SHAP scikit-learn Conformal Prediction MLB Stats API

Innovation Spotlight: The win probability output blends XGBoost's learned probability with Bill James' Log5 formula — a sabermetric method with 40+ years of validation. The ensemble adds conformal prediction intervals so every prediction comes with a calibrated confidence band, not just a point estimate.

About This Project

The Complete Package

A full-stack MLOps system built to demonstrate every layer of production machine learning.

Kieth

Data Engineer & MLOps Architect

"Building the future of sports analytics, one model at a time."

147K

Training Samples

92.1%

AUC-ROC

30

MLB Teams

6

Pipeline Stages

3.2ms

p99 Latency

GitHub LinkedIn

MLB Game PredictionMLOps Pipeline