MLB Analytics / MLOps
MLB Game Prediction
MLOps Pipeline
A production-grade machine learning system that predicts today's MLB games using sabermetrics,
XGBoost, and real-time data — monitored and retrained all season long.
The Story
How a Computer Learns to Predict Baseball
No jargon. Here's exactly what this system does, explained like you're seeing it for the first time.
⚾
Step 1
Gather the Data
Every morning the pipeline wakes up and calls the MLB Stats API to pull season statistics for all 30 teams — wins, losses, batting averages, ERA, and even bullpen fatigue. Think of it like reading the sports page automatically, every single day, for every team.
🔢
Step 2
Build Features
Raw stats aren't enough. We calculate 47 advanced metrics: wOBA (how well a batter gets on base), ERA+ (pitcher quality adjusted for the ballpark), rolling bullpen fatigue, park factor adjustments, and left/right platoon advantages. These "ingredients" are what the model actually learns from.
🧠
Step 3
Train the Model
We feed five years of historical games to XGBoost — a gradient boosting algorithm. It studies 147,000 matchups and learns patterns: which combinations of features actually predict wins. Optuna runs 200 hyperparameter trials to find the best possible settings before we finalize the model.
🧪
Step 4
Test It Rigorously
Before trusting the model, we run it against games it never saw in training — always testing on future data, never peeking at answers first. We measure AUC-ROC (ranking quality), Brier Score (confidence accuracy), and calibration (does "70% confident" actually win 70% of the time?). If it doesn't pass, it doesn't deploy.
🏟️
Step 5
Make Predictions
Today's games come in, we run live features through XGBoost, then blend it with Bill James' legendary Log5 formula — a statistical method proven over 40 years of baseball research. The result: "Yankees 57.2%, Red Sox 42.8%." Served in 3.2 milliseconds via Kubernetes.
📡
Step 6
Watch for Changes
Baseball changes all season — trades, injuries, players going cold. We monitor Population Stability Index (PSI) of input features daily. When PSI exceeds 0.2, the system auto-triggers a retrain. At the trade deadline, a retrain always fires regardless of PSI. The model never goes stale.
Production Code
Implementation Details
The real engineering behind the predictions. Click any block to expand.
Sabermetric Feature Engineering
import pandas as pd
import numpy as np
from feast import FeatureStore
def compute_sabermetrics(df: pd.DataFrame) -> pd.DataFrame:
# Weighted On-Base Average (wOBA) — 2025 weights
df["woba"] = (
0.69 * df["bb"] + 0.72 * df["hbp"] +
0.89 * df["singles"] + 1.27 * df["doubles"] +
1.62 * df["triples"] + 2.10 * df["hr"]
) / (df["ab"] + df["bb"] + df["sf"] + df["hbp"])
# ERA+ (park-adjusted pitcher quality)
df["era_plus"] = 100 * (df["lg_era"] / df["era"].replace(0, np.nan))
# Bullpen fatigue — rolling 5-day pitch count
df["bullpen_fatigue"] = (
df.groupby("team_id")["pitches"]
.transform(lambda x: x.rolling(5, min_periods=1).mean())
)
# Park factor (run environment adjustment)
df["park_factor"] = df["home_park_factor"].fillna(1.0)
# Left/Right platoon advantage
df["platoon_adv"] = (df["starter_hand"] != df["opp_lineup_side"]).astype(float)
return df
# Serve features from Feast for zero training-serving skew
def get_online_features(team_id: int, store: FeatureStore) -> dict:
return store.get_online_features(
features=["mlb_teams:woba","mlb_teams:era_plus",
"mlb_teams:bullpen_fatigue","mlb_teams:park_factor"],
entity_rows=[{"team_id": team_id}]
).to_dict()
XGBoost Training with Time-Based CV
import xgboost as xgb
import optuna
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import brier_score_loss, roc_auc_score
import mlflow
def train_with_time_cv(X: pd.DataFrame, y: pd.Series) -> xgb.XGBClassifier:
tscv = TimeSeriesSplit(n_splits=5) # chronological — no future leakage
def objective(trial):
params = {
"n_estimators": trial.suggest_int("n_estimators", 100, 500),
"max_depth": trial.suggest_int("max_depth", 3, 8),
"learning_rate": trial.suggest_float("lr", 0.01, 0.3, log=True),
"subsample": trial.suggest_float("subsample", 0.6, 1.0),
"colsample_bytree": trial.suggest_float("colsample", 0.6, 1.0),
}
scores = []
for train_idx, val_idx in tscv.split(X):
model = xgb.XGBClassifier(**params, eval_metric="logloss")
model.fit(X.iloc[train_idx], y.iloc[train_idx], verbose=False)
preds = model.predict_proba(X.iloc[val_idx])[:, 1]
scores.append(roc_auc_score(y.iloc[val_idx], preds))
return float(np.mean(scores))
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=200, show_progress_bar=True)
with mlflow.start_run(run_name="xgb_final"):
best = xgb.XGBClassifier(**study.best_params)
best.fit(X, y)
mlflow.log_params(study.best_params)
mlflow.log_metric("best_cv_auc", study.best_value)
mlflow.xgboost.log_model(best, "model")
return best
Season Drift Detection with Evidently
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
import pandas as pd
import mlflow
def check_and_retrain_if_drifted(
reference: pd.DataFrame,
current: pd.DataFrame,
psi_threshold: float = 0.20
) -> dict:
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference, current_data=current)
result = report.as_dict()
drift_info = {
"dataset_drift": result["metrics"][0]["result"]["dataset_drift"],
"n_drifted_features": result["metrics"][0]["result"]["n_drifted_features"],
"share_drifted": result["metrics"][0]["result"]["share_drifted_features"],
}
with mlflow.start_run(run_name="drift_monitor"):
for k, v in drift_info.items():
mlflow.log_metric(k, float(v))
if drift_info["share_drifted"] > psi_threshold:
# Auto-trigger retrain via Azure ML pipeline
trigger_azure_ml_retrain(
reason="feature_drift_detected",
psi=drift_info["share_drifted"]
)
return drift_info
Python 3.11
XGBoost 2.0
Optuna
Feast
Evidently AI
MLflow
Delta Lake
Azure Blob
Azure ML
Kubernetes
SHAP
scikit-learn
Conformal Prediction
MLB Stats API
Innovation Spotlight: The win probability output blends XGBoost's learned probability
with Bill James' Log5 formula — a sabermetric method with 40+ years of validation. The ensemble adds
conformal prediction intervals so every prediction comes with a calibrated confidence band, not just
a point estimate.