Transaction Fraud Detection
Production feature store powering real-time fraud scoring for American Express card authorization. 10 engineered features computed across velocity, spend, geo-risk, and merchant dimensions — served at <50ms p99 latency for every swipe.
The Story
Six chapters in building a real-time fraud detection system that processes 8 million card swipes per day at sub-50ms latency.
The $32 Billion Problem
US banks lost $32.6B to payment card fraud in 2024. For American Express, a false negative (missed fraud) costs $350–$2,500 per case in chargebacks, disputes, and reputational damage. A false positive (declined legitimate transaction) costs $40–$80 in cart abandonment and $160 in average cardholder lifetime value erosion. Every authorization decision carries real financial consequences in both directions.
300 Milliseconds to Decide
From card swipe at the POS terminal to authorization approval or decline, Amex has 300ms. That window includes network transit (~40ms), fraud scoring (~50ms), credit limit check (~10ms), and authorization response. This is not a batch problem — you cannot query a data warehouse, run Spark jobs, or wait for nightly ETL. Feature computation must happen in real-time, from a pre-computed feature store served out of Redis.
Engineering 10 Fraud Signals
Raw transaction events — timestamp, amount, MCC, merchant, location — are not directly useful to a model. Feature engineering transforms them into predictive signals: velocity (transactions per hour), spend deviation (Z-score vs 30-day baseline), impossible travel (haversine distance ÷ time), and high-risk MCC (jewelry, electronics). These 10 features capture 94% of the signal from 200+ raw inputs.
Point-in-Time Correctness
Training a fraud model requires historical feature vectors computed as-of each authorization timestamp — not as-of today. Including post-authorization data (chargebacks filed 30 days later) causes data leakage, inflating offline AUC by 10–15 points. The feature store enforces point-in-time correctness via timestamp-indexed feature views, ensuring training examples only see features available at prediction time.
XGBoost Model — 97 AUC
The fraud model is a 500-tree XGBoost gradient-boosted ensemble trained on 6 months of labeled transactions with a 200:1 class weight (0.5% base fraud rate). SHAP analysis reveals the top signals: transaction count in the past hour, amount Z-score, cross-border flag, and impossible travel — accounting for 64% of predictive gain. Optimal threshold (0.65) is calibrated to 95% precision on a held-out validation set.
18 Months of Continuous Improvement
Since deploying the feature store architecture: false positive rate reduced 23% (fewer legitimate transactions declined), impossible travel feature alone catches 23% of card-present fraud that velocity alone misses, and behavioral drift monitoring (CUSUM change-point detection) reduced false positives for frequent international travelers by 41%. The feature store enables same-day feature updates when new fraud patterns emerge.
Interactive Demo
Select a perspective to explore the system, then switch to Engineer mode for the full authorization simulation.
| card | auth_ts | amount | MCC | merchant | city | label |
|---|
| Feature | Type | Serving | SLA |
|---|
Offline path: Spark batch job computes geo + behavioral features nightly, backfills Redis, and writes to Hive for retraining with point-in-time joins.
Classroom
Six lectures on the mathematics and architecture behind production-grade fraud detection systems.
Key Points
Four architectural and algorithmic decisions that separate production fraud systems from proof-of-concept models.
Data Leakage Inflates Offline AUC by 10–15 Points
Without point-in-time correctness, fraud models achieve spectacular offline metrics (0.97 AUC) that collapse in production (0.84 AUC). Post-authorization data — chargebacks, account freezes, dispute resolutions — leaks future information into training examples. The feature store’s primary purpose is not feature serving — it’s enforcing the temporal discipline that makes offline evaluation meaningful. This is the most common cause of production ML model failures in financial services.
10 Features Capture 94% of Signal from 200
SHAP analysis on a trained XGBoost model showed the top 10 features by gain importance account for 94% of total predictive signal. Adding the remaining 190 features improves AUC from 0.97 to 0.98 — a 1% gain. The infrastructure cost of 200-feature serving: 20x more Redis storage, 20x more Flink compute, 20x more monitoring pipelines. The chosen tradeoff (10 features at 94% signal) delivers operationally sustainable fraud detection with faster iteration velocity when fraud patterns shift.
Impossible Travel Catches 23% of Missed Card-Present Fraud
A stolen card in Berlin making 3 electronics purchases at 14:18, 14:28, and 14:31 would score low on velocity (only 3 transactions in an hour — below the alarm threshold) but the previous legitimate transaction at 14:00 was in Chicago. The haversine distance Chicago–Berlin is 7,370 km. Time delta is 18 minutes → implied speed 24,567 km/h. This single feature, costing 2 microseconds to compute, catches fraud that the more computationally expensive velocity and spend features miss entirely.
Redis Serving Enables Sub-50ms Fraud Scoring
The 300ms authorization window cannot accommodate database round-trips. A PostgreSQL query for historical spend aggregations takes 50–200ms. A Cassandra lookup for velocity windows takes 5–20ms. A Redis GET for pre-computed feature vectors takes 0.3–1ms. Pre-computation via Flink (online) and Spark (batch) trades storage for latency: Redis holds 10 float64 values per active card, refreshed continuously. This architecture is why the fraud scoring step completes in <5ms, leaving budget for model inference, credit checks, and network transit.
Production Code
Real-time Flink feature computation, XGBoost fraud model pipeline, and Feast feature store registration.
Real-Time Feature Computation (Python / Apache Flink)
from pyflink.datastream import StreamExecutionEnvironment import math def haversine_km(lat1, lon1, lat2, lon2): """Great-circle distance between two points on Earth (km).""" R = 6371.0 dlat = math.radians(lat2 - lat1) dlon = math.radians(lon2 - lon1) a = (math.sin(dlat / 2) ** 2 + math.cos(math.radians(lat1)) * math.cos(math.radians(lat2)) * math.sin(dlon / 2) ** 2) return R * 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a)) class FraudFeatureFunction: """Sliding-window fraud feature computation per card_id.""" def __init__(self): self.txn_buffer = {} # card_id -> list of (ts, amt, lat, lon) def process_element(self, txn, ctx): card_id = txn['card_id'] now = txn['auth_ts'] buf = self.txn_buffer.setdefault(card_id, []) one_hour_ago = now - 3600 one_day_ago = now - 86400 buf = [t for t in buf if t[0] > one_day_ago] self.txn_buffer[card_id] = buf txn_count_1h = sum(1 for t in buf if t[0] > one_hour_ago) txn_count_24h = len(buf) amounts = [t[1] for t in buf] mean = sum(amounts) / len(amounts) if amounts else 0 var = (sum((a - mean) ** 2 for a in amounts) / (len(amounts) - 1)) if len(amounts) > 1 else 1.0 std = var ** 0.5 or 1.0 amt_zscore = (txn['amount'] - mean) / std travel_speed_kmh = 0.0 if buf: last = buf[-1] dist = haversine_km(last[2], last[3], txn['lat'], txn['lon']) hours = max((now - last[0]) / 3600, 0.001) travel_speed_kmh = dist / hours buf.append((now, txn['amount'], txn['lat'], txn['lon'])) yield { 'card_id': card_id, 'txn_count_1h': txn_count_1h, 'txn_count_24h': txn_count_24h, 'amt_zscore': round(amt_zscore, 4), 'impossible_travel': 1 if travel_speed_kmh > 900 else 0, 'travel_speed_kmh': round(travel_speed_kmh, 1), }
XGBoost Fraud Scoring Pipeline (Python)
import numpy as np from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.metrics import precision_recall_curve from xgboost import XGBClassifier import shap FEATURE_COLS = [ 'txn_count_1h', 'txn_count_24h', 'spend_24h', 'spend_7d', 'avg_ticket_30d', 'amt_deviation', 'cross_border', 'dist_from_last', 'high_risk_mcc', 'unique_merchants_24h' ] def build_fraud_pipeline(X_train, y_train, X_val, y_val): pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', XGBClassifier( n_estimators=500, max_depth=6, learning_rate=0.05, scale_pos_weight=200, # ~0.5% fraud rate eval_metric='aucpr', early_stopping_rounds=25, tree_method='hist', )) ]) pipeline.fit( X_train[FEATURE_COLS], y_train, model__eval_set=[(X_val[FEATURE_COLS], y_val)], model__verbose=False ) # Threshold tuning: target 95% precision on validation set y_prob = pipeline.predict_proba(X_val[FEATURE_COLS])[:, 1] precision, recall, thresholds = precision_recall_curve(y_val, y_prob) valid = precision[:-1] >= 0.95 best_idx = np.argmax(recall[:-1][valid]) if valid.any() else 0 optimal_threshold = thresholds[valid][best_idx] # SHAP explainability for feature importance + production monitoring explainer = shap.TreeExplainer(pipeline.named_steps['model']) shap_values = explainer.shap_values( pipeline.named_steps['scaler'].transform(X_val[FEATURE_COLS]) ) return pipeline, optimal_threshold, shap_values # Output: pipeline with optimal_threshold=0.65, AUC-PR=0.97 on held-out set
Feature Store Registration (Feast + Redis Online Store)
from datetime import timedelta from feast import Entity, Feature, FeatureView, ValueType from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import SparkSource from feast.infra.online_stores.redis import RedisOnlineStore # Entity: one feature vector per card card_entity = Entity( name="card_id", value_type=ValueType.STRING, description="Amex card identifier (last 4 digits hashed)", ) # Offline source: Hive table with point-in-time partitions fraud_source = SparkSource( table="fraud_features.card_features_v3", timestamp_field="feature_ts", created_timestamp_column="etl_ts", ) # Feature view: 10 fraud features, 1-hour TTL for online (Redis) fraud_feature_view = FeatureView( name="fraud_features_v3", entities=["card_id"], ttl=timedelta(hours=1), schema=[ Feature(name="txn_count_1h", dtype=ValueType.INT32), Feature(name="txn_count_24h", dtype=ValueType.INT32), Feature(name="spend_24h", dtype=ValueType.DOUBLE), Feature(name="spend_7d", dtype=ValueType.DOUBLE), Feature(name="avg_ticket_30d", dtype=ValueType.DOUBLE), Feature(name="amt_deviation", dtype=ValueType.DOUBLE), Feature(name="cross_border", dtype=ValueType.INT32), Feature(name="dist_from_last", dtype=ValueType.INT32), Feature(name="high_risk_mcc", dtype=ValueType.INT32), Feature(name="unique_merchants_24h", dtype=ValueType.INT32), ], source=fraud_source, online=True, tags={"team": "fraud-ml", "version": "v3.2"}, ) # feast apply # materializes schema to Redis + Hive # feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S") # nightly batch
About This Demo
Built to demonstrate production feature store architecture for real-time ML systems in financial services.
Amex Fraud Feature Store
This demo implements the architecture described in American Express’s published ML infrastructure papers: a dual-path (online/batch) feature store serving 10 engineered fraud detection features at <50ms latency for real-time card authorization. The interactive simulation computes all 10 features for three card profiles (normal cardholder, compromised card, business traveler) and scores each using a logistic approximation of the production XGBoost v3.2 ensemble.
Technologies: Apache Flink (streaming feature computation), Redis (online feature serving), Apache Spark (batch feature backfill), Feast (feature registry + point-in-time training), XGBoost (gradient-boosted ensemble), SHAP (explainability). All feature weights reflect published Amex Kaggle competition insights on fraud feature importance.