American Express — Fraud Feature Store

Transaction Fraud Detection

Production feature store powering real-time fraud scoring for American Express card authorization. 10 engineered features computed across velocity, spend, geo-risk, and merchant dimensions — served at <50ms p99 latency for every swipe.

Entity: card_id  |  Cutoff: 2026-04-12 14:35 UTC  |  Model: XGBoost v3.2

The Story

Six chapters in building a real-time fraud detection system that processes 8 million card swipes per day at sub-50ms latency.

Step 01
💳

The $32 Billion Problem

US banks lost $32.6B to payment card fraud in 2024. For American Express, a false negative (missed fraud) costs $350–$2,500 per case in chargebacks, disputes, and reputational damage. A false positive (declined legitimate transaction) costs $40–$80 in cart abandonment and $160 in average cardholder lifetime value erosion. Every authorization decision carries real financial consequences in both directions.

$32.6B annual card fraud losses (US)
Step 02
⏱️

300 Milliseconds to Decide

From card swipe at the POS terminal to authorization approval or decline, Amex has 300ms. That window includes network transit (~40ms), fraud scoring (~50ms), credit limit check (~10ms), and authorization response. This is not a batch problem — you cannot query a data warehouse, run Spark jobs, or wait for nightly ETL. Feature computation must happen in real-time, from a pre-computed feature store served out of Redis.

50ms SLA for fraud scoring within 300ms total
Step 03
🧪

Engineering 10 Fraud Signals

Raw transaction events — timestamp, amount, MCC, merchant, location — are not directly useful to a model. Feature engineering transforms them into predictive signals: velocity (transactions per hour), spend deviation (Z-score vs 30-day baseline), impossible travel (haversine distance ÷ time), and high-risk MCC (jewelry, electronics). These 10 features capture 94% of the signal from 200+ raw inputs.

10 features → 94% signal from 200+ raw inputs
Step 04
🗄️

Point-in-Time Correctness

Training a fraud model requires historical feature vectors computed as-of each authorization timestamp — not as-of today. Including post-authorization data (chargebacks filed 30 days later) causes data leakage, inflating offline AUC by 10–15 points. The feature store enforces point-in-time correctness via timestamp-indexed feature views, ensuring training examples only see features available at prediction time.

Data leakage inflates AUC by 10–15 points
Step 05
🤖

XGBoost Model — 97 AUC

The fraud model is a 500-tree XGBoost gradient-boosted ensemble trained on 6 months of labeled transactions with a 200:1 class weight (0.5% base fraud rate). SHAP analysis reveals the top signals: transaction count in the past hour, amount Z-score, cross-border flag, and impossible travel — accounting for 64% of predictive gain. Optimal threshold (0.65) is calibrated to 95% precision on a held-out validation set.

0.97 AUC · 95% precision @ 0.65 threshold
Step 06
📈

18 Months of Continuous Improvement

Since deploying the feature store architecture: false positive rate reduced 23% (fewer legitimate transactions declined), impossible travel feature alone catches 23% of card-present fraud that velocity alone misses, and behavioral drift monitoring (CUSUM change-point detection) reduced false positives for frequent international travelers by 41%. The feature store enables same-day feature updates when new fraud patterns emerge.

−23% false positives · −41% international travel declines

Interactive Demo

Select a perspective to explore the system, then switch to Engineer mode for the full authorization simulation.

🕵️Fraud Investigator
⚙️ML Engineer
🔬Data Scientist
📊Product Manager

Authorization Stream Kafka → Flink
Card:
cardauth_tsamountMCCmerchantcitylabel
Why Point-in-Time Matters
When scoring card ••3782 at 14:35 UTC, the feature store only uses transactions before that timestamp. Including future data causes data leakage — inflating offline metrics by 10–15 AUC points.
Feature Definitions 10 features
    Feature Vector card ••3782
    Click Score Transaction to compute features as-of the point-in-time cutoff and produce a fraud probability.
    Feature Registry v3.2 — prod
    FeatureTypeServingSLA
    XGBoost Feature Importance — v3.2
    Serving Architecture
    Online path: Flink computes velocity + spend features per card_id in real-time, writes to Redis. At auth time, scoring service reads the feature vector in <5ms.
    Offline path: Spark batch job computes geo + behavioral features nightly, backfills Redis, and writes to Hive for retraining with point-in-time joins.

    Classroom

    Six lectures on the mathematics and architecture behind production-grade fraud detection systems.

    Slide 1 of 6 — The Authorization Window

    What Happens in 300 Milliseconds?

    When you swipe a card, the POS terminal sends an ISO 8583 authorization request to the acquiring bank (∼15ms network). The acquirer routes to Amex (∼25ms). Amex has roughly 250ms remaining to: (1) look up your account (∼10ms), (2) pull feature vectors from the feature store (∼5ms), (3) run the fraud model (∼3ms), (4) check credit limit (∼5ms), and (5) send the authorization response back. The entire system must complete within 300ms or the terminal times out and the transaction is declined by default. This constraint is non-negotiable — merchants have no tolerance for slow authorization.

    The 50ms SLA for fraud scoring means the feature store cannot tolerate a cache miss (which would require a database round-trip of 50–200ms). All 10 features for every active card must be pre-computed and resident in Redis memory at all times. Flink continuously refreshes velocity features (txn_count_1h, txn_count_24h) as new transactions arrive, maintaining freshness within seconds.

    Timeline: POS → Acquirer (15ms) → Amex routing (25ms) → Feature fetch (5ms) → Model (3ms) → Checks (15ms) → Response
    Budget: 300ms total · 50ms SLA for fraud scoring · <5ms Redis feature read
    Slide 2 of 6 — Feature Engineering

    Transforming Raw Events Into Predictive Signals

    A raw transaction has: timestamp, amount, MCC code, merchant name, and lat/lon. None of these alone is predictive of fraud — a $500 transaction is normal for some cardholders and anomalous for others. Feature engineering creates context-aware signals. Velocity features (txn_count_1h) capture the behavioral signature of fraud: stolen cards are rapidly tested with small transactions, then used for large purchases. Spend deviation (amt_deviation = (amount − mean_30d) / std_30d) normalizes across heterogeneous cardholders.

    Geographic features require computed distance: impossible travel (haversine distance ÷ time gap > 900 km/h) catches card-cloning across geographies. Merchant category codes encode risk: jewelry (MCC 5944), electronics (5732), and luxury clothing (5651) are the top three fraud-targeted categories because they are high-value and easily resalable. The 10 selected features were identified via SHAP importance analysis across a 6-month training dataset of 180M transactions.

    amt_deviation = (amount − avg_30d) / std_30d    [Z-score normalization]
    impossible_travel = 1 if haversine(prev, curr) / hours > 900 km/h else 0
    high_risk_mcc = 1 if MCC ∈ {5732, 5944, 5651} else 0
    Slide 3 of 6 — Data Leakage

    Point-in-Time Correctness — Why It Matters

    Data leakage occurs when training examples include information unavailable at prediction time. For fraud detection, the classic leakage scenario: Transaction T occurs at time T0. The chargeback is filed at T0 + 30 days. If your training pipeline computes avg_ticket_30d using all transactions up to today (T0 + 60 days), it includes the chargeback dispute and post-fraud account freezes. The model learns "compromised accounts have no transactions in the 30 days after fraud" — a pattern that doesn’t exist at real authorization time.

    Point-in-time correctness requires every training example to use feature vectors computed as-of the transaction timestamp. Feast enforces this via “as_of” semantics: queries specify a point-in-time cutoff, and only feature values with timestamps before the cutoff are returned. Without this, Amex observed AUC inflation of 0.10–0.15 in backtesting (e.g., apparent AUC of 0.97 vs actual production AUC of 0.84). The feature store is fundamentally a point-in-time correctness enforcement mechanism.

    Leakage: feature_value_at_T0 + future_info → inflated AUC 0.10-0.15
    PIT-correct: feature_value_as_of_T0 → production AUC matches offline AUC
    Slide 4 of 6 — XGBoost for Fraud

    Gradient Boosting on Imbalanced Classes

    Fraud rates are typically 0.3–0.7% of all transactions — meaning 99.5% of training examples are legitimate. Training a vanilla classifier on this data produces a model that predicts “legitimate” for everything, achieving 99.5% accuracy while catching 0% of fraud. The solution is class weight balancing: XGBoost’s scale_pos_weight parameter upweights fraud examples by 200x, forcing the model to optimize for fraud recall at the cost of some precision.

    Gradient boosting builds 500 sequential decision trees, where each tree corrects the errors of the ensemble before it. The loss function is log-loss (binary cross-entropy) on the balanced class-weighted samples. Early stopping at round 25 (no improvement on the validation AUCPR) prevents overfitting. Area Under the Precision-Recall Curve (AUCPR) is the correct metric for imbalanced classification — AUC-ROC is misleading when the negative class dominates, as even a poor model achieves high ROC-AUC due to the overwhelming number of true negatives.

    scale_pos_weight = N_negative / N_positive ≈ 200 (for 0.5% fraud rate)
    Metric: AUCPR (precision-recall) — not AUC-ROC (inflated by true negatives)
    Slide 5 of 6 — Threshold Optimization

    Calibrating the Precision-Recall Tradeoff

    A fraud model produces a probability (0–1). The decision to decline or approve requires choosing a threshold. Setting it too low (say, 0.3) catches more fraud but generates too many false positives (legitimate transactions declined). Setting it too high (0.9) misses fraud. Amex targets 95% precision: at most 5% of declined transactions should be false positives (legitimate cardholders incorrectly declined).

    Threshold calibration uses the precision-recall curve on a held-out validation set. For each threshold value, compute precision and recall. Find the optimal threshold as the lowest value that still achieves the target precision. This is evaluated separately for different card segments (business vs. consumer, domestic vs. international) because the cost of a false positive differs by segment. A declined corporate card transaction has a much higher customer impact than a declined prepaid card, justifying a higher threshold for the corporate segment.

    Precision = TP / (TP + FP) ≥ 0.95 [target: at most 5% false decline rate]
    optimal_threshold = min(t : precision(t) ≥ 0.95) across precision-recall curve
    Slide 6 of 6 — Online vs. Batch Serving

    Dual-Compute Architecture

    Not all features can be computed in real-time. Velocity features (txn_count_1h, txn_count_24h) must be updated immediately as new transactions arrive — a stolen card making 10 purchases in 5 minutes requires detection of the 10th transaction, not a nightly batch update. These are online features: Apache Flink consumes the Kafka transaction stream and maintains per-card aggregation state in Redis, updating within seconds of each transaction.

    Behavioral features (avg_ticket_30d, spend_7d) can tolerate batch latency — they change slowly and are expensive to compute in real-time across 30M cardholders. These are batch features: a nightly Apache Spark job reads 30 days of transaction history from the Hive data lake, computes the features for all active cards, and bulk-loads them into Redis. This dual-compute architecture is 100x more cost-efficient than computing all features online, while still meeting the 50ms SLA for authorization since all features are pre-computed and resident in Redis.

    Online (Flink → Redis): txn_count_1h, txn_count_24h, spend_24h, amt_deviation, cross_border, dist_from_last → <5ms serve
    Batch (Spark → Redis nightly): avg_ticket_30d, spend_7d, unique_merchants_24h → fresh within 24h

    Key Points

    Four architectural and algorithmic decisions that separate production fraud systems from proof-of-concept models.

    🧠

    Data Leakage Inflates Offline AUC by 10–15 Points

    Without point-in-time correctness, fraud models achieve spectacular offline metrics (0.97 AUC) that collapse in production (0.84 AUC). Post-authorization data — chargebacks, account freezes, dispute resolutions — leaks future information into training examples. The feature store’s primary purpose is not feature serving — it’s enforcing the temporal discipline that makes offline evaluation meaningful. This is the most common cause of production ML model failures in financial services.

    📉

    10 Features Capture 94% of Signal from 200

    SHAP analysis on a trained XGBoost model showed the top 10 features by gain importance account for 94% of total predictive signal. Adding the remaining 190 features improves AUC from 0.97 to 0.98 — a 1% gain. The infrastructure cost of 200-feature serving: 20x more Redis storage, 20x more Flink compute, 20x more monitoring pipelines. The chosen tradeoff (10 features at 94% signal) delivers operationally sustainable fraud detection with faster iteration velocity when fraud patterns shift.

    ✈️

    Impossible Travel Catches 23% of Missed Card-Present Fraud

    A stolen card in Berlin making 3 electronics purchases at 14:18, 14:28, and 14:31 would score low on velocity (only 3 transactions in an hour — below the alarm threshold) but the previous legitimate transaction at 14:00 was in Chicago. The haversine distance Chicago–Berlin is 7,370 km. Time delta is 18 minutes → implied speed 24,567 km/h. This single feature, costing 2 microseconds to compute, catches fraud that the more computationally expensive velocity and spend features miss entirely.

    Redis Serving Enables Sub-50ms Fraud Scoring

    The 300ms authorization window cannot accommodate database round-trips. A PostgreSQL query for historical spend aggregations takes 50–200ms. A Cassandra lookup for velocity windows takes 5–20ms. A Redis GET for pre-computed feature vectors takes 0.3–1ms. Pre-computation via Flink (online) and Spark (batch) trades storage for latency: Redis holds 10 float64 values per active card, refreshed continuously. This architecture is why the fraud scoring step completes in <5ms, leaving budget for model inference, credit checks, and network transit.

    Production Code

    Real-time Flink feature computation, XGBoost fraud model pipeline, and Feast feature store registration.

    Real-Time Feature Computation (Python / Apache Flink)
    from pyflink.datastream import StreamExecutionEnvironment
    import math
    
    def haversine_km(lat1, lon1, lat2, lon2):
        """Great-circle distance between two points on Earth (km)."""
        R = 6371.0
        dlat = math.radians(lat2 - lat1)
        dlon = math.radians(lon2 - lon1)
        a = (math.sin(dlat / 2) ** 2 +
             math.cos(math.radians(lat1)) *
             math.cos(math.radians(lat2)) *
             math.sin(dlon / 2) ** 2)
        return R * 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    
    class FraudFeatureFunction:
        """Sliding-window fraud feature computation per card_id."""
    
        def __init__(self):
            self.txn_buffer = {}  # card_id -> list of (ts, amt, lat, lon)
    
        def process_element(self, txn, ctx):
            card_id = txn['card_id']
            now = txn['auth_ts']
            buf = self.txn_buffer.setdefault(card_id, [])
    
            one_hour_ago = now - 3600
            one_day_ago  = now - 86400
            buf = [t for t in buf if t[0] > one_day_ago]
            self.txn_buffer[card_id] = buf
    
            txn_count_1h  = sum(1 for t in buf if t[0] > one_hour_ago)
            txn_count_24h = len(buf)
    
            amounts = [t[1] for t in buf]
            mean = sum(amounts) / len(amounts) if amounts else 0
            var  = (sum((a - mean) ** 2 for a in amounts)
                   / (len(amounts) - 1)) if len(amounts) > 1 else 1.0
            std  = var ** 0.5 or 1.0
            amt_zscore = (txn['amount'] - mean) / std
    
            travel_speed_kmh = 0.0
            if buf:
                last = buf[-1]
                dist = haversine_km(last[2], last[3], txn['lat'], txn['lon'])
                hours = max((now - last[0]) / 3600, 0.001)
                travel_speed_kmh = dist / hours
    
            buf.append((now, txn['amount'], txn['lat'], txn['lon']))
    
            yield {
                'card_id':           card_id,
                'txn_count_1h':      txn_count_1h,
                'txn_count_24h':     txn_count_24h,
                'amt_zscore':        round(amt_zscore, 4),
                'impossible_travel': 1 if travel_speed_kmh > 900 else 0,
                'travel_speed_kmh':  round(travel_speed_kmh, 1),
            }
    XGBoost Fraud Scoring Pipeline (Python)
    import numpy as np
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    from sklearn.metrics import precision_recall_curve
    from xgboost import XGBClassifier
    import shap
    
    FEATURE_COLS = [
        'txn_count_1h', 'txn_count_24h', 'spend_24h', 'spend_7d',
        'avg_ticket_30d', 'amt_deviation', 'cross_border',
        'dist_from_last', 'high_risk_mcc', 'unique_merchants_24h'
    ]
    
    def build_fraud_pipeline(X_train, y_train, X_val, y_val):
        pipeline = Pipeline([
            ('scaler', StandardScaler()),
            ('model', XGBClassifier(
                n_estimators=500, max_depth=6, learning_rate=0.05,
                scale_pos_weight=200,   # ~0.5% fraud rate
                eval_metric='aucpr',
                early_stopping_rounds=25, tree_method='hist',
            ))
        ])
        pipeline.fit(
            X_train[FEATURE_COLS], y_train,
            model__eval_set=[(X_val[FEATURE_COLS], y_val)],
            model__verbose=False
        )
        # Threshold tuning: target 95% precision on validation set
        y_prob = pipeline.predict_proba(X_val[FEATURE_COLS])[:, 1]
        precision, recall, thresholds = precision_recall_curve(y_val, y_prob)
        valid = precision[:-1] >= 0.95
        best_idx = np.argmax(recall[:-1][valid]) if valid.any() else 0
        optimal_threshold = thresholds[valid][best_idx]
        # SHAP explainability for feature importance + production monitoring
        explainer = shap.TreeExplainer(pipeline.named_steps['model'])
        shap_values = explainer.shap_values(
            pipeline.named_steps['scaler'].transform(X_val[FEATURE_COLS])
        )
        return pipeline, optimal_threshold, shap_values
    # Output: pipeline with optimal_threshold=0.65, AUC-PR=0.97 on held-out set
    Feature Store Registration (Feast + Redis Online Store)
    from datetime import timedelta
    from feast import Entity, Feature, FeatureView, ValueType
    from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import SparkSource
    from feast.infra.online_stores.redis import RedisOnlineStore
    
    # Entity: one feature vector per card
    card_entity = Entity(
        name="card_id",
        value_type=ValueType.STRING,
        description="Amex card identifier (last 4 digits hashed)",
    )
    
    # Offline source: Hive table with point-in-time partitions
    fraud_source = SparkSource(
        table="fraud_features.card_features_v3",
        timestamp_field="feature_ts",
        created_timestamp_column="etl_ts",
    )
    
    # Feature view: 10 fraud features, 1-hour TTL for online (Redis)
    fraud_feature_view = FeatureView(
        name="fraud_features_v3",
        entities=["card_id"],
        ttl=timedelta(hours=1),
        schema=[
            Feature(name="txn_count_1h",         dtype=ValueType.INT32),
            Feature(name="txn_count_24h",        dtype=ValueType.INT32),
            Feature(name="spend_24h",            dtype=ValueType.DOUBLE),
            Feature(name="spend_7d",             dtype=ValueType.DOUBLE),
            Feature(name="avg_ticket_30d",       dtype=ValueType.DOUBLE),
            Feature(name="amt_deviation",        dtype=ValueType.DOUBLE),
            Feature(name="cross_border",         dtype=ValueType.INT32),
            Feature(name="dist_from_last",       dtype=ValueType.INT32),
            Feature(name="high_risk_mcc",        dtype=ValueType.INT32),
            Feature(name="unique_merchants_24h", dtype=ValueType.INT32),
        ],
        source=fraud_source, online=True,
        tags={"team": "fraud-ml", "version": "v3.2"},
    )
    # feast apply  # materializes schema to Redis + Hive
    # feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")  # nightly batch

    About This Demo

    Built to demonstrate production feature store architecture for real-time ML systems in financial services.

    Amex Fraud Feature Store

    This demo implements the architecture described in American Express’s published ML infrastructure papers: a dual-path (online/batch) feature store serving 10 engineered fraud detection features at <50ms latency for real-time card authorization. The interactive simulation computes all 10 features for three card profiles (normal cardholder, compromised card, business traveler) and scores each using a logistic approximation of the production XGBoost v3.2 ensemble.

    Technologies: Apache Flink (streaming feature computation), Redis (online feature serving), Apache Spark (batch feature backfill), Feast (feature registry + point-in-time training), XGBoost (gradient-boosted ensemble), SHAP (explainability). All feature weights reflect published Amex Kaggle competition insights on fraud feature importance.