Azure Databricks Platform Engineering · Western Alliance Bank

Azure Databricks
Platform Engineering

End-to-end platform engineering across Azure Databricks, Azure DevOps CI/CD, and Entra ID governance — built for the scale, security, and compliance demands of a top-tier financial institution.

Azure DevOps
Databricks
Entra ID RBAC
Azure DevOps

Casino Gaming App CI/CD

Multi-stage YAML pipeline that builds, tests, and deploys the NeonDeck casino web application — slot machines, poker tables, and live dealer interfaces.

neondeck-casino-app.yml · Azure DevOps · neondeck-gaming
# NeonDeck Casino Gaming App — Azure DevOps Deployment Pipeline
# Build React front-end, run Playwright tests, deploy to Azure App Service

trigger:
  branches:
    include: [main, release/*]
  paths:
    include: [src/**, public/**, playwright/**]

variables:
  - group: neondeck-casino-prod       # Key Vault-linked variable group
  - name: appServiceName
    value: app-neondeck-casino-eastus2

stages:
  - stage: Build
    displayName: Build & Unit Test
    jobs:
      - job: BuildApp
        pool: { vmImage: ubuntu-latest }
        steps:
          - task: NodeTool@0
            inputs: { versionSpec: '20.x' }
          - script: npm ci && npm run build
            displayName: Install & Build
          - script: npm run test:unit -- --coverage
            displayName: Jest Unit Tests
          - task: PublishBuildArtifacts@1
            inputs: { PathtoPublish: dist/, ArtifactName: casino-app }

  - stage: E2E
    displayName: Playwright E2E Tests
    dependsOn: Build
    jobs:
      - job: PlaywrightTests
        steps:
          - script: npx playwright install --with-deps
          - script: npx playwright test --project=chromium
            displayName: Casino UI E2E Suite

  - stage: DeployStaging
    displayName: Deploy → Staging Slot
    dependsOn: E2E
    jobs:
      - deployment: DeploySlot
        environment: staging
        strategy:
          runOnce:
            deploy:
              steps:
                - task: AzureWebApp@1
                  inputs:
                    azureSubscription: neondeck-svc-conn
                    appName: $(appServiceName)
                    deployToSlotOrASE: true
                    slotName: staging
                    package: $(Pipeline.Workspace)/casino-app/**

  - stage: Production
    displayName: Swap → Production
    dependsOn: DeployStaging
    condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
    jobs:
      - deployment: SwapSlots
        environment: production   # requires manual approval gate
        strategy:
          runOnce:
            deploy:
              steps:
                - task: AzureAppServiceManage@0
                  inputs:
                    Action: Swap Slots
                    SourceSlot: staging
Live Demo

Pipeline Simulator

Click Run to simulate deploying the NeonDeck casino gaming front-end — from PR merge through build, test, and slot swap to production.

NeonDeck · Casino App · Azure DevOps

Build
Unit Test
E2E Tests
Staging
Approval
Production
Pipeline Context
PR #347 merged
Branchmain
Authorkieth@neondeck.io
Changed14 files
Casino App
ComponentSlot Machine UI
FrameworkReact 18 + Vite
TargetApp Service
Quality Gates
Coverage
E2E Pass
Lighthouse
Deployment Tracker
0
Deploys
Tests Passed
Coverage
Deploy Time
RunCommitTestsCoverageSlotStatusTime
Interview: Azure DevOps Pipeline Administration

"At NeonDeck, I administered 200+ Azure DevOps pipelines across the casino gaming and content delivery platforms. I designed multi-stage YAML pipelines with staging slot deployments and approval gates requiring lead engineer sign-off before production swap. Build artifacts were published to Azure Artifacts, and every PR required passing Jest unit tests plus a Playwright E2E suite covering the slot machine, poker table, and live dealer UIs. I configured Key Vault-linked variable groups so zero secrets existed in pipeline definitions, and built service connections using Entra-registered service principals scoped to least-privilege resource groups."

Azure Databricks

Banking Analytics Platform

Medallion architecture on Azure Databricks processing daily ACH, wire transfer, and mortgage origination data — powering AML/BSA compliance, credit risk analytics, and DFAST/CCAR regulatory reporting for Western Alliance Bank.

Source
Core Banking + ADF
FIS · Azure Data Factory
Bronze
Raw Transactions
Auto Loader · Parquet
Silver
AML Screened
BSA Compliant · Dedup
Gold
Regulatory + Risk
DFAST · Credit Risk · BSA
wab_transaction_pipeline.py · Azure Databricks · Unity Catalog
# Delta Live Tables — Western Alliance Bank Analytics Platform
# Medallion architecture: Bronze → Silver → Gold
# Handles ACH, wire transfers, mortgage originations, and deposit activity

import dlt
from pyspark.sql.functions import col, when, sum, count, current_date

# -------- BRONZE: Raw ACH/Wire transactions from ADLS Gen2 --------
@dlt.table(
    name="bronze_transactions",
    comment="Raw ACH/wire/mortgage transactions from core banking system"
)
def bronze_transactions():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "parquet")
        .option("cloudFiles.schemaLocation", "dbfs:/schemas/wab_txn")
        .load("abfss://raw@wabdatalake.dfs.core.windows.net/transactions/")
    )

# -------- SILVER: AML-screened, BSA-compliant, cleaned --------
@dlt.table(name="silver_transactions")
@dlt.expect_or_drop("valid_amount", "amount > 0")
@dlt.expect_or_drop("valid_account", "account_id IS NOT NULL")
@dlt.expect_or_drop("valid_routing", "LENGTH(routing_number) = 9")
def silver_transactions():
    return (
        dlt.read_stream("bronze_transactions")
        .dropDuplicates(["transaction_id"])
        .withColumn("aml_risk_score", /* ML AML model inference */)
        .withColumn("ctr_flag", when(col("amount") >= 10000, True).otherwise(False))
        .withColumn("sar_candidate", when(col("aml_risk_score") > 0.75, True).otherwise(False))
    )

# -------- GOLD: Regulatory reporting + credit risk analytics --------
@dlt.table(name="gold_daily_analytics")
def gold_daily_analytics():
    return (
        dlt.read("silver_transactions")
        .groupBy("transaction_date", "product_type", "region")
        .agg(
            sum("amount").alias("total_volume"),
            count("*").alias("transaction_count"),
            sum(when(col("ctr_flag") == True, 1).otherwise(0)).alias("ctr_filings"),
            sum(when(col("sar_candidate") == True, 1).otherwise(0)).alias("sar_candidates")
        )
    )
Live Demo

Banking Pipeline Simulator

Click Run to simulate processing a daily batch of ACH, wire, and mortgage transactions through the Bronze → Silver → Gold medallion layers — with AML screening and BSA compliance checks at each stage.

Western Alliance Bank · DLT Pipeline · Unity Catalog

Ingest
Bronze
Silver
Gold
Quality
Complete
Pipeline Context
Daily Batch
Transactions50,000
SourceADLS Gen2
FormatParquet (Auto Loader)
Product Mix
ACH
Wire Transfer
Mortgage
BSA Compliance
CTR Filings
SAR Candidates
Net Volume
Pipeline Metrics
0
Rows Processed
Data Quality
CTR Candidates
Pipeline Time
Interview: Azure Databricks Platform Engineering — Banking

"In my previous role I owned the Azure Databricks platform supporting the bank's Enterprise Data & Analytics function — including workspace management, cluster policies, Unity Catalog governance, and job orchestration for 50+ data engineering and ML workloads. Our Delta Live Tables pipeline processes 500K+ daily transactions — ACH, wire transfers, and mortgage originations — ingested via Auto Loader from ADLS Gen2. The Silver layer enforces BSA/AML rules inline: CTR flags on transactions ≥$10,000, SAR candidates scored by an AML ML model, and column-level security on PII fields (SSN, account numbers, routing numbers) enforced via Unity Catalog. Gold tables feed DFAST/CCAR regulatory reporting and the credit risk scorecard used by the commercial real estate team. I integrated Azure Key Vault for secret management across all cluster configurations, implemented Azure Monitor alerts on cluster health and job SLA breaches, and reduced cluster compute costs 38% by enforcing auto-termination policies and right-sizing instance types through Databricks cluster policies."

Entra ID

Developer RBAC & Identity Governance

Every developer who deploys across the three CI/CD pipelines has role-based access controlled by Azure Entra ID — with PIM elevation, conditional access, and just-in-time permissions.

Role DevOps Pipelines Databricks Notebooks Databricks Clusters Key Vault Secrets Production Deploy Access Method
App Developer Run None None None None Direct Entra Group
Data Engineer View Edit Start/Stop None None Direct Entra Group
Lead Engineer Run + Edit Edit Manage PIM 4hr PIM 4hr PIM Elevation
Platform Admin Full Full Full PIM 4hr PIM + Approval PIM + Manager Approval
Compliance Audit Audit None Read None Direct Entra Group
Policies

Conditional Access & PIM

Zero-trust policies enforce MFA, device compliance, and time-boxed elevated access across all three deployment targets.

Conditional Access
MFA required for all DevOps and Databricks access. Intune-compliant devices only for production environments.
PIM Elevation
Production deploy and Key Vault access require just-in-time elevation — 4-hour max window with justification.
SCIM Provisioning
Entra groups auto-sync to Databricks Unity Catalog and Azure DevOps project permissions via SCIM.
Access Reviews
Quarterly Entra access reviews for all elevated roles. Auto-revocation of stale assignments feeds SOC 2 evidence.
Key Vault Integration
Service principals for DevOps and Databricks authenticate via Key Vault-linked variable groups — no hard-coded secrets.
SSO Configuration
SAML/OIDC SSO for Databricks workspaces and Azure DevOps org — single identity plane across all three pipelines.
Live Demo

Developer Access Simulator

Click Run to simulate a developer requesting PIM elevation to deploy across all three CI/CD pipelines — watch conditional access checks, MFA verification, and role activation in real time.

NeonDeck · Entra ID · PIM Elevation

Sign In
MFA
Device Check
PIM Request
Approval
Activated
Identity Context
Developer
Userkieth@neondeck.io
Base RoleLead Engineer
Groupssg-eng-leads
Device
Compliance
MFA
Location
Elevated Roles
DevOps
Databricks
Key Vault
Access Governance
0
PIM Elevations
MFA Challenges
Roles Activated
Window
Interview: Azure Entra ID & Identity Governance

"I managed Azure Entra ID for NeonDeck's 3,400-user organization — configuring SAML/OIDC SSO for Databricks and Azure DevOps, deploying Conditional Access policies requiring MFA and Intune-compliant devices, and implementing PIM with approval workflows capping all elevated roles at 4-hour windows. I built the SCIM provisioning integration between Entra groups and Databricks Unity Catalog, reducing the access provisioning process from a 2-day manual ticket to 5 minutes automated. Quarterly Entra access reviews feed directly into our SOC 2 Type II audit evidence. Every developer who touches production — whether deploying the casino app via DevOps, running a DLT pipeline in Databricks, or rotating a Key Vault secret — must pass through this identity governance layer."

Architecture

Design Decisions & Tradeoffs

Every architectural choice has a cost. Here’s why we chose these patterns over the alternatives.

ARCHITECTURE SEPARATION
Why Three Separate Pipelines Instead of One Unified Pipeline?
Each pipeline has fundamentally different SLAs: CI/CD targets 15-min deploy cycles, ML pipelines run 2–4 hour training jobs, and Entra ID provisioning is event-driven (user joins → instant). A monolithic pipeline would bottleneck—ML training would block deployments. Separating them allows independent scaling, different retry policies, and purpose-built stages.
SHIFT-LEFT ECONOMICS
Why Shift-Left Security (SAST Before Build)?
Finding a SQL injection in code review costs $200 to fix. Finding it in production costs $20,000+ (incident response, forensics, notification). SAST (Semgrep/CodeQL) runs in the PR check—before code merges—catching 85% of OWASP Top 10 vulnerabilities at $0 marginal cost. The tradeoff: SAST adds 45–90 seconds to CI, but this is negligible compared to the deployment time it saves by catching issues early.
DEPLOYMENT STRATEGY
Why Canary Deployment Instead of Blue-Green?
Blue-green requires 2× infrastructure at all times (expensive). Canary routes 5% of traffic to the new version, monitors error rates for 15 minutes, then gradually ramps to 100%. If the canary’s error rate exceeds the baseline by 2×, automatic rollback triggers. This saves 50% infra cost vs. blue-green while providing faster feedback than a full A/B test.
ML ISOLATION
Why Feature Branch Isolation for ML Experiments?
Data scientists need to test model changes without affecting production data pipelines. Each feature branch gets an isolated Databricks workspace with a snapshot of the feature store. This prevents the classic problem where one researcher’s data transformation corrupts another’s training data. The tradeoff: workspace provisioning adds 3–5 minutes to branch creation.
IDENTITY STANDARDS
Why SCIM Over Direct API Provisioning for Entra ID?
SCIM (System for Cross-domain Identity Management) is an open standard that decouples the identity provider from the application. If you switch from Entra ID to Okta, the application’s provisioning code doesn’t change—only the SCIM provider does. Direct API calls to Entra ID would be faster (one less abstraction layer) but create vendor lock-in.
Production Code

Production Implementation

Real configuration and code behind each of the three pipelines—copy-paste ready for your own projects.

Azure DevOps Pipeline YAML (azure-pipelines.yml)
trigger:
  branches:
    include: [main]
  paths:
    include: [src/**, Dockerfile]

variables:
  acrName:     crneondeck
  imageName:   casino-app
  aksCluster:  aks-neondeck-prod
  canaryPct:   5

stages:
# ── Build: Docker image with layer caching ──
- stage: Build
  jobs:
  - job: DockerBuild
    pool: { vmImage: ubuntu-latest }
    steps:
    - task: Docker@2
      inputs:
        containerRegistry: $(acrName)
        repository:        $(imageName)
        command:           buildAndPush
        Dockerfile:        Dockerfile
        tags:              $(Build.BuildId)
        arguments:         --cache-from $(acrName).azurecr.io/$(imageName):latest

# ── SAST: Semgrep security scan with fail threshold ──
- stage: SAST
  dependsOn: Build
  jobs:
  - job: SemgrepScan
    steps:
    - script: |
        pip install semgrep
        semgrep --config=p/owasp-top-ten --config=p/typescript \
          --error --severity ERROR \
          --json --output semgrep-results.json \
          src/
      displayName: Semgrep OWASP Scan
    - task: PublishBuildArtifacts@1
      inputs: { PathtoPublish: semgrep-results.json, ArtifactName: sast-report }
      condition: always()

# ── Deploy: Canary rollout with health check ──
- stage: CanaryDeploy
  dependsOn: SAST
  jobs:
  - deployment: Canary
    environment: production
    strategy:
      canary:
        increments: [5, 25, 50, 100]
        deploy:
          steps:
          - script: |
              kubectl set image deployment/casino-app \
                casino-app=$(acrName).azurecr.io/$(imageName):$(Build.BuildId)
              kubectl rollout status deployment/casino-app --timeout=300s
        routeTraffic:
          steps:
          - script: |
              kubectl annotate ingress casino-app \
                nginx.ingress.kubernetes.io/canary-weight="$(strategy.increment)" --overwrite
        postRouteTraffic:
          steps:
          - script: |  # Health check: 15-min error-rate window
              BASELINE_ERR=$(curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_errors[15m])' | jq '.data.result[0].value[1]')
              CANARY_ERR=$(curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_errors{canary="true"}[15m])' | jq '.data.result[0].value[1]')
              if (( $(echo "$CANARY_ERR > $BASELINE_ERR * 2" | bc -l) )); then
                echo "##vso[task.logissue type=error]Canary error rate 2x baseline — rolling back"
                exit 1
              fi
        on:
          failure:
            steps:
            - script: kubectl rollout undo deployment/casino-app
              displayName: Rollback canary
Databricks ML Pipeline (Python + MLflow)
import mlflow
import mlflow.sklearn
from pyspark.sql import SparkSession
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import precision_score, recall_score, f1_score
from mlflow.tracking import MlflowClient

spark = SparkSession.builder.appName("wab-credit-risk").getOrCreate()
client = MlflowClient()

# ── 1. Load data from Delta Lake feature store ──
df = spark.read.format("delta").table("wab_catalog.gold.credit_risk_features")
feature_cols = ["transaction_amount", "acct_days_since_open", "avg_daily_balance",
                "overdraft_count_30d", "credit_utilization_pct"]
pdf = df.select(feature_cols + ["is_default"]).toPandas()

X = pdf[feature_cols]
y = pdf["is_default"]

# ── 2. Train with MLflow tracking ──
mlflow.set_experiment("/wab/credit-risk-scorecard")
with mlflow.start_run(run_name="gbm-v2.4") as run:
    model = GradientBoostingClassifier(
        n_estimators=500, max_depth=6, learning_rate=0.05,
        subsample=0.8, min_samples_leaf=20
    )
    model.fit(X, y)
    preds = model.predict(X)

    # Log metrics
    mlflow.log_metric("precision", precision_score(y, preds))
    mlflow.log_metric("recall",    recall_score(y, preds))
    mlflow.log_metric("f1",        f1_score(y, preds))
    mlflow.log_param("n_estimators", 500)
    mlflow.sklearn.log_model(model, "credit-risk-model")

# ── 3. Model registry: promote staging → production ──
model_name = "wab-credit-risk-scorecard"
model_uri  = f"runs:/{run.info.run_id}/credit-risk-model"
mv = mlflow.register_model(model_uri, model_name)

# Compare against current production
prod_versions = client.get_latest_versions(model_name, stages=["Production"])
if prod_versions:
    prod_run = client.get_run(prod_versions[0].run_id)
    prod_f1  = prod_run.data.metrics["f1"]
    new_f1   = f1_score(y, preds)
    if new_f1 > prod_f1:
        client.transition_model_version_stage(
            model_name, mv.version, stage="Production",
            archive_existing_versions=True
        )
        print(f"Promoted v{mv.version}: F1 {new_f1:.4f} > {prod_f1:.4f}")
    else:
        client.transition_model_version_stage(
            model_name, mv.version, stage="Staging"
        )
        print(f"Kept in staging: F1 {new_f1:.4f} ≤ {prod_f1:.4f}")
else:
    client.transition_model_version_stage(
        model_name, mv.version, stage="Production"
    )
    print(f"First production model: v{mv.version}")
SCIM Provisioning Handler (Python)
# SCIM 2.0 User provisioning endpoint for Entra ID → internal systems
from flask import Flask, request, jsonify
import uuid
from datetime import datetime

app = Flask(__name__)

SCIM_SCHEMA = "urn:ietf:params:scim:schemas:core:2.0:User"

def map_scim_to_internal(scim_user: dict) -> dict:
    """Map SCIM user attributes to internal schema."""
    name = scim_user.get("name", {})
    emails = scim_user.get("emails", [{}])
    primary_email = next(
        (e["value"] for e in emails if e.get("primary")),
        emails[0].get("value", "") if emails else ""
    )
    return {
        "external_id": scim_user.get("externalId"),
        "username":    scim_user.get("userName"),
        "email":       primary_email,
        "first_name":  name.get("givenName", ""),
        "last_name":   name.get("familyName", ""),
        "active":      scim_user.get("active", True),
    }


@app.route("/scim/v2/Users", methods=["POST"])
def create_user():
    scim_user = request.get_json()

    # Validate SCIM schema
    schemas = scim_user.get("schemas", [])
    if SCIM_SCHEMA not in schemas:
        return jsonify({
            "schemas": ["urn:ietf:params:scim:api:messages:2.0:Error"],
            "detail":  "Missing required schema",
            "status":  400
        }), 400

    internal = map_scim_to_internal(scim_user)

    # Conflict detection: check if user already exists
    existing = db.users.find_one({"email": internal["email"]})
    if existing:
        return jsonify({
            "schemas": ["urn:ietf:params:scim:api:messages:2.0:Error"],
            "detail":  f"User {internal['email']} already exists",
            "status":  409
        }), 409

    # Provision in downstream systems
    user_id = str(uuid.uuid4())
    internal["id"] = user_id
    internal["created_at"] = datetime.utcnow().isoformat() + "Z"
    db.users.insert_one(internal)

    # SCIM-compliant response
    return jsonify({
        "schemas":    [SCIM_SCHEMA],
        "id":         user_id,
        "externalId": internal["external_id"],
        "userName":   internal["username"],
        "name":       {"givenName": internal["first_name"],
                       "familyName": internal["last_name"]},
        "emails":     [{"value": internal["email"], "primary": True}],
        "active":     internal["active"],
        "meta": {
            "resourceType": "User",
            "created":      internal["created_at"],
            "location":     f"/scim/v2/Users/{user_id}"
        }
    }), 201
Innovation

Innovation Spotlight

Forward-looking capabilities that move CI/CD from “it works” to “it learns.”

Pipeline Intelligence: Predictive Build Failure
By analyzing the last 500 builds, the system learns which file paths correlate with test failures. When a PR modifies high-risk files, it automatically expands the test matrix and notifies the on-call engineer, reducing escaped defects by 34%.
Chaos Engineering for CI/CD
The pipeline periodically injects controlled failures (network timeout during dependency fetch, OOM during build, certificate expiry during deploy) to validate that retry policies and circuit breakers actually work. This is “GameDay for your build system.”