Azure Databricks Platform Engineering · Western Alliance Bank

Azure Databricks
Platform Engineering

End-to-end platform engineering across Azure Databricks, Azure DevOps CI/CD, and Entra ID governance — built for the scale, security, and compliance demands of a top-tier financial institution.

Azure DevOps

Databricks

Entra ID RBAC

Azure DevOps

Casino Gaming App CI/CD

Multi-stage YAML pipeline that builds, tests, and deploys the NeonDeck casino web application — slot machines, poker tables, and live dealer interfaces.

neondeck-casino-app.yml · Azure DevOps · neondeck-gaming

# NeonDeck Casino Gaming App — Azure DevOps Deployment Pipeline
# Build React front-end, run Playwright tests, deploy to Azure App Service

trigger:
  branches:
    include: [main, release/*]
  paths:
    include: [src/**, public/**, playwright/**]

variables:
  - group: neondeck-casino-prod       # Key Vault-linked variable group
  - name: appServiceName
    value: app-neondeck-casino-eastus2

stages:
  - stage: Build
    displayName: Build & Unit Test
    jobs:
      - job: BuildApp
        pool: { vmImage: ubuntu-latest }
        steps:
          - task: NodeTool@0
            inputs: { versionSpec: '20.x' }
          - script: npm ci && npm run build
            displayName: Install & Build
          - script: npm run test:unit -- --coverage
            displayName: Jest Unit Tests
          - task: PublishBuildArtifacts@1
            inputs: { PathtoPublish: dist/, ArtifactName: casino-app }

  - stage: E2E
    displayName: Playwright E2E Tests
    dependsOn: Build
    jobs:
      - job: PlaywrightTests
        steps:
          - script: npx playwright install --with-deps
          - script: npx playwright test --project=chromium
            displayName: Casino UI E2E Suite

  - stage: DeployStaging
    displayName: Deploy → Staging Slot
    dependsOn: E2E
    jobs:
      - deployment: DeploySlot
        environment: staging
        strategy:
          runOnce:
            deploy:
              steps:
                - task: AzureWebApp@1
                  inputs:
                    azureSubscription: neondeck-svc-conn
                    appName: $(appServiceName)
                    deployToSlotOrASE: true
                    slotName: staging
                    package: $(Pipeline.Workspace)/casino-app/**

  - stage: Production
    displayName: Swap → Production
    dependsOn: DeployStaging
    condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
    jobs:
      - deployment: SwapSlots
        environment: production   # requires manual approval gate
        strategy:
          runOnce:
            deploy:
              steps:
                - task: AzureAppServiceManage@0
                  inputs:
                    Action: Swap Slots
                    SourceSlot: staging

Live Demo

Pipeline Simulator

Click Run to simulate deploying the NeonDeck casino gaming front-end — from PR merge through build, test, and slot swap to production.

NeonDeck · Casino App · Azure DevOps

Build

Unit Test

E2E Tests

Staging

Approval

Production

Pipeline Context

PR #347 merged

Branchmain

Authorkieth@neondeck.io

Changed14 files

Casino App

ComponentSlot Machine UI

FrameworkReact 18 + Vite

TargetApp Service

Quality Gates

Coverage—

E2E Pass—

Lighthouse—

Deployment Tracker

Deploys

—

Tests Passed

—

Coverage

—

Deploy Time

Run	Commit	Tests	Coverage	Slot	Status	Time

Interview: Azure DevOps Pipeline Administration

"At NeonDeck, I administered 200+ Azure DevOps pipelines across the casino gaming and content delivery platforms. I designed multi-stage YAML pipelines with staging slot deployments and approval gates requiring lead engineer sign-off before production swap. Build artifacts were published to Azure Artifacts, and every PR required passing Jest unit tests plus a Playwright E2E suite covering the slot machine, poker table, and live dealer UIs. I configured Key Vault-linked variable groups so zero secrets existed in pipeline definitions, and built service connections using Entra-registered service principals scoped to least-privilege resource groups."

Azure Databricks

Banking Analytics Platform

Medallion architecture on Azure Databricks processing daily ACH, wire transfer, and mortgage origination data — powering AML/BSA compliance, credit risk analytics, and DFAST/CCAR regulatory reporting for Western Alliance Bank.

Source

Core Banking + ADF

FIS · Azure Data Factory

Bronze

Raw Transactions

Auto Loader · Parquet

Silver

AML Screened

BSA Compliant · Dedup

Gold

Regulatory + Risk

DFAST · Credit Risk · BSA

wab_transaction_pipeline.py · Azure Databricks · Unity Catalog

# Delta Live Tables — Western Alliance Bank Analytics Platform
# Medallion architecture: Bronze → Silver → Gold
# Handles ACH, wire transfers, mortgage originations, and deposit activity

import dlt
from pyspark.sql.functions import col, when, sum, count, current_date

# -------- BRONZE: Raw ACH/Wire transactions from ADLS Gen2 --------
@dlt.table(
    name="bronze_transactions",
    comment="Raw ACH/wire/mortgage transactions from core banking system"
)
def bronze_transactions():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "parquet")
        .option("cloudFiles.schemaLocation", "dbfs:/schemas/wab_txn")
        .load("abfss://raw@wabdatalake.dfs.core.windows.net/transactions/")
    )

# -------- SILVER: AML-screened, BSA-compliant, cleaned --------
@dlt.table(name="silver_transactions")
@dlt.expect_or_drop("valid_amount", "amount > 0")
@dlt.expect_or_drop("valid_account", "account_id IS NOT NULL")
@dlt.expect_or_drop("valid_routing", "LENGTH(routing_number) = 9")
def silver_transactions():
    return (
        dlt.read_stream("bronze_transactions")
        .dropDuplicates(["transaction_id"])
        .withColumn("aml_risk_score", /* ML AML model inference */)
        .withColumn("ctr_flag", when(col("amount") >= 10000, True).otherwise(False))
        .withColumn("sar_candidate", when(col("aml_risk_score") > 0.75, True).otherwise(False))
    )

# -------- GOLD: Regulatory reporting + credit risk analytics --------
@dlt.table(name="gold_daily_analytics")
def gold_daily_analytics():
    return (
        dlt.read("silver_transactions")
        .groupBy("transaction_date", "product_type", "region")
        .agg(
            sum("amount").alias("total_volume"),
            count("*").alias("transaction_count"),
            sum(when(col("ctr_flag") == True, 1).otherwise(0)).alias("ctr_filings"),
            sum(when(col("sar_candidate") == True, 1).otherwise(0)).alias("sar_candidates")
        )
    )

Live Demo

Banking Pipeline Simulator

Click Run to simulate processing a daily batch of ACH, wire, and mortgage transactions through the Bronze → Silver → Gold medallion layers — with AML screening and BSA compliance checks at each stage.

Western Alliance Bank · DLT Pipeline · Unity Catalog

Ingest

Bronze

Silver

Gold

Quality

Complete

Pipeline Context

Daily Batch

Transactions50,000

SourceADLS Gen2

FormatParquet (Auto Loader)

Product Mix

ACH—

Wire Transfer—

Mortgage—

BSA Compliance

CTR Filings—

SAR Candidates—

Net Volume—

Pipeline Metrics

Rows Processed

—

Data Quality

—

CTR Candidates

—

Pipeline Time

Interview: Azure Databricks Platform Engineering — Banking

"In my previous role I owned the Azure Databricks platform supporting the bank's Enterprise Data & Analytics function — including workspace management, cluster policies, Unity Catalog governance, and job orchestration for 50+ data engineering and ML workloads. Our Delta Live Tables pipeline processes 500K+ daily transactions — ACH, wire transfers, and mortgage originations — ingested via Auto Loader from ADLS Gen2. The Silver layer enforces BSA/AML rules inline: CTR flags on transactions ≥$10,000, SAR candidates scored by an AML ML model, and column-level security on PII fields (SSN, account numbers, routing numbers) enforced via Unity Catalog. Gold tables feed DFAST/CCAR regulatory reporting and the credit risk scorecard used by the commercial real estate team. I integrated Azure Key Vault for secret management across all cluster configurations, implemented Azure Monitor alerts on cluster health and job SLA breaches, and reduced cluster compute costs 38% by enforcing auto-termination policies and right-sizing instance types through Databricks cluster policies."

Entra ID

Developer RBAC & Identity Governance

Every developer who deploys across the three CI/CD pipelines has role-based access controlled by Azure Entra ID — with PIM elevation, conditional access, and just-in-time permissions.

Role	DevOps Pipelines	Databricks Notebooks	Databricks Clusters	Key Vault Secrets	Production Deploy	Access Method
App Developer	Run	None	None	None	None	Direct Entra Group
Data Engineer	View	Edit	Start/Stop	None	None	Direct Entra Group
Lead Engineer	Run + Edit	Edit	Manage	PIM 4hr	PIM 4hr	PIM Elevation
Platform Admin	Full	Full	Full	PIM 4hr	PIM + Approval	PIM + Manager Approval
Compliance	Audit	Audit	None	Read	None	Direct Entra Group

Policies

Conditional Access & PIM

Zero-trust policies enforce MFA, device compliance, and time-boxed elevated access across all three deployment targets.

Conditional Access

MFA required for all DevOps and Databricks access. Intune-compliant devices only for production environments.

PIM Elevation

Production deploy and Key Vault access require just-in-time elevation — 4-hour max window with justification.

SCIM Provisioning

Entra groups auto-sync to Databricks Unity Catalog and Azure DevOps project permissions via SCIM.

Access Reviews

Quarterly Entra access reviews for all elevated roles. Auto-revocation of stale assignments feeds SOC 2 evidence.

Key Vault Integration

Service principals for DevOps and Databricks authenticate via Key Vault-linked variable groups — no hard-coded secrets.

SSO Configuration

SAML/OIDC SSO for Databricks workspaces and Azure DevOps org — single identity plane across all three pipelines.

Live Demo

Developer Access Simulator

Click Run to simulate a developer requesting PIM elevation to deploy across all three CI/CD pipelines — watch conditional access checks, MFA verification, and role activation in real time.

NeonDeck · Entra ID · PIM Elevation

MFA

Device Check

PIM Request

Approval

Activated

Identity Context

Developer

Userkieth@neondeck.io

Base RoleLead Engineer

Groupssg-eng-leads

Device

Compliance—

MFA—

Location—

Elevated Roles

DevOps—

Databricks—

Key Vault—

Access Governance

PIM Elevations

—

MFA Challenges

—

Roles Activated

—

Window

Interview: Azure Entra ID & Identity Governance

"I managed Azure Entra ID for NeonDeck's 3,400-user organization — configuring SAML/OIDC SSO for Databricks and Azure DevOps, deploying Conditional Access policies requiring MFA and Intune-compliant devices, and implementing PIM with approval workflows capping all elevated roles at 4-hour windows. I built the SCIM provisioning integration between Entra groups and Databricks Unity Catalog, reducing the access provisioning process from a 2-day manual ticket to 5 minutes automated. Quarterly Entra access reviews feed directly into our SOC 2 Type II audit evidence. Every developer who touches production — whether deploying the casino app via DevOps, running a DLT pipeline in Databricks, or rotating a Key Vault secret — must pass through this identity governance layer."

Architecture

Design Decisions & Tradeoffs

Every architectural choice has a cost. Here’s why we chose these patterns over the alternatives.

ARCHITECTURE SEPARATION

Why Three Separate Pipelines Instead of One Unified Pipeline?

Each pipeline has fundamentally different SLAs: CI/CD targets 15-min deploy cycles, ML pipelines run 2–4 hour training jobs, and Entra ID provisioning is event-driven (user joins → instant). A monolithic pipeline would bottleneck—ML training would block deployments. Separating them allows independent scaling, different retry policies, and purpose-built stages.

SHIFT-LEFT ECONOMICS

Why Shift-Left Security (SAST Before Build)?

Finding a SQL injection in code review costs $200 to fix. Finding it in production costs $20,000+ (incident response, forensics, notification). SAST (Semgrep/CodeQL) runs in the PR check—before code merges—catching 85% of OWASP Top 10 vulnerabilities at $0 marginal cost. The tradeoff: SAST adds 45–90 seconds to CI, but this is negligible compared to the deployment time it saves by catching issues early.

DEPLOYMENT STRATEGY

Why Canary Deployment Instead of Blue-Green?

Blue-green requires 2× infrastructure at all times (expensive). Canary routes 5% of traffic to the new version, monitors error rates for 15 minutes, then gradually ramps to 100%. If the canary’s error rate exceeds the baseline by 2×, automatic rollback triggers. This saves 50% infra cost vs. blue-green while providing faster feedback than a full A/B test.

ML ISOLATION

Why Feature Branch Isolation for ML Experiments?

Data scientists need to test model changes without affecting production data pipelines. Each feature branch gets an isolated Databricks workspace with a snapshot of the feature store. This prevents the classic problem where one researcher’s data transformation corrupts another’s training data. The tradeoff: workspace provisioning adds 3–5 minutes to branch creation.

IDENTITY STANDARDS

Why SCIM Over Direct API Provisioning for Entra ID?

SCIM (System for Cross-domain Identity Management) is an open standard that decouples the identity provider from the application. If you switch from Entra ID to Okta, the application’s provisioning code doesn’t change—only the SCIM provider does. Direct API calls to Entra ID would be faster (one less abstraction layer) but create vendor lock-in.

Production Code

Production Implementation

Real configuration and code behind each of the three pipelines—copy-paste ready for your own projects.

Azure DevOps Pipeline YAML (azure-pipelines.yml)

trigger:
  branches:
    include: [main]
  paths:
    include: [src/**, Dockerfile]

variables:
  acrName:     crneondeck
  imageName:   casino-app
  aksCluster:  aks-neondeck-prod
  canaryPct:   5

stages:
# ── Build: Docker image with layer caching ──
- stage: Build
  jobs:
  - job: DockerBuild
    pool: { vmImage: ubuntu-latest }
    steps:
    - task: Docker@2
      inputs:
        containerRegistry: $(acrName)
        repository:        $(imageName)
        command:           buildAndPush
        Dockerfile:        Dockerfile
        tags:              $(Build.BuildId)
        arguments:         --cache-from $(acrName).azurecr.io/$(imageName):latest

# ── SAST: Semgrep security scan with fail threshold ──
- stage: SAST
  dependsOn: Build
  jobs:
  - job: SemgrepScan
    steps:
    - script: |
        pip install semgrep
        semgrep --config=p/owasp-top-ten --config=p/typescript \
          --error --severity ERROR \
          --json --output semgrep-results.json \
          src/
      displayName: Semgrep OWASP Scan
    - task: PublishBuildArtifacts@1
      inputs: { PathtoPublish: semgrep-results.json, ArtifactName: sast-report }
      condition: always()

# ── Deploy: Canary rollout with health check ──
- stage: CanaryDeploy
  dependsOn: SAST
  jobs:
  - deployment: Canary
    environment: production
    strategy:
      canary:
        increments: [5, 25, 50, 100]
        deploy:
          steps:
          - script: |
              kubectl set image deployment/casino-app \
                casino-app=$(acrName).azurecr.io/$(imageName):$(Build.BuildId)
              kubectl rollout status deployment/casino-app --timeout=300s
        routeTraffic:
          steps:
          - script: |
              kubectl annotate ingress casino-app \
                nginx.ingress.kubernetes.io/canary-weight="$(strategy.increment)" --overwrite
        postRouteTraffic:
          steps:
          - script: |  # Health check: 15-min error-rate window
              BASELINE_ERR=$(curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_errors[15m])' | jq '.data.result[0].value[1]')
              CANARY_ERR=$(curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_errors{canary="true"}[15m])' | jq '.data.result[0].value[1]')
              if (( $(echo "$CANARY_ERR > $BASELINE_ERR * 2" | bc -l) )); then
                echo "##vso[task.logissue type=error]Canary error rate 2x baseline — rolling back"
                exit 1
              fi
        on:
          failure:
            steps:
            - script: kubectl rollout undo deployment/casino-app
              displayName: Rollback canary

Databricks ML Pipeline (Python + MLflow)

import mlflow
import mlflow.sklearn
from pyspark.sql import SparkSession
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import precision_score, recall_score, f1_score
from mlflow.tracking import MlflowClient

spark = SparkSession.builder.appName("wab-credit-risk").getOrCreate()
client = MlflowClient()

# ── 1. Load data from Delta Lake feature store ──
df = spark.read.format("delta").table("wab_catalog.gold.credit_risk_features")
feature_cols = ["transaction_amount", "acct_days_since_open", "avg_daily_balance",
                "overdraft_count_30d", "credit_utilization_pct"]
pdf = df.select(feature_cols + ["is_default"]).toPandas()

X = pdf[feature_cols]
y = pdf["is_default"]

# ── 2. Train with MLflow tracking ──
mlflow.set_experiment("/wab/credit-risk-scorecard")
with mlflow.start_run(run_name="gbm-v2.4") as run:
    model = GradientBoostingClassifier(
        n_estimators=500, max_depth=6, learning_rate=0.05,
        subsample=0.8, min_samples_leaf=20
    )
    model.fit(X, y)
    preds = model.predict(X)

    # Log metrics
    mlflow.log_metric("precision", precision_score(y, preds))
    mlflow.log_metric("recall",    recall_score(y, preds))
    mlflow.log_metric("f1",        f1_score(y, preds))
    mlflow.log_param("n_estimators", 500)
    mlflow.sklearn.log_model(model, "credit-risk-model")

# ── 3. Model registry: promote staging → production ──
model_name = "wab-credit-risk-scorecard"
model_uri  = f"runs:/{run.info.run_id}/credit-risk-model"
mv = mlflow.register_model(model_uri, model_name)

# Compare against current production
prod_versions = client.get_latest_versions(model_name, stages=["Production"])
if prod_versions:
    prod_run = client.get_run(prod_versions[0].run_id)
    prod_f1  = prod_run.data.metrics["f1"]
    new_f1   = f1_score(y, preds)
    if new_f1 > prod_f1:
        client.transition_model_version_stage(
            model_name, mv.version, stage="Production",
            archive_existing_versions=True
        )
        print(f"Promoted v{mv.version}: F1 {new_f1:.4f} > {prod_f1:.4f}")
    else:
        client.transition_model_version_stage(
            model_name, mv.version, stage="Staging"
        )
        print(f"Kept in staging: F1 {new_f1:.4f} ≤ {prod_f1:.4f}")
else:
    client.transition_model_version_stage(
        model_name, mv.version, stage="Production"
    )
    print(f"First production model: v{mv.version}")

SCIM Provisioning Handler (Python)

# SCIM 2.0 User provisioning endpoint for Entra ID → internal systems
from flask import Flask, request, jsonify
import uuid
from datetime import datetime

app = Flask(__name__)

SCIM_SCHEMA = "urn:ietf:params:scim:schemas:core:2.0:User"

def map_scim_to_internal(scim_user: dict) -> dict:
    """Map SCIM user attributes to internal schema."""
    name = scim_user.get("name", {})
    emails = scim_user.get("emails", [{}])
    primary_email = next(
        (e["value"] for e in emails if e.get("primary")),
        emails[0].get("value", "") if emails else ""
    )
    return {
        "external_id": scim_user.get("externalId"),
        "username":    scim_user.get("userName"),
        "email":       primary_email,
        "first_name":  name.get("givenName", ""),
        "last_name":   name.get("familyName", ""),
        "active":      scim_user.get("active", True),
    }


@app.route("/scim/v2/Users", methods=["POST"])
def create_user():
    scim_user = request.get_json()

    # Validate SCIM schema
    schemas = scim_user.get("schemas", [])
    if SCIM_SCHEMA not in schemas:
        return jsonify({
            "schemas": ["urn:ietf:params:scim:api:messages:2.0:Error"],
            "detail":  "Missing required schema",
            "status":  400
        }), 400

    internal = map_scim_to_internal(scim_user)

    # Conflict detection: check if user already exists
    existing = db.users.find_one({"email": internal["email"]})
    if existing:
        return jsonify({
            "schemas": ["urn:ietf:params:scim:api:messages:2.0:Error"],
            "detail":  f"User {internal['email']} already exists",
            "status":  409
        }), 409

    # Provision in downstream systems
    user_id = str(uuid.uuid4())
    internal["id"] = user_id
    internal["created_at"] = datetime.utcnow().isoformat() + "Z"
    db.users.insert_one(internal)

    # SCIM-compliant response
    return jsonify({
        "schemas":    [SCIM_SCHEMA],
        "id":         user_id,
        "externalId": internal["external_id"],
        "userName":   internal["username"],
        "name":       {"givenName": internal["first_name"],
                       "familyName": internal["last_name"]},
        "emails":     [{"value": internal["email"], "primary": True}],
        "active":     internal["active"],
        "meta": {
            "resourceType": "User",
            "created":      internal["created_at"],
            "location":     f"/scim/v2/Users/{user_id}"
        }
    }), 201

Innovation

Innovation Spotlight

Forward-looking capabilities that move CI/CD from “it works” to “it learns.”

Pipeline Intelligence: Predictive Build Failure

By analyzing the last 500 builds, the system learns which file paths correlate with test failures. When a PR modifies high-risk files, it automatically expands the test matrix and notifies the on-call engineer, reducing escaped defects by 34%.

Chaos Engineering for CI/CD

The pipeline periodically injects controlled failures (network timeout during dependency fetch, OOM during build, certificate expiry during deploy) to validate that retry policies and circuit breakers actually work. This is “GameDay for your build system.”

Azure DatabricksPlatform Engineering

Casino Gaming App CI/CD

Pipeline Simulator

NeonDeck · Casino App · Azure DevOps

Interview: Azure DevOps Pipeline Administration

Banking Analytics Platform

Banking Pipeline Simulator

Western Alliance Bank · DLT Pipeline · Unity Catalog

Interview: Azure Databricks Platform Engineering — Banking

Developer RBAC & Identity Governance

Conditional Access & PIM

Developer Access Simulator

NeonDeck · Entra ID · PIM Elevation

Interview: Azure Entra ID & Identity Governance

Design Decisions & Tradeoffs

Production Implementation

Innovation Spotlight

Azure Databricks
Platform Engineering