thenumerix

Omni Hotels & Resorts — Data Quality

Reservation Data Quality Framework

Validate Omni Hotels reservation data against a production rule suite — powering reliable revenue management and personalized guest experiences across 50+ properties.

Step 1 — The Stakes

Bad Data, Broken Stays

A single invalid check-out date cascades into wrong ADR calculations, broken forecasting models, and misrouted loyalty rewards. Omni processes 4,000+ reservations per night across 50 properties. At 2% error rate, that's 80 corrupted records per night compounding into the data warehouse.

Step 2 — The Data Flow

Opera PMS → Snowflake → Analytics

Opera Property Management System (the industry-standard hotel PMS) pushes reservation records nightly via ETL into Snowflake. Revenue management, marketing, and loyalty systems all read from this warehouse. Data quality must be enforced at ingestion — before downstream systems consume corrupt data.

Step 3 — The Framework

Great Expectations: Rules as Code

Great Expectations defines data quality rules declaratively ("expect_column_values_to_be_between") and executes them as unit tests against each batch. Rules are version-controlled alongside pipeline code — breaking data is caught in CI just like breaking code.

Step 4 — 5 Dimensions

Completeness, Validity, Uniqueness, Timeliness, Consistency

The CVUTC framework maps each rule to a business risk: Completeness gaps break loyalty profiles. Validity failures skew revenue reports. Uniqueness violations cause double-billing. Timeliness failures corrupt forecasting models. Consistency failures break date-range analytics.

Step 5 — Root Cause

Where Errors Actually Come From

72% of hotel data errors originate at front desk entry (walk-in guests, rate overrides, manual corrections). 18% come from OTA (booking.com, Expedia) field mapping mismatches. 10% are system sync glitches (duplicate res_id on failover). Each error type needs a different remediation strategy.

Step 6 — The Feedback Loop

From Report to Fix in 4 Hours

Quality failures generate Jira tickets, tagged by dimension and source system, routed to the owning team. Revenue management failures alert the RM director directly. SLA: critical failures (rate = $0, checkout < checkin) fixed within 4 hours. Non-critical within 24. Trend dashboard shows improvement over time.

Interactive Demo

Run the validation suite against 10 real Omni Hotels reservations with intentional quality issues.

📋

Data Steward

“I own the quality of this pipeline”

📊

BI Analyst

“My dashboards depend on clean data”

🗃

DBA

“Enforce constraints at the source”

💰

Revenue Manager

“Bad rates destroy my ADR metric”

Reservation Extract — Omni Hotels 10 rows · 10 columns · issues highlighted

res_id	guest_name	email	phone	check_in	check_out	room_type	rate	loyalty	property

Why This Matters at Omni

Omni Hotels operates 50+ properties across North America. Reservation data flows from Opera PMS into the data warehouse nightly. Revenue management models depend on accurate rates, dates, and loyalty tiers. A single invalid check-out date can cascade into incorrect ADR calculations, broken forecasting models, and misrouted loyalty rewards.

Validation Suite — Great Expectations Style

Column	Expectation	Parameter	Failures

Classroom

Six data quality concepts — from CVUTC dimensions to SLA-based remediation frameworks.

Slide 1 of 6

The CVUTC Data Quality Framework

Five dimensions, popularized by Loshin (2001) and formalized in ISO 8000: Completeness (are required fields populated?), Validity (do values conform to rules?), Uniqueness (are records deduplicated?), Timeliness (is data current?), Consistency (do related fields agree?).

Each dimension maps to a distinct business risk. Measuring them separately enables root-cause triage: a Completeness failure points to front desk training, a Validity failure points to OTA field mapping, a Uniqueness failure points to system sync issues.

DQ Score = ∑(passing_rules / total_rules) × 100 → Target: ≥95% for revenue management pipelines

Slide 2 of 6

Great Expectations: Data Quality as Code

Great Expectations (GE) defines data quality rules as Python objects: expect_column_values_to_be_between(column="rate", min_value=89, max_value=1500). Rules execute against DataFrames or database tables and produce machine-readable JSON reports.

The critical architectural insight: data quality rules belong in version control alongside pipeline code. A breaking change in source data fails the GE suite in CI/CD, just like a breaking unit test. This turns data quality from a reactive monthly report into a proactive gate.

Pipeline pattern: Extract → GE Checkpoint → [pass] Load / [fail] Quarantine + Alert

Slide 3 of 6

Completeness: The Invisible Quality Problem

NULL fields are the most common and most underestimated quality issue. Unlike invalid values, NULLs silently propagate through joins and aggregations. A NULL guest name produces no error — it just becomes "Unknown" in your loyalty mailing list, wasting a direct mail piece.

Completeness rules must distinguish between structurally required fields (guest name, check-in date — always required) and conditionally required fields (loyalty ID — required only for enrolled members). Blanket NOT NULL rules on conditional fields generate false failures.

Completeness rate = (populated_fields / total_fields) × 100 Target: 99.5% for PII fields

Slide 4 of 6

Regex Validation: Email as a Case Study

Email validation is the canonical regex quality rule — and also the most commonly over-engineered. The RFC 5322 full regex is 6,394 characters. The practical rule for hospitality data: /^[\w.+-]+@[\w-]+\.[\w.]+$/ catches 99.7% of real typos (missing @, incomplete domain, no TLD) without rejecting valid international addresses.

More important than the regex: what do you do on failure? Hotel strategy: quarantine the row, create a fallback record with hotel-assigned email, and route a data quality ticket to the originating property with the corrected format shown.

Common hotel email errors: front-desk typos (p.alvarez@), OTA mapping failures (email={blank}), test records (test@test)

Slide 5 of 6

Consistency Rules: Date Logic in Hospitality

Consistency validates relationships between fields, not just individual values. The canonical hospitality example: check-out must be strictly after check-in. Sounds obvious; happens constantly. AM/PM confusion at front desk, midnight check-outs entered as check-ins, OTA date format mismatches (MM/DD vs DD/MM).

A more subtle consistency rule: check-in date for a historical extract should not be in the far future. OMN-284708 (William Foster, La Costa, Dec 2027) is a real advance reservation — valid in booking systems, but flags a data timeliness concern when it appears in an April 2026 extract that should only contain current/recent reservations.

Consistency score = rows_passing_all_cross_field_rules / total_rows × 100

Slide 6 of 6

SLA-Based Remediation: Closing the Loop

A quality report no one acts on is theatre. The remediation SLA assigns criticality by dimension and routes failures to specific teams with response time commitments. Critical failures (rate=$0, checkout<checkin) block the record from loading and page the duty manager. Non-critical failures quarantine the record but allow it to load with a quality flag.

Trend analytics matter as much as point-in-time scores. A pipeline with 94% quality score that's trending up from 87% is healthier than one at 96% trending down from 99%. Track ΔDQ week-over-week. A single property consistently at the bottom of the ranking points to a training or system configuration issue, not just a data entry error.

SLA tiers: CRITICAL (<4h) · HIGH (<24h) · MEDIUM (<72h) · LOW (next sprint)

Key Engineering Points

Four decisions that separate production-grade hospitality data quality from checkbox exercises.

🏨

Opera PMS is the Source of Truth

Opera is deployed at 70%+ of large hotel chains. Its export format has idiosyncrasies: rate fields default to 0 (not NULL) on comp rooms, phone fields use property-specific formats, loyalty tier names vary by brand. Any GE ruleset must be calibrated to Opera's specific behavior — not generic SQL patterns.

📧

Email as a Marketing Revenue Multiplier

Omni's email marketing generates $8.50 per valid address per year in direct booking revenue. At 4% invalid email rate across 2.3M loyalty members, that's $782K/year in unreachable guests. Email validation in the ETL pipeline pays for the entire data quality program within 3 months.

📅

Date Consistency Errors Compound

A checkout-before-checkin error doesn't just affect one record. Revenue management models use rolling averages of stay duration. One -3 night stay (checkin April 12, checkout April 9) corrupts the average, distorting rate optimization algorithms that rely on accurate length-of-stay distribution. The cascade is invisible unless you measure it.

🧮

Statistical vs. Rule-Based Validation

Rule-based checks (GE expectations) catch known violation types. Statistical checks (Z-score on rate distribution, Kullback-Leibler divergence on daily reservation volumes) catch unknown shifts — like an OTA suddenly sending room types in a new format that passes all explicit rules but breaks downstream category models.

Production Code

Four patterns covering the full data quality stack: validation, profiling, anomaly detection on metrics, and alerting.

Great Expectations Suite for Hotel Reservations (Python)

import great_expectations as gx
from great_expectations.core.batch import RuntimeBatchRequest

context = gx.get_context()

suite = context.add_or_update_expectation_suite("omni_reservations_suite")

def build_expectation_suite(validator):
    # Uniqueness
    validator.expect_column_values_to_be_unique("res_id")

    # Completeness — structurally required fields
    for col in ["guest_name", "check_in", "check_out", "property", "room_type"]:
        validator.expect_column_values_to_not_be_null(col)

    # Validity — rate range ($89–$1500 covers standard to Presidential)
    validator.expect_column_values_to_be_between("rate", min_value=89, max_value=1500)

    # Validity — email regex
    validator.expect_column_values_to_match_regex(
        "email", regex=r"^[\w.+-]+@[\w-]+\.[\w.]+$"
    )

    # Validity — loyalty tier set
    validator.expect_column_values_to_be_in_set(
        "loyalty",
        value_set=["Select Guest", "Gold", "Platinum"]
    )

    # Consistency — checkout after checkin (custom SQL expectation)
    validator.expect_column_pair_values_to_be_equal(
        column_A="check_out", column_B="check_in",
        or_equal=False,
        meta={"description": "check_out must be after check_in"}
    )

    return validator.get_expectation_suite()

Automated Data Profiling with pandas-profiling (Python)

import pandas as pd
from ydata_profiling import ProfileReport
import snowflake.connector

def profile_reservations(date_range: tuple[str, str]) -> ProfileReport:
    conn = snowflake.connector.connect(
        user="etl_service", account="omni-hotels.us-east-1",
        warehouse="ANALYTICS_WH", database="OPERA_PROD"
    )
    df = pd.read_sql(f"""
        SELECT res_id, guest_name, email, phone, check_in, check_out,
               room_type, rate, loyalty_tier, property_code
        FROM   reservations
        WHERE  created_date BETWEEN '{date_range[0]}' AND '{date_range[1]}'
    """, conn)

    profile = ProfileReport(
        df,
        title=f"Omni Reservations {date_range[0]} to {date_range[1]}",
        explorative=True,
        correlations={"auto": {"calculate": True}},
        missing_diagrams={"heatmap": True, "dendrogram": True}
    )

    # Key stats: missing rates, value distributions, correlations
    completeness = (1 - df.isnull().mean()) * 100
    print(completeness.sort_values())          # spot bottom fields
    print(df["rate"].describe())              # outlier check
    return profile

Statistical Alert on Daily Quality Score Regression (Python)

import numpy as np
from dataclasses import dataclass

@dataclass
class QualityAlert:
    date: str; score: float; z_score: float; prev_scores: list[float]

class QualityRegressionDetector:
    """Detect unusual drops in daily DQ score using rolling Z-Score.
    Fires if today's score is >2.5 sigma below the rolling mean."""
    def __init__(self, window: int = 14, threshold: float = 2.5):
        self.window = window; self.threshold = threshold
        self._history: list[tuple[str, float]] = []

    def update(self, date: str, score: float) -> QualityAlert | None:
        self._history.append((date, score))
        if len(self._history) <= self.window:
            return None

        recent = [s for _, s in self._history[-self.window - 1:-1]]
        mu = np.mean(recent); sigma = np.std(recent, ddof=1)
        if sigma == 0: return None

        z = (score - mu) / sigma
        if z < -self.threshold:
            return QualityAlert(date=date, score=score, z_score=z, prev_scores=recent)
        return None

# Usage: called nightly after GE checkpoint
detector = QualityRegressionDetector(window=14)
for date, score in daily_scores:  # from DQ metrics table
    alert = detector.update(date, score)
    if alert: send_pagerduty(alert)   # page duty manager

Quarantine + Remediation Router (Python)

from enum import Enum
from dataclasses import dataclass, field

class Severity(str, Enum):
    CRITICAL = "critical"   # blocks load, pages duty manager
    HIGH     = "high"       # quarantines row, creates Jira P1
    MEDIUM   = "medium"     # loads with quality flag, Jira P2
    LOW      = "low"        # logs only, next-sprint backlog

DIMENSION_SLA = {
    "CONSISTENCY": Severity.CRITICAL,  # checkout < checkin
    "RANGE":       Severity.HIGH,       # rate = $0 or $5000+
    "UNIQUE":      Severity.HIGH,       # duplicate res_id
    "NOT NULL":    Severity.MEDIUM,     # missing required field
    "REGEX":       Severity.MEDIUM,     # invalid email/phone
    "SET":         Severity.LOW,        # unexpected loyalty tier
    "DATE PAST":   Severity.LOW,        # future date in historical extract
}

@dataclass
class RemediationAction:
    res_id: str; rule_type: str; severity: Severity; action: str

def route_failures(failures: list[dict]) -> list[RemediationAction]:
    actions = []
    for f in failures:
        sev = DIMENSION_SLA.get(f["rule_type"], Severity.LOW)
        action_map = {
            Severity.CRITICAL: "BLOCK_LOAD|PAGE_MANAGER",
            Severity.HIGH:     "QUARANTINE|CREATE_JIRA_P1",
            Severity.MEDIUM:   "LOAD_FLAGGED|CREATE_JIRA_P2",
            Severity.LOW:      "LOAD_FLAGGED|LOG_ONLY",
        }
        actions.append(RemediationAction(
            res_id=f["res_id"], rule_type=f["rule_type"],
            severity=sev, action=action_map[sev]
        ))
    return actions

About This Demo

A Great Expectations-style validation suite running against real Omni Hotels reservation patterns.

🏨 Omni Hotels Data Quality Framework

Eight validation rules across five CVUTC dimensions, animated rule-by-rule execution, SVG quality score ring, and per-column failure attribution. Mirrors production Opera PMS → Snowflake pipeline patterns.

Stack: JavaScript · Great Expectations patterns · Django 5.1 · Hospitality Data Engineering