Data Quality Monitoring: Anomaly Detection and Data Profiling

    Who this is for:

    Architecture / Concept Overview: Data Quality Monitoring: Anomaly Detection and Data Profiling

    Lakehouse Monitoring attaches to Unity Catalog tables and continuously profiles data, compares distributions against baselines, and writes results to metric tables you can query and alert on.

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED TBL[Unity Catalog Table] --> MONITOR[Lakehouse Monitor] MONITOR --> PROFILE[Profile Metrics<br/>Null rate · Distinct count · Distribution] MONITOR --> DRIFT[Drift Detection<br/>Baseline vs current comparison] MONITOR --> ANOMALY[Anomaly Detection<br/>Statistical outlier identification] PROFILE --> METRICS_TBL[Metric Tables<br/>Queryable results] DRIFT --> METRICS_TBL ANOMALY --> METRICS_TBL METRICS_TBL --> DASH[Quality Dashboard] METRICS_TBL --> ALERT[SQL Alerts] TBL:::storage MONITOR:::governance PROFILE:::processing DRIFT:::processing ANOMALY:::processing METRICS_TBL:::storage DASH:::serving ALERT:::serving

    *Figure 1 — Lakehouse Monitoring profiles tables, detects drift and anomalies, writes results to metric tables, and feeds dashboards and alerts.*

    Monitoring supports three analysis types depending on the table structure.

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED MON[Monitor Types] --> SNAP[Snapshot<br/>Profile entire table at each run] MON --> TS[Time Series<br/>Track metrics over a time column] MON --> INF[Inference<br/>ML model prediction monitoring] SNAP --> SNAP_USE[Best for: dimension tables, lookup tables] TS --> TS_USE[Best for: event tables, fact tables with timestamps] INF --> INF_USE[Best for: ML prediction tables with ground truth] MON:::governance SNAP:::processing TS:::processing INF:::processing SNAP_USE:::serving TS_USE:::serving INF_USE:::serving

    *Figure 2 — Three monitor types: snapshot for static tables, time series for temporal data, inference for ML model monitoring.*

    Key Terms

    Prerequisites and Setup

    • Unity Catalog enabled with managed or external Delta tables
    • MANAGE privilege on the table to create a monitor
    • CREATE TABLE privilege on the output schema for metric tables
    • A SQL warehouse (serverless recommended for monitor execution)

    Step-by-Step Implementation

      Configuration Reference

      Data Quality Monitoring: Anomaly Detection and Data Profiling configuration options
      SettingOptionsDefaultNotes
      Monitor typeSnapshot, Time Series, InferenceRequiredMatch to table structure
      Granularities1 day, 1 week, 1 month, etc.1 dayTime windows for time series monitors
      Timestamp columnAny timestamp/date columnRequired for time seriesColumn used for windowing
      Output schemaAny UC schemaRequiredWhere metric tables are stored
      ScheduleCron expressionManualHow often the monitor runs
      Baseline tableA Delta tableNoneOptional reference for drift detection
      Slicing expressionsColumn expressionsNoneGroup metrics by a dimension column

      Monitoring, Cost, and Security Considerations

      Common Pitfalls and Recommended Patterns

        Frequently Asked Questions