Data Quality Monitoring: Anomaly Detection and Data Profiling
Who this is for:
Architecture / Concept Overview: Data Quality Monitoring: Anomaly Detection and Data Profiling
Lakehouse Monitoring attaches to Unity Catalog tables and continuously profiles data, compares distributions against baselines, and writes results to metric tables you can query and alert on.
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
TBL[Unity Catalog Table] --> MONITOR[Lakehouse Monitor]
MONITOR --> PROFILE[Profile Metrics<br/>Null rate · Distinct count · Distribution]
MONITOR --> DRIFT[Drift Detection<br/>Baseline vs current comparison]
MONITOR --> ANOMALY[Anomaly Detection<br/>Statistical outlier identification]
PROFILE --> METRICS_TBL[Metric Tables<br/>Queryable results]
DRIFT --> METRICS_TBL
ANOMALY --> METRICS_TBL
METRICS_TBL --> DASH[Quality Dashboard]
METRICS_TBL --> ALERT[SQL Alerts]
TBL:::storage
MONITOR:::governance
PROFILE:::processing
DRIFT:::processing
ANOMALY:::processing
METRICS_TBL:::storage
DASH:::serving
ALERT:::serving
*Figure 1 — Lakehouse Monitoring profiles tables, detects drift and anomalies, writes results to metric tables, and feeds dashboards and alerts.*
Monitoring supports three analysis types depending on the table structure.
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
graph TD
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
MON[Monitor Types] --> SNAP[Snapshot<br/>Profile entire table at each run]
MON --> TS[Time Series<br/>Track metrics over a time column]
MON --> INF[Inference<br/>ML model prediction monitoring]
SNAP --> SNAP_USE[Best for: dimension tables, lookup tables]
TS --> TS_USE[Best for: event tables, fact tables with timestamps]
INF --> INF_USE[Best for: ML prediction tables with ground truth]
MON:::governance
SNAP:::processing
TS:::processing
INF:::processing
SNAP_USE:::serving
TS_USE:::serving
INF_USE:::serving
*Figure 2 — Three monitor types: snapshot for static tables, time series for temporal data, inference for ML model monitoring.*
Key Terms
Prerequisites and Setup
- Unity Catalog enabled with managed or external Delta tables
MANAGEprivilege on the table to create a monitorCREATE TABLEprivilege on the output schema for metric tables- A SQL warehouse (serverless recommended for monitor execution)
Step-by-Step Implementation
Configuration Reference
| Setting | Options | Default | Notes |
|---|---|---|---|
| Monitor type | Snapshot, Time Series, Inference | Required | Match to table structure |
| Granularities | 1 day, 1 week, 1 month, etc. | 1 day | Time windows for time series monitors |
| Timestamp column | Any timestamp/date column | Required for time series | Column used for windowing |
| Output schema | Any UC schema | Required | Where metric tables are stored |
| Schedule | Cron expression | Manual | How often the monitor runs |
| Baseline table | A Delta table | None | Optional reference for drift detection |
| Slicing expressions | Column expressions | None | Group metrics by a dimension column |