Data Quality Expectations: Validating Data in Pipelines
Who this is for:
Architecture / Concept Overview: Data Quality Expectations: Validating Data in Pipelines
Expectations are declarative data quality rules attached to table definitions. They evaluate every record during processing and take one of three actions: log the violation (warn), drop the violating row, or fail the entire pipeline. Results are captured in the pipeline event log for monitoring and alerting.
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
SRC[Incoming Records]:::source --> E1[expect: log warning]:::governance
SRC --> E2[expect_or_drop: remove row]:::governance
SRC --> E3[expect_or_fail: halt pipeline]:::governance
E1 --> PASS[Passed Records]:::storage
E1 --> LOG1[Event Log: warning]:::governance
E2 --> PASS
E2 --> LOG2[Event Log: dropped]:::governance
E3 --> PASS
E3 --> FAIL[Pipeline Failure]:::source
*Three expectation modes and their impact on data flow.*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
graph TD
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
DQ[Data Quality Strategy]:::governance
DQ --> BRZ[Bronze Layer]:::storage
DQ --> SLV[Silver Layer]:::processing
DQ --> GLD[Gold Layer]:::serving
BRZ --> B1[expect: log schema issues]:::governance
SLV --> S1[expect_or_drop: remove invalid rows]:::governance
SLV --> S2[expect: warn on business rules]:::governance
GLD --> G1[expect_or_fail: enforce hard constraints]:::governance
*Recommended expectation strategy across medallion layers.*
Key Terms
Prerequisites and Setup
- A Lakeflow Declarative Pipeline with at least one table defined.
- Unity Catalog enabled for the pipeline.
- Familiarity with SQL boolean expressions for defining quality rules.
Step-by-Step Implementation
Configuration Reference
| Expectation Type | Python Decorator | SQL Syntax | Behaviour |
|---|---|---|---|
| Warn | @dlt.expect("name", "expr") | EXPECT (expr) | Logs violation, passes row through |
| Drop | @dlt.expect_or_drop("name", "expr") | EXPECT (expr) ON VIOLATION DROP ROW | Drops violating row silently |
| Fail | @dlt.expect_or_fail("name", "expr") | EXPECT (expr) ON VIOLATION FAIL UPDATE | Halts the pipeline update |