Incremental and Streaming Workloads on Delta Lake
Who this is for:
Architecture / Concept Overview: Incremental and Streaming Workloads on Delta Lake
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
FILES[Cloud Files] -->|Auto Loader| BRONZE[Bronze Delta]
KAFKA[Kafka / Event Hub] -->|readStream| BRONZE
BRONZE -->|readStream + CDF| SILVER[Silver Delta]
SILVER -->|readStream + CDF| GOLD[Gold Delta]
GOLD -->|Serve| BI[BI / ML]
FILES:::source
KAFKA:::source
BRONZE:::ingestion
SILVER:::processing
GOLD:::storage
BI:::serving
*A streaming lakehouse pipeline chains Delta tables together — each layer reads incrementally from the upstream table via Change Data Feed.*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
graph TD
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
TRIGGER[Trigger Modes] --> CONT[processingTime]
TRIGGER --> ONCE[once / availableNow]
TRIGGER --> MICRO[Fixed interval micro-batch]
CONT -->|Low latency, always-on| USE1[Real-time dashboards]
ONCE -->|Cost-efficient batch| USE2[Scheduled ETL]
MICRO -->|Balanced| USE3[Near-real-time pipelines]
TRIGGER:::governance
CONT:::processing
ONCE:::source
MICRO:::ingestion
USE1:::serving
USE2:::serving
USE3:::serving
*Different trigger modes let you balance latency requirements against compute cost.*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
AL[Auto Loader] -->|Discovers new files| NOTIFY[File Notification / Listing]
NOTIFY -->|Schema inference| SCHEMA[Schema Store]
SCHEMA -->|Parse & transform| WRITE[Write to Delta]
WRITE -->|Checkpoint| CP[Checkpoint Location]
AL:::ingestion
NOTIFY:::source
SCHEMA:::governance
WRITE:::storage
CP:::processing
*Auto Loader combines file discovery, schema inference, and exactly-once writes in a single streaming source.*
Key Terms
Prerequisites and Setup
- Databricks workspace with Unity Catalog enabled
- A cluster running Databricks Runtime 13.3 LTS or later
- Source data landing in cloud storage or a streaming system (Kafka, Event Hubs)
- Target Delta tables with Change Data Feed enabled for downstream streaming reads
Step-by-Step Implementation
Configuration Reference
| Property | Default | Description |
|---|---|---|
cloudFiles.format | — | Source file format (json, csv, parquet, avro) |
cloudFiles.schemaLocation | — | Path to store inferred schema for Auto Loader |
cloudFiles.useNotifications | false | Use cloud-native file notifications instead of directory listing |
readChangeFeed | false | Enable CDF-based streaming read |
startingVersion | latest | CDF stream start version |
maxFilesPerTrigger | 1000 | Throttle files processed per micro-batch |
maxBytesPerTrigger | — | Throttle bytes processed per micro-batch |
trigger(availableNow=True) | — | Process all available data then stop |
trigger(processingTime=...) | — | Fixed-interval micro-batches |