Structured Streaming: Real-Time Data Processing with Spark
Who this is for:
Architecture / Concept Overview: Structured Streaming: Real-Time Data Processing with Spark
Structured Streaming processes data in micro-batches (or continuously), reading new data from sources, applying transformations, and writing results to sinks. The engine manages offsets and checkpoints to guarantee exactly-once processing across failures.
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
KF[Kafka]:::source --> SS[Structured Streaming Engine]:::processing
AL[Auto Loader]:::source --> SS
DT[Delta Table CDF]:::source --> SS
SS --> CK[Checkpoint Store]:::governance
SS --> D1[Delta Lake Sink]:::storage
SS --> D2[Kafka Sink]:::serving
SS --> D3[Console / Memory Sink]:::serving
*Structured Streaming reads from multiple sources, processes through the Spark engine, and writes to sinks with checkpoint-based exactly-once semantics.*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
graph TD
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
TRIG[Trigger Modes]:::processing
TRIG --> PT[processingTime - fixed interval]:::processing
TRIG --> AN[availableNow - process all then stop]:::ingestion
TRIG --> CONT[continuous - experimental low-latency]:::serving
TRIG --> ONCE[once - deprecated, use availableNow]:::source
*Trigger modes control how frequently Structured Streaming processes data.*
Key Terms
Prerequisites and Setup
- Databricks Runtime 13.3 LTS or later.
- Write access to a checkpoint location in cloud storage.
- A streaming source (Kafka, Auto Loader, Delta table with CDF, or rate source for testing).
Step-by-Step Implementation
Configuration Reference
| Parameter | Description | Default |
|---|---|---|
trigger(processingTime=...) | Micro-batch interval | None (runs as fast as possible) |
trigger(availableNow=True) | Process all available data then stop | false |
outputMode | append, complete, or update | append |
checkpointLocation | Path for checkpoint state | Required |
spark.sql.streaming.stateStore.providerClass | State store implementation | HDFSBackedStateStoreProvider |
spark.sql.streaming.noDataProgressEventInterval | Interval for no-data progress events | 10000ms |