Incremental and Streaming Workloads on Delta Lake

    Who this is for:

    Architecture / Concept Overview: Incremental and Streaming Workloads on Delta Lake

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED FILES[Cloud Files] -->|Auto Loader| BRONZE[Bronze Delta] KAFKA[Kafka / Event Hub] -->|readStream| BRONZE BRONZE -->|readStream + CDF| SILVER[Silver Delta] SILVER -->|readStream + CDF| GOLD[Gold Delta] GOLD -->|Serve| BI[BI / ML] FILES:::source KAFKA:::source BRONZE:::ingestion SILVER:::processing GOLD:::storage BI:::serving

    *A streaming lakehouse pipeline chains Delta tables together — each layer reads incrementally from the upstream table via Change Data Feed.*

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED TRIGGER[Trigger Modes] --> CONT[processingTime] TRIGGER --> ONCE[once / availableNow] TRIGGER --> MICRO[Fixed interval micro-batch] CONT -->|Low latency, always-on| USE1[Real-time dashboards] ONCE -->|Cost-efficient batch| USE2[Scheduled ETL] MICRO -->|Balanced| USE3[Near-real-time pipelines] TRIGGER:::governance CONT:::processing ONCE:::source MICRO:::ingestion USE1:::serving USE2:::serving USE3:::serving

    *Different trigger modes let you balance latency requirements against compute cost.*

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED AL[Auto Loader] -->|Discovers new files| NOTIFY[File Notification / Listing] NOTIFY -->|Schema inference| SCHEMA[Schema Store] SCHEMA -->|Parse & transform| WRITE[Write to Delta] WRITE -->|Checkpoint| CP[Checkpoint Location] AL:::ingestion NOTIFY:::source SCHEMA:::governance WRITE:::storage CP:::processing

    *Auto Loader combines file discovery, schema inference, and exactly-once writes in a single streaming source.*

    Key Terms

    Prerequisites and Setup

    • Databricks workspace with Unity Catalog enabled
    • A cluster running Databricks Runtime 13.3 LTS or later
    • Source data landing in cloud storage or a streaming system (Kafka, Event Hubs)
    • Target Delta tables with Change Data Feed enabled for downstream streaming reads

    Step-by-Step Implementation

      Configuration Reference

      Incremental and Streaming Workloads on Delta Lake configuration options
      PropertyDefaultDescription
      cloudFiles.formatSource file format (json, csv, parquet, avro)
      cloudFiles.schemaLocationPath to store inferred schema for Auto Loader
      cloudFiles.useNotificationsfalseUse cloud-native file notifications instead of directory listing
      readChangeFeedfalseEnable CDF-based streaming read
      startingVersionlatestCDF stream start version
      maxFilesPerTrigger1000Throttle files processed per micro-batch
      maxBytesPerTriggerThrottle bytes processed per micro-batch
      trigger(availableNow=True)Process all available data then stop
      trigger(processingTime=...)Fixed-interval micro-batches

      Monitoring, Cost, and Security Considerations

      Common Pitfalls and Recommended Patterns

        Frequently Asked Questions