Data Engineering with Lakeflow
Who this is for:
Architecture / Concept Overview: Data Engineering with Lakeflow
Lakeflow brings three core capabilities under one roof: Lakeflow Connect for ingestion, Lakeflow Declarative Pipelines (formerly Delta Live Tables) for transformation, and Lakeflow Jobs for orchestration. Together they form an end-to-end data engineering stack built natively on the Lakehouse.
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
S1[Cloud Storage]:::source --> C[Lakeflow Connect]:::ingestion
S2[Databases]:::source --> C
S3[SaaS APIs]:::source --> C
S4[Kafka / Event Hubs]:::source --> C
C --> T[Declarative Pipelines]:::processing
T --> L[Unity Catalog / Delta Lake]:::storage
L --> J[Lakeflow Jobs]:::serving
J --> D[Dashboards & ML]:::serving
*Lakeflow end-to-end pipeline: sources flow through Connect, are transformed by Declarative Pipelines, stored in Delta Lake, and orchestrated by Jobs.*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
graph TD
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
LF[Lakeflow Platform]:::processing
LF --> LC[Lakeflow Connect]:::ingestion
LF --> LDP[Declarative Pipelines]:::processing
LF --> LJ[Lakeflow Jobs]:::serving
LC --> MC[Managed Connectors]:::ingestion
LC --> SC[Standard Connectors]:::ingestion
LDP --> ST[Streaming Tables]:::storage
LDP --> MV[Materialized Views]:::storage
LJ --> SCHED[Schedules & Triggers]:::serving
LJ --> CF[Control Flow]:::serving
*Lakeflow component hierarchy showing the three pillars and their sub-capabilities.*
Key Terms
Prerequisites and Setup
- A Databricks workspace on AWS, Azure, or GCP with Unity Catalog enabled.
- A cluster or SQL warehouse running Databricks Runtime 13.3 LTS or later.
CREATE TABLEandCREATE SCHEMApermissions in your target catalog.- Network access to the data sources you plan to ingest from (firewall rules, Private Link, etc.).
Step-by-Step Implementation
Configuration Reference
| Parameter | Description | Default |
|---|---|---|
cloudFiles.format | File format for Auto Loader (json, csv, parquet, avro) | Required |
cloudFiles.schemaLocation | Path to store inferred schema | Required |
cloudFiles.maxFilesPerTrigger | Max files per micro-batch | 1000 |
pipelines.maxFlowRetryAttempts | Retry attempts for failed flows | 2 |
spark.databricks.delta.optimizeWrite.enabled | Auto-optimize write file sizes | true |
spark.databricks.delta.autoCompact.enabled | Auto-compact small files | false |