Lakeflow Jobs: Orchestrating Data and AI Workloads
Who this is for:
Architecture / Concept Overview: Lakeflow Jobs: Orchestrating Data and AI Workloads
A Lakeflow Job is a directed acyclic graph (DAG) of tasks. Each task runs a notebook, Python script, SQL query, Declarative Pipeline, or dbt project on dedicated compute. The Jobs scheduler handles dependency resolution, retries, parameterisation, and notifications.
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
TRIG[Trigger: Schedule / API / File]:::source --> JOB[Lakeflow Job]:::processing
JOB --> T1[Task: Ingest]:::ingestion
T1 --> T2[Task: Transform]:::processing
T1 --> T3[Task: Enrich]:::processing
T2 --> T4[Task: Aggregate]:::storage
T3 --> T4
T4 --> T5[Task: Train ML Model]:::serving
T4 --> T6[Task: Refresh Dashboard]:::serving
T5 --> T7[Task: Notify]:::governance
T6 --> T7
*A multi-task Lakeflow Job with parallel and sequential task dependencies.*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
graph TD
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
TT[Task Types]:::processing
TT --> NB[Notebook Task]:::processing
TT --> PY[Python Script Task]:::processing
TT --> SQL[SQL Task]:::storage
TT --> PL[Pipeline Task]:::ingestion
TT --> DBT[dbt Task]:::processing
TT --> JAR[JAR Task]:::source
TT --> COND[Condition Task]:::governance
TT --> LOOP[For-Each Task]:::serving
*Supported task types in Lakeflow Jobs.*
Key Terms
Prerequisites and Setup
- Databricks workspace with permissions to create and manage jobs.
- Notebooks, scripts, or pipelines that the job tasks will execute.
- For scheduled jobs: appropriate compute permissions (job clusters or serverless).
Step-by-Step Implementation
Configuration Reference
| Parameter | Description | Default |
|---|---|---|
schedule.quartz_cron_expression | Cron schedule for automatic runs | None (manual) |
max_concurrent_runs | Maximum parallel runs of the same job | 1 |
timeout_seconds | Maximum duration before the run is cancelled | 0 (no timeout) |
max_retries | Number of retries for failed tasks | 0 |
retry_on_timeout | Whether to retry on timeout | false |
min_retry_interval_millis | Minimum delay between retries | 0 |
run_if | Task execution condition: ALL_SUCCESS, AT_LEAST_ONE_SUCCESS, NONE_FAILED, ALL_DONE | ALL_SUCCESS |