Optimising Delta Lake: Liquid Clustering, Data Skipping, and VACUUM
Who this is for:
Architecture / Concept Overview: Optimising Delta Lake: Liquid Clustering, Data Skipping, and VACUUM
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
WRITE[New Data Written] -->|Small Files| TABLE[Delta Table]
TABLE -->|OPTIMIZE| COMPACT[Compacted Files]
COMPACT -->|Liquid Clustering| CLUSTERED[Clustered Layout]
CLUSTERED -->|Data Skipping Stats| QUERY[Query Engine]
TABLE -->|VACUUM| CLEAN[Remove Orphan Files]
WRITE:::ingestion
TABLE:::source
COMPACT:::processing
CLUSTERED:::storage
QUERY:::serving
CLEAN:::governance
*The optimisation lifecycle: writes produce small files, OPTIMIZE compacts and clusters them, and VACUUM cleans up obsolete versions.*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
graph TD
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
LAYOUT[Data Layout Strategies] --> PART[Partitioning]
LAYOUT --> ZORDER[Z-Order]
LAYOUT --> LC[Liquid Clustering]
PART -->|Static, coarse-grained| LIMIT1[High cardinality = too many partitions]
ZORDER -->|Requires OPTIMIZE, static keys| LIMIT2[Cannot change keys without rewrite]
LC -->|Adaptive, incremental| ADV[Flexible keys, no partition overhead]
LAYOUT:::governance
PART:::source
ZORDER:::ingestion
LC:::storage
LIMIT1:::source
LIMIT2:::ingestion
ADV:::serving
*Liquid clustering supersedes static partitioning and Z-ordering, offering adaptive data layout without manual tuning.*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
Q[Query with Filter] -->|Check stats| SKIP{Data Skipping}
SKIP -->|Min/Max mismatch| PRUNE[Skip File]
SKIP -->|Min/Max overlap| READ[Read File]
READ -->|Return rows| RESULT[Query Result]
Q:::serving
SKIP:::processing
PRUNE:::source
READ:::storage
RESULT:::serving
*Data skipping uses per-file column statistics to prune irrelevant files at query planning time.*
Key Terms
Prerequisites and Setup
- Databricks Runtime 13.3 LTS or later (liquid clustering requires 13.3+)
- A Delta table with sufficient data volume to benefit from optimisation (1 GB+)
MODIFYprivilege for running OPTIMIZE and VACUUM- Awareness of current table properties: partitioning, file count, and average file size
Step-by-Step Implementation
Configuration Reference
| Property | Default | Description |
|---|---|---|
delta.autoOptimize.optimizeWrite | true | Coalesces small output files during writes |
delta.autoOptimize.autoCompact | true | Triggers background compaction after writes |
delta.targetFileSize | 256mb (managed) | Target file size for OPTIMIZE |
delta.tuneFileSizesForRewrites | true | Adjusts target size for rewrite-heavy tables |
delta.deletedFileRetentionDuration | 7 days | Minimum file age before VACUUM eligibility |
delta.enableDeletionVectors | true | Enables soft-delete row markers |
delta.dataSkippingStatsColumns | 32 | Number of leading columns to collect stats for |
spark.databricks.delta.vacuum.parallelDelete.enabled | true | Parallel file deletion during VACUUM |