Optimising Delta Lake: Liquid Clustering, Data Skipping, and VACUUM

Who this is for:

Architecture / Concept Overview: Optimising Delta Lake: Liquid Clustering, Data Skipping, and VACUUM

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED WRITE[New Data Written] -->|Small Files| TABLE[Delta Table] TABLE -->|OPTIMIZE| COMPACT[Compacted Files] COMPACT -->|Liquid Clustering| CLUSTERED[Clustered Layout] CLUSTERED -->|Data Skipping Stats| QUERY[Query Engine] TABLE -->|VACUUM| CLEAN[Remove Orphan Files] WRITE:::ingestion TABLE:::source COMPACT:::processing CLUSTERED:::storage QUERY:::serving CLEAN:::governance

*The optimisation lifecycle: writes produce small files, OPTIMIZE compacts and clusters them, and VACUUM cleans up obsolete versions.*

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED LAYOUT[Data Layout Strategies] --> PART[Partitioning] LAYOUT --> ZORDER[Z-Order] LAYOUT --> LC[Liquid Clustering] PART -->|Static, coarse-grained| LIMIT1[High cardinality = too many partitions] ZORDER -->|Requires OPTIMIZE, static keys| LIMIT2[Cannot change keys without rewrite] LC -->|Adaptive, incremental| ADV[Flexible keys, no partition overhead] LAYOUT:::governance PART:::source ZORDER:::ingestion LC:::storage LIMIT1:::source LIMIT2:::ingestion ADV:::serving

*Liquid clustering supersedes static partitioning and Z-ordering, offering adaptive data layout without manual tuning.*

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Q[Query with Filter] -->|Check stats| SKIP{Data Skipping} SKIP -->|Min/Max mismatch| PRUNE[Skip File] SKIP -->|Min/Max overlap| READ[Read File] READ -->|Return rows| RESULT[Query Result] Q:::serving SKIP:::processing PRUNE:::source READ:::storage RESULT:::serving

*Data skipping uses per-file column statistics to prune irrelevant files at query planning time.*

Key Terms

Prerequisites and Setup

Databricks Runtime 13.3 LTS or later (liquid clustering requires 13.3+)
A Delta table with sufficient data volume to benefit from optimisation (1 GB+)
MODIFY privilege for running OPTIMIZE and VACUUM
Awareness of current table properties: partitioning, file count, and average file size

Step-by-Step Implementation

Configuration Reference

Optimising Delta Lake: Liquid Clustering, Data Skipping, and VACUUM configuration options
Property	Default	Description
`delta.autoOptimize.optimizeWrite`	`true`	Coalesces small output files during writes
`delta.autoOptimize.autoCompact`	`true`	Triggers background compaction after writes
`delta.targetFileSize`	`256mb` (managed)	Target file size for OPTIMIZE
`delta.tuneFileSizesForRewrites`	`true`	Adjusts target size for rewrite-heavy tables
`delta.deletedFileRetentionDuration`	`7 days`	Minimum file age before VACUUM eligibility
`delta.enableDeletionVectors`	`true`	Enables soft-delete row markers
`delta.dataSkippingStatsColumns`	`32`	Number of leading columns to collect stats for
`spark.databricks.delta.vacuum.parallelDelete.enabled`	`true`	Parallel file deletion during VACUUM

Optimising Delta Lake: Liquid Clustering, Data Skipping, and VACUUM

Architecture / Concept Overview: Optimising Delta Lake: Liquid Clustering, Data Skipping, and VACUUM

Key Terms

Prerequisites and Setup

Step-by-Step Implementation

Configuration Reference

Monitoring, Cost, and Security Considerations

Common Pitfalls and Recommended Patterns

Frequently Asked Questions

Optimising Delta Lake: Liquid Clustering, Data Skipping, and VACUUM

Architecture / Concept Overview: Optimising Delta Lake: Liquid Clustering, Data Skipping, and VACUUM

Key Terms

Prerequisites and Setup

Step-by-Step Implementation

Configuration Reference

Monitoring, Cost, and Security Considerations

Common Pitfalls and Recommended Patterns

Frequently Asked Questions

Related Topics