Optimising Delta Lake: Liquid Clustering, Data Skipping, and VACUUM

    Who this is for:

    Architecture / Concept Overview: Optimising Delta Lake: Liquid Clustering, Data Skipping, and VACUUM

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED WRITE[New Data Written] -->|Small Files| TABLE[Delta Table] TABLE -->|OPTIMIZE| COMPACT[Compacted Files] COMPACT -->|Liquid Clustering| CLUSTERED[Clustered Layout] CLUSTERED -->|Data Skipping Stats| QUERY[Query Engine] TABLE -->|VACUUM| CLEAN[Remove Orphan Files] WRITE:::ingestion TABLE:::source COMPACT:::processing CLUSTERED:::storage QUERY:::serving CLEAN:::governance

    *The optimisation lifecycle: writes produce small files, OPTIMIZE compacts and clusters them, and VACUUM cleans up obsolete versions.*

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED LAYOUT[Data Layout Strategies] --> PART[Partitioning] LAYOUT --> ZORDER[Z-Order] LAYOUT --> LC[Liquid Clustering] PART -->|Static, coarse-grained| LIMIT1[High cardinality = too many partitions] ZORDER -->|Requires OPTIMIZE, static keys| LIMIT2[Cannot change keys without rewrite] LC -->|Adaptive, incremental| ADV[Flexible keys, no partition overhead] LAYOUT:::governance PART:::source ZORDER:::ingestion LC:::storage LIMIT1:::source LIMIT2:::ingestion ADV:::serving

    *Liquid clustering supersedes static partitioning and Z-ordering, offering adaptive data layout without manual tuning.*

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Q[Query with Filter] -->|Check stats| SKIP{Data Skipping} SKIP -->|Min/Max mismatch| PRUNE[Skip File] SKIP -->|Min/Max overlap| READ[Read File] READ -->|Return rows| RESULT[Query Result] Q:::serving SKIP:::processing PRUNE:::source READ:::storage RESULT:::serving

    *Data skipping uses per-file column statistics to prune irrelevant files at query planning time.*

    Key Terms

    Prerequisites and Setup

    • Databricks Runtime 13.3 LTS or later (liquid clustering requires 13.3+)
    • A Delta table with sufficient data volume to benefit from optimisation (1 GB+)
    • MODIFY privilege for running OPTIMIZE and VACUUM
    • Awareness of current table properties: partitioning, file count, and average file size

    Step-by-Step Implementation

      Configuration Reference

      Optimising Delta Lake: Liquid Clustering, Data Skipping, and VACUUM configuration options
      PropertyDefaultDescription
      delta.autoOptimize.optimizeWritetrueCoalesces small output files during writes
      delta.autoOptimize.autoCompacttrueTriggers background compaction after writes
      delta.targetFileSize256mb (managed)Target file size for OPTIMIZE
      delta.tuneFileSizesForRewritestrueAdjusts target size for rewrite-heavy tables
      delta.deletedFileRetentionDuration7 daysMinimum file age before VACUUM eligibility
      delta.enableDeletionVectorstrueEnables soft-delete row markers
      delta.dataSkippingStatsColumns32Number of leading columns to collect stats for
      spark.databricks.delta.vacuum.parallelDelete.enabledtrueParallel file deletion during VACUUM

      Monitoring, Cost, and Security Considerations

      Common Pitfalls and Recommended Patterns

        Frequently Asked Questions