Partitioning and Bucketing for Spark Performance

Who this is for:

Architecture / Concept Overview: Partitioning and Bucketing for Spark Performance

Partitioning divides a table into directories based on column values, enabling partition pruning (skipping irrelevant directories). Bucketing (via Z-ORDER on Databricks) co-locates related data within files, enabling data skipping through Delta statistics. Liquid clustering is Databricks' next-generation approach that combines the benefits of both.

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Q[Query: WHERE date = '2025-06' AND region = 'EMEA']:::source Q --> PP[Partition Pruning]:::processing PP --> |Skip 90% of directories| DS[Data Skipping via Z-ORDER stats]:::processing DS --> |Skip 80% of remaining files| READ[Read only relevant files]:::storage READ --> RES[Query Result]:::serving

*Partitioning prunes directories, Z-ORDER data skipping prunes files, drastically reducing I/O.*

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED STRAT[Data Layout Strategies]:::processing STRAT --> PART[Hive-Style Partitioning]:::storage STRAT --> ZORD[Z-ORDER Clustering]:::processing STRAT --> LIQ[Liquid Clustering]:::serving PART --> P1[Directories by column values]:::storage PART --> P2[Best for low-cardinality columns]:::storage PART --> P3[Requires OPTIMIZE + VACUUM]:::storage ZORD --> Z1[Co-locates data within files]:::processing ZORD --> Z2[Works with any cardinality]:::processing ZORD --> Z3[Applied via OPTIMIZE ZORDER BY]:::processing LIQ --> L1[Automatic data layout]:::serving LIQ --> L2[No manual OPTIMIZE needed]:::serving LIQ --> L3[Incremental clustering]:::serving

*Three data layout strategies from traditional to modern.*

Key Terms

Prerequisites and Setup

Delta tables in Unity Catalog.
MODIFY permission on tables for running OPTIMIZE.
Understanding of your query access patterns (which columns appear in WHERE clauses and joins).

Step-by-Step Implementation

Configuration Reference

Partitioning and Bucketing for Spark Performance configuration options
Parameter	Description	Default
`delta.autoOptimize.optimizeWrite`	Auto-optimize file sizes during writes	false
`delta.autoOptimize.autoCompact`	Auto-compact small files after writes	false
`delta.targetFileSize`	Target file size for OPTIMIZE	1GB
`spark.databricks.delta.optimize.maxFileSize`	Max file size for OPTIMIZE	1GB
`spark.databricks.delta.stats.skipping`	Enable data skipping via file statistics	true
`delta.deletedFileRetentionDuration`	Retention period for VACUUM	7 days
`spark.sql.files.maxPartitionBytes`	Max bytes per read partition	128MB

Partitioning and Bucketing for Spark Performance

Architecture / Concept Overview: Partitioning and Bucketing for Spark Performance

Key Terms

Prerequisites and Setup

Step-by-Step Implementation

Configuration Reference

Monitoring, Cost, and Security Considerations

Common Pitfalls and Recommended Patterns

Frequently Asked Questions

Partitioning and Bucketing for Spark Performance

Architecture / Concept Overview: Partitioning and Bucketing for Spark Performance

Key Terms

Prerequisites and Setup

Step-by-Step Implementation

Configuration Reference

Monitoring, Cost, and Security Considerations

Common Pitfalls and Recommended Patterns

Frequently Asked Questions

Related Topics