Check partition sizes for potential data skew

Who this is for:

Architecture / Concept Overview: Check partition sizes for potential data skew

Every Spark application follows the same execution model: a driver process runs on the master node, communicates with a cluster manager to acquire resources, and distributes work to executors on worker nodes. The work is organized as a DAG (Directed Acyclic Graph) of stages and tasks.

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED DRIVER[Driver - SparkContext]:::processing --> DAGSCH[DAG Scheduler]:::processing DAGSCH --> TASKSCH[Task Scheduler]:::processing TASKSCH --> CM[Cluster Manager]:::governance CM --> W1[Worker Node 1]:::serving CM --> W2[Worker Node 2]:::serving CM --> W3[Worker Node 3]:::serving W1 --> EX1[Executor 1]:::serving W2 --> EX2[Executor 2]:::serving W3 --> EX3[Executor 3]:::serving EX1 --> |Read/Write| CS[Cloud Storage]:::storage EX2 --> |Read/Write| CS EX3 --> |Read/Write| CS

*Spark execution flow: driver plans the work, cluster manager allocates resources, executors process data.*

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED APP[Spark Application]:::processing APP --> JOB1[Job 1 - triggered by action]:::processing APP --> JOB2[Job 2 - triggered by action]:::processing JOB1 --> S1[Stage 1 - Narrow transforms]:::ingestion JOB1 --> S2[Stage 2 - After shuffle]:::ingestion S1 --> T1[Task 1.1]:::serving S1 --> T2[Task 1.2]:::serving S1 --> T3[Task 1.3]:::serving S2 --> T4[Task 2.1]:::serving S2 --> T5[Task 2.2]:::serving

*Hierarchy: Application contains Jobs (one per action), Jobs contain Stages (split at shuffle boundaries), Stages contain Tasks (one per partition).*

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED READ[Read Data]:::source --> FILTER[filter]:::processing FILTER --> SELECT[select]:::processing SELECT --> |Narrow: no shuffle| STAGE1_END[Stage 1 End]:::processing STAGE1_END --> |Shuffle| GROUPBY[groupBy + agg]:::ingestion GROUPBY --> SORT[orderBy]:::ingestion SORT --> WRITE[Write Output]:::storage

*A DAG showing two stages separated by a shuffle boundary at the groupBy operation.*

Key Terms

Prerequisites and Setup

A Databricks cluster with the Spark UI accessible.
A notebook attached to the cluster.

Step-by-Step Implementation

Configuration Reference

Check partition sizes for potential data skew configuration options
Parameter	Description	Default
`spark.driver.memory`	Driver JVM heap size	1g
`spark.executor.memory`	Executor JVM heap size	1g
`spark.executor.cores`	Number of cores per executor	1
`spark.sql.shuffle.partitions`	Partitions after a shuffle	200
`spark.default.parallelism`	Default RDD parallelism	Total executor cores
`spark.memory.fraction`	Fraction of heap for execution and storage	0.6
`spark.memory.storageFraction`	Fraction of `spark.memory.fraction` reserved for caching	0.5
`spark.sql.adaptive.enabled`	Enable AQE for runtime plan optimization	true

Check partition sizes for potential data skew

Architecture / Concept Overview: Check partition sizes for potential data skew

Key Terms

Prerequisites and Setup

Step-by-Step Implementation

Configuration Reference

Monitoring, Cost, and Security Considerations

Common Pitfalls and Recommended Patterns

Frequently Asked Questions

Check partition sizes for potential data skew

Architecture / Concept Overview: Check partition sizes for potential data skew

Key Terms

Prerequisites and Setup

Step-by-Step Implementation

Configuration Reference

Monitoring, Cost, and Security Considerations

Common Pitfalls and Recommended Patterns

Frequently Asked Questions

Related Topics