Check partition sizes for potential data skew
Who this is for:
Architecture / Concept Overview: Check partition sizes for potential data skew
Every Spark application follows the same execution model: a driver process runs on the master node, communicates with a cluster manager to acquire resources, and distributes work to executors on worker nodes. The work is organized as a DAG (Directed Acyclic Graph) of stages and tasks.
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
DRIVER[Driver - SparkContext]:::processing --> DAGSCH[DAG Scheduler]:::processing
DAGSCH --> TASKSCH[Task Scheduler]:::processing
TASKSCH --> CM[Cluster Manager]:::governance
CM --> W1[Worker Node 1]:::serving
CM --> W2[Worker Node 2]:::serving
CM --> W3[Worker Node 3]:::serving
W1 --> EX1[Executor 1]:::serving
W2 --> EX2[Executor 2]:::serving
W3 --> EX3[Executor 3]:::serving
EX1 --> |Read/Write| CS[Cloud Storage]:::storage
EX2 --> |Read/Write| CS
EX3 --> |Read/Write| CS
*Spark execution flow: driver plans the work, cluster manager allocates resources, executors process data.*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
graph TD
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
APP[Spark Application]:::processing
APP --> JOB1[Job 1 - triggered by action]:::processing
APP --> JOB2[Job 2 - triggered by action]:::processing
JOB1 --> S1[Stage 1 - Narrow transforms]:::ingestion
JOB1 --> S2[Stage 2 - After shuffle]:::ingestion
S1 --> T1[Task 1.1]:::serving
S1 --> T2[Task 1.2]:::serving
S1 --> T3[Task 1.3]:::serving
S2 --> T4[Task 2.1]:::serving
S2 --> T5[Task 2.2]:::serving
*Hierarchy: Application contains Jobs (one per action), Jobs contain Stages (split at shuffle boundaries), Stages contain Tasks (one per partition).*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
READ[Read Data]:::source --> FILTER[filter]:::processing
FILTER --> SELECT[select]:::processing
SELECT --> |Narrow: no shuffle| STAGE1_END[Stage 1 End]:::processing
STAGE1_END --> |Shuffle| GROUPBY[groupBy + agg]:::ingestion
GROUPBY --> SORT[orderBy]:::ingestion
SORT --> WRITE[Write Output]:::storage
*A DAG showing two stages separated by a shuffle boundary at the groupBy operation.*
Key Terms
Prerequisites and Setup
- A Databricks cluster with the Spark UI accessible.
- A notebook attached to the cluster.
Step-by-Step Implementation
Configuration Reference
| Parameter | Description | Default |
|---|---|---|
spark.driver.memory | Driver JVM heap size | 1g |
spark.executor.memory | Executor JVM heap size | 1g |
spark.executor.cores | Number of cores per executor | 1 |
spark.sql.shuffle.partitions | Partitions after a shuffle | 200 |
spark.default.parallelism | Default RDD parallelism | Total executor cores |
spark.memory.fraction | Fraction of heap for execution and storage | 0.6 |
spark.memory.storageFraction | Fraction of spark.memory.fraction reserved for caching | 0.5 |
spark.sql.adaptive.enabled | Enable AQE for runtime plan optimization | true |