Check partition sizes for potential data skew

    Who this is for:

    Architecture / Concept Overview: Check partition sizes for potential data skew

    Every Spark application follows the same execution model: a driver process runs on the master node, communicates with a cluster manager to acquire resources, and distributes work to executors on worker nodes. The work is organized as a DAG (Directed Acyclic Graph) of stages and tasks.

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED DRIVER[Driver - SparkContext]:::processing --> DAGSCH[DAG Scheduler]:::processing DAGSCH --> TASKSCH[Task Scheduler]:::processing TASKSCH --> CM[Cluster Manager]:::governance CM --> W1[Worker Node 1]:::serving CM --> W2[Worker Node 2]:::serving CM --> W3[Worker Node 3]:::serving W1 --> EX1[Executor 1]:::serving W2 --> EX2[Executor 2]:::serving W3 --> EX3[Executor 3]:::serving EX1 --> |Read/Write| CS[Cloud Storage]:::storage EX2 --> |Read/Write| CS EX3 --> |Read/Write| CS

    *Spark execution flow: driver plans the work, cluster manager allocates resources, executors process data.*

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED APP[Spark Application]:::processing APP --> JOB1[Job 1 - triggered by action]:::processing APP --> JOB2[Job 2 - triggered by action]:::processing JOB1 --> S1[Stage 1 - Narrow transforms]:::ingestion JOB1 --> S2[Stage 2 - After shuffle]:::ingestion S1 --> T1[Task 1.1]:::serving S1 --> T2[Task 1.2]:::serving S1 --> T3[Task 1.3]:::serving S2 --> T4[Task 2.1]:::serving S2 --> T5[Task 2.2]:::serving

    *Hierarchy: Application contains Jobs (one per action), Jobs contain Stages (split at shuffle boundaries), Stages contain Tasks (one per partition).*

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED READ[Read Data]:::source --> FILTER[filter]:::processing FILTER --> SELECT[select]:::processing SELECT --> |Narrow: no shuffle| STAGE1_END[Stage 1 End]:::processing STAGE1_END --> |Shuffle| GROUPBY[groupBy + agg]:::ingestion GROUPBY --> SORT[orderBy]:::ingestion SORT --> WRITE[Write Output]:::storage

    *A DAG showing two stages separated by a shuffle boundary at the groupBy operation.*

    Key Terms

    Prerequisites and Setup

    • A Databricks cluster with the Spark UI accessible.
    • A notebook attached to the cluster.

    Step-by-Step Implementation

      Configuration Reference

      Check partition sizes for potential data skew configuration options
      ParameterDescriptionDefault
      spark.driver.memoryDriver JVM heap size1g
      spark.executor.memoryExecutor JVM heap size1g
      spark.executor.coresNumber of cores per executor1
      spark.sql.shuffle.partitionsPartitions after a shuffle200
      spark.default.parallelismDefault RDD parallelismTotal executor cores
      spark.memory.fractionFraction of heap for execution and storage0.6
      spark.memory.storageFractionFraction of spark.memory.fraction reserved for caching0.5
      spark.sql.adaptive.enabledEnable AQE for runtime plan optimizationtrue

      Monitoring, Cost, and Security Considerations

      Common Pitfalls and Recommended Patterns

        Frequently Asked Questions