Analyze query plan for shuffle

    Who this is for:

    Architecture / Concept Overview: Analyze query plan for shuffle

    Spark performance bottlenecks typically fall into three categories: redundant reads (solved by caching), expensive joins (solved by broadcast joins), and unnecessary shuffles (solved by partition and shuffle tuning). Understanding when and how to apply each technique is the key to performant Spark applications.

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED SLOW[Slow Query]:::source --> DIAG{Diagnose Bottleneck}:::processing DIAG --> IO[I/O Bound: Repeated Scans]:::storage DIAG --> SHUF[Shuffle Bound: Large Data Exchange]:::processing DIAG --> SKEW[Skew Bound: Uneven Task Distribution]:::governance IO --> CACHE[Solution: Cache/Persist]:::serving SHUF --> BC[Solution: Broadcast Join]:::serving SHUF --> AQE[Solution: AQE + Partition Tuning]:::serving SKEW --> SALT[Solution: Salting / AQE Skew Join]:::serving

    *Performance diagnosis: identify the bottleneck category, then apply the appropriate optimization.*

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED JOIN[Join Strategies]:::processing JOIN --> BHJ[Broadcast Hash Join]:::serving JOIN --> SMJ[Sort-Merge Join]:::processing JOIN --> SHJ[Shuffle Hash Join]:::processing BHJ --> B1[Small table broadcast to all executors]:::serving BHJ --> B2[No shuffle required]:::serving BHJ --> B3[Best for small dimension tables]:::serving SMJ --> S1[Both sides shuffled and sorted]:::processing SMJ --> S2[Default for large-large joins]:::processing SMJ --> S3[Most expensive but most general]:::processing SHJ --> H1[Both sides shuffled, hash table built]:::processing SHJ --> H2[Good when one side fits in memory]:::processing

    *Spark join strategies ordered by cost: broadcast is cheapest, sort-merge is most general.*

    Key Terms

    Prerequisites and Setup

    • A Databricks cluster with the Spark UI accessible.
    • Tables to query and analyze in Unity Catalog.
    • Familiarity with the Spark UI (Jobs, Stages, Tasks tabs).

    Step-by-Step Implementation

      Configuration Reference

      Analyze query plan for shuffle configuration options
      ParameterDescriptionDefault
      spark.sql.autoBroadcastJoinThresholdMax table size for auto-broadcast10MB
      spark.sql.shuffle.partitionsInitial shuffle partition count200
      spark.sql.adaptive.enabledEnable AQEtrue
      spark.sql.adaptive.coalescePartitions.enabledAQE partition coalescingtrue
      spark.sql.adaptive.advisoryPartitionSizeInBytesTarget partition size after AQE coalescing64MB
      spark.sql.adaptive.skewJoin.enabledAQE skew join optimizationtrue
      spark.sql.adaptive.skewJoin.skewedPartitionFactorSkew detection factor5
      spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytesSkew detection threshold256MB
      spark.databricks.io.cache.enabledEnable Delta disk cachingfalse

      Monitoring, Cost, and Security Considerations

      Common Pitfalls and Recommended Patterns

        Frequently Asked Questions