Only when you call an action does Spark execute the plan

    Who this is for:

    Architecture / Concept Overview: Only when you call an action does Spark execute the plan

    Spark distributes computation across a cluster by dividing data into partitions and processing them in parallel across worker nodes. The driver program coordinates the work, while executors on worker nodes perform the actual data processing.

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED CODE[User Code]:::source --> DRIVER[Driver Program]:::processing DRIVER --> |Plan & Schedule| CM[Cluster Manager]:::governance CM --> E1[Executor 1]:::serving CM --> E2[Executor 2]:::serving CM --> E3[Executor 3]:::serving E1 --> STORE[Cloud Storage / Delta Lake]:::storage E2 --> STORE E3 --> STORE

    *Spark's distributed execution model: the driver plans work, the cluster manager allocates resources, and executors process data in parallel.*

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED SPARK[Apache Spark]:::processing SPARK --> SQL[Spark SQL & DataFrames]:::processing SPARK --> SS[Structured Streaming]:::ingestion SPARK --> MLLIB[MLlib]:::serving SPARK --> GX[GraphX]:::source SQL --> BATCH[Batch Analytics]:::storage SS --> RT[Real-Time Processing]:::ingestion MLLIB --> MLMOD[ML Pipelines]:::serving GX --> GRAPH[Graph Computation]:::source

    *Spark's unified engine supports four major workload types.*

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED T[Transformations - Lazy]:::processing --> T1[filter]:::processing T --> T2[select]:::processing T --> T3[groupBy]:::processing T --> T4[join]:::processing T --> T5[withColumn]:::processing A[Actions - Eager]:::serving --> A1[count]:::serving A --> A2[show]:::serving A --> A3[collect]:::serving A --> A4[write]:::serving A --> A5[save]:::serving

    *Spark operations are either lazy transformations (build a plan) or eager actions (trigger execution).*

    Key Terms

    Prerequisites and Setup

    • A Databricks cluster running any supported Databricks Runtime.
    • A notebook attached to the cluster (Python, SQL, Scala, or R).
    • The spark session is pre-configured in Databricks notebooks.

    Step-by-Step Implementation

      Configuration Reference

      Only when you call an action does Spark execute the plan configuration options
      ParameterDescriptionDefault
      spark.sql.shuffle.partitionsNumber of partitions after a shuffle200
      spark.sql.adaptive.enabledEnable Adaptive Query Executiontrue
      spark.default.parallelismDefault number of partitions for RDD operationsTotal executor cores
      spark.sql.files.maxPartitionBytesMax bytes per partition when reading files128MB
      spark.serializerSerializer for data exchangeJavaSerializer

      Monitoring, Cost, and Security Considerations

      Common Pitfalls and Recommended Patterns

        Frequently Asked Questions