Only when you call an action does Spark execute the plan
Who this is for:
Architecture / Concept Overview: Only when you call an action does Spark execute the plan
Spark distributes computation across a cluster by dividing data into partitions and processing them in parallel across worker nodes. The driver program coordinates the work, while executors on worker nodes perform the actual data processing.
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
CODE[User Code]:::source --> DRIVER[Driver Program]:::processing
DRIVER --> |Plan & Schedule| CM[Cluster Manager]:::governance
CM --> E1[Executor 1]:::serving
CM --> E2[Executor 2]:::serving
CM --> E3[Executor 3]:::serving
E1 --> STORE[Cloud Storage / Delta Lake]:::storage
E2 --> STORE
E3 --> STORE
*Spark's distributed execution model: the driver plans work, the cluster manager allocates resources, and executors process data in parallel.*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
graph TD
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
SPARK[Apache Spark]:::processing
SPARK --> SQL[Spark SQL & DataFrames]:::processing
SPARK --> SS[Structured Streaming]:::ingestion
SPARK --> MLLIB[MLlib]:::serving
SPARK --> GX[GraphX]:::source
SQL --> BATCH[Batch Analytics]:::storage
SS --> RT[Real-Time Processing]:::ingestion
MLLIB --> MLMOD[ML Pipelines]:::serving
GX --> GRAPH[Graph Computation]:::source
*Spark's unified engine supports four major workload types.*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
T[Transformations - Lazy]:::processing --> T1[filter]:::processing
T --> T2[select]:::processing
T --> T3[groupBy]:::processing
T --> T4[join]:::processing
T --> T5[withColumn]:::processing
A[Actions - Eager]:::serving --> A1[count]:::serving
A --> A2[show]:::serving
A --> A3[collect]:::serving
A --> A4[write]:::serving
A --> A5[save]:::serving
*Spark operations are either lazy transformations (build a plan) or eager actions (trigger execution).*
Key Terms
Prerequisites and Setup
- A Databricks cluster running any supported Databricks Runtime.
- A notebook attached to the cluster (Python, SQL, Scala, or R).
- The
sparksession is pre-configured in Databricks notebooks.
Step-by-Step Implementation
Configuration Reference
| Parameter | Description | Default |
|---|---|---|
spark.sql.shuffle.partitions | Number of partitions after a shuffle | 200 |
spark.sql.adaptive.enabled | Enable Adaptive Query Execution | true |
spark.default.parallelism | Default number of partitions for RDD operations | Total executor cores |
spark.sql.files.maxPartitionBytes | Max bytes per partition when reading files | 128MB |
spark.serializer | Serializer for data exchange | JavaSerializer |