Only when you call an action does Spark execute the plan

Who this is for:

Architecture / Concept Overview: Only when you call an action does Spark execute the plan

Spark distributes computation across a cluster by dividing data into partitions and processing them in parallel across worker nodes. The driver program coordinates the work, while executors on worker nodes perform the actual data processing.

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED CODE[User Code]:::source --> DRIVER[Driver Program]:::processing DRIVER --> |Plan & Schedule| CM[Cluster Manager]:::governance CM --> E1[Executor 1]:::serving CM --> E2[Executor 2]:::serving CM --> E3[Executor 3]:::serving E1 --> STORE[Cloud Storage / Delta Lake]:::storage E2 --> STORE E3 --> STORE

*Spark's distributed execution model: the driver plans work, the cluster manager allocates resources, and executors process data in parallel.*

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED SPARK[Apache Spark]:::processing SPARK --> SQL[Spark SQL & DataFrames]:::processing SPARK --> SS[Structured Streaming]:::ingestion SPARK --> MLLIB[MLlib]:::serving SPARK --> GX[GraphX]:::source SQL --> BATCH[Batch Analytics]:::storage SS --> RT[Real-Time Processing]:::ingestion MLLIB --> MLMOD[ML Pipelines]:::serving GX --> GRAPH[Graph Computation]:::source

*Spark's unified engine supports four major workload types.*

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED T[Transformations - Lazy]:::processing --> T1[filter]:::processing T --> T2[select]:::processing T --> T3[groupBy]:::processing T --> T4[join]:::processing T --> T5[withColumn]:::processing A[Actions - Eager]:::serving --> A1[count]:::serving A --> A2[show]:::serving A --> A3[collect]:::serving A --> A4[write]:::serving A --> A5[save]:::serving

*Spark operations are either lazy transformations (build a plan) or eager actions (trigger execution).*

Key Terms

Prerequisites and Setup

A Databricks cluster running any supported Databricks Runtime.
A notebook attached to the cluster (Python, SQL, Scala, or R).
The spark session is pre-configured in Databricks notebooks.

Step-by-Step Implementation

Configuration Reference

Only when you call an action does Spark execute the plan configuration options
Parameter	Description	Default
`spark.sql.shuffle.partitions`	Number of partitions after a shuffle	200
`spark.sql.adaptive.enabled`	Enable Adaptive Query Execution	true
`spark.default.parallelism`	Default number of partitions for RDD operations	Total executor cores
`spark.sql.files.maxPartitionBytes`	Max bytes per partition when reading files	128MB
`spark.serializer`	Serializer for data exchange	JavaSerializer

Only when you call an action does Spark execute the plan

Architecture / Concept Overview: Only when you call an action does Spark execute the plan

Key Terms

Prerequisites and Setup

Step-by-Step Implementation

Configuration Reference

Monitoring, Cost, and Security Considerations

Common Pitfalls and Recommended Patterns

Frequently Asked Questions

Only when you call an action does Spark execute the plan

Architecture / Concept Overview: Only when you call an action does Spark execute the plan

Key Terms

Prerequisites and Setup

Step-by-Step Implementation

Configuration Reference

Monitoring, Cost, and Security Considerations

Common Pitfalls and Recommended Patterns

Frequently Asked Questions

Related Topics