Working with DataFrames and the Spark SQL API
Who this is for:
Architecture / Concept Overview: Working with DataFrames and the Spark SQL API
DataFrames are distributed collections of rows with named, typed columns. Under the hood, both the DataFrame API and Spark SQL compile to the same logical plan, optimized by Catalyst, and executed by the same engine.
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
PY[Python DataFrame API]:::source --> LP[Logical Plan]:::processing
SQL[Spark SQL]:::source --> LP
SC[Scala DataFrame API]:::source --> LP
LP --> CAT[Catalyst Optimizer]:::processing
CAT --> PP[Physical Plan]:::processing
PP --> PHOTON[Photon / Spark Engine]:::serving
PHOTON --> RES[Results / Delta Tables]:::storage
*Both the DataFrame API and Spark SQL produce the same optimized execution plan.*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
graph TD
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
OPS[DataFrame Operations]:::processing
OPS --> SEL[Selection: select, filter, where]:::processing
OPS --> PROJ[Projection: withColumn, drop, alias]:::processing
OPS --> AGG[Aggregation: groupBy, agg, pivot]:::ingestion
OPS --> JOIN[Joins: join, crossJoin]:::serving
OPS --> WIN[Windows: over, partitionBy]:::storage
OPS --> IO[I/O: read, write, saveAsTable]:::source
*Major categories of DataFrame operations.*
Key Terms
Prerequisites and Setup
- A Databricks notebook attached to a running cluster.
- Sample data in Unity Catalog or cloud storage.
Step-by-Step Implementation
Configuration Reference
| Parameter | Description | Default |
|---|---|---|
spark.sql.shuffle.partitions | Partitions created during shuffles (joins, aggregations) | 200 |
spark.sql.adaptive.enabled | Enable runtime query plan optimization | true |
spark.sql.adaptive.coalescePartitions.enabled | Automatically coalesce small partitions | true |
spark.sql.autoBroadcastJoinThreshold | Max table size for automatic broadcast join | 10MB |
spark.sql.files.maxPartitionBytes | Max bytes per file partition | 128MB |