Working with DataFrames and the Spark SQL API

Who this is for:

Architecture / Concept Overview: Working with DataFrames and the Spark SQL API

DataFrames are distributed collections of rows with named, typed columns. Under the hood, both the DataFrame API and Spark SQL compile to the same logical plan, optimized by Catalyst, and executed by the same engine.

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED PY[Python DataFrame API]:::source --> LP[Logical Plan]:::processing SQL[Spark SQL]:::source --> LP SC[Scala DataFrame API]:::source --> LP LP --> CAT[Catalyst Optimizer]:::processing CAT --> PP[Physical Plan]:::processing PP --> PHOTON[Photon / Spark Engine]:::serving PHOTON --> RES[Results / Delta Tables]:::storage

*Both the DataFrame API and Spark SQL produce the same optimized execution plan.*

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED OPS[DataFrame Operations]:::processing OPS --> SEL[Selection: select, filter, where]:::processing OPS --> PROJ[Projection: withColumn, drop, alias]:::processing OPS --> AGG[Aggregation: groupBy, agg, pivot]:::ingestion OPS --> JOIN[Joins: join, crossJoin]:::serving OPS --> WIN[Windows: over, partitionBy]:::storage OPS --> IO[I/O: read, write, saveAsTable]:::source

*Major categories of DataFrame operations.*

Key Terms

Prerequisites and Setup

A Databricks notebook attached to a running cluster.
Sample data in Unity Catalog or cloud storage.

Step-by-Step Implementation

Configuration Reference

Working with DataFrames and the Spark SQL API configuration options
Parameter	Description	Default
`spark.sql.shuffle.partitions`	Partitions created during shuffles (joins, aggregations)	200
`spark.sql.adaptive.enabled`	Enable runtime query plan optimization	true
`spark.sql.adaptive.coalescePartitions.enabled`	Automatically coalesce small partitions	true
`spark.sql.autoBroadcastJoinThreshold`	Max table size for automatic broadcast join	10MB
`spark.sql.files.maxPartitionBytes`	Max bytes per file partition	128MB

Working with DataFrames and the Spark SQL API

Architecture / Concept Overview: Working with DataFrames and the Spark SQL API

Key Terms

Prerequisites and Setup

Step-by-Step Implementation

Configuration Reference

Monitoring, Cost, and Security Considerations

Common Pitfalls and Recommended Patterns

Frequently Asked Questions

Working with DataFrames and the Spark SQL API

Architecture / Concept Overview: Working with DataFrames and the Spark SQL API

Key Terms

Prerequisites and Setup

Step-by-Step Implementation

Configuration Reference

Monitoring, Cost, and Security Considerations

Common Pitfalls and Recommended Patterns

Frequently Asked Questions

Related Topics