Convert to pandas for detailed manipulation
Running your first Spark query on Databricks requires only an attached cluster and a single line of SQL or Python — the platform handles distributed execution, optimisation, and result rendering automatically. Within minutes you can query sample datasets, create tables, and see results visualised inline.
Who this is for:
Part of the Getting Started with Databricks section of the Databricks tutorial series.
Architecture / Concept Overview: Convert to pandas for detailed manipulation
When you submit a query in Databricks, the Spark engine parses it, optimises the execution plan via Catalyst, and distributes work across cluster workers. You interact with a high-level API (SQL or DataFrames) and Spark handles parallelism, fault tolerance, and memory management behind the scenes.
*Figure 1 — Query execution: your code is optimised, distributed across workers, and results are collected.*
*Figure 2 — Two equivalent query interfaces: SQL and DataFrames both execute through the same optimised engine.*
*Figure 3 — Lazy evaluation: transformations queue up until an action triggers execution.*
Key Terms
Prerequisites and Setup
- A Databricks notebook attached to a running cluster
- The cluster should have at least one worker node (single-node works for small queries)
- No additional libraries needed — Spark is pre-configured in every cluster
- Access to the
samplescatalog (available by default in all workspaces)
Step-by-Step Implementation
Configuration Reference
| Parameter | Description | Recommended for First Queries |
|---|---|---|
| Cluster workers | Parallel execution units | 1-2 for sample data |
| Spark version | Runtime version | Latest LTS |
| Photon acceleration | Vectorised engine | Enable (default on newer runtimes) |
| spark.sql.shuffle.partitions | Parallelism for shuffles | 8 for small data (default 200) |
| display() max rows | Rows shown in output | 1000 (default) |
| Query result cache | Cache repeated queries | Enabled by default |
Monitoring, Cost, and Security Considerations
Monitoring
Check the Spark UI (accessible from cell output or cluster page) to see job stages, task distribution, and execution times. Watch for skewed partitions where one task takes much longer than others.
Cost Optimisation
Use .limit() when exploring large tables — avoid scanning entire datasets unnecessarily. Cache frequently reused DataFrames with .cache() to avoid recomputation. Run simple queries on single-node clusters to minimise cost.
Security and Governance
Queries respect Unity Catalog permissions — you only see tables you have access to. Spark logs are visible to cluster owners. Avoid printing sensitive data from queries to notebook outputs that may be shared.
Common Pitfalls and Recommended Patterns
- Calling
.collect()on large DataFrames — this pulls all data to the driver and can cause out-of-memory errors - Not using
display()—df.show()truncates output;display()provides scrollable tables and charts - Forgetting that Spark is lazy — nothing executes until you call an action (show, write, count, collect)
- Running
count()before filtering — on large tables, count scans everything; filter first for better performance - Using Python loops to process rows — use DataFrame transformations instead; they execute in parallel across the cluster
- Not leveraging sample datasets —
samples.nyctaxi.tripsand other built-in datasets are perfect for learning without setup
Frequently Asked Questions
What is the difference between SQL and DataFrame queries?
They produce identical results. SQL is more accessible for analysts; DataFrames offer more programmatic flexibility. Under the hood, both use the same Catalyst optimizer and Photon engine.
Why does my first query take longer than expected?
The first query on a new cluster incurs JVM warm-up time and metadata loading. Subsequent queries on the same cluster are significantly faster.
How large a dataset can Spark handle?
Spark is designed for petabyte-scale data. It distributes processing across cluster workers and can scale horizontally by adding nodes. Sample datasets are small, but the same code scales to billions of rows.
Do I need to understand distributed computing?
Not for basic queries. Spark abstracts distribution — you write standard SQL or DataFrame code and Spark handles parallelism. Understanding partitioning helps for performance tuning on large datasets.
Can I use pandas syntax instead of Spark DataFrames?
Yes. Databricks supports the Pandas API on Spark (import pyspark.pandas as ps) which provides pandas-compatible syntax that executes on Spark under the hood.