Convert to pandas for detailed manipulation

Running your first Spark query on Databricks requires only an attached cluster and a single line of SQL or Python — the platform handles distributed execution, optimisation, and result rendering automatically. Within minutes you can query sample datasets, create tables, and see results visualised inline.

    Who this is for:

    Part of the Getting Started with Databricks section of the Databricks tutorial series.

    Architecture / Concept Overview: Convert to pandas for detailed manipulation

    When you submit a query in Databricks, the Spark engine parses it, optimises the execution plan via Catalyst, and distributes work across cluster workers. You interact with a high-level API (SQL or DataFrames) and Spark handles parallelism, fault tolerance, and memory management behind the scenes.

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Query[Your Query] --> Catalyst[Catalyst Optimizer] Catalyst --> Plan[Execution Plan] Plan --> Driver[Driver Node] Driver --> Worker1[Worker 1] Driver --> Worker2[Worker 2] Driver --> Worker3[Worker N] Worker1 --> Results[Collected Results] Worker2 --> Results Worker3 --> Results class Query source class Catalyst ingestion class Plan processing class Driver governance class Worker1 serving class Worker2 serving class Worker3 serving class Results storage

    *Figure 1 — Query execution: your code is optimised, distributed across workers, and results are collected.*

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED SQL[SQL API] --> SparkSQL[Spark SQL Engine] DF[DataFrame API] --> SparkSQL SparkSQL --> Photon[Photon Engine] Photon --> Storage[(Delta Lake)] class SQL source class DF ingestion class SparkSQL processing class Photon serving class Storage storage

    *Figure 2 — Two equivalent query interfaces: SQL and DataFrames both execute through the same optimised engine.*

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Lazy[Lazy Evaluation] Lazy --> Transform1[filter] Lazy --> Transform2[groupBy] Lazy --> Transform3[join] Transform1 --> Action[Action: show/collect/write] Transform2 --> Action Transform3 --> Action Action --> Execute[Spark Executes All] class Lazy governance class Transform1 processing class Transform2 processing class Transform3 processing class Action serving class Execute storage

    *Figure 3 — Lazy evaluation: transformations queue up until an action triggers execution.*

    Key Terms

    Prerequisites and Setup

    • A Databricks notebook attached to a running cluster
    • The cluster should have at least one worker node (single-node works for small queries)
    • No additional libraries needed — Spark is pre-configured in every cluster
    • Access to the samples catalog (available by default in all workspaces)

    Step-by-Step Implementation

      Configuration Reference

      Convert to pandas for detailed manipulation configuration options
      ParameterDescriptionRecommended for First Queries
      Cluster workersParallel execution units1-2 for sample data
      Spark versionRuntime versionLatest LTS
      Photon accelerationVectorised engineEnable (default on newer runtimes)
      spark.sql.shuffle.partitionsParallelism for shuffles8 for small data (default 200)
      display() max rowsRows shown in output1000 (default)
      Query result cacheCache repeated queriesEnabled by default

      Monitoring, Cost, and Security Considerations

      Monitoring

      Check the Spark UI (accessible from cell output or cluster page) to see job stages, task distribution, and execution times. Watch for skewed partitions where one task takes much longer than others.

      Cost Optimisation

      Use .limit() when exploring large tables — avoid scanning entire datasets unnecessarily. Cache frequently reused DataFrames with .cache() to avoid recomputation. Run simple queries on single-node clusters to minimise cost.

      Security and Governance

      Queries respect Unity Catalog permissions — you only see tables you have access to. Spark logs are visible to cluster owners. Avoid printing sensitive data from queries to notebook outputs that may be shared.

      Common Pitfalls and Recommended Patterns

      • Calling .collect() on large DataFrames — this pulls all data to the driver and can cause out-of-memory errors
      • Not using display()df.show() truncates output; display() provides scrollable tables and charts
      • Forgetting that Spark is lazy — nothing executes until you call an action (show, write, count, collect)
      • Running count() before filtering — on large tables, count scans everything; filter first for better performance
      • Using Python loops to process rows — use DataFrame transformations instead; they execute in parallel across the cluster
      • Not leveraging sample datasets — samples.nyctaxi.trips and other built-in datasets are perfect for learning without setup

      Frequently Asked Questions

      What is the difference between SQL and DataFrame queries?

      They produce identical results. SQL is more accessible for analysts; DataFrames offer more programmatic flexibility. Under the hood, both use the same Catalyst optimizer and Photon engine.

      Why does my first query take longer than expected?

      The first query on a new cluster incurs JVM warm-up time and metadata loading. Subsequent queries on the same cluster are significantly faster.

      How large a dataset can Spark handle?

      Spark is designed for petabyte-scale data. It distributes processing across cluster workers and can scale horizontally by adding nodes. Sample datasets are small, but the same code scales to billions of rows.

      Do I need to understand distributed computing?

      Not for basic queries. Spark abstracts distribution — you write standard SQL or DataFrame code and Spark handles parallelism. Understanding partitioning helps for performance tuning on large datasets.

      Can I use pandas syntax instead of Spark DataFrames?

      Yes. Databricks supports the Pandas API on Spark (import pyspark.pandas as ps) which provides pandas-compatible syntax that executes on Spark under the hood.