Photon operators appear as "PhotonGroupingAgg", "PhotonScan", etc.

Photon is a C++ native vectorised query engine built into Databricks that accelerates SQL and DataFrame workloads by processing data in columnar batches instead of row-by-row. It replaces parts of the Spark SQL engine at runtime, delivering up to 3-8x performance improvements on scan-heavy, aggregation, and join operations without any code changes. Photon is automatically enabled on SQL warehouses and optionally enabled on clusters.

  • Understand how Photon differs from the standard Spark SQL engine
  • Learn which workloads benefit most from Photon acceleration
  • Enable and configure Photon on clusters and SQL warehouses

Who this is for: Data engineers, analysts, and platform administrators who want to accelerate SQL and DataFrame query performance.

Part of the Databricks Compute section of the Databricks tutorial series.

Architecture / Concept Overview: Photon operators appear as "PhotonGroupingAgg", "PhotonScan", etc.

Photon sits below the Spark SQL optimiser as an alternative execution backend. When a query is submitted, the Catalyst optimiser generates a plan, and Photon-eligible operators (scans, filters, joins, aggregations) are offloaded to the native C++ engine. Non-supported operators fall back to the standard JVM-based Spark engine transparently. This hybrid execution means you get acceleration where Photon excels without losing compatibility.

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Query[SQL / DataFrame]:::source --> Catalyst[Catalyst Optimiser]:::processing Catalyst --> Photon[Photon C++ Engine]:::serving Catalyst --> Spark[Spark JVM Engine]:::processing Photon --> Data[(Delta Lake)]:::storage Spark --> Data

*Catalyst routes Photon-eligible operators to the native C++ engine; unsupported operators fall back to Spark JVM.*

Photon processes data in columnar batches using SIMD instructions, which is fundamentally more cache-efficient than Spark's row-based volcano model.

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED subgraph Spark JVM Row1[Row 1]:::processing --> Op1[Process]:::processing Row2[Row 2]:::processing --> Op1 Row3[Row 3]:::processing --> Op1 end subgraph Photon Native Batch[Column Batch]:::serving --> SIMD[Vectorised SIMD]:::serving end

*Spark processes rows one at a time; Photon processes columnar batches with SIMD for higher throughput.*

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Scan[Table Scan]:::serving --> Filter[Filter]:::serving Filter --> Agg[Aggregation]:::serving Agg --> Join[Hash Join]:::serving Join --> Result[Result]:::source

*Photon accelerates the most common SQL operators: scans, filters, aggregations, and hash joins.*

Key Terms

Photon
A C++ native vectorised query engine that accelerates SQL and DataFrame operations on Databricks.
Vectorised Execution
Processing data in columnar batches rather than row-by-row, enabling CPU cache efficiency and SIMD instructions.
SIMD
Single Instruction, Multiple Data; CPU instructions that process multiple data values in a single operation.
Catalyst Optimiser
Spark SQL's query planning and optimisation engine that generates physical execution plans.
Photon-Eligible
An operator or query pattern that Photon can execute natively instead of falling back to JVM Spark.

Prerequisites and Setup

  • A Databricks workspace with Photon-enabled runtimes available
  • Clusters running Databricks Runtime with Photon (or a SQL warehouse, which always has Photon)
  • Workloads using SQL or DataFrame API (PySpark, Spark SQL, Scala DataFrames)
  • Delta Lake tables for optimal scan performance

Step-by-Step Implementation

    Configuration Reference

    Photon operators appear as "PhotonGroupingAgg", "PhotonScan", etc. configuration options
    SettingDescriptionDefault
    runtime_engineSet to PHOTON to enable on clustersSTANDARD
    spark_versionUse Photon-enabled runtime (e.g., LATEST_LTS_PHOTON)Required for Photon
    spark.databricks.photon.enabledRuntime flag for Photontrue on Photon runtimes
    SQL warehouse PhotonAlways enabled on SQL warehousesCannot be disabled
    Supported file formatsDelta Lake, ParquetBest on Delta
    Supported operationsScan, filter, project, aggregate, join, sortMost SQL operators

    Monitoring, Cost, and Security Considerations

    Monitoring

    Use the Spark UI and query profile to see which operators ran on Photon versus the JVM engine. Monitor Photon-specific metrics in system tables to track acceleration ratios across your workloads.

    Cost Optimisation

    - Photon clusters consume DBUs at a Photon-specific rate, which is higher per-DBU than non-Photon. However, faster execution typically results in lower total cost because the cluster runs for less time.

    - Benchmark your specific workloads: Photon provides the most benefit on scan-heavy and aggregation-heavy queries.

    - For UDF-heavy workloads that Photon cannot accelerate, the higher DBU rate may not pay off.

    Security and Governance

    - Photon does not change the security model; Unity Catalog access controls and encryption apply identically.

    - Photon operates within the same cluster isolation boundaries as the standard engine.

    - No additional permissions are needed to use Photon beyond the ability to create Photon-enabled clusters.

    Common Pitfalls and Recommended Patterns

    • Expecting Photon to accelerate Python UDFs: Photon runs native C++ SQL operators, not arbitrary Python code.
    • Not benchmarking before and after: always measure your actual workload rather than relying on published benchmarks.
    • Using non-Delta formats: Photon performs best on Delta Lake tables with optimised file sizes.
    • Forgetting to check the query plan: verify that Photon operators appear in your physical plan.
    • Disabling Photon to save DBU cost without measuring total runtime cost: faster queries often cost less overall.
    • Using Photon on streaming micro-batches that are already fast: the acceleration overhead may exceed the benefit on tiny datasets.

    Frequently Asked Questions

    Does Photon require code changes?

    No. Photon is a runtime engine replacement. Your existing SQL and DataFrame code runs unchanged; Photon transparently accelerates eligible operators.

    Can I use Photon with Python UDFs?

    Photon cannot execute Python UDFs. The UDF portion falls back to the JVM engine while Photon handles the surrounding SQL operators. For maximum benefit, minimise UDFs in favour of built-in SQL functions.

    Is Photon always faster?

    For scan-heavy, aggregation-heavy, and join-heavy workloads, yes. For workloads dominated by UDFs, complex ML transformations, or very small datasets, the improvement may be negligible.

    Does Photon cost more?

    The per-DBU rate for Photon runtimes is higher, but queries typically run faster, so the total cost (DBU rate x runtime) is often lower. Benchmark your workloads to confirm.