Read a specific version
Who this is for:
Architecture / Concept Overview: Read a specific version
Apache Spark on Databricks runs on the Databricks Runtime, which includes a customized Spark distribution, optimized connectors, and the Photon vectorized engine. Clusters are managed through the workspace, with autoscaling, spot instance support, and automatic termination.
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
USER[User Code: Python/SQL/Scala/R]:::source --> API[DataFrame / Spark SQL API]:::processing
API --> CAT[Catalyst Optimizer]:::processing
CAT --> PHO[Photon Engine]:::processing
CAT --> SPARK[Spark Engine]:::processing
PHO --> EXE[Executors on Workers]:::serving
SPARK --> EXE
EXE --> DL[Delta Lake on Cloud Storage]:::storage
EXE --> UC[Unity Catalog]:::governance
*Spark on Databricks: user code flows through the Catalyst optimizer and optionally Photon to distributed executors.*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
graph TD
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
RT[Databricks Runtime]:::processing
RT --> SPARK[Apache Spark]:::processing
RT --> DELTA[Delta Lake]:::storage
RT --> PHOTON[Photon Engine]:::serving
RT --> ML[MLlib & MLflow]:::serving
RT --> LIBS[Pre-installed Libraries]:::source
RT --> OPT[Databricks Optimizations]:::processing
OPT --> AQE[Adaptive Query Execution]:::processing
OPT --> DIO[Optimized I/O]:::storage
OPT --> CACHE[Disk Caching]:::storage
*The Databricks Runtime bundles Spark with Delta Lake, Photon, and platform-specific optimizations.*
Key Terms
Prerequisites and Setup
- A Databricks workspace on AWS, Azure, or GCP.
- Permission to create clusters or access to a shared cluster / SQL warehouse.
- Basic familiarity with Python, SQL, or Scala.
Step-by-Step Implementation
Configuration Reference
| Parameter | Description | Default |
|---|---|---|
spark_version | Databricks Runtime version | Required |
node_type_id | Instance type for cluster nodes | Required |
autoscale.min_workers | Minimum worker count | 1 |
autoscale.max_workers | Maximum worker count | 8 |
autotermination_minutes | Idle time before cluster shuts down | 120 |
runtime_engine | STANDARD or PHOTON | STANDARD |
spark.sql.adaptive.enabled | Enable Adaptive Query Execution | true |
spark.databricks.io.cache.enabled | Enable Delta disk cache | false |
data_security_mode | SINGLE_USER, USER_ISOLATION, or NO_ISOLATION | SINGLE_USER |