Ray on Databricks: Scaling ML Workloads
Who this is for:
Architecture / Concept Overview: Ray on Databricks: Scaling ML Workloads
Ray runs as a sidecar runtime on Databricks clusters, sharing nodes with Spark while providing its own task scheduler.
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
NOTEBOOK[Databricks Notebook] -->|Init| RAY_CLUSTER[Ray Cluster on Spark Nodes]
RAY_CLUSTER -->|Run| TUNE[Ray Tune - HPO]
RAY_CLUSTER -->|Run| TRAIN_R[Ray Train - Distributed Training]
RAY_CLUSTER -->|Run| DATA[Ray Data - Preprocessing]
TUNE -->|Log| MLF[MLflow]
TRAIN_R -->|Log| MLF
DATA -->|Feed| TRAIN_R
NOTEBOOK:::source
RAY_CLUSTER:::ingestion
TUNE:::processing
TRAIN_R:::processing
DATA:::storage
MLF:::governance
*Ray on Databricks: a Ray cluster bootstraps on Spark nodes, running Tune, Train, and Data workloads with MLflow integration.*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
graph TD
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
RAY[Ray Ecosystem] --> CORE[Ray Core - Tasks & Actors]
RAY --> TUNE_LIB[Ray Tune]
RAY --> TRAIN_LIB[Ray Train]
RAY --> DATA_LIB[Ray Data]
RAY --> SERVE_LIB[Ray Serve]
CORE --> REMOTE[ray.remote]
TUNE_LIB --> SEARCH[Search Algorithms]
TUNE_LIB --> SCHED[Schedulers - ASHA, PBT]
TRAIN_LIB --> TORCH_T[TorchTrainer]
TRAIN_LIB --> XGB_T[XGBoostTrainer]
DATA_LIB --> STREAM[Streaming Datasets]
RAY:::governance
CORE:::processing
TUNE_LIB:::serving
TRAIN_LIB:::storage
DATA_LIB:::ingestion
SERVE_LIB:::source
REMOTE:::processing
SEARCH:::serving
SCHED:::serving
TORCH_T:::storage
XGB_T:::storage
STREAM:::ingestion
*Ray ecosystem components available on Databricks.*
Key Terms
Prerequisites and Setup
- Databricks Runtime for ML (Ray is pre-installed).
- A multi-worker cluster for distributed workloads.
- For GPU workloads: GPU ML Runtime.
Step-by-Step Implementation
Configuration Reference
| Parameter | Default | Description |
|---|---|---|
num_worker_nodes | — | Number of Spark workers to use as Ray nodes |
num_cpus_per_node | all available | CPUs allocated per Ray worker |
num_gpus_per_node | 0 | GPUs allocated per Ray worker |
num_samples (Tune) | — | Total number of hyperparameter configurations to try |
num_workers (Train) | — | Number of distributed training workers |
scheduler | FIFOScheduler | Tune scheduler (ASHA, PBT, etc.) |