Distributed Training with Ray, TorchDistributor, and DeepSpeed
Who this is for:
Architecture / Concept Overview: Distributed Training with Ray, TorchDistributor, and DeepSpeed
Distributed training on Databricks uses the cluster's multi-node architecture to parallelise training across GPUs.
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
DRIVER[Driver Node] -->|Launch| TD[TorchDistributor / Ray]
TD -->|Distribute| W1[Worker 0 - GPU]
TD -->|Distribute| W2[Worker 1 - GPU]
TD -->|Distribute| W3[Worker 2 - GPU]
W1 -->|Gradient Sync| NCCL[NCCL All-Reduce]
W2 -->|Gradient Sync| NCCL
W3 -->|Gradient Sync| NCCL
NCCL -->|Update| MODEL[Synchronised Model]
MODEL -->|Log| MLF[MLflow]
DRIVER:::source
TD:::ingestion
W1:::processing
W2:::processing
W3:::processing
NCCL:::storage
MODEL:::serving
MLF:::governance
*Data-parallel distributed training: the driver launches workers, GPUs sync gradients via NCCL, and the model converges jointly.*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
graph TD
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
DIST[Distributed Strategies] --> DP[Data Parallelism]
DIST --> MP[Model Parallelism]
DIST --> ZeRO[ZeRO Optimisation]
DP --> DDP[PyTorch DDP]
DP --> TD_API[TorchDistributor]
DP --> RAY_TRAIN[Ray Train]
MP --> TENSOR[Tensor Parallelism]
MP --> PIPE[Pipeline Parallelism]
ZeRO --> Z1[Stage 1 - Optimizer States]
ZeRO --> Z2[Stage 2 - Gradients]
ZeRO --> Z3[Stage 3 - Parameters]
DIST:::governance
DP:::processing
MP:::serving
ZeRO:::storage
DDP:::ingestion
TD_API:::ingestion
RAY_TRAIN:::ingestion
TENSOR:::source
PIPE:::source
Z1:::source
Z2:::source
Z3:::source
*Distributed training strategy taxonomy: data parallelism, model parallelism, and DeepSpeed ZeRO stages.*
Key Terms
Prerequisites and Setup
- Databricks Runtime for ML (GPU variant).
- Multi-GPU or multi-node GPU cluster.
- For Ray:
rayis pre-installed on ML Runtime. - For DeepSpeed: install via
%pip install deepspeed.
Step-by-Step Implementation
Configuration Reference
| Parameter | Default | Description |
|---|---|---|
num_processes (TorchDistributor) | — | Total GPU processes across all nodes |
local_mode | true | Restrict to driver node if true |
use_gpu | true | Use GPU for training processes |
num_workers (Ray) | — | Number of Ray training workers |
zero_optimization.stage | 0 | DeepSpeed ZeRO stage (0, 1, 2, or 3) |
fp16.enabled | false | Enable mixed precision via DeepSpeed |
gradient_accumulation_steps | 1 | Accumulate gradients over multiple mini-batches |