Verify GPU availability on a GPU cluster
Who this is for:
Architecture / Concept Overview: Verify GPU availability on a GPU cluster
The ML Runtime extends the standard Databricks Runtime with GPU drivers, optimised libraries, and ML-specific integrations.
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
graph TD
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
BASE[Standard Runtime] --> MLR[ML Runtime]
MLR --> LIBS[Pre-Installed Libraries]
MLR --> GPU[GPU Support / CUDA]
MLR --> INTEGRATIONS[Platform Integrations]
LIBS --> SKLEARN[scikit-learn]
LIBS --> PYTORCH[PyTorch]
LIBS --> TF[TensorFlow]
LIBS --> XGB[XGBoost / LightGBM]
LIBS --> HF[Hugging Face Transformers]
GPU --> CUDA[CUDA Toolkit]
GPU --> NCCL[NCCL]
GPU --> CUDNN[cuDNN]
INTEGRATIONS --> MLFLOW[MLflow Autologging]
INTEGRATIONS --> HOROVOD[Horovod / TorchDistributor]
INTEGRATIONS --> DELTA[Delta Lake ML I/O]
BASE:::source
MLR:::governance
LIBS:::processing
GPU:::storage
INTEGRATIONS:::serving
SKLEARN:::ingestion
PYTORCH:::ingestion
TF:::ingestion
XGB:::ingestion
HF:::ingestion
CUDA:::source
NCCL:::source
CUDNN:::source
MLFLOW:::serving
HOROVOD:::serving
DELTA:::serving
*ML Runtime layer cake: standard runtime, ML libraries, GPU drivers, and platform integrations.*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
USER[Data Scientist] -->|Select| CLUSTER[Cluster Config]
CLUSTER -->|Choose| CPU_GPU{CPU or GPU?}
CPU_GPU -->|CPU| CPU_ML[ML Runtime CPU]
CPU_GPU -->|GPU| GPU_ML[ML Runtime GPU]
CPU_ML -->|Train| CLASSIC[Classic ML]
GPU_ML -->|Train| DL[Deep Learning / LLM]
USER:::source
CLUSTER:::ingestion
CPU_GPU:::processing
CPU_ML:::storage
GPU_ML:::storage
CLASSIC:::serving
DL:::serving
*Decision flow for selecting CPU vs. GPU ML Runtime based on workload type.*
Key Terms
Prerequisites and Setup
- Workspace admin access to create or modify cluster policies.
- For GPU workloads, ensure your cloud account has GPU instance quota (e.g.,
p3,g5on AWS;NC/NDseries on Azure). - Select a runtime version that matches your library needs (check the release notes for included versions).
Step-by-Step Implementation
Configuration Reference
| Parameter | Default | Description |
|---|---|---|
spark_version (ML) | — | ML Runtime version string, e.g., 15.4.x-cpu-ml-scala2.12 |
spark_version (GPU) | — | GPU ML Runtime version string, e.g., 15.4.x-gpu-ml-scala2.12 |
node_type_id | — | Instance type for workers (i3.xlarge, g5.xlarge, etc.) |
autoscale.min_workers | 1 | Minimum number of worker nodes |
autoscale.max_workers | 8 | Maximum number of worker nodes |
init_scripts | [] | List of init script paths for custom setup |
spark.databricks.mlflow.autologging.enabled | true | Enable or disable MLflow autologging |