Model Serving: Deploying Models as REST Endpoints
Who this is for:
Architecture / Concept Overview: Model Serving: Deploying Models as REST Endpoints
Model Serving sits between the model registry and client applications, handling scaling, versioning, and traffic splitting.
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
UC[Unity Catalog Model] -->|Load| EP[Serving Endpoint]
EP -->|Route| V1[Served Entity v1 - 90%]
EP -->|Route| V2[Served Entity v2 - 10%]
V1 -->|Predict| CLIENT[Client Application]
V2 -->|Predict| CLIENT
EP -->|Log| INF[Inference Table - Delta]
INF -->|Monitor| MON[Lakehouse Monitor]
UC:::governance
EP:::serving
V1:::processing
V2:::processing
CLIENT:::source
INF:::storage
MON:::governance
*Model Serving architecture: Unity Catalog models are deployed to auto-scaling endpoints with traffic routing and inference logging.*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
graph TD
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
ENDPOINT[Serving Endpoint] --> CONFIG[Configuration]
ENDPOINT --> TRAFFIC[Traffic Policy]
ENDPOINT --> MONITOR[Monitoring]
CONFIG --> SIZE[Workload Size]
CONFIG --> SCALE[Scale to Zero]
CONFIG --> GPU_CFG[GPU Serving]
TRAFFIC --> SPLIT[A/B Traffic Split]
TRAFFIC --> CANARY[Canary Rollout]
MONITOR --> LATENCY[Latency Metrics]
MONITOR --> ERRORS[Error Rate]
MONITOR --> INF_LOG[Inference Logging]
ENDPOINT:::governance
CONFIG:::processing
TRAFFIC:::serving
MONITOR:::storage
SIZE:::ingestion
SCALE:::ingestion
GPU_CFG:::ingestion
SPLIT:::source
CANARY:::source
LATENCY:::source
ERRORS:::source
INF_LOG:::source
*Serving endpoint components: configuration, traffic management, and monitoring.*
Key Terms
Prerequisites and Setup
- Premium or Enterprise Databricks workspace.
- A model registered in Unity Catalog.
EXECUTEprivilege on the registered model for serving.- For GPU serving: GPU-capable workload types enabled in your region.
Step-by-Step Implementation
Configuration Reference
| Parameter | Default | Description |
|---|---|---|
workload_size | Small | Compute tier: Small, Medium, Large |
scale_to_zero_enabled | true | Scale down to zero replicas when idle |
workload_type | CPU | Set to GPU_SMALL, GPU_MEDIUM, or GPU_LARGE for GPU serving |
traffic_percentage | 100 | Percentage of traffic routed to each served entity |
auto_capture_config.enabled | false | Enable inference table logging |
environment_vars | {} | Environment variables passed to the serving container |