Model Serving: Deploying Models as REST Endpoints

    Who this is for:

    Architecture / Concept Overview: Model Serving: Deploying Models as REST Endpoints

    Model Serving sits between the model registry and client applications, handling scaling, versioning, and traffic splitting.

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED UC[Unity Catalog Model] -->|Load| EP[Serving Endpoint] EP -->|Route| V1[Served Entity v1 - 90%] EP -->|Route| V2[Served Entity v2 - 10%] V1 -->|Predict| CLIENT[Client Application] V2 -->|Predict| CLIENT EP -->|Log| INF[Inference Table - Delta] INF -->|Monitor| MON[Lakehouse Monitor] UC:::governance EP:::serving V1:::processing V2:::processing CLIENT:::source INF:::storage MON:::governance

    *Model Serving architecture: Unity Catalog models are deployed to auto-scaling endpoints with traffic routing and inference logging.*

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED ENDPOINT[Serving Endpoint] --> CONFIG[Configuration] ENDPOINT --> TRAFFIC[Traffic Policy] ENDPOINT --> MONITOR[Monitoring] CONFIG --> SIZE[Workload Size] CONFIG --> SCALE[Scale to Zero] CONFIG --> GPU_CFG[GPU Serving] TRAFFIC --> SPLIT[A/B Traffic Split] TRAFFIC --> CANARY[Canary Rollout] MONITOR --> LATENCY[Latency Metrics] MONITOR --> ERRORS[Error Rate] MONITOR --> INF_LOG[Inference Logging] ENDPOINT:::governance CONFIG:::processing TRAFFIC:::serving MONITOR:::storage SIZE:::ingestion SCALE:::ingestion GPU_CFG:::ingestion SPLIT:::source CANARY:::source LATENCY:::source ERRORS:::source INF_LOG:::source

    *Serving endpoint components: configuration, traffic management, and monitoring.*

    Key Terms

    Prerequisites and Setup

    • Premium or Enterprise Databricks workspace.
    • A model registered in Unity Catalog.
    • EXECUTE privilege on the registered model for serving.
    • For GPU serving: GPU-capable workload types enabled in your region.

    Step-by-Step Implementation

      Configuration Reference

      Model Serving: Deploying Models as REST Endpoints configuration options
      ParameterDefaultDescription
      workload_sizeSmallCompute tier: Small, Medium, Large
      scale_to_zero_enabledtrueScale down to zero replicas when idle
      workload_typeCPUSet to GPU_SMALL, GPU_MEDIUM, or GPU_LARGE for GPU serving
      traffic_percentage100Percentage of traffic routed to each served entity
      auto_capture_config.enabledfalseEnable inference table logging
      environment_vars{}Environment variables passed to the serving container

      Monitoring, Cost, and Security Considerations

      Common Pitfalls and Recommended Patterns

        Frequently Asked Questions