Apply forecasting per SKU using Pandas UDF or applyInPandas

Databricks empowers retail and e-commerce organisations to unify customer, inventory, and transaction data on a single lakehouse platform — enabling real-time personalisation, demand forecasting, and supply chain optimisation at scale. Retailers replace fragmented point solutions with one governed environment for analytics and AI.

    Who this is for:

    Part of the How Databricks Can Help Your Business section of the Databricks tutorial series.

    Architecture / Concept Overview: Apply forecasting per SKU using Pandas UDF or applyInPandas

    Retailers generate massive volumes of data from point-of-sale systems, e-commerce platforms, loyalty programmes, supply chain systems, and marketing channels. The lakehouse unifies these streams to power real-time decisions across merchandising, marketing, and operations.

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED POS[POS Systems] --> Ingest[Streaming Ingest] Web[E-Commerce Clicks] --> Ingest Loyalty[Loyalty Programme] --> Ingest Supply[Supply Chain] --> Ingest Ingest --> DL[(Delta Lake)] DL --> Recommend[Recommendations] DL --> Forecast[Demand Forecast] DL --> CustomerBI[Customer Analytics] class POS source class Web source class Loyalty source class Supply source class Ingest ingestion class DL storage class Recommend processing class Forecast governance class CustomerBI serving

    *Figure 1 — Retail data streams converge in the lakehouse to power personalisation, forecasting, and analytics.*

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Clicks[Clickstream] --> SessionID[Session Tracking] SessionID --> Profile[Customer Profile] Profile --> Features[(Feature Store)] Features --> RecModel[Recommendation Model] RecModel --> API[Serving Endpoint] API --> Website[Product Page] class Clicks source class SessionID ingestion class Profile processing class Features storage class RecModel governance class API serving class Website serving

    *Figure 2 — Real-time personalisation pipeline from clickstream to product recommendations.*

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Demand[Demand Forecasting] Demand --> Historical[Historical Sales] Demand --> Seasonal[Seasonality Models] Demand --> External[External Signals] Historical --> SKU[SKU-Level Forecast] Seasonal --> SKU External --> SKU SKU --> Replenish[Replenishment Orders] class Demand processing class Historical storage class Seasonal governance class External source class SKU serving class Replenish ingestion

    *Figure 3 — Demand forecasting combines historical patterns, seasonality, and external signals.*

    Key Terms

    Prerequisites and Setup

    • Databricks workspace with streaming capabilities enabled
    • Connections to e-commerce platform APIs (orders, products, customers)
    • POS system data feed (real-time or batch)
    • Cloud storage for clickstream event data
    • Product catalogue and inventory system access

    Step-by-Step Implementation

      Configuration Reference

      Apply forecasting per SKU using Pandas UDF or applyInPandas configuration options
      ParameterDescriptionRecommended Value
      Streaming checkpointLocation for stream stateDedicated cloud storage path
      Recommendation refreshHow often to retrainWeekly for collaborative filtering
      Demand forecast horizonPrediction window28-90 days
      Clickstream retentionRaw event retention90 days bronze, aggregated in gold
      Serving endpoint sizeReal-time inference computeMedium (scale-to-zero for off-peak)
      Customer 360 refreshProfile update frequencyDaily for batch, hourly for key metrics

      Monitoring, Cost, and Security Considerations

      Monitoring

      Track recommendation model click-through rates and conversion impact. Monitor forecast accuracy (MAPE) weekly and retrain when accuracy degrades. Alert on clickstream pipeline lag — stale data means stale recommendations.

      Cost Optimisation

      Use spot instances for batch demand forecasting jobs. Scale recommendation serving endpoints to zero during low-traffic hours. Cache popular recommendations to reduce inference calls. Archive clickstream data older than 90 days to cold storage.

      Security and Governance

      Mask customer PII in analytics tables — analysts should work with hashed identifiers. Comply with GDPR/CCPA by implementing data deletion pipelines for customer opt-out requests. Log all access to customer data for audit compliance.

      Common Pitfalls and Recommended Patterns

      • Training recommendation models on stale data — retrain weekly with fresh interaction data
      • Not accounting for seasonality in demand forecasts — include holiday calendars and promotional events
      • Building separate customer profiles per channel — unify online and offline before analysis
      • Serving recommendations without a fallback — implement popularity-based defaults for cold-start users
      • Ignoring data freshness in dashboards — stale inventory data leads to poor replenishment decisions
      • Not A/B testing recommendation changes — measure business impact before full rollout

      Frequently Asked Questions

      Can Databricks handle Black Friday traffic spikes?

      Yes. Auto-scaling clusters and serverless endpoints handle traffic spikes elastically. Pre-warm serving endpoints before known peak events for zero cold-start latency.

      How do we unify online and in-store customer identities?

      Use deterministic matching (email, loyalty ID) and probabilistic matching (device fingerprints, address) to create a unified customer graph in the gold layer.

      What about real-time pricing optimisation?

      Databricks supports real-time feature computation and model serving that can power dynamic pricing. Combine demand elasticity models with inventory levels for automated price adjustments.

      How long does it take to see ROI from personalisation?

      Most retailers see measurable uplift in click-through and conversion rates within 4-8 weeks of deploying recommendation models. Revenue impact grows as models learn from more interaction data.

      Can we integrate with our existing marketing platforms?

      Yes. Use Delta Sharing or JDBC connections to push customer segments and scores to marketing automation platforms. Reverse ETL patterns send model outputs to operational systems.