Apply forecasting per SKU using Pandas UDF or applyInPandas

Databricks empowers retail and e-commerce organisations to unify customer, inventory, and transaction data on a single lakehouse platform — enabling real-time personalisation, demand forecasting, and supply chain optimisation at scale. Retailers replace fragmented point solutions with one governed environment for analytics and AI.

Who this is for:

Part of the How Databricks Can Help Your Business section of the Databricks tutorial series.

Architecture / Concept Overview: Apply forecasting per SKU using Pandas UDF or applyInPandas

Retailers generate massive volumes of data from point-of-sale systems, e-commerce platforms, loyalty programmes, supply chain systems, and marketing channels. The lakehouse unifies these streams to power real-time decisions across merchandising, marketing, and operations.

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED POS[POS Systems] --> Ingest[Streaming Ingest] Web[E-Commerce Clicks] --> Ingest Loyalty[Loyalty Programme] --> Ingest Supply[Supply Chain] --> Ingest Ingest --> DL[(Delta Lake)] DL --> Recommend[Recommendations] DL --> Forecast[Demand Forecast] DL --> CustomerBI[Customer Analytics] class POS source class Web source class Loyalty source class Supply source class Ingest ingestion class DL storage class Recommend processing class Forecast governance class CustomerBI serving

*Figure 1 — Retail data streams converge in the lakehouse to power personalisation, forecasting, and analytics.*

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Clicks[Clickstream] --> SessionID[Session Tracking] SessionID --> Profile[Customer Profile] Profile --> Features[(Feature Store)] Features --> RecModel[Recommendation Model] RecModel --> API[Serving Endpoint] API --> Website[Product Page] class Clicks source class SessionID ingestion class Profile processing class Features storage class RecModel governance class API serving class Website serving

*Figure 2 — Real-time personalisation pipeline from clickstream to product recommendations.*

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Demand[Demand Forecasting] Demand --> Historical[Historical Sales] Demand --> Seasonal[Seasonality Models] Demand --> External[External Signals] Historical --> SKU[SKU-Level Forecast] Seasonal --> SKU External --> SKU SKU --> Replenish[Replenishment Orders] class Demand processing class Historical storage class Seasonal governance class External source class SKU serving class Replenish ingestion

*Figure 3 — Demand forecasting combines historical patterns, seasonality, and external signals.*

Key Terms

Prerequisites and Setup

Databricks workspace with streaming capabilities enabled
Connections to e-commerce platform APIs (orders, products, customers)
POS system data feed (real-time or batch)
Cloud storage for clickstream event data
Product catalogue and inventory system access

Step-by-Step Implementation

Configuration Reference

Apply forecasting per SKU using Pandas UDF or applyInPandas configuration options
Parameter	Description	Recommended Value
Streaming checkpoint	Location for stream state	Dedicated cloud storage path
Recommendation refresh	How often to retrain	Weekly for collaborative filtering
Demand forecast horizon	Prediction window	28-90 days
Clickstream retention	Raw event retention	90 days bronze, aggregated in gold
Serving endpoint size	Real-time inference compute	Medium (scale-to-zero for off-peak)
Customer 360 refresh	Profile update frequency	Daily for batch, hourly for key metrics

Monitoring, Cost, and Security Considerations

Monitoring

Track recommendation model click-through rates and conversion impact. Monitor forecast accuracy (MAPE) weekly and retrain when accuracy degrades. Alert on clickstream pipeline lag — stale data means stale recommendations.

Cost Optimisation

Use spot instances for batch demand forecasting jobs. Scale recommendation serving endpoints to zero during low-traffic hours. Cache popular recommendations to reduce inference calls. Archive clickstream data older than 90 days to cold storage.

Security and Governance

Mask customer PII in analytics tables — analysts should work with hashed identifiers. Comply with GDPR/CCPA by implementing data deletion pipelines for customer opt-out requests. Log all access to customer data for audit compliance.

Common Pitfalls and Recommended Patterns

Training recommendation models on stale data — retrain weekly with fresh interaction data
Not accounting for seasonality in demand forecasts — include holiday calendars and promotional events
Building separate customer profiles per channel — unify online and offline before analysis
Serving recommendations without a fallback — implement popularity-based defaults for cold-start users
Ignoring data freshness in dashboards — stale inventory data leads to poor replenishment decisions
Not A/B testing recommendation changes — measure business impact before full rollout

Frequently Asked Questions

Can Databricks handle Black Friday traffic spikes?

Yes. Auto-scaling clusters and serverless endpoints handle traffic spikes elastically. Pre-warm serving endpoints before known peak events for zero cold-start latency.

How do we unify online and in-store customer identities?

Use deterministic matching (email, loyalty ID) and probabilistic matching (device fingerprints, address) to create a unified customer graph in the gold layer.

What about real-time pricing optimisation?

Databricks supports real-time feature computation and model serving that can power dynamic pricing. Combine demand elasticity models with inventory levels for automated price adjustments.

How long does it take to see ROI from personalisation?

Most retailers see measurable uplift in click-through and conversion rates within 4-8 weeks of deploying recommendation models. Revenue impact grows as models learn from more interaction data.

Can we integrate with our existing marketing platforms?

Yes. Use Delta Sharing or JDBC connections to push customer segments and scores to marketing automation platforms. Reverse ETL patterns send model outputs to operational systems.