Reducing Data Infrastructure Costs with the Lakehouse
The lakehouse architecture cuts infrastructure costs by storing all data in open formats on cheap cloud object storage while delivering warehouse-class query performance through optimised compute engines. Organisations typically reduce total cost of ownership by 30-60% compared to maintaining separate data lakes and warehouses.
Who this is for:
Part of the How Databricks Can Help Your Business section of the Databricks tutorial series.
Architecture / Concept Overview: Reducing Data Infrastructure Costs with the Lakehouse
Traditional architectures require separate storage for the data lake and data warehouse, with expensive ETL copying data between them. The lakehouse eliminates this duplication by serving both analytical and engineering workloads from one storage layer.
*Figure 1 — Single storage layer serves multiple compute tiers, each optimised for cost.*
*Figure 2 — Cost comparison: traditional lake + warehouse vs unified lakehouse.*
*Figure 3 — Key cost levers across storage, compute, and networking.*
Key Terms
Prerequisites and Setup
- An existing cloud account with cost monitoring enabled (AWS Cost Explorer, Azure Cost Management, or GCP Billing)
- A Databricks workspace with system tables enabled for billing analysis
- Historical spend data from your current data platform for comparison
- Admin access to configure cluster policies and warehouse settings
Step-by-Step Implementation
Configuration Reference
| Parameter | Description | Recommended Value |
|---|---|---|
| Auto-termination | Idle shutdown for clusters | 10-15 minutes |
| Spot instance ratio | Proportion of spot workers | 80-100% for batch jobs |
| Warehouse auto-stop | Idle timeout for SQL warehouses | 5-10 minutes |
| Delta retention | Time travel history period | 7 days for most tables |
| Photon acceleration | Vectorised engine for SQL | Enable on all SQL workloads |
| Predictive optimisation | Automated table maintenance | Enable at workspace level |
Monitoring, Cost, and Security Considerations
Monitoring
Use system.billing.usage and system.billing.list_prices to build real-time cost dashboards. Track DBU consumption by team, workspace, and SKU. Alert on week-over-week spend increases above 20%.
Cost Optimisation
Consolidate under-utilised clusters into shared pools. Eliminate zombie clusters (running but unused). Use single-node clusters for small development workloads. Archive cold data to cheaper storage tiers with lifecycle policies.
Security and Governance
Cluster policies prevent users from provisioning expensive instances without approval. Budget alerts ensure cost overruns are caught early. Service principals with scoped permissions prevent accidental resource creation.
Common Pitfalls and Recommended Patterns
- Leaving auto-termination at the 120-minute default — 10-15 minutes is appropriate for most interactive workloads
- Using on-demand instances for ETL jobs that tolerate retries — spot instances save 60-90%
- Never running VACUUM on Delta tables — old versions accumulate, inflating storage costs
- Keeping time travel retention at 30 days when 7 days suffices — each day retains all historical file versions
- Provisioning large warehouses for small teams — start with Small and scale up only if queue times are unacceptable
- Ignoring inter-region data transfer costs — keep compute and storage in the same region
- Not reviewing system billing tables monthly — cost creep is invisible without regular analysis
- Running development workloads on production-sized clusters — use cluster policies to enforce appropriate sizing
Frequently Asked Questions
How much can we realistically save by switching to a lakehouse?
Most organisations report 30-60% TCO reduction when consolidating from separate lake and warehouse solutions. Savings come from eliminating data duplication, reducing ETL complexity, and using spot compute.
Does cheaper mean slower?
No. The Photon engine and Delta Lake optimisations deliver warehouse-class performance on object storage. Many organisations see performance improvements alongside cost reduction.
Can we set per-team budgets?
Yes. Use Databricks account-level budgets with filters by workspace, SKU, or tag. Combine with cluster policies to enforce per-team compute limits.
What about egress costs?
Keep compute and storage in the same cloud region to avoid cross-region egress. Use private endpoints to avoid public internet egress charges.
How do we handle bursty workloads cost-effectively?
Serverless SQL warehouses scale to zero when idle and spin up in seconds for burst traffic. For batch workloads, auto-scaling clusters with spot instances handle variable load at minimal cost.