Reducing Data Infrastructure Costs with the Lakehouse

The lakehouse architecture cuts infrastructure costs by storing all data in open formats on cheap cloud object storage while delivering warehouse-class query performance through optimised compute engines. Organisations typically reduce total cost of ownership by 30-60% compared to maintaining separate data lakes and warehouses.

Who this is for:

Part of the How Databricks Can Help Your Business section of the Databricks tutorial series.

Architecture / Concept Overview: Reducing Data Infrastructure Costs with the Lakehouse

Traditional architectures require separate storage for the data lake and data warehouse, with expensive ETL copying data between them. The lakehouse eliminates this duplication by serving both analytical and engineering workloads from one storage layer.

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Raw[Raw Data] --> ObjStore[(Object Storage)] ObjStore --> Delta[(Delta Tables)] Delta --> Serverless[Serverless SQL] Delta --> Spot[Spot Clusters] Delta --> Jobs[Scheduled Jobs] class Raw source class ObjStore storage class Delta storage class Serverless serving class Spot processing class Jobs ingestion

*Figure 1 — Single storage layer serves multiple compute tiers, each optimised for cost.*

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED OldLake[(Data Lake $)] --> ETL1[ETL Copy $$] ETL1 --> OldWH[(Warehouse $$$)] OldWH --> BI[BI Queries] NewLake[(Lakehouse $)] --> SQL[SQL Engine] SQL --> BI2[BI Queries] class OldLake storage class ETL1 ingestion class OldWH governance class BI serving class NewLake storage class SQL processing class BI2 serving

*Figure 2 — Cost comparison: traditional lake + warehouse vs unified lakehouse.*

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Budget[Cloud Budget] Budget --> Storage[Storage Costs] Budget --> Compute[Compute Costs] Budget --> Network[Networking] Storage --> Lifecycle[Lifecycle Policies] Compute --> Autoscale[Auto-scaling] Compute --> SpotInst[Spot Instances] Compute --> AutoTerm[Auto-termination] Network --> PrivLink[Same-region Access] class Budget source class Storage storage class Compute processing class Network ingestion class Lifecycle storage class Autoscale serving class SpotInst serving class AutoTerm serving class PrivLink governance

*Figure 3 — Key cost levers across storage, compute, and networking.*

Key Terms

Prerequisites and Setup

An existing cloud account with cost monitoring enabled (AWS Cost Explorer, Azure Cost Management, or GCP Billing)
A Databricks workspace with system tables enabled for billing analysis
Historical spend data from your current data platform for comparison
Admin access to configure cluster policies and warehouse settings

Step-by-Step Implementation

Configuration Reference

Reducing Data Infrastructure Costs with the Lakehouse configuration options
Parameter	Description	Recommended Value
Auto-termination	Idle shutdown for clusters	10-15 minutes
Spot instance ratio	Proportion of spot workers	80-100% for batch jobs
Warehouse auto-stop	Idle timeout for SQL warehouses	5-10 minutes
Delta retention	Time travel history period	7 days for most tables
Photon acceleration	Vectorised engine for SQL	Enable on all SQL workloads
Predictive optimisation	Automated table maintenance	Enable at workspace level

Monitoring, Cost, and Security Considerations

Monitoring

Use system.billing.usage and system.billing.list_prices to build real-time cost dashboards. Track DBU consumption by team, workspace, and SKU. Alert on week-over-week spend increases above 20%.

Cost Optimisation

Consolidate under-utilised clusters into shared pools. Eliminate zombie clusters (running but unused). Use single-node clusters for small development workloads. Archive cold data to cheaper storage tiers with lifecycle policies.

Security and Governance

Cluster policies prevent users from provisioning expensive instances without approval. Budget alerts ensure cost overruns are caught early. Service principals with scoped permissions prevent accidental resource creation.

Common Pitfalls and Recommended Patterns

Leaving auto-termination at the 120-minute default — 10-15 minutes is appropriate for most interactive workloads
Using on-demand instances for ETL jobs that tolerate retries — spot instances save 60-90%
Never running VACUUM on Delta tables — old versions accumulate, inflating storage costs
Keeping time travel retention at 30 days when 7 days suffices — each day retains all historical file versions
Provisioning large warehouses for small teams — start with Small and scale up only if queue times are unacceptable
Ignoring inter-region data transfer costs — keep compute and storage in the same region
Not reviewing system billing tables monthly — cost creep is invisible without regular analysis
Running development workloads on production-sized clusters — use cluster policies to enforce appropriate sizing

Frequently Asked Questions

How much can we realistically save by switching to a lakehouse?

Most organisations report 30-60% TCO reduction when consolidating from separate lake and warehouse solutions. Savings come from eliminating data duplication, reducing ETL complexity, and using spot compute.

Does cheaper mean slower?

No. The Photon engine and Delta Lake optimisations deliver warehouse-class performance on object storage. Many organisations see performance improvements alongside cost reduction.

Can we set per-team budgets?

Yes. Use Databricks account-level budgets with filters by workspace, SKU, or tag. Combine with cluster policies to enforce per-team compute limits.

What about egress costs?

Keep compute and storage in the same cloud region to avoid cross-region egress. Use private endpoints to avoid public internet egress charges.

How do we handle bursty workloads cost-effectively?

Serverless SQL warehouses scale to zero when idle and spin up in seconds for burst traffic. For batch workloads, auto-scaling clusters with spot instances handle variable load at minimal cost.