Reducing Data Infrastructure Costs with the Lakehouse

The lakehouse architecture cuts infrastructure costs by storing all data in open formats on cheap cloud object storage while delivering warehouse-class query performance through optimised compute engines. Organisations typically reduce total cost of ownership by 30-60% compared to maintaining separate data lakes and warehouses.

    Who this is for:

    Part of the How Databricks Can Help Your Business section of the Databricks tutorial series.

    Architecture / Concept Overview: Reducing Data Infrastructure Costs with the Lakehouse

    Traditional architectures require separate storage for the data lake and data warehouse, with expensive ETL copying data between them. The lakehouse eliminates this duplication by serving both analytical and engineering workloads from one storage layer.

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Raw[Raw Data] --> ObjStore[(Object Storage)] ObjStore --> Delta[(Delta Tables)] Delta --> Serverless[Serverless SQL] Delta --> Spot[Spot Clusters] Delta --> Jobs[Scheduled Jobs] class Raw source class ObjStore storage class Delta storage class Serverless serving class Spot processing class Jobs ingestion

    *Figure 1 — Single storage layer serves multiple compute tiers, each optimised for cost.*

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED OldLake[(Data Lake $)] --> ETL1[ETL Copy $$] ETL1 --> OldWH[(Warehouse $$$)] OldWH --> BI[BI Queries] NewLake[(Lakehouse $)] --> SQL[SQL Engine] SQL --> BI2[BI Queries] class OldLake storage class ETL1 ingestion class OldWH governance class BI serving class NewLake storage class SQL processing class BI2 serving

    *Figure 2 — Cost comparison: traditional lake + warehouse vs unified lakehouse.*

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Budget[Cloud Budget] Budget --> Storage[Storage Costs] Budget --> Compute[Compute Costs] Budget --> Network[Networking] Storage --> Lifecycle[Lifecycle Policies] Compute --> Autoscale[Auto-scaling] Compute --> SpotInst[Spot Instances] Compute --> AutoTerm[Auto-termination] Network --> PrivLink[Same-region Access] class Budget source class Storage storage class Compute processing class Network ingestion class Lifecycle storage class Autoscale serving class SpotInst serving class AutoTerm serving class PrivLink governance

    *Figure 3 — Key cost levers across storage, compute, and networking.*

    Key Terms

    Prerequisites and Setup

    • An existing cloud account with cost monitoring enabled (AWS Cost Explorer, Azure Cost Management, or GCP Billing)
    • A Databricks workspace with system tables enabled for billing analysis
    • Historical spend data from your current data platform for comparison
    • Admin access to configure cluster policies and warehouse settings

    Step-by-Step Implementation

      Configuration Reference

      Reducing Data Infrastructure Costs with the Lakehouse configuration options
      ParameterDescriptionRecommended Value
      Auto-terminationIdle shutdown for clusters10-15 minutes
      Spot instance ratioProportion of spot workers80-100% for batch jobs
      Warehouse auto-stopIdle timeout for SQL warehouses5-10 minutes
      Delta retentionTime travel history period7 days for most tables
      Photon accelerationVectorised engine for SQLEnable on all SQL workloads
      Predictive optimisationAutomated table maintenanceEnable at workspace level

      Monitoring, Cost, and Security Considerations

      Monitoring

      Use system.billing.usage and system.billing.list_prices to build real-time cost dashboards. Track DBU consumption by team, workspace, and SKU. Alert on week-over-week spend increases above 20%.

      Cost Optimisation

      Consolidate under-utilised clusters into shared pools. Eliminate zombie clusters (running but unused). Use single-node clusters for small development workloads. Archive cold data to cheaper storage tiers with lifecycle policies.

      Security and Governance

      Cluster policies prevent users from provisioning expensive instances without approval. Budget alerts ensure cost overruns are caught early. Service principals with scoped permissions prevent accidental resource creation.

      Common Pitfalls and Recommended Patterns

      • Leaving auto-termination at the 120-minute default — 10-15 minutes is appropriate for most interactive workloads
      • Using on-demand instances for ETL jobs that tolerate retries — spot instances save 60-90%
      • Never running VACUUM on Delta tables — old versions accumulate, inflating storage costs
      • Keeping time travel retention at 30 days when 7 days suffices — each day retains all historical file versions
      • Provisioning large warehouses for small teams — start with Small and scale up only if queue times are unacceptable
      • Ignoring inter-region data transfer costs — keep compute and storage in the same region
      • Not reviewing system billing tables monthly — cost creep is invisible without regular analysis
      • Running development workloads on production-sized clusters — use cluster policies to enforce appropriate sizing

      Frequently Asked Questions

      How much can we realistically save by switching to a lakehouse?

      Most organisations report 30-60% TCO reduction when consolidating from separate lake and warehouse solutions. Savings come from eliminating data duplication, reducing ETL complexity, and using spot compute.

      Does cheaper mean slower?

      No. The Photon engine and Delta Lake optimisations deliver warehouse-class performance on object storage. Many organisations see performance improvements alongside cost reduction.

      Can we set per-team budgets?

      Yes. Use Databricks account-level budgets with filters by workspace, SKU, or tag. Combine with cluster policies to enforce per-team compute limits.

      What about egress costs?

      Keep compute and storage in the same cloud region to avoid cross-region egress. Use private endpoints to avoid public internet egress charges.

      How do we handle bursty workloads cost-effectively?

      Serverless SQL warehouses scale to zero when idle and spin up in seconds for burst traffic. For batch workloads, auto-scaling clusters with spot instances handle variable load at minimal cost.