What is a Data Lakehouse?

A data lakehouse is a single architecture that puts data-warehouse capabilities, ACID transactions, schema enforcement, and fast SQL, directly on top of low-cost data-lake storage using open table formats like Delta Lake. It removes the traditional split between a lake (cheap, flexible, but unreliable) and a warehouse (reliable, but rigid and siloed). After reading, you will know what defines a lakehouse, how it differs from the two-tier past, and how to build one on Databricks.

  • Define the lakehouse and the problems it solves versus separate lake and warehouse systems
  • Understand how open table formats add ACID transactions and governance to object storage
  • Build a minimal lakehouse with Bronze, Silver, and Gold Delta tables

Who this is for: Data engineers and architects deciding how to structure analytics and AI storage.

Part of the What is Databricks section in the Databricks tutorial series.

Architecture / Concept Overview: What is a Data Lakehouse?

A lakehouse unifies the two-tier "data lake plus data warehouse" pattern into one layer. Instead of landing raw data in a lake and then copying a curated subset into a separate warehouse, you keep one governed copy in open format and serve every workload, BI, data engineering, streaming, and AI, from it. The key enabler is a transactional table format (Delta Lake) that brings reliability and performance features to plain object storage.

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Src[All Data Types]:::source --> Lake[(Object Storage)]:::storage Lake --> Delta[Delta Lake Table Layer]:::processing Gov[Unity Catalog Governance]:::governance -.governs.-> Delta Delta --> BI[BI and SQL]:::serving Delta --> DE[Data Engineering]:::serving Delta --> AI[Data Science and AI]:::serving

*One governed copy of data on object storage, made transactional by Delta Lake, serving BI, engineering, and AI workloads simultaneously.*

The contrast with the legacy two-tier stack is what makes the lakehouse compelling.

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef neutral fill:#2A2F3A,stroke:#7A828F,stroke-width:2px,rx:8,ry:8,color:#E0E6ED subgraph TwoTier [Legacy Two-Tier] L[(Data Lake)]:::storage --> ETL[Copy and ETL]:::ingestion --> W[(Warehouse)]:::neutral --> R1[BI Only]:::serving end subgraph House [Lakehouse] LH[(Delta Lakehouse)]:::storage --> R2[BI, AI, Streaming]:::serving end

*The legacy stack duplicates data from lake to warehouse before serving BI; the lakehouse serves all workloads from one copy.*

Key Terms

Data lake
Low-cost storage for raw data in any format, flexible but historically lacking transactions, schema enforcement, and fast query performance.
Data warehouse
A structured, high-performance system for SQL analytics that traditionally requires loading and duplicating data into a proprietary store.
Lakehouse
An architecture that delivers warehouse reliability and performance directly on lake storage via open transactional table formats.
Delta Lake
The open table format that adds ACID transactions, schema enforcement, time travel, and indexing to Parquet files on object storage.
ACID transactions
Guarantees (atomicity, consistency, isolation, durability) that keep concurrent reads and writes correct, which lakes traditionally lacked.
Schema enforcement and evolution
Rules that reject malformed writes while allowing controlled, intentional changes to a table's structure over time.

Prerequisites and Setup

  • A Databricks workspace with access to cloud object storage
  • Unity Catalog enabled with a catalog and schema you can write to
  • Familiarity with SQL and basic Python/PySpark
  • A landing location (volume or path) holding sample raw files

Step-by-Step Implementation

  1. Land raw data in Bronze

    Ingest source files as-is into a Bronze Delta table to preserve a faithful, replayable copy of the input.

    # Python cell - load raw files into a Bronze Delta table\nbronze = spark.read.format("json").load("/Volumes/shop/landing/events/")\nbronze.write.mode("append").saveAsTable("shop.web.events_bronze")
  2. Enforce quality into Silver

    Clean, deduplicate, and type-cast data into a Silver table, applying schema enforcement so bad records are caught early.

    -- SQL cell - conform Bronze into a typed Silver table\nCREATE OR REPLACE TABLE shop.web.events_silver AS\nSELECT\nCAST(event_id AS STRING)      AS event_id,\nCAST(ts AS TIMESTAMP)         AS event_time,\nlower(event_type)             AS event_type\nFROM shop.web.events_bronze\nWHERE event_id IS NOT NULL;
  3. Aggregate business metrics into Gold

    Produce curated, query-optimized Gold tables that BI tools and analysts consume directly.

    -- SQL cell - build a Gold aggregate\nCREATE OR REPLACE TABLE shop.web.daily_events_gold AS\nSELECT date(event_time) AS day, event_type, COUNT(*) AS events\nFROM shop.web.events_silver\nGROUP BY date(event_time), event_type;
  4. Use time travel for reliability

    Query a previous version of a table to audit changes or recover from a bad write, a capability unique to transactional table formats.

    -- SQL cell - read a previous table version (time travel)\nSELECT * FROM shop.web.events_silver VERSION AS OF 3;
  5. Optimize for fast reads

    Compact small files and cluster data so SQL scans stay fast as the table grows.

    -- SQL cell - compact and cluster the Gold table\nOPTIMIZE shop.web.daily_events_gold;

Configuration Reference

What is a Data Lakehouse? configuration options
Parameter / OptionTypeDefaultDescription
Table formatstringdeltaOpen transactional format underpinning the lakehouse
delta.appendOnlybooleanfalseRestricts a table to append-only writes for immutability
delta.deletedFileRetentionDurationinterval7 daysHow long old files are retained for time travel before VACUUM can remove them
Liquid clustering keyslistnoneColumns used to physically cluster data for faster filtered reads
Schema evolutionbooleanfalseWhether writes may add new columns automatically

Monitoring, Cost, and Security Considerations

Monitoring

Track table health (file counts, sizes, and operation history) through Delta's transaction log and system tables. Watching the operation history helps you spot small-file proliferation before it degrades query performance, which can otherwise slow scans by several multiples.

Cost Optimisation

Storing one copy instead of duplicating into a warehouse cuts both storage spend and ETL compute. Run OPTIMIZE and use clustering so queries read fewer bytes, and schedule VACUUM to remove obsolete files you no longer need for time travel.

Security and Governance

Govern lakehouse tables with Unity Catalog so a single permission model covers SQL, Python, and AI access. Apply row filters and column masks at the table level so every engine that reads the data inherits the same controls.

Common Pitfalls and Recommended Patterns

  • Reintroducing a separate warehouse: serve BI directly from Gold tables instead of copying out.
  • Neglecting file compaction: schedule OPTIMIZE to prevent the small-file problem that slows scans.
  • Disabling schema enforcement everywhere: keep it on so malformed data fails fast.
  • Never running VACUUM: unbounded retention inflates storage; balance retention with recovery needs.
  • Skipping a layered model: adopt Bronze/Silver/Gold so consumers always read curated data.

Frequently Asked Questions

How is a lakehouse different from a data lake?

A data lake is just storage; a lakehouse adds a transactional table layer (Delta Lake) on top, giving you ACID guarantees, schema enforcement, fast SQL, and governance that a bare lake cannot provide.

Do I still need a separate data warehouse?

Usually not. A lakehouse serves BI and SQL analytics directly from curated Gold tables, so you avoid maintaining and syncing a second proprietary system.

What makes Delta Lake transactional?

Delta keeps an ordered transaction log alongside the Parquet data files. The log records every commit atomically, enabling isolation between concurrent readers and writers and features like time travel.

Can I use engines other than Spark on a lakehouse?

Yes. Delta Lake is open, and the lakehouse supports SQL Warehouses and external engines that can read the open format, so you are not locked into a single tool.