What is a Data Lakehouse?

A data lakehouse is a single architecture that puts data-warehouse capabilities, ACID transactions, schema enforcement, and fast SQL, directly on top of low-cost data-lake storage using open table formats like Delta Lake. It removes the traditional split between a lake (cheap, flexible, but unreliable) and a warehouse (reliable, but rigid and siloed). After reading, you will know what defines a lakehouse, how it differs from the two-tier past, and how to build one on Databricks.

Define the lakehouse and the problems it solves versus separate lake and warehouse systems
Understand how open table formats add ACID transactions and governance to object storage
Build a minimal lakehouse with Bronze, Silver, and Gold Delta tables

Who this is for: Data engineers and architects deciding how to structure analytics and AI storage.

Part of the What is Databricks section in the Databricks tutorial series.

Architecture / Concept Overview: What is a Data Lakehouse?

A lakehouse unifies the two-tier "data lake plus data warehouse" pattern into one layer. Instead of landing raw data in a lake and then copying a curated subset into a separate warehouse, you keep one governed copy in open format and serve every workload, BI, data engineering, streaming, and AI, from it. The key enabler is a transactional table format (Delta Lake) that brings reliability and performance features to plain object storage.

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Src[All Data Types]:::source --> Lake[(Object Storage)]:::storage Lake --> Delta[Delta Lake Table Layer]:::processing Gov[Unity Catalog Governance]:::governance -.governs.-> Delta Delta --> BI[BI and SQL]:::serving Delta --> DE[Data Engineering]:::serving Delta --> AI[Data Science and AI]:::serving

*One governed copy of data on object storage, made transactional by Delta Lake, serving BI, engineering, and AI workloads simultaneously.*

The contrast with the legacy two-tier stack is what makes the lakehouse compelling.

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef neutral fill:#2A2F3A,stroke:#7A828F,stroke-width:2px,rx:8,ry:8,color:#E0E6ED subgraph TwoTier [Legacy Two-Tier] L[(Data Lake)]:::storage --> ETL[Copy and ETL]:::ingestion --> W[(Warehouse)]:::neutral --> R1[BI Only]:::serving end subgraph House [Lakehouse] LH[(Delta Lakehouse)]:::storage --> R2[BI, AI, Streaming]:::serving end

*The legacy stack duplicates data from lake to warehouse before serving BI; the lakehouse serves all workloads from one copy.*

Key Terms

Data lake: Low-cost storage for raw data in any format, flexible but historically lacking transactions, schema enforcement, and fast query performance.
Data warehouse: A structured, high-performance system for SQL analytics that traditionally requires loading and duplicating data into a proprietary store.
Lakehouse: An architecture that delivers warehouse reliability and performance directly on lake storage via open transactional table formats.
Delta Lake: The open table format that adds ACID transactions, schema enforcement, time travel, and indexing to Parquet files on object storage.
ACID transactions: Guarantees (atomicity, consistency, isolation, durability) that keep concurrent reads and writes correct, which lakes traditionally lacked.
Schema enforcement and evolution: Rules that reject malformed writes while allowing controlled, intentional changes to a table's structure over time.

Prerequisites and Setup

A Databricks workspace with access to cloud object storage
Unity Catalog enabled with a catalog and schema you can write to
Familiarity with SQL and basic Python/PySpark
A landing location (volume or path) holding sample raw files

Step-by-Step Implementation

Land raw data in Bronze

Ingest source files as-is into a Bronze Delta table to preserve a faithful, replayable copy of the input.

# Python cell - load raw files into a Bronze Delta table\nbronze = spark.read.format("json").load("/Volumes/shop/landing/events/")\nbronze.write.mode("append").saveAsTable("shop.web.events_bronze")

Enforce quality into Silver

Clean, deduplicate, and type-cast data into a Silver table, applying schema enforcement so bad records are caught early.

-- SQL cell - conform Bronze into a typed Silver table\nCREATE OR REPLACE TABLE shop.web.events_silver AS\nSELECT\nCAST(event_id AS STRING)      AS event_id,\nCAST(ts AS TIMESTAMP)         AS event_time,\nlower(event_type)             AS event_type\nFROM shop.web.events_bronze\nWHERE event_id IS NOT NULL;

Aggregate business metrics into Gold

Produce curated, query-optimized Gold tables that BI tools and analysts consume directly.

-- SQL cell - build a Gold aggregate\nCREATE OR REPLACE TABLE shop.web.daily_events_gold AS\nSELECT date(event_time) AS day, event_type, COUNT(*) AS events\nFROM shop.web.events_silver\nGROUP BY date(event_time), event_type;

Use time travel for reliability
Query a previous version of a table to audit changes or recover from a bad write, a capability unique to transactional table formats.
```
-- SQL cell - read a previous table version (time travel)\nSELECT * FROM shop.web.events_silver VERSION AS OF 3;
```
Optimize for fast reads
Compact small files and cluster data so SQL scans stay fast as the table grows.
```
-- SQL cell - compact and cluster the Gold table\nOPTIMIZE shop.web.daily_events_gold;
```

Configuration Reference

What is a Data Lakehouse? configuration options
Parameter / Option	Type	Default	Description
Table format	string	`delta`	Open transactional format underpinning the lakehouse
`delta.appendOnly`	boolean	false	Restricts a table to append-only writes for immutability
`delta.deletedFileRetentionDuration`	interval	7 days	How long old files are retained for time travel before VACUUM can remove them
Liquid clustering keys	list	none	Columns used to physically cluster data for faster filtered reads
Schema evolution	boolean	false	Whether writes may add new columns automatically

Monitoring, Cost, and Security Considerations

Monitoring

Track table health (file counts, sizes, and operation history) through Delta's transaction log and system tables. Watching the operation history helps you spot small-file proliferation before it degrades query performance, which can otherwise slow scans by several multiples.

Cost Optimisation

Storing one copy instead of duplicating into a warehouse cuts both storage spend and ETL compute. Run OPTIMIZE and use clustering so queries read fewer bytes, and schedule VACUUM to remove obsolete files you no longer need for time travel.

Security and Governance

Govern lakehouse tables with Unity Catalog so a single permission model covers SQL, Python, and AI access. Apply row filters and column masks at the table level so every engine that reads the data inherits the same controls.

Common Pitfalls and Recommended Patterns

Reintroducing a separate warehouse: serve BI directly from Gold tables instead of copying out.
Neglecting file compaction: schedule OPTIMIZE to prevent the small-file problem that slows scans.
Disabling schema enforcement everywhere: keep it on so malformed data fails fast.
Never running VACUUM: unbounded retention inflates storage; balance retention with recovery needs.
Skipping a layered model: adopt Bronze/Silver/Gold so consumers always read curated data.

Frequently Asked Questions

How is a lakehouse different from a data lake?

A data lake is just storage; a lakehouse adds a transactional table layer (Delta Lake) on top, giving you ACID guarantees, schema enforcement, fast SQL, and governance that a bare lake cannot provide.

Do I still need a separate data warehouse?

Usually not. A lakehouse serves BI and SQL analytics directly from curated Gold tables, so you avoid maintaining and syncing a second proprietary system.

What makes Delta Lake transactional?

Delta keeps an ordered transaction log alongside the Parquet data files. The log records every commit atomically, enabling isolation between concurrent readers and writers and features like time travel.

Can I use engines other than Spark on a lakehouse?

Yes. Delta Lake is open, and the lakehouse supports SQL Warehouses and external engines that can read the open format, so you are not locked into a single tool.