What is a Data Lakehouse?
A data lakehouse is a single architecture that puts data-warehouse capabilities, ACID transactions, schema enforcement, and fast SQL, directly on top of low-cost data-lake storage using open table formats like Delta Lake. It removes the traditional split between a lake (cheap, flexible, but unreliable) and a warehouse (reliable, but rigid and siloed). After reading, you will know what defines a lakehouse, how it differs from the two-tier past, and how to build one on Databricks.
- Define the lakehouse and the problems it solves versus separate lake and warehouse systems
- Understand how open table formats add ACID transactions and governance to object storage
- Build a minimal lakehouse with Bronze, Silver, and Gold Delta tables
Who this is for: Data engineers and architects deciding how to structure analytics and AI storage.
Part of the What is Databricks section in the Databricks tutorial series.
Architecture / Concept Overview: What is a Data Lakehouse?
A lakehouse unifies the two-tier "data lake plus data warehouse" pattern into one layer. Instead of landing raw data in a lake and then copying a curated subset into a separate warehouse, you keep one governed copy in open format and serve every workload, BI, data engineering, streaming, and AI, from it. The key enabler is a transactional table format (Delta Lake) that brings reliability and performance features to plain object storage.
*One governed copy of data on object storage, made transactional by Delta Lake, serving BI, engineering, and AI workloads simultaneously.*
The contrast with the legacy two-tier stack is what makes the lakehouse compelling.
*The legacy stack duplicates data from lake to warehouse before serving BI; the lakehouse serves all workloads from one copy.*
Key Terms
- Data lake
- Low-cost storage for raw data in any format, flexible but historically lacking transactions, schema enforcement, and fast query performance.
- Data warehouse
- A structured, high-performance system for SQL analytics that traditionally requires loading and duplicating data into a proprietary store.
- Lakehouse
- An architecture that delivers warehouse reliability and performance directly on lake storage via open transactional table formats.
- Delta Lake
- The open table format that adds ACID transactions, schema enforcement, time travel, and indexing to Parquet files on object storage.
- ACID transactions
- Guarantees (atomicity, consistency, isolation, durability) that keep concurrent reads and writes correct, which lakes traditionally lacked.
- Schema enforcement and evolution
- Rules that reject malformed writes while allowing controlled, intentional changes to a table's structure over time.
Prerequisites and Setup
- A Databricks workspace with access to cloud object storage
- Unity Catalog enabled with a catalog and schema you can write to
- Familiarity with SQL and basic Python/PySpark
- A landing location (volume or path) holding sample raw files
Step-by-Step Implementation
Land raw data in Bronze
Ingest source files as-is into a Bronze Delta table to preserve a faithful, replayable copy of the input.
# Python cell - load raw files into a Bronze Delta table\nbronze = spark.read.format("json").load("/Volumes/shop/landing/events/")\nbronze.write.mode("append").saveAsTable("shop.web.events_bronze")Enforce quality into Silver
Clean, deduplicate, and type-cast data into a Silver table, applying schema enforcement so bad records are caught early.
-- SQL cell - conform Bronze into a typed Silver table\nCREATE OR REPLACE TABLE shop.web.events_silver AS\nSELECT\nCAST(event_id AS STRING) AS event_id,\nCAST(ts AS TIMESTAMP) AS event_time,\nlower(event_type) AS event_type\nFROM shop.web.events_bronze\nWHERE event_id IS NOT NULL;Aggregate business metrics into Gold
Produce curated, query-optimized Gold tables that BI tools and analysts consume directly.
-- SQL cell - build a Gold aggregate\nCREATE OR REPLACE TABLE shop.web.daily_events_gold AS\nSELECT date(event_time) AS day, event_type, COUNT(*) AS events\nFROM shop.web.events_silver\nGROUP BY date(event_time), event_type;Use time travel for reliability
Query a previous version of a table to audit changes or recover from a bad write, a capability unique to transactional table formats.
-- SQL cell - read a previous table version (time travel)\nSELECT * FROM shop.web.events_silver VERSION AS OF 3;Optimize for fast reads
Compact small files and cluster data so SQL scans stay fast as the table grows.
-- SQL cell - compact and cluster the Gold table\nOPTIMIZE shop.web.daily_events_gold;
Configuration Reference
| Parameter / Option | Type | Default | Description |
|---|---|---|---|
| Table format | string | delta | Open transactional format underpinning the lakehouse |
delta.appendOnly | boolean | false | Restricts a table to append-only writes for immutability |
delta.deletedFileRetentionDuration | interval | 7 days | How long old files are retained for time travel before VACUUM can remove them |
| Liquid clustering keys | list | none | Columns used to physically cluster data for faster filtered reads |
| Schema evolution | boolean | false | Whether writes may add new columns automatically |
Monitoring, Cost, and Security Considerations
Monitoring
Track table health (file counts, sizes, and operation history) through Delta's transaction log and system tables. Watching the operation history helps you spot small-file proliferation before it degrades query performance, which can otherwise slow scans by several multiples.
Cost Optimisation
Storing one copy instead of duplicating into a warehouse cuts both storage spend and ETL compute. Run OPTIMIZE and use clustering so queries read fewer bytes, and schedule VACUUM to remove obsolete files you no longer need for time travel.
Security and Governance
Govern lakehouse tables with Unity Catalog so a single permission model covers SQL, Python, and AI access. Apply row filters and column masks at the table level so every engine that reads the data inherits the same controls.
Common Pitfalls and Recommended Patterns
- Reintroducing a separate warehouse: serve BI directly from Gold tables instead of copying out.
- Neglecting file compaction: schedule
OPTIMIZEto prevent the small-file problem that slows scans. - Disabling schema enforcement everywhere: keep it on so malformed data fails fast.
- Never running VACUUM: unbounded retention inflates storage; balance retention with recovery needs.
- Skipping a layered model: adopt Bronze/Silver/Gold so consumers always read curated data.
Frequently Asked Questions
How is a lakehouse different from a data lake?
A data lake is just storage; a lakehouse adds a transactional table layer (Delta Lake) on top, giving you ACID guarantees, schema enforcement, fast SQL, and governance that a bare lake cannot provide.
Do I still need a separate data warehouse?
Usually not. A lakehouse serves BI and SQL analytics directly from curated Gold tables, so you avoid maintaining and syncing a second proprietary system.
What makes Delta Lake transactional?
Delta keeps an ordered transaction log alongside the Parquet data files. The log records every commit atomically, enabling isolation between concurrent readers and writers and features like time travel.
Can I use engines other than Spark on a lakehouse?
Yes. Delta Lake is open, and the lakehouse supports SQL Warehouses and external engines that can read the open format, so you are not locked into a single tool.