Core Components of the Databricks Platform
The Databricks platform is made of a few interlocking components: the workspace, compute (clusters and SQL Warehouses), Delta Lake storage, Unity Catalog governance, Lakeflow for orchestration and ingestion, and Mosaic AI for machine learning. Understanding how these pieces connect lets you design end-to-end pipelines without guessing where each responsibility lives. After reading, you will be able to name each core component, explain its role, and wire them together for a working data flow.
- Identify each core component and the responsibility it owns
- Understand how compute, storage, and governance interact at run time
- Assemble a minimal pipeline that touches every major component
Who this is for: Data engineers and architects building a mental map of the platform's building blocks.
Part of the What is Databricks section in the Databricks tutorial series.
Architecture / Concept Overview: Core Components of the Databricks Platform
The components layer cleanly: the workspace is your entry point; compute executes work; Delta Lake holds data; Unity Catalog governs every asset; Lakeflow orchestrates ingestion and jobs; and Mosaic AI builds and serves models. Each component does one job well and connects through open interfaces, so you can adopt them incrementally.
*The workspace and Lakeflow drive compute, which reads and writes Delta Lake; Unity Catalog governs it all, and outputs feed Mosaic AI and BI.*
At run time, a job flows through these components in sequence, from orchestration to governed output.
*A scheduled job triggers compute, which authorizes against Unity Catalog before reading and writing governed Delta tables.*
Key Terms
- Workspace
- The collaborative environment (notebooks, queries, dashboards, repos, and settings) where teams build and manage assets.
- Compute
- Execution resources: all-purpose and job clusters for engineering and ML, and SQL Warehouses for BI/SQL workloads.
- Delta Lake
- The default storage layer providing ACID tables, time travel, and performance optimizations on object storage.
- Unity Catalog
- The unified governance component managing permissions, lineage, discovery, and auditing across all data and AI assets.
- Lakeflow
- Databricks' framework for ingestion, declarative pipelines, and job orchestration across the lakehouse.
- Mosaic AI
- The component suite for building, fine-tuning, serving, and governing machine learning and generative AI models.
Prerequisites and Setup
- A Databricks workspace with Unity Catalog enabled
- Permission to create compute (a cluster and a SQL Warehouse)
- A catalog and schema to write into
- Basic familiarity with notebooks and SQL
Step-by-Step Implementation
Create compute
Provision an all-purpose cluster for development; it is the component that executes notebook and job code.
# bash cell - create a small all-purpose cluster\ndatabricks clusters create --json '{\n"cluster_name": "dev",\n"num_workers": 1,\n"spark_version": "managed-lts",\n"node_type_id": "standard",\n"autotermination_minutes": 30\n}'Define governed storage
Create a catalog and schema in Unity Catalog so every table you create is governed from the start.
-- SQL cell - governance namespace\nCREATE CATALOG IF NOT EXISTS ops;\nCREATE SCHEMA IF NOT EXISTS ops.telemetry;Ingest with Lakeflow
Use a declarative pipeline (Lakeflow) so ingestion logic is managed, retried, and observable rather than hand-scheduled.
# Python cell - a declarative Lakeflow streaming table\nimport dlt\n@dlt.table(name="events_bronze")\ndef events_bronze():\nreturn (spark.readStream.format("cloudFiles")\n.option("cloudFiles.format", "json")\n.load("/Volumes/ops/landing/events/"))Transform on compute into Delta
Refine data and write a governed Delta table that downstream consumers can trust.
-- SQL cell - curated table written to Delta\nCREATE OR REPLACE TABLE ops.telemetry.events_gold AS\nSELECT device_id, COUNT(*) AS event_count\nFROM ops.telemetry.events_silver\nGROUP BY device_id;Serve to AI or BI
Register a model or point a SQL Warehouse at the Gold table to complete the flow through Mosaic AI or BI.
# Python cell - log a simple model with Mosaic AI / MLflow\nimport mlflow\nwith mlflow.start_run():\nmlflow.log_metric("rows", spark.table("ops.telemetry.events_gold").count())
Configuration Reference
| Parameter / Option | Type | Default | Description |
|---|---|---|---|
| Cluster mode | enum (all-purpose / job) | all-purpose | Interactive development vs scheduled, single-run job compute |
| Autoscaling workers | min/max integers | fixed | Range of workers a cluster scales between under load |
| SQL Warehouse size | enum | Small | Compute power for BI/SQL queries |
| Catalog | string | hive_metastore | Top level of the Unity Catalog namespace |
| Pipeline mode | enum (triggered / continuous) | triggered | Whether a Lakeflow pipeline runs on demand or continuously |
Monitoring, Cost, and Security Considerations
Monitoring
Each component exposes observability: cluster event logs, SQL query history, pipeline event logs, and MLflow run tracking, all surfaced through system tables. Consolidating on system tables gives a single pane of glass across compute, storage, and AI rather than per-component dashboards.
Cost Optimisation
Compute is the main cost driver, so size clusters and warehouses to the workload and enable auto-termination/auto-stop. Use job clusters (which spin up and tear down per run) for scheduled work instead of leaving all-purpose clusters running.
Security and Governance
Unity Catalog is the single control point: grant access at catalog/schema/table scope, and rely on lineage to trace how data moves between components. Keep secrets in secret scopes and prefer service principals for automated jobs.
Common Pitfalls and Recommended Patterns
- Using all-purpose clusters for production jobs: use job clusters to avoid idle cost and isolation issues.
- Bypassing Unity Catalog: assets created outside UC lose centralized governance and lineage.
- Hand-rolling orchestration: use Lakeflow for retries, dependencies, and observability instead of cron.
- Mixing dev and prod in one workspace without controls: separate by catalog and permissions.
- Ignoring lineage before changes: check dependencies in Unity Catalog to avoid breaking consumers.
Frequently Asked Questions
What is the minimum set of components I need?
A workspace, some compute, a Unity Catalog catalog/schema, and Delta storage are enough to build and govern a basic pipeline; Lakeflow and Mosaic AI add orchestration and ML when you need them.
What is the difference between a cluster and a SQL Warehouse?
Clusters run general-purpose engineering and ML code (Python, Scala, SQL), while SQL Warehouses are specialized, autoscaling compute tuned for BI and SQL analytics.
Is Unity Catalog mandatory?
It is strongly recommended. You can technically use the legacy metastore, but Unity Catalog provides the unified governance, lineage, and discovery that modern deployments rely on.
How do these components scale independently?
Storage in Delta Lake grows on object storage independently of compute, and you can run multiple right-sized clusters and warehouses against the same governed data.