Core Components of the Databricks Platform

The Databricks platform is made of a few interlocking components: the workspace, compute (clusters and SQL Warehouses), Delta Lake storage, Unity Catalog governance, Lakeflow for orchestration and ingestion, and Mosaic AI for machine learning. Understanding how these pieces connect lets you design end-to-end pipelines without guessing where each responsibility lives. After reading, you will be able to name each core component, explain its role, and wire them together for a working data flow.

Identify each core component and the responsibility it owns
Understand how compute, storage, and governance interact at run time
Assemble a minimal pipeline that touches every major component

Who this is for: Data engineers and architects building a mental map of the platform's building blocks.

Part of the What is Databricks section in the Databricks tutorial series.

Architecture / Concept Overview: Core Components of the Databricks Platform

The components layer cleanly: the workspace is your entry point; compute executes work; Delta Lake holds data; Unity Catalog governs every asset; Lakeflow orchestrates ingestion and jobs; and Mosaic AI builds and serves models. Each component does one job well and connects through open interfaces, so you can adopt them incrementally.

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED WS[Workspace]:::processing --> Compute[Clusters and SQL Warehouses]:::processing Lakeflow[Lakeflow Orchestration]:::ingestion --> Compute Compute --> Delta[(Delta Lake Storage)]:::storage UC[Unity Catalog]:::governance -.governs.-> Delta UC -.governs.-> Compute Delta --> Mosaic[Mosaic AI]:::serving Delta --> SQLBI[SQL and BI]:::serving

*The workspace and Lakeflow drive compute, which reads and writes Delta Lake; Unity Catalog governs it all, and outputs feed Mosaic AI and BI.*

At run time, a job flows through these components in sequence, from orchestration to governed output.

*A scheduled job triggers compute, which authorizes against Unity Catalog before reading and writing governed Delta tables.*

Key Terms

Workspace: The collaborative environment (notebooks, queries, dashboards, repos, and settings) where teams build and manage assets.
Compute: Execution resources: all-purpose and job clusters for engineering and ML, and SQL Warehouses for BI/SQL workloads.
Delta Lake: The default storage layer providing ACID tables, time travel, and performance optimizations on object storage.
Unity Catalog: The unified governance component managing permissions, lineage, discovery, and auditing across all data and AI assets.
Lakeflow: Databricks' framework for ingestion, declarative pipelines, and job orchestration across the lakehouse.
Mosaic AI: The component suite for building, fine-tuning, serving, and governing machine learning and generative AI models.

Prerequisites and Setup

A Databricks workspace with Unity Catalog enabled
Permission to create compute (a cluster and a SQL Warehouse)
A catalog and schema to write into
Basic familiarity with notebooks and SQL

Step-by-Step Implementation

Create compute

Provision an all-purpose cluster for development; it is the component that executes notebook and job code.

# bash cell - create a small all-purpose cluster\ndatabricks clusters create --json '{\n"cluster_name": "dev",\n"num_workers": 1,\n"spark_version": "managed-lts",\n"node_type_id": "standard",\n"autotermination_minutes": 30\n}'

Define governed storage
Create a catalog and schema in Unity Catalog so every table you create is governed from the start.
```
-- SQL cell - governance namespace\nCREATE CATALOG IF NOT EXISTS ops;\nCREATE SCHEMA IF NOT EXISTS ops.telemetry;
```

Ingest with Lakeflow

Use a declarative pipeline (Lakeflow) so ingestion logic is managed, retried, and observable rather than hand-scheduled.

# Python cell - a declarative Lakeflow streaming table\nimport dlt\n@dlt.table(name="events_bronze")\ndef events_bronze():\nreturn (spark.readStream.format("cloudFiles")\n.option("cloudFiles.format", "json")\n.load("/Volumes/ops/landing/events/"))

Transform on compute into Delta

Refine data and write a governed Delta table that downstream consumers can trust.

-- SQL cell - curated table written to Delta\nCREATE OR REPLACE TABLE ops.telemetry.events_gold AS\nSELECT device_id, COUNT(*) AS event_count\nFROM ops.telemetry.events_silver\nGROUP BY device_id;

Serve to AI or BI

# Python cell - log a simple model with Mosaic AI / MLflow\nimport mlflow\nwith mlflow.start_run():\nmlflow.log_metric("rows", spark.table("ops.telemetry.events_gold").count())

Configuration Reference

Core Components of the Databricks Platform configuration options
Parameter / Option	Type	Default	Description
Cluster mode	enum (all-purpose / job)	all-purpose	Interactive development vs scheduled, single-run job compute
Autoscaling workers	min/max integers	fixed	Range of workers a cluster scales between under load
SQL Warehouse size	enum	Small	Compute power for BI/SQL queries
Catalog	string	`hive_metastore`	Top level of the Unity Catalog namespace
Pipeline mode	enum (triggered / continuous)	triggered	Whether a Lakeflow pipeline runs on demand or continuously

Monitoring, Cost, and Security Considerations

Monitoring

Each component exposes observability: cluster event logs, SQL query history, pipeline event logs, and MLflow run tracking, all surfaced through system tables. Consolidating on system tables gives a single pane of glass across compute, storage, and AI rather than per-component dashboards.

Cost Optimisation

Compute is the main cost driver, so size clusters and warehouses to the workload and enable auto-termination/auto-stop. Use job clusters (which spin up and tear down per run) for scheduled work instead of leaving all-purpose clusters running.

Security and Governance

Unity Catalog is the single control point: grant access at catalog/schema/table scope, and rely on lineage to trace how data moves between components. Keep secrets in secret scopes and prefer service principals for automated jobs.

Common Pitfalls and Recommended Patterns

Using all-purpose clusters for production jobs: use job clusters to avoid idle cost and isolation issues.
Bypassing Unity Catalog: assets created outside UC lose centralized governance and lineage.
Hand-rolling orchestration: use Lakeflow for retries, dependencies, and observability instead of cron.
Mixing dev and prod in one workspace without controls: separate by catalog and permissions.
Ignoring lineage before changes: check dependencies in Unity Catalog to avoid breaking consumers.

Frequently Asked Questions

What is the minimum set of components I need?

A workspace, some compute, a Unity Catalog catalog/schema, and Delta storage are enough to build and govern a basic pipeline; Lakeflow and Mosaic AI add orchestration and ML when you need them.

What is the difference between a cluster and a SQL Warehouse?

Clusters run general-purpose engineering and ML code (Python, Scala, SQL), while SQL Warehouses are specialized, autoscaling compute tuned for BI and SQL analytics.

Is Unity Catalog mandatory?

It is strongly recommended. You can technically use the legacy metastore, but Unity Catalog provides the unified governance, lineage, and discovery that modern deployments rely on.

How do these components scale independently?

Storage in Delta Lake grows on object storage independently of compute, and you can run multiple right-sized clusters and warehouses against the same governed data.