Core Components of the Databricks Platform

The Databricks platform is made of a few interlocking components: the workspace, compute (clusters and SQL Warehouses), Delta Lake storage, Unity Catalog governance, Lakeflow for orchestration and ingestion, and Mosaic AI for machine learning. Understanding how these pieces connect lets you design end-to-end pipelines without guessing where each responsibility lives. After reading, you will be able to name each core component, explain its role, and wire them together for a working data flow.

  • Identify each core component and the responsibility it owns
  • Understand how compute, storage, and governance interact at run time
  • Assemble a minimal pipeline that touches every major component

Who this is for: Data engineers and architects building a mental map of the platform's building blocks.

Part of the What is Databricks section in the Databricks tutorial series.

Architecture / Concept Overview: Core Components of the Databricks Platform

The components layer cleanly: the workspace is your entry point; compute executes work; Delta Lake holds data; Unity Catalog governs every asset; Lakeflow orchestrates ingestion and jobs; and Mosaic AI builds and serves models. Each component does one job well and connects through open interfaces, so you can adopt them incrementally.

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED WS[Workspace]:::processing --> Compute[Clusters and SQL Warehouses]:::processing Lakeflow[Lakeflow Orchestration]:::ingestion --> Compute Compute --> Delta[(Delta Lake Storage)]:::storage UC[Unity Catalog]:::governance -.governs.-> Delta UC -.governs.-> Compute Delta --> Mosaic[Mosaic AI]:::serving Delta --> SQLBI[SQL and BI]:::serving

*The workspace and Lakeflow drive compute, which reads and writes Delta Lake; Unity Catalog governs it all, and outputs feed Mosaic AI and BI.*

At run time, a job flows through these components in sequence, from orchestration to governed output.

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% sequenceDiagram participant J as Lakeflow Job participant C as Compute participant D as Delta Lake participant U as Unity Catalog J->>C: Trigger task on cluster C->>U: Check permissions U-->>C: Authorize C->>D: Read source and write result D-->>C: Commit acknowledged C-->>J: Report success

*A scheduled job triggers compute, which authorizes against Unity Catalog before reading and writing governed Delta tables.*

Key Terms

Workspace
The collaborative environment (notebooks, queries, dashboards, repos, and settings) where teams build and manage assets.
Compute
Execution resources: all-purpose and job clusters for engineering and ML, and SQL Warehouses for BI/SQL workloads.
Delta Lake
The default storage layer providing ACID tables, time travel, and performance optimizations on object storage.
Unity Catalog
The unified governance component managing permissions, lineage, discovery, and auditing across all data and AI assets.
Lakeflow
Databricks' framework for ingestion, declarative pipelines, and job orchestration across the lakehouse.
Mosaic AI
The component suite for building, fine-tuning, serving, and governing machine learning and generative AI models.

Prerequisites and Setup

  • A Databricks workspace with Unity Catalog enabled
  • Permission to create compute (a cluster and a SQL Warehouse)
  • A catalog and schema to write into
  • Basic familiarity with notebooks and SQL

Step-by-Step Implementation

  1. Create compute

    Provision an all-purpose cluster for development; it is the component that executes notebook and job code.

    # bash cell - create a small all-purpose cluster\ndatabricks clusters create --json '{\n"cluster_name": "dev",\n"num_workers": 1,\n"spark_version": "managed-lts",\n"node_type_id": "standard",\n"autotermination_minutes": 30\n}'
  2. Define governed storage

    Create a catalog and schema in Unity Catalog so every table you create is governed from the start.

    -- SQL cell - governance namespace\nCREATE CATALOG IF NOT EXISTS ops;\nCREATE SCHEMA IF NOT EXISTS ops.telemetry;
  3. Ingest with Lakeflow

    Use a declarative pipeline (Lakeflow) so ingestion logic is managed, retried, and observable rather than hand-scheduled.

    # Python cell - a declarative Lakeflow streaming table\nimport dlt\n@dlt.table(name="events_bronze")\ndef events_bronze():\nreturn (spark.readStream.format("cloudFiles")\n.option("cloudFiles.format", "json")\n.load("/Volumes/ops/landing/events/"))
  4. Transform on compute into Delta

    Refine data and write a governed Delta table that downstream consumers can trust.

    -- SQL cell - curated table written to Delta\nCREATE OR REPLACE TABLE ops.telemetry.events_gold AS\nSELECT device_id, COUNT(*) AS event_count\nFROM ops.telemetry.events_silver\nGROUP BY device_id;
  5. Serve to AI or BI

    Register a model or point a SQL Warehouse at the Gold table to complete the flow through Mosaic AI or BI.

    # Python cell - log a simple model with Mosaic AI / MLflow\nimport mlflow\nwith mlflow.start_run():\nmlflow.log_metric("rows", spark.table("ops.telemetry.events_gold").count())

Configuration Reference

Core Components of the Databricks Platform configuration options
Parameter / OptionTypeDefaultDescription
Cluster modeenum (all-purpose / job)all-purposeInteractive development vs scheduled, single-run job compute
Autoscaling workersmin/max integersfixedRange of workers a cluster scales between under load
SQL Warehouse sizeenumSmallCompute power for BI/SQL queries
Catalogstringhive_metastoreTop level of the Unity Catalog namespace
Pipeline modeenum (triggered / continuous)triggeredWhether a Lakeflow pipeline runs on demand or continuously

Monitoring, Cost, and Security Considerations

Monitoring

Each component exposes observability: cluster event logs, SQL query history, pipeline event logs, and MLflow run tracking, all surfaced through system tables. Consolidating on system tables gives a single pane of glass across compute, storage, and AI rather than per-component dashboards.

Cost Optimisation

Compute is the main cost driver, so size clusters and warehouses to the workload and enable auto-termination/auto-stop. Use job clusters (which spin up and tear down per run) for scheduled work instead of leaving all-purpose clusters running.

Security and Governance

Unity Catalog is the single control point: grant access at catalog/schema/table scope, and rely on lineage to trace how data moves between components. Keep secrets in secret scopes and prefer service principals for automated jobs.

Common Pitfalls and Recommended Patterns

  • Using all-purpose clusters for production jobs: use job clusters to avoid idle cost and isolation issues.
  • Bypassing Unity Catalog: assets created outside UC lose centralized governance and lineage.
  • Hand-rolling orchestration: use Lakeflow for retries, dependencies, and observability instead of cron.
  • Mixing dev and prod in one workspace without controls: separate by catalog and permissions.
  • Ignoring lineage before changes: check dependencies in Unity Catalog to avoid breaking consumers.

Frequently Asked Questions

What is the minimum set of components I need?

A workspace, some compute, a Unity Catalog catalog/schema, and Delta storage are enough to build and govern a basic pipeline; Lakeflow and Mosaic AI add orchestration and ML when you need them.

What is the difference between a cluster and a SQL Warehouse?

Clusters run general-purpose engineering and ML code (Python, Scala, SQL), while SQL Warehouses are specialized, autoscaling compute tuned for BI and SQL analytics.

Is Unity Catalog mandatory?

It is strongly recommended. You can technically use the legacy metastore, but Unity Catalog provides the unified governance, lineage, and discovery that modern deployments rely on.

How do these components scale independently?

Storage in Delta Lake grows on object storage independently of compute, and you can run multiple right-sized clusters and warehouses against the same governed data.