Parse the message value

Databricks connects to virtually any data source — cloud storage files, relational databases, streaming platforms, and SaaS APIs — through built-in connectors, JDBC/ODBC drivers, and partner integrations. The typical path is: register the connection in Unity Catalog, verify access, then read data into a notebook or pipeline.

Who this is for:

Part of the Getting Started with Databricks section of the Databricks tutorial series.

Architecture / Concept Overview: Parse the message value

Databricks can consume data from three main categories: file-based sources in cloud storage, structured databases via connectors, and streaming platforms. Unity Catalog governs all connections, ensuring credentials are managed securely and access is auditable.

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Files[Cloud Storage Files] --> Databricks[Databricks] DBs[Relational Databases] --> Databricks Streams[Streaming Platforms] --> Databricks APIs[SaaS APIs] --> Databricks Databricks --> DL[(Delta Tables)] class Files source class DBs ingestion class Streams processing class APIs governance class Databricks processing class DL storage

*Figure 1 — Four categories of data sources that Databricks connects to natively.*

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Cred[Storage Credential] --> ExtLoc[External Location] ExtLoc --> ExtTable[External Table] ExtTable --> UC[Unity Catalog] UC --> Users[Users Query Normally] class Cred governance class ExtLoc storage class ExtTable storage class UC governance class Users serving

*Figure 2 — Governed connection pattern: credentials → external locations → tables in Unity Catalog.*

%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Ingest[Ingestion Methods] Ingest --> COPY[COPY INTO] Ingest --> AutoLoader[Auto Loader] Ingest --> Stream[Structured Streaming] Ingest --> Upload[UI File Upload] COPY --> Batch[Batch - one-time load] AutoLoader --> Incremental[Incremental - new files] Stream --> RT[Real-time - continuous] Upload --> Quick[Quick exploration] class Ingest processing class COPY ingestion class AutoLoader ingestion class Stream serving class Upload source class Batch storage class Incremental storage class RT storage class Quick storage

*Figure 3 — Ingestion methods for different use cases: batch, incremental, real-time, and ad-hoc.*

Key Terms

Prerequisites and Setup

A Databricks workspace with a running cluster or SQL warehouse
Cloud storage containing your data (or a relational database to connect to)
Permission to create external locations and storage credentials (for governed connections)
Knowledge of your data's file format (CSV, JSON, Parquet, etc.) or database connection details

Step-by-Step Implementation

Configuration Reference

Parse the message value configuration options
Data Source	Format/Connector	Best Ingestion Method
CSV/JSON/Parquet in cloud storage	spark.read / cloudFiles	Auto Loader for ongoing
Delta Lake files	spark.read.format("delta")	Direct read
PostgreSQL / MySQL	Lakehouse Federation	Foreign catalog
SQL Server	JDBC connector	spark.read.format("jdbc")
Kafka / Event Hubs	spark.readStream.format("kafka")	Structured Streaming
REST APIs	Python requests + spark	Custom ingestion notebook
Small local files	UI upload	One-time upload

Monitoring, Cost, and Security Considerations

Monitoring

Track Auto Loader ingestion metrics through the streaming query progress. Monitor Lakehouse Federation query latency (network round-trips add overhead). Check COPY INTO history to verify all expected files were processed.

Cost Optimisation

Use Auto Loader's file notification mode (instead of directory listing) for storage with many files — it reduces API calls. Avoid scanning entire directories repeatedly; COPY INTO tracks processed files. Cache federated query results locally if queried frequently.

Security and Governance

Never hardcode credentials in notebooks — use Databricks secret scopes. Register all production data access through Unity Catalog external locations. Grant the minimum required permissions on storage credentials. Rotate credentials on a regular schedule.

Common Pitfalls and Recommended Patterns

Hardcoding cloud storage keys in notebooks — use secret scopes and storage credentials instead
Using inferSchema on large files without sampling — it reads the entire file; set a schema explicitly for production
Not setting checkpoints for streaming ingestion — without checkpoints, restarts reprocess all data
Querying federated databases for large table scans — federation is for targeted queries; bulk loads should use COPY INTO
Ignoring file format mismatches — CSV files without headers or with inconsistent delimiters cause silent data corruption
Uploading large files through the UI — use cloud storage and Auto Loader for anything over a few hundred MB

Frequently Asked Questions

What file formats does Databricks support?

CSV, JSON, Parquet, ORC, Avro, Delta, text, binary, and XML. Delta and Parquet are preferred for performance; CSV and JSON are common for ingestion from external systems.

Can I connect to on-premises databases?

Yes, using a secure tunnel or VPN gateway that bridges your on-premises network with your cloud VPC. Databricks clusters in your VPC can then reach on-premises databases through the tunnel.

How do I handle schema changes in incoming files?

Auto Loader supports schema evolution — it detects new columns automatically and can either merge them or quarantine unexpected files. Configure cloudFiles.schemaEvolutionMode for your strategy.

Is Lakehouse Federation a full ETL replacement?

No. Federation is for querying external data in place. For production analytics, ingest data into Delta tables for better performance, governance, and reliability.

How much data can I upload through the UI?

UI uploads are limited to small files (typically under 2 GB). For larger datasets, use cloud storage with Auto Loader or COPY INTO.