Parse the message value
Databricks connects to virtually any data source — cloud storage files, relational databases, streaming platforms, and SaaS APIs — through built-in connectors, JDBC/ODBC drivers, and partner integrations. The typical path is: register the connection in Unity Catalog, verify access, then read data into a notebook or pipeline.
Who this is for:
Part of the Getting Started with Databricks section of the Databricks tutorial series.
Architecture / Concept Overview: Parse the message value
Databricks can consume data from three main categories: file-based sources in cloud storage, structured databases via connectors, and streaming platforms. Unity Catalog governs all connections, ensuring credentials are managed securely and access is auditable.
*Figure 1 — Four categories of data sources that Databricks connects to natively.*
*Figure 2 — Governed connection pattern: credentials → external locations → tables in Unity Catalog.*
*Figure 3 — Ingestion methods for different use cases: batch, incremental, real-time, and ad-hoc.*
Key Terms
Prerequisites and Setup
- A Databricks workspace with a running cluster or SQL warehouse
- Cloud storage containing your data (or a relational database to connect to)
- Permission to create external locations and storage credentials (for governed connections)
- Knowledge of your data's file format (CSV, JSON, Parquet, etc.) or database connection details
Step-by-Step Implementation
Configuration Reference
| Data Source | Format/Connector | Best Ingestion Method |
|---|---|---|
| CSV/JSON/Parquet in cloud storage | spark.read / cloudFiles | Auto Loader for ongoing |
| Delta Lake files | spark.read.format("delta") | Direct read |
| PostgreSQL / MySQL | Lakehouse Federation | Foreign catalog |
| SQL Server | JDBC connector | spark.read.format("jdbc") |
| Kafka / Event Hubs | spark.readStream.format("kafka") | Structured Streaming |
| REST APIs | Python requests + spark | Custom ingestion notebook |
| Small local files | UI upload | One-time upload |
Monitoring, Cost, and Security Considerations
Monitoring
Track Auto Loader ingestion metrics through the streaming query progress. Monitor Lakehouse Federation query latency (network round-trips add overhead). Check COPY INTO history to verify all expected files were processed.
Cost Optimisation
Use Auto Loader's file notification mode (instead of directory listing) for storage with many files — it reduces API calls. Avoid scanning entire directories repeatedly; COPY INTO tracks processed files. Cache federated query results locally if queried frequently.
Security and Governance
Never hardcode credentials in notebooks — use Databricks secret scopes. Register all production data access through Unity Catalog external locations. Grant the minimum required permissions on storage credentials. Rotate credentials on a regular schedule.
Common Pitfalls and Recommended Patterns
- Hardcoding cloud storage keys in notebooks — use secret scopes and storage credentials instead
- Using
inferSchemaon large files without sampling — it reads the entire file; set a schema explicitly for production - Not setting checkpoints for streaming ingestion — without checkpoints, restarts reprocess all data
- Querying federated databases for large table scans — federation is for targeted queries; bulk loads should use COPY INTO
- Ignoring file format mismatches — CSV files without headers or with inconsistent delimiters cause silent data corruption
- Uploading large files through the UI — use cloud storage and Auto Loader for anything over a few hundred MB
Frequently Asked Questions
What file formats does Databricks support?
CSV, JSON, Parquet, ORC, Avro, Delta, text, binary, and XML. Delta and Parquet are preferred for performance; CSV and JSON are common for ingestion from external systems.
Can I connect to on-premises databases?
Yes, using a secure tunnel or VPN gateway that bridges your on-premises network with your cloud VPC. Databricks clusters in your VPC can then reach on-premises databases through the tunnel.
How do I handle schema changes in incoming files?
Auto Loader supports schema evolution — it detects new columns automatically and can either merge them or quarantine unexpected files. Configure cloudFiles.schemaEvolutionMode for your strategy.
Is Lakehouse Federation a full ETL replacement?
No. Federation is for querying external data in place. For production analytics, ingest data into Delta tables for better performance, governance, and reliability.
How much data can I upload through the UI?
UI uploads are limited to small files (typically under 2 GB). For larger datasets, use cloud storage with Auto Loader or COPY INTO.