Parse the message value

Databricks connects to virtually any data source — cloud storage files, relational databases, streaming platforms, and SaaS APIs — through built-in connectors, JDBC/ODBC drivers, and partner integrations. The typical path is: register the connection in Unity Catalog, verify access, then read data into a notebook or pipeline.

    Who this is for:

    Part of the Getting Started with Databricks section of the Databricks tutorial series.

    Architecture / Concept Overview: Parse the message value

    Databricks can consume data from three main categories: file-based sources in cloud storage, structured databases via connectors, and streaming platforms. Unity Catalog governs all connections, ensuring credentials are managed securely and access is auditable.

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Files[Cloud Storage Files] --> Databricks[Databricks] DBs[Relational Databases] --> Databricks Streams[Streaming Platforms] --> Databricks APIs[SaaS APIs] --> Databricks Databricks --> DL[(Delta Tables)] class Files source class DBs ingestion class Streams processing class APIs governance class Databricks processing class DL storage

    *Figure 1 — Four categories of data sources that Databricks connects to natively.*

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Cred[Storage Credential] --> ExtLoc[External Location] ExtLoc --> ExtTable[External Table] ExtTable --> UC[Unity Catalog] UC --> Users[Users Query Normally] class Cred governance class ExtLoc storage class ExtTable storage class UC governance class Users serving

    *Figure 2 — Governed connection pattern: credentials → external locations → tables in Unity Catalog.*

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED Ingest[Ingestion Methods] Ingest --> COPY[COPY INTO] Ingest --> AutoLoader[Auto Loader] Ingest --> Stream[Structured Streaming] Ingest --> Upload[UI File Upload] COPY --> Batch[Batch - one-time load] AutoLoader --> Incremental[Incremental - new files] Stream --> RT[Real-time - continuous] Upload --> Quick[Quick exploration] class Ingest processing class COPY ingestion class AutoLoader ingestion class Stream serving class Upload source class Batch storage class Incremental storage class RT storage class Quick storage

    *Figure 3 — Ingestion methods for different use cases: batch, incremental, real-time, and ad-hoc.*

    Key Terms

    Prerequisites and Setup

    • A Databricks workspace with a running cluster or SQL warehouse
    • Cloud storage containing your data (or a relational database to connect to)
    • Permission to create external locations and storage credentials (for governed connections)
    • Knowledge of your data's file format (CSV, JSON, Parquet, etc.) or database connection details

    Step-by-Step Implementation

      Configuration Reference

      Parse the message value configuration options
      Data SourceFormat/ConnectorBest Ingestion Method
      CSV/JSON/Parquet in cloud storagespark.read / cloudFilesAuto Loader for ongoing
      Delta Lake filesspark.read.format("delta")Direct read
      PostgreSQL / MySQLLakehouse FederationForeign catalog
      SQL ServerJDBC connectorspark.read.format("jdbc")
      Kafka / Event Hubsspark.readStream.format("kafka")Structured Streaming
      REST APIsPython requests + sparkCustom ingestion notebook
      Small local filesUI uploadOne-time upload

      Monitoring, Cost, and Security Considerations

      Monitoring

      Track Auto Loader ingestion metrics through the streaming query progress. Monitor Lakehouse Federation query latency (network round-trips add overhead). Check COPY INTO history to verify all expected files were processed.

      Cost Optimisation

      Use Auto Loader's file notification mode (instead of directory listing) for storage with many files — it reduces API calls. Avoid scanning entire directories repeatedly; COPY INTO tracks processed files. Cache federated query results locally if queried frequently.

      Security and Governance

      Never hardcode credentials in notebooks — use Databricks secret scopes. Register all production data access through Unity Catalog external locations. Grant the minimum required permissions on storage credentials. Rotate credentials on a regular schedule.

      Common Pitfalls and Recommended Patterns

      • Hardcoding cloud storage keys in notebooks — use secret scopes and storage credentials instead
      • Using inferSchema on large files without sampling — it reads the entire file; set a schema explicitly for production
      • Not setting checkpoints for streaming ingestion — without checkpoints, restarts reprocess all data
      • Querying federated databases for large table scans — federation is for targeted queries; bulk loads should use COPY INTO
      • Ignoring file format mismatches — CSV files without headers or with inconsistent delimiters cause silent data corruption
      • Uploading large files through the UI — use cloud storage and Auto Loader for anything over a few hundred MB

      Frequently Asked Questions

      What file formats does Databricks support?

      CSV, JSON, Parquet, ORC, Avro, Delta, text, binary, and XML. Delta and Parquet are preferred for performance; CSV and JSON are common for ingestion from external systems.

      Can I connect to on-premises databases?

      Yes, using a secure tunnel or VPN gateway that bridges your on-premises network with your cloud VPC. Databricks clusters in your VPC can then reach on-premises databases through the tunnel.

      How do I handle schema changes in incoming files?

      Auto Loader supports schema evolution — it detects new columns automatically and can either merge them or quarantine unexpected files. Configure cloudFiles.schemaEvolutionMode for your strategy.

      Is Lakehouse Federation a full ETL replacement?

      No. Federation is for querying external data in place. For production analytics, ingest data into Delta tables for better performance, governance, and reliability.

      How much data can I upload through the UI?

      UI uploads are limited to small files (typically under 2 GB). For larger datasets, use cloud storage with Auto Loader or COPY INTO.