Ingesting from Databases and SaaS Applications
Who this is for:
Architecture / Concept Overview: Ingesting from Databases and SaaS Applications
Database and SaaS ingestion follows two broad patterns: managed CDC replication via Lakeflow Connect for supported sources, and standard ingestion using JDBC, Lakehouse Federation, or partner-built connectors for everything else.
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
PG[PostgreSQL]:::source --> MC[Managed CDC]:::ingestion
MY[MySQL]:::source --> MC
SS[SQL Server]:::source --> MC
MC --> BZ[Bronze Delta Tables]:::storage
OR[Oracle]:::source --> JD[JDBC Batch Read]:::ingestion
SF[Salesforce]:::source --> SM[Managed Connector]:::ingestion
HB[HubSpot]:::source --> PC[Partner Connector]:::ingestion
JD --> BZ
SM --> BZ
PC --> BZ
BZ --> SV[Silver Layer]:::processing
*Multiple ingestion pathways converge into the bronze layer of the Lakehouse.*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
graph TD
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
DEC[Choose Ingestion Strategy]:::processing
DEC --> Q1{Managed connector available?}:::governance
Q1 --> |Yes| MC[Use Lakeflow Connect]:::ingestion
Q1 --> |No| Q2{Need real-time CDC?}:::governance
Q2 --> |Yes| DEB[Debezium + Kafka]:::ingestion
Q2 --> |No| Q3{Query federation OK?}:::governance
Q3 --> |Yes| FED[Lakehouse Federation]:::serving
Q3 --> |No| JDBC[JDBC Batch Ingestion]:::source
*Decision tree for selecting the right database ingestion strategy.*
Key Terms
Prerequisites and Setup
- Unity Catalog enabled on the workspace.
- Network connectivity to source databases (firewall rules, VPC peering, Private Link).
- Source database credentials stored in Databricks Secrets.
- For managed CDC: source database configured for logical replication (e.g.,
wal_level = logicalfor PostgreSQL).
Step-by-Step Implementation
Configuration Reference
| Parameter | Scope | Description | Default |
|---|---|---|---|
numPartitions | JDBC | Number of parallel read partitions | 1 |
fetchsize | JDBC | JDBC fetch size per round-trip | 1000 |
queryTimeout | JDBC | Query timeout in seconds | 0 (no timeout) |
pushDownPredicate | JDBC | Push filter predicates to the source | true |
connection_name | Managed Connector | Unity Catalog connection name | Required |
gateway_size | Managed Connector | Ingestion compute size | SMALL |