Verify access to your landing zone
Who this is for:
Architecture / Concept Overview: Verify access to your landing zone
Auto Loader uses either directory listing or file notification mode to discover new files. Directory listing polls the source path periodically, while file notification leverages cloud-native events (S3 SQS, Azure Event Grid, GCP Pub/Sub) for near-instant file discovery at scale.
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
L1[Landing Zone - S3/ADLS/GCS]:::source --> FN{File Notification}:::ingestion
L1 --> DL{Directory Listing}:::ingestion
FN --> AL[Auto Loader]:::processing
DL --> AL
AL --> SC[Schema Checkpoint]:::governance
AL --> DT[Delta Table - Bronze]:::storage
DT --> DP[Downstream Pipelines]:::serving
*Auto Loader discovers files via notification or listing, infers schema, and writes to Delta Lake.*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
graph TD
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
AL[Auto Loader]:::ingestion
AL --> INF[Schema Inference]:::processing
AL --> EVO[Schema Evolution]:::processing
AL --> RES[Rescued Data Column]:::governance
AL --> CKP[Checkpoint Management]:::storage
INF --> |First batch| EVO
EVO --> |New columns| MERGE[Merge into target schema]:::storage
RES --> |Malformed rows| RESCUE[_rescued_data column]:::governance
*Auto Loader manages schema lifecycle: inference on first run, evolution on subsequent runs, and rescued data for malformed records.*
Key Terms
Prerequisites and Setup
- Databricks Runtime 11.3 LTS or later.
- Read access to the cloud storage path containing source files.
- Write access to a checkpoint location and schema location in cloud storage.
- For file notification mode: permissions to create cloud resources (SQS queue, Event Grid subscription, or Pub/Sub subscription).
Step-by-Step Implementation
Configuration Reference
| Parameter | Description | Default |
|---|---|---|
cloudFiles.format | Source file format (json, csv, parquet, avro, orc, text, binaryFile) | Required |
cloudFiles.schemaLocation | Cloud path for persisting inferred schema | Required (unless explicit schema provided) |
cloudFiles.inferColumnTypes | Infer specific types instead of defaulting to STRING | false |
cloudFiles.schemaEvolutionMode | addNewColumns, rescue, failOnNewColumns, none | addNewColumns |
cloudFiles.useNotifications | Enable file notification mode | false |
cloudFiles.maxFilesPerTrigger | Max number of files per micro-batch | 1000 |
cloudFiles.maxBytesPerTrigger | Max bytes per micro-batch | None |
cloudFiles.includeExistingFiles | Process files that existed before the stream started | true |
cloudFiles.pathGlobFilter | Glob pattern to filter files (e.g., *.json) | None |