Verify access to your landing zone

    Who this is for:

    Architecture / Concept Overview: Verify access to your landing zone

    Auto Loader uses either directory listing or file notification mode to discover new files. Directory listing polls the source path periodically, while file notification leverages cloud-native events (S3 SQS, Azure Event Grid, GCP Pub/Sub) for near-instant file discovery at scale.

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED L1[Landing Zone - S3/ADLS/GCS]:::source --> FN{File Notification}:::ingestion L1 --> DL{Directory Listing}:::ingestion FN --> AL[Auto Loader]:::processing DL --> AL AL --> SC[Schema Checkpoint]:::governance AL --> DT[Delta Table - Bronze]:::storage DT --> DP[Downstream Pipelines]:::serving

    *Auto Loader discovers files via notification or listing, infers schema, and writes to Delta Lake.*

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED AL[Auto Loader]:::ingestion AL --> INF[Schema Inference]:::processing AL --> EVO[Schema Evolution]:::processing AL --> RES[Rescued Data Column]:::governance AL --> CKP[Checkpoint Management]:::storage INF --> |First batch| EVO EVO --> |New columns| MERGE[Merge into target schema]:::storage RES --> |Malformed rows| RESCUE[_rescued_data column]:::governance

    *Auto Loader manages schema lifecycle: inference on first run, evolution on subsequent runs, and rescued data for malformed records.*

    Key Terms

    Prerequisites and Setup

    • Databricks Runtime 11.3 LTS or later.
    • Read access to the cloud storage path containing source files.
    • Write access to a checkpoint location and schema location in cloud storage.
    • For file notification mode: permissions to create cloud resources (SQS queue, Event Grid subscription, or Pub/Sub subscription).

    Step-by-Step Implementation

      Configuration Reference

      Verify access to your landing zone configuration options
      ParameterDescriptionDefault
      cloudFiles.formatSource file format (json, csv, parquet, avro, orc, text, binaryFile)Required
      cloudFiles.schemaLocationCloud path for persisting inferred schemaRequired (unless explicit schema provided)
      cloudFiles.inferColumnTypesInfer specific types instead of defaulting to STRINGfalse
      cloudFiles.schemaEvolutionModeaddNewColumns, rescue, failOnNewColumns, noneaddNewColumns
      cloudFiles.useNotificationsEnable file notification modefalse
      cloudFiles.maxFilesPerTriggerMax number of files per micro-batch1000
      cloudFiles.maxBytesPerTriggerMax bytes per micro-batchNone
      cloudFiles.includeExistingFilesProcess files that existed before the stream startedtrue
      cloudFiles.pathGlobFilterGlob pattern to filter files (e.g., *.json)None

      Monitoring, Cost, and Security Considerations

      Common Pitfalls and Recommended Patterns

        Frequently Asked Questions