Databricks on AWS Overview
Who this is for:
Architecture / Concept Overview: Databricks on AWS Overview
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
CP[Databricks Control Plane] -->|Manage| EC2[EC2 Clusters in Customer VPC]
S3_SRC[S3 Data Lake] -->|Read| EC2
KINESIS[Kinesis Data Streams] -->|Stream| EC2
GLUE[AWS Glue Catalog] -->|Metadata| EC2
EC2 -->|Write| S3_DL[Delta Lake on S3]
S3_DL -->|Serve| RS[Amazon Redshift Spectrum]
S3_DL -->|Serve| ATHENA[Amazon Athena]
CP:::processing
EC2:::processing
S3_SRC:::source
KINESIS:::source
GLUE:::governance
S3_DL:::storage
RS:::serving
ATHENA:::serving
*Databricks on AWS platform overview showing the managed control plane, customer VPC data plane, and AWS service integrations.*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
graph TD
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
PLATFORM[Databricks on AWS] --> WORKLOADS[Workload Types]
PLATFORM --> INFRA[Infrastructure]
PLATFORM --> GOV[Governance]
WORKLOADS --> DE[Data Engineering]
WORKLOADS --> DS[Data Science & ML]
WORKLOADS --> SQL[SQL Analytics]
INFRA --> EC2I[EC2 Instances]
INFRA --> S3I[S3 Storage]
INFRA --> VPCI[VPC Networking]
GOV --> IAM[IAM Roles & Policies]
GOV --> UC[Unity Catalog]
GOV --> KMS[KMS Encryption]
PLATFORM:::processing
WORKLOADS:::ingestion
INFRA:::storage
GOV:::governance
DE:::ingestion
DS:::processing
SQL:::serving
EC2I:::storage
S3I:::storage
VPCI:::storage
IAM:::governance
UC:::governance
KMS:::governance
*Databricks on AWS platform capabilities across workloads, infrastructure, and governance.*
Key Terms
Prerequisites and Setup
- An AWS account with administrative access (IAM, EC2, S3, VPC permissions)
- A Databricks account — sign up through AWS Marketplace or accounts.cloud.databricks.com
- AWS CLI configured with credentials for the target account
- Understanding of AWS IAM roles, trust policies, and instance profiles
- A VPC with private subnets in at least two Availability Zones
Step-by-Step Implementation
Configuration Reference
| Component | Purpose | Default | Recommended |
|---|---|---|---|
| Cross-Account Role | Databricks manages EC2/VPC | required | Scoped to Databricks actions only |
| Root S3 Bucket | Workspace storage | required | Versioning + KMS encryption |
| Instance Profile | Cluster S3 access | none | One per workload/team |
| VPC | Network isolation | Databricks-managed or custom | Customer-managed for production |
| Security Groups | Cluster firewall rules | Databricks-managed | Restrict to required ports only |
| Spot Instances | Cost optimization | on-demand | SPOT_WITH_FALLBACK for batch |
| Auto-termination | Idle cluster shutdown | 120 min | 30 min for dev clusters |