Databricks on AWS
Who this is for:
Architecture / Concept Overview: Databricks on AWS
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
A[Amazon Kinesis / MSK] -->|Stream| B[Databricks Workspace]
C[Amazon S3 Data Lake] -->|Batch| B
D[Amazon RDS / DynamoDB] -->|CDC| B
B -->|Transform| E[Delta Lake on S3]
E -->|Query| F[Amazon Redshift / Athena]
E -->|Govern| G[Unity Catalog]
A:::source
C:::source
D:::source
B:::processing
E:::storage
F:::serving
G:::governance
*Databricks on AWS data pipeline showing ingestion from AWS-native sources through transformation and serving.*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
graph TD
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
ACC[AWS Account] --> VPC[Customer VPC]
ACC --> IAM[Cross-Account IAM Role]
VPC --> PRIV[Private Subnets]
VPC --> NAT[NAT Gateway]
VPC --> SG[Security Groups]
IAM --> CP[Databricks Control Plane]
CP --> DP[Data Plane - EC2 Clusters]
DP --> S3[S3 Root Bucket]
DP --> PRIV
ACC:::source
VPC:::storage
IAM:::governance
PRIV:::serving
NAT:::ingestion
SG:::governance
CP:::processing
DP:::processing
S3:::storage
*AWS resource topology showing the cross-account trust relationship between the Databricks control plane and customer-managed VPC resources.*
Key Terms
Prerequisites and Setup
- An AWS account with permissions to create IAM roles, VPCs, S3 buckets, and EC2 instances
- AWS CLI installed and configured with appropriate credentials
- A Databricks account (sign up via AWS Marketplace or directly at accounts.cloud.databricks.com)
- Familiarity with AWS IAM policies and trust relationships
- A dedicated VPC with at least two private subnets in different Availability Zones
Step-by-Step Implementation
Configuration Reference
| Parameter | Description | Default | Recommended |
|---|---|---|---|
| Cross-Account Role | IAM role ARN for Databricks to manage resources | required | Least-privilege policy |
| Root S3 Bucket | Workspace storage bucket | required | SSE-KMS encryption enabled |
| VPC CIDR | IP range for Databricks VPC | /16 | /16 with room for growth |
| Subnet Size | Private subnet CIDR blocks | /24 | /24 per AZ minimum |
| NAT Gateway | Outbound internet for cluster nodes | required | Multi-AZ NAT for HA |
| Security Group | Inbound/outbound rules for clusters | Databricks-managed | Allow only required ports |
| Instance Profile | IAM role for cluster EC2 nodes | none | Scoped to specific S3 paths |