Check for IP exhaustion in GKE
Who this is for:
Architecture / Concept Overview: Check for IP exhaustion in GKE
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
ISSUE[Issue Detected] -->|Categorize| TRIAGE{Triage}
TRIAGE -->|Provisioning| PROV[Workspace Deploy Failure]
TRIAGE -->|Compute| GKE_I[GKE / Cluster Issues]
TRIAGE -->|Network| NET_I[Connectivity Problems]
TRIAGE -->|IAM| IAM_I[Permission Errors]
TRIAGE -->|Storage| STOR_I[GCS Access Failures]
PROV --> LOGS[Check Cloud Audit Logs]
GKE_I --> LOGS
NET_I --> LOGS
IAM_I --> LOGS
STOR_I --> LOGS
LOGS --> FIX[Apply Resolution]
ISSUE:::source
TRIAGE:::ingestion
PROV:::processing
GKE_I:::processing
NET_I:::storage
IAM_I:::governance
STOR_I:::storage
LOGS:::serving
FIX:::serving
*Troubleshooting triage workflow for Databricks on GCP issues.*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
graph TD
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
CLUSTER_FAIL[Cluster Launch Failure] --> CHECK_IP{Subnet IP Available?}
CHECK_IP -->|No| EXPAND[Expand Subnet Range]
CHECK_IP -->|Yes| CHECK_SA{Service Account Valid?}
CHECK_SA -->|No| FIX_SA[Fix SA Permissions]
CHECK_SA -->|Yes| CHECK_QUOTA{GKE Quota Available?}
CHECK_QUOTA -->|No| REQ_QUOTA[Request Quota Increase]
CHECK_QUOTA -->|Yes| CHECK_FW{Firewall Allows Egress?}
CHECK_FW -->|No| FIX_FW[Update Firewall Rules]
CHECK_FW -->|Yes| CHECK_API{APIs Enabled?}
CHECK_API -->|No| ENABLE[Enable Required APIs]
CLUSTER_FAIL:::source
CHECK_IP:::ingestion
EXPAND:::serving
CHECK_SA:::governance
FIX_SA:::serving
CHECK_QUOTA:::processing
REQ_QUOTA:::serving
CHECK_FW:::storage
FIX_FW:::serving
CHECK_API:::processing
ENABLE:::serving
*Decision tree for diagnosing cluster launch failures on GCP.*
Key Terms
Prerequisites and Setup
- Access to the GCP project hosting the Databricks workspace
gcloudCLI authenticated with project viewer or owner permissions- Access to the Databricks account console for workspace status
- Cloud Logging Viewer role to query audit logs
- Basic understanding of GKE, VPC networking, and IAM troubleshooting
Step-by-Step Implementation
Configuration Reference
| Error Symptom | Root Cause | Diagnostic | Resolution |
|---|---|---|---|
| Workspace status FAILED | Missing APIs or permissions | Check audit logs and API list | Enable APIs, fix IAM roles |
| Cluster stuck in PENDING | GKE node pool scaling failure | Check GKE cluster conditions | Verify quota, subnet IPs |
| Pod scheduling failures | IP exhaustion in pod range | Check secondary IP range usage | Expand pod CIDR or reduce pods per node |
| GCS access denied | Service account missing storage role | Test SA permissions | Add roles/storage.objectAdmin |
| BigQuery permission denied | Missing BQ IAM binding | Check SA roles | Add roles/bigquery.dataEditor |
| Network timeout to control plane | Firewall blocking egress 443 | Check firewall rules | Allow HTTPS egress to Databricks IPs |
| Workspace shows BANNED | Billing or policy violation | Contact Databricks support | Resolve billing or policy issue |