Check for IP exhaustion in GKE

    Who this is for:

    Architecture / Concept Overview: Check for IP exhaustion in GKE

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED ISSUE[Issue Detected] -->|Categorize| TRIAGE{Triage} TRIAGE -->|Provisioning| PROV[Workspace Deploy Failure] TRIAGE -->|Compute| GKE_I[GKE / Cluster Issues] TRIAGE -->|Network| NET_I[Connectivity Problems] TRIAGE -->|IAM| IAM_I[Permission Errors] TRIAGE -->|Storage| STOR_I[GCS Access Failures] PROV --> LOGS[Check Cloud Audit Logs] GKE_I --> LOGS NET_I --> LOGS IAM_I --> LOGS STOR_I --> LOGS LOGS --> FIX[Apply Resolution] ISSUE:::source TRIAGE:::ingestion PROV:::processing GKE_I:::processing NET_I:::storage IAM_I:::governance STOR_I:::storage LOGS:::serving FIX:::serving

    *Troubleshooting triage workflow for Databricks on GCP issues.*

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED CLUSTER_FAIL[Cluster Launch Failure] --> CHECK_IP{Subnet IP Available?} CHECK_IP -->|No| EXPAND[Expand Subnet Range] CHECK_IP -->|Yes| CHECK_SA{Service Account Valid?} CHECK_SA -->|No| FIX_SA[Fix SA Permissions] CHECK_SA -->|Yes| CHECK_QUOTA{GKE Quota Available?} CHECK_QUOTA -->|No| REQ_QUOTA[Request Quota Increase] CHECK_QUOTA -->|Yes| CHECK_FW{Firewall Allows Egress?} CHECK_FW -->|No| FIX_FW[Update Firewall Rules] CHECK_FW -->|Yes| CHECK_API{APIs Enabled?} CHECK_API -->|No| ENABLE[Enable Required APIs] CLUSTER_FAIL:::source CHECK_IP:::ingestion EXPAND:::serving CHECK_SA:::governance FIX_SA:::serving CHECK_QUOTA:::processing REQ_QUOTA:::serving CHECK_FW:::storage FIX_FW:::serving CHECK_API:::processing ENABLE:::serving

    *Decision tree for diagnosing cluster launch failures on GCP.*

    Key Terms

    Prerequisites and Setup

    • Access to the GCP project hosting the Databricks workspace
    • gcloud CLI authenticated with project viewer or owner permissions
    • Access to the Databricks account console for workspace status
    • Cloud Logging Viewer role to query audit logs
    • Basic understanding of GKE, VPC networking, and IAM troubleshooting

    Step-by-Step Implementation

      Configuration Reference

      Check for IP exhaustion in GKE configuration options
      Error SymptomRoot CauseDiagnosticResolution
      Workspace status FAILEDMissing APIs or permissionsCheck audit logs and API listEnable APIs, fix IAM roles
      Cluster stuck in PENDINGGKE node pool scaling failureCheck GKE cluster conditionsVerify quota, subnet IPs
      Pod scheduling failuresIP exhaustion in pod rangeCheck secondary IP range usageExpand pod CIDR or reduce pods per node
      GCS access deniedService account missing storage roleTest SA permissionsAdd roles/storage.objectAdmin
      BigQuery permission deniedMissing BQ IAM bindingCheck SA rolesAdd roles/bigquery.dataEditor
      Network timeout to control planeFirewall blocking egress 443Check firewall rulesAllow HTTPS egress to Databricks IPs
      Workspace shows BANNEDBilling or policy violationContact Databricks supportResolve billing or policy issue

      Monitoring, Cost, and Security Considerations

      Common Pitfalls and Recommended Patterns

        Frequently Asked Questions