Training Classic ML Models: Scikit-Learn, XGBoost, and LightGBM

    Who this is for:

    Architecture / Concept Overview: Training Classic ML Models: Scikit-Learn, XGBoost, and LightGBM

    Classic ML training on Databricks combines single-node library training with distributed hyperparameter search and unified model management.

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% flowchart LR classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED DATA[Delta Table] -->|Load| PANDAS[Pandas DataFrame] PANDAS -->|Split| TRAIN[Train / Validation / Test] TRAIN -->|Fit| MODEL[sklearn / XGBoost / LightGBM] MODEL -->|Tune| HPO[Hyperopt on Spark] HPO -->|Log| MLF[MLflow Experiment] MLF -->|Register| UC[Unity Catalog] UC -->|Deploy| EP[Serving Endpoint] DATA:::source PANDAS:::ingestion TRAIN:::processing MODEL:::processing HPO:::storage MLF:::governance UC:::governance EP:::serving

    *Classic ML workflow: data loading, training, distributed hyperparameter optimisation, and deployment.*

    %%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%% graph TD classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED FRAMEWORK[Classic ML Frameworks] --> SKL[scikit-learn] FRAMEWORK --> XGB[XGBoost] FRAMEWORK --> LGBM[LightGBM] SKL --> ENSEMBLE[Ensemble Methods] SKL --> LINEAR[Linear Models] SKL --> PIPE[Pipeline API] XGB --> XGBC[XGBClassifier] XGB --> XGBR[XGBRegressor] XGB --> SPARK_XGB[Spark-Distributed XGBoost] LGBM --> LGBMC[LGBMClassifier] LGBM --> LGBMR[LGBMRegressor] LGBM --> LGBM_SPARK[LightGBM on Spark via SynapseML] FRAMEWORK:::governance SKL:::processing XGB:::serving LGBM:::storage ENSEMBLE:::ingestion LINEAR:::ingestion PIPE:::ingestion XGBC:::source XGBR:::source SPARK_XGB:::source LGBMC:::source LGBMR:::source LGBM_SPARK:::source

    *Classic ML framework hierarchy with single-node and distributed variants.*

    Key Terms

    Prerequisites and Setup

    • Databricks Runtime for ML (all three libraries are pre-installed).
    • A CPU cluster (GPU is unnecessary for these algorithms).
    • Data that fits in driver memory as a Pandas DataFrame (up to ~100 GB with sufficient driver RAM).

    Step-by-Step Implementation

      Configuration Reference

      Training Classic ML Models: Scikit-Learn, XGBoost, and LightGBM configuration options
      ParameterDefaultDescription
      n_estimators100Number of boosting rounds
      max_depth6 (XGB) / -1 (LGBM)Maximum tree depth; -1 means unlimited for LightGBM
      learning_rate0.1Step size shrinkage to prevent overfitting
      subsample1.0Fraction of samples used per tree
      colsample_bytree1.0Fraction of features used per tree
      num_leaves (LightGBM)31Maximum number of leaves per tree
      parallelism (SparkTrials)1Number of concurrent Hyperopt trials across Spark workers

      Monitoring, Cost, and Security Considerations

      Common Pitfalls and Recommended Patterns

        Frequently Asked Questions