Dokumentation (english)

Tabular Regression - XGBoost

Predict house prices using XGBoost on California Housing dataset

This case study demonstrates training an XGBoost regressor to predict median house prices in California. XGBoost (Extreme Gradient Boosting) is a powerful ensemble method that builds trees sequentially, with each tree correcting errors of previous trees, making it highly effective for regression tasks.

Dataset: California Housing

  • Source: Kaggle (California Housing Prices)
  • Type: Tabular regression
  • Size: 20,640 samples
  • Features: Location, demographics, housing characteristics
  • Target: Median house value (in $100,000s)
  • Challenge: Non-linear relationships, spatial patterns

Model Configuration

{
  "model": "xgboost",
  "category": "regression",
  "model_config": {
    "n_estimators": 500,
    "max_depth": 6,
    "learning_rate": 0.1,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "objective": "reg:squarederror",
    "random_state": 42
  }
}

Training Results

Learning Curve

Model performance improves with more trees:

Keine Plot-Daten verfügbar

Predicted vs Actual Prices

Model predictions closely match actual prices:

Keine Plot-Daten verfügbar

Feature Importance

Key factors driving house prices:

Keine Plot-Daten verfügbar

Residual Distribution

Error distribution should be centered at zero:

Keine Plot-Daten verfügbar

Performance by Price Range

Model accuracy varies across price ranges:

Keine Plot-Daten verfügbar

Prediction Intervals

Model uncertainty visualization:

Keine Plot-Daten verfügbar

Common Use Cases

  • Real Estate Valuation: Automated property appraisal
  • Demand Forecasting: Predict sales, inventory needs
  • Financial Modeling: Revenue prediction, risk assessment
  • Energy Consumption: Predict usage patterns
  • Manufacturing: Quality prediction, yield optimization
  • Healthcare: Patient length of stay, treatment costs
  • Marketing: Customer lifetime value prediction

Key Settings

Essential Parameters

  • n_estimators: Number of boosting rounds (100-1000)
  • max_depth: Maximum tree depth (3-10 typical)
  • learning_rate: Shrinkage (0.01-0.3, lower = more robust)
  • subsample: Row sampling ratio (0.5-1.0)
  • colsample_bytree: Feature sampling ratio (0.5-1.0)

Regularization

  • reg_alpha: L1 regularization (lasso)
  • reg_lambda: L2 regularization (ridge)
  • min_child_weight: Minimum sum of instance weight
  • gamma: Minimum loss reduction for split

Advanced Configuration

  • objective: Loss function (reg:squarederror, reg:logistic)
  • eval_metric: Evaluation metric (rmse, mae, r2)
  • early_stopping_rounds: Stop if no improvement
  • scale_pos_weight: For imbalanced data

Performance Metrics

  • RMSE: 0.452 ($45,200 typical error)
  • MAE: 0.328 ($32,800 mean absolute error)
  • R² Score: 0.832 (83.2% variance explained)
  • MAPE: 14.2% (mean absolute percentage error)
  • Training Time: 12.3 seconds (500 trees)
  • Inference Speed: ~15,000 predictions/second

Tips for Success

  1. Feature Engineering: Create interaction features, polynomial terms
  2. Missing Values: XGBoost handles them natively
  3. Feature Scaling: Not required for tree-based models
  4. Hyperparameter Tuning: Use grid search or Bayesian optimization
  5. Cross-Validation: Essential for reliable performance estimates
  6. Learning Rate: Lower rates with more trees often better
  7. Early Stopping: Monitor validation set to prevent overfitting

Example Scenarios

Scenario 1: Coastal High-Income Area

  • Features:
    • Median Income: $8.5 (x $10k)
    • Longitude: -122.25
    • Latitude: 37.85
    • Average Rooms: 6.2
  • Prediction: $4.35 ($435,000)
  • Actual: $4.50 ($450,000)
  • Error: $15,000 (3.3%)

Scenario 2: Inland Mid-Range Area

  • Features:
    • Median Income: $3.8
    • Longitude: -119.80
    • Latitude: 36.75
    • Average Rooms: 4.8
  • Prediction: $1.95 ($195,000)
  • Actual: $1.88 ($188,000)
  • Error: $7,000 (3.7%)

Troubleshooting

Problem: Model overfitting validation data

  • Solution: Reduce max_depth, increase min_child_weight, add regularization

Problem: High error on expensive properties

  • Solution: Log transform target, add more high-value samples

Problem: Training too slow

  • Solution: Reduce n_estimators, use histogram-based method, subsample data

Problem: Predictions outside valid range

  • Solution: Apply output clipping, check for outliers in features

XGBoost vs Other Regressors

ModelRMSETraining TimeInterpretability
Linear Regression0.740.1sHigh
Random Forest0.5145sMedium
XGBoost0.4512sMedium
Neural Network0.48180sLow
LightGBM0.468sMedium

Next Steps

After training your XGBoost regressor, you can:

  • Deploy as API for real-time predictions
  • Create SHAP plots for model interpretability
  • Ensemble with other models for better accuracy
  • Add conformal prediction for uncertainty quantification
  • Export to production formats (ONNX, Treelite)
  • Build automated retraining pipeline
  • A/B test against current pricing models

Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items