Tabular Regression - XGBoost
Predict house prices using XGBoost on California Housing dataset
This case study demonstrates training an XGBoost regressor to predict median house prices in California. XGBoost (Extreme Gradient Boosting) is a powerful ensemble method that builds trees sequentially, with each tree correcting errors of previous trees, making it highly effective for regression tasks.
Dataset: California Housing
- Source: Kaggle (California Housing Prices)
- Type: Tabular regression
- Size: 20,640 samples
- Features: Location, demographics, housing characteristics
- Target: Median house value (in $100,000s)
- Challenge: Non-linear relationships, spatial patterns
Model Configuration
{
"model": "xgboost",
"category": "regression",
"model_config": {
"n_estimators": 500,
"max_depth": 6,
"learning_rate": 0.1,
"subsample": 0.8,
"colsample_bytree": 0.8,
"objective": "reg:squarederror",
"random_state": 42
}
}Training Results
Learning Curve
Model performance improves with more trees:
Keine Plot-Daten verfügbar
Predicted vs Actual Prices
Model predictions closely match actual prices:
Keine Plot-Daten verfügbar
Feature Importance
Key factors driving house prices:
Keine Plot-Daten verfügbar
Residual Distribution
Error distribution should be centered at zero:
Keine Plot-Daten verfügbar
Performance by Price Range
Model accuracy varies across price ranges:
Keine Plot-Daten verfügbar
Prediction Intervals
Model uncertainty visualization:
Keine Plot-Daten verfügbar
Common Use Cases
- Real Estate Valuation: Automated property appraisal
- Demand Forecasting: Predict sales, inventory needs
- Financial Modeling: Revenue prediction, risk assessment
- Energy Consumption: Predict usage patterns
- Manufacturing: Quality prediction, yield optimization
- Healthcare: Patient length of stay, treatment costs
- Marketing: Customer lifetime value prediction
Key Settings
Essential Parameters
- n_estimators: Number of boosting rounds (100-1000)
- max_depth: Maximum tree depth (3-10 typical)
- learning_rate: Shrinkage (0.01-0.3, lower = more robust)
- subsample: Row sampling ratio (0.5-1.0)
- colsample_bytree: Feature sampling ratio (0.5-1.0)
Regularization
- reg_alpha: L1 regularization (lasso)
- reg_lambda: L2 regularization (ridge)
- min_child_weight: Minimum sum of instance weight
- gamma: Minimum loss reduction for split
Advanced Configuration
- objective: Loss function (reg:squarederror, reg:logistic)
- eval_metric: Evaluation metric (rmse, mae, r2)
- early_stopping_rounds: Stop if no improvement
- scale_pos_weight: For imbalanced data
Performance Metrics
- RMSE: 0.452 ($45,200 typical error)
- MAE: 0.328 ($32,800 mean absolute error)
- R² Score: 0.832 (83.2% variance explained)
- MAPE: 14.2% (mean absolute percentage error)
- Training Time: 12.3 seconds (500 trees)
- Inference Speed: ~15,000 predictions/second
Tips for Success
- Feature Engineering: Create interaction features, polynomial terms
- Missing Values: XGBoost handles them natively
- Feature Scaling: Not required for tree-based models
- Hyperparameter Tuning: Use grid search or Bayesian optimization
- Cross-Validation: Essential for reliable performance estimates
- Learning Rate: Lower rates with more trees often better
- Early Stopping: Monitor validation set to prevent overfitting
Example Scenarios
Scenario 1: Coastal High-Income Area
- Features:
- Median Income: $8.5 (x $10k)
- Longitude: -122.25
- Latitude: 37.85
- Average Rooms: 6.2
- Prediction: $4.35 ($435,000)
- Actual: $4.50 ($450,000)
- Error: $15,000 (3.3%)
Scenario 2: Inland Mid-Range Area
- Features:
- Median Income: $3.8
- Longitude: -119.80
- Latitude: 36.75
- Average Rooms: 4.8
- Prediction: $1.95 ($195,000)
- Actual: $1.88 ($188,000)
- Error: $7,000 (3.7%)
Troubleshooting
Problem: Model overfitting validation data
- Solution: Reduce max_depth, increase min_child_weight, add regularization
Problem: High error on expensive properties
- Solution: Log transform target, add more high-value samples
Problem: Training too slow
- Solution: Reduce n_estimators, use histogram-based method, subsample data
Problem: Predictions outside valid range
- Solution: Apply output clipping, check for outliers in features
XGBoost vs Other Regressors
| Model | RMSE | Training Time | Interpretability |
|---|---|---|---|
| Linear Regression | 0.74 | 0.1s | High |
| Random Forest | 0.51 | 45s | Medium |
| XGBoost | 0.45 | 12s | Medium |
| Neural Network | 0.48 | 180s | Low |
| LightGBM | 0.46 | 8s | Medium |
Next Steps
After training your XGBoost regressor, you can:
- Deploy as API for real-time predictions
- Create SHAP plots for model interpretability
- Ensemble with other models for better accuracy
- Add conformal prediction for uncertainty quantification
- Export to production formats (ONNX, Treelite)
- Build automated retraining pipeline
- A/B test against current pricing models