Gradient boosting & XGBoost

Gradient boosting is a powerful ensemble learning method that combines several “weak” learners (typically simple decision trees) to build a single “strong” learner. This technique belongs to a family of algorithms known as boosting, where predictors are trained sequentially rather than independently.

How Gradient Boosting Works

In gradient boosting, each new tree is built additively, meaning they are created one after another to specifically improve upon the deficiencies or mistakes of the previous trees. The process follows these core principles:

Sequential Improvement on Residuals: Rather than re-weighting data points like other boosting methods (e.g., AdaBoost), gradient boosting focuses on fitting the new predictor to the residual errors (the difference between the actual label and the current prediction) made by the previous ensemble.
Gradient Minimisation: The “gradient” in the name refers to the fact that the algorithm works by minimising the gradient of a loss function.
Aggregation: Unlike random forests, which aggregate results only at the end of the process, gradient boosting aggregates results along the way to calculate the final prediction.
Regularisation and Shrinkage: To prevent the model from fitting the training data too perfectly (overfitting), a learning rate (or shrinkage) parameter is used to scale the contribution of each new tree.

XGBoost (eXtreme Gradient Boosting)

XGBoost is a popular and highly efficient implementation of gradient boosting that has gained fame for its performance in winning numerous machine learning competitions, such as those on Kaggle. It is specifically designed to be performant and scalable, often outperforming other models on tabular data.

Key features of XGBoost include:

Similarity Score: XGBoost uses a specific metric called a similarity score to determine the best way to split a node in its decision trees, helping it to separate large and small values effectively.
Automatic Handling of Complexities: It automatically handles non-linearities and feature interactions (relationships between different input features) without the need for manual feature engineering.
Pruning: To combat overfitting, XGBoost employs a pruning step. It calculates the “similarity gain” for a potential split and prevents the split if the gain does not meet a certain threshold (the minimum split loss).
Regularisation Hyperparameters: It includes specific parameters, such as lambda ( $\lambda$ ), to penalise model complexity and prevent the weights from becoming too large.

Practical Considerations

While gradient boosting and XGBoost generally offer superior predictive performance compared to methods like random forests, they are sensitive to noise and outliers. They are also prone to overfitting if hyperparameters—such as the number of estimators, maximum tree depth, and learning rate—are not tuned carefully. Practitioners often use techniques like grid search or Bayesian optimisation to find the optimal combination of these settings for their specific dataset.

Analogy for Gradient Boosting: Imagine you are a golfer trying to get the ball into the hole. Your first swing (the first weak learner) gets the ball halfway there. Instead of starting over from the tee for your next shot, you walk to where the ball landed and take another swing aimed specifically at the remaining distance to the hole (the residual error). Each subsequent stroke is a “weak learner” that focuses only on correcting the mistake of the previous shot, gradually bringing the ball closer to the target until you finally sink the putt.