Random Forests

A Random Forest is a powerful and versatile machine learning model that belongs to a class of techniques known as Ensemble Learning. It functions by combining the predictions of multiple individual decision trees to create a “strong learner” that is often more accurate and robust than its constituent parts.

Here is a detailed breakdown of how random forests work and why they are used:

Core Mechanism: Bagging and Feature Randomness

Random forests are primarily built using two techniques to ensure the individual trees are diverse:

  • Bagging (Bootstrap Aggregating): The algorithm takes random samples of the data with replacement (meaning some instances may be sampled multiple times for one tree while others are not at all) to build a series of decision trees on these different subsets.
  • Feature Randomness: When growing a tree, the algorithm does not search for the best feature among all available features to split a node. Instead, it searches for the best feature among a random subset of features. This ensures that the trees are not too correlated, which helps reduce the overall variance of the ensemble.

Making Predictions

Once the forest is trained, it makes predictions by aggregating the outputs of all its individual trees:

  • For Classification: The final prediction is typically decided by majority vote (hard voting). If the individual trees can estimate class probabilities, they can also use soft voting, which averages the probabilities to give more weight to highly confident votes.
  • For Regression: The final prediction is the average of the predictions from all the individual models.

Key Benefits

  • Reduced Overfitting: While a single decision tree is highly prone to overfitting (memorising the noise in a dataset), a random forest generalises better because the errors of individual trees tend to cancel each other out.
  • Reduced Bias: By not being limited to a single specific way of splitting the data, the model can account for limitations in the training data that might not be fully representative.
  • Feature Importance: Random forests provide a handy way to understand which features actually matter. Scikit-Learn, for instance, can compute a score for each feature based on the average depth at which it appears across all trees in the forest.

Practical Considerations

  • Hyperparameters: Users can tune several “knobs” before training, such as the number of trees in the forest, the maximum depth allowed for each tree, and the number of features considered for each split.
  • Efficiency: Because each tree in the forest is independent of the others, they can be trained in parallel across multiple CPU cores, making the model highly scalable.
  • Variations: A common variation is Extra-Trees (Extremely Randomized Trees), which uses random thresholds for each feature rather than searching for the best possible ones, making them even faster to train.