Overfitting
When a model memorises training data instead of learning generalisable patterns.
Overfitting is when a model learns the training data too well — including all its quirks and noise — instead of learning the general pattern. It’s like memorising answers to practice questions instead of understanding the subject. You’ll ace the practice test but fail the real exam.
A Simple Analogy
Imagine you’re learning to recognise cats. If you only see 5 cats during training, you might conclude “cats are orange” because 3 of them happened to be orange. You’ve overfit to your small sample. When you see a black cat, you’d incorrectly say “not a cat.”
A well-trained model learns the general features of cats (pointy ears, whiskers, fur) rather than memorising the specific cats it saw.
How to Spot Overfitting
Your model is overfitting if:
- It scores very high on training data (e.g., 99%)
- It scores much lower on new test data (e.g., 70%)
The gap between these scores is the warning sign.
# Check for overfitting
train_score = model.score(X_train, y_train) # 0.99
test_score = model.score(X_test, y_test) # 0.70
# Big gap = overfitting!
Why Does It Happen?
- Too little data — easier to memorise a small dataset
- Model too complex — a model with many parameters can memorise instead of learn
- Training too long — given enough time, the model starts fitting the noise
How to Fix It
Get More Data
More examples make memorisation harder and force the model to find real patterns.
Simplify the Model
Use fewer features or a smaller model. A simpler model can’t memorise as much.
Use Regularisation
Add a penalty that discourages the model from becoming too complex — see Regularisation.
Early Stopping
Stop training before the model starts memorising:
# Stop when test performance stops improving
early_stop = EarlyStopping(monitor='val_loss', patience=5)
Dropout (for Neural Networks)
Randomly turn off some neurons during training, forcing the network to not rely too heavily on any single path.
The Opposite Problem: Underfitting
If your model performs poorly on both training and test data, it’s underfitting — the model is too simple to capture the pattern at all.
| Problem | Training Score | Test Score | Fix |
|---|---|---|---|
| Underfitting | Low | Low | More complex model |
| Overfitting | High | Low | Simpler model, more data |
| Good fit | High | High | You’re done! |