Overfitting

Overfitting is when a model learns the training data too well — including all its quirks and noise — instead of learning the general pattern. It’s like memorising answers to practice questions instead of understanding the subject. You’ll ace the practice test but fail the real exam.

A Simple Analogy

Imagine you’re learning to recognise cats. If you only see 5 cats during training, you might conclude “cats are orange” because 3 of them happened to be orange. You’ve overfit to your small sample. When you see a black cat, you’d incorrectly say “not a cat.”

A well-trained model learns the general features of cats (pointy ears, whiskers, fur) rather than memorising the specific cats it saw.

How to Spot Overfitting

Your model is overfitting if:

It scores very high on training data (e.g., 99%)
It scores much lower on new test data (e.g., 70%)

The gap between these scores is the warning sign.

# Check for overfitting
train_score = model.score(X_train, y_train)  # 0.99
test_score = model.score(X_test, y_test)      # 0.70

# Big gap = overfitting!

Why Does It Happen?

Too little data — easier to memorise a small dataset
Model too complex — a model with many parameters can memorise instead of learn
Training too long — given enough time, the model starts fitting the noise

How to Fix It

Get More Data

More examples make memorisation harder and force the model to find real patterns.

Simplify the Model

Use fewer features or a smaller model. A simpler model can’t memorise as much.

Use Regularisation

Add a penalty that discourages the model from becoming too complex — see Regularisation.

Early Stopping

Stop training before the model starts memorising:

# Stop when test performance stops improving
early_stop = EarlyStopping(monitor='val_loss', patience=5)

Dropout (for Neural Networks)

Randomly turn off some neurons during training, forcing the network to not rely too heavily on any single path.

The Opposite Problem: Underfitting

If your model performs poorly on both training and test data, it’s underfitting — the model is too simple to capture the pattern at all.

Problem	Training Score	Test Score	Fix
Underfitting	Low	Low	More complex model
Overfitting	High	Low	Simpler model, more data
Good fit	High	High	You’re done!