Regularisation

Regularisation is a technique that prevents Overfitting by discouraging the model from becoming too complex. It adds a “penalty” for complexity, so the model has to balance fitting the data well against staying simple.

A Simple Analogy

Imagine you’re writing an essay and your teacher says “keep it under 500 words.” That word limit is like regularisation — it forces you to focus on what’s important rather than rambling on about every tiny detail.

Without the limit (no regularisation), you might write 5,000 words that perfectly address every nuance of the topic but miss the main point. With the limit, you have to prioritise.

How It Works

Normally, a model tries to minimise its errors on training data. With regularisation, we add an extra rule: “also keep your parameters small.”

Parameters are the numbers the model learns during training (like weights). When parameters get very large, it usually means the model is trying too hard to fit every little bump in the training data.

The Two Main Types

L2 Regularisation (Ridge)

Penalises parameters based on their squared values. This shrinks all parameters toward zero, but doesn’t make them exactly zero.

When to use: General-purpose, works well in most situations.

from sklearn.linear_model import Ridge

model = Ridge(alpha=1.0)  # alpha controls penalty strength
model.fit(X_train, y_train)

L1 Regularisation (Lasso)

Penalises parameters based on their absolute values. This can push some parameters all the way to zero, effectively removing those features.

When to use: When you suspect many features are irrelevant and want the model to ignore them.

from sklearn.linear_model import Lasso

model = Lasso(alpha=1.0)
model.fit(X_train, y_train)

Quick Comparison

Type	What it does	Best for
L2 (Ridge)	Makes all parameters smaller	General use
L1 (Lasso)	Makes some parameters zero	Feature selection

Choosing the Penalty Strength

The alpha (or lambda) parameter controls how much to penalise complexity:

Too low: Not enough penalty, model may still overfit
Too high: Too much penalty, model becomes too simple (underfits)
Just right: Good balance between fitting data and staying simple

Use cross-validation to find the best value:

from sklearn.linear_model import RidgeCV

# Try several values, pick the best one automatically
model = RidgeCV(alphas=[0.01, 0.1, 1.0, 10.0])
model.fit(X_train, y_train)

print(f"Best alpha: {model.alpha_}")

In Neural Networks

Regularisation in neural networks is often called weight decay:

# PyTorch
optimizer = torch.optim.Adam(model.parameters(), weight_decay=0.01)

# Keras
Dense(64, kernel_regularizer=regularizers.l2(0.01))