ReLU

ReLU (Rectified Linear Unit) is the most popular activation function in neural networks. It’s remarkably simple: if the input is positive, keep it; if negative, make it zero.

The Formula

$\text{ReLU}(x) = \max(0, x)$

That’s it. The simplest activation function that actually works well.

How It Works

Input	Output
-5	0
-1	0
0	0
1	1
5	5
100	100

Negative → Zero. Positive → Unchanged.

The shape is like a hockey stick lying flat: horizontal at zero for all negative inputs, then rising diagonally for positive inputs.

Why ReLU?

Before ReLU, neural networks used sigmoid or tanh activation functions. These caused a serious problem called the vanishing gradient.

The Vanishing Gradient Problem

When training deep networks with sigmoid/tanh:

Gradients get multiplied together through layers
Since sigmoid outputs are always between 0 and 1, gradients shrink
By the time you reach early layers, gradients are nearly zero
Early layers stop learning → network doesn’t improve

How ReLU Fixes It

For positive inputs, the gradient is always 1 (not a tiny fraction)
Gradients don’t shrink as they flow backward
Deep networks can actually learn

ReLU vs Sigmoid vs Tanh

Property	ReLU	Sigmoid	Tanh
Formula	max(0, x)	1/(1+e⁻ˣ)	(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)
Output range	[0, ∞)	(0, 1)	(-1, 1)
Gradient (max)	1	0.25	1
Vanishing gradient?	Less prone	Yes	Yes
Computation	Very fast	Slower	Slower

Benefits of ReLU

Simple — just a max operation, very fast to compute
Sparse activation — many neurons output zero, which is efficient
No vanishing gradient — gradient is 1 for positive values
Biologically plausible — neurons either fire or don’t

The Dying ReLU Problem

ReLU has one weakness: neurons can “die.”

If a neuron always receives negative input, it always outputs zero. With zero output, the gradient is also zero, so the weights never update. The neuron is stuck — permanently dead.

This can happen when:

Learning rate is too high
Bad weight initialisation
Unlucky data distribution

ReLU Variants

Several variants fix the dying ReLU problem by allowing small negative values:

Leaky ReLU

Instead of zero for negatives, use a small slope (typically 0.01):

$\text{LeakyReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ 0.01x & \text{if } x \leq 0 \end{cases}$

Parametric ReLU (PReLU)

Like Leaky ReLU, but the slope is learned during training:

$\text{PReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases}$

ELU (Exponential Linear Unit)

Smoothly curves toward a negative value instead of a sharp corner:

$\text{ELU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases}$

GELU (Gaussian Error Linear Unit)

Used in transformers (like GPT). Smoothly gates values based on how positive they are:

$\text{GELU}(x) = x \cdot \Phi(x)$

where $\Phi(x)$ is the cumulative distribution function of the standard normal distribution.

Quick Comparison

Variant	Negative Side	Use Case
ReLU	Zero	Default choice, CNNs
Leaky ReLU	Small slope (0.01x)	When ReLUs are dying
PReLU	Learned slope	When you have lots of data
ELU	Smooth curve	Faster convergence
GELU	Smooth gate	Transformers, NLP

When to Use ReLU

Default choice for hidden layers in most networks
Especially good for convolutional neural networks (CNNs)
Use variants (Leaky, ELU) if you notice dying neurons

When NOT to Use ReLU

Output layer for classification → use Softmax instead
Output layer for regression → use linear (no activation)
Recurrent networks (RNNs) → tanh often works better

Key Takeaways

ReLU = max(0, x) — dead simple
It solved the vanishing gradient problem that plagued early deep learning
Fast to compute, works well in practice
Watch out for dying neurons — use Leaky ReLU if needed
Still the default activation function for most neural networks