ReLU
ReLU (Rectified Linear Unit) is the most popular activation function in neural networks. It’s remarkably simple: if the input is positive, keep it; if negative, make it zero.
The Formula
That’s it. The simplest activation function that actually works well.
How It Works
| Input | Output |
|---|---|
| -5 | 0 |
| -1 | 0 |
| 0 | 0 |
| 1 | 1 |
| 5 | 5 |
| 100 | 100 |
Negative → Zero. Positive → Unchanged.
The shape is like a hockey stick lying flat: horizontal at zero for all negative inputs, then rising diagonally for positive inputs.
Why ReLU?
Before ReLU, neural networks used sigmoid or tanh activation functions. These caused a serious problem called the vanishing gradient.
The Vanishing Gradient Problem
When training deep networks with sigmoid/tanh:
- Gradients get multiplied together through layers
- Since sigmoid outputs are always between 0 and 1, gradients shrink
- By the time you reach early layers, gradients are nearly zero
- Early layers stop learning → network doesn’t improve
How ReLU Fixes It
- For positive inputs, the gradient is always 1 (not a tiny fraction)
- Gradients don’t shrink as they flow backward
- Deep networks can actually learn
ReLU vs Sigmoid vs Tanh
| Property | ReLU | Sigmoid | Tanh |
|---|---|---|---|
| Formula | max(0, x) | 1/(1+e⁻ˣ) | (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ) |
| Output range | [0, ∞) | (0, 1) | (-1, 1) |
| Gradient (max) | 1 | 0.25 | 1 |
| Vanishing gradient? | Less prone | Yes | Yes |
| Computation | Very fast | Slower | Slower |
Benefits of ReLU
- Simple — just a max operation, very fast to compute
- Sparse activation — many neurons output zero, which is efficient
- No vanishing gradient — gradient is 1 for positive values
- Biologically plausible — neurons either fire or don’t
The Dying ReLU Problem
ReLU has one weakness: neurons can “die.”
If a neuron always receives negative input, it always outputs zero. With zero output, the gradient is also zero, so the weights never update. The neuron is stuck — permanently dead.
This can happen when:
- Learning rate is too high
- Bad weight initialisation
- Unlucky data distribution
ReLU Variants
Several variants fix the dying ReLU problem by allowing small negative values:
Leaky ReLU
Instead of zero for negatives, use a small slope (typically 0.01):
Parametric ReLU (PReLU)
Like Leaky ReLU, but the slope is learned during training:
ELU (Exponential Linear Unit)
Smoothly curves toward a negative value instead of a sharp corner:
GELU (Gaussian Error Linear Unit)
Used in transformers (like GPT). Smoothly gates values based on how positive they are:
where is the cumulative distribution function of the standard normal distribution.
Quick Comparison
| Variant | Negative Side | Use Case |
|---|---|---|
| ReLU | Zero | Default choice, CNNs |
| Leaky ReLU | Small slope (0.01x) | When ReLUs are dying |
| PReLU | Learned slope | When you have lots of data |
| ELU | Smooth curve | Faster convergence |
| GELU | Smooth gate | Transformers, NLP |
When to Use ReLU
- Default choice for hidden layers in most networks
- Especially good for convolutional neural networks (CNNs)
- Use variants (Leaky, ELU) if you notice dying neurons
When NOT to Use ReLU
- Output layer for classification → use Softmax instead
- Output layer for regression → use linear (no activation)
- Recurrent networks (RNNs) → tanh often works better
Key Takeaways
- ReLU = max(0, x) — dead simple
- It solved the vanishing gradient problem that plagued early deep learning
- Fast to compute, works well in practice
- Watch out for dying neurons — use Leaky ReLU if needed
- Still the default activation function for most neural networks
See Also
- Perceptron — the basic neuron that uses activation functions
- Softmax — activation for multi-class output layers
- Deep Q Networks (DQN) — uses ReLU in hidden layers