Gradient Descent
Gradient Descent Definition
Gradient descent is an iterative optimisation algorithm used to find the minimum of a Cost Function by repeatedly moving in the direction of steepest descent (opposite to the gradient). It’s the workhorse algorithm for training machine learning models, particularly neural networks.
Understanding the Update Rule
The following equations represent the fundamental gradient descent update rule:
Note: In this example, = is assignment, not assertion, and both w and b must be updated simultaneously, i.e the w in the second equation is the original value, not the value after the second equation is run.
Components:
- : The weight parameter (or parameter vector) being optimised
- (alpha): The learning rate - controls how big each step is
- : The cost/loss function that measures how wrong our predictions are
- : The partial derivative (gradient) of the cost function with respect to
- : The bias parameter (another parameter being optimised)
How It Works:
-
Calculate the gradient: Find , which tells us the slope of the cost function at the current position. This indicates which direction increases the cost most steeply.
-
Move opposite to the gradient: Since we want to minimise the cost, we move in the negative direction of the gradient (hence the minus sign).
-
Scale by learning rate: Multiply by α to control step size:
- Too large α: Might overshoot the minimum
- Too small α: Convergence will be very slow
-
Update the parameter: Replace the old value of w with the new value.
-
Repeat: Continue this process until convergence (when changes become negligibly small) or for a fixed number of iterations.
Intuitive Analogy:
Imagine you’re blindfolded on a hillside trying to reach the valley (minimum). At each step:
- You feel the slope under your feet (calculate gradient)
- You take a step downhill (opposite to upward slope)
- The learning rate determines how big your steps are
- You repeat until you reach the bottom
Typical Implementation:
# Micropip is used by the CodeEmitter Obsidian plugin to install packages
import micropip
await micropip.install('numpy')
import numpy as np
def compute_gradient(X, y, w, b):
"""
Calculate the gradient (slope) of the cost function.
This tells us which direction to adjust w and b.
X: features (m x n) - input data WITHOUT bias column
y: target values (m x 1) - correct answers
w: weights (n x 1) - current parameter values for features
b: bias (scalar) - current bias/intercept value
"""
m = X.shape[0] # number of training examples
# Step 1: Make predictions using current w and b
# Formula: prediction = b + w1×feature1 + w2×feature2 + ...
predictions = X @ w + b
# Step 2: Calculate the error
error = predictions - y
# Step 3: Calculate gradients for both w and b
# These tell us how much to adjust each parameter
dw = (1/m) * (X.T @ error) # Gradient for weights
db = (1/m) * np.sum(error) # Gradient for bias
return dw, db
def gradient_descent(X, y, w, b, alpha=0.01, num_iterations=1000):
"""
Optimize parameters w and b using gradient descent.
X: features (m x n)
y: target values (m x 1)
w: initial weights (n x 1)
b: initial bias (scalar)
alpha: learning rate - controls step size
num_iterations: how many times to update parameters
"""
for iteration in range(num_iterations):
# Calculate gradients (d/dw and d/db of cost function J)
dw, db = compute_gradient(X, y, w, b)
# Update rules - move opposite to gradient direction
w = w - alpha * dw # Update weights
b = b - alpha * db # Update bias
return w, b
# Example usage - fitting the line y = 2x
X = np.array([[1], [2], [3]]) # Just the feature values (no bias column needed!)
y = np.array([[2], [4], [6]]) # Target values
# Initialize parameters
w = np.zeros((X.shape[1], 1)) # Start with weight = 0
b = 0 # Start with bias = 0
print("Initial parameters:")
print(f" w = {w.flatten()}, b = {b}")
# Run gradient descent
w_final, b_final = gradient_descent(X, y, w, b, alpha=0.01, num_iterations=1000)
print(f"\nOptimized parameters:")
print(f" w = {w_final.flatten()}, b = {b_final}")
print(f"\nOur model learned: y = {b_final:.4f} + {w_final[0,0]:.4f}×x")
This simple update rule, when applied iteratively, allows us to train complex models by gradually adjusting parameters to minimise prediction errors.
Understanding the Derivatives
What is a Derivative?
A derivative measures how much something changes when you make a small change to its input. In our context:
- ∂J/∂w tells us: “If I increase w by a tiny amount, how much will the cost J increase or decrease?”
- ∂J/∂b tells us: “If I increase b by a tiny amount, how much will the cost J increase or decrease?”
Why Partial Derivatives?
We use partial derivatives (∂) instead of regular derivatives (d) because J depends on multiple variables (both w and b). A partial derivative measures change with respect to one variable while holding others constant.
Deriving the Gradients
Starting from the cost function, let’s see how we get the gradient formulas:
For ∂J/∂w:
- Start with
- Apply the chain rule: derivative of squared term × derivative of inner function
- The 2 from the square cancels with the 1/2, leaving
- The derivative of with respect to is
- Result:
For ∂J/∂b:
- Same starting point and chain rule application
- The derivative of with respect to is 1
- Result:
Gradient Descent Algorithm
The algorithm iteratively optimizes and through these steps:
repeat until convergence:
Where:
- (alpha) is the learning rate, controlling step size
- Parameters , are updated simultaneously
Why Subtract the Gradient?
We subtract (rather than add) because:
- If ∂J/∂w is positive, increasing w increases the cost → so we decrease w
- If ∂J/∂w is negative, increasing w decreases the cost → so we increase w
- This ensures we always move in the direction that reduces the cost
The Gradient Formulas Explained
- The error term measures prediction error
- Multiplied by because is multiplied by in the model
- Averaged over all examples
- Same error term, but no multiplication
- This is because affects the output directly (not through )
Key Concept: “Simultaneously”
The note about simultaneous updates is crucial:
- First, calculate both partial derivatives using the current w and b values
- Only after computing both gradients do you update w and b
- This ensures consistent gradient calculations across all parameters
How It Works
The algorithm:
- Computes how the cost changes with respect to each parameter (gradients show the “slope” of the cost function)
- Updates parameters in the opposite direction of the gradient (moving “downhill” toward lower cost)
- Repeats until convergence (when parameters stop changing significantly)
- Gradually moves toward the values of w and b that minimize prediction error
This iterative process finds the optimal parameters that best fit the linear model to the training data, like a ball rolling down a hill until it reaches the lowest point.