Gradient Descent

Gradient Descent Definition

Gradient descent is an iterative optimisation algorithm used to find the minimum of a Cost Function by repeatedly moving in the direction of steepest descent (opposite to the gradient). It’s the workhorse algorithm for training machine learning models, particularly neural networks.

Understanding the Update Rule

The following equations represent the fundamental gradient descent update rule:

w=wαwJ(w,b)w = w - \alpha \frac{\partial}{\partial w} J(w,b) b=bαbJ(w,b)b = b - \alpha \frac{\partial}{\partial b} J(w,b)

Note: In this example, = is assignment, not assertion, and both w and b must be updated simultaneously, i.e the w in the second equation is the original value, not the value after the second equation is run.

Components:

  • ww: The weight parameter (or parameter vector) being optimised
  • α\alpha (alpha): The learning rate - controls how big each step is
  • J(w,b)J(w,b): The cost/loss function that measures how wrong our predictions are
  • wJ(w,b)\frac{\partial}{\partial w} J(w,b): The partial derivative (gradient) of the cost function with respect to ww
  • bb: The bias parameter (another parameter being optimised)

How It Works:

  1. Calculate the gradient: Find d/dw J(w,b)d/dw\ J(w,b), which tells us the slope of the cost function at the current position. This indicates which direction increases the cost most steeply.

  2. Move opposite to the gradient: Since we want to minimise the cost, we move in the negative direction of the gradient (hence the minus sign).

  3. Scale by learning rate: Multiply by α to control step size:

    • Too large α: Might overshoot the minimum
    • Too small α: Convergence will be very slow
  4. Update the parameter: Replace the old value of w with the new value.

  5. Repeat: Continue this process until convergence (when changes become negligibly small) or for a fixed number of iterations.

Intuitive Analogy:

Imagine you’re blindfolded on a hillside trying to reach the valley (minimum). At each step:

  • You feel the slope under your feet (calculate gradient)
  • You take a step downhill (opposite to upward slope)
  • The learning rate determines how big your steps are
  • You repeat until you reach the bottom

Typical Implementation:


# Micropip is used by the CodeEmitter Obsidian plugin to install packages
import micropip
await micropip.install('numpy')  

import numpy as np

def compute_gradient(X, y, w, b):
    """
    Calculate the gradient (slope) of the cost function.
    This tells us which direction to adjust w and b.
    
    X: features (m x n) - input data WITHOUT bias column
    y: target values (m x 1) - correct answers
    w: weights (n x 1) - current parameter values for features
    b: bias (scalar) - current bias/intercept value
    """
    m = X.shape[0]  # number of training examples
    
    # Step 1: Make predictions using current w and b
    # Formula: prediction = b + w1×feature1 + w2×feature2 + ...
    predictions = X @ w + b
    
    # Step 2: Calculate the error
    error = predictions - y
    
    # Step 3: Calculate gradients for both w and b
    # These tell us how much to adjust each parameter
    dw = (1/m) * (X.T @ error)  # Gradient for weights
    db = (1/m) * np.sum(error)   # Gradient for bias
    
    return dw, db


def gradient_descent(X, y, w, b, alpha=0.01, num_iterations=1000):
    """
    Optimize parameters w and b using gradient descent.
    
    X: features (m x n)
    y: target values (m x 1)
    w: initial weights (n x 1)
    b: initial bias (scalar)
    alpha: learning rate - controls step size
    num_iterations: how many times to update parameters
    """
    for iteration in range(num_iterations):
        # Calculate gradients (d/dw and d/db of cost function J)
        dw, db = compute_gradient(X, y, w, b)
        
        # Update rules - move opposite to gradient direction
        w = w - alpha * dw  # Update weights
        b = b - alpha * db  # Update bias
    
    return w, b


# Example usage - fitting the line y = 2x
X = np.array([[1], [2], [3]])  # Just the feature values (no bias column needed!)
y = np.array([[2], [4], [6]])  # Target values

# Initialize parameters
w = np.zeros((X.shape[1], 1))  # Start with weight = 0
b = 0                           # Start with bias = 0

print("Initial parameters:")
print(f"  w = {w.flatten()}, b = {b}")

# Run gradient descent
w_final, b_final = gradient_descent(X, y, w, b, alpha=0.01, num_iterations=1000)

print(f"\nOptimized parameters:")
print(f"  w = {w_final.flatten()}, b = {b_final}")
print(f"\nOur model learned: y = {b_final:.4f} + {w_final[0,0]:.4f}×x")

This simple update rule, when applied iteratively, allows us to train complex models by gradually adjusting parameters to minimise prediction errors.

Understanding the Derivatives

What is a Derivative?

A derivative measures how much something changes when you make a small change to its input. In our context:

  • ∂J/∂w tells us: “If I increase w by a tiny amount, how much will the cost J increase or decrease?”
  • ∂J/∂b tells us: “If I increase b by a tiny amount, how much will the cost J increase or decrease?”

Why Partial Derivatives?

We use partial derivatives (∂) instead of regular derivatives (d) because J depends on multiple variables (both w and b). A partial derivative measures change with respect to one variable while holding others constant.

Deriving the Gradients

Starting from the cost function, let’s see how we get the gradient formulas:

For ∂J/∂w:

  1. Start with J(w,b)=12m(fw,b(x(i))y(i))2J(w,b) = \frac{1}{2m} \sum(f_{w,b}(x^{(i)}) - y^{(i)})^2
  2. Apply the chain rule: derivative of squared term × derivative of inner function
  3. The 2 from the square cancels with the 1/2, leaving (1/m)(1/m)
  4. The derivative of (wx(i)+b)(wx^{(i)} + b) with respect to ww is x(i)x^{(i)}
  5. Result:
Jw=1mi=0m1(fw,b(x(i))y(i))x(i)\frac{\partial J}{\partial w} = \frac{1}{m} \sum_{i=0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) x^{(i)}

For ∂J/∂b:

  1. Same starting point and chain rule application
  2. The derivative of (wx(i)+b)(wx^{(i)} + b) with respect to bb is 1
  3. Result:
Jb=1mi=0m1(fw,b(x(i))y(i))\frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})

Gradient Descent Algorithm

The algorithm iteratively optimizes ww and bb through these steps:

repeat until convergence:

w=wαJ(w,b)ww = w - \alpha \frac{\partial J(w,b)}{\partial w} b=bαJ(w,b)bb = b - \alpha \frac{\partial J(w,b)}{\partial b}

Where:

  • α\alpha (alpha) is the learning rate, controlling step size
  • Parameters ww, bb are updated simultaneously

Why Subtract the Gradient?

We subtract (rather than add) because:

  • If ∂J/∂w is positive, increasing w increases the cost → so we decrease w
  • If ∂J/∂w is negative, increasing w decreases the cost → so we increase w
  • This ensures we always move in the direction that reduces the cost

The Gradient Formulas Explained

Jw=1mi=0m1(fw,b(x(i))y(i))x(i)\frac{\partial J}{\partial w} = \frac{1}{m} \sum_{i=0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) x^{(i)}
  • The error term (fw,b(x(i))y(i))(f_{w,b}(x^{(i)}) - y^{(i)}) measures prediction error
  • Multiplied by x(i)x^{(i)} because ww is multiplied by xx in the model
  • Averaged over all mm examples
Jb=1mi=0m1(fw,b(x(i))y(i))\frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})
  • Same error term, but no x(i)x^{(i)} multiplication
  • This is because bb affects the output directly (not through xx)

Key Concept: “Simultaneously”

The note about simultaneous updates is crucial:

  1. First, calculate both partial derivatives using the current w and b values
  2. Only after computing both gradients do you update w and b
  3. This ensures consistent gradient calculations across all parameters

How It Works

The algorithm:

  1. Computes how the cost changes with respect to each parameter (gradients show the “slope” of the cost function)
  2. Updates parameters in the opposite direction of the gradient (moving “downhill” toward lower cost)
  3. Repeats until convergence (when parameters stop changing significantly)
  4. Gradually moves toward the values of w and b that minimize prediction error

This iterative process finds the optimal parameters that best fit the linear model to the training data, like a ball rolling down a hill until it reaches the lowest point.