Cost Functions

What is a Cost Function?

A cost function (also called a loss function or objective function) is a mathematical measure that quantifies how wrong a model’s predictions are compared to the actual values. It essentially tells us “how much our model costs us” in terms of prediction errors. The goal in machine learning is to minimise this cost by finding the optimal model parameters.

The Squared Error Cost Function

Looking at the following formula:

J(w,b)=12mi=1m(y^(i)y(i))2J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})^2

Let me break down each component:

  • J(w,b)J(w,b): The cost function value, dependent on parameters ww (weight) and bb (bias)
  • mm: The number of training examples
  • y^(i)\hat{y}^{(i)}: The predicted value for the i-th example (y^(i)=fw,b(x(i))=wx(i)+b\hat{y}^{(i)} = f_{w,b}(x^{(i)}) = wx^{(i)} + b)
  • y(i)y^{(i)}: The actual/true value for the i-th example
  • (y^(i)y(i))(\hat{y}^{(i)} - y^{(i)}): The error for a single prediction
  • \sum: Sum over all mm training examples
  • 12m\frac{1}{2m}: A normalization factor (the 12\frac{1}{2} is often included to simplify derivatives during optimization)

Visual Interpretation

error-graph.png This image illustrates the squared error cost function, one of the most fundamental concepts in machine learning, particularly for regression problems.

The graph shows:

  • Red X’s: Actual data points (x(i),y(i))(x^{(i)}, y^{(i)})
  • Blue line: The linear model fw,b=wx+bf_{w,b} = wx + b
  • Pink arrows: The errors between predictions and actual values

The cost function calculates the average of the squared lengths of these pink arrows. Squaring serves two purposes:

  1. Makes all errors positive (so they don’t cancel out)
  2. Penalizes larger errors more heavily than smaller ones

The Optimization Goal

Find w,bw,b so that y^(i)\hat{y}^{(i)} is close to y(i)y^{(i)} for all (x(i),y(i))(x^{(i)}, y^{(i)}).

The objective is to find the optimal values of ww and bb that minimize J(w,b)J(w,b), which means finding the line that best fits the data points with the smallest overall squared error. This is typically done using optimization algorithms like Gradient Descent.

This particular cost function leads to what’s known as ordinary least squares regression when used with linear models, forming the foundation for many more complex machine learning algorithms.

Source: Machine Learning Specialization on Coursera