Clustering

Clustering is a fundamental branch of unsupervised machine learning that focuses on grouping similar unlabelled data points together based on their inherent characteristics or features. Unlike supervised learning, where a computer learns from examples with known “correct” answers, clustering requires the algorithm to independently discover patterns and structures within the data without any predefined labels.

Core Mechanisms and Algorithm Types

At its heart, clustering identifies groups where items within the same cluster are highly similar to each other and distinct from those in other groups. This is typically achieved by calculating the distance (often Euclidean) between data points; points that are “closer” in the feature space are grouped together.

The sources highlight several key approaches to clustering:

K-means Clustering: This is a popular “hard clustering” method that partitions data into a user-specified number of groups ( $K$ ). The process is iterative: it begins by placing random centers (centroids), assigns each data point to the nearest center, and then recalculates the center as the average (mean) of all points in that group until the centers stop moving significantly.
Hierarchical Clustering: This method groups the closest data points together into small clusters and then progressively unites these groups into larger ones. Users can specify a maximum distance to “cut” the hierarchy and stop the process, which allows for different granularities of grouping.
Density-Based Spatial Clustering (DBSCAN): Unlike K-means, which assumes roughly circular clusters, DBSCAN groups points based on high-density regions and is capable of identifying clusters of arbitrary shapes while labelling isolated, low-density points as noise.
Gaussian Mixture Models (GMMs): Known as a “soft clustering” approach, GMMs do not assign a point to just one group; instead, they provide the probability of a point belonging to each available cluster.
Self-Organising Maps (SOMs): These are neural-network-inspired models where processing units on a lattice compete to represent inputs, helping to visualise high-dimensional data relationships while preserving the “topology” or proximity of the space.

Cost Functions in Clustering

Like supervised learning, clustering algorithms use cost functions to measure how well they’re performing. For K-means, this is called the distortion function (or inertia):

$J = \sum_{i=1}^{m} ||x^{(i)} - \mu_{c^{(i)}}||^2$

Where:

$x^{(i)}$ is the i-th data point
$c^{(i)}$ is the cluster assignment for the i-th point
$\mu_{c^{(i)}}$ is the centroid of the cluster that $x^{(i)}$ is assigned to

The cost function measures the sum of squared distances between each point and its assigned cluster centroid. K-means iteratively minimizes this by:

Assigning points to clusters — each point goes to the nearest centroid (minimizes $J$ with fixed centroids)
Updating centroids — move each centroid to the mean of its assigned points (minimizes $J$ with fixed assignments)

Since K-means can converge to local minima depending on initial centroid placement, it’s common to run the algorithm multiple times with different random initializations and keep the result with the lowest cost.

Other clustering methods have their own optimization objectives:

GMMs maximize the log-likelihood of the data under the mixture model
DBSCAN doesn’t optimize a global cost function — it uses local density criteria instead

Practical Applications

Clustering is widely used across various industries for exploratory data analysis and organisation:

Market and Customer Segmentation: Businesses use it to divide customers into groups based on demographics or purchasing habits to tailor specific marketing strategies.
Image Segmentation and Compression: In computer vision, clustering pixels based on color can simplify an image (e.g., reducing it to 16 dominant colors) or isolate specific objects, such as different types of tissue in medical imaging.
Recommendation Systems: Services like YouTube or Netflix may cluster users into groups with similar tastes to suggest content watched by others in that same group.
Genetics: Researchers use clustering to group species or gene sequences based on similarity.

Key Challenges

Sensitivity to Outliers: Standard algorithms like K-means can be highly sensitive to extreme data points (outliers), which can pull cluster centers away from the actual logical groups.
The “K” Problem: Many techniques require the user to decide the number of clusters in advance, which can be difficult if the underlying structure of the data is unknown. In practice, the number of clusters is often determined by what best fits the purpose of the clustering rather than purely mathematical criteria. For example, when clustering customer height and weight data to determine t-shirt sizes, you might choose $K=3$ (small, medium, large) or $K=5$ (XS through XL) based on social convention and manufacturing costs — not because the data naturally forms that many groups.
Evaluation: Because there is no “ground truth” (correct label) to compare against, evaluating whether a clustering result is “correct” is more subjective than in other forms of machine learning.