Anomaly Detection

Anomaly detection is an unsupervised learning technique used to identify rare items, events, or observations that differ significantly from the majority of the data. Unlike Clustering, which groups similar items together, anomaly detection focuses on finding the outliers — the data points that don’t fit the expected pattern.

When to Use Anomaly Detection

Anomaly detection works well when:

You have many “normal” examples but very few (or no) labeled anomalies
Anomalies are rare and diverse — they could look different each time
You want to catch new types of anomalies that haven’t been seen before

Common applications include server monitoring (detecting failing machines), fraud detection, manufacturing quality control, and medical diagnostics.

Gaussian-Based Approach

The most common approach models the “normal” behavior of your data using a Gaussian Distribution. The intuition is simple:

Fit a Gaussian to your training data (assumed to be mostly normal examples)
For any new example, compute its probability under this distribution
If the probability is very low (below threshold ε), flag it as an anomaly

Data points in the dense center of the distribution have high probability — they’re normal. Points in the sparse tails have low probability — they’re anomalies.

Parameter Estimation

Given a training set $\{x^{(1)}, \ldots, x^{(m)}\}$ , estimate the Gaussian parameters for each feature $i$ :

Mean: $\mu_i = \frac{1}{m} \sum_{j=1}^{m} x_i^{(j)}$

Variance: $\sigma_i^2 = \frac{1}{m} \sum_{j=1}^{m} (x_i^{(j)} - \mu_i)^2$

These are closed-form solutions — no iterative optimization required. Simply compute the average and spread of each feature from your training data.

def estimate_gaussian(X):
    """Estimate mean and variance for each feature."""
    m, n = X.shape
    mu = (1 / m) * np.sum(X, axis=0)
    var = (1 / m) * np.sum((X - mu) ** 2, axis=0)
    return mu, var

Computing Probability

For a new example $x$ with $n$ features, compute its probability by assuming features are independent:

$p(x) = \prod_{j=1}^{n} p(x_j; \mu_j, \sigma_j^2) = \prod_{j=1}^{n} \frac{1}{\sqrt{2\pi\sigma_j^2}} \exp\left(-\frac{(x_j - \mu_j)^2}{2\sigma_j^2}\right)$

Each feature contributes its own Gaussian probability, and we multiply them together. A low value for any feature will drag down the overall probability, helping detect anomalies that are unusual in just one dimension.

Threshold Selection (ε)

The threshold ε determines what probability is “too low” to be normal. To select it optimally, use a cross-validation set with labeled examples (some known anomalies):

Compute $p(x)$ for all cross-validation examples
Try many values of ε
For each ε, classify examples as anomaly if $p(x) < \varepsilon$
Select the ε that maximizes the F1 score

Evaluation Metrics

For imbalanced data (few anomalies, many normal), accuracy is misleading. Instead use:

Precision — Of all examples flagged as anomalies, how many actually are? $\text{precision} = \frac{tp}{tp + fp}$

Recall — Of all actual anomalies, how many did we catch? $\text{recall} = \frac{tp}{tp + fn}$

F1 Score — Harmonic mean balancing precision and recall: $F_1 = \frac{2 \cdot \text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$

Where:

$tp$ (true positives): correctly identified anomalies
$fp$ (false positives): normal examples incorrectly flagged
$fn$ (false negatives): anomalies we missed

F1 is preferred because it penalizes models that sacrifice either precision or recall too heavily.

Algorithm Summary

Choose features $x_i$ that might indicate anomalous behavior
Fit parameters $\mu_1, \ldots, \mu_n$ and $\sigma_1^2, \ldots, \sigma_n^2$ from training data
Compute probability $p(x)$ for new examples
Flag as anomaly if $p(x) < \varepsilon$

# Training
mu, var = estimate_gaussian(X_train)

# Prediction
p = multivariate_gaussian(X_new, mu, var)
anomalies = p < epsilon

Practical Applications

Server Monitoring: Track throughput and latency; flag servers with unusual combinations
Fraud Detection: Model normal transaction patterns; flag unusual purchases
Manufacturing: Monitor sensor readings; detect defective products
Network Security: Baseline normal traffic; detect intrusions

High-Dimensional Data

This approach scales well to many features. The notebook example uses 11 features to monitor server health, achieving good detection with F1 ≈ 0.62. As dimensionality increases, ensure you have enough training data to reliably estimate each feature’s parameters.

Anomaly Detection vs. Supervised Learning

Aspect	Anomaly Detection	Supervised Learning
Labeled anomalies	Few or none	Many of each class
Anomaly types	Diverse, novel	Known patterns
Training data	Mostly normal	Balanced classes
Best for	Rare, unpredictable events	Well-defined categories

Use anomaly detection when anomalies are too rare or varied to learn directly. Use supervised classification when you have enough labeled examples of each class.