Anomaly Detection

Detecting unusual patterns using Gaussian probability models to identify data points that deviate significantly from normal behavior.

Anomaly detection is an unsupervised learning technique used to identify rare items, events, or observations that differ significantly from the majority of the data. Unlike Clustering, which groups similar items together, anomaly detection focuses on finding the outliers — the data points that don’t fit the expected pattern.

When to Use Anomaly Detection

Anomaly detection works well when:

  • You have many “normal” examples but very few (or no) labeled anomalies
  • Anomalies are rare and diverse — they could look different each time
  • You want to catch new types of anomalies that haven’t been seen before

Common applications include server monitoring (detecting failing machines), fraud detection, manufacturing quality control, and medical diagnostics.

Gaussian-Based Approach

The most common approach models the “normal” behavior of your data using a Gaussian Distribution. The intuition is simple:

  1. Fit a Gaussian to your training data (assumed to be mostly normal examples)
  2. For any new example, compute its probability under this distribution
  3. If the probability is very low (below threshold ε), flag it as an anomaly

Data points in the dense center of the distribution have high probability — they’re normal. Points in the sparse tails have low probability — they’re anomalies.

Parameter Estimation

Given a training set {x(1),,x(m)}\{x^{(1)}, \ldots, x^{(m)}\}, estimate the Gaussian parameters for each feature ii:

Mean: μi=1mj=1mxi(j)\mu_i = \frac{1}{m} \sum_{j=1}^{m} x_i^{(j)}

Variance: σi2=1mj=1m(xi(j)μi)2\sigma_i^2 = \frac{1}{m} \sum_{j=1}^{m} (x_i^{(j)} - \mu_i)^2

These are closed-form solutions — no iterative optimization required. Simply compute the average and spread of each feature from your training data.

def estimate_gaussian(X):
    """Estimate mean and variance for each feature."""
    m, n = X.shape
    mu = (1 / m) * np.sum(X, axis=0)
    var = (1 / m) * np.sum((X - mu) ** 2, axis=0)
    return mu, var

Computing Probability

For a new example xx with nn features, compute its probability by assuming features are independent:

p(x)=j=1np(xj;μj,σj2)=j=1n12πσj2exp((xjμj)22σj2)p(x) = \prod_{j=1}^{n} p(x_j; \mu_j, \sigma_j^2) = \prod_{j=1}^{n} \frac{1}{\sqrt{2\pi\sigma_j^2}} \exp\left(-\frac{(x_j - \mu_j)^2}{2\sigma_j^2}\right)

Each feature contributes its own Gaussian probability, and we multiply them together. A low value for any feature will drag down the overall probability, helping detect anomalies that are unusual in just one dimension.

Threshold Selection (ε)

The threshold ε determines what probability is “too low” to be normal. To select it optimally, use a cross-validation set with labeled examples (some known anomalies):

  1. Compute p(x)p(x) for all cross-validation examples
  2. Try many values of ε
  3. For each ε, classify examples as anomaly if p(x)<εp(x) < \varepsilon
  4. Select the ε that maximizes the F1 score

Evaluation Metrics

For imbalanced data (few anomalies, many normal), accuracy is misleading. Instead use:

Precision — Of all examples flagged as anomalies, how many actually are? precision=tptp+fp\text{precision} = \frac{tp}{tp + fp}

Recall — Of all actual anomalies, how many did we catch? recall=tptp+fn\text{recall} = \frac{tp}{tp + fn}

F1 Score — Harmonic mean balancing precision and recall: F1=2precisionrecallprecision+recallF_1 = \frac{2 \cdot \text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}

Where:

  • tptp (true positives): correctly identified anomalies
  • fpfp (false positives): normal examples incorrectly flagged
  • fnfn (false negatives): anomalies we missed

F1 is preferred because it penalizes models that sacrifice either precision or recall too heavily.

Algorithm Summary

  1. Choose features xix_i that might indicate anomalous behavior
  2. Fit parameters μ1,,μn\mu_1, \ldots, \mu_n and σ12,,σn2\sigma_1^2, \ldots, \sigma_n^2 from training data
  3. Compute probability p(x)p(x) for new examples
  4. Flag as anomaly if p(x)<εp(x) < \varepsilon
# Training
mu, var = estimate_gaussian(X_train)

# Prediction
p = multivariate_gaussian(X_new, mu, var)
anomalies = p < epsilon

Practical Applications

  • Server Monitoring: Track throughput and latency; flag servers with unusual combinations
  • Fraud Detection: Model normal transaction patterns; flag unusual purchases
  • Manufacturing: Monitor sensor readings; detect defective products
  • Network Security: Baseline normal traffic; detect intrusions

High-Dimensional Data

This approach scales well to many features. The notebook example uses 11 features to monitor server health, achieving good detection with F1 ≈ 0.62. As dimensionality increases, ensure you have enough training data to reliably estimate each feature’s parameters.

Anomaly Detection vs. Supervised Learning

AspectAnomaly DetectionSupervised Learning
Labeled anomaliesFew or noneMany of each class
Anomaly typesDiverse, novelKnown patterns
Training dataMostly normalBalanced classes
Best forRare, unpredictable eventsWell-defined categories

Use anomaly detection when anomalies are too rare or varied to learn directly. Use supervised classification when you have enough labeled examples of each class.

-
-