Reducing the number of features

When a dataset has many features, it can cause problems: slower training, overfitting, and difficulty visualising the data. This note covers techniques to reduce features while keeping the useful information.

Why Reduce Features?

Curse of dimensionality: Models need exponentially more data as features increase
Noise reduction: Irrelevant features add noise
Faster training: Fewer features = faster computation
Better generalisation: Simpler models often perform better on new data
Visualisation: Hard to plot more than 2-3 dimensions

Principal Component Analysis (PCA)

PCA finds new features (called principal components) that are combinations of the original features. These new features capture the most variance in the data using fewer dimensions.

How It Works (Simplified)

Centre the data (subtract the mean)
Find the direction of maximum variance (first principal component)
Find the next direction of maximum variance, perpendicular to the first
Repeat until you have as many components as original features
Keep only the top K components that explain most of the variance

Simple Analogy

Imagine photographing a 3D object. Some angles capture the shape better than others. PCA finds the “best angles” to view your high-dimensional data in fewer dimensions.

Example

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Always scale before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Reduce from many features to 2
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X_scaled)

# How much variance does each component explain?
print(pca.explained_variance_ratio_)
# [0.72, 0.23] → First two components explain 95% of variance

Choosing the Number of Components

# Keep components that explain 95% of variance
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_scaled)

print(f"Reduced from {X.shape[1]} to {X_reduced.shape[1]} features")

When to Use PCA

Visualising high-dimensional data
Reducing features before training (speeds up model)
Removing multicollinearity
Noise reduction

Limitations

Components are hard to interpret (they’re combinations of original features)
Assumes linear relationships
Sensitive to scale (always standardise first)

Determining Feature Importance

Before removing features, figure out which ones matter.

Correlation Analysis

Check how strongly each feature correlates with the target.

import pandas as pd

# Correlation with target
correlations = df.corr()['target'].sort_values(ascending=False)
print(correlations)

# Drop features with low correlation
low_correlation_features = correlations[abs(correlations) < 0.1].index
df_reduced = df.drop(columns=low_correlation_features)

Tree-Based Feature Importance

Decision trees and random forests naturally rank features by how useful they are for splitting.

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(X, y)

# Get feature importances
importances = pd.Series(model.feature_importances_, index=feature_names)
importances.sort_values(ascending=False).head(10)

Mutual Information

Measures how much knowing a feature tells you about the target. Works for non-linear relationships.

from sklearn.feature_selection import mutual_info_classif

mi_scores = mutual_info_classif(X, y)
mi_series = pd.Series(mi_scores, index=feature_names)
mi_series.sort_values(ascending=False)

Permutation Importance

Shuffle each feature and see how much the model’s performance drops. Bigger drop = more important.

from sklearn.inspection import permutation_importance

result = permutation_importance(model, X_test, y_test, n_repeats=10)
importances = pd.Series(result.importances_mean, index=feature_names)

Feature Selection Methods

Filter Methods

Score features independently of the model, then select top K.

from sklearn.feature_selection import SelectKBest, f_classif

# Select top 10 features
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)

Wrapper Methods

Train models with different feature subsets, pick the best.

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Recursively eliminate features
selector = RFE(LogisticRegression(), n_features_to_select=10)
X_selected = selector.fit_transform(X, y)

Embedded Methods

Feature selection happens during model training (e.g., Lasso regularisation).

from sklearn.linear_model import LassoCV

# Lasso automatically zeros out unimportant features
lasso = LassoCV()
lasso.fit(X, y)

# Features with non-zero coefficients are selected
selected_features = feature_names[lasso.coef_ != 0]

Other Dimensionality Reduction Techniques

t-SNE (t-Distributed Stochastic Neighbour Embedding)

Good for visualisation. Preserves local structure (nearby points stay nearby).

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=42)
X_2d = tsne.fit_transform(X)

Use for: Visualising clusters, exploring data Not for: Reducing features for model training (too slow, non-deterministic)

UMAP (Uniform Manifold Approximation and Projection)

Similar to t-SNE but faster and preserves more global structure.

import umap

reducer = umap.UMAP(n_components=2)
X_2d = reducer.fit_transform(X)

Autoencoders

Neural networks that compress data to a smaller representation, then reconstruct it.

Input (100 dims) → Encoder → Bottleneck (10 dims) → Decoder → Output (100 dims)

The bottleneck layer is the reduced representation.

Quick Comparison

Technique	Best For	Interpretable?	Speed
PCA	General reduction	Somewhat	Fast
Feature importance	Understanding features	Yes	Medium
Lasso	Automatic selection	Yes	Fast
t-SNE	Visualisation	No	Slow
UMAP	Visualisation	No	Medium
Autoencoders	Complex non-linear data	No	Slow

Practical Workflow

Start simple: Check correlations, remove obviously useless features
Try tree importance: Quick way to rank features
Use Lasso: Automatic selection during training
Apply PCA: If you still have too many features
Visualise with t-SNE/UMAP: To understand your data structure