Reducing the number of features

Techniques for determining feature importance, selecting useful features, and combining features through dimensionality reduction.

When a dataset has many features, it can cause problems: slower training, overfitting, and difficulty visualising the data. This note covers techniques to reduce features while keeping the useful information.

Why Reduce Features?

  • Curse of dimensionality: Models need exponentially more data as features increase
  • Noise reduction: Irrelevant features add noise
  • Faster training: Fewer features = faster computation
  • Better generalisation: Simpler models often perform better on new data
  • Visualisation: Hard to plot more than 2-3 dimensions

Principal Component Analysis (PCA)

PCA finds new features (called principal components) that are combinations of the original features. These new features capture the most variance in the data using fewer dimensions.

How It Works (Simplified)

  1. Centre the data (subtract the mean)
  2. Find the direction of maximum variance (first principal component)
  3. Find the next direction of maximum variance, perpendicular to the first
  4. Repeat until you have as many components as original features
  5. Keep only the top K components that explain most of the variance

Simple Analogy

Imagine photographing a 3D object. Some angles capture the shape better than others. PCA finds the “best angles” to view your high-dimensional data in fewer dimensions.

Example

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Always scale before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Reduce from many features to 2
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X_scaled)

# How much variance does each component explain?
print(pca.explained_variance_ratio_)
# [0.72, 0.23] → First two components explain 95% of variance

Choosing the Number of Components

# Keep components that explain 95% of variance
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_scaled)

print(f"Reduced from {X.shape[1]} to {X_reduced.shape[1]} features")

When to Use PCA

  • Visualising high-dimensional data
  • Reducing features before training (speeds up model)
  • Removing multicollinearity
  • Noise reduction

Limitations

  • Components are hard to interpret (they’re combinations of original features)
  • Assumes linear relationships
  • Sensitive to scale (always standardise first)

Determining Feature Importance

Before removing features, figure out which ones matter.

Correlation Analysis

Check how strongly each feature correlates with the target.

import pandas as pd

# Correlation with target
correlations = df.corr()['target'].sort_values(ascending=False)
print(correlations)

# Drop features with low correlation
low_correlation_features = correlations[abs(correlations) < 0.1].index
df_reduced = df.drop(columns=low_correlation_features)

Tree-Based Feature Importance

Decision trees and random forests naturally rank features by how useful they are for splitting.

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(X, y)

# Get feature importances
importances = pd.Series(model.feature_importances_, index=feature_names)
importances.sort_values(ascending=False).head(10)

Mutual Information

Measures how much knowing a feature tells you about the target. Works for non-linear relationships.

from sklearn.feature_selection import mutual_info_classif

mi_scores = mutual_info_classif(X, y)
mi_series = pd.Series(mi_scores, index=feature_names)
mi_series.sort_values(ascending=False)

Permutation Importance

Shuffle each feature and see how much the model’s performance drops. Bigger drop = more important.

from sklearn.inspection import permutation_importance

result = permutation_importance(model, X_test, y_test, n_repeats=10)
importances = pd.Series(result.importances_mean, index=feature_names)

Feature Selection Methods

Filter Methods

Score features independently of the model, then select top K.

from sklearn.feature_selection import SelectKBest, f_classif

# Select top 10 features
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)

Wrapper Methods

Train models with different feature subsets, pick the best.

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Recursively eliminate features
selector = RFE(LogisticRegression(), n_features_to_select=10)
X_selected = selector.fit_transform(X, y)

Embedded Methods

Feature selection happens during model training (e.g., Lasso regularisation).

from sklearn.linear_model import LassoCV

# Lasso automatically zeros out unimportant features
lasso = LassoCV()
lasso.fit(X, y)

# Features with non-zero coefficients are selected
selected_features = feature_names[lasso.coef_ != 0]

Other Dimensionality Reduction Techniques

t-SNE (t-Distributed Stochastic Neighbour Embedding)

Good for visualisation. Preserves local structure (nearby points stay nearby).

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=42)
X_2d = tsne.fit_transform(X)

Use for: Visualising clusters, exploring data Not for: Reducing features for model training (too slow, non-deterministic)

UMAP (Uniform Manifold Approximation and Projection)

Similar to t-SNE but faster and preserves more global structure.

import umap

reducer = umap.UMAP(n_components=2)
X_2d = reducer.fit_transform(X)

Autoencoders

Neural networks that compress data to a smaller representation, then reconstruct it.

Input (100 dims) → Encoder → Bottleneck (10 dims) → Decoder → Output (100 dims)

The bottleneck layer is the reduced representation.

Quick Comparison

TechniqueBest ForInterpretable?Speed
PCAGeneral reductionSomewhatFast
Feature importanceUnderstanding featuresYesMedium
LassoAutomatic selectionYesFast
t-SNEVisualisationNoSlow
UMAPVisualisationNoMedium
AutoencodersComplex non-linear dataNoSlow

Practical Workflow

  1. Start simple: Check correlations, remove obviously useless features
  2. Try tree importance: Quick way to rank features
  3. Use Lasso: Automatic selection during training
  4. Apply PCA: If you still have too many features
  5. Visualise with t-SNE/UMAP: To understand your data structure

See Also

-
-