Reducing the number of features
Techniques for determining feature importance, selecting useful features, and combining features through dimensionality reduction.
When a dataset has many features, it can cause problems: slower training, overfitting, and difficulty visualising the data. This note covers techniques to reduce features while keeping the useful information.
Why Reduce Features?
- Curse of dimensionality: Models need exponentially more data as features increase
- Noise reduction: Irrelevant features add noise
- Faster training: Fewer features = faster computation
- Better generalisation: Simpler models often perform better on new data
- Visualisation: Hard to plot more than 2-3 dimensions
Principal Component Analysis (PCA)
PCA finds new features (called principal components) that are combinations of the original features. These new features capture the most variance in the data using fewer dimensions.
How It Works (Simplified)
- Centre the data (subtract the mean)
- Find the direction of maximum variance (first principal component)
- Find the next direction of maximum variance, perpendicular to the first
- Repeat until you have as many components as original features
- Keep only the top K components that explain most of the variance
Simple Analogy
Imagine photographing a 3D object. Some angles capture the shape better than others. PCA finds the “best angles” to view your high-dimensional data in fewer dimensions.
Example
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Always scale before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Reduce from many features to 2
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X_scaled)
# How much variance does each component explain?
print(pca.explained_variance_ratio_)
# [0.72, 0.23] → First two components explain 95% of variance
Choosing the Number of Components
# Keep components that explain 95% of variance
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_scaled)
print(f"Reduced from {X.shape[1]} to {X_reduced.shape[1]} features")
When to Use PCA
- Visualising high-dimensional data
- Reducing features before training (speeds up model)
- Removing multicollinearity
- Noise reduction
Limitations
- Components are hard to interpret (they’re combinations of original features)
- Assumes linear relationships
- Sensitive to scale (always standardise first)
Determining Feature Importance
Before removing features, figure out which ones matter.
Correlation Analysis
Check how strongly each feature correlates with the target.
import pandas as pd
# Correlation with target
correlations = df.corr()['target'].sort_values(ascending=False)
print(correlations)
# Drop features with low correlation
low_correlation_features = correlations[abs(correlations) < 0.1].index
df_reduced = df.drop(columns=low_correlation_features)
Tree-Based Feature Importance
Decision trees and random forests naturally rank features by how useful they are for splitting.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X, y)
# Get feature importances
importances = pd.Series(model.feature_importances_, index=feature_names)
importances.sort_values(ascending=False).head(10)
Mutual Information
Measures how much knowing a feature tells you about the target. Works for non-linear relationships.
from sklearn.feature_selection import mutual_info_classif
mi_scores = mutual_info_classif(X, y)
mi_series = pd.Series(mi_scores, index=feature_names)
mi_series.sort_values(ascending=False)
Permutation Importance
Shuffle each feature and see how much the model’s performance drops. Bigger drop = more important.
from sklearn.inspection import permutation_importance
result = permutation_importance(model, X_test, y_test, n_repeats=10)
importances = pd.Series(result.importances_mean, index=feature_names)
Feature Selection Methods
Filter Methods
Score features independently of the model, then select top K.
from sklearn.feature_selection import SelectKBest, f_classif
# Select top 10 features
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)
Wrapper Methods
Train models with different feature subsets, pick the best.
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# Recursively eliminate features
selector = RFE(LogisticRegression(), n_features_to_select=10)
X_selected = selector.fit_transform(X, y)
Embedded Methods
Feature selection happens during model training (e.g., Lasso regularisation).
from sklearn.linear_model import LassoCV
# Lasso automatically zeros out unimportant features
lasso = LassoCV()
lasso.fit(X, y)
# Features with non-zero coefficients are selected
selected_features = feature_names[lasso.coef_ != 0]
Other Dimensionality Reduction Techniques
t-SNE (t-Distributed Stochastic Neighbour Embedding)
Good for visualisation. Preserves local structure (nearby points stay nearby).
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=42)
X_2d = tsne.fit_transform(X)
Use for: Visualising clusters, exploring data Not for: Reducing features for model training (too slow, non-deterministic)
UMAP (Uniform Manifold Approximation and Projection)
Similar to t-SNE but faster and preserves more global structure.
import umap
reducer = umap.UMAP(n_components=2)
X_2d = reducer.fit_transform(X)
Autoencoders
Neural networks that compress data to a smaller representation, then reconstruct it.
Input (100 dims) → Encoder → Bottleneck (10 dims) → Decoder → Output (100 dims)
The bottleneck layer is the reduced representation.
Quick Comparison
| Technique | Best For | Interpretable? | Speed |
|---|---|---|---|
| PCA | General reduction | Somewhat | Fast |
| Feature importance | Understanding features | Yes | Medium |
| Lasso | Automatic selection | Yes | Fast |
| t-SNE | Visualisation | No | Slow |
| UMAP | Visualisation | No | Medium |
| Autoencoders | Complex non-linear data | No | Slow |
Practical Workflow
- Start simple: Check correlations, remove obviously useless features
- Try tree importance: Quick way to rank features
- Use Lasso: Automatic selection during training
- Apply PCA: If you still have too many features
- Visualise with t-SNE/UMAP: To understand your data structure