One Hot Encoding

One-hot encoding is a popular method used to convert categorical or discrete variables into numerical values so they can be processed by machine learning algorithms. While discrete features are theoretically usable, many common algorithms—most notably neural networks and XGBoost—struggle to work with non-numerical data in practice.

How it Works

When a discrete variable has three or more options, one-hot encoding creates a new, dedicated column for every possible category. The encoding follows a specific binary logic:

  • The column corresponding to the specific category of a data point is assigned a 1 (referred to as “hot”).
  • All other columns for that data point are assigned a 0 (referred to as “cold”).

For example, if you have a “favourite colour” feature with three options (blue, red, and green), one-hot encoding replaces that single column with three separate columns. A data point for “blue” would have a 1 in the blue column and a 0 in the red and green columns.

Why One-Hot Encoding is Preferred

A common alternative is label encoding, where each category is simply assigned an arbitrary number (e.g., blue=0, red=1, green=2). However, this can be problematic for several reasons:

  • Avoids Unintended Ordering: Simply numbering categories can mislead a model into assuming an inherent order or mathematical relationship where none exists. For instance, a model might incorrectly assume that “green” (2) is greater than “blue” (0) or that “red” (1) is the average of the two.
  • Model Robustness: It allows categories to be treated as independent entities, which is particularly helpful for simpler models like linear regression that might otherwise be confused by the arbitrary scale of label-encoded numbers.

Limitations and Practical Considerations

The sources highlight two main challenges when using this technique:

  • High Cardinality: If a feature has a vast number of options (e.g., 41,683 distinct US postal codes), one-hot encoding would create an equal number of new columns. This can make the dataset massive and difficult to handle.
  • Data Sparsity: Large numbers of categories result in sparse rows, meaning they contain mostly zeros. This can be computationally expensive in terms of memory and processing power.

In Python, this process is frequently automated using the get_dummies function in the Pandas library or the OneHotEncoder class in Scikit-Learn.

Using Pandas get_dummies

The pd.get_dummies() function provides a straightforward way to perform one-hot encoding on a DataFrame or Series.

Basic Usage

import pandas as pd

df = pd.DataFrame({
    'colour': ['blue', 'red', 'green', 'blue'],
    'size': ['small', 'large', 'medium', 'small']
})

encoded_df = pd.get_dummies(df, columns=['colour'])

This transforms the colour column into three new columns: colour_blue, colour_green, and colour_red.

Key Parameters

ParameterDescription
columnsList of column names to encode. If None, encodes all object/category columns.
prefixString to prepend to new column names (e.g., prefix='col'col_blue).
prefix_sepSeparator between prefix and category value (default is _).
drop_firstIf True, drops the first category to avoid multicollinearity (useful for regression models).
dtypeData type for new columns (default is bool; use int for 0/1 integers).

Avoiding the Dummy Variable Trap

When using one-hot encoded features in linear regression, including all category columns can cause multicollinearity since one column is perfectly predictable from the others. Use drop_first=True to remove one category:

encoded_df = pd.get_dummies(df, columns=['colour'], drop_first=True)

This creates only colour_green and colour_red columns—if both are 0, the colour is implicitly blue.

Handling Unknown Categories

A limitation of get_dummies is that it only creates columns for categories present in the data. If your test set contains categories not seen during training, those won’t have corresponding columns. For production pipelines where consistency is critical, consider using Scikit-Learn’s OneHotEncoder with handle_unknown='ignore' instead.


Analogy: Think of one-hot encoding like a panel of light switches in a room. Instead of having a single dial that you turn to different numbers to select a colour, you have a separate “On/Off” switch for every possible colour. To choose “Blue”, you flip the Blue switch on (1) and ensure every other switch is off (0). This way, there is no confusion about whether one colour is “higher” or “lower” than another—they are all just separate switches.