One Hot Encoding
One-hot encoding is a popular method used to convert categorical or discrete variables into numerical values so they can be processed by machine learning algorithms. While discrete features are theoretically usable, many common algorithms—most notably neural networks and XGBoost—struggle to work with non-numerical data in practice.
How it Works
When a discrete variable has three or more options, one-hot encoding creates a new, dedicated column for every possible category. The encoding follows a specific binary logic:
- The column corresponding to the specific category of a data point is assigned a 1 (referred to as “hot”).
- All other columns for that data point are assigned a 0 (referred to as “cold”).
For example, if you have a “favourite colour” feature with three options (blue, red, and green), one-hot encoding replaces that single column with three separate columns. A data point for “blue” would have a 1 in the blue column and a 0 in the red and green columns.
Why One-Hot Encoding is Preferred
A common alternative is label encoding, where each category is simply assigned an arbitrary number (e.g., blue=0, red=1, green=2). However, this can be problematic for several reasons:
- Avoids Unintended Ordering: Simply numbering categories can mislead a model into assuming an inherent order or mathematical relationship where none exists. For instance, a model might incorrectly assume that “green” (2) is greater than “blue” (0) or that “red” (1) is the average of the two.
- Model Robustness: It allows categories to be treated as independent entities, which is particularly helpful for simpler models like linear regression that might otherwise be confused by the arbitrary scale of label-encoded numbers.
Limitations and Practical Considerations
The sources highlight two main challenges when using this technique:
- High Cardinality: If a feature has a vast number of options (e.g., 41,683 distinct US postal codes), one-hot encoding would create an equal number of new columns. This can make the dataset massive and difficult to handle.
- Data Sparsity: Large numbers of categories result in sparse rows, meaning they contain mostly zeros. This can be computationally expensive in terms of memory and processing power.
In Python, this process is frequently automated using the get_dummies function in the Pandas library or the OneHotEncoder class in Scikit-Learn.
Using Pandas get_dummies
The pd.get_dummies() function provides a straightforward way to perform one-hot encoding on a DataFrame or Series.
Basic Usage
import pandas as pd
df = pd.DataFrame({
'colour': ['blue', 'red', 'green', 'blue'],
'size': ['small', 'large', 'medium', 'small']
})
encoded_df = pd.get_dummies(df, columns=['colour'])
This transforms the colour column into three new columns: colour_blue, colour_green, and colour_red.
Key Parameters
| Parameter | Description |
|---|---|
columns | List of column names to encode. If None, encodes all object/category columns. |
prefix | String to prepend to new column names (e.g., prefix='col' → col_blue). |
prefix_sep | Separator between prefix and category value (default is _). |
drop_first | If True, drops the first category to avoid multicollinearity (useful for regression models). |
dtype | Data type for new columns (default is bool; use int for 0/1 integers). |
Avoiding the Dummy Variable Trap
When using one-hot encoded features in linear regression, including all category columns can cause multicollinearity since one column is perfectly predictable from the others. Use drop_first=True to remove one category:
encoded_df = pd.get_dummies(df, columns=['colour'], drop_first=True)
This creates only colour_green and colour_red columns—if both are 0, the colour is implicitly blue.
Handling Unknown Categories
A limitation of get_dummies is that it only creates columns for categories present in the data. If your test set contains categories not seen during training, those won’t have corresponding columns. For production pipelines where consistency is critical, consider using Scikit-Learn’s OneHotEncoder with handle_unknown='ignore' instead.
Analogy: Think of one-hot encoding like a panel of light switches in a room. Instead of having a single dial that you turn to different numbers to select a colour, you have a separate “On/Off” switch for every possible colour. To choose “Blue”, you flip the Blue switch on (1) and ensure every other switch is off (0). This way, there is no confusion about whether one colour is “higher” or “lower” than another—they are all just separate switches.