Understanding Data Encoding Techniques in Machine Learning: One-Hot Encoding vs. Label Encoding
In machine learning, handling categorical data is a common challenge. Two popular techniques for encoding categorical variables are One-Hot Encoding and Label Encoding. In this post, we’ll explore the differences between these two methods, and we’ll use sample data to illustrate their use in Python.
Sample Data: Let’s consider a sample dataset with a categorical variable, “Color,” which represents different colors of cars.
import pandas as pd
data = {
‘Car’: [‘Honda’, ‘Toyota’, ‘Ford’, ‘BMW’, ‘Nissan’],
‘Color’: [‘Red’, ‘Blue’, ‘Green’, ‘Red’, ‘Black’]
}
df = pd.DataFrame(data)
Label Encoding:
Label Encoding is a method to convert categorical values into numerical labels. Each category is assigned an integer value, typically starting from 0.
In Python, you can use the LabelEncoder
from the scikit-learn library:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df[‘Color_LabelEncoded’] = label_encoder.fit_transform(df[‘Color’])
The resulting DataFrame will look like this:
Car Color Color_LabelEncoded
0 Honda Red 2
1 Toyota Blue 0
2 Ford Green 1
3 BMW Red 2
4 Nissan Black 3
One-Hot Encoding:
One-Hot Encoding, on the other hand, creates binary columns for each category, representing the presence or absence of that category with 1's and 0's.
You can use the pd.get_dummies()
function in Pandas for one-hot encoding:
df_encoded = pd.get_dummies(df, columns=[‘Color’], prefix=[‘Color’])
The resulting DataFrame will look like this:
Car Color_Black Color_Blue Color_Green Color_Red
0 Honda 0 0 0 1
1 Toyota 0 1 0 0
2 Ford 0 0 1 0
3 BMW 0 0 0 1
4 Nissan 1 0 0 0
Key Differences:
- Data Representation:
- Label Encoding assigns a single integer to each category.
- One-Hot Encoding creates binary columns for each category.
2. Magnitude of Values:
- Label Encoding introduces ordinality, which can lead to misinterpretation.
- One-Hot Encoding treats each category equally.
3. Impact on Algorithms:
- Label Encoding may work better with some algorithms but can lead to unintended ordinal relationships.
- One-Hot Encoding is safer and doesn’t introduce relationships between categories.
4. Dimensionality:
- Label Encoding reduces dimensionality.
- One-Hot Encoding increases dimensionality, which can be a concern with large categorical variables.