Understanding Data Encoding Techniques in Machine Learning: One-Hot Encoding vs. Label Encoding

Chinna Babu Singanamala
2 min readOct 23, 2023

--

In machine learning, handling categorical data is a common challenge. Two popular techniques for encoding categorical variables are One-Hot Encoding and Label Encoding. In this post, we’ll explore the differences between these two methods, and we’ll use sample data to illustrate their use in Python.

Sample Data: Let’s consider a sample dataset with a categorical variable, “Color,” which represents different colors of cars.

import pandas as pd

data = {
‘Car’: [‘Honda’, ‘Toyota’, ‘Ford’, ‘BMW’, ‘Nissan’],
‘Color’: [‘Red’, ‘Blue’, ‘Green’, ‘Red’, ‘Black’]
}

df = pd.DataFrame(data)

Label Encoding:

Label Encoding is a method to convert categorical values into numerical labels. Each category is assigned an integer value, typically starting from 0.

In Python, you can use the LabelEncoder from the scikit-learn library:

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df[‘Color_LabelEncoded’] = label_encoder.fit_transform(df[‘Color’])

The resulting DataFrame will look like this:

Car Color Color_LabelEncoded
0 Honda Red 2
1 Toyota Blue 0
2 Ford Green 1
3 BMW Red 2
4 Nissan Black 3

One-Hot Encoding:

One-Hot Encoding, on the other hand, creates binary columns for each category, representing the presence or absence of that category with 1's and 0's.

You can use the pd.get_dummies() function in Pandas for one-hot encoding:

df_encoded = pd.get_dummies(df, columns=[‘Color’], prefix=[‘Color’])

The resulting DataFrame will look like this:

Car Color_Black Color_Blue Color_Green Color_Red
0 Honda 0 0 0 1
1 Toyota 0 1 0 0
2 Ford 0 0 1 0
3 BMW 0 0 0 1
4 Nissan 1 0 0 0

Key Differences:

  1. Data Representation:
  • Label Encoding assigns a single integer to each category.
  • One-Hot Encoding creates binary columns for each category.

2. Magnitude of Values:

  • Label Encoding introduces ordinality, which can lead to misinterpretation.
  • One-Hot Encoding treats each category equally.

3. Impact on Algorithms:

  • Label Encoding may work better with some algorithms but can lead to unintended ordinal relationships.
  • One-Hot Encoding is safer and doesn’t introduce relationships between categories.

4. Dimensionality:

  • Label Encoding reduces dimensionality.
  • One-Hot Encoding increases dimensionality, which can be a concern with large categorical variables.

--

--

Chinna Babu Singanamala
Chinna Babu Singanamala

Written by Chinna Babu Singanamala

Join me, an experienced engineer with a passion for innovation and cutting-edge technologies. Discover the latest trends and explore the digital world with me!

No responses yet