Data Cleansing Masterclass: Achieving Pristine Data with Python and Sample Datasets

2 min readOct 19, 2023

Data cleaning is a crucial step in the data preprocessing pipeline that can significantly impact the quality of your analysis and machine learning models.

In this, we will explore the essential techniques and best practices for data cleaning in Python, using sample data to illustrate each step.

Sample Data:
Let's start by creating a sample dataset for this tutorial. We'll use Python's Pandas library to generate a simple dataset with some common data quality issues.

import pandas as pd

data = {
'Name': ['John', 'Alice', 'Bob', 'Charlie', 'Eva', 'Frank', 'Grace', 'Helen', 'Irene', 'Jack'],
'Age': [25, 32, None, 40, 30, 22, 28, 35, 33, 'unknown'],
'Salary': [45000, 60000, 54000, None, 75000, 50000, 62000, 48000, 55000, 52000],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami', 'Phoenix', None, 'Seattle', 'Boston', 'Denver'],
'Gender': ['M', 'F', 'M', 'M', 'F', 'M', 'F', 'F', 'M', None]
}

df = pd.DataFrame(data)

Step 1: Handling Missing Values

Missing data can affect your analysis. Let's deal with it.

# Drop rows with missing values
df.dropna(inplace=True)

# Fill missing values with a specific value
df['Age'].fillna(0, inplace=True)

Step 2: Removing Duplicates

Duplicate records can skew your analysis. Remove them.
df.drop_duplicates(inplace=True)

Step 3: Correcting Data Types

Ensure data types are consistent and appropriate.

df['Age'] = pd.to_numeric(df['Age'], errors='coerce')

Step 4: Handling Outliers

Outliers can distort your analysis. Identify and handle them.

Step 5: Data Validation and Cleaning

Validate and clean data for correctness.
# Remove whitespace and leading/trailing spaces in text data
df['Name'] = df['Name'].str.strip()

# Ensure consistent capitalization in text data
df['City'] = df['City'].str.title()

# Correct invalid values
df['Gender'] = df['Gender'].replace({'m': 'M', 'f': 'F', 'unknown': 'Other'})

Step 6: Standardizing Data

Ensure data is standardized for consistent analysis.

Step 7: Renaming Columns

Make column names descriptive and uniform.
df.rename(columns={'Name': 'Full Name', 'Age': 'Age (years)', 'Salary': 'Annual Salary', 'City': 'Location', 'Gender': 'Sex'}, inplace=True)

Step 8: Saving Cleaned Data

Save your cleaned data to a new file.

df.to_csv('cleaned_data.csv', index=False)

By following these steps and best practices, you can ensure that your data is accurate, consistent, and ready for analysis or machine learning.

Data Cleansing Masterclass: Achieving Pristine Data with Python and Sample Datasets

Written by Chinna Babu Singanamala

No responses yet