Data Cleansing Masterclass: Achieving Pristine Data with Python and Sample Datasets
Data cleaning is a crucial step in the data preprocessing pipeline that can significantly impact the quality of your analysis and machine learning models.
In this, we will explore the essential techniques and best practices for data cleaning in Python, using sample data to illustrate each step.
Sample Data:
Let's start by creating a sample dataset for this tutorial. We'll use Python's Pandas library to generate a simple dataset with some common data quality issues.
import pandas as pd
data = {
'Name': ['John', 'Alice', 'Bob', 'Charlie', 'Eva', 'Frank', 'Grace', 'Helen', 'Irene', 'Jack'],
'Age': [25, 32, None, 40, 30, 22, 28, 35, 33, 'unknown'],
'Salary': [45000, 60000, 54000, None, 75000, 50000, 62000, 48000, 55000, 52000],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami', 'Phoenix', None, 'Seattle', 'Boston', 'Denver'],
'Gender': ['M', 'F', 'M', 'M', 'F', 'M', 'F', 'F', 'M', None]
}
df = pd.DataFrame(data)
Step 1: Handling Missing Values
Missing data can affect your analysis. Let's deal with it.
# Drop rows with missing values
df.dropna(inplace=True)
# Fill missing values with a specific value
df['Age'].fillna(0, inplace=True)
Step 2: Removing Duplicates
Duplicate records can skew your analysis. Remove them.
df.drop_duplicates(inplace=True)
Step 3: Correcting Data Types
Ensure data types are consistent and appropriate.
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
Step 4: Handling Outliers
Outliers can distort your analysis. Identify and handle them.
Step 5: Data Validation and Cleaning
Validate and clean data for correctness.
# Remove whitespace and leading/trailing spaces in text data
df['Name'] = df['Name'].str.strip()
# Ensure consistent capitalization in text data
df['City'] = df['City'].str.title()
# Correct invalid values
df['Gender'] = df['Gender'].replace({'m': 'M', 'f': 'F', 'unknown': 'Other'})
Step 6: Standardizing Data
Ensure data is standardized for consistent analysis.
Step 7: Renaming Columns
Make column names descriptive and uniform.
df.rename(columns={'Name': 'Full Name', 'Age': 'Age (years)', 'Salary': 'Annual Salary', 'City': 'Location', 'Gender': 'Sex'}, inplace=True)
Step 8: Saving Cleaned Data
Save your cleaned data to a new file.
df.to_csv('cleaned_data.csv', index=False)
By following these steps and best practices, you can ensure that your data is accurate, consistent, and ready for analysis or machine learning.