Exploratory Data Analysis in Python: A Step-by-Step Guide with Sample Data
Exploratory Data Analysis (EDA) is a critical step in any data science or data analysis project. It involves understanding your data, uncovering patterns, and gaining insights that can inform your subsequent analysis. In this guide, we’ll walk through the process of performing EDA in Python using sample data to illustrate each step.
Step 1: Importing Necessary Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
Step 2: Loading the Sample Data
# Load your dataset
df = pd.read_csv(“your_data.csv”)
Step 3: Getting an Overview of the Data
- Use
df.head()
to view the first few rows of the data. - Use
df.shape
to check the dimensions of the dataset. - Use
df.info()
to get data types and missing values information.
Step 4: Descriptive Statistics
Calculate basic statistics of the numerical columns:
df.describe()
Step 5: Data Visualization
Visualizing the data is crucial to spot trends and patterns. Let’s create some basic visualizations:
# Histogram
sns.histplot(df[‘numeric_column’], kde=True)
plt.title(“Distribution of Numeric Column”)
plt.show()
# Box plot
sns.boxplot(x=’categorical_column’, y=’numeric_column’, data=df)
plt.title(“Box Plot of Numeric Column by Category”)
plt.show()
Step 6: Correlation Analysis
Understand relationships between variables using a correlation matrix:
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap=’coolwarm’)
plt.title(“Correlation Heatmap”)
plt.show()
Step 7: Handling Missing Data
Identify and handle missing data:
# Check for missing values
df.isnull().sum()
# Handle missing values (e.g., fill with mean or median)
df[‘column_name’].fillna(df[‘column_name’].median(), inplace=True)
Step 8: Outlier Detection
Detect and handle outliers:
# Box plot or IQR method for outlier detection
sns.boxplot(df[‘numeric_column’])
plt.title(“Box Plot for Outlier Detection”)
plt.show()
Exploratory Data Analysis is an essential first step in any data analysis project. It helps you understand your data, identify patterns, and make informed decisions about how to proceed with further analysis.