Cleaning Wrong Data in Python Using Pandas – A Complete Guide

Last updated 1 month, 3 weeks ago | 135 views 75     5

Tags:- Python Pandas

Real-world data is rarely perfect. Whether you're analyzing sales data, survey responses, or logs, you’ll often encounter wrong data — values that are incorrect, inconsistent, or simply out of place. These anomalies can lead to misleading results if not handled properly.

In this tutorial, you’ll learn how to identify and clean wrong data using Python's Pandas library.


What Is "Wrong Data"?

Wrong data refers to values in your dataset that:

  • Are inaccurate (e.g., age = 250)

  • Are inconsistent (e.g., "Yes" vs "Y" vs "1")

  • Are incorrectly formatted (e.g., "twenty" instead of 20)

  • Violate logical rules (e.g., negative salary)


Tools We'll Use

  • Pandas: for data handling and cleaning

  • NumPy: for working with NaN and numerical operations


Common Examples of Wrong Data

Problem Type Example
Out-of-range values Age = -5 or 200
Incorrect format "twenty" in numeric field
Inconsistency Gender = ["M", "Male", "m"]
Impossible values Date = "30-02-2023"
Typo or error "Calfornia" instead of "California"

Sample Dataset with Issues

Let's create a sample dataset with common wrong data problems:

import pandas as pd
import numpy as np

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, -1, 'thirty', 130, 45],
    'Gender': ['F', 'M', 'Male', 'f', 'Female'],
    'Salary': [50000, 60000, None, 2000000, 'not disclosed'],
    'JoinDate': ['2023-01-01', 'not a date', '2022-03-15', '', '2021-07-01']
}

df = pd.DataFrame(data)
print(df)

Step-by-Step Guide to Cleaning Wrong Data

✅ Step 1: Detect and Clean Numeric Anomalies

Clean Age: Remove non-numeric and unrealistic values.

# Convert Age to numeric
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')

# Replace unrealistic ages
df.loc[(df['Age'] < 0) | (df['Age'] > 120), 'Age'] = np.nan

# Optionally, fill missing age with the median
df['Age'].fillna(df['Age'].median(), inplace=True)

✅ Step 2: Standardize Categorical Data

Clean Gender: Normalize gender values to consistent labels.

# Convert to lowercase for consistency
df['Gender'] = df['Gender'].str.lower()

# Replace variations
df['Gender'] = df['Gender'].replace({
    'f': 'Female',
    'female': 'Female',
    'm': 'Male',
    'male': 'Male'
})

✅ Step 3: Handle Missing or Incorrect Salary Values

# Replace 'not disclosed' with NaN
df['Salary'] = df['Salary'].replace('not disclosed', np.nan)

# Convert to numeric
df['Salary'] = pd.to_numeric(df['Salary'], errors='coerce')

# Optional: Fill missing salaries with average or median
df['Salary'].fillna(df['Salary'].mean(), inplace=True)

✅ Step 4: Clean and Format Date Columns

# Replace empty strings with NaT
df['JoinDate'].replace('', np.nan, inplace=True)

# Convert to datetime
df['JoinDate'] = pd.to_datetime(df['JoinDate'], errors='coerce')

✅ Full Cleaned Example

import pandas as pd
import numpy as np

# Sample data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, -1, 'thirty', 130, 45],
    'Gender': ['F', 'M', 'Male', 'f', 'Female'],
    'Salary': [50000, 60000, None, 2000000, 'not disclosed'],
    'JoinDate': ['2023-01-01', 'not a date', '2022-03-15', '', '2021-07-01']
}

df = pd.DataFrame(data)

# Clean Age
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
df.loc[(df['Age'] < 0) | (df['Age'] > 120), 'Age'] = np.nan
df['Age'].fillna(df['Age'].median(), inplace=True)

# Clean Gender
df['Gender'] = df['Gender'].str.lower().replace({
    'f': 'Female', 'female': 'Female',
    'm': 'Male', 'male': 'Male'
})

# Clean Salary
df['Salary'] = df['Salary'].replace('not disclosed', np.nan)
df['Salary'] = pd.to_numeric(df['Salary'], errors='coerce')
df['Salary'].fillna(df['Salary'].mean(), inplace=True)

# Clean JoinDate
df['JoinDate'] = df['JoinDate'].replace('', np.nan)
df['JoinDate'] = pd.to_datetime(df['JoinDate'], errors='coerce')

# Final cleaned DataFrame
print(df)

Tips and Best Practices

Tip Why It Matters
Use errors='coerce' Prevent crashes when converting formats
Check for outliers Use .describe() and visualizations
Normalize categories For consistent analysis and grouping
Don’t fill blindly Use median, mode, or contextual defaults
Create backups Always keep a copy of raw data for reference

⚠️ Common Pitfalls

Mistake Solution
Using .astype() on dirty data Use pd.to_numeric() or pd.to_datetime() with errors='coerce'
Ignoring blank strings Replace '' with np.nan before cleaning
Overfitting cleaning logic Clean according to business rules, not just technical correctness

Summary

Cleaning wrong data is an essential skill for data analysis and machine learning. With Pandas, you can easily detect, clean, and transform bad data into usable formats.

Key Tools Recap:

  • pd.to_numeric(), pd.to_datetime()

  • .replace(), .fillna(), .dropna()

  • .str.lower(), .str.strip() for string normalization