Cleaning Wrong Data in Python Using Pandas – A Complete Guide

Last updated 7 months, 1 week ago | 663 views 75 5

Cleaning Wrong Data in Python Using Pandas – A Complete Guide

Real-world data is rarely perfect. Whether you're analyzing sales data, survey responses, or logs, you’ll often encounter wrong data — values that are incorrect, inconsistent, or simply out of place. These anomalies can lead to misleading results if not handled properly.

In this tutorial, you’ll learn how to identify and clean wrong data using Python's Pandas library.

What Is "Wrong Data"?

Wrong data refers to values in your dataset that:

Are inaccurate (e.g., age = 250)
Are inconsistent (e.g., "Yes" vs "Y" vs "1")
Are incorrectly formatted (e.g., "twenty" instead of 20)
Violate logical rules (e.g., negative salary)

Tools We'll Use

Pandas: for data handling and cleaning
NumPy: for working with NaN and numerical operations

Common Examples of Wrong Data

Problem Type	Example
Out-of-range values	Age = -5 or 200
Incorrect format	"twenty" in numeric field
Inconsistency	Gender = ["M", "Male", "m"]
Impossible values	Date = "30-02-2023"
Typo or error	"Calfornia" instead of "California"

Sample Dataset with Issues

Let's create a sample dataset with common wrong data problems:

import pandas as pd
import numpy as np

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, -1, 'thirty', 130, 45],
    'Gender': ['F', 'M', 'Male', 'f', 'Female'],
    'Salary': [50000, 60000, None, 2000000, 'not disclosed'],
    'JoinDate': ['2023-01-01', 'not a date', '2022-03-15', '', '2021-07-01']
}

df = pd.DataFrame(data)
print(df)

Step-by-Step Guide to Cleaning Wrong Data

✅ Step 1: Detect and Clean Numeric Anomalies

Clean Age: Remove non-numeric and unrealistic values.

# Convert Age to numeric
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')

# Replace unrealistic ages
df.loc[(df['Age'] < 0) | (df['Age'] > 120), 'Age'] = np.nan

# Optionally, fill missing age with the median
df['Age'].fillna(df['Age'].median(), inplace=True)

✅ Step 2: Standardize Categorical Data

Clean Gender: Normalize gender values to consistent labels.

# Convert to lowercase for consistency
df['Gender'] = df['Gender'].str.lower()

# Replace variations
df['Gender'] = df['Gender'].replace({
    'f': 'Female',
    'female': 'Female',
    'm': 'Male',
    'male': 'Male'
})

✅ Step 3: Handle Missing or Incorrect Salary Values

# Replace 'not disclosed' with NaN
df['Salary'] = df['Salary'].replace('not disclosed', np.nan)

# Convert to numeric
df['Salary'] = pd.to_numeric(df['Salary'], errors='coerce')

# Optional: Fill missing salaries with average or median
df['Salary'].fillna(df['Salary'].mean(), inplace=True)

✅ Step 4: Clean and Format Date Columns

# Replace empty strings with NaT
df['JoinDate'].replace('', np.nan, inplace=True)

# Convert to datetime
df['JoinDate'] = pd.to_datetime(df['JoinDate'], errors='coerce')

✅ Full Cleaned Example

import pandas as pd
import numpy as np

# Sample data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, -1, 'thirty', 130, 45],
    'Gender': ['F', 'M', 'Male', 'f', 'Female'],
    'Salary': [50000, 60000, None, 2000000, 'not disclosed'],
    'JoinDate': ['2023-01-01', 'not a date', '2022-03-15', '', '2021-07-01']
}

df = pd.DataFrame(data)

# Clean Age
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
df.loc[(df['Age'] < 0) | (df['Age'] > 120), 'Age'] = np.nan
df['Age'].fillna(df['Age'].median(), inplace=True)

# Clean Gender
df['Gender'] = df['Gender'].str.lower().replace({
    'f': 'Female', 'female': 'Female',
    'm': 'Male', 'male': 'Male'
})

# Clean Salary
df['Salary'] = df['Salary'].replace('not disclosed', np.nan)
df['Salary'] = pd.to_numeric(df['Salary'], errors='coerce')
df['Salary'].fillna(df['Salary'].mean(), inplace=True)

# Clean JoinDate
df['JoinDate'] = df['JoinDate'].replace('', np.nan)
df['JoinDate'] = pd.to_datetime(df['JoinDate'], errors='coerce')

# Final cleaned DataFrame
print(df)

Tips and Best Practices

Tip	Why It Matters
Use `errors='coerce'`	Prevent crashes when converting formats
Check for outliers	Use `.describe()` and visualizations
Normalize categories	For consistent analysis and grouping
Don’t fill blindly	Use median, mode, or contextual defaults
Create backups	Always keep a copy of raw data for reference

⚠️ Common Pitfalls

Mistake	Solution
Using `.astype()` on dirty data	Use `pd.to_numeric()` or `pd.to_datetime()` with `errors='coerce'`
Ignoring blank strings	Replace `''` with `np.nan` before cleaning
Overfitting cleaning logic	Clean according to business rules, not just technical correctness

Summary

Cleaning wrong data is an essential skill for data analysis and machine learning. With Pandas, you can easily detect, clean, and transform bad data into usable formats.

Key Tools Recap:

pd.to_numeric(), pd.to_datetime()
.replace(), .fillna(), .dropna()
.str.lower(), .str.strip() for string normalization

From The Article

Python PostgreSQL Tutorial – Creating a Database with psycopg2

Explain the difference between @staticmethod and @classmethod?

Django Comments System – A Complete Guide

Python MSSQL: Using WHERE Clause – Filter Data with Precision

Creating a Basic Retrieve API with Django REST Framework

Python NumPy: Binomial Distribution Explained

Cleaning Wrong Data in Python Using Pandas – A Complete Guide

Cleaning Wrong Data in Python Using Pandas – A Complete Guide

What Is "Wrong Data"?

Tools We'll Use

Common Examples of Wrong Data

Sample Dataset with Issues

Step-by-Step Guide to Cleaning Wrong Data

✅ Step 1: Detect and Clean Numeric Anomalies

✅ Step 2: Standardize Categorical Data

✅ Step 3: Handle Missing or Incorrect Salary Values

✅ Step 4: Clean and Format Date Columns

✅ Full Cleaned Example

Tips and Best Practices

⚠️ Common Pitfalls

Summary

Key Tools Recap:

From The Article

Trending View All

How to show data values on top of each bar …

A non-numeric value encountered in PHP

The view account.views.register did not return an HttpResponse object. It …

Input type number maxlength not working

Uncaught TypeError: e.indexOf is not a function in JQuery

How to start array index from 1 in PHP

Interview Questions

PHP Interview Question

PayPal Interview Question

MySQL Interview Question

PHP-MySQL Interview Question

SQL Interview Question

CodeIgniter Interview Question

JQuery Interview Question

htaccess Interview Question

JavaScript Interview Question

HTML Interview Question

Python Interview Question

Django Interview Question