Cleaning Wrong Data in Python Using Pandas – A Complete Guide
Last updated 1 month, 3 weeks ago | 135 views 75 5

Real-world data is rarely perfect. Whether you're analyzing sales data, survey responses, or logs, you’ll often encounter wrong data — values that are incorrect, inconsistent, or simply out of place. These anomalies can lead to misleading results if not handled properly.
In this tutorial, you’ll learn how to identify and clean wrong data using Python's Pandas library.
What Is "Wrong Data"?
Wrong data refers to values in your dataset that:
-
Are inaccurate (e.g., age = 250)
-
Are inconsistent (e.g.,
"Yes"
vs"Y"
vs"1"
) -
Are incorrectly formatted (e.g.,
"twenty"
instead of20
) -
Violate logical rules (e.g., negative salary)
Tools We'll Use
-
Pandas: for data handling and cleaning
-
NumPy: for working with
NaN
and numerical operations
Common Examples of Wrong Data
Problem Type | Example |
---|---|
Out-of-range values | Age = -5 or 200 |
Incorrect format | "twenty" in numeric field |
Inconsistency | Gender = ["M", "Male", "m"] |
Impossible values | Date = "30-02-2023" |
Typo or error | "Calfornia" instead of "California" |
Sample Dataset with Issues
Let's create a sample dataset with common wrong data problems:
import pandas as pd
import numpy as np
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, -1, 'thirty', 130, 45],
'Gender': ['F', 'M', 'Male', 'f', 'Female'],
'Salary': [50000, 60000, None, 2000000, 'not disclosed'],
'JoinDate': ['2023-01-01', 'not a date', '2022-03-15', '', '2021-07-01']
}
df = pd.DataFrame(data)
print(df)
Step-by-Step Guide to Cleaning Wrong Data
✅ Step 1: Detect and Clean Numeric Anomalies
Clean Age
: Remove non-numeric and unrealistic values.
# Convert Age to numeric
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
# Replace unrealistic ages
df.loc[(df['Age'] < 0) | (df['Age'] > 120), 'Age'] = np.nan
# Optionally, fill missing age with the median
df['Age'].fillna(df['Age'].median(), inplace=True)
✅ Step 2: Standardize Categorical Data
Clean Gender
: Normalize gender values to consistent labels.
# Convert to lowercase for consistency
df['Gender'] = df['Gender'].str.lower()
# Replace variations
df['Gender'] = df['Gender'].replace({
'f': 'Female',
'female': 'Female',
'm': 'Male',
'male': 'Male'
})
✅ Step 3: Handle Missing or Incorrect Salary Values
# Replace 'not disclosed' with NaN
df['Salary'] = df['Salary'].replace('not disclosed', np.nan)
# Convert to numeric
df['Salary'] = pd.to_numeric(df['Salary'], errors='coerce')
# Optional: Fill missing salaries with average or median
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
✅ Step 4: Clean and Format Date Columns
# Replace empty strings with NaT
df['JoinDate'].replace('', np.nan, inplace=True)
# Convert to datetime
df['JoinDate'] = pd.to_datetime(df['JoinDate'], errors='coerce')
✅ Full Cleaned Example
import pandas as pd
import numpy as np
# Sample data
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, -1, 'thirty', 130, 45],
'Gender': ['F', 'M', 'Male', 'f', 'Female'],
'Salary': [50000, 60000, None, 2000000, 'not disclosed'],
'JoinDate': ['2023-01-01', 'not a date', '2022-03-15', '', '2021-07-01']
}
df = pd.DataFrame(data)
# Clean Age
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
df.loc[(df['Age'] < 0) | (df['Age'] > 120), 'Age'] = np.nan
df['Age'].fillna(df['Age'].median(), inplace=True)
# Clean Gender
df['Gender'] = df['Gender'].str.lower().replace({
'f': 'Female', 'female': 'Female',
'm': 'Male', 'male': 'Male'
})
# Clean Salary
df['Salary'] = df['Salary'].replace('not disclosed', np.nan)
df['Salary'] = pd.to_numeric(df['Salary'], errors='coerce')
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
# Clean JoinDate
df['JoinDate'] = df['JoinDate'].replace('', np.nan)
df['JoinDate'] = pd.to_datetime(df['JoinDate'], errors='coerce')
# Final cleaned DataFrame
print(df)
Tips and Best Practices
Tip | Why It Matters |
---|---|
Use errors='coerce' |
Prevent crashes when converting formats |
Check for outliers | Use .describe() and visualizations |
Normalize categories | For consistent analysis and grouping |
Don’t fill blindly | Use median, mode, or contextual defaults |
Create backups | Always keep a copy of raw data for reference |
⚠️ Common Pitfalls
Mistake | Solution |
---|---|
Using .astype() on dirty data |
Use pd.to_numeric() or pd.to_datetime() with errors='coerce' |
Ignoring blank strings | Replace '' with np.nan before cleaning |
Overfitting cleaning logic | Clean according to business rules, not just technical correctness |
Summary
Cleaning wrong data is an essential skill for data analysis and machine learning. With Pandas, you can easily detect, clean, and transform bad data into usable formats.
Key Tools Recap:
-
pd.to_numeric()
,pd.to_datetime()
-
.replace()
,.fillna()
,.dropna()
-
.str.lower()
,.str.strip()
for string normalization