How to Clean Wrong Format Data in Python Pandas

Last updated 7 months, 1 week ago | 502 views 75 5

How to Clean Wrong Format Data in Python Pandas

When working with real-world datasets, it’s common to encounter values in the wrong format — such as strings in a date column, or text in a numeric field. These formatting issues can prevent accurate analysis and raise errors during data processing.

Fortunately, Pandas in Python provides powerful tools to detect, convert, and clean data with incorrect formats.

What is Wrong Format Data?

Wrong format data refers to values in a column that do not match the expected type. Examples include:

A string like "twenty" in a numeric column
Non-date text like "tomorrow" in a date column
Mixed formats like ["25", "twenty-five", "30"] in a single column

This can happen due to:

Manual data entry errors
Data scraped from unstructured sources
Inconsistent formats in files like CSV or Excel

Common Format Issues & Fixes

1. ⛔ Strings in Numeric Columns

import pandas as pd

df = pd.DataFrame({
    'Age': ['25', '30', 'forty', '45', '']
})

# Attempt conversion
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')

Explanation:

The value "forty" and '' (empty string) are not valid numbers.
errors='coerce' replaces these with NaN.

2.Incorrect Date Formats

df = pd.DataFrame({
    'JoinDate': ['2023-01-01', '01/02/2023', 'today', '2023-04-01']
})

# Convert to datetime
df['JoinDate'] = pd.to_datetime(df['JoinDate'], errors='coerce')

Explanation:

'today' is not a standard date string and will be converted to NaT (Not a Time).
Use errors='coerce' to avoid exceptions.

3. ✅ Cleaning Boolean or Categorical Data

df = pd.DataFrame({
    'IsActive': ['Yes', 'No', 'Y', 'N', '1', '0']
})

# Normalize values
df['IsActive'] = df['IsActive'].replace({
    'Yes': True, 'Y': True, '1': True,
    'No': False, 'N': False, '0': False
})

Detecting Wrong Format Data

Use these techniques to detect anomalies:

Find rows where numeric conversion fails

df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
invalid_rows = df[df['Age'].isnull()]

Find invalid date entries

df['JoinDate'] = pd.to_datetime(df['JoinDate'], errors='coerce')
invalid_dates = df[df['JoinDate'].isnull()]

✅ Full Working Example

import pandas as pd
import numpy as np

# Sample DataFrame with format issues
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': ['25', 'thirty', '', '40'],
    'JoinDate': ['2023-01-01', 'today', '15/03/2023', 'April 5, 2023']
}

df = pd.DataFrame(data)

# Clean Age column
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
df['Age'].fillna(df['Age'].mean(), inplace=True)

# Clean JoinDate column
df['JoinDate'] = pd.to_datetime(df['JoinDate'], errors='coerce')

# Final cleaned DataFrame
print(df)

Output:

      Name   Age   JoinDate
0    Alice  25.0 2023-01-01
1      Bob  32.5        NaT
2  Charlie  32.5 2023-03-15
3    David  40.0 2023-04-05

Note: The average of 25 and 40 is 32.5, which replaces invalid Age values.

Tips & Best Practices

Best Practice	Description
`errors='coerce'`	Safe way to convert columns with potential bad formats
`df.dtypes`	Always check column data types
Use `fillna()` smartly	Replace `NaN` with meaningful defaults or statistics
Normalize booleans and categories	Use `.replace()` to clean inconsistent text values
Create backup copies	Before making changes, keep an original copy of your DataFrame

⚠️ Common Pitfalls

Pitfall	Solution
Forgetting to handle blank strings	Use `.replace('', np.nan)` before conversions
Assuming date strings are valid	Always use `pd.to_datetime()` with `errors='coerce'`
Using `astype()` on dirty data	Use `pd.to_numeric()` or `to_datetime()` with error handling instead

Summary

Handling wrong format data is a crucial part of data cleaning in any analysis or machine learning project. With Pandas, you can:

Detect formatting issues
Convert data safely
Replace invalid entries with meaningful values

Key Functions Recap:

pd.to_numeric()
pd.to_datetime()
.replace(), .fillna()
.dtypes to inspect column types

From The Article

Python MySQL Tutorial – How to Using Create a Table Python

Introduction to Python SciPy

Python DynamoDB: Delete Data Using Boto3

Building Dynamic React Components Using Props (Properties)

Python String Formatting Tutorial: Modern Ways to Format Strings

React Error Boundaries: Catch JavaScript Errors Gracefully in Your UI

How to Clean Wrong Format Data in Python Pandas

How to Clean Wrong Format Data in Python Pandas

In This Article:

What is Wrong Format Data?

Common Format Issues & Fixes

1. ⛔ Strings in Numeric Columns

2.Incorrect Date Formats

3. ✅ Cleaning Boolean or Categorical Data

Detecting Wrong Format Data

Find rows where numeric conversion fails

Find invalid date entries

✅ Full Working Example

Tips & Best Practices

⚠️ Common Pitfalls

Summary

Key Functions Recap:

From The Article

Trending View All

How to show data values on top of each bar …

A non-numeric value encountered in PHP

The view account.views.register did not return an HttpResponse object. It …

Input type number maxlength not working

Uncaught TypeError: e.indexOf is not a function in JQuery

How to start array index from 1 in PHP

Interview Questions

PHP Interview Question

PayPal Interview Question

MySQL Interview Question

PHP-MySQL Interview Question

SQL Interview Question

CodeIgniter Interview Question

JQuery Interview Question

htaccess Interview Question

JavaScript Interview Question

HTML Interview Question

Python Interview Question

Django Interview Question