How to Clean Wrong Format Data in Python Pandas

Last updated 1 month, 3 weeks ago | 124 views 75     5

Tags:- Python Pandas

When working with real-world datasets, it’s common to encounter values in the wrong format — such as strings in a date column, or text in a numeric field. These formatting issues can prevent accurate analysis and raise errors during data processing.

Fortunately, Pandas in Python provides powerful tools to detect, convert, and clean data with incorrect formats.


In This Article:

  • ✅ What is "wrong format" data?

  • ✅ Common formatting issues in datasets

  • ✅ Detecting wrong formats

  • ✅ Fixing wrong formats with Pandas

  • ✅ Full working example

  • ✅ Tips & common pitfalls


What is Wrong Format Data?

Wrong format data refers to values in a column that do not match the expected type. Examples include:

  • A string like "twenty" in a numeric column

  • Non-date text like "tomorrow" in a date column

  • Mixed formats like ["25", "twenty-five", "30"] in a single column

This can happen due to:

  • Manual data entry errors

  • Data scraped from unstructured sources

  • Inconsistent formats in files like CSV or Excel


Common Format Issues & Fixes

1. ⛔ Strings in Numeric Columns

import pandas as pd

df = pd.DataFrame({
    'Age': ['25', '30', 'forty', '45', '']
})

# Attempt conversion
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')

Explanation:

  • The value "forty" and '' (empty string) are not valid numbers.

  • errors='coerce' replaces these with NaN.


2.Incorrect Date Formats

df = pd.DataFrame({
    'JoinDate': ['2023-01-01', '01/02/2023', 'today', '2023-04-01']
})

# Convert to datetime
df['JoinDate'] = pd.to_datetime(df['JoinDate'], errors='coerce')

Explanation:

  • 'today' is not a standard date string and will be converted to NaT (Not a Time).

  • Use errors='coerce' to avoid exceptions.


3. ✅ Cleaning Boolean or Categorical Data

df = pd.DataFrame({
    'IsActive': ['Yes', 'No', 'Y', 'N', '1', '0']
})

# Normalize values
df['IsActive'] = df['IsActive'].replace({
    'Yes': True, 'Y': True, '1': True,
    'No': False, 'N': False, '0': False
})

Detecting Wrong Format Data

Use these techniques to detect anomalies:

Find rows where numeric conversion fails

df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
invalid_rows = df[df['Age'].isnull()]

Find invalid date entries

df['JoinDate'] = pd.to_datetime(df['JoinDate'], errors='coerce')
invalid_dates = df[df['JoinDate'].isnull()]

✅ Full Working Example

import pandas as pd
import numpy as np

# Sample DataFrame with format issues
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': ['25', 'thirty', '', '40'],
    'JoinDate': ['2023-01-01', 'today', '15/03/2023', 'April 5, 2023']
}

df = pd.DataFrame(data)

# Clean Age column
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
df['Age'].fillna(df['Age'].mean(), inplace=True)

# Clean JoinDate column
df['JoinDate'] = pd.to_datetime(df['JoinDate'], errors='coerce')

# Final cleaned DataFrame
print(df)

Output:

      Name   Age   JoinDate
0    Alice  25.0 2023-01-01
1      Bob  32.5        NaT
2  Charlie  32.5 2023-03-15
3    David  40.0 2023-04-05

Note: The average of 25 and 40 is 32.5, which replaces invalid Age values.


Tips & Best Practices

Best Practice Description
errors='coerce' Safe way to convert columns with potential bad formats
df.dtypes Always check column data types
Use fillna() smartly Replace NaN with meaningful defaults or statistics
Normalize booleans and categories Use .replace() to clean inconsistent text values
Create backup copies Before making changes, keep an original copy of your DataFrame

⚠️ Common Pitfalls

Pitfall Solution
Forgetting to handle blank strings Use .replace('', np.nan) before conversions
Assuming date strings are valid Always use pd.to_datetime() with errors='coerce'
Using astype() on dirty data Use pd.to_numeric() or to_datetime() with error handling instead

Summary

Handling wrong format data is a crucial part of data cleaning in any analysis or machine learning project. With Pandas, you can:

  • Detect formatting issues

  • Convert data safely

  • Replace invalid entries with meaningful values

Key Functions Recap:

  • pd.to_numeric()

  • pd.to_datetime()

  • .replace(), .fillna()

  • .dtypes to inspect column types