Removing Duplicates in Python Using Pandas – A Complete Guide

Last updated 7 months, 1 week ago | 470 views 75 5

Removing Duplicates in Python Using Pandas – A Complete Guide

In real-world datasets, it's common to find duplicate rows — either due to data entry errors, system glitches, or improper data merges. These duplicates can skew your analysis and must be dealt with efficiently.

Fortunately, Pandas provides powerful and flexible tools for identifying and removing duplicates in a DataFrame.

What You'll Learn:

What duplicates are in a dataset
How to detect duplicate rows
How to remove duplicates using drop_duplicates()
Working with subsets of columns
Keeping specific occurrences (first/last)
Full working example
Tips and common pitfalls

What Are Duplicates?

Duplicates are rows in a DataFrame that have the same values for all or selected columns.

Example:

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'David'],
    'Age': [25, 30, 25, 22]
})

Index	Name	Age
0	Alice	25
1	Bob	30
2	Alice	25
3	David	22

Rows 0 and 2 are duplicates.

Detecting Duplicates

Use duplicated() to find duplicate rows.

# Detect all duplicate rows
print(df.duplicated())

Output:

0    False
1    False
2     True
3    False
dtype: bool

Only row 2 is marked as a duplicate because it matches row 0.

You can filter them like this:

# Show only duplicate rows
duplicates = df[df.duplicated()]
print(duplicates)

❌ Removing Duplicates

Use drop_duplicates() to remove duplicates.

# Remove duplicate rows
df_cleaned = df.drop_duplicates()

This removes all rows that are duplicates based on all columns.

Removing Duplicates Based on Specific Columns

You may want to check for duplicates using a subset of columns.

# Remove duplicates based on 'Name' column
df_unique_names = df.drop_duplicates(subset=['Name'])

This keeps the first occurrence of each name by default.

Keeping First or Last Occurrence

You can control which duplicate to keep:

# Keep last occurrence
df_last = df.drop_duplicates(keep='last')

# Drop all duplicates (keep none)
df_none = df.drop_duplicates(keep=False)

Parameter	Description
`keep='first'`	Keeps the first occurrence (default)
`keep='last'`	Keeps the last occurrence
`keep=False`	Drops all duplicates

In-Place Operation

To remove duplicates directly in the original DataFrame:

df.drop_duplicates(inplace=True)

✅ Full Working Example

import pandas as pd

# Sample data
data = {
    'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'],
    'Age': [25, 30, 25, 22, 30],
    'City': ['NY', 'LA', 'NY', 'SF', 'LA']
}

df = pd.DataFrame(data)

# Step 1: Detect duplicates
print("Duplicate rows:")
print(df[df.duplicated()])

# Step 2: Remove full duplicates
df_clean = df.drop_duplicates()

# Step 3: Remove based on 'Name' only
df_name_unique = df.drop_duplicates(subset=['Name'])

print("\nCleaned DataFrame:")
print(df_clean)

print("\nUnique Names:")
print(df_name_unique)

⚠️ Common Pitfalls

Mistake	Tip
Not specifying `subset` when needed	Always define which columns matter for duplicates
Forgetting `inplace=True`	If you want to update the DataFrame, use `inplace=True` or reassign
Expecting `duplicated()` to include the first	Remember: it marks subsequent duplicates only
Dropping necessary records	Review your logic before dropping duplicates — it's irreversible!

Pro Tips

Use .shape before and after cleaning to check how many rows were removed.
For fuzzy matching or near-duplicates, consider using libraries like fuzzywuzzy or recordlinkage.
Use .groupby() and .count() to identify suspicious repetition patterns before dropping.

Summary

Removing duplicates is a critical step in data cleaning. Pandas makes it easy to identify and remove duplicates based on all or selected columns, with options to keep the first, last, or none.

Key Methods Recap:

df.duplicated() — Detect duplicates
df.drop_duplicates() — Remove duplicates
subset= — Specify columns to check
keep= — Control which duplicates to keep

From The Article

How do you check for the version of Django installed on your system?

PHP Sessions: Complete Guide to Managing User Sessions Securely

Python BigQuery: How to Use JOIN to Combine Tables

What is access specifiers in php?

React Icons: Effortless Icon Integration for Beautiful UI Components

Print prime number between 1 to 100 in python

Removing Duplicates in Python Using Pandas – A Complete Guide

Removing Duplicates in Python Using Pandas – A Complete Guide

What You'll Learn:

What Are Duplicates?

Example:

Detecting Duplicates

❌ Removing Duplicates

Removing Duplicates Based on Specific Columns

Keeping First or Last Occurrence

In-Place Operation

✅ Full Working Example

⚠️ Common Pitfalls

Pro Tips

Summary

Key Methods Recap:

From The Article

Trending View All

How to show data values on top of each bar …

A non-numeric value encountered in PHP

The view account.views.register did not return an HttpResponse object. It …

Input type number maxlength not working

Uncaught TypeError: e.indexOf is not a function in JQuery

How to start array index from 1 in PHP

Interview Questions

PHP Interview Question

PayPal Interview Question

MySQL Interview Question

PHP-MySQL Interview Question

SQL Interview Question

CodeIgniter Interview Question

JQuery Interview Question

htaccess Interview Question

JavaScript Interview Question

HTML Interview Question

Python Interview Question

Django Interview Question