Removing Duplicates in Python Using Pandas – A Complete Guide
Last updated 3 weeks, 6 days ago | 90 views 75 5

In real-world datasets, it's common to find duplicate rows — either due to data entry errors, system glitches, or improper data merges. These duplicates can skew your analysis and must be dealt with efficiently.
Fortunately, Pandas provides powerful and flexible tools for identifying and removing duplicates in a DataFrame.
What You'll Learn:
-
What duplicates are in a dataset
-
How to detect duplicate rows
-
How to remove duplicates using
drop_duplicates()
-
Working with subsets of columns
-
Keeping specific occurrences (first/last)
-
Full working example
-
Tips and common pitfalls
What Are Duplicates?
Duplicates are rows in a DataFrame that have the same values for all or selected columns.
Example:
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Alice', 'David'],
'Age': [25, 30, 25, 22]
})
Index | Name | Age |
---|---|---|
0 | Alice | 25 |
1 | Bob | 30 |
2 | Alice | 25 |
3 | David | 22 |
Rows 0 and 2 are duplicates.
Detecting Duplicates
Use duplicated()
to find duplicate rows.
# Detect all duplicate rows
print(df.duplicated())
Output:
0 False
1 False
2 True
3 False
dtype: bool
Only row 2 is marked as a duplicate because it matches row 0.
You can filter them like this:
# Show only duplicate rows
duplicates = df[df.duplicated()]
print(duplicates)
❌ Removing Duplicates
Use drop_duplicates()
to remove duplicates.
# Remove duplicate rows
df_cleaned = df.drop_duplicates()
This removes all rows that are duplicates based on all columns.
Removing Duplicates Based on Specific Columns
You may want to check for duplicates using a subset of columns.
# Remove duplicates based on 'Name' column
df_unique_names = df.drop_duplicates(subset=['Name'])
This keeps the first occurrence of each name by default.
Keeping First or Last Occurrence
You can control which duplicate to keep:
# Keep last occurrence
df_last = df.drop_duplicates(keep='last')
# Drop all duplicates (keep none)
df_none = df.drop_duplicates(keep=False)
Parameter | Description |
---|---|
keep='first' |
Keeps the first occurrence (default) |
keep='last' |
Keeps the last occurrence |
keep=False |
Drops all duplicates |
In-Place Operation
To remove duplicates directly in the original DataFrame:
df.drop_duplicates(inplace=True)
✅ Full Working Example
import pandas as pd
# Sample data
data = {
'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'],
'Age': [25, 30, 25, 22, 30],
'City': ['NY', 'LA', 'NY', 'SF', 'LA']
}
df = pd.DataFrame(data)
# Step 1: Detect duplicates
print("Duplicate rows:")
print(df[df.duplicated()])
# Step 2: Remove full duplicates
df_clean = df.drop_duplicates()
# Step 3: Remove based on 'Name' only
df_name_unique = df.drop_duplicates(subset=['Name'])
print("\nCleaned DataFrame:")
print(df_clean)
print("\nUnique Names:")
print(df_name_unique)
⚠️ Common Pitfalls
Mistake | Tip |
---|---|
Not specifying subset when needed |
Always define which columns matter for duplicates |
Forgetting inplace=True |
If you want to update the DataFrame, use inplace=True or reassign |
Expecting duplicated() to include the first |
Remember: it marks subsequent duplicates only |
Dropping necessary records | Review your logic before dropping duplicates — it's irreversible! |
Pro Tips
-
Use
.shape
before and after cleaning to check how many rows were removed. -
For fuzzy matching or near-duplicates, consider using libraries like fuzzywuzzy or recordlinkage.
-
Use
.groupby()
and.count()
to identify suspicious repetition patterns before dropping.
Summary
Removing duplicates is a critical step in data cleaning. Pandas makes it easy to identify and remove duplicates based on all or selected columns, with options to keep the first, last, or none.
Key Methods Recap:
-
df.duplicated()
— Detect duplicates -
df.drop_duplicates()
— Remove duplicates -
subset=
— Specify columns to check -
keep=
— Control which duplicates to keep