When analyzing data, one of the most valuable tools you can use is correlation analysis. Correlation helps you understand the relationship between numerical variables in your dataset — whether they move together and how strong that relationship is.
In this article, we'll walk through how to compute and interpret correlations using Pandas, along with visualization techniques, practical examples, and tips.
What is Correlation?
Correlation is a statistical measure that expresses the extent to which two variables are linearly related.
Common Correlation Values:
Correlation Coefficient | Interpretation |
---|---|
+1 |
Perfect positive correlation |
0 |
No correlation |
-1 |
Perfect negative correlation |
Getting Started
Requirements:
pip install pandas matplotlib seaborn
Sample Data
import pandas as pd
data = {
'Temperature': [30, 35, 40, 45, 50],
'IceCreamSales': [200, 300, 400, 500, 600],
'SunglassesSales': [150, 220, 270, 350, 400],
'Rainfall': [100, 80, 60, 40, 20]
}
df = pd.DataFrame(data)
print(df)
Calculating Correlations in Pandas
Pandas provides a simple and powerful .corr()
method to calculate pairwise correlation of columns.
correlation_matrix = df.corr()
print(correlation_matrix)
Output:
Temperature IceCreamSales SunglassesSales Rainfall
Temperature 1.0 1.0 0.99 -1.00
IceCreamSales 1.0 1.0 0.99 -1.00
SunglassesSales 0.99 0.99 1.0 -0.99
Rainfall -1.0 -1.0 -0.99 1.00
✅ Interpretation:
-
Temperature and IceCreamSales have a perfect positive correlation.
-
Rainfall has a perfect negative correlation with Temperature and Sales.
Choosing a Correlation Method
Pandas supports three methods for computing correlation:
Method | Description |
---|---|
'pearson' (default) |
Measures linear relationship (most common) |
'kendall' |
Measures ordinal association (non-parametric) |
'spearman' |
Based on rank, good for non-linear monotonic relationships |
df.corr(method='spearman')
df.corr(method='kendall')
Visualizing Correlation with Heatmap
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title("Correlation Matrix")
plt.show()
✅ This heatmap gives a quick visual summary of which variables are positively or negatively correlated — great for EDA (Exploratory Data Analysis).
Correlation Between Specific Columns
You can compute the correlation between two specific columns:
correlation = df['Temperature'].corr(df['IceCreamSales'])
print(f"Correlation between Temperature and Ice Cream Sales: {correlation:.2f}")
✅ Full Working Example
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Sample dataset
data = {
'Temperature': [30, 35, 40, 45, 50],
'IceCreamSales': [200, 300, 400, 500, 600],
'SunglassesSales': [150, 220, 270, 350, 400],
'Rainfall': [100, 80, 60, 40, 20]
}
df = pd.DataFrame(data)
# Correlation Matrix
print("Correlation Matrix:")
print(df.corr())
# Visualizing with heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(), annot=True, cmap='YlGnBu', linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()
# Correlation between specific columns
temp_ice_corr = df['Temperature'].corr(df['IceCreamSales'])
print(f"\nCorrelation between Temperature and Ice Cream Sales: {temp_ice_corr:.2f}")
Tips for Using Correlation
Tip | Why it Helps |
---|---|
Use .abs() to find strongest relationships |
Sometimes you're just interested in strength, not direction |
Always visualize your data | Heatmaps help spot issues or interesting trends |
Check for linearity | Pearson assumes a linear relationship |
Remove irrelevant columns | Non-numeric columns will be ignored automatically |
⚠️ Common Pitfalls
Pitfall | Fix |
---|---|
Interpreting correlation as causation | Correlation ≠ Causation. Use with domain knowledge |
Using correlation with categorical data | Use encoding or other statistical tests instead |
Including outliers | Outliers can skew correlation; visualize first |
Small sample size | Can produce misleading correlations |
Summary
Pandas makes it incredibly easy to perform correlation analysis and visualize relationships between variables.
Key Functions Recap:
Function | Purpose |
---|---|
df.corr() |
Compute correlation matrix |
df['col1'].corr(df['col2']) |
Pairwise correlation |
sns.heatmap() |
Visual heatmap for better analysis |