The Normal Distribution, also known as the Gaussian Distribution, is one of the most important concepts in statistics and data science. It models many real-world phenomena like heights, weights, test scores, and measurement errors.
In Python, the NumPy library provides easy-to-use tools for generating and working with data that follows a normal distribution.
What is a Normal Distribution?
A normal distribution is a bell-shaped and symmetric probability distribution. It is defined by two parameters:
-
Mean (μ): The center of the distribution.
-
Standard Deviation (σ): Measures the spread; higher σ means more spread out.
The probability density function (PDF) is:
f(x)=1σ2πe−(x−μ)22σ2f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{ -\frac{(x - \mu)^2}{2\sigma^2} }
NumPy and Normal Distribution
NumPy provides a function to generate samples from a normal distribution:
Syntax
numpy.random.Generator.normal(loc=0.0, scale=1.0, size=None)
Parameters
Parameter | Description |
---|---|
loc |
Mean (μ) of the distribution |
scale |
Standard deviation (σ) |
size |
Number of samples (shape of output) |
Getting Started
import numpy as np
import matplotlib.pyplot as plt
# Create a random generator
rng = np.random.default_rng(seed=42)
# Generate normal distribution data
data = rng.normal(loc=0, scale=1, size=1000)
This generates 1000 samples from a standard normal distribution (mean=0, std=1).
Visualizing the Distribution
Use matplotlib
or seaborn
to understand the shape:
import seaborn as sns
sns.histplot(data, bins=30, kde=True, color='skyblue')
plt.title("Normal Distribution (μ=0, σ=1)")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
-
kde=True
adds a smooth density line.
Example: Custom Mean and Standard Deviation
# μ = 50, σ = 10
data = rng.normal(loc=50, scale=10, size=1000)
sns.histplot(data, kde=True, color='lightgreen')
plt.title("Normal Distribution (μ=50, σ=10)")
plt.show()
Real-World Simulation
Simulate Students' Test Scores
# Average score: 70, Std dev: 12
scores = rng.normal(loc=70, scale=12, size=500)
# Clip scores to a 0-100 range
scores = np.clip(scores, 0, 100)
sns.histplot(scores, bins=20, kde=True)
plt.title("Simulated Student Scores")
plt.xlabel("Score")
plt.ylabel("Number of Students")
plt.show()
This creates a more realistic dataset for practical analysis.
Check Statistical Properties
print("Mean:", np.mean(scores))
print("Standard Deviation:", np.std(scores))
This verifies if the generated data aligns with the specified distribution parameters.
Generating Multidimensional Data
data_2d = rng.normal(loc=0, scale=1, size=(3, 5))
print(data_2d)
Generates a 3×5 matrix of normally distributed values.
✅ Use Case: Normal vs Non-Normal Comparison
# Normal distribution
normal_data = rng.normal(0, 1, 1000)
# Uniform distribution for comparison
uniform_data = rng.uniform(-3, 3, 1000)
# Plot
sns.kdeplot(normal_data, label="Normal", shade=True)
sns.kdeplot(uniform_data, label="Uniform", shade=True, color='orange')
plt.title("Normal vs Uniform Distribution")
plt.legend()
plt.show()
✅ Tips
-
✅ Use
default_rng()
: It’s the modern and recommended way to create random generators. -
✅ Use
kde=True
to visualize the density shape. -
✅ Always set
seed
for reproducibility during testing. -
✅ Clip values if simulating bounded data (e.g., scores 0–100).
-
✅ Use large enough sample sizes (≥ 500) for smooth distributions.
⚠️ Common Pitfalls
Pitfall | Explanation |
---|---|
❌ Confusing scale with variance |
scale = standard deviation (not variance!) |
❌ Using legacy np.random.normal() in new projects |
Use default_rng().normal() instead |
❌ Forgetting to set size |
If omitted, only a single float is returned |
❌ Assuming small samples show perfect bell curve | You need large enough samples to approximate a bell shape |
Summary
Feature | Description |
---|---|
Function | rng.normal(loc=μ, scale=σ, size=n) |
Mean (loc ) |
Controls the center of the distribution |
Std dev (scale ) |
Controls the spread (width) |
Use cases | Simulating real-world numeric data |
Tools | NumPy, Matplotlib, Seaborn |
The normal distribution is the cornerstone of statistical modeling, and with NumPy, generating and analyzing normally distributed data is both easy and powerful.
Full Code Example
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Generator
rng = np.random.default_rng(seed=123)
# Generate normal data
data = rng.normal(loc=60, scale=15, size=1000)
# Plot histogram and KDE
sns.histplot(data, bins=30, kde=True, color='skyblue')
plt.title("Normal Distribution (μ=60, σ=15)")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.axvline(np.mean(data), color='red', linestyle='--', label='Mean')
plt.legend()
plt.show()
# Print statistics
print("Sample Mean:", round(np.mean(data), 2))
print("Sample Standard Deviation:", round(np.std(data), 2))