In data science, understanding how data is distributed is critical. Whether you're simulating data, analyzing real-world datasets, or performing hypothesis testing, you’ll encounter probability distributions.
The numpy.random
module provides powerful tools to generate data that follows specific distributions such as normal, binomial, Poisson, and many others.
This guide walks you through NumPy data distributions, how to use them, and practical examples.
What is a Data Distribution?
A distribution describes how the values of a dataset are spread or distributed. In probability theory, a probability distribution describes how likely different outcomes are.
NumPy allows us to simulate data drawn from these distributions using random number generators.
Getting Started
Import NumPy and create a random generator:
import numpy as np
# Recommended modern generator
rng = np.random.default_rng()
NumPy distributions are accessed via this
rng
object using methods likenormal()
,binomial()
, etc.
Common Distributions in NumPy
Let's go through some of the most common ones with examples.
1. Normal Distribution (Gaussian)
-
Bell-shaped curve.
-
Common in natural phenomena.
# Generate 1000 numbers from a normal distribution with mean=0, std=1
data = rng.normal(loc=0.0, scale=1.0, size=1000)
Parameters:
-
loc
: Mean (μ) -
scale
: Standard deviation (σ) -
size
: Output shape
2. Binomial Distribution
-
Models number of successes in
n
trials with success probabilityp
.
# 10 trials, success probability 0.5
data = rng.binomial(n=10, p=0.5, size=1000)
Example: Flipping a coin 10 times, how many heads?
3. Poisson Distribution
-
Counts number of events in a fixed interval (used for rare events).
# Lambda = 3 (average rate)
data = rng.poisson(lam=3, size=1000)
4. Uniform Distribution
-
All values within the interval are equally likely.
# Random floats between 0.0 and 1.0
data = rng.uniform(low=0.0, high=1.0, size=1000)
⏱ 5. Exponential Distribution
-
Time between events in a Poisson process (e.g., time until next earthquake)
data = rng.exponential(scale=2.0, size=1000)
Parameter:
-
scale
is the inverse of the rate (λ)
⛷ 6. Chi-Square Distribution
-
Often used in statistical tests (e.g., chi-square test)
data = rng.chisquare(df=2, size=1000)
7. Multinomial Distribution
-
Generalization of binomial with more than two categories.
# 10 experiments, probabilities for 3 outcomes
data = rng.multinomial(n=10, pvals=[0.2, 0.5, 0.3], size=5)
Visualizing Distributions
Use matplotlib
to understand the shape of the distributions:
import matplotlib.pyplot as plt
data = rng.normal(loc=0, scale=1, size=1000)
plt.hist(data, bins=30, edgecolor='black')
plt.title("Normal Distribution")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
Repeat with different distributions to compare their shapes.
Full Code Example
import numpy as np
import matplotlib.pyplot as plt
# Initialize RNG
rng = np.random.default_rng(seed=42)
# Generate different distributions
data_normal = rng.normal(0, 1, 1000)
data_binomial = rng.binomial(10, 0.5, 1000)
data_poisson = rng.poisson(3, 1000)
data_uniform = rng.uniform(0, 10, 1000)
# Plot
fig, axs = plt.subplots(2, 2, figsize=(10, 8))
axs[0, 0].hist(data_normal, bins=30, color='skyblue', edgecolor='black')
axs[0, 0].set_title("Normal Distribution")
axs[0, 1].hist(data_binomial, bins=10, color='salmon', edgecolor='black')
axs[0, 1].set_title("Binomial Distribution")
axs[1, 0].hist(data_poisson, bins=15, color='lightgreen', edgecolor='black')
axs[1, 0].set_title("Poisson Distribution")
axs[1, 1].hist(data_uniform, bins=30, color='orange', edgecolor='black')
axs[1, 1].set_title("Uniform Distribution")
for ax in axs.flat:
ax.set_xlabel("Value")
ax.set_ylabel("Frequency")
plt.tight_layout()
plt.show()
✅ Tips
-
Understand Parameters: Each distribution has unique parameters. Know what they represent.
-
Visualize Before Use: Plot data to confirm it behaves as expected.
-
Use
seed
for reproducibility: Always helpful in testing or demonstrations. -
Use proper sample sizes: Small samples might not reflect the true shape of the distribution.
⚠️ Common Pitfalls
Pitfall | Explanation |
---|---|
❌ Misinterpreting parameters | For example, scale in exponential is 1/λ, not λ itself. |
❌ Using legacy RNG functions | Prefer default_rng() over np.random.normal() and similar old APIs. |
❌ Assuming distributions are always symmetric | Many (like Poisson, exponential) are skewed. |
❌ Forgetting sample size | Small samples may mislead your intuition about the distribution. |
Conclusion
The numpy.random
module offers powerful tools for simulating real-world data using different probability distributions. Whether you're modeling dice rolls, simulating experiments, or preparing for statistical analysis, understanding these distributions is essential.
Start experimenting with different parameters, visualize your results, and use these distributions to simulate and analyze data effectively.