The Zipf distribution is a discrete power-law probability distribution that describes the frequency of events in many natural and social systems. It is named after George Zipf, who observed that word frequencies in natural language follow this pattern: a few items are very common, while many are rare.
NumPy makes it easy to simulate and analyze Zipf-distributed data for use in data science, linguistics, network traffic, and probability modeling.
What is the Zipf Distribution?
In a Zipf distribution:
-
The frequency of an element is inversely proportional to its rank.
-
For example, the 2nd most common item appears half as often as the most common item.
Probability Mass Function (PMF)
P(X=k)=1/ka∑n=1∞1/naP(X = k) = \frac{1 / k^a}{\sum_{n=1}^\infty 1 / n^a}
Where:
-
kk: Rank (positive integer)
-
a>1a > 1: Exponent characterizing the distribution
-
The denominator is the Riemann zeta function (normalization factor)
Key Properties
Property | Description |
---|---|
Support | k=1,2,3,…k = 1, 2, 3, \ldots |
Shape | Discrete, long right tail |
Mean | Finite if a>2a > 2 |
Variance | Finite if a>3a > 3 |
Skewness | Very high, depending on aa |
Applications | Word frequencies, website visits, file sizes, etc. |
NumPy’s zipf()
Function
numpy.random.Generator.zipf(a, size=None)
Parameters
Parameter | Description |
---|---|
a |
Exponent parameter a>1a > 1 |
size |
Output shape (e.g., 1000, or (2, 3)) |
✅ Returns
An array of random integers drawn from a Zipf distribution with exponent a
.
✅ Example: Generate Zipf Data
import numpy as np
rng = np.random.default_rng(seed=42)
# Generate 1000 samples with exponent a=2
data = rng.zipf(a=2.0, size=1000)
print(data[:10])
You’ll notice a lot of small numbers, with a few large outliers — the hallmark of a Zipf distribution.
Visualizing the Zipf Distribution
import matplotlib.pyplot as plt
import seaborn as sns
# Limit values for visualization clarity
filtered_data = data[data < 20]
sns.histplot(filtered_data, bins=range(1, 21), discrete=True, color="purple")
plt.title("Zipf Distribution (a=2.0)")
plt.xlabel("Value (Rank)")
plt.ylabel("Frequency")
plt.grid(True)
plt.show()
Exploring Different a
Values
alphas = [1.1, 1.5, 2.0, 3.0]
sample_size = 1000
for a in alphas:
sample = rng.zipf(a=a, size=sample_size)
filtered = sample[sample < 20]
sns.kdeplot(filtered, label=f"a={a}", bw_adjust=0.5)
plt.title("Zipf Distribution for Different a-values")
plt.xlabel("Value (Rank)")
plt.ylabel("Density")
plt.legend()
plt.grid(True)
plt.show()
Interpretation:
-
Smaller
a
(e.g. 1.1) → heavier tail (more high values) -
Larger
a
(e.g. 3.0) → steeper drop-off (dominance of small ranks)
Full Simulation: Word Frequency
import collections
words = rng.zipf(a=2.0, size=10000)
words = words[words < 50] # Truncate long tail
# Frequency count
freq = collections.Counter(words)
sorted_freq = dict(sorted(freq.items()))
plt.bar(sorted_freq.keys(), sorted_freq.values(), color='darkgreen')
plt.title("Simulated Word Frequency (Zipf, a=2.0)")
plt.xlabel("Word Rank")
plt.ylabel("Count")
plt.grid(True)
plt.show()
This mimics how in natural language:
-
"the", "is", "and" appear most often,
-
while thousands of words appear rarely.
Real-World Applications
Domain | Use Case |
---|---|
Linguistics | Word frequency in natural languages |
Web Analytics | Page visits, link popularity |
Economics | Wealth/rank distributions |
File Systems | File sizes and access patterns |
Data Science | Sampling rare vs common events |
Tips for Using Zipf in NumPy
Tip | Why It’s Helpful |
---|---|
✅ Filter large values | Prevents charts being dominated by extreme outliers |
✅ Use large sample size | Gives better approximation of the distribution |
✅ Truncate with condition (e.g. data < N ) |
Makes visualization clearer |
✅ Vary a for experimentation |
Controls skew and tail heaviness |
⚠️ Common Pitfalls
Pitfall | Explanation |
---|---|
❌ Using a <= 1 |
Not defined (PMF won’t converge) |
❌ Forgetting it’s discrete | Zipf returns integers only |
❌ Plotting without filtering | Extreme values may distort plots |
❌ Using small samples | Doesn't reflect true Zipf characteristics |
Mathematical Connection to Other Distributions
Distribution | Relationship |
---|---|
Power-law | Zipf is a discrete power-law |
Pareto | Pareto is continuous analog |
Geometric | Similar form but different decay |
Exponential | Zipf decays slower than exponential |
Conclusion
The Zipf Distribution is a powerful way to model real-world data where rank and frequency are inversely related. With just a few lines of NumPy, you can simulate and study Zipf-distributed phenomena — from language and economics to web traffic and more.
Summary Table
Feature | Value |
---|---|
Function | rng.zipf(a, size) |
Output | Discrete integers (≥1) |
Skewed | Yes (right-skewed) |
Mean exists | Only if a>2a > 2 |
Use cases | Word frequency, traffic, rankings |