Python NumPy: Chi-Square Distribution Explained

Last updated 1 month, 3 weeks ago | 117 views 75     5

Tags:- Python NumPy

The Chi-Square (χ²) distribution is widely used in statistical inference, especially for hypothesis testing and confidence intervals involving variances and categorical data.

With NumPy, generating and working with Chi-Square distributed values is simple and efficient for simulations and analysis.


What is the Chi-Square Distribution?

The Chi-Square distribution is a continuous probability distribution that describes the sum of the squares of independent standard normal variables.

Mathematically:

If Z1,Z2,...,Zk∼N(0,1)Z_1, Z_2, ..., Z_k \sim \mathcal{N}(0, 1) (standard normal), then:

χ2=Z12+Z22+⋯+Zk2\chi^2 = Z_1^2 + Z_2^2 + \dots + Z_k^2

This follows a Chi-Square distribution with k degrees of freedom (df).


Key Properties

Property Description
Range [0,∞)[0, \infty)
Shape Right-skewed (less skew with higher df)
Mean μ=k\mu = k
Variance σ2=2k\sigma^2 = 2k
Degrees of Freedom The number of squared standard normals

NumPy's chisquare() Function

numpy.random.Generator.chisquare(df, size=None)

Parameters

Parameter Description
df Degrees of freedom (must be > 0)
size Number of random samples (int/tuple)

✅ Returns

An array of random values from a Chi-Square distribution.


✅ Example: Generate Chi-Square Data

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

rng = np.random.default_rng(seed=42)

# Generate 1000 chi-square values with 5 degrees of freedom
data = rng.chisquare(df=5, size=1000)

print(data[:5])  # First few samples

Visualizing the Distribution

sns.histplot(data, bins=40, kde=True, color='skyblue')
plt.title("Chi-Square Distribution (df=5)")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.grid(True)
plt.show()

You’ll notice a right-skewed shape typical of the Chi-Square distribution.


Varying the Degrees of Freedom

Let’s compare Chi-Square distributions with different df values:

dfs = [1, 3, 5, 10, 20]

for df in dfs:
    data = rng.chisquare(df=df, size=1000)
    sns.kdeplot(data, label=f'df={df}')

plt.title("Chi-Square Distributions with Varying Degrees of Freedom")
plt.xlabel("Value")
plt.ylabel("Density")
plt.grid(True)
plt.legend()
plt.show()

Observation:

  • Lower df: Highly skewed

  • Higher df: Approaches a normal distribution


Practical Use Case: Goodness-of-Fit Test (Conceptual)

While NumPy simulates the distribution, statistical libraries like SciPy use it to perform hypothesis tests.

Chi-Square distribution is commonly used in:

  • Goodness-of-fit tests (is data distributed as expected?)

  • Test of independence in contingency tables

  • Variance tests

Example (not using NumPy directly):

from scipy.stats import chisquare

observed = [18, 22, 20, 25, 15]
expected = [20, 20, 20, 20, 20]

stat, p = chisquare(f_obs=observed, f_exp=expected)
print(f"Chi-Square statistic: {stat:.2f}, p-value: {p:.3f}")

Full Simulation Example: Generate and Analyze Chi-Square Samples

# Simulate
samples = rng.chisquare(df=10, size=10000)

# Summary
print("Mean:", np.mean(samples))
print("Expected Mean:", 10)
print("Variance:", np.var(samples))
print("Expected Variance:", 2 * 10)

# Plot
sns.histplot(samples, bins=60, kde=True, color='orange')
plt.title("Chi-Square Distribution (df=10, 10k samples)")
plt.xlabel("Value")
plt.ylabel("Density")
plt.grid(True)
plt.show()

You’ll notice the empirical mean/variance aligns closely with theory.


Tips for Using Chi-Square in NumPy

Tip Benefit
✅ Use larger sample sizes More stable visualization and analysis
✅ Understand “df” Higher df = smoother and more normal-like
✅ Use with standard normal samples Can manually compute χ² from np.random.normal
✅ Seed the RNG for reproducibility Ensures consistent results across runs

⚠️ Common Pitfalls

Pitfall Explanation
❌ Using negative df Degrees of freedom must be > 0
❌ Confusing with normal Chi-Square is derived from normal, but not symmetric
❌ Forgetting to square normal values Manual chi-square = sum of squared standard normals
❌ Using for means instead of variances Chi-Square tests variances and categorical frequencies, not means

Relation to Other Distributions

Distribution Relation to Chi-Square
Normal χ² is the sum of squares of normals
Gamma Chi-Square is a special case of the Gamma distribution
F-distribution Ratio of two scaled chi-square distributions
Student’s t Based on a normal and a chi-square distribution

Conclusion

The Chi-Square distribution is vital for statistical modeling, especially when analyzing categorical data and variances. NumPy provides an easy and efficient way to generate data and perform simulations.


Summary

Feature Value
Function rng.chisquare(df, size)
Parameters Degrees of freedom (df > 0)
Shape Skewed right (less with higher df)
Use Cases Goodness-of-fit, test of independence
Related to Normal, Gamma, F-distribution