Hypothesis Testing Simulator: Understand P-Values

simulator intermediate ~10 min
Loading simulation...
p ≈ 0.036 — reject the null at α = 0.05

With n=30, a true mean of 105 vs null mean 100, and σ=15, the z-test yields a p-value around 0.036. The null hypothesis is rejected at the 5% significance level.

Formula

z = (x̄ - μ₀) / (σ / √n)
p-value = 2 × (1 - Φ(|z|)) for two-tailed test
Power = Φ(z_α/2 - (μ₁ - μ₀)/(σ/√n)) for one-tailed

The Most Misunderstood Number in Science

The p-value is arguably the most influential — and most misinterpreted — number in all of science. Introduced by Ronald Fisher in 1925, it was meant as an informal measure of evidence against a null hypothesis. Nearly a century later, it has become a rigid threshold that determines what gets published, what drugs get approved, and what policies get enacted. Understanding what p-values actually measure, and what they do not, is essential statistical literacy.

How the Z-Test Works

The one-sample z-test asks: is the sample mean far enough from the hypothesized population mean to be unlikely under the null hypothesis? The test statistic z measures this distance in units of standard error. A larger z means the sample mean is further from the null — making the null hypothesis less plausible. The p-value converts this distance into a probability under the null distribution.

Power and Sample Size

Statistical power is the probability of correctly rejecting a false null hypothesis. It depends on three factors: the true effect size, the sample size, and the significance level α. Underpowered studies — which are alarmingly common — waste resources and produce unreliable results. This simulator lets you see exactly how increasing the sample size or effect size boosts power toward the conventional 80% target.

The Replication Crisis

The widespread misuse of p-values contributed to science's replication crisis, where many published findings failed to reproduce. Researchers engaged in p-hacking — running multiple analyses until p < 0.05 appeared by chance. The American Statistical Association issued an unprecedented statement in 2016 warning against over-reliance on p-values. Modern best practice emphasizes effect sizes, confidence intervals, and pre-registration of hypotheses.

FAQ

What does a p-value actually mean?

A p-value is the probability of observing data at least as extreme as your sample, assuming the null hypothesis is true. It is NOT the probability that the null hypothesis is true. A p-value of 0.03 means there's a 3% chance of seeing results this extreme if there were truly no effect.

What is the difference between Type I and Type II error?

A Type I error (false positive) occurs when you reject a true null hypothesis — its rate is controlled by α. A Type II error (false negative) occurs when you fail to reject a false null hypothesis. Power = 1 - P(Type II error) measures your ability to detect real effects.

Why is p < 0.05 used as a threshold?

The 0.05 threshold is a convention established by Ronald Fisher in the 1920s, not a law of nature. Different fields use different thresholds — particle physics requires p < 0.0000003 (5σ). The appropriate threshold depends on the costs of false positives versus false negatives.

Can a statistically significant result be practically meaningless?

Absolutely. With a large enough sample size, even a trivially small effect becomes statistically significant. A drug that lowers blood pressure by 0.1 mmHg might achieve p < 0.001 with n = 100,000, but the effect is clinically irrelevant. Always report effect sizes alongside p-values.

Sources

Embed

<iframe src="https://homo-deus.com/lab/statistics/hypothesis-testing/embed" width="100%" height="400" frameborder="0"></iframe>
View source on GitHub