Statistical Significance (p-Value)¶

Prime #: 435
Origin domain: Statistics & Experimental Design
Aliases: P Value, Significance Level, Fisherian P Value, Tail Probability, Observed Significance Level, Statistical Significance
Related primes: Hypothesis Testing (Null vs. Alternative), Type I & Type II Errors, Statistical Power, Confidence Intervals, Effect Size, Multiple Comparisons Correction, Reproducibility & Replicability, Bayesian Updating, Randomization

Core Idea¶

Statistical Significance quantifies how likely observed results (or more extreme) could happen purely by random chance under the null hypothesis. If that likelihood (p-value) is sufficiently low (e.g., \< 0.05), the effect is deemed "significant," implying the result likely isn't just luck.

How would you explain it like I'm…

How Weird Is This?

Imagine your friend says, 'I can guess heads or tails every time!' You flip a coin and they guess right 10 times in a row. You think: that's really weird if they're just guessing. A p-value is a number that says how surprising your result would be if nothing special were really going on. Small number, big surprise.

Coincidence number

Suppose you test whether a new cereal makes kids grow taller. You compare two groups and the cereal group ends up a bit taller. But maybe that just happened by luck. A p-value is a number that says: 'If the cereal really did nothing, how often would I see a difference this big just from chance?' If the answer is 'almost never' (like less than 5 times in 100), scientists often call the result 'statistically significant.' But it doesn't prove the cereal works — it's only one clue, and you need more studies to be sure.

P-value

Statistical significance is the tail-probability-as-evidence-against-the-null principle. The p-value is the probability — calculated under an assumed null hypothesis — of seeing a test statistic at least as extreme as the one you actually observed. It's a continuous summary of how incompatible the data are with the null, ranging from 0 (data impossible under the null) to 1 (data exactly typical under the null). The 0.05 threshold for calling a result 'statistically significant' is a historical convention, not a principled boundary. Crucially, a p-value measures P(data | null), not P(null | data) — confusing these is the prosecutor's fallacy. A p-value isn't an effect size, isn't a Type I error rate for your specific result, and a single significant finding doesn't establish an effect; replication does.

Statistical significance is the principle that a tail probability under a null model can serve as a continuous measure of evidence against that null. The p-value is the probability, computed under an assumed null hypothesis H0 and its associated probability model, of observing a test statistic at least as extreme as the one actually observed — where 'extreme' is defined by the alternative hypothesis (one-sided: more extreme in a specified direction; two-sided: in either direction). It ranges from 0 (data impossible under H0) to 1 (data exactly typical under H0). The 0.05 threshold is historical convention, not principled. The concept originates with Fisher's 1925 Statistical Methods for Research Workers, where the p-value was introduced as a continuous measure of evidence; Neyman and Pearson's 1933 framework embedded it within decision-theoretic hypothesis testing (alpha as a pre-specified accept/reject threshold), and contemporary practice is a hybrid of these two. The p-value measures P(data as extreme | H0), not P(H0 | data) — a conditional-probability asymmetry that underlies most misinterpretations (the transposed conditional, or prosecutor's fallacy). Other widespread errors include reading it as the Type I error rate for a specific result, as a measure of effect size, as a boundary between real and unreal effects, and treating a single significant finding as establishing an effect. The American Statistical Association's 2016 and 2019 statements codified these misinterpretations and called for reform.

Broad Use¶

Medical Research: Declaring a new treatment "statistically significant" if the improvement rate is unlikely to be due to random fluctuation.
Psychology & Social Science: Checking if a measured difference in group means is large enough to surpass a threshold of chance variability.
Business Analytics: Determining if an A/B test effect is "real" or a fluke.
Engineering Quality: Deciding that a manufacturing improvement is robustly better if the p-value \< some alpha (e.g., 0.01).

Clarity¶

Offers a structured criterion for deciding if the data "too strongly contradicts" the null, although it doesn't measure effect size or practical importance—just the unlikelihood of a random fluke.

Manages Complexity¶

This single number (p-value) compresses data variation into a yes/no lens, though it must be interpreted cautiously to avoid pitfalls like p-hacking or ignoring context.

Abstract Reasoning¶

Reveals that variation in data is normal, and we need a threshold to separate "could be random" from "probably not random." The concept extends to any domain employing chance modeling.

Knowledge Transfer¶

Political Polling: If poll results differ significantly from 50/50 with a p \< 0.05, pollsters claim a lead is "statistically significant."
Finance: Testing if a trading strategy's outperformance might simply be luck or truly indicates skill.

Example¶

A marketing A/B test finds 6% higher click-through in variant B, with p=0.01. They interpret this as only a 1% chance that an observed difference that big arose by chance if no real difference existed, leading them to adopt version B.

Relationships to Other Abstractions¶

Current abstraction Statistical Significance (p-Value) Prime

Parents (3) — more general patterns this builds on

Statistical Significance (p-Value) is a kind of Statistical Inference Prime

Statistical significance is a specialization of statistical inference that summarizes sample-data incompatibility with a null via a tail probability.
Statistical Significance (p-Value) presupposes Hypothesis Testing (Null vs. Alternative) Prime

Statistical significance presupposes hypothesis testing because the p-value is read as evidence-against only within a pre-specified null/alternative testing frame.
Statistical Significance (p-Value) presupposes Probability Prime

Statistical Significance presupposes Probability: a p-value is the tail probability of a test statistic under an assumed null model.

Children (2) — more specific cases that build on this

Type M Error Domain-specific is part of Statistical Significance (p-Value)

Type M Error contains a statistical-significance gate that selects the tail estimates whose conditional magnitude is evaluated.
Type S Error Domain-specific is part of Statistical Significance (p-Value)

Type S Error contains a two-sided statistical-significance gate whose opposite-tail admissions can carry the wrong sign.

Hierarchy paths (11) — routes to 5 parentless roots

Statistical Significance (p-Value) → Statistical Inference → Inductive Reasoning

Show alternative paths (10)

Not to Be Confused With¶

Statistical Significance (p-value) is not Statistical Power because Statistical Significance is the probability of observing data as extreme as or more extreme than observed if the null hypothesis were true, while Statistical Power is the probability of correctly detecting an effect.
Statistical Significance (p-value) is not Statistical Inference because Statistical Significance is a specific decision criterion for hypothesis tests, while Statistical Inference is the broader framework for reasoning about populations from sample data.
Statistical Significance (p-value) is not Effect Size because Statistical Significance depends on sample size and data variability, while Effect Size measures the magnitude of an effect independent of sample size.