Statistical Significance (p-Value)¶

Prime #: 435
Origin domain: Statistics & Experimental Design
Aliases: P Value, Significance Level, Fisherian P Value, Tail Probability, Observed Significance Level, Statistical Significance
Related primes: Hypothesis Testing (Null vs. Alternative), Type I & Type II Errors, Statistical Power, Confidence Intervals, Effect Size, Multiple Comparisons Correction, Reproducibility & Replicability, Bayesian Updating, Randomization

Core Idea¶

Statistical significance is the tail-probability-as-evidence-against-null principle that operationalizes the degree to which observed data are incompatible with a specified null hypothesis. The p-value is the probability, computed under an assumed null hypothesis H₀ and associated probability model, of observing a test statistic at least as extreme as the one actually observed—where "extreme" is defined by the alternative hypothesis (one-sided: more extreme in specified direction; two-sided: more extreme in either direction). It is a continuous summary of sample evidence's incompatibility with the null, ranging from 0 (data impossible under null) to 1 (data exactly typical under null)^[1]. The convention in much of biomedical and social science is to label results with p < 0.05 as "statistically significant," though this threshold is historical rather than principled. The concept originates with Ronald A. Fisher's 1925 Statistical Methods for Research Workers, where Fisher introduced the p-value as a continuous measure of evidence and suggested 0.05 as a convenient benchmark^[2]. Karl Pearson had worked with related tail-probability concepts in his chi-squared framework (1900); Neyman and Pearson 1933 embedded the p-value within the decision-theoretic framework of hypothesis testing (α as pre-specified threshold for reject/do-not-reject decisions). The contemporary notion combines Fisher's continuous-evidence interpretation with Neyman-Pearson's dichotomous-decision framework into a hybrid that has been the subject of extensive methodological debate for nearly a century^[3]. The p-value measures P(data at least as extreme | H₀), not P(H₀ | data)—a fundamental asymmetry of conditional probability that underlies the majority of misinterpretations (the transposed-conditional or prosecutor's fallacy). Additional well-documented confusions include treating the p-value as the probability of a Type I error for a specific result (it is not; Type I error rate is α, controlled across the long run), treating it as a measure of effect size (it is not; small effects in large samples produce tiny p-values; large effects in small samples can fail to reach significance), treating the threshold as a boundary between "real" and "not-real" effects (it is not), and treating a single significant finding as establishing the effect (it does not; replication is required)^[1]. The American Statistical Association's 2016 statement on p-values and 2019 follow-up explicitly codified these misinterpretations and advocated for reform; the contested_construct flag reflects sustained methodological critique.

How would you explain it like I'm…

How Weird Is This?

Imagine your friend says, 'I can guess heads or tails every time!' You flip a coin and they guess right 10 times in a row. You think: that's really weird if they're just guessing. A p-value is a number that says how surprising your result would be if nothing special were really going on. Small number, big surprise.

Coincidence number

Suppose you test whether a new cereal makes kids grow taller. You compare two groups and the cereal group ends up a bit taller. But maybe that just happened by luck. A p-value is a number that says: 'If the cereal really did nothing, how often would I see a difference this big just from chance?' If the answer is 'almost never' (like less than 5 times in 100), scientists often call the result 'statistically significant.' But it doesn't prove the cereal works — it's only one clue, and you need more studies to be sure.

P-value

Statistical significance is the tail-probability-as-evidence-against-the-null principle. The p-value is the probability — calculated under an assumed null hypothesis — of seeing a test statistic at least as extreme as the one you actually observed. It's a continuous summary of how incompatible the data are with the null, ranging from 0 (data impossible under the null) to 1 (data exactly typical under the null). The 0.05 threshold for calling a result 'statistically significant' is a historical convention, not a principled boundary. Crucially, a p-value measures P(data | null), not P(null | data) — confusing these is the prosecutor's fallacy. A p-value isn't an effect size, isn't a Type I error rate for your specific result, and a single significant finding doesn't establish an effect; replication does.

Statistical significance is the principle that a tail probability under a null model can serve as a continuous measure of evidence against that null. The p-value is the probability, computed under an assumed null hypothesis H0 and its associated probability model, of observing a test statistic at least as extreme as the one actually observed — where 'extreme' is defined by the alternative hypothesis (one-sided: more extreme in a specified direction; two-sided: in either direction). It ranges from 0 (data impossible under H0) to 1 (data exactly typical under H0). The 0.05 threshold is historical convention, not principled. The concept originates with Fisher's 1925 Statistical Methods for Research Workers, where the p-value was introduced as a continuous measure of evidence; Neyman and Pearson's 1933 framework embedded it within decision-theoretic hypothesis testing (alpha as a pre-specified accept/reject threshold), and contemporary practice is a hybrid of these two. The p-value measures P(data as extreme | H0), not P(H0 | data) — a conditional-probability asymmetry that underlies most misinterpretations (the transposed conditional, or prosecutor's fallacy). Other widespread errors include reading it as the Type I error rate for a specific result, as a measure of effect size, as a boundary between real and unreal effects, and treating a single significant finding as establishing an effect. The American Statistical Association's 2016 and 2019 statements codified these misinterpretations and called for reform.

Structural Signature¶

A p-value computation exhibits: (a) the null hypothesis as probability model — a well-defined H₀ specifying a parameter value or relationship; (b) the probability model under H₀ — specifying the joint distribution of the data; © the test statistic as data summary — T(X), a function of sample data with a known or derivable distribution under H₀; (d) the null distribution — the sampling distribution of T(X) under H₀, derived from theory (t, F, χ²), randomization (permutation), or simulation (bootstrap); (e) the observed test-statistic value t* computed from realized data; (f) the definition of "at least as extreme" — one-sided or two-sided, aligned with the alternative hypothesis; (g) the p-value computation — p = P(T(X) at least as extreme as t* | H₀ and model), a tail probability of the null distribution; (h) proper handling of multiplicity when multiple tests are conducted—adjusted p-values or FDR-controlled procedures; (i) reporting of the p-value alongside the effect estimate and confidence interval, not the p-value alone^[4]. When these elements are properly implemented and interpreted, the p-value provides a calibrated continuous measure of evidence against H₀; when they are compromised (wrong null distribution, undisclosed multiplicity, selective reporting, post-hoc hypothesis specification), the nominal p-value bears little relation to the actual long-run frequency of extreme test statistics under the null and the evidential interpretation fails.

What It Is Not¶

Not the probability that the null hypothesis is true. This is the most common misinterpretation. The p-value is P(data at least as extreme | H₀), not P(H₀ | data). Obtaining the latter requires Bayesian analysis with a prior on H₀.
Not the probability that the observed result arose by chance. The result is fixed; what is probabilistic is the sampling framework in which it was observed. The p-value quantifies how extreme the observed result is under H₀, not the probability that H₀ explains the result.
Not a measure of effect size or practical importance. A p-value of 0.0001 with an effect of 0.001 standard deviations is statistically striking but practically trivial; a p-value of 0.10 with an effect of 0.5 standard deviations is statistically inconclusive but potentially important.
Not the Type I error probability for this specific result. Type I error rate is α, a pre-specified long-run property; the p-value is a specific data summary. These are conceptually distinct even when the p-value is compared to α for decision-making^[5].
Not a threshold between "real" and "not-real" effects. Effects exist on a continuum; p = 0.049 and p = 0.051 represent essentially identical evidence despite different significance classifications. The dichotomous threshold is a decision convention, not an epistemic boundary.
Not evidence for the alternative hypothesis in a well-defined sense. The p-value measures incompatibility with H₀ under the assumed model. Rejection of H₀ does not specifically support any particular alternative unless the test is designed with a specific alternative and the probability model is correct.
Not meaningful if the null distribution is misspecified. P-values computed under incorrect null distributions have no well-defined frequency interpretation.
Not independent of sample size. The distribution of p-values under H₁ depends on power, which depends on sample size, effect size, and measurement precision. At very large sample sizes, essentially all nulls are rejected because no null is exactly true; at very small sample sizes, even large effects may fail to produce small p-values.
Not additive or multiplicative across tests. Combining p-values across studies requires specific methods (Fisher's method, Stouffer's method); naive combination has no standard interpretation.
Not sufficient by itself for publication or policy decisions. Increasing methodological consensus holds that p-values should be reported alongside effect sizes, confidence intervals, sample size, assumption checks, and replication status, not as sole basis for claims.

Broad Use¶

Clinical trials and regulatory approval (canonical high-stakes context): Pre-specified p-values for primary endpoints serve as the formal basis for regulatory decisions about drug approval^[4]. FDA and EMA guidance specify α = 0.05 two-sided (or 0.025 one-sided) as the standard confirmatory threshold, with adjustments for multiplicity when multiple endpoints or comparisons are involved. ICH E9 (Statistical Principles for Clinical Trials) and related guidelines codify the role of p-values in the confirmatory-trial framework. The replication-requirement culture (typically two positive Phase 3 trials for full approval) partially addresses the false-discovery concerns inherent in single-study p-values. Adaptive trials, group-sequential designs, and seamless Phase ⅔ trials extend p-value methodology to more-complex designs while preserving nominal error-rate control.

Psychology and behavioral science (canonical reform context): NHST with p-value reporting dominated psychology through the 20^th century. The replication crisis beginning around 2011 (Bem ESP paper; Open Science Collaboration 2015 large-scale replication study reporting ~36% replication rate) prompted widespread methodological reform^[6]. Sources of the crisis include p-hacking (selective analysis to achieve p < 0.05), garden-of-forking-paths (implicit multiple comparisons through analytical flexibility), HARKing (hypothesizing after results are known), and publication bias (selective publication of significant results). Responses include pre-registration (AsPredicted, OSF), registered reports (journals commit based on method not results), effect-size reporting as standard, and Bayesian alternatives. The ASA 2016 and 2019 statements emerged partly from this context.

Physics and particle physics: The 5σ convention (p ≈ 3×10⁻⁷ one-sided) reflects the extraordinary-evidence standard for fundamental-physics discovery claims^[7]. The 2012 Higgs-boson discovery announcement explicitly reported 5σ significance for ATLAS and CMS experiments. Look-elsewhere effect is the particle-physics term for multiplicity adjustment across searched parameter space.

Economics and econometrics: "Stars culture"—annotation of regression coefficients with * (p < 0.10), ** (p < 0.05), *** (p < 0.01)—has been standard reporting convention in empirical economics. Critique from Ziliak-McCloskey 2008 The Cult of Statistical Significance and subsequent methodological movement argued against stars-culture reporting as reducing continuous evidence to dichotomous labels and obscuring economic-significance questions. The credibility revolution in empirical economics (Angrist-Pischke) emphasizes research design and robust inference, with p-values as one of several evidence summaries rather than the primary deliverable.

Genetics and genomics: Genome-wide association studies use extremely stringent significance thresholds (5×10⁻⁸, corresponding to Bonferroni correction for ~1 million independent tests). Manhattan plots display −log₁₀(p) across the genome, with the 5×10⁻⁸ line as the visualization threshold for genome-wide significance. False discovery rate (FDR) methods (Benjamini-Hochberg 1995) provide a less-stringent alternative for exploratory studies. Replication of GWAS findings in independent cohorts is standard, addressing false-discovery concerns beyond single-study thresholds.

Epidemiology and public health: P-values for association tests (chi-squared, logistic regression, Cox regression) have been standard. Epidemiological journals increasingly emphasize effect-size estimates with confidence intervals over p-values; Rothman and colleagues have argued since the 1980s for estimation-focused rather than testing-focused reporting.

Technology and A/B testing: Online experimentation platforms report p-values alongside Bayesian posterior probabilities. Sequential testing with α-spending functions (Pocock; O'Brien-Fleming) preserves nominal p-value interpretation under continuous peeking at accumulating data. Multi-armed bandits as alternative framework sidestep formal p-value thresholds in favor of posterior-based allocation.

Quality control: P-values for process-capability tests, distributional-assumption tests, outlier tests. Statistical process control charts implicitly use p-value thresholds (Western Electric rules define out-of-control signals). Acceptance sampling and operating characteristic curves are p-value-adjacent in their formulation.

Machine-learning evaluation: Paired t-tests or sign tests comparing models on validation folds; McNemar's test for paired classification; permutation tests on feature importance. ML-community discussion about p-value use is variable across subcommunities (more common in medical-ML and computational-biology subfields; less common in pure deep-learning research).

Legal and forensic science: DNA match probabilities are tail probabilities of seeing a matching profile by chance; forensic statistical evidence in general draws on p-value-style reasoning, sometimes with misinterpretation (the prosecutor's fallacy is the transposed-conditional error applied to DNA match evidence). The ENFSI and related standards-bodies have issued guidance on proper Bayesian-framework reporting of forensic evidence.

Clarity¶

The p-value frame makes explicit the specific quantity—P(data at least as extreme | H₀ and model)—and the specific misinterpretations it is vulnerable to. Without the frame, people conflate p-values with posterior probabilities of the null (prosecutor's fallacy)^[1], treat significance thresholds as boundaries between real and not-real effects, ignore effect sizes, fail to account for multiplicity, and selectively report significant findings. With the frame, diagnosis becomes specific: What is the null hypothesis, and what is the probability model under it? Is the model appropriate to the data-generating process, or are key assumptions violated? Is the p-value one-sided or two-sided, and which is appropriate to the scientific question? Has multiplicity been properly adjusted for? What is the effect size and its confidence interval—is the result practically important regardless of statistical significance? Was the analysis pre-specified, or selected after seeing the data? If significant, has the result been replicated? If non-significant, is this evidence of no effect, or just insufficient power? The frame clarifies that the p-value is a continuous evidential summary with known limitations, not a truth-determination device.

Manages Complexity¶

Reduces the multivariate inferential problem to a single summary statistic that can be computed, compared to thresholds, and reported in standard ways. This parsimony is the p-value's strength (comparable results across studies; standardized reporting; meta-analytic combination) and its weakness (single number obscures effect size, precision, assumptions, and context). Cross-domain transfer is productive: multiplicity-adjustment methods from genomics to large-scale A/B testing; sequential-testing methods from clinical trials to tech experimentation; meta-analytic combination methods from medicine to psychology to economics; permutation-based p-values from agriculture to neuroscience^[2]. The decomposition reveals interplay with other primes: hypothesis testing (#434)—the framework of which the p-value is the evidential summary, tightly paired; type I / type II errors (#445)—the error framework against which the p-value is calibrated; statistical power (#437)—determines the p-value's distribution under alternatives; confidence intervals (#436)—parallel or alternative framework with advantages for effect-size communication; effect size (#447)—the quantity the p-value does not measure but that must accompany it; multiple comparisons correction (#446)—how the p-value threshold scales with number of tests; reproducibility (#441)—single p-values must be replicated; bayesian updating (#444)—the framework that computes what people often mistakenly read p-values as providing.

Abstract Reasoning¶

The analyst asks: What is the null hypothesis, what is the probability model, and does the model fit the data-generating process? Is the test statistic appropriate, and is its null distribution correctly specified? Is the p-value one-sided or two-sided, and which is right for the scientific question? What is the effect size and confidence interval—is the result practically meaningful, trivially small, or inconclusive about magnitude? Was the analysis pre-specified or selected post-hoc^[8]? How many tests were conducted, and has multiplicity been properly addressed? What threshold should be used, and why—is the conventional 0.05 appropriate to the decision stakes and the prior probability of a real effect? What does a specific p-value value mean for evidence: is p = 0.04 really so different from p = 0.06? What is the study's power, and how does that condition the interpretation of a non-significant result? Is this finding one of many tested in the same dataset, or an independent confirmatory test? Has the finding replicated? Mature practice reports p-values as continuous evidential summaries alongside effect sizes and confidence intervals, pre-specifies analyses, adjusts for multiplicity, emphasizes replication, and interprets thresholds as conventions rather than truth-boundaries. Immature practice treats p < 0.05 as a discovery certificate, fails to adjust for multiplicity, selectively reports significant findings, and conflates statistical with practical significance.

Knowledge Transfer¶

Domain	Typical p-value threshold	Multiplicity approach	Characteristic pitfall
Clinical trial (confirmatory)	0.05 two-sided / 0.025 one-sided	Pre-specified hierarchy	Post-hoc primary-endpoint switch
Psychology experiment	0.05	Limited, often unadjusted	Garden-of-forking-paths
Particle physics (discovery)	3×10⁻⁷ (5σ)	Look-elsewhere effect	Trial-factor miscalculation
Econometric regression	0.05, stars at 0.10/0.05/0.01	Often none	Stars-culture over-interpretation
Genomics (GWAS)	5×10⁻⁸	Bonferroni (implicit)	Cryptic population structure
Epidemiology	0.05 but CIs emphasized	Rarely formal	Confounding not p-value issue
A/B testing	0.05 or Bayesian	Sequential α-spending	Peeking without adjustment
Quality control	0.0027 (±3σ)	Per-chart rules	Autocorrelated processes
Machine-learning comparison	0.05	Dataset-level issues	Test-set reuse
Legal/forensic	Domain-specific	Case-specific	Prosecutor's fallacy

Across rows: the core logic—tail probability under assumed null, with threshold calibrated to decision stakes—transfers across domains with characteristic pitfalls tied to domain conventions.

Examples¶

Formal/abstract¶

The American Statistical Association's 2016 statement on p-values ("Statistical Significance and P-values," Wasserstein-Lazar, The American Statistician 70:2) represents an unusual instance of a scholarly society issuing explicit guidance about the proper and improper use of a statistical technique. The statement was prompted by long-standing concern within the statistical community about p-value misuse in published research and its contribution to the reproducibility crisis. The statement articulates six principles: (1) P-values can indicate how incompatible the data are with a specified statistical model—but the p-value alone does not establish truth of a model^[9]. (2) P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. (3) Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. (4) Proper inference requires full reporting and transparency—selective reporting, p-hacking, and garden-of-forking-paths undermine p-value interpretation. (5) A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. (6) By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis. The six principles were accompanied by editorial commentary and diverse subsequent commentaries from Andrew Gelman, John Ioannidis, Sander Greenland, Deborah Mayo and others presenting diverse views on p-value use and reform^[9]. The 2016 statement was followed in March 2019 by a more assertive editorial ("Moving to a World Beyond 'p < 0.05'") specifically urging abandonment of the "statistically significant" dichotomous label and emphasizing continuous interpretation of p-values alongside effect sizes and confidence intervals. The special issue accompanying this editorial featured 43 articles proposing various reforms. Discipline-specific responses have varied: epidemiology and some areas of biomedical research have shifted substantially toward estimation-focused reporting; regulatory contexts (clinical trial approval) have continued to use α = 0.05 as a confirmatory threshold while emphasizing effect sizes; psychology has adopted pre-registration and registered reports as methodological reforms; physics conventions (5σ for discovery) predate and exceed the 0.05 convention and have been largely unaffected; economics has debated but not uniformly abandoned stars-culture reporting. Benjamin et al. 2018 ("Redefine Statistical Significance," Nature Human Behaviour) proposed lowering the discovery threshold to α = 0.005, arguing that this would substantially reduce false-positive rates for new findings while retaining α = 0.05 as a "suggestive" threshold^[10]; the proposal drew extensive debate. The evolving ASA-led discussion illustrates the live-debate character of the p-value construct: widespread use alongside substantial methodological critique; incremental reforms (pre-registration, effect-size reporting, CI emphasis) widely adopted; more-fundamental reforms (threshold change, abandonment of significance labeling, Bayesian alternatives) actively debated; ongoing methodological development.

Mapped back: This case illustrates the structural signature of p-values—a test statistic drawn from a known distribution under the null hypothesis, producing a tail probability—as the centerpiece of formal hypothesis testing in high-level science policy; the core abstraction of "continuous evidential incompatibility with a specified model" appears as the continuous interpretation of the ASA principles moving beyond dichotomous significance labeling.

Applied/industry¶

An e-commerce company's growth team runs continuous A/B tests on landing-page variants, checkout-flow modifications, recommendation-algorithm changes, and pricing-display experiments. The experimentation platform computes p-values from frequentist z-tests (for conversion-rate differences) and t-tests (for revenue-per-visitor differences), with default α = 0.05 two-sided and explicit multiplicity adjustment across the primary metric and five secondary metrics tracked per experiment. The team initially operated under a "ship if p < 0.05" dichotomous decision rule but observed two concerning patterns: (a) Many "shipped" variants—features declared winners based on p < 0.05—failed to replicate when re-tested at larger scale or failed to produce expected revenue gains in post-launch tracking. (b) Product managers routinely "peeked" at experiment results before pre-specified sample size was reached, ending experiments early when p < 0.05 appeared and continuing when it did not—a form of analytical flexibility that inflates actual Type I error rate above nominal α = 0.05. The head of experimentation commissions a methodology review with findings and reforms: (i) Peeking without α-spending: Early stopping on nominal p < 0.05 without α-spending inflated actual false-positive rate from 5% to roughly 25%. Reform: the platform now implements default O'Brien-Fleming-style α-spending so that early-stopping remains nominally α = 0.05 after accounting for peeking^[4]. (ii) Garden-of-forking-paths in metric selection: Product managers frequently chose which metric to designate as primary after seeing results, effectively multiple-testing without adjustment. Reform: primary metric pre-registered and auto-locked before data collection; metric changes require methodology-team review. (iii) Effect-size neglect: The ship/no-ship decision had been driven by p-value alone, ignoring effect magnitude. Several "significant" results were effects of 0.1-0.3% conversion-rate improvement—statistically detectable but plausibly below practical importance given feature-development cost. Reform: decision rule now requires both p < 0.05 AND effect-size minimum (point estimate ≥ 1% for conversion-rate; ≥ 0.5% for revenue-per-visitor) with confidence interval lower bound informing confidence in effect magnitude. (iv) Publication/visibility bias within company: "Winning" experiments were written up and celebrated; "losing" experiments were de-emphasized or not documented. Reform: all results are documented in standardized template including effect-size estimate, CI, p-value, pre-specified decision rule, and post-hoc observations flagged as exploratory. (v) Replication expectations: Single-experiment p < 0.05 drove shipping decisions. Reform: "high-stakes" launches (affecting pricing, checkout, homepage) require a replication experiment on independent traffic before full rollout; the replication's pre-specified threshold is again α = 0.05 but with effect-size estimate expected consistent with the original finding. Over 12 months following the reforms, the team observes higher-quality shipping decisions (measured by post-launch metric stability matching experiment predictions), reduced "surprise regressions" from shipped variants, and stronger documentation enabling meta-analytic learning across experiment portfolios.

Mapped back: This case exemplifies the structural signature of p-value misuse and reform—the dichotomous "ship/no-ship" decision driven by p < 0.05 threshold, peeking-induced false-positive inflation, garden-of-forking-paths metric selection, effect-size neglect—and the reform pathway restoring the core abstraction: continuous evidential interpretation paired with pre-specified decision rules, effect-size reporting, multiplicity adjustment, and replication expectations that jointly enforce the p-value's role within disciplined hypothesis testing rather than as standalone evidence.

Structural Tensions¶

T1 — Continuous evidential interpretation versus dichotomous threshold decision. Fisher's original conception of the p-value was as a continuous evidential measure—0.04 and 0.06 represent similar evidence against the null despite dichotomous "significant/not significant" distinction. Neyman-Pearson decision theory requires a threshold for accept/reject decision. Contemporary practice often employs the dichotomous threshold as if it were epistemically fundamental, losing continuous information^[11]. Reform movements (ASA 2019; Amrhein-Greenland-McShane 2019) urge abandonment of the "significant" label in favor of continuous reporting; regulatory and confirmatory contexts retain the threshold for decision-discipline purposes. Mature practice reports the p-value continuously with accompanying effect sizes and uses thresholds only when a dichotomous decision must be made with pre-specification.

T2 — Threshold convention versus calibration to decision stakes and prior probability. The α = 0.05 convention is historical (Fisher's suggestion, adopted as default) rather than principled for any specific decision. Particle physics uses 5σ for discovery claims because the prior probability of a new particle at a random mass is low and consequences of false discovery are high. Preliminary clinical research uses α = 0.05 for preliminary-evidence threshold with subsequent replication required for regulatory approval. Reform proposals (Benjamin et al. 2018 α = 0.005; Bayesian-threshold calibration to prior probabilities; decision-theoretic approaches weighing Type I and Type II costs) argue for context-specific thresholds. Mature practice calibrates thresholds to decision stakes, prior probability of real effect, and replication opportunity; immature practice applies α = 0.05 universally^[10].

T3 — P-value as single summary versus multi-dimensional evidential reporting. The p-value is a scalar; the evidence it summarizes is multi-dimensional (effect magnitude, precision, assumption adequacy, multiplicity context, replication history). Reducing to a single p-value is simple and comparable but loses information. Comprehensive reporting (effect estimate, confidence interval, p-value, sample size, pre-registration status, multiplicity adjustment, assumption checks) is informative but longer. The field's direction—toward multi-dimensional reporting—trades simplicity for fidelity. Mature practice reports p-values alongside effect sizes and CIs with full context; immature practice reports p-values alone.

T4 — Frequentist long-run interpretation versus Bayesian posterior interest. The p-value is P(data at least as extreme | H₀)—a frequentist conditional. The quantity most scientists and decision-makers actually want is P(H₀ | data) or P(H₁ | data)—a Bayesian posterior. The p-value does not provide this; obtaining it requires a prior probability for H₀ and Bayesian updating. The persistent misinterpretation of p-values as posterior probabilities is not a failure of education alone—it reflects that the Bayesian quantity is often the decision-relevant one while the frequentist p-value is the more easily computed^[12]. Reform through Bayes-factor reporting, posterior-probability calculation, or prior-probability-informed p-value interpretation (Ioannidis 2005 "Why most published research findings are false" used prior-probability reasoning to show p < 0.05 results are often false under realistic prior probabilities) addresses this tension. Mature practice uses both frameworks as appropriate, with explicit attention to what quantity is being computed and what decision-relevant question it answers; immature practice applies one framework uncritically and misinterprets the other.

Structural–Framed Character¶

Statistical Significance sits at the structural end of the structural–framed spectrum: it is a pure relational pattern, the same in any domain where it appears, and nothing about its meaning depends on a particular field's vocabulary or assumptions.

The prime is a defined computation: assume a null model, then ask how probable it would be, under that model, to see a test statistic at least as extreme as the one observed. That tail probability is a mathematical object with no built-in normative weight and no reliance on human institutions; the same calculation applies whether the data come from a physics experiment, a drug trial, or an A/B test of a website. To use it is to read off a quantity already determined by the null model and the data, not to bring in an interpretive stance. On every diagnostic, it reads structural.

Substrate Independence¶

Statistical Significance (p-Value) is a narrowly substrate-independent prime — composite 2 / 5 on the substrate-independence scale. It is a domain-specific formalization inside frequentist hypothesis testing, and its signature imports statistical machinery deeply — a null distribution, a test statistic, a tail probability — none of which travels without that apparatus. It does appear across A/B testing and experimental science broadly, but that is a single statistical technique being applied in many industries rather than a structural pattern recurring across physical, biological, or social substrates as a unifying logic. The construct stays bound to the frequentist frame it came from.

Composite substrate independence — 2 / 5
Domain breadth — 2 / 5
Structural abstraction — 2 / 5
Transfer evidence — 2 / 5

Relationships to Other Abstractions¶

Current abstraction Statistical Significance (p-Value) Prime

Parents (3) — more general patterns this builds on

Statistical Significance (p-Value) is a kind of Statistical Inference Prime

Statistical significance is a specialization of statistical inference that summarizes sample-data incompatibility with a null via a tail probability.
Statistical Significance (p-Value) presupposes Hypothesis Testing (Null vs. Alternative) Prime

Statistical significance presupposes hypothesis testing because the p-value is read as evidence-against only within a pre-specified null/alternative testing frame.
Statistical Significance (p-Value) presupposes Probability Prime

Statistical Significance presupposes Probability: a p-value is the tail probability of a test statistic under an assumed null model.

Children (2) — more specific cases that build on this

Type M Error Domain-specific is part of Statistical Significance (p-Value)

Type M Error contains a statistical-significance gate that selects the tail estimates whose conditional magnitude is evaluated.
Type S Error Domain-specific is part of Statistical Significance (p-Value)

Type S Error contains a two-sided statistical-significance gate whose opposite-tail admissions can carry the wrong sign.

Hierarchy paths (11) — routes to 5 parentless roots

Statistical Significance (p-Value) → Statistical Inference → Inductive Reasoning

Show alternative paths (10)

Neighborhood in Abstraction Space¶

Statistical Significance (p-Value) sits in a sparse region of abstraction space (77^th percentile for distinctiveness): few abstractions share its structure, so a faithful description tends to retrieve it precisely rather than landing on a neighbor.

Family — Statistical Inference & Uncertainty (15 primes)

Nearest neighbors

Hypothesis Testing (Null vs. Alternative) — 0.75
Statistical Power — 0.74
Statistical Inference — 0.72
Multiple Comparisons Correction — 0.69
Confidence Intervals — 0.68

Computed from structural-signature embeddings · 2026-07-26

Not to Be Confused With¶

Statistical Significance (p-value) must be distinguished from Statistical Power, though both are defined within the frequentist hypothesis-testing framework. Statistical Significance asks a backward-looking, retrospective question: "Given that the null hypothesis is true, how improbable was the observed data?" It is measured by the p-value, the tail probability of observing data as extreme as or more extreme than what was actually seen. Statistical Power asks a forward-looking, prospective question: "Given that a true effect of specified magnitude exists, how likely is the test to detect it?" Power is 1 − β, where β is the Type II error probability, and is determined before the study is conducted to inform design choices. A study can have small p-values (statistically significant, evidence against the null) yet low power (the study was designed so sensitively that even small true effects appear significant), or large p-values (not statistically significant) with high power (the study was well-designed but genuinely found no evidence of the specified effect). The confusion arises because both measure the sensitivity of a test, but they are asking about different scenarios: significance conditions on H₀ being true and reports how extreme the observed data are; power conditions on a true effect existing and reports the probability of detecting it. A practitioner obsessed with achieving p < 0.05 might ignore whether the study was adequately powered to detect the effect of interest, leading to studies that are either underpowered (fail to detect real effects because they're designed to find tiny significant effects) or overpowered (waste resources detecting trivial effects as statistically significant). The distinction is crucial for study interpretation: a non-significant result could mean either no effect exists or the study lacked power to detect a real effect.

Nor is Statistical Significance (p-value) identical to Statistical Inference, the broader epistemic framework for reasoning about populations from samples. Statistical Significance is a specific decision rule—a binary verdict (reject H₀ if p < α; do not reject otherwise) based on a tail probability computed under the assumed null hypothesis. Statistical Inference encompasses this hypothesis-testing component but is much broader: it includes parameter estimation with confidence intervals and credible intervals, causal inference methods, prediction with out-of-sample generalization, and model evaluation. A researcher conducting statistical inference might use p-values as one tool, but inference requires also reporting effect-size estimates and their uncertainty bounds, checking whether assumptions are met, considering alternative models, and situating results within broader evidence. An inference is an integrated judgment about what the data support; a p-value is a single probabilistic summary. A study might produce a statistically non-significant result yet yield valuable inferences about effect magnitude (narrow confidence interval suggesting the effect, if it exists, is small and practically negligible), patterns in the data, or directions for future research. Conversely, an overpowered study producing a significant p-value may support only a trivial inference (we detected a tiny effect that costs more in measurement than it's worth). Conflating the two leads to "p-value chasing" (running studies optimized for p < 0.05) while neglecting the inferential goals (understanding magnitude, direction, and context of effects). Sound statistical inference uses p-values as one component within a richer framework that includes effect-size estimation and interpretation within domain context.

Statistical Significance (p-value) is also distinct from Effect Size, though they are often conflated. Effect Size is the magnitude of an effect—the true difference between groups, the strength of association, the practical meaningfulness of a phenomenon—measured independently of sample size (e.g., Cohen's d, correlation coefficient, odds ratio). Statistical Significance (p-value) quantifies the extremeness of observed data under the null hypothesis; it is sensitive to both effect magnitude and sample size. The same effect size can be non-significant in a small sample (low power to detect it) or highly significant in a large sample (sufficient power to detect it despite the effect being statistically trivial in magnitude). Conversely, tiny effects can appear significant in very large samples (because the null hypothesis is never exactly true), while large effects can fail to reach significance in small samples (low power). A p-value provides no direct measure of effect size—p = 0.01 tells you the data are incompatible with the null, but does not tell you whether the effect is 0.1 standard deviations or 2 standard deviations. This separation has led to widespread misinterpretation: many people treat p < 0.05 as evidence of "a meaningful effect," conflating statistical significance with practical or clinical significance. The reform pathway emphasizes reporting both p-values and effect sizes with their confidence intervals, allowing readers to judge practical significance independently of statistical significance.

Solution Archetypes¶

Solution archetypes in the catalog that build on this prime — directly (this prime is a source ingredient) or as a related prime.

Also a related prime in 4 archetypes

Correlation Structure Characterization: Characterize how variables move together—by sign, strength, form, lag, condition, uncertainty, and stability—then explicitly constrain what that association may be used to claim or decide.
Effect Size Standardization: Convert raw inferred effects into comparable, uncertainty-bounded magnitude expressions so evidence can be judged by size and practical meaning, not only by detectability.
Hypothesis Test Power Calibration: Design a hypothesis test around the effect that would actually matter, then tune sample size, noise control, allocation, and error rates so the test has adequate power to detect it.
Network Motif and Pattern Discovery: Discover functionally meaningful recurring local graph structures by comparing observed subgraphs to suitable baselines.

Notes¶

Experimental-design/statistics origin (Fisher 1925 canonical for the p-value as continuous measure; Neyman-Pearson 1933 for the decision-theoretic significance level). The contested_construct flag is strongly warranted—p-value use and interpretation has been subject to sustained methodological debate since Fisher-Neyman, culminating in ASA 2016 and 2019 statements; Benjamin et al. 2018 threshold-redefinition proposal; Amrhein-Greenland-McShane 2019 "Retire statistical significance"; ongoing Bayesian-frequentist framework debates. The tight_pair_with_hypothesis_testing_null_vs_alternative flag reflects that the p-value is the canonical evidential summary within the NHST framework—the two primes are tightly interdependent; reciprocal flag is already wired into #434. Related primes: #434 hypothesis_testing_null_vs_alternative (tight pair), #445 type_i_type_ii_errors (α corresponds to p-value threshold), #437 statistical_power (determines p-value distribution under alternatives), #436 confidence_intervals (parallel framework, often preferred), #447 effect_size (what the p-value does not measure), #446 multiple_comparisons_correction (how thresholds scale with many tests), #441 reproducibility_replicability (replication as evidential standard beyond single p-values), #444 bayesian_updating (alternative framework computing posterior probabilities), #432 randomization (permutation-based p-values). Strong transfer targets: clinical-trial regulatory work, replication-reform in psychology and biomedicine, online A/B testing with sequential-testing corrections, GWAS and genomics multiplicity adjustment, epidemiological association studies, econometric coefficient testing, particle-physics discovery-threshold conventions. Pass B should develop archetypes for p-value-plus-effect-size-plus-CI integrated reporting, pre-registration and registered reports, multiplicity adjustment (Bonferroni, Holm, FDR), sequential testing with α-spending, Bayesian-frequentist hybrid reporting, threshold calibration to decision stakes (α = 0.005, domain-specific), assumption-check workflows for parametric p-values, and replication-based evidence aggregation beyond single-study p-values.

References¶

[1] Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. Authoritative critique of statistical practice: exposes how implicit distributional assumptions and convenience-driven model choices generate misinterpretations of significance and uncertainty. ↩

[2] Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver & Boyd. Establishes the formal statistical concept of an unbiased estimator and the use of randomization to enforce identity-invariance in experimental design; the metrology-furthest realization of the prime — invariance under sample identity stated in purely mathematical terms with no parties or preferences. ↩

[3] Lehmann, E. L. (1993). The Fisher, Neyman-Pearson theories of testing hypotheses: One theory or two? Journal of the American Statistical Association, 88(424), 1242–1249. Lehmann canonical historical treatment of Fisher-Neyman-Pearson philosophical and methodological differences. ↩

[4] Cox, D. R. (1958). Planning of Experiments. John Wiley & Sons. Canonical exposition of how active intervention—assigning units to treatments and pre-specifying measurement—isolates causal effects from confounding across scientific domains. ↩

[5] Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society A, 231(694–706), 289–337. Foundational paper: frames inferential conclusions as tentative decisions with controlled long-run error rates, subject to revision as new data accumulate. ↩

[6] Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. Coordinated replication of 100 published psychology experiments: reproduced significant effects in only 36% of cases despite nominal transparency of original methods, dramatizing that disclosed information without shared data, code, and pre-registration is insufficient to support substantive scrutiny. ↩

[7] Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2^nd ed.). Lawrence Erlbaum Associates. Foundational text on power analysis: links sample size, effect size, significance threshold, and noise level into a coherent design discipline — the practical instantiation of "set decision thresholds appropriate to the noise level" for empirical research. ↩

[8] Wilkinson, L., & American Psychological Association Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: guidelines and explanations. American Psychologist, 54(8), 594–604. Wilkinson APA task force statistical methods effect-size reporting confidence intervals significance testing. ↩

[9] Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: context, process, and purpose. The American Statistician, 70(2), 129–133. https://doi.org/10.1080/00031305.2016.1154108. ASA p-value statement clarifying replication implications of significance testing. ↩

[10] Benjamin, D. J., et al. (2018). Redefine statistical significance. Nature Human Behaviour, 2(1), 6–10. Benjamin et al. advocating for α = 0.005 discovery threshold to address power and prior-probability issues. ↩

[11] Cumming, G. (2014). The new statistics: why and how. Psychological Science, 25(1), 7–29. Cumming new statistics effect-size confidence intervals point estimate plus uncertainty reporting discipline. ↩

[12] Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124. Foundational analysis of how publication bias, low statistical power, and flexible analytic choices produce a literature in which most positive findings fail to replicate—motivating epistemic humility about scientific claims. ↩

[13] Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine Series 5, 50(302), 157–175. Pearson chi-square test foundational hypothesis test for goodness-of-fit.

[14] Student. (1908). The probable error of a mean. Biometrika, 6(1), 1–25. Gosset t-distribution foundational for small-sample error-rate control in hypothesis testing.

[15] Ziliak, S. T., & McCloskey, D. N. (2008). The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. University of Michigan Press. Ziliak-McCloskey critique of "stars culture" and conventional p-value over-reliance in economics.