Skip to content

Multiple Comparisons Correction

Prime #
446
Origin domain
Statistics & Experimental Design
Aliases
Multiplicity Adjustment, Family Wise Error Rate Control, False Discovery Rate, Bonferroni Correction, Benjamini Hochberg
Related primes
Hypothesis Testing (Null vs. Alternative), Statistical Significance (p-Value), Type I & Type II Errors, Statistical Power, Reproducibility & Replicability, Selection Bias

Core Idea

(1) When many hypothesis tests are conducted in the same study, the per-test false-positive rate (e.g., α=0.05) does not bound the study-level false-positive rate: a study performing 100 independent tests each at α=0.05 has roughly a 99.4% probability of yielding at least one false-positive result even if all nulls are true. (2) Multiple comparisons correction is the family of statistical techniques that adjust either the per-test significance thresholds or the p-values themselves to control some specified error rate at the family-wise level — the family-wise error rate (FWER, probability of any false rejection), the false discovery rate (FDR, expected proportion of false discoveries among rejections), or alternative criteria like per-family error rate or false coverage rate. (3) The two dominant traditions — Bonferroni-type FWER control (strict, conservative) and Benjamini-Hochberg FDR control (less conservative, accepts a controlled fraction of false discoveries among positives) — embody different answers to the question of how aggressively to penalize multiplicity. (4) The deeper abstraction is that conducting many tests inflates the probability of finding spurious patterns by chance; principled inference under multiplicity requires explicitly choosing what error rate to control and at what level, a choice tied to the decision context and downstream cost of false discoveries.

How would you explain it like I'm…

Lots-of-Tests Fairness Rule

If you flip a coin one time, getting heads doesn't surprise you. But if you flip it a hundred times, a big lucky streak is almost guaranteed somewhere. Scientists who run many tests at once need to be extra strict, or random luck will look like a real discovery.

Lucky Result Correction

Scientists set a rule that a result counts as 'real' if it would happen by random chance less than five percent of the time. That rule is fine for one test. But if you run a hundred tests, you'd expect about five lucky-looking results even when nothing is going on. Multiple comparisons correction is the set of math tricks for tightening that rule when you run lots of tests, so you don't fool yourself with random noise.

Multiple Testing Correction

When researchers run many statistical tests in the same study, the chance of at least one false positive shoots up: a hundred independent tests at the usual 5 percent threshold will produce a false alarm more than 99 percent of the time, even if nothing is really going on. Multiple comparisons correction is the family of methods that adjusts either the per-test thresholds or the p-values themselves to control error at the level of the whole family of tests — for example, Bonferroni correction (strict, controls the chance of any false positive) or Benjamini–Hochberg (less strict, controls the expected fraction of false positives among reported findings).

 

When many hypothesis tests are run in a single study, the per-test false-positive rate (typically alpha = 0.05) does not bound the study-level false-positive rate. A study performing 100 independent tests, each at alpha = 0.05, has roughly a 99.4% chance of producing at least one false positive even if every null hypothesis is true. Multiple comparisons correction is the family of techniques that adjust per-test thresholds or p-values to control a chosen error rate at the family level: the family-wise error rate (FWER, the probability of any false rejection), the false discovery rate (FDR, the expected proportion of false discoveries among rejections), or alternatives such as per-family error rate or false coverage rate. The two dominant traditions, Bonferroni-style FWER control (strict) and Benjamini-Hochberg FDR control (less conservative), encode different answers to how aggressively multiplicity should be penalized given the downstream cost of false discoveries.

Structural Signature

Multiple comparisons correction exhibits six core structural elements: (a) a defined family of tests — typically pre-specified comparisons within a study, though "family" boundaries are contested and consequential for correction severity; (b) the joint distribution of test statistics or p-values across the family; © an error criterion to control — family-wise error rate (FWER, probability of any false rejection), false discovery rate (FDR, expected proportion of false discoveries among rejections), per-family error rate, or other; (d) the chosen error level — typically α=0.05 per-family or q=0.05 for FDR; (e) a correction procedure yielding adjusted p-values or adjusted thresholds (FWER methods: Bonferroni, Holm, Hochberg, Sidak, Tukey, Scheffé; FDR methods: Benjamini-Hochberg, Benjamini-Yekutieli, Storey q-value; resampling: max-T, cluster-based, randomization); (f) the dependence structure of tests (independent, positively correlated, arbitrary dependence), which affects correction conservativeness. The universal trade-off: stricter correction (Bonferroni) reduces Type I error inflation but raises Type II error, reducing power to detect genuine effects; looser correction (FDR) accepts a controlled proportion of false discoveries in exchange for higher power. The distinguishing commitment is that inferential conclusions must account for the full testing context and family structure, not just a single test's p-value in isolation[1].

What It Is Not

  • Not a way to rescue a poorly-designed study — multiplicity correction handles the specific problem of inflated false-positive rates from multiple testing but cannot fix selection bias, confounding, or underpowered designs.
  • Not always necessary — for pre-specified primary hypotheses, no correction is needed; corrections apply when many tests are conducted and the analyst wants to maintain family-wise or false-discovery-rate control.
  • Not the same as adjusting for confounders in a regression model — confounder adjustment addresses causal identification; multiplicity correction addresses false-positive rate control.
  • Not a replacement for pre-registration — pre-specifying which tests will be conducted often obviates the need for aggressive multiplicity adjustment by limiting the family of tests.
  • Not consistent across approaches — Bonferroni, Holm, Hochberg, and Benjamini-Hochberg can yield different decisions on the same data, each reflecting different assumptions about dependence structure and error-rate definitions.
  • Not always correct to use the most conservative method — Bonferroni is conservative and often sacrifices too much power when its assumptions are poorly matched to the data structure.
  • Not equivalent to Bayesian multiplicity handling — Bayesian hierarchical models "shrink" estimates toward a prior distribution, implicitly handling multiplicity through the prior rather than through p-value adjustment.
  • Not limited to explicit hypothesis tests — any data-dependent decision procedure (feature selection, cutpoint choice, model selection) that conditions on the observed data creates analogous multiplicity concerns.
  • Not free of definitional controversy — what constitutes "a family" (single study, single paper, single research career, single field) is judgment-dependent and consequential for the correction applied.
  • Not the only way to address the garden-of-forking-paths problem — pre-registration, hold-out validation, and replication also address the underlying concern.

Broad Use

Multiple comparisons correction is foundational across domains where large numbers of tests are conducted simultaneously or sequentially, from hypothesis-testing frameworks (genomics, clinical trials) to exploratory data analysis (subgroup analysis, feature selection). The core challenge is universal: whenever more than one or two tests are conducted, the combined family-wise false-positive rate exceeds the per-test rate. The solutions—Bonferroni-type FWER control, Benjamini-Hochberg FDR control, and alternatives—transfer across domains with domain-specific conventions reflecting the consequences of false positives and the typical number of tests conducted. In low-dimensional confirmatory contexts (clinical trials with pre-specified primary endpoints), stringent FWER control preserves evidence integrity; in high-dimensional discovery contexts (genomics, neuroimaging), FDR control balances discovery power against false-positive burden. The field has evolved from viewing multiplicity as a nuisance to be suppressed toward viewing it as a design parameter to be calibrated to decision stakes.

Genomics and transcriptomics illustrate the power-discovery trade-off when moving from FWER to FDR control. Differential-expression analyses routinely test thousands or tens of thousands of genes simultaneously; without correction, a 20,000-gene study at α=0.05 per gene would expect 1,000 false-positive findings by chance alone. The field adopted Benjamini-Hochberg FDR control (typically q<0.05 or q<0.10) as standard, allowing ~5–10% of reported gene discoveries to be false in exchange for substantially higher power to detect genuine effects. This pragmatic reframing—accepting a controlled false-discovery fraction among reported findings rather than strictly prohibiting any false positives—enabled the discovery-driven genomics era. Genome-wide association studies (GWAS) take the opposite approach: testing ~1 million SNPs requires the ultra-stringent threshold p<5×10⁻⁸ (Bonferroni correction) to maintain FWER, reflecting the high cost of false-positive genetic associations that may consume years of follow-up research.

Clinical trials, regulatory approval, and sequential testing require stringent multiplicity control to protect evidence integrity for high-stakes decisions. FDA and EMA guidance mandate pre-specification of family structure and multiplicity adjustment strategy in protocols; post-hoc correction is not acceptable. Hierarchical testing procedures (gatekeeping, alpha-spending functions like Pocock or O'Brien-Fleming) control family-wise error across multiple primary endpoints, doses, or interim analyses. Sequential or adaptive designs—testing at interim milestones (50%, 75%, 100% of planned sample)—create multiple testing opportunities that inflate Type I error unless corrections like alpha-spending maintain control across all looks. The distinction between confirmatory pre-specified tests (stringent multiplicity control) and exploratory secondary tests (weaker or no correction) structures the evidence hierarchy in trial reports.

Psychology, social science, and the replication crisis have elevated multiplicity awareness as a core methodological concern. The 2010s replication failures were partly attributed to endemic p-hacking and researcher degrees of freedom (multiple tests conducted but not all reported). Contemporary norms now enforce pre-registration of hypotheses and explicit disclosure of all conducted analyses, with separate interpretation tiers: primary confirmatory tests (few in number, no correction needed), secondary pre-specified tests (modest correction), and exploratory tests (stronger correction or hypothesis-generating flagging). This tiered approach separates confirmatory inference from hypothesis generation, reducing incentives to selectively report results.

Machine learning, feature selection, and exploratory analysis create implicit multiplicity through model search, hyperparameter tuning, and variable selection. Searching over thousands of candidate features or millions of hyperparameter combinations biases selection toward extreme parameter values under the null, inflating effect sizes and enabling overfitting. Cross-validation and held-out test sets address this by separating search-phase data from evaluation-phase data, with test-phase results treated as confirmatory. Information criteria (AIC, BIC) and regularization methods (LASSO, ridge regression) implicitly handle multiplicity through penalty functions that trade goodness-of-fit against model complexity.

Clarity

Clarity

Multiple comparisons correction makes the family-wise consequences of multiplicity explicit and transparent. A single reported "p<0.05" means radically different things depending on its source: (a) a pre-specified primary hypothesis test in a confirmatory study with one or two planned contrasts (genuine ~5% false-positive probability), (b) one of 20 tested hypotheses, some examined post-hoc, all at α=0.05 without correction (false-positive probability approaching 100% across the family), or © one of many subgroup analyses conducted after seeing data, selected for reporting based on visual inspection of results (essentially uninterpretable; the p-value is conditional on selection). Requiring explicit correction forces disclosure: How many tests were conducted? Were they pre-specified or post-hoc? What error rate was controlled? Modern reporting norms — CONSORT (trials), STROBE (observational), pre-registration (psychology) — mandate the distinction between primary confirmatory tests (pre-specified, few in number, deserving of single-test α) and secondary or exploratory tests (post-hoc, numerous, requiring correction or explicit hypothesis-generating flagging)[2]. This framing shifts the conversation from "did you get a p<0.05?" to "what is this study's actual false-positive rate given the full testing context and the number of tests conducted?"

Manages Complexity

Multiple comparisons correction structures the complexity of large-scale testing into tractable frameworks with known error-rate properties. A genome-wide association study with 1 million SNP tests conducted at α=0.05 per test (uncorrected) would expect approximately 50,000 false-positive SNPs even if no SNP truly affected the outcome (global null). Bonferroni correction sets α=5×10⁻⁸ per SNP to maintain family-wise error rate (FWER) at 0.05 across the million tests, reducing the false-positive SNP count to an expected 0.05. For genomics, the transition from Bonferroni FWER control to Benjamini-Hochberg FDR control was a major practical gain: FDR permits a small fraction of false discoveries among the identified genes (q=0.10, meaning 10% of reported hits are false), in exchange for substantially higher power to detect true associations[1]. This trade-off enabled the discovery-phase genomics that has characterized the field for 20 years. Complexity management extends to dependence structure: neighboring SNPs are in linkage disequilibrium (correlated), neighboring fMRI voxels are spatially correlated, and genomic regions have hierarchical structure. Permutation-based and resampling-based corrections exploit this dependence to produce less conservative thresholds than Bonferroni, which assumes independence. Hierarchical Bayesian alternatives bypass the correction framework entirely: rather than testing each parameter independently with correction, fit a joint model that shrinks estimates toward a pooled prior, implicitly handling multiplicity through partial pooling[3].

Abstract Reasoning

Multiple comparisons correction illuminates a deep inference principle: the evidential meaning of a p-value depends on the context in which it was obtained. A p=0.03 result is strong evidence if it came from a single pre-specified hypothesis test designed before observing data; it is weak evidence if it came from one of 20 tested hypotheses, some examined post-hoc; it is nearly meaningless if it emerged from thousands of candidate tests without correction. This principle generalizes across statistics and beyond: the impressive-looking pattern in a data-dredged analysis, the "surprising" correlation discovered after testing hundreds of variable pairs, the "winning" strategy from a backtest that evaluated thousands of trading strategies — all share the structure that the observed extremity is partially an artifact of selection from a large reference class. Selection from extremes inflates effect sizes, contracts confidence intervals, and makes p-values uninterpretable. Recognizing multiplicity means recognizing that the statistical and inferential framework must condition on the full testing and selection process, not just on the single test whose result is being reported. This principle is the heart of the "garden of forking paths" problem (Gelman & Loken 2013) in Bayesian framing and the "p-hacking" and selective reporting concerns central to the 2010s replication crisis in frequentist contexts: the conducted-but-unreported tests and analyses matter as much as the reported ones, because they create the selection process that biases results[2].

Knowledge Transfer

Domain Typical Family Size Correction Approach FWER vs FDR Choice
GWAS (genomics) ~1M SNPs Bonferroni α=5×10⁻⁸ FWER
Transcriptomics 20K+ genes Benjamini-Hochberg FDR (q<0.05 or 0.10)
Neuroimaging (voxel) 100K+ voxels Cluster-based permutation or FDR Mixed; field convention
Clinical trial (multiple endpoints) 3-10 endpoints Hierarchical/gatekeeping or Bonferroni FWER
Psychology study 5-50 analyses Bonferroni or Holm; often uncorrected FWER where specified
A/B testing (interim looks) Few stops Alpha-spending (Pocock, O'Brien-Fleming) FWER
Marketing subgroup analysis 10-100 subgroups Usually uncorrected (exploratory) Varies
Machine learning feature selection 100-10000 features Cross-validation; held-out test set Implicit via validation
Quality control SPC Many control charts Type I rates per chart, discussed as "false alarms" Process-specific
Epidemiological subgroup analysis 5-30 subgroups Interaction test; then selective correction Context-dependent

Examples

Formal/abstract

Yoav Benjamini and Yosef Hochberg's 1995 paper "Controlling the False Discovery Rate" (Journal of the Royal Statistical Society: Series B) introduced an alternative to family-wise error rate control that became foundational to genomics and many other high-dimensional fields. The problem that motivated the paper was the increasing common practice of conducting thousands of tests (in microarray gene-expression studies, for example) where Bonferroni correction was too conservative — it preserved FWER at α=0.05 but at the cost of power so low that few real effects could be detected. Benjamini and Hochberg proposed a different error criterion: rather than controlling the probability of any false rejection, control the expected proportion of false discoveries among the rejections. Their step-up procedure sorts p-values in ascending order and rejects the i-th test if p(i) ≤ (i/m)·q, where m is the number of tests and q is the target FDR. This procedure was shown to control FDR at q under independence and under "positive regression dependence" of test statistics.

The impact on genomics was revolutionary. A 2003 microarray analysis comparing tumor tissue to normal tissue for ~10,000 genes might find, at Bonferroni α=0.05/10000=5×10⁻⁶, only 20 differentially-expressed genes — too few to characterize biological pathways. At FDR q=0.05, the same analysis might identify 400 differentially-expressed genes with an expected 20 false positives among them, a scientifically much richer finding. The BH procedure and its extensions (Storey's q-value, Benjamini-Yekutieli for arbitrary dependence) became standard tools in bioinformatics pipelines. Google Scholar shows the 1995 paper has over 80,000 citations as of 2025, making it among the most-cited statistical methodology papers in history.

The FDR framework's success is partly methodological and partly philosophical: it reframed multiple testing not as a threat to be suppressed but as a design feature to be calibrated. Genomic discoveries explicitly accept a controlled fraction of false leads in exchange for the power to generate hypotheses from massively parallel data. The framework has spread beyond genomics to neuroimaging (voxel-wise FDR), ecology (species-wise tests), and even some social science applications, though FWER control remains the standard for regulatory clinical trials where any false positive has severe consequences.

Mapped back: This case exemplifies the structural signature element of the false discovery rate (FDR) and the family-wise versus per-comparison error budget distinction — Benjamini-Hochberg's procedure shifts the genomics-discovery framework from family-wise FWER (Bonferroni) to FDR (proportion of false discoveries among called significant), trading individual-test stringency for power; the genomic revolution rested on accepting expected-FDR control as the appropriate error criterion for high-throughput parallel testing.

Applied/industry

A national specialty retail chain with ~3,800 stores operated a customer-segmentation analytics program that ran weekly comparisons of purchase behavior across 24 customer segments (defined by demographics, loyalty tier, and recent shopping cadence) against chain-wide baselines. The analytics team had built a dashboard flagging "segments trending up/down" on ~12 key metrics (basket size, visit frequency, category mix, etc.), with "trending" defined as a week-over-week change outside a ±2σ band, loosely equivalent to p<0.05 per segment-metric combination. Merchandise and marketing teams used this dashboard to target promotions and trend interventions.

After 18 months of operation, the analytics team conducted a retrospective audit. They found that in any given week, roughly 15–20 segment-metric combinations were being flagged as "trending" — and out of 24 segments × 12 metrics = 288 tests per week at a nominal 5% false-positive rate, the expected number of false flags was 288 × 0.05 = 14.4, closely matching the observed pattern. In other words, most of the weekly flags were likely statistical noise, and the marketing team's reactive responses to these flags were amplifying randomness into business decisions. The team rebuilt the dashboard with explicit multiplicity awareness. Two changes: first, they introduced FDR control using a per-week Benjamini-Hochberg procedure, targeting q=0.10 (accepting that ~10% of flags would be false positives among those flagged, rather than 5% per test). This reduced the typical weekly flag count from 15–20 to 3–6, with those flags being far more likely to represent genuine shifts. Second, they distinguished "confirmatory" flags (pre-specified high-priority segment-metric combinations: 32 pairs they cared about most) from "exploratory" flags (the remaining 256), applying separate FDR procedures to each family with the confirmatory family receiving a tighter q=0.05 threshold.

The operational impact was measurable. Marketing interventions triggered by dashboard flags had previously shown an uplift ROI of 0.6x (i.e., loss-making on average), consistent with responding to noise. Post-correction, flag-triggered interventions showed an uplift ROI of 2.1x, consistent with responding to genuine shifts. The team also introduced a "flag was confirmed by next-week data" metric and found that pre-correction flags were confirmed by the following week's data only 34% of the time (close to chance given regression-to-the-mean), while post-correction flags were confirmed 71% of the time — further evidence that the corrections had separated signal from noise. The analytics team presented the case study at an internal data-science summit under the title "The Emperor's New Dashboard" as a cautionary tale: impressive-looking weekly pattern-detection had been mostly noise amplification before multiplicity-aware reporting was introduced.

Mapped back: This case exemplifies the structural signature element of "a defined family of tests" (24 segments × 12 metrics = 288 per-week comparisons) and the core abstraction that "many tests inflate false-positive probability"; the correction procedure (Benjamini-Hochberg FDR) enforces the inference principle that evidential meaning depends on full testing context, converting noise-amplifying dashboards into signal-separating decision tools through explicit family-level error control.

Structural Tensions

T1 — Error-rate control versus statistical power. Stricter correction (Bonferroni, tight FWER) reduces the per-test significance threshold, raising Type II error and reducing power to detect genuine effects. Looser correction (FDR with q=0.10, or no correction) increases power but allows more false positives. Genomics resolved this trade-off pragmatically by moving from FWER to FDR: accepting that 10% of reported gene findings are false positives in exchange for the power to detect real effects in high-dimensional data. Clinical trials and drug approval maintain strict FWER because a single false-positive claim can lead to ineffective drugs entering the market. The tension is fundamental and context-dependent: the "right" balance depends on the cost structure of false positives versus false negatives in the specific application. No universal correction is "correct" independent of these costs.

T2 — Family definition as researcher judgment call. The boundary of "the family" of tests is a researcher choice with no uniquely correct answer. Is the family a single paper (10 tests)? A research program (100 tests across 5 papers)? A field's entire literature (10,000 tests over decades)? The larger the family boundary, the more aggressive the correction required, the lower the resulting power. Most applied practice adopts "within-study" or "within-paper" family definitions for convenience, but the Replication Crisis revealed that within-study families often exclude substantial testing that occurred during exploratory analysis phases. The tension is unavoidable: narrow families preserve power but risk missing correction for true multiplicity; broad families preserve inferential validity but sacrifice power. No clean resolution exists.

T3 — Classical frequentist correction versus Bayesian multiplicity handling versus pre-registration. The classical frequentist approach is to conduct many tests and adjust p-values or significance thresholds post-hoc (Bonferroni, FDR, etc.). Bayesian hierarchical models handle multiplicity implicitly through partial pooling and shrinkage toward a common prior, without explicit p-value adjustment. Pre-registration reduces the need for correction by restricting the family of pre-specified confirmatory tests, leaving only hypothesis-generating exploratory analyses (which are flagged as such). These are competing responses to the same concern. The tension is between the universal applicability of correction methods (applicable to any research) and the alternative routes (hierarchical modeling, pre-registration) that handle the problem at a different level, with distinct advantages and costs.

T4 — Simple closed-form corrections versus dependence-aware resampling methods. Bonferroni assumes independence (most conservative, widely applicable). Benjamini-Hochberg assumes positive dependence (less conservative, typically holds). Real data often have complex dependence structures (neighboring SNPs in linkage disequilibrium, neighboring voxels in fMRI, features in ML pipelines) that violate simple assumptions. Permutation-based and resampling methods (max-T, cluster-based, bootstrap) produce data-adaptive corrections exploiting the actual dependence structure, but at computational cost and with less-widely-available software. The tension is between simplicity and broad applicability (closed-form methods) versus statistical accuracy and computational burden (resampling methods). Modern practice often uses simple corrections as defaults and resampling-based methods for high-stakes applications[3].

T5 — Type I error control versus reproducibility and replication success. Correcting for multiplicity controls the probability of any false positive within a study, but replication studies—even if conducted identically—will have independent random sampling variation. A finding that barely passes multiple-correction thresholds in one study (right at the boundary) may fail to replicate in the next study. Replication power depends on effect size, sample size, and design, not just on controlling Type I error. Fields that adopt stringent multiplicity control (e.g., genomics with p<5×10⁻⁸) achieve high replication rates for reported findings, but at the cost of very large sample requirements. The tension is that Type I error control is a necessary but insufficient condition for reproducibility and replication.

T6 — Declared primary hypothesis versus exploratory analysis flexibility. A research study can declare a single primary hypothesis (requiring only one test, no correction) and multiple secondary/exploratory hypotheses (requiring correction or flagging). This structure preserves Type I error for the primary hypothesis while enabling hypothesis generation from exploratory analyses. However, researchers have incentive to post-hoc declare as "primary" findings that emerged exploratorily. Pre-registration (declaring hypotheses before data collection) strengthens the distinction, but compliance with pre-registration is incomplete. The tension is between preserving researcher flexibility for discovery and preventing strategic declaration that undermines the purpose of distinguishing primary from exploratory[2].

Structural–Framed Character

Multiple Comparisons Correction is a hybrid on the structural–framed spectrum, leaning structural with a light frame. At its center is a field-neutral mathematical fact: when many tests are run together, the chance of at least one false positive climbs far above the per-test rate, so the thresholds or p-values must be adjusted to control the study-level error. A modest amount of vocabulary comes along from its home in experimental statistics.

The core structure is genuinely cross-domain: the same error-inflation arithmetic and the family-wise and false-discovery-rate corrections apply unchanged wherever many tests are aggregated — genomics screens, neuroimaging analyses, A/B testing in software, or any large battery of comparisons. It carries little intrinsic normative weight; the inflation is a probabilistic consequence, not a verdict. The light frame it inherits is methodological: the contested judgment about where the limits of a 'family' of tests lie, and the inferential-practice assumption that controlling a study-level error rate is the right discipline to impose. That choice of what to protect against reflects a research convention rather than the bare mathematics. The structural content dominates while the frame stays thin, placing it on the structural side of the middle.

Substrate Independence

Multiple Comparisons Correction is among the most substrate-tethered entries — composite 1 / 5 on the substrate-independence scale. The general worry it addresses — that testing many hypotheses inflates the chance of a spurious hit — has wide appeal, but the prime itself is a bundle of statistical machinery: family-wise error control, false-discovery-rate adjustment, Bonferroni and Benjamini-Hochberg. These are statistical constructs through and through, and they carry no meaning outside hypothesis testing and causal inference. This is a catalog technique anchored to its statistical substrate, not a pattern that recurs across media.

  • Composite substrate independence — 1 / 5
  • Domain breadth — 1 / 5
  • Structural abstraction — 2 / 5
  • Transfer evidence — 1 / 5

Neighborhood in Abstraction Space

Multiple Comparisons Correction sits in a sparse region of abstraction space (81st percentile for distinctiveness): few abstractions share its structure, so a faithful description tends to retrieve it precisely rather than landing on a neighbor.

Family — Multiple-Comparison Correction (1 primes)

Nearest neighbors

Computed from structural-signature embeddings · 2026-05-29

Not to Be Confused With

Multiple comparisons correction must be distinguished from Hypothesis Testing (Null vs. Alternative), its nearest neighbor (similarity 0.654), because they operate at different scopes. Hypothesis testing is the framework for a single test: specifying null and alternative hypotheses, choosing a significance level (typically α=0.05), conducting the test, and interpreting a p-value. Hypothesis testing asks: "For this one test, what is the probability of a false positive?" Multiple comparisons correction, by contrast, addresses what happens when many tests are conducted. The framework remains hypothesis testing, but the scope expands from "this one test" to "all the tests conducted in this study or analysis." A single p-value of 0.03 in a hypothesis test may represent genuine evidence if it is the only test conducted; the same p-value is nearly meaningless if it is one of 100 tests conducted in the same study without correction. Multiple comparisons correction forces the question: "Given that I have conducted 100 tests, what is the probability that at least one false positive occurred by chance?" The distinction is between per-test inference (single hypothesis test) and family-wise inference (multiple tests in context).

Nor is multiple comparisons correction the same as Statistical Power, the probability that a test correctly rejects a false null hypothesis given a true effect of a specified size. Statistical power answers: "If the null is false with effect size δ, will my test detect it?" Multiple comparisons correction answers: "If the null is true for all tests, will I incorrectly reject any of them by chance?" These are complementary but orthogonal concerns. A study could have high power on each of its tests (good ability to detect effects when present) and still suffer from inflated false-positive rates across the family if multiple-comparison correction is not applied. Conversely, a study could apply stringent multiple-comparisons correction (reducing false positives) but have low power per test (low ability to detect genuine effects). The trade-off between them is well-known: stricter multiplicity correction increases Type II error (reducing power) as the per-test significance threshold becomes more stringent. The two are distinct problems requiring separate analysis.

Multiple comparisons correction is also distinct from Reproducibility and Replicability, though multiplicity correction contributes to reproducibility. Replicability is whether an independent study using the same methods produces similar findings; reproducibility is whether the original analysis results can be reproduced by redoing the computation. Multiple comparisons correction addresses false-positive inflation within a single study: it controls the probability that the study's reported findings include spurious false positives, making the findings more trustworthy. But controlling false positives in one study does not guarantee replication in an independent study: replication depends on effect size, sample size, heterogeneity, and other factors not addressed by multiplicity correction. A well-corrected study's findings might still fail to replicate if the true effect size is small, or succeed even if the original study had some false-positive findings (if the true effects in the reported findings are large enough). Multiplicity correction is necessary for reproducible science but not sufficient for replicability.

Multiple comparisons correction is further distinct from Selection Bias, though they are related. Selection bias occurs when the process of data collection, analysis choice, or outcome selection is not independent of the outcome being measured, introducing systematic distortion in the estimated effects. Multiple comparisons correction addresses a specific kind of selection artifact: the inflation of false-positive rates when many tests are conducted and only the significant ones are reported. Selection bias is broader and can occur even without multiple testing (e.g., a cohort study where loss-to-followup is differential across exposure groups). Multiple comparisons correction does not address selection bias directly; it addresses only the multiple-testing contribution to inflated false-positive rates. However, the underlying problem is related: both selection bias and multiplicity correction concern the need to account for the full data-collection and testing process, not just the reported results in isolation.

Finally, multiple comparisons correction is distinct from Confirmation Bias, a cognitive phenomenon where people selectively seek, interpret, and remember information confirming their prior beliefs. Confirmation bias is a psychological mechanism operating at the level of individual cognition and research culture. Multiple comparisons correction is a statistical procedure operating at the level of formal inference. Confirmation bias might lead researchers to conduct selective tests and report only those consistent with their hypothesis; multiple comparisons correction would, if applied to all tests conducted, reveal that inflation in false-positive rates has occurred and require adjustment. Multiplicity correction is a partial antidote to the consequences of confirmation bias, though it does not address the cognitive phenomenon itself. A researcher could apply rigorous multiple-comparisons correction and still be unconsciously biased in which tests they choose to conduct; correction helps mitigate the inferential consequences of multiplicity but does not eliminate cognitive bias.

Solution Archetypes

Solution archetypes in the catalog that build on this prime — directly (this prime is a source ingredient) or as a related prime.

Built directly on this prime (1)

Also a related prime in 5 archetypes

Notes

Additional canonical references in this cluster: [4], [1], [5], [6], [7], [8], [9], [10].

Multiple comparisons correction is foundational to modern statistical practice in high-dimensional fields (genomics, neuroimaging, machine learning, clinical trials). The field has matured from Bonferroni's simple rule through Benjamini-Hochberg's FDR framework to permutation-based and resampling-based adaptive methods. The 2010s replication crisis in psychology and biomedicine elevated multiplicity concerns to broad scientific awareness, leading to pre-registration norms, registered-report publication formats, and mandatory disclosure of all conducted tests. The "garden-of-forking-paths" perspective (Gelman & Loken 2013) emphasizes that multiplicity inflation often arises not from intentional p-hacking but from benign analytical flexibility—exploring multiple dependent variables, multiple analytical approaches, multiple subgroups—that cumulatively inflates Type I error. Solutions to multiplicity operate at multiple levels: (a) formal statistical correction (Bonferroni, FDR, permutation tests), (b) pre-registration to reduce the testing family, © hold-out test sets and replication to validate findings. In sequential and A/B testing contexts, always-valid p-values and e-values provide tools for anytime-valid inference that permit interim analyses without traditional multiplicity problems. The distinction from model selection is worth noting: choosing among many candidate models (machine learning feature selection, regression model choice) creates related multiplicity concerns addressed through information criteria (AIC, BIC), cross-validation, and held-out test sets rather than hypothesis-test correction.

References

[1] Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300. Benjamini Hochberg false discovery rate FDR control multiple testing procedure step-up.

[2] Gelman, A., & Loken, E. (2013). The garden of forking paths: why multiple comparisons and p-hacking likely afflict us all. Unpublished manuscript. Gelman Loken garden forking paths multiplicity selection bias p-hacking.

[3] Westfall, P. H., & Young, S. S. (1993). Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment. John Wiley & Sons. Westfall Young resampling multiple testing permutation bootstrap.

[4] Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70. Holm sequentially rejective multiple test procedure closed-form step-down.

[5] Benjamini, Y., & Yekutieli, D. (2001). The control of the false discovery rate under dependency. Annals of Statistics, 29(4), 1165–1188. Benjamini Yekutieli FDR control arbitrary dependence structure.

[6] Sidak, Z. (1967). Rectangular confidence regions for the means of multivariate normal distributions. Journal of the American Statistical Association, 62(318), 626–633. Sidak confidence regions multivariate normal multiple comparisons.

[7] Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75(4), 800–802. Hochberg sharper Bonferroni closed-form step-up step-down.

[8] Tukey, J. W. (1953). The problem of multiple comparisons. Princeton University. Unpublished manuscript. Tukey problem multiple comparisons HSD test pairwise.

[9] Scheffé, H. (1953). A method for judging all contrasts in the analysis of variance. Biometrika, 40(1–2), 87–110. Scheffe method contrasts ANOVA simultaneous confidence intervals.

[10] Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(3), 479–498. Storey q-value false discovery rate direct approach robust.

[11] Bonferroni, C. E. (1936). Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Roma, 8, 3–62. Bonferroni statistical classes probability multiple inequalities union bound.

[12] Genovese, C. R., Lazar, N. A., & Nichols, T. (2002). Thresholding of statistical maps in functional neuroimaging using the false discovery rate. NeuroImage, 15(4), 870–878. Genovese neuroimaging false discovery rate thresholding fMRI.

[13] Hochberg, Y., & Tamhane, A. C. (1987). Multiple Comparison Procedures. John Wiley & Sons. Hochberg Tamhane multiple comparison procedures FWER FDR theory.

[14] Johari, H., Korolova, A., Munagala, K., & Wang, C. (2022). Always valid inference: continuous monitoring of A/B tests. Operations Research, 70(1), 294–313. Johari always valid continuous monitoring A/B test sequential testing.

[15] Vovk, V., & Wang, R. (2021). E-values: calibration, combination and applications. Annals of Statistics, 49(3), 1736–1754. Vovk e-values calibration exchangeability safe test sequential inference.