Multiple Testing Discipline¶

Control false discoveries when many comparisons, claims, or tests are being tried.

Essence¶

Multiple-Testing Discipline is the intervention pattern for situations where many comparisons, claims, metrics, screens, or analytic choices are tried and the most attractive result is at risk of being interpreted as though it came from a single planned test. The archetype does not prohibit exploration. It makes exploration honest: the selected result must be understood against the larger search space that made it possible.

The core move is to change the unit of credibility. Instead of asking only, "Does this one result look unlikely by chance?" the pattern asks, "How many chances did the system have to find something that looked unlikely, and what confirmation burden follows from that?" That shift turns a potentially misleading discovery ritual into a disciplined evidence process.

Compression statement¶

When many comparisons are searched, separate the family of attempted claims from any single attractive result, adjust the evidentiary standard or discovery procedure, record exploratory search, and require confirmation before treating selected findings as reliable.

Canonical formula: many attempted claims + ordinary single-claim interpretation -> inflated false positives; claim family + adjusted threshold/procedure + exploration log + confirmation -> credible discovery

When to Use This Archetype¶

Use this archetype when the same evidence base can generate many candidate findings: many outcomes in a study, many metrics in an experiment, many subgroups in a report, many model variants in a benchmark, many anomaly screens in an audit, or many investigative leads in a case file. It is especially important when selected findings will guide decisions, reputations, resource allocation, safety action, publication, enforcement, or product launch.

The archetype is not needed for every open-ended exploration. It becomes necessary when exploratory results are about to be presented as confirmed claims or used for consequential action. In low-stakes exploration, the right move may be simple labeling: "this is a lead." In high-stakes contexts, the right move may require formal correction, holdout validation, independent replication, a registry of attempted claims, or a staged confirmation process.

Structural Problem¶

When many tests are tried, chance has many opportunities to produce an impressive-looking pattern. If the failed, null, or unreported comparisons disappear from view, the remaining selected result looks more surprising than it really is. This is the structural source of p-hacking, metric shopping, subgroup fishing, data dredging, repeated interim looks, and overfitting to validation evidence.

The problem is not simply that people use a p-value, a threshold, or a dashboard. The problem is that the evidentiary context of the selected result is incomplete. A single result is being interpreted without the family of attempts that generated it. Multiple-Testing Discipline restores that missing context.

Intervention Logic¶

The intervention begins by defining the claim family: the related tests, outcomes, metrics, models, filters, time windows, and segments that created opportunities for a selected discovery. It then records the multiplicity inventory, including failed or unselected attempts. Next, it chooses an error-risk policy: strict false-positive avoidance, discovery-rate control, exploratory labeling, or staged confirmation. Finally, it applies a rule or process that matches the search space, and it withholds decision-ready status until the finding has survived the appropriate confirmation burden.

A disciplined process can still be creative. Exploration generates leads; confirmation changes their status. The archetype succeeds when a stakeholder can see both the promising result and the path by which it was found, then judge whether the result is exploratory, adjusted, confirmed, replicated, or ready for action.

Key Components¶

Multiple-Testing Discipline restores the missing evidentiary context around a selected finding by making the broader search space visible and matching the confirmation burden to it. The Claim Family defines which comparisons belong together for interpretation — a single product experiment may contain one primary metric, several secondary metrics, dozens of segments, and multiple time windows, and the family makes those opportunities legible. The Multiplicity Inventory records the actual number and kind of attempted looks, including tested metrics, model variants, filters, thresholds, subgroups, interim checks, and abandoned analyses, so hidden flexibility cannot vanish from the evidence trail. The Error-Risk Profile then states what kind of mistake matters most for the context — a safety screen, a scientific discovery program, and a product experiment need different balances between false positives and missed discoveries — and chooses the policy by which the family will be governed.

The remaining components translate that policy into discipline at the moment of decision. The Multiplicity Adjustment Rule implements the chosen approach: a formal threshold correction, a false-discovery-rate procedure, a holdout requirement, an alpha-spending plan, or a staged confirmation gate. The Exploratory–Confirmatory Boundary prevents patterns found during search from being silently relabeled as planned tests, preserving exploration's value while protecting confirmatory credibility. The Discovery Record keeps failed, null, alternative, and selected analyses visible together so selective memory does not recreate the same inflation risk. Finally, the Confirmation Requirement specifies what must happen before a promising lead can guide consequential action — a new experiment, held-out data, independent replication, a second site, or a stricter review gate — so credibility scales with the stakes of the decision the finding is asked to support.

Component	Description
Claim Family ↗	Defines which comparisons belong together for interpretation. A single product experiment might include one primary metric, several secondary metrics, dozens of segments, and multiple time windows; the claim family makes those opportunities visible.
Multiplicity Inventory ↗	Records the number and kind of attempted looks. This includes tested metrics, model variants, filters, thresholds, subgroups, interim checks, and abandoned analyses.
Error-Risk Profile ↗	States what kind of mistake matters most. A safety screen, scientific discovery program, and product experiment may need different balances between false positives and missed discoveries.
Multiplicity Adjustment Rule ↗	Implements the chosen discipline. It might be a formal correction, a false-discovery-rate procedure, a holdout rule, an alpha-spending plan, or a staged confirmation requirement.
Exploratory–Confirmatory Boundary ↗	Prevents patterns found during search from being treated as planned tests. It preserves exploratory value while protecting confirmatory credibility.
Discovery Record ↗	Keeps failed, null, alternative, and selected analyses visible. Without it, selective memory recreates the same false-discovery risk.
Confirmation Requirement ↗	Defines what must happen before a selected result can guide high-stakes action. Confirmation may require a new experiment, held-out data, independent replication, a second site, or a stricter review gate.

Common Mechanisms¶

Mechanisms implement the archetype; they are not the archetype itself.

Bonferroni-Like Correction (bonferroni_like_correction): A strict threshold-adjustment mechanism for contexts where even one false positive across the family is costly.
False Discovery Rate Control (false_discovery_rate_control): A screening mechanism that permits many discoveries while controlling the expected share that are false.
Preregistration (preregistration): A procedural commitment mechanism that records planned claims and analyses before results are observed.
Holdout Validation (holdout_validation): An evidence-partitioning mechanism that reserves fresh data or cases for confirmation after discovery.
Claim Registry (claim_registry): A governance mechanism that tracks attempted claims, statuses, owners, and follow-up requirements.
Replication Study (replication_study): An independent-confirmation mechanism that tests whether a selected finding persists in new evidence.
Confirmatory Follow-Up (confirmatory_follow_up): A staged-validation mechanism that turns a promising lead into a targeted confirmatory test.
Alpha-Spending Plan (alpha_spending_plan): A sequential-testing mechanism for repeated interim looks.
Metric Hierarchy (metric_hierarchy): A priority mechanism that prevents teams from promoting a secondary metric after the primary result disappoints.
Multiverse Analysis Report (multiverse_analysis_report): A transparency mechanism that shows whether a result depends on one favorable analytic path.
Alpha-Spending Plan
Bonferroni-Like Correction
Claim Registry
Confirmatory Follow-Up
False Discovery Rate Control
Holdout Validation
Metric Hierarchy
Multiverse Analysis Report
Preregistration
Replication Study

Parameter / Tuning Dimensions¶

Important tuning dimensions include the size of the claim family, the dependence among tests, the cost of false positives, the cost of false negatives, the strength of prior theory, the number of exploratory degrees of freedom, the availability of held-out evidence, the stakes of action, and the acceptable delay before confirmation.

A strict familywise approach is appropriate when any false claim is dangerous. A false-discovery-rate approach is often better when screening many candidates and expecting later follow-up. A governance-heavy approach is useful when the main risk is organizational cherry-picking rather than a single statistical formula. A staged exploration-confirmation approach is useful when broad search is necessary but action must wait.

Invariants to Preserve¶

The claim family must remain visible when interpreting a selected result. Exploration and confirmation must not be merged into a single status. Failed or unreported attempts must not disappear from the evidence context. The evidentiary burden should increase as the search space expands. High-stakes action should require stronger confirmation than initial discovery. Mechanisms should remain mechanisms: a p-value, Bonferroni correction, preregistration form, or holdout dataset is not the archetype by itself.

Target Outcomes¶

The archetype aims to reduce false discoveries, reduce p-hacking and metric shopping, improve trust in selected findings, preserve the value of exploratory search, and make reported claims more reproducible. A successful implementation does not eliminate uncertainty; it makes the credibility of discoveries better calibrated to how they were found.

Tradeoffs¶

Multiple-Testing Discipline trades some speed and sensitivity for credibility. Stricter correction can miss real signals. Broad exploration can generate useful leads but weak confirmation. Simple rules are easy to explain but may be too conservative or poorly matched to correlated tests. Claim registries improve memory but add administrative burden. Confirmation delays action, but premature action can be far more costly when the discovery is false.

The best version of the archetype is not always the strictest. It is the version that fits the error-risk profile and clearly labels what status each finding deserves.

Failure Modes¶

A common failure is undefined claim family, where only the reported tests are corrected and hidden analytic flexibility is ignored. Another is correction theater, where a formal adjustment is applied while metric shopping, data leakage, confounding, or selective reporting continues. Overcorrection paralysis happens when strict rules suppress useful discovery even when false leads could be cheaply followed up. Exploratory label laundering occurs when a claim is labeled exploratory in methods text but presented as confirmed in decisions or headlines. Confirmation contamination happens when holdout or replication evidence is influenced by the original search.

The most subtle failure is treating a result that survived multiplicity discipline as automatically important. A disciplined discovery can still be tiny, biased, confounded, practically irrelevant, or ethically unsafe to act on.

Neighbor Distinctions¶

Hypothesis Testing Frame structures a single claim against a default with evidence thresholds and error costs. Multiple-Testing Discipline adds the many-claim layer: what counts as evidence changes when many opportunities for a false alarm exist.

Reproducibility Protocol makes work rerunnable and auditable. A reproducible workflow can still overclaim a cherry-picked result if the discovery process had many unacknowledged attempts.

Uncertainty Explicitness communicates uncertainty. Multiple-Testing Discipline changes the discovery and confirmation process so reported uncertainty is not falsely narrow because of hidden search.

Confounder Control addresses third-variable distortion. Multiple-Testing Discipline addresses false discoveries created by repeated or flexible search.

Power-Aware Design and Effect Size Reporting remain merge-review neighbors in this batch. They address false negatives and practical magnitude, while this archetype addresses inflated false positives across many claims.

Cross-Domain Examples¶

In genomics, thousands of candidate genes may be screened; the archetype uses discovery-rate control and replication before treating candidates as credible. In product analytics, a feature may be examined across many metrics and segments; the archetype keeps primary metrics separate from exploratory subgroup leads. In machine learning, many hyperparameters may be tried; the archetype protects a locked test set for final confirmation. In public policy evaluation, many regions, outcomes, and demographic subgroups may be inspected; the archetype reports the family and labels subgroup findings appropriately. In safety monitoring, many anomaly screens may create false alarms; the archetype tracks alert families and requires corroboration before escalation.

Non-Examples¶

A single pre-specified pass/fail test is better handled by Hypothesis Testing Frame. A brainstorming session that produces speculative ideas without evidence claims does not require this archetype. A causal comparison distorted by a third variable calls for Confounder Control. A report that lacks uncertainty intervals calls for uncertainty representation. A statistically significant but practically tiny result calls for effect magnitude and decision-relevance checks, not primarily multiplicity discipline.

Abstractions this archetype builds on — directly (a source ingredient) or as a related pattern. Links follow the typed catalog namespace.

Built directly on (3)

Multiple Comparisons Correction: Adjust the thresholds or p-values of a defined family of simultaneous tests so a chosen family-level error criterion remains bounded despite multiplicity.
Reproducibility & Replicability: Repeatable results.
Type I & Type II Errors: False positive/negative.

Also references 6 related abstractions

Confirmation Bias: Favor confirming evidence.
Hypothesis Testing (Null vs. Alternative): Null vs alternative evaluation.
Probability: Quantifies uncertainty and likelihoods.
Statistical Significance (p-Value): Likelihood results are random.
Threshold: Safe vs harmful levels.
Uncertainty: Incomplete knowledge.

Variants¶

Narrower or domain-specific specializations that share this archetype's core structure. Recognized variants are established; candidate variants are provisional.

Familywise Error Control Variant · risk or failure variant · recognized

A stricter variant that tries to avoid even one false positive across a defined family of claims.

Distinct from parent: The parent includes several ways to discipline many-claim discovery; this variant emphasizes strict family-level false-positive prevention.
Use when: Any false claim in the family could cause serious harm, liability, wasted resources, or irreversible action; The number of tests is moderate enough that strict control is still usable; The environment values high specificity over broad exploratory discovery.
Typical domains: clinical safety, regulatory testing, quality acceptance, high stakes audit
Common mechanisms: bonferroni like correction, closed testing procedure

False Discovery Rate Variant · risk or failure variant · recognized

A screening-oriented variant that permits many discoveries while controlling the expected share that are false.

Distinct from parent: The parent is broader; this variant accepts a calibrated false-discovery burden to preserve useful discovery.
Use when: Large numbers of candidate signals are screened and some false leads are tolerable; The goal is to prioritize follow-up candidates, not to make final irreversible claims from the first screen; Discovery value is high enough that overly strict familywise control would be counterproductive.
Typical domains: genomics, screening programs, anomaly detection, lead generation
Common mechanisms: false discovery rate control, ranked candidate follow up

Exploratory–Confirmatory Partition Variant · temporal variant · promote to full archetype candidate

A staged variant that explicitly labels idea-generation evidence separately from confirmation evidence.

Distinct from parent: The parent controls many-claim false-discovery risk; this variant may become broader as a staged epistemic-status pattern for discovery work generally.
Use when: Broad exploration is necessary but later claims must be credible; Discovery and confirmation can be separated by time, dataset, team, site, or protocol; Stakeholders frequently overstate exploratory patterns.
Typical domains: scientific research, product analytics, machine learning, investigative analysis
Common mechanisms: preregistration, holdout validation, replication study

Claim Registry Variant · governance variant · recognized

A governance variant that controls multiplicity by making attempted claims, status, and follow-up obligations visible across a portfolio.

Distinct from parent: The parent includes statistical and procedural strategies; this variant centers on portfolio-level claim memory and accountability.
Use when: Many teams, analysts, experiments, dashboards, alerts, or investigations generate claims over time; The main risk is not a single formula but organizational forgetting, selective reporting, or repeated re-analysis; Reviewers need to see what has been tried, what failed, and what remains exploratory.
Typical domains: regulatory review, analytics governance, audit portfolios, research program management
Common mechanisms: claim registry, review gate

Near names: Multiplicity Control, Multiple Comparisons Correction, False Discovery Control, Data-Dredging Guardrail, p-Hacking Guardrail, Look-Elsewhere Effect Control, Trials Factor Correction.