Hypothesis Testing Frame¶
Essence¶
Hypothesis Testing Frame is the pattern of turning an ambiguous evidentiary question into a disciplined claim-evaluation structure. It asks: what is the claim, what remains true by default, what alternative would displace that default, what evidence is allowed to bear on the choice, how strong must that evidence be, and what mistakes are we trying to avoid?
The archetype is not a p-value, a statistical test, or a proof ritual. It is the surrounding architecture that makes a result interpretable. A statistical test may implement the archetype, but so can a quality acceptance standard, a legal burden of proof, a safety approval gate, or a structured diagnostic review.
Compression statement¶
When a claim must be evaluated, define the default, alternative, evidence threshold, and error costs before interpreting results.
Canonical formula: claim_under_test + default_claim + alternative_claim + test_evidence + evidence_threshold + error_cost_profile -> decision_rule + interpretation_limit
When to Use This Archetype¶
Use this archetype when a claim must be accepted, rejected, escalated, or left unresolved under uncertainty. It is especially useful when a result could be interpreted opportunistically, when actors might move standards after seeing evidence, or when the two major mistakes—false acceptance and false rejection—carry different costs.
It is less useful for open exploration with no claim yet, for evidence that is visibly biased or confounded before testing begins, or for situations where continuous belief updating is the primary task. In those cases, pair it with other archetypes rather than forcing all evidence work into a single threshold frame.
Structural Problem¶
The structural problem is not merely lack of data. It is the absence of a stable interpretation frame. A team sees evidence, then decides what it was testing. A reviewer accepts a claim because the result looks impressive, without saying what default was displaced. A non-detection is treated as proof of no effect even though the evidence may have been too weak to detect anything. A threshold is chosen because it supports a preferred conclusion.
Without the frame, claim evaluation becomes vulnerable to ambiguity, confirmation bias, burden shifting, threshold shopping, and overclaiming. The same evidence can be used to support incompatible conclusions because the claim, default, threshold, and error costs were never fixed.
Intervention Logic¶
The intervention begins by naming the claim under test. It then defines the default claim that remains unless evidence is sufficient, and the alternative claim that would justify changing belief or action. Next it specifies admissible evidence, the evidence threshold, and the error-cost profile: what happens if we accept a false claim, and what happens if we miss a true one?
Only after those pieces are explicit should the evidence be interpreted. The decision rule converts evidence into a status: reject the default, fail to reject it, escalate, continue monitoring, or require follow-up. The interpretation rule then states what the result does not establish. A threshold crossing does not automatically prove causality, practical importance, generalizability, or safety. A failure to cross a threshold does not automatically prove absence.
Key Components¶
Hypothesis Testing Frame replaces ambiguous evidence interpretation with an architecture that fixes every important judgment before the result is seen. The Claim Under Test is the precise assertion being exposed to evidence, which prevents drift between several adjacent claims such as "the intervention works," "the intervention is large enough to matter," and "the intervention caused the change." The Default Claim is the baseline presumption that stays in place unless evidence is sufficient to displace it, while the Alternative Claim names the specific departure that would justify changing belief or action and distinguishes superiority, difference, equivalence, noninferiority, or burden-of-proof forms. The Test Evidence ties the test to admissible observations rather than letting any convenient signal count, and the Evidence Threshold sets how much of that evidence is enough — a threshold that must not move after seeing the result. The Error-Cost Profile makes the threshold non-arbitrary by stating what happens when a false claim is accepted and when a true one is missed.
The final three components govern what happens after evidence is in hand and what may legitimately be said about the outcome. The Decision Rule translates evidence into an action status — support for the alternative, retention of the default, insufficient evidence, monitor, or follow-up — rather than collapsing every result into a binary verdict. The Interpretation Rule constrains what can be inferred once the decision fires: it stops significance from becoming importance, non-significance from becoming proof of absence, and association from becoming causation. The Interpretation Limit then marks the outer edge of the frame itself, flagging where sampling, confounding, multiplicity, power, or magnitude questions remain unresolved and where neighbor archetypes must take over. Together the nine components convert a result into a disciplined claim status with explicit error costs rather than letting evidence be reinterpreted to suit whoever speaks first.
| Component | Description |
|---|---|
| Claim Under Test ↗ | This is the assertion being exposed to evidence. It prevents the conversation from drifting between several claims, such as “the intervention works,” “the intervention is large enough to matter,” and “the intervention caused the change.” |
| Default Claim ↗ | This is the baseline presumption that stays in place unless evidence is strong enough to displace it. It might be no improvement, current policy retained, no defect found, no safety approval, or innocence until proof. |
| Alternative Claim ↗ | This is the proposed departure from the default. It can assert improvement, difference, equivalence, noninferiority, risk, defect, or causal change. |
| Test Evidence ↗ | This names the observations or measurements allowed to bear on the claim. It ties the test to actual evidence instead of letting any convenient signal count. |
| Evidence Threshold ↗ | This sets how much evidence is enough to change interpretation or action. It can be numerical, procedural, legal, operational, or qualitative, but it must not move after seeing the result. |
| Error-Cost Profile ↗ | This explains the consequences of false positives and false negatives. It is what keeps the threshold from being arbitrary or purely conventional. |
| Decision Rule ↗ | This translates evidence into action status. It distinguishes “support the alternative,” “retain the default,” “insufficient evidence,” “monitor,” and “requires follow-up.” |
| Interpretation Rule ↗ | This constrains what can be said after the decision rule fires. It prevents significance from becoming importance, non-significance from becoming proof of absence, and association from becoming causation. |
| Interpretation Limit ↗ | This marks the boundary of the test frame. It reminds users when sampling, confounding, multiplicity, power, or magnitude questions remain unresolved. |
Common Mechanisms¶
Mechanisms implement the archetype; they are not the archetype itself.
- Null Hypothesis Significance Test (
null_hypothesis_significance_test): A formal statistical mechanism for evaluating a default/null claim against an alternative under assumptions and thresholds. It implements the archetype only when the claim frame and interpretation limits are explicit. - Decision Threshold Rule (
decision_threshold_rule): A procedural mechanism that turns evidence into a status change. Thresholds are useful, but a threshold alone is not the archetype. - Quality Acceptance Test (
quality_acceptance_test): An operational mechanism for accepting, rejecting, or reworking a batch, product, service, or process. It works when acceptance criteria and error costs are explicit. - A/B Test Interpretation Protocol (
ab_test_interpretation_protocol): A product or service experimentation mechanism that interprets measured differences against planned metrics and launch criteria. - Legal Burden-of-Proof Analog (
legal_burden_of_proof_analog): A governance mechanism that assigns a default presumption and evidence burden before changing a status or imposing a consequence. - Inspection Pass/Fail Test (
inspection_pass_fail_test): A classification mechanism for checking whether an item meets criteria. It becomes weak when the criteria are arbitrary or adjusted after inspection. - Falsification Protocol (
falsification_protocol): A mechanism for stating what would count against a favored claim before evidence is reviewed. - Equivalence or Noninferiority Test (
equivalence_or_noninferiority_test): A mechanism for showing that a difference is small enough or that a new option is not unacceptably worse. - Sequential Review Gate (
sequential_review_gate): A workflow mechanism for interpreting evidence at predefined milestones. Repeated looks may require Multiple-Testing Discipline. - Scientific Claim Evaluation Template (
scientific_claim_evaluation_template): A template mechanism that prompts users to document claim, default, alternative, evidence, threshold, assumptions, error costs, and limits.
Parameter / Tuning Dimensions¶
Important tuning dimensions include the stringency of the threshold, the asymmetry between false-positive and false-negative costs, the specificity of the alternative claim, the credibility and independence of evidence, the strength of assumptions, the availability of follow-up evidence, the decision stakes, and whether action is reversible.
A safety-critical frame usually sets a higher burden before approving action. A screening frame may lower the threshold to avoid missing plausible signals, but then require follow-up before irreversible action. A legal or governance frame may emphasize who bears the burden of proof. A product experimentation frame may balance false rollout risk against the cost of delaying a beneficial change.
Invariants to Preserve¶
The default claim, alternative claim, and evidence threshold must remain stable during interpretation. The frame must preserve the difference between insufficient evidence and evidence of absence. It must also preserve the difference between evidential status, causal attribution, practical magnitude, generalization, and ethical acceptability.
The archetype should never allow a mechanism to replace the frame. A p-value, pass/fail label, inspection score, burden-of-proof phrase, or significance star is only meaningful when embedded in the surrounding claim-evaluation architecture.
Target Outcomes¶
The target outcomes are disciplined claim evaluation, reduced threshold shopping, clearer burden of evidence, better calibration between evidence and action, and more honest communication of results. The frame should make it easier to say not only “what did we find?” but also “what were we trying to decide, what would have counted as enough, what mistakes were we guarding against, and what remains unsettled?”
A successful implementation does not eliminate uncertainty. It makes uncertainty actionable without pretending that the evidence proves more than it does.
Tradeoffs¶
The archetype trades interpretive flexibility for discipline. Pre-specified thresholds prevent opportunistic interpretation, but they can also be too rigid when unexpected evidence appears. Stricter thresholds reduce false positives but raise false-negative risk. More sensitive thresholds reduce misses but increase false alarms. Simple decision labels communicate clearly but can hide nuance.
The frame can also institutionalize power. Whoever defines the default and threshold can shape which claims are easy or hard to establish. This is useful when the default protects people from harm or arbitrary action, but dangerous when it protects incumbents from legitimate challenge.
Failure Modes¶
A common failure is post-hoc thresholding, where the evidence threshold is selected after seeing the result. Another is default claim drift, where the default changes to favor the preferred conclusion. p-value ritualism occurs when a statistical label replaces reasoning about claim, assumptions, error costs, and limits.
Non-significance as proof of absence is especially dangerous: failing to cross a threshold may mean evidence was weak, noisy, underpowered, or poorly aimed. Practical-irrelevance blindness occurs when a threshold-crossing result is treated as important despite tiny magnitude. Multiplicity leakage occurs when a single claim frame is applied to a result selected from many unreported attempts. Causal overreach occurs when the test frame is treated as proof of cause without causal design support.
Neighbor Distinctions¶
Multiple-Testing Discipline governs many attempted claims, comparisons, metrics, or subgroups. Hypothesis Testing Frame can structure one planned claim; multiplicity discipline becomes necessary when the claim was selected from a larger search space.
Power-Aware Design asks whether evidence collection can detect an effect worth acting on. Hypothesis Testing Frame asks how to interpret evidence against a default and alternative. A weak test frame can produce “no evidence” without being sensitive enough to mean “no effect.”
Effect Size Reporting asks how large or practically important the effect is. Hypothesis Testing Frame asks whether evidence crosses a claim threshold. A result can be statistically supported and practically trivial.
Uncertainty Explicitness makes uncertainty visible. Hypothesis Testing Frame turns evidence and uncertainty into claim-status decisions under explicit error costs.
Confounder Control protects causal interpretation from third-variable distortion. Hypothesis Testing Frame protects claim interpretation from ambiguous thresholds and burden shifting.
Bayesian Belief Updating revises beliefs continuously using priors and evidence. Hypothesis Testing Frame usually creates a discrete claim-evaluation status, though Bayesian evidence can feed the frame.
Variants and Near Names¶
Recognized variants include the Null–Alternative Statistical Test Variant, Superiority Test Variant, Equivalence / Noninferiority Variant, Burden-of-Proof Variant, and Falsification Frame Variant. Near names include claim evaluation frame, null–alternative frame, evidence threshold frame, burden-of-proof frame, null hypothesis testing, and significance testing.
Collapsed candidates include p-value, statistical significance label, confidence interval, single pass/fail rule, and preregistration. These are useful mechanisms, statistics, representations, or support tools, but they should not be drafted as full archetypes here. Error Tradeoff Calibration remains a second-wave promotion candidate if it proves broader than the claim-testing frame.
Cross-Domain Examples¶
In scientific research, the archetype structures whether evidence supports a planned claim rather than allowing after-the-fact result storytelling. In product analytics, it keeps a team from shipping based on whichever metric looks best after the experiment. In manufacturing, it makes acceptance and rejection criteria auditable. In safety governance, it defines the burden of evidence before approving high-risk action. In legal or compliance contexts, it preserves the presumption and standard of proof. In diagnosis, it clarifies what evidence would support or weaken a working explanation.
Non-Examples¶
A p-value alone is not this archetype. A pass/fail rule with no explicit claim, default, threshold rationale, or error-cost logic is only a mechanism. A dashboard where many comparisons are searched and one favorable result is reported needs Multiple-Testing Discipline first. A causal claim drawn from confounded data needs Confounder Control. A report saying “no difference found” from a weak or tiny study needs Power-Aware Design before the non-detection can be interpreted confidently.