Reproducibility & Replicability¶

Prime #: 441
Origin domain: Statistics & Experimental Design
Also from: Philosophy
Aliases: Replication, Reproducible Research, Scientific Replication, Computational Reproducibility, Reproducibility
Related primes: Randomization, Hypothesis Testing (Null vs. Alternative), Statistical Significance (p-Value), Statistical Power, Selection Bias, Confounding, Effect Size, Sampling (Representativeness)

Core Idea¶

Reproducibility and replicability constitute the repeat-the-study-to-confirm-the-finding principle that: (1) reproducibility and replicability are complementary concepts about independent verification of scientific findings: reproducibility refers to obtaining consistent results from the same data using the same analytic procedures — the "computational" sense that a well-documented analysis can be rerun by a third party with the original data to yield identical results — while replicability refers to obtaining consistent results from new data using similar methods — the "scientific" sense that an effect reported in one study can be observed again in an independent study collecting fresh data under similar conditions; the two concepts were traditionally conflated in the phrase "replication," but the 2019 National Academies of Sciences Reproducibility and Replicability in Science report formalized the distinction^[1], with reproducibility treating same-data-same-analysis as the computational standard and replicability treating new-data-same-design as the scientific-confirmation standard; both concepts rest on the philosophical commitment — rooted in Popper (falsifiability; 1934 Logik der Forschung), the hypothetico-deductive tradition, and the sociology of science (Merton's norms of universalism and communism) — that scientific knowledge must be independently verifiable rather than depending on singular authoritative demonstrations; the multi_origin_equal flag reflects genuine co-origin in experimental-design/statistics (where replication is a foundational concept for valid inference — Fisher 1935 "the greater the precision aimed at, the greater the demand for replication") and philosophy of science (Popper; Lakatos; Kitcher; contemporary philosophy-of-science analyses of replication); (2) the concept has several identifiable components and distinctions: reproducibility (same data, same code, same result — a computational standard), replicability (new data, same methods, consistent result — a scientific standard), generalizability (effect holds under deliberately-varied conditions — a more stringent standard), robustness (effect holds under different analytical specifications — a sensitivity standard), direct replication (attempting to reproduce the original study as closely as possible), conceptual replication (testing the same theoretical prediction with deliberately different methods), pre-registered replication (replication attempt registered before conducting, with pre-specified success criteria), many-labs replication (coordinated multi-laboratory replication effort), registered report (journal commits to publication based on methodology rather than results), replication crisis (the empirical observation, documented across psychology, medicine, economics, and other fields, that many published findings fail to replicate; Ioannidis 2005 "Why most published research findings are false"^[2]; Open Science Collaboration 2015 Science paper reporting 36% replication rate in psychology; Camerer et al. experimental-economics replications; Begley-Ellis 2012 preclinical-cancer-research; multiple-field replication initiatives), and reproducibility crisis (the computational version — inability to reproduce published computational results from available code and data, as documented in Nature surveys and discipline-specific audits); (3) the deeper logic is that a single study, no matter how carefully conducted and analyzed, has substantial probability of producing a misleading finding — due to sampling variability, selection effects, winner's-curse from publication threshold, p-hacking, garden-of-forking-paths flexibility, and the fundamental statistical fact that published effects in significance-selected studies over-represent the high-variance tail of their sampling distributions — and therefore independent replication is not an optional refinement but a constitutive feature of reliable scientific knowledge; the contested_construct flag reflects ongoing disagreement about (a) whether the "crisis" is crisis or normal scientific self-correction, (b) how to define replication success (exact-effect-size match, confidence-interval overlap, same-sign, same-sign-and-significance), © what replication rates should be expected given real-effect heterogeneity and statistical power, (d) which replication failures reflect original false positives versus replication false negatives (low replication-study power) versus true between-study heterogeneity in effect sizes, and (e) whether replication can ever be "exact" given inevitable context-and-population differences; the consensus position across most methodological reformers is that current scientific incentives (publication bias toward novel significant findings; career rewards for novelty over verification; under-resourcing of replication; rare data sharing) produce systematically less-replicable research than the epistemic ideal demands, and reform in multiple dimensions (pre-registration; registered reports; data-and-code sharing; replication funding; replication-positive journals; meta-science infrastructure) has been underway since the 2011-2015 reform wave; (4) the concept appears across domains — psychology and behavioral science (Open Science Collaboration 2015 Reproducibility Project; Many Labs consortium; registered reports at Royal Society Open Science, Cortex, Nature Human Behaviour; ≈36% replication rate in psychology flagship study; individual cases including Bem ESP 2011 non-replication and priming research broadly), medicine and biomedicine (Begley-Ellis 2012 Nature report that 47 of 53 "landmark" preclinical cancer studies did not replicate; Prinz-Schuonenberg-Asadullah 2011 Bayer in-house audit showing 21% replication rate; Ioannidis 2005 foundational meta-analysis; Reproducibility Project: Cancer Biology; direct-to-human research more stringent than preclinical), economics and econometrics (Camerer et al. 2016 Science replication of 18 experimental-economics studies with 61% replication rate; Camerer et al. 2018 social-science-broad replication with 62% rate; AEA data and code availability policy since 2005; replication-in-economics journals), political science and sociology (multiple replication initiatives; TESS data-sharing; some high-profile non-replications including LaCour 2014 retraction), computational and data science (computational reproducibility distinct from scientific replicability; "reproducibility crisis" in ML where results depend on undisclosed hyperparameters, random seeds, implementation details; NeurIPS and ICML reproducibility checklists), genetics and genomics (early GWAS era replete with non-replicating candidate-gene studies; contemporary GWAS requires explicit replication in independent cohorts for publication), drug development and clinical trials (Phase 3 replication expectation for regulatory approval — typically two positive trials; FDA/EMA review of replication patterns across submitted evidence), ecology and evolutionary biology (replication studies more recent but growing; SCORE and SPEC projects in ecology), physics and chemistry (independent-laboratory confirmation of major experimental results as standard; Higgs boson 5σ evidence required independent confirmation from ATLAS and CMS; traditional physics-culture expectation of cross-laboratory reproduction), meta-science and science-of-science (Bak-Coleman and others using replication data to study scientific production; Meta-Research Innovation Center at Stanford; the TIER protocol; the FAIR data principles) — across these, the repeat-the-study-to-confirm principle is shared, with domain-specific replication norms, failure modes, and reform efforts.

How would you explain it like I'm…

Check It Again

If someone bakes a cake and says it tastes amazing, you should not believe them until another person, in another kitchen, bakes it the same way and gets the same yummy cake. Science is like that. One person finding something cool is not enough. Other people have to do the test again and get the same answer before we really trust it.

Other Scientists Checking the Result

Scientists try to find true facts about the world, but a single experiment can give the wrong answer by accident. So they check each other's work in two ways. Reproducibility means: if I take your data and your computer code, do I get the same numbers? Replicability means: if I run a new experiment like yours, do I get the same kind of result? When lots of studies fail this check, scientists call it a replication crisis, and they work on better habits — like sharing data and writing down their plan before they start.

Independent Verification of Findings

Reproducibility and replicability are the two ways science checks itself. Reproducibility is the computational standard: another researcher takes your data and your analysis code and gets your exact numbers. Replicability is the scientific standard: another team collects fresh data using similar methods and finds a similar result. The 2019 National Academies report made this distinction official because the words used to be used interchangeably. The reason both matter is that any single study can be misleading — through random chance, selective reporting, hidden choices in analysis, or publication bias that favors surprising results. Since around 2011, large projects in psychology, medicine, and economics have shown that a sizable share of published findings don't replicate, sparking reforms like preregistration and open data.

Reproducibility and replicability are the twin standards by which scientific findings earn the status of reliable knowledge through independent verification. Reproducibility, in the contemporary technical sense, refers to the computational standard: another investigator, given the original data and analysis code, should be able to recompute the reported numerical results exactly. Replicability refers to the scientific standard: an independent team, collecting fresh data under similar conditions and applying similar methods, should obtain consistent results. The 2019 National Academies of Sciences report formalized this distinction, which had previously been muddled under the single word "replication." Both rest on a philosophical commitment, rooted in Popper's falsifiability and Merton's norms of universalism and communism, that scientific claims must be checkable by anyone, not authoritative pronouncements. The construct gained urgency with the "replication crisis," launched by Ioannidis's 2005 argument that most published findings may be false and confirmed by the Open Science Collaboration's 2015 Reproducibility Project (around 36% replication rate in psychology), Begley and Ellis's 2012 Nature audit (47 of 53 preclinical cancer landmarks did not replicate), and parallel results in experimental economics. The diagnosed causes — publication bias toward novel significant results, p-hacking, garden-of-forking-paths analytic flexibility, the winner's curse in selected studies, underpowered designs — have driven a reform wave including preregistration, registered reports, mandatory data-and-code sharing, replication-positive journals, and meta-science infrastructure.

Structural Signature¶

A reproducibility-and-replicability assessment exhibits the following six core role-phrases:

the original-study artifact-and-result — a published finding from an original study, typically a specific claim (effect exists, effect has specific magnitude and direction, association is statistically significant, pattern is observable), published in a peer-reviewed venue with sufficient detail that readers might evaluate or replicate it.

the independent-replication attempt under matched conditions — either the computational rerun (reproducibility, same data and code) or a new data-collection study following the original methodology (replicability, new data, same design), conducted with sufficient independence and methodological fidelity to test whether the original finding holds.

the operational-definition of replication success — a pre-specified criterion for what would count as successful reproduction or replication (match of computational results, same-sign significant result in new data, confidence-interval overlap, effect-size match within tolerance, or other definition used in the original study's field or replication protocol).

the documentation-and-transparency requirement — availability of original data and code in usable form (reproducibility requirement), and availability of original methodology documentation sufficient to re-implement the study (replicability requirement), enabling independent verification.

the power-and-statistical-adequacy consideration — replication sample size must be adequate to detect the original effect size with reasonable probability (typically ≥80% power to detect the original reported effect, or ≥80% power to detect a smaller plausible effect if accounting for winner's-curse inflation).

the result-interpretation and aggregation framework — analysis of reproduction/replication results against the original finding using pre-specified success criteria; interpretation of outcomes (successful replication strengthens evidence; failed replication triggers diagnostic questions about original false positive vs. replication false negative vs. population heterogeneity vs. method differences); meta-analytic synthesis and aggregation across multiple replication attempts when available.

When these elements are in place, replication functions as the self-correction mechanism of science; when they are absent (no data-and-code sharing; inadequate methodology documentation; under-powered replications; ambiguous success criteria), replication efforts produce inconclusive results and cannot correct the scientific record.

What It Is Not¶

Not redundant with statistical significance — a single study reaching statistical significance does not establish the finding; significance controls the Type I error rate for that test, not the reliability of the finding across studies. Replication provides a different and stronger evidential layer.
Not a guarantee of truth even when successful — multiple consistent findings can all be produced by shared methodological flaws, publication bias, measurement-instrument artifacts, or stable-but-invalid confounders. Replication is necessary but not sufficient for confident inference.
Not impossible in principle — some philosophical discussions have argued that no replication is "exact" because population, time, and context always differ, concluding that replication is a meaningless ideal. The pragmatic response (embraced by most of the replication-reform movement) is that "close-enough" replication is possible and informative, and that deliberate variation can probe generalizability rather than undermining replication as a concept.
Not solely about fraud detection — replication failures can arise from many causes (chance; p-hacking; selection bias; population heterogeneity; methodological artifacts; measurement differences; true between-study effect-size variation) with fraud representing a tiny fraction. Treating replication primarily as fraud detection misframes the function.
Not equivalent across fields — conventions for what counts as replication differ substantially across disciplines. Physics culture expects independent-laboratory confirmation as a routine standard; psychology historically treated single-study findings as sufficient until the replication-crisis wave; economics relies on applied-policy evidence that is hard to replicate without natural-experiment re-emergence. Field-specific norms shape what replication looks like.
Not addressable only by replicators — the conditions for replicability are set by original-study authors (methodology documentation; data sharing; pre-registration; analysis transparency) and by the publication and funding systems (rewarding replication efforts; supporting data infrastructure). Replication success is a systemic property, not solely a replicator's achievement.
Not meaningfully assessed by a single replication attempt — failed replication can occur due to replication-study shortcomings (low power; design error; population differences) rather than original-study errors. Multiple replication attempts, preferably with different methods and contexts, provide the evidential base for confident conclusions.
Not the same as "robustness" in the sense of analytic-decision sensitivity — robustness checks (varying analytic specifications within a study) are useful but distinct from replication (new data). Both are important for reliable inference but operate at different levels.
Not orthogonal to statistical power — low-powered original studies produce inflated effect estimates (winner's curse); adequately-powered replications then produce smaller estimates, often failing to reach significance even when the true effect exists. Power considerations are central to replication interpretation.
Not a simple binary outcome — "successful" and "failed" replication categories oversimplify. Effects can replicate in sign but not magnitude; in some populations but not others; statistically significantly but with smaller effects than original; or the reverse. Continuous reporting of replication results (effect-size estimate, CI, power, comparison to original CI) is more informative than binary classification.

Broad Use¶

Psychology and behavioral science (canonical replication-crisis context): Bem 2011 ESP paper published in a major journal triggered methodological concerns. Open Science Collaboration 2015 Reproducibility Project: Psychology replicated 100 published studies with ≈36% replication rate (varying by replication definition); average replication effect size was about half the original. Many Labs Project (Klein et al. 2014, 2018, 2022) conducted coordinated multi-laboratory replications with more consistent patterns than single-lab replications. Registered reports at Royal Society Open Science, Cortex, Nature Human Behaviour, Psychological Science commit to publication based on pre-registered methodology rather than results. Pre-registration infrastructure (AsPredicted, OSF) became widespread. Power-analysis standards for sample size improved. Priming research, ego-depletion, and other subfield-specific replication failures prompted theoretical reassessment.
Medicine and biomedicine: Ioannidis 2005 PLoS Medicine "Why most published research findings are false" used prior-probability reasoning to argue that most published findings are likely false under plausible assumptions. Begley-Ellis 2012 reported that only 6 of 53 "landmark" preclinical oncology studies replicated — a 11% replication rate in a disease area with high-stakes translation. Prinz-Schuonenberg-Asadullah 2011 Bayer in-house audit showed 21% replication rate across disease areas. Reproducibility Project: Cancer Biology (2014-2021) formally replicated 50 high-profile cancer studies with mixed results. Clinical-trial replication is somewhat better (Phase 3 trials with pre-registration and regulatory oversight) but preclinical research remains concerning. FDA and EMA increasingly scrutinize replication patterns and pre-registration in approval decisions.
Economics and econometrics: Camerer et al. 2016 replication of 18 experimental-economics studies found 61% replication rate. Camerer et al. 2018 Nature Human Behaviour social-science replication found 62% rate across 21 studies from Nature and Science. Econometric replication requires different definitions (quasi-experimental findings rely on specific natural experiments not available for re-examination; structural models depend on specific assumption-sets). AEA data and code availability policy since 2005; American Economic Review and others enforce replication-file availability. Credibility revolution in empirical economics (Angrist-Pischke) overlaps with replication-consciousness.
Political science and sociology: Multiple replication initiatives including TESS (Time-sharing Experiments for the Social Sciences). High-profile cases include LaCour 2014 retraction. Data-sharing norms improving but uneven.
Computational and data science (canonical computational-reproducibility context): Distinct reproducibility crisis in ML — published results often depend on undisclosed hyperparameters, random seeds, data-preprocessing choices, and implementation details. Hutson 2018 reported that most ML papers cannot be reproduced from their publications alone. NeurIPS, ICML, and other venues added reproducibility checklists. MLPerf benchmarks and reproducibility challenges. "Paper with code" movement. Docker containers and cloud notebooks for computational-reproducibility support.
Genetics and genomics: Early GWAS era was replete with non-replicating candidate-gene studies — small-sample studies proposed associations that failed in larger samples. Contemporary GWAS explicitly requires replication in independent cohorts for publication (typically discovery cohort plus replication cohort). The 5×10⁻⁸ genome-wide-significance threshold plus required replication has produced highly reliable GWAS findings. Replication culture in human genetics is generally strong relative to many biomedical subfields.
Drug development and clinical trials: Phase 3 replication expectation for regulatory approval — typically two positive trials demonstrating efficacy before full drug approval. Regulatory reviewers scrutinize replication consistency; approval often discusses whether the second trial truly replicated the first. Pre-specified primary endpoints and statistical-analysis plans codify replication standards across the regulatory pipeline. Post-approval effectiveness studies extend the evidence base.
Ecology and evolutionary biology: SPEC (Ecology Replication Project) and related efforts. Field ecology faces unique replication challenges (natural variation; site-specific ecology; long-term timescales) but meta-analytic synthesis and coordinated experimental networks partially address replication concerns. Nakagawa and colleagues have developed replication methodology for ecological contexts.
Physics and chemistry: Traditional physics culture has strong replication norms. Major experimental results (Higgs boson; gravitational waves; recent cosmological measurements) require independent-laboratory confirmation. Tabletop-physics results are expected to replicate across laboratories. The replication crisis has been less prominent in physics than in biomedicine and behavioral science, partly reflecting culture and partly reflecting publication practices.
Meta-science and science-of-science: Stanford Meta-Research Innovation Center (METRICS) founded by Ioannidis 2014. The Center for Open Science (COS) founded 2013. Studies of replication rates, pre-registration adoption, data-sharing rates, and science-reform impact. Bak-Coleman and others use replication data to study scientific production dynamics. The FAIR data principles (Findable, Accessible, Interoperable, Reusable) codify data-infrastructure norms supporting reproducibility.

Clarity¶

Names the specific epistemic standards — reproducibility (same data) and replicability (new data)^[1] — that distinguish reliable scientific findings from singular-study artifacts. Without the frame, people treat single-study findings as established knowledge, fail to distinguish computational reproducibility from scientific replicability, and miss the systematic reasons why original published findings may not replicate (publication bias, winner's curse, p-hacking, low power, selection effects). With the frame, diagnosis becomes specific: has this finding been replicated, and under what conditions? Was the replication preregistered with pre-specified success criteria? Was the replication adequately powered to detect an effect of the size the original reported — or of a smaller plausible true effect size? Is the original study's data and code available for reproducibility checking? If replication has not been conducted, what is the baseline expectation given the field's replication-rate patterns? Is the original finding in a subfield with strong replication culture (physics, genetics, clinical trials) or weak (preclinical biology, social psychology)? Does the finding rest on a single high-powered replication or on many independent replications? How should I weight this evidence in my decisions? The frame clarifies that scientific findings exist on a continuum from single-study-speculation to multiply-replicated-consensus, and that the evidential weight of a finding depends on where it sits on that continuum.

Manages Complexity¶

Decomposes reliability into computational (reproducibility), scientific (replicability), methodological (robustness), and epistemic (generalizability) dimensions^[1], each with distinct defenses and assessments. Cross-domain transfer is productive: pre-registration from clinical trials to psychology to economics; data-sharing requirements from genetics to economics to social sciences; registered reports from psychology to biomedicine; meta-analytic replication aggregation from medicine to all empirical fields; computational-reproducibility infrastructure (Docker, notebooks, cloud computing) from computer science to all computational sciences. The decomposition reveals interplay with other primes: randomization (#432) — proper randomization is a condition for reliable replication; hypothesis testing (#434) — replication aggregates across single-study testing; statistical significance / p-value (#435) — p-value interpretation is compromised without replication expectation; statistical power (#437) — under-powered original studies produce winner's curse and non-replicable findings; selection bias (#440) — publication bias is selection bias in literature, a primary driver of replication failure; confounding (#438) — unmeasured confounders differing across replication contexts produce non-replicable findings; effect size (#447) — replications measure whether effect sizes hold up; sampling representativeness (#433) — replications in different populations test generalizability.

Abstract Reasoning¶

The analyst asks: what is the evidential status of this finding — single study, one replication, multiple replications across contexts?^[3] What field does it come from, and what are that field's replication-rate patterns and reform maturity? Is the original study well-powered, pre-registered, with transparent data and code, or under-powered with undisclosed analytic flexibility? Is the effect magnitude consistent with what statistically-plausible winner's-curse arguments would imply for a single-study significant finding? If replication studies exist, were they pre-registered, adequately powered, and conducted with methodological fidelity to the original? If replications failed, what explains the failure — original false positive, replication false negative, between-study heterogeneity, methodological differences? How should I weight this evidence relative to singular-study findings in fields with stronger replication cultures? Is there sufficient aggregated evidence for practical-decision confidence, or is more replication needed? Mature practice reads single-study findings with appropriate skepticism, tracks replication status of findings they rely on, supports reform through pre-registration and data-sharing, and treats scientific reliability as emerging from accumulated evidence rather than authoritative-source publication. Immature practice treats published findings as established knowledge, ignores replication status, and treats replication failure as individual-study failure rather than as systemic methodological concern.

Knowledge Transfer¶

Domain	Replication-rate estimate	Primary failure-mode	Ongoing reform
Social psychology	~25-50% (varies by definition)	Low power; flexibility	Pre-registration; registered reports
Cognitive psychology	~50-65%	Effect-size inflation	Power improvements
Experimental economics	~60%	Context dependence	Data-sharing mandates
Preclinical cancer biology	~10-25%	Methodological diversity; bias	Biotech in-house audits
Clinical trials (regulatory)	High (with 2-trial rule)	Post-approval drift	Real-world evidence
GWAS	High (with replication-cohort rule)	Stringent thresholds work	Continuing
Machine learning	Low (computational reproducibility)	Undisclosed details	Reproducibility checklists
Physics (major results)	High	Systematic biases rare	Independent-lab verification
Ecology	Variable; emerging	Site-specificity; natural variation	Coordinated networks
Computational chemistry	Low-to-variable	Code/data availability	FAIR principles

Across rows: the core logic — repeated independent observation as evidential standard — transfers across domains, with characteristic failure modes and reform trajectories.

Examples¶

Formal/abstract¶

The Open Science Collaboration's 2015 Science paper "Estimating the reproducibility of psychological science" (Science 349, aac4716) is the canonical modern replication-study case. Led by Brian Nosek at the University of Virginia and the Center for Open Science, the project coordinated a multi-year, multi-laboratory effort to conduct high-fidelity replications of 100 published studies from three top psychology journals (Psychological Science, Journal of Personality and Social Psychology, Journal of Experimental Psychology: Learning, Memory, and Cognition) published in 2008. The replication studies followed the original protocols as closely as possible, typically with larger sample sizes than the original (median replication sample size was 2.5 times the original, attempting to provide >80% power to detect the original effect size)^[4]. Replication success was operationalized through multiple metrics: (a) did the replication produce a statistically significant result in the same direction as the original (significance-based)? (b) was the original effect size within the replication's 95% confidence interval (interval-overlap-based)? © did the meta-analytic combination of original and replication provide significant evidence (combined-evidence-based)? (d) did the replication's authors subjectively judge the result a successful replication (subjective)? Across these metrics, replication rates were approximately: 36% by same-direction significance; 47% by interval overlap; 68% by combined-evidence; 39% by subjective judgment^[5]. The mean replication effect size was approximately half the original (0.197 vs. 0.403 standardized effect), consistent with substantial winner's-curse inflation in original published effects. Replication success was correlated with original effect size (larger originals more likely to replicate); original p-value (stronger originals more likely); original effect surprising-ness (surprising findings less likely to replicate); and subfield (cognitive-psychology results replicated better than social-psychology). Individual cases included notable non-replications (some priming-research findings; some dual-process cognitive findings) and notable successful replications (many perceptual and memory findings). The paper's publication in Science generated massive methodological-reform impact: (i) It quantified for the first time the psychology replication problem at scale with rigorous methodology, converting anecdotal concerns into statistical evidence^[6]. (ii) It distinguished different definitions of replication success, showing that the estimated rate depends substantially on definition. (iii) It motivated subsequent initiatives — Reproducibility Project: Cancer Biology (biomedical analogue), Many Labs Project (multi-lab coordinated replications), Social Psychology Association Center of Excellence for Open Science. (iv) It accelerated pre-registration adoption in psychology — AsPredicted, OSF pre-registration, Registered Reports formats in multiple journals. (v) It prompted power reform — minimum-sample-size expectations and explicit power-justification requirements became standard at many psychology journals. (vi) It prompted theoretical reassessment of multiple subfields (ego depletion, priming effects, facial feedback hypothesis — all substantially re-examined post-replication). (vii) It entered the broader scientific and public discourse as a central reference for "replication crisis" claims. Subsequent methodological work extended the findings: (a) Camerer et al. 2016 (experimental economics, ~60% replication), 2018 (social-science-broad, ~62% replication)^[7]. (b) Many Labs consortium's more-systematic replications showing more reliable patterns than single-lab replications in many cases. © Open Science Framework's growth as infrastructure for pre-registration and data-sharing. (d) Institutional and funding-agency requirements for data-sharing and pre-registration. (e) Multiple-Aims-Score (MASS) and other aggregated-evidence approaches to meta-analytic replication synthesis. (f) Field-specific replications in economics, neuroscience, ecology, and other domains. The Open Science Collaboration's 2015 paper represents a benchmark example of the replication-as-scientific-quality-assessment concept: the explicit, large-scale, rigorous, pre-registered multi-laboratory effort to measure scientific reliability, with results that were consequential for methodological reform and that continue to shape research practice a decade later.

Mapped back: The OSC formal case establishes the canonical framework for replication assessment: design, operationalization of success criteria, analysis across multiple definitions, outcome interpretation (including effect-size attenuation consistent with publication-bias theory), and downstream institutional impact demonstrating how rigorous replication evidence drives methodological reform.

Applied/industry¶

A mid-sized consumer-finance company's data-science team discovers a promising pattern in its customer-transaction data: customers who use a specific combination of payment features appear to have substantially lower default rates on credit products than customers who do not. The original analysis, based on 18 months of transaction data covering approximately 2.1 million customer-months, identifies a statistical relationship with a hazard ratio of approximately 0.72 (28% reduction in default rates) among customers using the feature combination, p < 0.001, with effects persisting after adjustment for credit-score decile, account tenure, product mix, and demographic controls. The team is excited and proposes a marketing campaign to promote the feature combination, projecting $18M annual reduction in default costs at moderate marketing investment. Before launching, the Chief Risk Officer requests a more rigorous verification, specifically raising concerns about: (a) Whether the finding would replicate in a separate dataset and time window, (b) Whether the finding reflects causation or selection — customers who choose to use certain features may differ systematically from those who do not, and © Whether a proposed marketing intervention would produce the same effect as the observed passive association^[8]. The verification plan addresses all three concerns: (i) Out-of-sample reproducibility: Re-running the original analysis on a held-out 6-month window following the original 18-month window, with the same analytic code and procedures. Result: Hazard ratio replicates at 0.78 (95% CI 0.72-0.85) — slightly attenuated from the original 0.72 but still showing substantial association. The reproducibility check confirms the pattern is not an artifact of the specific time window. (ii) Selection-bias diagnostic: Analysis of customer characteristics at the time of feature-combination adoption, showing that adopters differ systematically from non-adopters on multiple pre-adoption variables (higher credit-score, longer tenure, lower income volatility, better payment history) — not all of which were in the original adjustment set. Propensity-score-adjusted analysis using expanded covariate set yields hazard ratio of 0.89 (95% CI 0.82-0.97) — a 11% reduction, substantially smaller than the naive 28% but still significant. The selection-bias diagnostic suggests that much of the observed effect reflects customer-quality selection rather than causal feature effect^[9]. (iii) Causal verification via randomized promotion: A randomized experiment is designed to promote the feature combination to a random sample of customers eligible to adopt, with a comparable control group receiving no promotion. Promotion increases feature-combination adoption from 8% to 24% among the promoted group. Intention-to-treat analysis of default rates: Hazard ratio 0.96 (95% CI 0.88-1.05) — essentially no effect. Complier-Average-Causal-Effect (using random promotion as instrument for feature adoption): 0.85 (95% CI 0.72-1.00) — some effect, but with CI including 1.0 and substantially smaller than the original 0.72 observational estimate. The three-stage verification tells a clear story: (α) The original pattern reproduces in held-out data (not a spurious artifact), but (β) selection bias explains much of the observed effect (customers adopting the feature combination were pre-selected on lower-default-risk characteristics), and (γ) randomized promotion produces much smaller causal effects than observational analysis suggested^[2]. The Chief Risk Officer's decision: Do not launch the originally-proposed marketing campaign; the projected $18M annual default reduction was substantially overestimated because the original observational estimate conflated selection and causal effects. However, a smaller modified marketing campaign targeting specific customer segments where the CACE remained meaningfully non-zero is approved, with pre-registered randomized expansion to test if the effect holds in broader segments. The case illustrates reproducibility-and-replicability analysis as practical application to business decision-making: computational reproducibility (same analysis on held-out data) confirms the pattern is real in the data; scientific replicability (randomized follow-up) reveals that the observational pattern's causal interpretation was substantially overestimated; business decision is made on the more-rigorous evidence base. The contrast between the $18M naive estimate and the much smaller causally-supported opportunity — and the distinction between "pattern in data" (which reproduced) and "causal effect of intervention" (which was much smaller) — makes both methodological points vivid.

Mapped back: The industry case demonstrates reproducibility-and-replicability reasoning in a business context with stakes: computational reproducibility (out-of-sample validation) confirms pattern persistence; selection-bias adjustment and randomized causal verification reveal that the original observational effect was substantially inflated; the decision-maker uses the more-rigorous evidence base to revise estimates and strategy, avoiding a large false-positive decision.

Structural Tensions¶

T1 — Replication as science's self-correction versus replication as publication-devaluation. Replication is the mechanism by which the scientific record self-corrects; findings that fail to replicate are de-weighted over time. But in many current publication systems, replications (especially null or failed replications) have low status — journals preferentially publish novel findings; funders preferentially fund novel research; career rewards flow to original discoveries. The incentive structure systematically under-produces replication relative to the epistemic ideal. Reform through dedicated replication journals, replication-funding initiatives, and cultural change (recognition of replication contributions; emphasis on cumulative science) is underway but incomplete. Mature practice values replication as scientific-quality infrastructure; immature practice treats replication as second-class activity.

T2 — Exact versus deliberately-varied replication. Exact replication attempts to match the original study as closely as possible to test whether the finding holds under essentially identical conditions; deliberately-varied replication tests whether the finding generalizes across variations (populations, contexts, methods, measures)^[10]. Both are informative but address different questions. Exact replication addresses whether the original was right about what it claimed; varied replication addresses whether the finding is robust and generalizable. The tension is resource-allocational — both cannot always be pursued, and decisions about the type of replication matter. Mature practice distinguishes the two and deploys them strategically; immature practice treats all replications as equivalent.

T3 — Replication crisis as systemic failure versus normal scientific process. Strong claim: the replication rates observed (30-50% in much of psychology and biomedicine) reflect a crisis of scientific methodology — systematic incentives have produced an unreliable literature; reform is urgent. Moderate claim: the observed rates reflect a combination of real effects, context-dependent effects, statistical noise, and some false positives — not ideal but not catastrophic; gradual reform suffices. Mild claim: "crisis" framing is overblown; science has always self-corrected through replication; observed non-replication is mostly statistical expected. The contested_construct flag reflects this ongoing disagreement. Mature practice works within the moderate framework — acknowledging systemic issues while avoiding both nihilism and complacency; immature practice takes extreme positions and dismisses counter-evidence.

T4 — Replication-friendly design versus research cost and feasibility. Making research replicable requires pre-registration, data-and-code sharing, comprehensive documentation, adequate power, and transparent methodology — all of which impose costs (researcher time, publication complexity, data-management infrastructure). Full replication-friendliness may not be feasible for some research (ethnographic fieldwork; sensitive-data studies; expensive-equipment experiments). The tension between replication ideals and research practicalities drives reform pace. Mature practice implements replication-friendly practices where feasible and articulates limits transparently; immature practice either resists all reform as burdensome or insists on full reform without acknowledging feasibility constraints.

Structural–Framed Character¶

Reproducibility & Replicability is a hybrid on the structural–framed spectrum. Part of it is a bare pattern that means the same thing in any field — repeat a procedure and check whether the result holds — and part of it is a frame, a vocabulary and a set of assumptions, inherited from experimental design and the practice of science.

The skeleton is a comparison: an original result, an independent repeat, and a verdict on whether the two agree, with reproducibility holding the data and method fixed and replicability varying them. That repeat-and-compare structure is general. But the concept is steeped in the institutional norms of science. Its vocabulary — finding, study, verification, analytic procedure — is drawn from research practice, and it carries a strong evaluative weight: a result that fails to reproduce is suspect, and the ability to confirm findings independently is a mark of trustworthy knowledge. Its purpose is institutional, tied to how a community certifies claims. Applied to computational pipelines, clinical trials, or social-science findings, it imports that epistemic standard rather than merely naming a repeated procedure. The structural core exists, but the scientific frame does substantial work, placing it toward the framed side of the middle.

Substrate Independence¶

Reproducibility & Replicability is a moderately substrate-independent prime — composite 3 / 5 on the substrate-independence scale. It reaches across experimental design, statistics, and philosophy of science, and its signature — an original-study artifact, independent verification, same-data consistency set against different-data consistency — gives it structure. But that signature is methodological rather than purely structural, and it is fundamentally rooted in the scientific method and empirical verification. The worked examples (Nosek's psychology replication crisis, a consumer-finance holdout) are squarely empirical, and transfer to non-empirical domains like pure mathematics or social systems is limited — keeping it a methodological prime rather than a free-floating one.

Composite substrate independence — 3 / 5
Domain breadth — 3 / 5
Structural abstraction — 3 / 5
Transfer evidence — 3 / 5

Relationships to Other Abstractions¶

Current abstraction Reproducibility & Replicability Prime

Parents (1) — more general patterns this builds on

Reproducibility & Replicability is a kind of Verification Prime

Reproducibility and replicability are a specialization of verification in which the conformance check is repeating the study to confirm the finding.

Children (1) — more specific cases that build on this

Inter-Annotator Agreement Domain-specific is a decomposition of Reproducibility & Replicability

Inter-Annotator Agreement is the categorical-coding instrument for testing whether a fixed procedure reproduces across independent applications.

Hierarchy path (1) — routes to 1 parentless root

Reproducibility & Replicability → Verification → Evaluation → Comparison → Self Checking

Neighborhood in Abstraction Space¶

Reproducibility & Replicability sits in a sparse region of abstraction space (89^th percentile for distinctiveness): few abstractions share its structure, so a faithful description tends to retrieve it precisely rather than landing on a neighbor.

Family — Accumulation, Decay & Maintenance at Interfaces (16 primes)

Nearest neighbors

Replay — 0.69
Validation — 0.69
Convergent Independent Adoption — 0.68
Spaced Repetition — 0.67
Memoing — 0.67

Computed from structural-signature embeddings · 2026-07-26

Not to Be Confused With¶

Reproducibility & Replicability must be distinguished from Statistical Inference, though they are complementary. Statistical inference is the reasoning process of drawing conclusions about populations from sample data—hypothesis testing, confidence-interval construction, effect-size estimation, and the entire apparatus of generalizing from samples to populations. Reproducibility (same data, same methods) and replicability (new data, same design) are about verifying that findings are stable and reliable, whether or not they generalize. A finding can be statistically inferred from a sample and be entirely reproducible (same analysis on the same data yields identical results) and still fail to replicate in new data (new samples produce different results). Statistical inference asks "what can I conclude about the population from this sample?"; reproducibility asks "can I get the same result if I rerun the analysis?"; replicability asks "can other researchers get similar results with fresh data?". The distinction is critical because inference validity depends on design and sampling properties, while replicability depends on whether the finding holds beyond the original study's context. A study can use perfect inference methodology (proper randomization, correct statistical tests) and still produce non-replicable findings if it is underpowered, uses selection-biased samples, or is affected by hidden confounders that vary across replication attempts. Conversely, a study with questionable inference assumptions might still produce replicable findings if the effect is large and robust. Understanding the three as distinct dimensions—inference validity (from study design), reproducibility (from analytic stability), and replicability (from independent verification)—clarifies why all three matter and why absence in any dimension undermines scientific reliability.

Reproducibility & Replicability is also distinct from Hypothesis Testing (Null vs. Alternative), though hypothesis testing is a component of both reproducibility and replicability assessment. Hypothesis testing is the formal decision procedure using p-values, significance thresholds, and error rates to evaluate whether sample evidence supports one hypothesis over another. Reproducibility is about whether the computational results are stable when methods are re-executed—a question about technical reliability. Replicability is about whether an effect found in one study is observed again in independent studies—a question about scientific verification. Hypothesis testing can be applied to reproducibility (do replicate-study results differ significantly from original results?) and to replicability (does the replicate study's test reject the null?), but hypothesis testing itself is not the verification process. A hypothesis test is an inferential tool; reproducibility is a computational property; replicability is an empirical pattern. The confusion arises because replicability is often assessed by asking "does the replication study produce a significant result in the same direction?" — a hypothesis-test framing. But this operationalization is just one of several ways to define replication success; other definitions (effect-size matching, confidence-interval overlap, combined meta-analytic evidence) do not rely on hypothesis testing as the primary tool. Mature replication assessment uses multiple operationalizations; hypothesis testing is one lens among several. Understanding the distinction prevents treating p-value magnitude in the replication study as the sole criterion for replication success, when effect-size stability, confidence-interval overlap, and aggregated evidence often provide stronger verification.

Reproducibility & Replicability is also distinct from Confounding, though confounding is a source of non-replicable findings. Confounding is the structural problem in which an unmeasured or uncontrolled variable creates a spurious association between a measured exposure and outcome—the classic "correlation does not imply causation" problem. A study can have strong confounding and still be perfectly reproducible: if you re-run the same analysis on the same data with the same unmeasured confounders present, you get identical results. The confoundedness of those results does not prevent reproducibility. Conversely, a finding can be reproducible (stable across re-runs) and unconfounded (the exposure truly causes the outcome). The two dimensions are orthogonal. However, confounding becomes relevant to replicability when the confounder structure differs across studies. If Study A is confounded by an unmeasured variable that happens to be equally distributed in Study A's population but differently distributed in Study B's population (or not measured in Study B at all), then the two studies may produce different results despite using similar designs. This is one class of replication failure — different confounder structures across contexts. But other replication failures arise without confounding (sampling variability with small samples; publication bias inflating original estimates; population heterogeneity in true effect sizes). Understanding confounding as a source of contextual variation that can produce replication failure, rather than as replication-failure definition itself, clarifies the causal-inference problem and distinguishes it from the reproducibility problem (which is about computational stability within a study) and the replicability problem (which is about consistency across independent studies with potentially different confounding structures).

Solution Archetypes¶

Solution archetypes in the catalog that build on this prime — directly (this prime is a source ingredient) or as a related prime.

Built directly on this prime (4)

Deterministic Transition Contract: Make the transition from current state to next state fully specified so identical starting conditions, rules, inputs, ordering, and environment produce one reproducible successor.
▸ Mechanisms (9)
- canonical_execution_order_runbook
- concurrency_serialization_gate
- dependency_version_lockfile
- deterministic_replay_harness
- differential_transition_comparison
- golden_master_transition_test
- seeded_randomness_protocol
- state_machine_transition_table
- transition_audit_log
Multiple-Testing Discipline: Control false discoveries when many comparisons, claims, or tests are being tried.
▸ Mechanisms (10)
- Alpha-Spending Plan
- Bonferroni-Like Correction
- Claim Registry
- Confirmatory Follow-Up
- False Discovery Rate Control
- Holdout Validation
- Metric Hierarchy
- Multiverse Analysis Report
- Preregistration
- Replication Study
Proceduralization: Convert tacit or inconsistent work into explicit repeatable steps with inputs, outputs, and exception handling.
▸ Mechanisms (10)
- Automation Routine
- Checklist
- Decision Tree
- Playbook
- Process Map
- Protocol
- Runbook
- Standard Operating Procedure — Freezes a stabilized, low-judgment routine into ordered steps, named roles, and explicit acceptance conditions so anyone can run it the same way.
- Swimlane Workflow Diagram
- Workflow Script
Reproducibility Protocol: Make methods, data, assumptions, and environments explicit enough that results can be repeated or checked.
▸ Mechanisms (10)
- Audit Trail
- Containerized Environment Snapshot
- Decision Log
- Lab Notebook Record
- Protocol Documentation
- Replication Package
- Reproducible Research Package
- Rerun Checklist
- Version-Controlled Analysis
- Workflow Script or Pipeline

Also a related prime in 41 archetypes

Alternative-Hypothesis Generation: Before treating a conclusion as settled, generate credible alternative explanations and identify the evidence that would distinguish them.
Assumption-Light Inference: Use inference methods that require fewer fragile assumptions when strong assumptions are unjustified.
Bayesian Belief Updating: Revise beliefs by combining prior expectations with new evidence rather than treating each observation in isolation.
Blinding and Expectancy Bias Reduction: Hide condition identity from the roles that could be biased by knowing it, while preserving safety, correct operation, and auditable exceptions.
Canonical Ordering: Choose a stable ordering rule so comparison, serialization, processing, or coordination becomes consistent.
Checkpoint and Rollback: Save recoverable states before risky change so the system can return to a known-good condition if the change fails.
Comparative Benchmark Validation: Validate a claim by comparing the system against explicit reference standards, gold standards, incumbent alternatives, competitors, or benchmark suites under conditions that make the comparison meaningful.
Conformance Control and Corrective Feedback: Measure output against an explicit specification, gate release on conformance, contain and disposition failures, and feed defect evidence upstream until recurrence risk falls.
Construct–Proxy–Signal Validity Alignment: Make a measurement earn its interpretation by tracing the claim from construct to proxy to signal and requiring evidence that the signal captures the intended construct rather than a correlated surrogate.
Coverage Probability Calibration: Verify and adjust uncertainty intervals so their promised coverage rate is achieved in the regime where decisions will rely on them.

▸ Show 31 more

Definition-Time Context Binding: Bind a behavior unit to the minimum context that defined it so later execution resolves against that context rather than silently inheriting an unrelated ambient environment.
Dimensionality Reduction for Signal: Reduce many variables into fewer informative dimensions so structure becomes visible without drowning in noise.
Distributional-Assumption Governance: Make probability-distribution commitments explicit, evidence-grounded, consequence-aware, stress-tested, and revisable before they govern inference or action.
Effect Size Standardization: Convert raw inferred effects into comparable, uncertainty-bounded magnitude expressions so evidence can be judged by size and practical meaning, not only by detectability.
Equivalence-Preserving Rewrite Optimization: Rewrite something into a cheaper, clearer, faster, safer, or more usable form only after proving or testing that the declared behavior stays equivalent.
Evidentiary Trace Warranting: Treat evidence as a defeasible relation between a trace and a claim, not as raw data or free-floating support.
False Convergence Prevention: Prevent apparent stability or agreement from being mistaken for genuine convergence.
Generate-and-Verify Separation: Let many, complex, heuristic, or untrusted parties search for candidates, but require every accepted candidate to pass a substantially cheaper, smaller, explicit, and independently assured verifier.
Hypothesis Test Power Calibration: Design a hypothesis test around the effect that would actually matter, then tune sample size, noise control, allocation, and error rates so the test has adequate power to detect it.
Hypothesis Testing Frame: Frame a claim against a default alternative so evidence can change belief or action under explicit error risks.
Independent Convergence Evidence Appraisal: Treat repeated independent arrival at the same solution-shape as evidence of fit only after auditing independence, shared pressures, abstraction level, and alternative explanations for the convergence.
Independent Convergence Recognition and Transfer Design: Use independently repeated solutions as evidence of shared pressures or constraints while checking that the repetition is not copying, common ancestry, or false similarity.
Independent Evidence Triangulation: Cross-check a scoped claim with multiple meaningfully independent evidence streams, using both convergence and divergence to calibrate confidence and expose hidden dependence, bias, or context.
Independent Verification Oversight: When a validity judgment can be biased by the producer’s incentives or assumptions, route the evidence to an independent verifier with enough access, authority, and separation to challenge the claim before it is accepted.
Inductive Validity Extension: Validate that a rule, guarantee, or process that works in a base case continues to hold as it extends step by step, recursively, or at larger scale.
Knowledge-Warrant Audit: Audit what each belief rests on, classify the strength and type of its warrant, and adjust confidence or action accordingly.
Leakage-Resistant Validation Design: Before trusting a fitted model, score, policy, or benchmark result, enforce the boundary between what would have been knowable at decision time and what was learned only through the target, future, holdout, or deployment outcome.
Measurement-Protocol Standardization: Make comparisons interpretable by ensuring every subject, group, site, or condition is measured with the same construct, instruments, timing, administration, scoring, calibration, and deviation rules.
Model-Guided Signal Separation: Recover a target component from mixed observations by stating what the target is, modeling how target and nuisance combine, applying a calibrated separator, and proving what the output preserves, suppresses, and still leaves uncertain.
Operational Context Validation Testing: Test the system in the conditions where it must actually work, not only in the simplified conditions where it is easiest to prove it works.
Perturbation Testing: Introduce small controlled disturbances to learn system sensitivity, robustness, and hidden dependencies.
Portable Dependency Envelope: Bundle a unit with the dependencies it needs and expose only a standardized exterior so heterogeneous handlers can move, host, or activate it intact.
Recursive Triangulation of Triangulation: When a conclusion already rests on triangulation, audit the triangulation itself by checking whether its evidence streams are independent, its convergence logic is valid, and its confidence claim survives a second-order triangulation layer.
Regression-to-the-Mean Guardrail: Prevent ordinary reversion after extreme observations from being credited to an intervention, person, punishment, reward, or event without a credible counterfactual.
Regroupable Aggregation: Design partial summaries to combine associatively so an aggregate can be chunked, nested, or tree-reduced without changing its defined result.
Self-Hosted Bootstrap Construction: Begin with a trusted minimal seed, let each verified stage produce the capability that builds the next, and finish only when the target system can reproduce and operate itself without hidden external support.
Tacit Knowledge Elicitation: Draw out expert know-how that is used in practice but not yet articulated.
Traceability Linking: Create explicit links from sources, requirements, decisions, actions, or artifacts to their downstream consequences or implementations.
Traceable Measurement System Design: Define exactly what attribute is being measured, anchor it to a unit and frame, realize it through a validated instrument and procedure, and report the result together with uncertainty and traceability.
Variance Reduction: Reduce unwanted variation so signal, quality, fairness, or reliability becomes clearer and more stable.
Versioned Evolution: Track changes as explicit versions so evolution remains comparable, reversible, auditable, and compatible.

Notes¶

Munafò et al. 2017 manifesto ^[8] articulates the methodological-reform framework for reproducible science.

The multi_origin_equal flag is warranted^[11] — reproducibility-and-replicability has genuine co-origin in experimental-design/statistics (Fisher 1935 explicitly treated replication as central to inference — "the greater the precision aimed at, the greater the demand for replication"; replication is foundational to multi-study methodological frameworks) and in philosophy of science (Popper falsifiability framework; Lakatos research-programme methodology; Kitcher naturalistic epistemology; contemporary philosophy-of-science and science-studies analyses). The primary origin_domain: experimental_design_statistics reflects the operational-methodology framework; alternate_origin_domains: [philosophy] reflects the epistemological foundation. The contested_construct flag reflects ongoing disagreement about: (i) whether current patterns constitute "crisis" or normal science; (ii) replication-success definitions (significance-based; interval-based; combined-evidence; subjective); (iii) relative contributions of original-false-positives, replication-false-negatives, and true between-study heterogeneity; (iv) whether exact replication is philosophically coherent; (v) optimal reform direction (pre-registration-heavy; data-sharing-heavy; meta-science-heavy; incentive-reform-heavy). Related primes: #432 randomization (proper randomization as condition for reliable replication), #434 hypothesis_testing_null_vs_alternative (replication aggregates across single-study tests), #435 statistical_significance_p_value (p-value interpretation is compromised without replication expectation — contested_construct both primes), #437 statistical_power (under-powered originals produce winner's curse and non-replicable findings), #440 selection_bias (publication bias as selection-in-literature, primary replication-failure driver), #438 confounding (unmeasured confounders differing across contexts), #447 effect_size (replications measure effect-size persistence), #433 sampling_representativeness (replications in different populations test generalizability). Cross-DP-25 notes: reproducibility_replicability (this prime) is the validation framework for empirical inference; missing_data_mechanisms_mcar_mar_mnar identifies structural threats to complete-data assumptions; compression quantifies information content. Strong transfer targets: psychology and social-science replication reform, biomedicine and preclinical-research quality improvement, economics empirical-methodology consolidation, machine-learning computational reproducibility, genetics replication-cohort requirements, clinical-trial regulatory-replication standards, meta-science and science-of-science infrastructure. Pass B should develop archetypes for pre-registration and registered reports, data-and-code sharing standards (FAIR principles, Dryad, OSF), computational-reproducibility infrastructure (containers, notebooks, cloud), many-labs and coordinated multi-site replication, meta-analytic replication synthesis, replication-study design (power, success criteria, context matching), publication-bias correction in meta-analyses (funnel plots, trim-and-fill, p-curve, selection-model meta-analysis), and reform-incentive infrastructure (replication journals, funding, tenure credit for replication).

References¶

[1] National Academies of Sciences, Engineering, and Medicine. (2019). Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. https://doi.org/10.17226/25303. National Academies foundational distinction between reproducibility (same data) and replicability (new data). ↩

[2] Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124. Foundational analysis of how publication bias, low statistical power, and flexible analytic choices produce a literature in which most positive findings fail to replicate—motivating epistemic humility about scientific claims. ↩

[3] Nosek, B. A., Alter, G., Banks, G. C., et al. (2015). Promoting an open research culture. Science, 348(6242), aab2374. https://doi.org/10.1126/science.aab2374. Nosek open-science cultural-reform framework addressing replication infrastructure. ↩

[4] Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. Coordinated replication of 100 published psychology experiments: reproduced significant effects in only 36% of cases despite nominal transparency of original methods, dramatizing that disclosed information without shared data, code, and pre-registration is insufficient to support substantive scrutiny. ↩

[5] Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716. Open Science Collaboration quantified replication rates across multiple success definitions. ↩

[6] Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716. Open Science Collaboration large-scale replication converting anecdotal crisis concerns into statistical evidence. ↩

[7] Camerer, C. F., Dreber, A., Holzmeister, F., et al. (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour, 2(9), 637–644. https://doi.org/10.1038/s41562-018-0399-z. Camerer et al. replication of experimental-economics and social-science studies showing 60-62% replication rates. ↩

[8] Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., du Sert, N. P., Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., & Ioannidis, J. P. A. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1, 0021. Identifies methods, reporting, reproducibility, evaluation, and incentives as the loci of reform; frames replication infrastructure as a verification system whose specifications are the methods and protocols of the original studies. ↩

[9] Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632. Simmons false-positive-psychology framework documenting p-hacking and garden-of-forking-paths flexibility. ↩

[10] Klein, R. A., Ratliff, K. A., Vianello, M., et al. (2014). Investigating variation in replicability: A "many labs" replication project. Social Psychology, 45(3), 142–152. https://doi.org/10.1027/1864-9335/a000178. Klein Many Labs multi-laboratory coordinated replication effort establishing robustness patterns. ↩

[11] Fisher, R. A. (1935). The Design of Experiments. Oliver and Boyd, Edinburgh. (Foundational treatise on experimental design; establishes randomization as the "reasoned basis for inference" and develops the principles of randomization, replication, and blocking that underpin modern randomization-based causal inference.) ↩

[12] Begley, C. G., & Ellis, L. M. (2012). Drug development: Raise standards for preclinical cancer research. Nature, 483(7391), 531–533. https://doi.org/10.1038/483531a. Begley-Ellis preclinical-cancer-research non-replication showing 89% failure-to-replicate rate.

[13] Prinz, F., Schlange, T., & Asadullah, K. (2011). Believe it or not: how much can we rely on published data on potential drug targets? Nature Reviews Drug Discovery, 10(9), 712. https://doi.org/10.1038/nrd3439-c1. Prinz et al. Bayer in-house audit of preclinical-research replication rates.

[14] Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: context, process, and purpose. The American Statistician, 70(2), 129–133. https://doi.org/10.1080/00031305.2016.1154108. ASA p-value statement clarifying replication implications of significance testing.

[15] Campbell, D. T., & Stanley, J. C. (1966). Experimental and Quasi-Experimental Designs for Research. Chicago: Rand McNally. Campbell-Stanley internal-validity framework for evaluating replication consistency across contexts.