Effect Size¶

Prime #: 447
Origin domain: Statistics & Experimental Design
Aliases: Cohens D, Standardized Mean Difference, Practical Significance, Magnitude of Effect, Effect Estimate
Related primes: Statistical Power, Statistical Significance (p-Value), Confidence Intervals, Hypothesis Testing (Null vs. Alternative), Type I & Type II Errors, Reproducibility & Replicability, Regression to the Mean

Core Idea¶

Effect size quantifies the magnitude of a relationship or difference — the size of an observed effect in substantive, interpretable units — independent of sample size and separately from statistical significance. The core insight is that statistical significance ("does this effect exist?") is a question about whether an observation deviates from zero; effect size ("how big is this effect?") is a question about the scale of that deviation in units that matter for practical decision-making. Effect size reporting shifts the analytical frame from the dichotomous significance-testing question to the continuous estimation question, revealing a phenomenon that significance testing obscures: a large sample can make any non-zero effect "statistically significant," while a small sample can render even a large practical effect "non-significant," making significance alone an unreliable guide to importance. The abstraction is that meaningful inference requires attending to magnitude (the effect-size estimate), uncertainty (confidence interval or posterior distribution), and direction (positive or negative) jointly — not collapsing them into a binary reject/do-not-reject decision.

How would you explain it like I'm…

How Big Is It

If two kids race, asking who won is one question. Asking by how much one beat the other, like one step or one whole block, is a different question. Knowing the size of the gap tells you way more than just knowing there was a gap. That bigger picture matters.

Size Of The Difference

When scientists test something, they often ask two different questions. First: 'Is there any effect at all?' Second: 'How big is the effect?' The second question is called effect size. A medicine might really lower blood pressure, but only by a tiny amount that no patient would notice. Effect size tells you the size of the change in real-world units, so you can decide if it actually matters for your life, not just whether a study counts it as 'significant.'

Measuring Magnitude, Not Just Yes/No

Effect size is a number that tells you the magnitude of a difference or relationship in units you can interpret. It is separate from statistical significance, which only tells you whether an effect probably exists. With a huge sample, even a tiny effect can be 'statistically significant'; with a tiny sample, even a big effect can fail that test. So significance alone is a poor guide to whether something matters in practice. To draw a real conclusion you need three pieces together: the size of the effect, the uncertainty around it (a confidence interval), and the direction. Collapsing all that into a yes/no verdict throws away the information you actually need.

Effect size quantifies the magnitude of an observed relationship or difference in substantive, interpretable units, independently of sample size and separately from the question of statistical significance. The conceptual move is from a *dichotomous* hypothesis test (does the effect differ from zero?) to a *continuous* estimation question (how large is the effect, and how precisely have we estimated it?). This matters because the null-hypothesis significance test (NHST) is sample-size dependent: in a large enough study, even a trivially small effect crosses the significance threshold, while a substantively important effect can fail to reach significance in a small study. Reporting standardized effect sizes (Cohen's d, Pearson's r, odds ratios, eta-squared) together with confidence intervals or posterior distributions allows readers to evaluate practical importance, perform meta-analytic synthesis across studies, and conduct power analyses for replication. The discipline is to report magnitude, uncertainty, and direction jointly rather than collapse them into a single reject/do-not-reject verdict.

Structural Signature¶

Effect size operates at the interpretive layer of statistical analysis, with six defining structural roles: the magnitude-of-effect quantification in substantive units, the standardization mechanism dividing observed difference by variability, the dimensionless comparability property enabling cross-study synthesis, the separation-of-magnitude-from-uncertainty design, the meta-analytic pooling foundation replacing p-value vote-counting, and the practical-significance-versus-statistical-significance distinction. Standardized effect sizes (Cohen's d, Pearson's r, odds ratios, η²) divide observed differences by variability measures (standard deviation for d, standard error for proportions), producing scale-free numbers comparable across studies and outcome measures; unstandardized effect sizes (raw mean differences, absolute risk reductions, percentage-point changes) preserve the substantive units that stakeholders care about. The critical structural distinction is the separation of magnitude from reliability: a point estimate of effect size (the "how big") is paired with an interval estimate (the "how certain"), and both are needed for complete inference. Power analysis, sample-size planning, and meta-analysis all hinge on effect-size reasoning: sample size calculations are not fundamentally about achieving "statistical significance" but about engineering sufficient statistical power to detect effects of a specified practical magnitude; meta-analyses combine effect-size estimates across studies to produce pooled estimates and assess heterogeneity, a synthesis that would be impossible using p-values alone.

What It Is Not¶

Not the same as statistical significance — A statistically significant effect can be tiny and practically irrelevant; a large effect can be non-significant in a small study. The distinction between #435 (statistical_significance_p_value) and effect size is fundamental: statistical significance addresses "is this distinguishable from zero?" while effect size addresses "how much is it?" These are orthogonal questions; conflating them has been the systematic error of significance-testing culture.
Not a substitute for confidence intervals — A point estimate of effect size without calibrated uncertainty is incomplete. Effect size must be reported alongside interval estimates (#436 confidence_intervals) to communicate both magnitude and precision. A large effect from a small, noisy study is uninformative without its wide confidence interval; conversely, a narrow interval around a trivial effect communicates different information than a wide interval around the same point estimate.
Not uniquely defined — Different metrics (Cohen's d, Pearson's r, odds ratio, η², relative risk, raw difference) highlight different aspects of the same phenomenon and are not always mutually convertible without assumptions. The choice of metric should depend on the research question and the audience; a meta-analyst may prefer standardized d for cross-study synthesis, while a clinician may prefer number-needed-to-treat for decision-making.
Not inherently interpretable without context — Cohen's d=0.2, 0.5, 0.8 as "small/medium/large" conventions are universal approximations that apply poorly in many domains. A d=0.2 in a preventive intervention on educational outcomes is often substantial; the same d=0.2 in laboratory physics might be negligible. Effective effect-size interpretation requires domain-specific judgment about what magnitudes are practically meaningful.
Not a causal claim by itself — Effect size quantifies an observed association or difference in data; whether that difference is causal depends on design features (randomization, control arms, measured and unmeasured confounding assessment). An effect size from an observational study is a descriptive quantity; causal interpretation requires additional argument.
Not identical to clinical or practical significance — In medicine and policy, clinical significance combines effect size with costs, harms, patient values, and implementation feasibility. A small effect size might have enormous clinical significance if the harm prevented is severe and the intervention is cheap; conversely, a large effect size might have trivial significance if harm prevented is rare or intervention cost is prohibitive.
Not monotonic with practical value — A small standardized effect (d=0.1) applied to millions of people (e.g., a 0.2% reduction in mortality across a large population) can create enormous aggregate value; a large standardized effect (d=0.8) in a small population may matter little in absolute terms. Aggregate impact depends on both effect size and population scale.
Not equivalent to #437 (statistical_power) — While power analysis is built on effect-size thinking (power to detect an effect of specified magnitude), reporting the observed effect size does not tell you the power of the study. A non-significant finding with a large confidence interval excludes neither "no effect" nor "large effect"; determining what the study could have detected requires retrospective power analysis or confidence-interval examination.

Broad Use¶

Psychology and Behavioral Science. The American Psychological Association's Publication Manual (6^th edition, 2010, and 7^th edition, 2020) explicitly mandates effect-size reporting for all statistical tests. Cohen's d, η² (eta-squared), ω² (omega-squared), and partial η² are standard effect-size metrics. The shift was gradual but decisive: a 1989 survey by Cohen found fewer than 2% of published psychology papers reported effect sizes beyond implicit p-values; by 2010, major psychology journals had adopted formal effect-size reporting requirements. Contemporary psychology distinguishes standardized d (for cross-study meta-analysis) and unstandardized raw differences (for interpretability within a study), with reporting guidance favoring both^[1].

Medical Research and Clinical Trials. Odds ratios, relative risks, hazard ratios, and absolute risk reductions (with number-needed-to-treat, or NNT) are conventional primary-result effect-size metrics in clinical trials. Meta-analyses aggregate effect sizes across trials to produce pooled estimates that inform clinical guidelines and evidence-based medicine syntheses. CONSORT (Consolidated Standards of Reporting Trials) explicitly requires effect-size reporting alongside p-values and confidence intervals. The medical tradition emphasizes communicating effect size in absolute terms (NNT for harm prevention, cost per life-year saved) alongside relative metrics, serving decision-making by patients, clinicians, and regulators^[2].

Educational Research and Evidence Synthesis. The What Works Clearinghouse, a US Department of Education evidence-classification system, uses effect sizes in standard deviation units as its primary classification metric, assigning studies to evidence tiers based on effect magnitude. John Hattie's meta-analytic synthesis of approximately 800 meta-analyses (2008 Visible Learning and subsequent updates) identifies effect sizes of 0.4 in standard deviation units as a heuristic threshold for "significant positive impact," with effect sizes above and below providing continuous scaling of intervention effectiveness. This application has driven effect-size awareness in education policy and practice^[3].

Economics and Econometrics. Elasticities (percent-change-in-outcome per percent-change-in-predictor), marginal effects (partial derivatives of outcome with respect to predictors), and dollar-valued impacts are the substantive quantities of interest in economic analysis. T-statistics and p-values play secondary roles as statistical diagnostics; the economic question is always "by how much," not just "is there an effect." Regression-coefficient reporting with confidence intervals and effect-size interpretation based on economic context (e.g., a 0.1 percentage-point change in GDP growth is a very large effect) is standard practice^[4].

Technology, A/B Testing, and Online Experimentation. Percent-lift (percentage-change relative to control) and absolute-difference effect sizes are the primary decision drivers in A/B testing at technology companies. A test may show statistical significance at p<0.01, but if the lift is 0.1% on a metric worth millions, the practical magnitude may justify investment in implementation. Multi-armed bandit algorithms and Bayesian testing frameworks increasingly orient around posterior distributions of effect sizes rather than dichotomous reject/do-not-reject outcomes, allowing sequential decision-making under specified uncertainty. This shift reflects the recognition that business decisions depend on effect magnitude, not on significance thresholds^[5].

Policy Analysis and Cost-Effectiveness. Effect sizes expressed as cost-effectiveness ratios (cost per quality-adjusted life year gained), return-on-investment metrics, or cost-benefit comparisons allow comparisons across diverse policy interventions. A job-training program's effect on earnings, measured in dollars per enrollee, is directly comparable to a health-care subsidy's effect on disease incidence. This application of effect-size reasoning aligns with public-finance and welfare-economic frameworks where the magnitude of impact and its cost-efficiency determine policy adoption^[4].

Systematic Review and Meta-Analysis. Cochrane reviews, Campbell Collaboration systematic reviews, and other high-standards evidence-synthesis programs emphasize effect-size estimation as the central inferential task. Rather than vote-counting (how many studies show significance?), which is biased toward large studies and ignores effect magnitudes, meta-analysis pools effect-size estimates across studies, produces a pooled estimate with confidence interval, and assesses heterogeneity (I² statistics, prediction intervals) to characterize the distribution of effects across contexts. This methodological standard has been adopted across medical, educational, psychological, and policy research^[6].

Clarity¶

Effect-size reporting clarifies what a study actually found by integrating magnitude, precision, and statistical distinguishability into a coherent narrative. The statement "treatment reduced symptoms by 2.3 points on a 20-point scale (95% CI 1.1–3.5, p=0.002)" communicates the point estimate (2.3), the uncertainty bounds (1.1 to 3.5), and the statistical distinguishability (p=0.002 from zero), allowing readers to judge whether the effect is practically meaningful and to visualize the range of plausible effects. The same result stated as "treatment was significantly better than control (p<0.01)" suppresses the magnitude and precision information, leaving readers unable to assess practical importance. In meta-analysis, effect-size reporting enables pooling across studies using different sample sizes, outcome measures, and designs; "vote counting" (counting how many studies achieved p<0.05) is biased toward studies with large samples and toward dichotomizing a continuous quantity (effect magnitude) into a binary outcome (significant/non-significant). Effect-size discipline also clarifies interpretation of null results: a non-significant result with a narrow confidence interval (e.g., d with 95% CI -0.1 to +0.1) is evidence of a small effect; a non-significant result with a wide confidence interval (e.g., d with 95% CI -0.8 to +0.8) is inconclusive, leaving open the possibility of a large effect. This distinction is invisible in pure significance testing but central to interval-based effect-size reporting^[5].

Manages Complexity¶

Effect size provides a unified currency for comparing findings across studies, measures, contexts, and interventions, reducing the apparent incommensurability of diverse evidence. A meta-analyst combining psychotherapy trials measuring depression on different scales (Beck Depression Inventory, Hamilton Depression Rating Scale, Patient Health Questionnaire-9) cannot directly pool raw scores (incompatible metrics) but can pool Cohen's d effect sizes, each computed as the standardized mean difference on its native scale. Meta-regression extends this framework by examining whether effect sizes covary with study characteristics (sample size, demographic composition), trial design features (intensity, duration), or patient-population features (baseline symptom severity, comorbidity), producing moderator analyses that would be impossible with p-values alone. In applied settings, effect-size thinking allows decision-makers to compare interventions on a common scale: a 0.1-SD improvement in educational outcomes may appear less impressive than a 0.5-SD effect from another intervention, but if the first is dramatically cheaper to implement and targets a priority population, its cost-effectiveness can be competitive. The complexity management lies in standardizing the metric (converting diverse raw outcomes into dimensionless effect sizes) while preserving uncertainty through paired interval estimates— a discipline that has consistently improved scientific communication and organizational decision-making where it has been systematically adopted^[6].

Abstract Reasoning¶

Effect-size reasoning embodies a fundamental abstraction about the structure of inference: the statistical question "is there an effect" is distinct from and less important than the substantive question "how big is it, is it stable across contexts, and does its magnitude matter for my decisions?" This perspective shifts analytical orientation from binary hypothesis testing (reject/do-not-reject) to continuous estimation with quantified uncertainty. The abstraction applies far beyond formal statistics: in engineering, setting specifications and tolerances is about magnitude and precision, not about hypothesis tests; in manufacturing quality control, statistical process control charts monitor the centering and spread of a process, not just whether it's "in control"; in medical monitoring, tracking a patient's marker trajectory requires continuous magnitude assessment, not sequential binary tests. The "statistically significant = real and important; non-significant = null and negligible" mental shortcut fails in two distinct ways: (a) with huge samples, it calls tiny effects that are practically negligible "statistically significant," leading to resource allocation toward trivial findings; (b) with small samples, it calls practically meaningful effects "non-significant," leading to premature abandonment of potentially valuable interventions. Effect-size discipline prevents both failures by centering magnitude as the primary inferential object^[4]. The deeper principle is that effective decision-making under uncertainty requires attending to magnitude (the point estimate), direction (positive or negative), uncertainty (confidence or credible interval), and context (what counts as practically meaningful) jointly — collapsing these into a binary decision discards critical information needed for good judgment.

Knowledge Transfer¶

Domain	Typical Effect-Size Metric	Interpretive Context
Psychology	Cohen's d, η², partial η²	d=0.2 small, 0.5 medium, 0.8 large (Cohen's conventions)
Medical trials	Odds ratio, relative risk, HR, NNT	OR=1.5 modest; NNT provides absolute scale
Epidemiology	Risk difference, hazard ratio	Absolute risk reduction preferred for clinical communication
Education	Effect size in SD units (Glass's Δ)	0.1 small, 0.25 moderate, 0.4 large (What Works / Hattie)
Economics	Elasticity, marginal effect, $-impact	Unit-native; economic interpretation context-specific
A/B testing	Percent lift, absolute difference	Context-specific thresholds (e.g., 1% lift on revenue)
Marketing	Lift, ROAS, incremental conversion	Dollar-valued; comparable across campaigns
Meta-analysis	Pooled d, pooled OR with heterogeneity	Combined across studies with I² for heterogeneity
Machine learning	Accuracy/AUC improvement, F1 change	Task-specific; minimum meaningful lift varies
Ecology	Effect size on population growth rate, diversity	Biologically relevant thresholds; scaled to life-history

Examples¶

Formal Example — Jacob Cohen 1988 and the Cohen's d revolution in psychology¶

Jacob Cohen's Statistical Power Analysis for the Behavioral Sciences (1^st edition 1969, 2^nd edition 1988) is the canonical academic foundation for effect-size thinking in psychology and the social sciences broadly. The 1988 edition codified systematic definitions, conventions, and computational methods for standardized effect sizes across test families: d for mean comparisons, r for Pearson correlations, f for ANOVA, w for chi-square (proportion effects), h for differences in proportions, and q for differences in correlations. Cohen's "small/medium/large" benchmarks (d=0.2, 0.5, 0.8 for standardized mean difference) were explicitly offered as rough and tentative conventions for contexts lacking domain-specific benchmarks— but became widely cited as if universally applicable rules, a misuse Cohen himself explicitly regretted in later writings. The standardized d metric rapidly became the foundational unit for meta-analyses and effect-size syntheses across psychology^[7].

The practical impact on research culture unfolded over decades. A 1989 survey by Cohen found that fewer than 2% of published psychology papers reported effect sizes beyond implicit p-values; by 2010, mainstream psychology journals and the APA Publication Manual (6^th edition, 2010) formally required effect-size reporting for primary statistical tests. Power analysis, built directly on effect-size reasoning, became routine in grant proposals and study protocols. Meta-analytic methods pioneered by Gene Glass, Larry Hedges, and John Hunter & Schmidt (building on effect-size pooling rather than p-value vote-counting) became dominant evidence-synthesis tools in psychology, medicine, education, and organizational research.

Cohen's framework also exposed a phenomenon that significance testing had obscured: many published psychology findings had much smaller effect sizes than their authors appeared to believe, and many purported "medium" effects were artifacts of publication bias and selection bias— only studies with effect sizes large enough to achieve significance in small samples survived to publication (see #441 reproducibility_replicability). Gelman and Carlin's 2014 extension into "Type M" (magnitude) and "Type S" (sign) errors showed mathematically that underpowered studies reaching statistical significance tend to dramatically exaggerate the true effect size (a "winner's curse" phenomenon)^[8]. The contemporary emphasis on effect-size-centered reporting, pre-registered tests of pre-specified effect sizes, and explicit power analysis traces directly to Cohen's foundational innovations.

Mapped back: Effect-size foundational abstraction — standardization of magnitude across diverse contexts; convention-setting (the d=0.2/0.5/0.8 benchmarks); integration with power analysis (sample size planning as magnitude detection); exposure of publication-bias distortion of effect-size estimates.

Applied Example — A national retail chain's transition from significance-centric to magnitude-centric testing¶

A mid-market consumer packaged goods company operating approximately 12,000 US retail doors ran about 60 price-promotion tests per year, each comparing weekly unit sales under a test price against matched control stores. The analytics team's framework was standard: t-test on log-transformed unit sales, decision rule "p<0.05 = success." Over a four-year period, this procedure validated 47 of 240 tests (~20% success rate), and the validated tests were rolled out chain-wide. When finance reviewed the post-rollout performance a year later, the aggregate incremental-margin lift was roughly 40% below projections^[9]— the tests were predicting effects that didn't persist, suggesting the test-validation procedure was systematically overstating true effect sizes.

A data-science team redesigned the testing framework around explicit effect-size discipline. Every test now reported: (a) point estimate of lift (percent-change in units, absolute unit-lift per store per week, incremental-margin contribution), (b) 95% confidence interval, © a pre-specified "minimum meaningful effect" (MME) of 5% unit-lift—the smallest effect the business cared about, (d) Bayesian posterior: P(lift > MME). The decision logic changed: tests with statistical significance (p<0.05) but point estimates below MME or confidence intervals including sub-MME values were marked "not actionable regardless of p-value." Tests with non-significant results but wide confidence intervals encompassing MME were marked "inconclusive—larger sample needed," not "failure." Tests with significance and point estimates clearly exceeding MME were marked "high confidence: rollout recommended."

Post-intervention, the success rate dropped to 12%, but validated rollouts matched financial projections within 8%, indicating the surviving tests were genuinely higher-quality. Sample sizes rose (designed for power against MME, not for p<0.05). Communications to business stakeholders shifted from binary "passed/failed" to rich narratives: "This test shows a 7% lift (95% CI 4%-10%) with 96% posterior probability of exceeding our 5% minimum. The incremental margin would support rollout if validated in the next-period replication." Brand managers began requesting pre-specified MMEs before tests launched, embedding effect-size thinking into the testing culture^[10]. The organizational shift was that "statistical significance" stopped serving as a validation credential; the real conversation became "is the magnitude large enough to matter?"

Mapped back: Effect-size decision-making structure — pre-specified minimum effect thresholds (context-dependent); decision-logic rewiring (significance necessary but not sufficient); sample-size planning aligned to effect magnitude, not to p<0.05; communication and organizational culture shift from binary outcomes to magnitude-centric reasoning.

Structural Tensions¶

T1 — Standardized versus raw/unstandardized metrics. Standardized effect sizes (Cohen's d, Pearson's r, η², Cramér's V) enable direct comparison across studies with different outcome scales, sample sizes, and populations, making them essential for meta-analysis. Raw unstandardized effect sizes (percent-lift, minutes-saved, dollars-earned, lives-saved) preserve substantive units that decision-makers and end-users understand intuitively. The tension is that standardization enables synthesis but obscures interpretability; raw metrics enable interpretation but lack cross-study comparability. Best practice reports both simultaneously: standardized metrics for research synthesis, raw metrics for applied decision-making. No single metric serves all audiences equally well^[11]^[11].

T2 — Magnitude (point estimate) versus reliability (interval estimate). Effect size is a point estimate (e.g., d=0.35) communicating the best estimate of magnitude; uncertainty is captured by confidence intervals, credible intervals, or Bayesian posterior distributions. Reporting a large effect size from a small, noisy study (e.g., d=0.8 ± 0.7, with wide interval) misleads readers toward overconfidence in the magnitude if the interval is not equally prominent. Conversely, reporting statistical significance without magnitude (p<0.05) misleads in the opposite direction, by suppressing information about the effect's scale. The tension is in simultaneously and equally communicating "this is the estimated magnitude" and "this is our uncertainty about it," a communication discipline that requires more reporting real estate than either quantity alone, creating pressure to abbreviate or simplify.

T3 — Universal conventions versus context-specific benchmarks. Cohen's small/medium/large rules (d=0.2, 0.5, 0.8 in standard deviations) are widely cited and provide universal reference points. Yet they are deeply context-dependent: in behavioral interventions on cognitive learning outcomes, d=0.2 is often considered substantial and worthwhile; in laboratory physics or chemistry, d=2 might be considered negligibly small. The tension is between the practical convenience of universal conventions (when domain-specific knowledge is unavailable) and the accuracy of context-specific benchmarks (which require substantive domain expertise). Applying universal benchmarks as authoritative rules produces systematic misinterpretation.

T4 — Average effect versus heterogeneous effects across subgroups. A reported average or overall effect size aggregates across populations, subgroups, and contextual conditions that may have substantially different effects. The same d=0.4 overall effect might decompose into d=0.8 in one demographic subgroup and d=0.0 in another, in which case the average obscures important heterogeneity. Subgroup analysis reveals this variation but risks inflating Type I error rates through multiple-comparisons problems. Treatment-effect-heterogeneity modeling (causal forests, Bayesian additive regression trees) offers principled tools but requires larger samples and more sophisticated analysis. The tension is between reporting simplicity (one summary number) and substantive accuracy (heterogeneity-aware reporting), particularly when downstream implementation will affect heterogeneous populations unequally.

T5 — Selection bias and publication bias distortion of effect sizes. The published literature on effect sizes is subject to multiple selection mechanisms: only studies with effects large enough to achieve statistical significance in small samples are likely to be published ("publication bias"); analyses are selected for reporting after seeing data ("selection bias" or "p-hacking"); replication attempts are less likely to be published if they fail. The result is that published effect-size estimates are systematically larger than true population effects— a phenomenon documented in meta-science research and termed the "file-drawer problem." Correcting for selection bias requires methods (trim-and-fill, p-curve, robust Bayesian meta-analysis) that have their own assumptions. The tension is between the apparent precision of a published meta-analytic estimate and the unquantifiable distortion from selection bias in the underlying literature.

T6 — Temporal stability and context generalization of effects. An estimated effect size from one sample, location, time period, or population may not generalize to another. A d=0.5 treatment effect in an efficacy trial (highly controlled conditions, motivated participants) may shrink to d=0.2 in an effectiveness trial (routine care, diverse populations). Broad generalization claims risk overstatement; cautious context-specific interpretation underutilizes findings. Modern practice emphasizes pre-registered sample-specific effect-size estimation and replication across diverse populations, but this requires resources most studies lack. The tension is between pragmatic use of available evidence (accepting potential context limitation) and epistemically cautious inference (qualifying conclusions to observed contexts).

Structural–Framed Character¶

Effect Size is a hybrid on the structural–framed spectrum, leaning structural with a light frame. Part of it is a bare pattern — the magnitude of a difference or relationship, expressed in interpretable units independent of how much data you collected; part of it is a vocabulary inherited from statistics and experimental design.

The structural core is a clean separation of two questions: not whether an effect differs from zero, but how large it is, often standardized by dividing the observed difference by a measure of spread so it can be compared across studies. That quantity is well defined wherever there is a measured contrast — a medical trial, an education study, an A/B test, a psychology experiment — and the arithmetic does not change with the subject matter. The lighter frame comes from its statistical home: the prime presupposes the apparatus of sampling, significance, and the practice of inference, and it carries a mild normative pull toward reporting magnitudes that matter for real decisions rather than chasing bare significance. Because a transferable quantitative pattern dominates while a modest methodological frame rides along, it sits toward the structural side of the middle.

Substrate Independence¶

Effect Size is a narrowly substrate-independent prime — composite 2 / 5 on the substrate-independence scale. The construct — separating the magnitude of an effect from its statistical significance, independent of sample size — is abstractly clean, but it is specialized to experimental design and quantitative research. It does not meaningfully travel outside statistics and data-heavy fields; social scientists and philosophers simply do not invoke effect-size reasoning when they are off the quantitative grid. So a fairly general formal idea ends up tethered in practice to the statistics domain that defines it, rather than functioning as a substrate-independent prime.

Composite substrate independence — 2 / 5
Domain breadth — 2 / 5
Structural abstraction — 4 / 5
Transfer evidence — 1 / 5

Relationships to Other Abstractions¶

Current abstraction Effect Size Prime

Parents (2) — more general patterns this builds on

Effect Size presupposes Comparison Prime

Effect size presupposes comparison because magnitude is read off the relation between two or more co-considered quantities.
Effect Size presupposes Scale Prime

Effect size presupposes scale because it quantifies the magnitude of an observed relationship in substantive units of measurement.

Children (3) — more specific cases that build on this

Small-Study Effects Domain-specific is part of Effect Size

Small-Study Effects contains effect-size estimates as the magnitude coordinate whose systematic relationship with study precision defines the pattern.
Type M Error Domain-specific is part of Effect Size

Type M Error contains observed and assumed true Effect Sizes whose magnitudes form the selected estimate and comparison target.
Type S Error Domain-specific is part of Effect Size

Type S Error contains an estimated and true Effect Size whose signed directions are compared after significance selection.

Hierarchy paths (2) — routes to 2 parentless roots

Effect Size → Comparison → Self Checking

Show alternative path (1)

Neighborhood in Abstraction Space¶

Effect Size sits in a moderately populated region (48^th percentile for distinctiveness): it has near-neighbors but no dense thicket of synonyms.

Family — Aggregation & Distributional Effects (11 primes)

Nearest neighbors

Computed from structural-signature embeddings · 2026-07-26

Not to Be Confused With¶

Effect Size must be distinguished from Proportion and Scale, its closest neighbor (similarity 0.724), though both involve quantitative relationships. Proportion and Scale concerns the sizing and ratio of compositional parts to wholes—how much of a total budget is allocated to marketing versus production, what fraction of team capacity is deployed to project X. This is fundamentally about relational positioning and allocation fractions. Effect Size, by contrast, quantifies the magnitude of a treatment, intervention, or relationship in standardized units that enable comparison across diverse measurement scales and studies. Proportion asks "what fraction of the whole?" while effect size asks "how much did this phenomenon change?" A 30% budget allocation to marketing (a proportion) is a different kind of fact than a Cohen's d=0.5 treatment effect on customer retention (an effect size). Though both are quantitative, they answer different structural questions: proportion is about composition and allocation; effect size is about measurement magnitude and comparative estimation. A meta-analysis synthesizing the effects of marketing interventions reports effect sizes, not proportions; a budget committee allocates proportions, not effect sizes. The two can coexist—a marketer might allocate 20% of budget (proportion) to a treatment whose effect size is d=0.4 (magnitude)—but they operate on different causal and structural planes.

Effect Size also differs from Scale, though both appear to involve "size." Scale is a structural prime naming the observation that systems behave fundamentally differently at different magnitudes—single cells obey different physical laws than multicellular organisms; a village and a nation operate under different coordination mechanisms. Scale is about the ontological bands at which phenomena are described, and the fact that the rules themselves change across bands. Effect Size, by contrast, is a statistical measurement construct: it quantifies the magnitude of a specific phenomenon (a treatment effect, a correlation, a difference) at a given scale, in interpretable units independent of sample size. A health-economics researcher measuring the effect of a drug on patient outcomes across millions of subjects is working at the national scale; the effect size (a hazard ratio of 0.85, an NNT of 50) quantifies the magnitude of the drug's impact. A laboratory researcher measuring the same drug on cultured cells is working at the cellular scale; the effect size (a percent-change in proliferation rate) quantifies the magnitude at that scale. Scale describes which band of the world you're studying; effect size describes how much of a difference your phenomenon makes within that band. They are complementary: effect size requires you to specify a scale (you always measure effects at some scale); scale thinking requires you to specify effect sizes at the scale you've chosen (bigger effects matter more at every scale). But one is ontological (what world-levels are relevant?) and the other is measurement-based (how much does the phenomenon matter?).

Effect Size also must be distinguished from Statistical Significance, though they are often conflated in practice and this conflation is the source of widespread scientific misinterpretation. Statistical significance answers the question "is this effect distinguishable from zero (or from the null hypothesis)?" using a p-value or Bayesian credible interval check. Effect Size answers the question "how big is the effect?" using a point estimate in interpretable units (Cohen's d, odds ratio, percent change, lives saved). These are orthogonal questions because two effects can be highly significant with wildly different magnitudes, and a large effect can be non-significant in a small study. A study of 100,000 participants might show a treatment effect that is statistically significant (p<0.001) but tiny in practical magnitude (d=0.05—clinically negligible). Conversely, a small study of 50 participants might show a large practical effect (d=0.8) that does not reach statistical significance because the study lacks power. Conflating significance with importance has been called "the systematic error of significance-testing culture" because it leads to resource allocation toward trivial findings (if large sample makes any tiny effect significant) and abandonment of valuable interventions (if small sample fails to reach significance despite large practical effect). Modern statistical reform emphasizes that significance and effect size are separable and both needed: effect size is the magnitude being estimated; significance is the confidence we have in that estimate; they must be reported together for complete inference.

Finally, Effect Size differs from Dose-Response Relationship, though both quantify how outcomes vary. Dose-Response Relationship maps the quantitative input-output function across a range of doses or exposures, characterizing the functional form (linear, nonlinear, sigmoidal, threshold), the curve's shape, and the relationship at each point along the gradient. A dose-response curve shows how a patient's symptom reduction increases as medication dose increases from 10mg to 20mg to 30mg to 40mg, revealing the shape of the relationship (perhaps linear up to 30mg, then plateauing, with diminishing returns at higher doses). Effect Size, by contrast, is a scalar magnitude estimate comparing two points: typically, one treatment condition against a control condition (or one exposure level against another), producing a single number summarizing the magnitude of the difference. Effect size is point-wise comparison; dose-response is functional mapping across a continuum. A clinical trial might report an effect size (treatment vs. control) of d=0.45, a single summary of how much the treatment helps. A dose-response study of the same drug across six dose levels reports the entire curve, showing not just "it helps" but "by how much at each dose, and where the curve flattens," a richer but more complex characterization. Both are valuable for different purposes: effect size drives clinical decision-making (does this dose help enough to use?), while dose-response guides optimal dosing and reveals safety thresholds (at what dose does benefit stop increasing and harm begin?).

Solution Archetypes¶

Solution archetypes in the catalog that build on this prime — directly (this prime is a source ingredient) or as a related prime.

Also a related prime in 22 archetypes

Attrition and Dropout Monitoring: Track who leaves a study, when they leave, why they leave, and from which condition so dropout cannot silently distort causal or comparative conclusions.
Baseline Covariate Balance Verification: Check whether randomization actually produced comparable groups by comparing pre-treatment covariates before causal conclusions are drawn.
Bayesian Belief Updating: Revise beliefs by combining prior expectations with new evidence rather than treating each observation in isolation.
Blocking Design: Group similar experimental units before assignment and compare treatments within blocks so nuisance variation does not obscure the effect being studied.
Catalytic Pairing: Pair factors so one increases the effectiveness of the other beyond what either achieves alone.
Control-Condition Specification: Make an experimental effect interpretable by specifying exactly what the treatment is being compared against and keeping that comparator realistic, ethical, stable, and uncontaminated.
Correlation Structure Characterization: Characterize how variables move together—by sign, strength, form, lag, condition, uncertainty, and stability—then explicitly constrain what that association may be used to claim or decide.
Counterfactual Comparison: Compare what happened with a plausible alternative to isolate causal effect or decision value.
Dimensionality Reduction for Signal: Reduce many variables into fewer informative dimensions so structure becomes visible without drowning in noise.
Dimensioned Comparison Framing: Make comparison legitimate by aligning the items, dimensions, scales, context, and relation-readout rule before drawing conclusions.

▸ Show 12 more

Effect Size Standardization: Convert raw inferred effects into comparable, uncertainty-bounded magnitude expressions so evidence can be judged by size and practical meaning, not only by detectability.
Hypothesis Test Power Calibration: Design a hypothesis test around the effect that would actually matter, then tune sample size, noise control, allocation, and error rates so the test has adequate power to detect it.
Hypothesis Testing Frame: Frame a claim against a default alternative so evidence can change belief or action under explicit error risks.
Interaction Effect Mapping: Map how factors change one another's effects when combined so interventions are not evaluated only in isolation.
Measurement-Protocol Standardization: Make comparisons interpretable by ensuring every subject, group, site, or condition is measured with the same construct, instruments, timing, administration, scoring, calibration, and deviation rules.
Minimum Effective Intervention: Use the smallest intervention intensity that reliably produces the desired effect.
Realized-Possible Outcome Gap Mapping: Compare what a process actually produced with what it could credibly have produced, then treat the gap as the main diagnostic object.
Regression-to-the-Mean Guardrail: Prevent ordinary reversion after extreme observations from being credited to an intervention, person, punishment, reward, or event without a credible counterfactual.
Risk-Adjustment and Benchmark Selection: Before calling performance abnormal, inefficient, or skillful, choose a benchmark that matches the relevant risk exposure, opportunity set, time horizon, and information conditions.
Selection–Transmission Change Attribution: When an aggregate mean changes, split the change into how much came from units gaining or losing weight and how much came from units changing internally.
Synergistic Combination Design: Combine elements so their interaction produces more value than isolated implementation.
Time Series Cross-Section Analysis: Compare many units across many moments so change over time is not confused with stable differences between units.

Notes¶

Effect-size reporting is now recognized as a foundational practice of modern statistical inference, emphasized across the American Statistical Association's 2016 and 2019 statements, journal guidelines (APA Publication Manual, CONSORT, STROBE), and funding-agency requirements. The practical and philosophical shift from significance testing to effect-size-plus-interval reporting remains incomplete in many fields, but momentum is strong in medicine, psychology, economics, and policy research. Contemporary practice emphasizes: (a) pre-specified minimum effects of interest for sample-size planning; (b) confidence intervals or credible intervals alongside point estimates; © effect-size reporting in abstracts and headline findings, not relegated to tables; (d) domain-specific interpretation rather than universal benchmarks; (e) heterogeneity assessment in meta-analytic synthesis rather than averaging across diverse effects; (f) Bayesian posterior distributions of effect size as alternative to frequentist point-and-interval pairs. The tight structural connection to #437 (statistical_power) means the two should be traversed together: power analysis is effect-size analysis given sample size, and the two address complementary aspects of research design and interpretation. Meta-science research documents that fields which moved toward effect-size discipline (clinical medicine, education research) have achieved better cumulative evidence synthesis than fields that remained significance-testing focused.

References¶

[1] Wilkinson, L., & American Psychological Association Task Force on Statistical Inference. (1999). "Statistical methods in psychology journals: Guidelines and explanations." American Psychologist, 54(8), 594–604. Supports the claim that psychology adopted formal effect-size reporting; the Task Force recommended reporting effect sizes and CIs alongside (or instead of) bare significance — directly backs the marker on contemporary effect-size reporting guidance. ↩

[2] Sullivan, G. M., & Feinn, R. (2012). "Using effect size — or why the P value is not enough." Journal of Graduate Medical Education, 4(3), 279–282. Supports the medical/clinical-trials marker: argues effect size is the primary product of research and that P values alone are insufficient, with worked clinical examples (NNT, magnitude vs. significance). ↩

[3] Glass, G. V. (1976). "Primary, secondary, and meta-analysis of research." Educational Researcher, 5(10), 3–8. Coins 'meta-analysis' and advocates averaging standardized effect sizes across studies — supports the educational-research/evidence-synthesis marker that effect sizes in SD units became the classification metric (Glass's Δ lineage). ↩

[4] Funder, D. C., & Ozer, D. J. (2019). "Evaluating effect size in psychological research: Sense and nonsense." Advances in Methods and Practices in Psychological Science, 2(2), 156–168. Supports the psychology effect-size-interpretation markers (061-context, 070): argues effect sizes are underappreciated/misinterpreted and proposes benchmark/consequence heuristics. Title in prime over-stated ('sense, nonsense, and new heuristics'); actual subtitle is 'Sense and Nonsense' (citation-fix). NOTE: this key is reused on markers 064 (Economics) and 066 (Policy/cost-effectiveness), which it does NOT specifically support — see flags. ↩

[5] Cumming, G. (2014). "The new statistics: Why and how." Psychological Science, 25(1), 7–29. Supports markers 065 and 068: the 'new statistics' program centers estimation via effect sizes, confidence intervals, and meta-analysis over NHST — backs both the A/B-testing/Bayesian-posterior framing and the Clarity-section CI-vs-significance point. ↩

[6] Hedges, L. V., & Olkin, I. (1985). Statistical Methods for Meta-Analysis. Academic Press. Foundational treatment of effect-size estimation, pooling, fixed/random-effects, and heterogeneity — supports markers 067 and 069 on meta-analytic pooling of effect-size estimates replacing p-value vote-counting. ↩

[7] Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2^nd ed.). Lawrence Erlbaum Associates. Codifies standardized effect-size definitions (d, r, f, w, h, q) and the small/medium/large (0.2/0.5/0.8) conventions offered as rough benchmarks — directly supports marker 071 on the Cohen's d framework. ↩

[8] Gelman, A., & Carlin, J. (2014). "Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors." Perspectives on Psychological Science, 9(6), 641–651. Introduces Type M (exaggeration) and Type S (sign) errors and shows underpowered significant studies exaggerate true effect size — the correct source for marker 072. Replaces the mis-keyed greenland-2016 (which is about P-value misinterpretation, not Type M/S). ↩

[9] Lakens, D. (2013). "Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs." Frontiers in Psychology, 4, 863. Real, on-topic effect-size methodology primer. Sits on marker 072->073 inside a fabricated 'national retail chain / CPG company' applied vignette; it supports the general effect-size discipline the vignette illustrates but cannot substantiate the invented figures (see flags). ↩

[10] Cumming, G. (2012). Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. Routledge. Real unified treatment of effect sizes + CIs + meta-analysis. Sits on marker 074 inside the same fabricated applied vignette; supports the effect-size-centric decision framework generically, not the specific invented numbers (see flags). ↩

[11] Hedges, L. V. (1981). "Distribution theory for Glass's estimator of effect size and related estimators." Journal of Educational Statistics, 6(2), 107–128. Derives the exact distribution and bias of Glass's estimator and gives the minimum-variance unbiased d — supports marker 075 on the standardized-vs-raw / bias-corrected estimator distinction. ↩

[12] Cohen, J. (1992). "A power primer." Psychological Bulletin, 112(1), 155–159. Tabulates effect-size indexes and required sample sizes for .80 power across 8 standard tests — bibliography-only (tier C); linked.

[13] Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). "Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations." European Journal of Epidemiology, 31(4), 337–350. Real and authoritative, but it documents 25 misinterpretations of P/CI/power — it does NOT cover Type M/S errors. Demoted from marker 072 to bibliography-only after re-sourcing that marker to gelman-carlin-2014; linked.

[14] Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E. J. (2016). "The fallacy of placing confidence in confidence intervals." Psychonomic Bulletin & Review, 23(1), 103–123. Bibliography-only (tier C); linked.

[15] Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E. J. (2014). "Robust misinterpretation of confidence intervals." Psychonomic Bulletin & Review, 21(5), 1157–1164. Bibliography-only (tier C); linked.

[16] Benjamini, Y., & Hochberg, Y. (1995). "Controlling the false discovery rate: A practical and powerful approach to multiple testing." Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300. Bibliography-only (tier C); linked.