Effect Size¶
Core Idea¶
Effect size quantifies the magnitude of a relationship or difference — the size of an observed effect in substantive, interpretable units — independent of sample size and separately from statistical significance. The core insight is that statistical significance ("does this effect exist?") is a question about whether an observation deviates from zero; effect size ("how big is this effect?") is a question about the scale of that deviation in units that matter for practical decision-making. Effect size reporting shifts the analytical frame from the dichotomous significance-testing question to the continuous estimation question, revealing a phenomenon that significance testing obscures: a large sample can make any non-zero effect "statistically significant," while a small sample can render even a large practical effect "non-significant," making significance alone an unreliable guide to importance. The abstraction is that meaningful inference requires attending to magnitude (the effect-size estimate), uncertainty (confidence interval or posterior distribution), and direction (positive or negative) jointly — not collapsing them into a binary reject/do-not-reject decision.
How would you explain it like I'm…
How Big Is It
Size Of The Difference
Measuring Magnitude, Not Just Yes/No
Structural Signature¶
Effect size operates at the interpretive layer of statistical analysis, with six defining structural roles: the magnitude-of-effect quantification in substantive units, the standardization mechanism dividing observed difference by variability, the dimensionless comparability property enabling cross-study synthesis, the separation-of-magnitude-from-uncertainty design, the meta-analytic pooling foundation replacing p-value vote-counting, and the practical-significance-versus-statistical-significance distinction. Standardized effect sizes (Cohen's d, Pearson's r, odds ratios, η²) divide observed differences by variability measures (standard deviation for d, standard error for proportions), producing scale-free numbers comparable across studies and outcome measures; unstandardized effect sizes (raw mean differences, absolute risk reductions, percentage-point changes) preserve the substantive units that stakeholders care about. The critical structural distinction is the separation of magnitude from reliability: a point estimate of effect size (the "how big") is paired with an interval estimate (the "how certain"), and both are needed for complete inference. Power analysis, sample-size planning, and meta-analysis all hinge on effect-size reasoning: sample size calculations are not fundamentally about achieving "statistical significance" but about engineering sufficient statistical power to detect effects of a specified practical magnitude; meta-analyses combine effect-size estimates across studies to produce pooled estimates and assess heterogeneity, a synthesis that would be impossible using p-values alone.
What It Is Not¶
- Not the same as statistical significance — A statistically significant effect can be tiny and practically irrelevant; a large effect can be non-significant in a small study. The distinction between #435 (statistical_significance_p_value) and effect size is fundamental: statistical significance addresses "is this distinguishable from zero?" while effect size addresses "how much is it?" These are orthogonal questions; conflating them has been the systematic error of significance-testing culture.
- Not a substitute for confidence intervals — A point estimate of effect size without calibrated uncertainty is incomplete. Effect size must be reported alongside interval estimates (#436 confidence_intervals) to communicate both magnitude and precision. A large effect from a small, noisy study is uninformative without its wide confidence interval; conversely, a narrow interval around a trivial effect communicates different information than a wide interval around the same point estimate.
- Not uniquely defined — Different metrics (Cohen's d, Pearson's r, odds ratio, η², relative risk, raw difference) highlight different aspects of the same phenomenon and are not always mutually convertible without assumptions. The choice of metric should depend on the research question and the audience; a meta-analyst may prefer standardized d for cross-study synthesis, while a clinician may prefer number-needed-to-treat for decision-making.
- Not inherently interpretable without context — Cohen's d=0.2, 0.5, 0.8 as "small/medium/large" conventions are universal approximations that apply poorly in many domains. A d=0.2 in a preventive intervention on educational outcomes is often substantial; the same d=0.2 in laboratory physics might be negligible. Effective effect-size interpretation requires domain-specific judgment about what magnitudes are practically meaningful.
- Not a causal claim by itself — Effect size quantifies an observed association or difference in data; whether that difference is causal depends on design features (randomization, control arms, measured and unmeasured confounding assessment). An effect size from an observational study is a descriptive quantity; causal interpretation requires additional argument.
- Not identical to clinical or practical significance — In medicine and policy, clinical significance combines effect size with costs, harms, patient values, and implementation feasibility. A small effect size might have enormous clinical significance if the harm prevented is severe and the intervention is cheap; conversely, a large effect size might have trivial significance if harm prevented is rare or intervention cost is prohibitive.
- Not monotonic with practical value — A small standardized effect (d=0.1) applied to millions of people (e.g., a 0.2% reduction in mortality across a large population) can create enormous aggregate value; a large standardized effect (d=0.8) in a small population may matter little in absolute terms. Aggregate impact depends on both effect size and population scale.
- Not equivalent to #437 (statistical_power) — While power analysis is built on effect-size thinking (power to detect an effect of specified magnitude), reporting the observed effect size does not tell you the power of the study. A non-significant finding with a large confidence interval excludes neither "no effect" nor "large effect"; determining what the study could have detected requires retrospective power analysis or confidence-interval examination.
Broad Use¶
Psychology and Behavioral Science. The American Psychological Association's Publication Manual (6th edition, 2010, and 7th edition, 2020) explicitly mandates effect-size reporting for all statistical tests. Cohen's d, η² (eta-squared), ω² (omega-squared), and partial η² are standard effect-size metrics. The shift was gradual but decisive: a 1989 survey by Cohen found fewer than 2% of published psychology papers reported effect sizes beyond implicit p-values; by 2010, major psychology journals had adopted formal effect-size reporting requirements. Contemporary psychology distinguishes standardized d (for cross-study meta-analysis) and unstandardized raw differences (for interpretability within a study), with reporting guidance favoring both[1].
Medical Research and Clinical Trials. Odds ratios, relative risks, hazard ratios, and absolute risk reductions (with number-needed-to-treat, or NNT) are conventional primary-result effect-size metrics in clinical trials. Meta-analyses aggregate effect sizes across trials to produce pooled estimates that inform clinical guidelines and evidence-based medicine syntheses. CONSORT (Consolidated Standards of Reporting Trials) explicitly requires effect-size reporting alongside p-values and confidence intervals. The medical tradition emphasizes communicating effect size in absolute terms (NNT for harm prevention, cost per life-year saved) alongside relative metrics, serving decision-making by patients, clinicians, and regulators[2].
Educational Research and Evidence Synthesis. The What Works Clearinghouse, a US Department of Education evidence-classification system, uses effect sizes in standard deviation units as its primary classification metric, assigning studies to evidence tiers based on effect magnitude. John Hattie's meta-analytic synthesis of approximately 800 meta-analyses (2008 Visible Learning and subsequent updates) identifies effect sizes of 0.4 in standard deviation units as a heuristic threshold for "significant positive impact," with effect sizes above and below providing continuous scaling of intervention effectiveness. This application has driven effect-size awareness in education policy and practice[3].
Economics and Econometrics. Elasticities (percent-change-in-outcome per percent-change-in-predictor), marginal effects (partial derivatives of outcome with respect to predictors), and dollar-valued impacts are the substantive quantities of interest in economic analysis. T-statistics and p-values play secondary roles as statistical diagnostics; the economic question is always "by how much," not just "is there an effect." Regression-coefficient reporting with confidence intervals and effect-size interpretation based on economic context (e.g., a 0.1 percentage-point change in GDP growth is a very large effect) is standard practice[4].
Technology, A/B Testing, and Online Experimentation. Percent-lift (percentage-change relative to control) and absolute-difference effect sizes are the primary decision drivers in A/B testing at technology companies. A test may show statistical significance at p<0.01, but if the lift is 0.1% on a metric worth millions, the practical magnitude may justify investment in implementation. Multi-armed bandit algorithms and Bayesian testing frameworks increasingly orient around posterior distributions of effect sizes rather than dichotomous reject/do-not-reject outcomes, allowing sequential decision-making under specified uncertainty. This shift reflects the recognition that business decisions depend on effect magnitude, not on significance thresholds[5].
Policy Analysis and Cost-Effectiveness. Effect sizes expressed as cost-effectiveness ratios (cost per quality-adjusted life year gained), return-on-investment metrics, or cost-benefit comparisons allow comparisons across diverse policy interventions. A job-training program's effect on earnings, measured in dollars per enrollee, is directly comparable to a health-care subsidy's effect on disease incidence. This application of effect-size reasoning aligns with public-finance and welfare-economic frameworks where the magnitude of impact and its cost-efficiency determine policy adoption[4].
Systematic Review and Meta-Analysis. Cochrane reviews, Campbell Collaboration systematic reviews, and other high-standards evidence-synthesis programs emphasize effect-size estimation as the central inferential task. Rather than vote-counting (how many studies show significance?), which is biased toward large studies and ignores effect magnitudes, meta-analysis pools effect-size estimates across studies, produces a pooled estimate with confidence interval, and assesses heterogeneity (I² statistics, prediction intervals) to characterize the distribution of effects across contexts. This methodological standard has been adopted across medical, educational, psychological, and policy research[6].
Clarity¶
Effect-size reporting clarifies what a study actually found by integrating magnitude, precision, and statistical distinguishability into a coherent narrative. The statement "treatment reduced symptoms by 2.3 points on a 20-point scale (95% CI 1.1–3.5, p=0.002)" communicates the point estimate (2.3), the uncertainty bounds (1.1 to 3.5), and the statistical distinguishability (p=0.002 from zero), allowing readers to judge whether the effect is practically meaningful and to visualize the range of plausible effects. The same result stated as "treatment was significantly better than control (p<0.01)" suppresses the magnitude and precision information, leaving readers unable to assess practical importance. In meta-analysis, effect-size reporting enables pooling across studies using different sample sizes, outcome measures, and designs; "vote counting" (counting how many studies achieved p<0.05) is biased toward studies with large samples and toward dichotomizing a continuous quantity (effect magnitude) into a binary outcome (significant/non-significant). Effect-size discipline also clarifies interpretation of null results: a non-significant result with a narrow confidence interval (e.g., d with 95% CI -0.1 to +0.1) is evidence of a small effect; a non-significant result with a wide confidence interval (e.g., d with 95% CI -0.8 to +0.8) is inconclusive, leaving open the possibility of a large effect. This distinction is invisible in pure significance testing but central to interval-based effect-size reporting[5].
Manages Complexity¶
Effect size provides a unified currency for comparing findings across studies, measures, contexts, and interventions, reducing the apparent incommensurability of diverse evidence. A meta-analyst combining psychotherapy trials measuring depression on different scales (Beck Depression Inventory, Hamilton Depression Rating Scale, Patient Health Questionnaire-9) cannot directly pool raw scores (incompatible metrics) but can pool Cohen's d effect sizes, each computed as the standardized mean difference on its native scale. Meta-regression extends this framework by examining whether effect sizes covary with study characteristics (sample size, demographic composition), trial design features (intensity, duration), or patient-population features (baseline symptom severity, comorbidity), producing moderator analyses that would be impossible with p-values alone. In applied settings, effect-size thinking allows decision-makers to compare interventions on a common scale: a 0.1-SD improvement in educational outcomes may appear less impressive than a 0.5-SD effect from another intervention, but if the first is dramatically cheaper to implement and targets a priority population, its cost-effectiveness can be competitive. The complexity management lies in standardizing the metric (converting diverse raw outcomes into dimensionless effect sizes) while preserving uncertainty through paired interval estimates— a discipline that has consistently improved scientific communication and organizational decision-making where it has been systematically adopted[6].
Abstract Reasoning¶
Effect-size reasoning embodies a fundamental abstraction about the structure of inference: the statistical question "is there an effect" is distinct from and less important than the substantive question "how big is it, is it stable across contexts, and does its magnitude matter for my decisions?" This perspective shifts analytical orientation from binary hypothesis testing (reject/do-not-reject) to continuous estimation with quantified uncertainty. The abstraction applies far beyond formal statistics: in engineering, setting specifications and tolerances is about magnitude and precision, not about hypothesis tests; in manufacturing quality control, statistical process control charts monitor the centering and spread of a process, not just whether it's "in control"; in medical monitoring, tracking a patient's marker trajectory requires continuous magnitude assessment, not sequential binary tests. The "statistically significant = real and important; non-significant = null and negligible" mental shortcut fails in two distinct ways: (a) with huge samples, it calls tiny effects that are practically negligible "statistically significant," leading to resource allocation toward trivial findings; (b) with small samples, it calls practically meaningful effects "non-significant," leading to premature abandonment of potentially valuable interventions. Effect-size discipline prevents both failures by centering magnitude as the primary inferential object[4]. The deeper principle is that effective decision-making under uncertainty requires attending to magnitude (the point estimate), direction (positive or negative), uncertainty (confidence or credible interval), and context (what counts as practically meaningful) jointly — collapsing these into a binary decision discards critical information needed for good judgment.
Knowledge Transfer¶
| Domain | Typical Effect-Size Metric | Interpretive Context |
|---|---|---|
| Psychology | Cohen's d, η², partial η² | d=0.2 small, 0.5 medium, 0.8 large (Cohen's conventions) |
| Medical trials | Odds ratio, relative risk, HR, NNT | OR=1.5 modest; NNT provides absolute scale |
| Epidemiology | Risk difference, hazard ratio | Absolute risk reduction preferred for clinical communication |
| Education | Effect size in SD units (Glass's Δ) | 0.1 small, 0.25 moderate, 0.4 large (What Works / Hattie) |
| Economics | Elasticity, marginal effect, $-impact | Unit-native; economic interpretation context-specific |
| A/B testing | Percent lift, absolute difference | Context-specific thresholds (e.g., 1% lift on revenue) |
| Marketing | Lift, ROAS, incremental conversion | Dollar-valued; comparable across campaigns |
| Meta-analysis | Pooled d, pooled OR with heterogeneity | Combined across studies with I² for heterogeneity |
| Machine learning | Accuracy/AUC improvement, F1 change | Task-specific; minimum meaningful lift varies |
| Ecology | Effect size on population growth rate, diversity | Biologically relevant thresholds; scaled to life-history |
Examples¶
Formal Example — Jacob Cohen 1988 and the Cohen's d revolution in psychology¶
Jacob Cohen's Statistical Power Analysis for the Behavioral Sciences (1st edition 1969, 2nd edition 1988) is the canonical academic foundation for effect-size thinking in psychology and the social sciences broadly. The 1988 edition codified systematic definitions, conventions, and computational methods for standardized effect sizes across test families: d for mean comparisons, r for Pearson correlations, f for ANOVA, w for chi-square (proportion effects), h for differences in proportions, and q for differences in correlations. Cohen's "small/medium/large" benchmarks (d=0.2, 0.5, 0.8 for standardized mean difference) were explicitly offered as rough and tentative conventions for contexts lacking domain-specific benchmarks— but became widely cited as if universally applicable rules, a misuse Cohen himself explicitly regretted in later writings. The standardized d metric rapidly became the foundational unit for meta-analyses and effect-size syntheses across psychology[7].
The practical impact on research culture unfolded over decades. A 1989 survey by Cohen found that fewer than 2% of published psychology papers reported effect sizes beyond implicit p-values; by 2010, mainstream psychology journals and the APA Publication Manual (6th edition, 2010) formally required effect-size reporting for primary statistical tests. Power analysis, built directly on effect-size reasoning, became routine in grant proposals and study protocols. Meta-analytic methods pioneered by Gene Glass, Larry Hedges, and John Hunter & Schmidt (building on effect-size pooling rather than p-value vote-counting) became dominant evidence-synthesis tools in psychology, medicine, education, and organizational research.
Cohen's framework also exposed a phenomenon that significance testing had obscured: many published psychology findings had much smaller effect sizes than their authors appeared to believe, and many purported "medium" effects were artifacts of publication bias and selection bias— only studies with effect sizes large enough to achieve significance in small samples survived to publication (see #441 reproducibility_replicability). Gelman and Carlin's 2014 extension into "Type M" (magnitude) and "Type S" (sign) errors showed mathematically that underpowered studies reaching statistical significance tend to dramatically exaggerate the true effect size (a "winner's curse" phenomenon)[8]. The contemporary emphasis on effect-size-centered reporting, pre-registered tests of pre-specified effect sizes, and explicit power analysis traces directly to Cohen's foundational innovations.
Mapped back: Effect-size foundational abstraction — standardization of magnitude across diverse contexts; convention-setting (the d=0.2/0.5/0.8 benchmarks); integration with power analysis (sample size planning as magnitude detection); exposure of publication-bias distortion of effect-size estimates.
Applied Example — A national retail chain's transition from significance-centric to magnitude-centric testing¶
A mid-market consumer packaged goods company operating approximately 12,000 US retail doors ran about 60 price-promotion tests per year, each comparing weekly unit sales under a test price against matched control stores. The analytics team's framework was standard: t-test on log-transformed unit sales, decision rule "p<0.05 = success." Over a four-year period, this procedure validated 47 of 240 tests (~20% success rate), and the validated tests were rolled out chain-wide. When finance reviewed the post-rollout performance a year later, the aggregate incremental-margin lift was roughly 40% below projections[9]— the tests were predicting effects that didn't persist, suggesting the test-validation procedure was systematically overstating true effect sizes.
A data-science team redesigned the testing framework around explicit effect-size discipline. Every test now reported: (a) point estimate of lift (percent-change in units, absolute unit-lift per store per week, incremental-margin contribution), (b) 95% confidence interval, © a pre-specified "minimum meaningful effect" (MME) of 5% unit-lift—the smallest effect the business cared about, (d) Bayesian posterior: P(lift > MME). The decision logic changed: tests with statistical significance (p<0.05) but point estimates below MME or confidence intervals including sub-MME values were marked "not actionable regardless of p-value." Tests with non-significant results but wide confidence intervals encompassing MME were marked "inconclusive—larger sample needed," not "failure." Tests with significance and point estimates clearly exceeding MME were marked "high confidence: rollout recommended."
Post-intervention, the success rate dropped to 12%, but validated rollouts matched financial projections within 8%, indicating the surviving tests were genuinely higher-quality. Sample sizes rose (designed for power against MME, not for p<0.05). Communications to business stakeholders shifted from binary "passed/failed" to rich narratives: "This test shows a 7% lift (95% CI 4%-10%) with 96% posterior probability of exceeding our 5% minimum. The incremental margin would support rollout if validated in the next-period replication." Brand managers began requesting pre-specified MMEs before tests launched, embedding effect-size thinking into the testing culture[10]. The organizational shift was that "statistical significance" stopped serving as a validation credential; the real conversation became "is the magnitude large enough to matter?"
Mapped back: Effect-size decision-making structure — pre-specified minimum effect thresholds (context-dependent); decision-logic rewiring (significance necessary but not sufficient); sample-size planning aligned to effect magnitude, not to p<0.05; communication and organizational culture shift from binary outcomes to magnitude-centric reasoning.
Structural Tensions¶
T1 — Standardized versus raw/unstandardized metrics. Standardized effect sizes (Cohen's d, Pearson's r, η², Cramér's V) enable direct comparison across studies with different outcome scales, sample sizes, and populations, making them essential for meta-analysis. Raw unstandardized effect sizes (percent-lift, minutes-saved, dollars-earned, lives-saved) preserve substantive units that decision-makers and end-users understand intuitively. The tension is that standardization enables synthesis but obscures interpretability; raw metrics enable interpretation but lack cross-study comparability. Best practice reports both simultaneously: standardized metrics for research synthesis, raw metrics for applied decision-making. No single metric serves all audiences equally well[11][11].
T2 — Magnitude (point estimate) versus reliability (interval estimate). Effect size is a point estimate (e.g., d=0.35) communicating the best estimate of magnitude; uncertainty is captured by confidence intervals, credible intervals, or Bayesian posterior distributions. Reporting a large effect size from a small, noisy study (e.g., d=0.8 ± 0.7, with wide interval) misleads readers toward overconfidence in the magnitude if the interval is not equally prominent. Conversely, reporting statistical significance without magnitude (p<0.05) misleads in the opposite direction, by suppressing information about the effect's scale. The tension is in simultaneously and equally communicating "this is the estimated magnitude" and "this is our uncertainty about it," a communication discipline that requires more reporting real estate than either quantity alone, creating pressure to abbreviate or simplify.
T3 — Universal conventions versus context-specific benchmarks. Cohen's small/medium/large rules (d=0.2, 0.5, 0.8 in standard deviations) are widely cited and provide universal reference points. Yet they are deeply context-dependent: in behavioral interventions on cognitive learning outcomes, d=0.2 is often considered substantial and worthwhile; in laboratory physics or chemistry, d=2 might be considered negligibly small. The tension is between the practical convenience of universal conventions (when domain-specific knowledge is unavailable) and the accuracy of context-specific benchmarks (which require substantive domain expertise). Applying universal benchmarks as authoritative rules produces systematic misinterpretation.
T4 — Average effect versus heterogeneous effects across subgroups. A reported average or overall effect size aggregates across populations, subgroups, and contextual conditions that may have substantially different effects. The same d=0.4 overall effect might decompose into d=0.8 in one demographic subgroup and d=0.0 in another, in which case the average obscures important heterogeneity. Subgroup analysis reveals this variation but risks inflating Type I error rates through multiple-comparisons problems. Treatment-effect-heterogeneity modeling (causal forests, Bayesian additive regression trees) offers principled tools but requires larger samples and more sophisticated analysis. The tension is between reporting simplicity (one summary number) and substantive accuracy (heterogeneity-aware reporting), particularly when downstream implementation will affect heterogeneous populations unequally.
T5 — Selection bias and publication bias distortion of effect sizes. The published literature on effect sizes is subject to multiple selection mechanisms: only studies with effects large enough to achieve statistical significance in small samples are likely to be published ("publication bias"); analyses are selected for reporting after seeing data ("selection bias" or "p-hacking"); replication attempts are less likely to be published if they fail. The result is that published effect-size estimates are systematically larger than true population effects— a phenomenon documented in meta-science research and termed the "file-drawer problem." Correcting for selection bias requires methods (trim-and-fill, p-curve, robust Bayesian meta-analysis) that have their own assumptions. The tension is between the apparent precision of a published meta-analytic estimate and the unquantifiable distortion from selection bias in the underlying literature.
T6 — Temporal stability and context generalization of effects. An estimated effect size from one sample, location, time period, or population may not generalize to another. A d=0.5 treatment effect in an efficacy trial (highly controlled conditions, motivated participants) may shrink to d=0.2 in an effectiveness trial (routine care, diverse populations). Broad generalization claims risk overstatement; cautious context-specific interpretation underutilizes findings. Modern practice emphasizes pre-registered sample-specific effect-size estimation and replication across diverse populations, but this requires resources most studies lack. The tension is between pragmatic use of available evidence (accepting potential context limitation) and epistemically cautious inference (qualifying conclusions to observed contexts).
Structural–Framed Character¶
Effect Size is a hybrid on the structural–framed spectrum, leaning structural with a light frame. Part of it is a bare pattern — the magnitude of a difference or relationship, expressed in interpretable units independent of how much data you collected; part of it is a vocabulary inherited from statistics and experimental design.
The structural core is a clean separation of two questions: not whether an effect differs from zero, but how large it is, often standardized by dividing the observed difference by a measure of spread so it can be compared across studies. That quantity is well defined wherever there is a measured contrast — a medical trial, an education study, an A/B test, a psychology experiment — and the arithmetic does not change with the subject matter. The lighter frame comes from its statistical home: the prime presupposes the apparatus of sampling, significance, and the practice of inference, and it carries a mild normative pull toward reporting magnitudes that matter for real decisions rather than chasing bare significance. Because a transferable quantitative pattern dominates while a modest methodological frame rides along, it sits toward the structural side of the middle.
Substrate Independence¶
Effect Size is a narrowly substrate-independent prime — composite 2 / 5 on the substrate-independence scale. The construct — separating the magnitude of an effect from its statistical significance, independent of sample size — is abstractly clean, but it is specialized to experimental design and quantitative research. It does not meaningfully travel outside statistics and data-heavy fields; social scientists and philosophers simply do not invoke effect-size reasoning when they are off the quantitative grid. So a fairly general formal idea ends up tethered in practice to the statistics domain that defines it, rather than functioning as a substrate-independent prime.
- Composite substrate independence — 2 / 5
- Domain breadth — 2 / 5
- Structural abstraction — 4 / 5
- Transfer evidence — 1 / 5
Relationships to Other Primes¶
Parents (2) — more general patterns this builds on
-
Effect Size presupposes Comparison
Effect size quantifies the magnitude of a relationship or difference in substantive, interpretable units, separately from statistical significance. This presupposes comparison: placing items under a shared frame, selecting a dimension along which they are co-considered, applying an alignment rule, and reading off a relation. An effect size is precisely such a reading: it requires comparands (typically treatment and control, or two conditions), a shared scale, and an alignment rule that makes the difference commensurable. Without comparison's operation of generating relational information, the magnitude has nothing to be the magnitude of.
-
Effect Size presupposes Scale
Effect size presupposes scale because reporting magnitude in interpretable units requires a chosen scale -- raw difference, standardized mean difference, odds ratio, variance explained -- against which the effect is sized. Without scale's commitment to specifying size, resolution, or unit of aggregation, there is no axis along which to express how large the deviation from zero is and no comparison across studies on a common dimension. Effect size IS the scale at which a relationship is described, separated from the significance question of whether it differs from zero.
Path to root: Effect Size → Comparison
Neighborhood in Abstraction Space¶
Effect Size sits in a sparse region of abstraction space (61st percentile for distinctiveness): few abstractions share its structure, so a faithful description tends to retrieve it precisely rather than landing on a neighbor.
Family — Experimentation & Validation (18 primes)
Nearest neighbors
- Synergy and Antagonism — 0.80
- Selection Bias — 0.79
- Blocking (In Experimental Design) — 0.79
- Regression to the Mean — 0.77
- Experimental Design — 0.77
Computed from structural-signature embeddings · 2026-05-29
Not to Be Confused With¶
Effect Size must be distinguished from Proportion and Scale, its closest neighbor (similarity 0.724), though both involve quantitative relationships. Proportion and Scale concerns the sizing and ratio of compositional parts to wholes—how much of a total budget is allocated to marketing versus production, what fraction of team capacity is deployed to project X. This is fundamentally about relational positioning and allocation fractions. Effect Size, by contrast, quantifies the magnitude of a treatment, intervention, or relationship in standardized units that enable comparison across diverse measurement scales and studies. Proportion asks "what fraction of the whole?" while effect size asks "how much did this phenomenon change?" A 30% budget allocation to marketing (a proportion) is a different kind of fact than a Cohen's d=0.5 treatment effect on customer retention (an effect size). Though both are quantitative, they answer different structural questions: proportion is about composition and allocation; effect size is about measurement magnitude and comparative estimation. A meta-analysis synthesizing the effects of marketing interventions reports effect sizes, not proportions; a budget committee allocates proportions, not effect sizes. The two can coexist—a marketer might allocate 20% of budget (proportion) to a treatment whose effect size is d=0.4 (magnitude)—but they operate on different causal and structural planes.
Effect Size also differs from Scale, though both appear to involve "size." Scale is a structural prime naming the observation that systems behave fundamentally differently at different magnitudes—single cells obey different physical laws than multicellular organisms; a village and a nation operate under different coordination mechanisms. Scale is about the ontological bands at which phenomena are described, and the fact that the rules themselves change across bands. Effect Size, by contrast, is a statistical measurement construct: it quantifies the magnitude of a specific phenomenon (a treatment effect, a correlation, a difference) at a given scale, in interpretable units independent of sample size. A health-economics researcher measuring the effect of a drug on patient outcomes across millions of subjects is working at the national scale; the effect size (a hazard ratio of 0.85, an NNT of 50) quantifies the magnitude of the drug's impact. A laboratory researcher measuring the same drug on cultured cells is working at the cellular scale; the effect size (a percent-change in proliferation rate) quantifies the magnitude at that scale. Scale describes which band of the world you're studying; effect size describes how much of a difference your phenomenon makes within that band. They are complementary: effect size requires you to specify a scale (you always measure effects at some scale); scale thinking requires you to specify effect sizes at the scale you've chosen (bigger effects matter more at every scale). But one is ontological (what world-levels are relevant?) and the other is measurement-based (how much does the phenomenon matter?).
Effect Size also must be distinguished from Statistical Significance, though they are often conflated in practice and this conflation is the source of widespread scientific misinterpretation. Statistical significance answers the question "is this effect distinguishable from zero (or from the null hypothesis)?" using a p-value or Bayesian credible interval check. Effect Size answers the question "how big is the effect?" using a point estimate in interpretable units (Cohen's d, odds ratio, percent change, lives saved). These are orthogonal questions because two effects can be highly significant with wildly different magnitudes, and a large effect can be non-significant in a small study. A study of 100,000 participants might show a treatment effect that is statistically significant (p<0.001) but tiny in practical magnitude (d=0.05—clinically negligible). Conversely, a small study of 50 participants might show a large practical effect (d=0.8) that does not reach statistical significance because the study lacks power. Conflating significance with importance has been called "the systematic error of significance-testing culture" because it leads to resource allocation toward trivial findings (if large sample makes any tiny effect significant) and abandonment of valuable interventions (if small sample fails to reach significance despite large practical effect). Modern statistical reform emphasizes that significance and effect size are separable and both needed: effect size is the magnitude being estimated; significance is the confidence we have in that estimate; they must be reported together for complete inference.
Finally, Effect Size differs from Dose-Response Relationship, though both quantify how outcomes vary. Dose-Response Relationship maps the quantitative input-output function across a range of doses or exposures, characterizing the functional form (linear, nonlinear, sigmoidal, threshold), the curve's shape, and the relationship at each point along the gradient. A dose-response curve shows how a patient's symptom reduction increases as medication dose increases from 10mg to 20mg to 30mg to 40mg, revealing the shape of the relationship (perhaps linear up to 30mg, then plateauing, with diminishing returns at higher doses). Effect Size, by contrast, is a scalar magnitude estimate comparing two points: typically, one treatment condition against a control condition (or one exposure level against another), producing a single number summarizing the magnitude of the difference. Effect size is point-wise comparison; dose-response is functional mapping across a continuum. A clinical trial might report an effect size (treatment vs. control) of d=0.45, a single summary of how much the treatment helps. A dose-response study of the same drug across six dose levels reports the entire curve, showing not just "it helps" but "by how much at each dose, and where the curve flattens," a richer but more complex characterization. Both are valuable for different purposes: effect size drives clinical decision-making (does this dose help enough to use?), while dose-response guides optimal dosing and reveals safety thresholds (at what dose does benefit stop increasing and harm begin?).
Solution Archetypes¶
Solution archetypes in the catalog that build on this prime — directly (this prime is a source ingredient) or as a related prime.
Also a related prime in 16 archetypes
- Attrition and Dropout Monitoring
- Baseline Covariate Balance Verification
- Bayesian Belief Updating
- Catalytic Pairing
- Control-Condition Specification
- Counterfactual Comparison
- Dimensionality Reduction for Signal
- Effect Size Standardization
- Hypothesis Test Power Calibration
- Hypothesis Testing Frame
Notes¶
Effect-size reporting is now recognized as a foundational practice of modern statistical inference, emphasized across the American Statistical Association's 2016 and 2019 statements, journal guidelines (APA Publication Manual, CONSORT, STROBE), and funding-agency requirements. The practical and philosophical shift from significance testing to effect-size-plus-interval reporting remains incomplete in many fields, but momentum is strong in medicine, psychology, economics, and policy research. Contemporary practice emphasizes: (a) pre-specified minimum effects of interest for sample-size planning; (b) confidence intervals or credible intervals alongside point estimates; © effect-size reporting in abstracts and headline findings, not relegated to tables; (d) domain-specific interpretation rather than universal benchmarks; (e) heterogeneity assessment in meta-analytic synthesis rather than averaging across diverse effects; (f) Bayesian posterior distributions of effect size as alternative to frequentist point-and-interval pairs. The tight structural connection to #437 (statistical_power) means the two should be traversed together: power analysis is effect-size analysis given sample size, and the two address complementary aspects of research design and interpretation. Meta-science research documents that fields which moved toward effect-size discipline (clinical medicine, education research) have achieved better cumulative evidence synthesis than fields that remained significance-testing focused.
References¶
[1] Wilkinson, L., & American Psychological Association Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: guidelines and explanations. American Psychologist, 54(8), 594–604. Wilkinson APA task force statistical methods effect-size reporting confidence intervals significance testing. ↩
[2] Sullivan, G. M., & Feinn, R. (2012). Using effect size—or why the P value is not enough. Journal of Graduate Medical Education, 4(3), 279–282. Sullivan Feinn effect-size uses p-value limitations clinical medical practice interpretation. ↩
[3] Glass, G. V. (1976). Primary, secondary, and meta-analysis of research. Educational Researcher, 5(10), 3–8. Glass primary secondary meta-analysis terminology effect-size pooling across studies. ↩
[4] Funder, D. C., & Ozer, D. J. (2019). Evaluating effect size in psychological research: sense, nonsense, and new heuristics. Advances in Methods and Practices in Psychological Science, 2(2), 156–168. Funder Ozer effect-size interpretation context-specificity heuristics research practice guidelines. ↩
[5] Cumming, G. (2014). The new statistics: why and how. Psychological Science, 25(1), 7–29. Cumming new statistics effect-size confidence intervals point estimate plus uncertainty reporting discipline. ↩
[6] Hedges, L. V., & Olkin, I. (1985). Statistical Methods for Meta-Analysis. Academic Press. Hedges Olkin meta-analysis methods effect-size estimation pooling heterogeneity fixed-effects random-effects models. ↩
[7] Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates. Foundational text on power analysis: links sample size, effect size, significance threshold, and noise level into a coherent design discipline — the practical instantiation of "set decision thresholds appropriate to the noise level" for empirical research. ↩
[8] Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. Authoritative critique of statistical practice: exposes how implicit distributional assumptions and convenience-driven model choices generate misinterpretations of significance and uncertainty. ↩
[9] Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4, 863. Lakens calculating reporting effect sizes practical guide d transformation equivalence. ↩
[10] Cumming, G. (2012). Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. Routledge. Cumming understanding new statistics effect-sizes confidence intervals meta-analysis unified framework. ↩
[11] Hedges, L. V. (1981). Distribution theory for Glass's estimator of effect size and related estimators. Journal of Educational Statistics, 6(2), 107–128. Hedges Glass effect-size estimator sampling distribution bias correction unbiased d. ↩
[12] Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155–159. Cohen power primer effect-size conventions applied research disciplines ANOVA correlation chi-square proportions.
[13] Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E. J. (2016). The fallacy of placing confidence in confidence intervals. Psychonomic Bulletin & Review, 23(1), 103–123. Morey fallacy confidence intervals realized interval probability frequentist misunderstanding.
[14] Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E. J. (2014). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review, 21(5), 1157–1164. Hoekstra misinterpretation confidence intervals Bayesian frequentist probability statements.
[15] Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300. Benjamini Hochberg false discovery rate FDR control multiple testing procedure step-up.