Skip to content

Statistical Power

Core Idea

Statistical Power measures the probability that a test or study will detect a true effect—i.e., reject the null hypothesis when it is indeed false—thereby avoiding a Type II error (falsely concluding "no effect").

How would you explain it like I'm…

Catching Real Effects

Imagine looking for a small bug in the grass. With a tiny magnifying glass and bad lighting, you'll probably miss it even if it's right there. With a big magnifying glass and a bright lamp, you'll find it. Statistical power is just how good your bug-finder is at spotting a bug that really is there.

Chance of catching a real effect

Suppose you think a new sports drink helps kids run faster. To check, you test some kids. If you only test 3 kids, even if the drink really works, the small group might not show the difference — you'd miss it. If you test 300 kids, you're much more likely to spot it. Statistical power is the chance that your study will actually catch a real effect when one exists. Bigger effects, bigger samples, and less random noise all give you more power.

Effect-detection probability

Statistical power is the probability that a study will correctly detect a true effect — that is, reject a null hypothesis when the null is actually false. Formally, power equals 1 minus the chance of a false negative. Four things determine it: the size of the real effect, the size of your sample, the significance threshold you set, and how noisy your measurements are. If your study has low power, you might run it and find 'no effect' even when one really exists — wasting time and money. Worse, the few low-power studies that do find an effect tend to overstate it, because only the lucky high-variance results crossed the bar. That's why scientists do 'power analysis' before running studies: to figure out how big a sample they need to give themselves a fair shot at detecting what they're hoping to find.

 

Statistical power is the probability that a hypothesis test correctly rejects a false null hypothesis: power = P(reject H0 | H1 true) = 1 - beta, where beta is the Type II error rate. Power is a function of four interlocked quantities — true effect size (delta), sample size (n), significance level (alpha), and measurement noise (sigma) — bound by a deterministic relationship: fixing any three pins down the fourth, and power analysis is the systematic computation that makes this explicit at the design stage. The framework was formalized by Neyman and Pearson (1933), but its widespread adoption in the social sciences traces to Jacob Cohen's 1962 audit of abnormal-social psychology and his 1969/1988 *Statistical Power Analysis for the Behavioral Sciences*, which introduced the conventional small/medium/large effect-size benchmarks and produced the tabulations that made power calculations practical. The pre-specified planning discipline guards against two failure modes. Underpowering — running studies too small to detect plausible effects — wastes resources, yields inconclusive nulls, and worse, when an underpowered study does cross the significance threshold the resulting estimate is systematically inflated (the winner's curse, or type-M error; Gelman and Carlin 2014) because only the high-variance realizations were able to clear the bar. Overpowering wastes resources detecting trivially small effects. Underpowering is the dominant and more damaging failure, and a major contributor to the replication crisis.

Broad Use

  • Clinical Trials: Researchers calculate power beforehand to ensure sample size is large enough to catch a clinically meaningful difference in treatments.

  • Marketing A/B Tests: Managers want enough users tested so that if there's a real improvement in click-through, the experiment can reliably confirm it.

  • Psychological Studies: Planning ensures the design (number of participants, effect size) avoids uncertain or inconclusive results due to underpowered experiments.

  • Manufacturing & Quality Control: Determining if a small but meaningful defect reduction is actually detectable with given sampling procedures.

Clarity

Shows that failing to find a significant result might not mean there's no effect—one must also consider if the test had sufficient power to detect that effect, preventing "false negatives."

Manages Complexity

By systematically calculating or targeting a certain power level (often 80% or 90%), experimenters reduce the risk of wasting resources on inconclusive or ambiguous data, thereby optimizing study design.

Abstract Reasoning

Underscores that statistical tests are not all or nothing; they differ in sensitivity to real effects, shaped by sample size, effect size, and alpha levels—mirroring "signal detection" thinking in broader contexts.

Knowledge Transfer

  • Software Testing: Determining how many test runs or user sessions are needed to reliably detect performance improvements.

  • Epidemiology: Ensuring enough participants or observed cases exist to detect a moderate vaccine benefit.

Example

A medical study aiming for 80% power to detect a 5% improvement in survival calculates it needs at least 500 patients per group. Fewer patients would risk a Type II error—even if the improvement exists.

Relationships to Other Primes

One-hop neighborhood: parents above, mutual partners to the right, children below.Statistical Powercomposition: Experimental DesignExperimentalDesigncomposition: ProbabilityProbability

Parents (2) — more general patterns this builds on

  • Statistical Power presupposes Experimental Design — Statistical power presupposes experimental design because its computation requires the pre-specified architecture of treatment assignment, sample size, and outcome measurement.
  • Statistical Power presupposes Probability — Statistical power presupposes probability because it is a calibrated probability quantifying correct rejection of a false null.

Path to root: Statistical PowerProbability

Not to Be Confused With

  • Statistical Power is not Statistical Significance (p-value) because Statistical Power is the probability of correctly detecting an effect when it exists (avoiding Type II error), while Statistical Significance is the probability of rejecting a true null hypothesis (Type I error).
  • Statistical Power is not Statistical Inference because Statistical Power concerns the sensitivity of a test to detect effects, while Statistical Inference is the broader framework for reasoning about populations from samples.
  • Statistical Power is not Effect Size because Statistical Power depends on effect size, sample size, and alpha, whereas Effect Size measures the magnitude of an effect independent of sample size.