Shortcut Learning¶
Core Idea¶
An adapting system discovers a cheap, locally available feature that correlates with success on its training distribution and uses it as a stand-in for the structure it was meant to learn. It looks competent in-distribution but collapses sharply off-distribution, where the correlation breaks — because no structural knowledge was ever acquired. Optimization under a sufficient statistic finds the cheapest sufficient statistic, not the target.
How would you explain it like I'm…
The Grass Trick
Cheating With Clues
Cheapest Clue Wins
Broad Use¶
- Machine learning: a pneumonia detector that recognizes the hospital's X-ray machine rather than the lung; a classifier latching onto negation tokens.
- Education: the student who reads keyword cues in test items rather than the concept, then fails on novel framings.
- Hiring analytics: an algorithm learning a demographic proxy because it correlated with prior hires, not the construct.
- Animal cognition: Clever Hans reading his trainer's posture rather than doing arithmetic; pigeons learning background luminance.
- Biological evolution: runaway sexual selection on a display trait whose honest-signal correlation has decayed.
- Clinical prediction: a sepsis model that fires on the documentation of a sepsis workup — predicting the chart, not the patient.
Clarity¶
It separates being right from being right for a transportable reason, converting "passes training" from a reassurance into a question — which feature is used, and would it still predict off-distribution?
Manages Complexity¶
It collapses a per-domain catalogue of unrelated-looking generalization failures into one diagnostic: find the cheapest feature achieving training success, then check whether it dissociates from the target out of distribution.
Abstract Reasoning¶
It yields a structural prediction, not a contingent one: where a cheaper correlate exists, the optimizer finds it, and brittleness concentrates exactly where correlate and target dissociate — the region in-distribution evaluation never visits.
Knowledge Transfer¶
- ML → education: stress-set evaluation is the same move as a novel-framing test item.
- ML → animal cognition / evolution: "passes training is not learned the task" carries to beaks and genomes, since the optimizer there is selection.
- Across substrates: build stress sets that decorrelate the cheap feature, train across varying environments (invariance penalties), probe causally, audit construct validity.
Example¶
A sepsis-early-warning model graded on later diagnosis learns that a lactate order — taken because a clinician already suspects sepsis — predicts the outcome, looking excellent until deployed to alert before suspicion, where it fails on the population it was meant to help.
Relationships to Other Primes¶
Parents (1) — more general patterns this builds on
- Shortcut Learning presupposes Learning — Shortcut learning presupposes an adaptive process under outcome feedback (learning) and adds the cheaper-correlate-available condition + the collapse prediction — a child-not-duplicate of learning. Dossier-confirmed: 'presupposes learning... adds the cheaper-correlate-available condition.' The 0.9717 'learning' neighbor is the parent, not a duplicate.
Path to root: Shortcut Learning → Learning → Adaptation
Not to Be Confused With¶
- Shortcut Learning is not Learning because the prime substitutes a cheap correlate carrying no transportable structure, whereas learning is the genuine acquisition of structure that survives transport.
- Shortcut Learning is not Overfitting because the prime seizes a real, systematic correlation and generalizes fine until the distribution shifts, whereas overfitting memorizes sample-specific noise and fails on a random held-out split.
- Shortcut Learning is not Transfer of Learning because the prime is the prior failure to acquire structure, so there is nothing to transfer, whereas transfer concerns moving genuine competence to a new task.