Reward Prediction Error¶

Prime #: 1148
Origin domain: Neuroscience
Subdomain: computational neuroscience → Neuroscience
Aliases: Rpe, Temporal Difference Error

Core Idea¶

A reward prediction error is the pattern in which a system learns from the signed gap between expected and received outcome — signal minus prediction — using that scalar error, not the outcome itself, as the teaching signal: matched outcomes teach nothing, surprise teaches.

How would you explain it like I'm…

The Surprise Teacher

Imagine you expect one cookie and you get one cookie — no surprise, nothing to learn. But if you expected one and got three, that happy surprise makes you remember whatever led to it. And if you expected one and got none, that letdown makes you trust it less next time. Surprise is the teacher; getting exactly what you expected teaches nothing.

Better Or Worse Than Expected

A reward prediction error is the gap between what you expected and what you actually got — and that gap, not the reward itself, is what teaches you. If the outcome matches your expectation, there's no error and you learn nothing. If it's better than expected, that's a positive error and it strengthens whatever predicted it. If it's worse, that's a negative error and it weakens those predictors. So the system keeps a guess, gets a result, and pays attention to (result minus guess). A neat side effect: as your guesses get better, the surprises shrink and learning naturally slows down — small errors mean you've about maxed out, not that you failed.

Surprise Is The Signal

A reward prediction error is the pattern where a system learns not from raw outcomes but from the gap between expected and received outcomes, using the sign and size of that gap — rather than the outcome itself — to update its model. Outcomes that match expectation make no error and produce no learning; outcomes that beat expectation make a positive error and reinforce whatever predicted them; outcomes that fall short make a negative error and weaken those predictors. The commitment is that the system carries a prediction, receives a signal, and computes a scalar error (signal minus prediction) that serves as the teaching signal for whatever updates the predictor. It's the dual of outcome-only learning: a dog that just salivates when food arrives is responding to the food, but a prediction-error learner that already expected the food learns nothing from it — only unexpected food (positive error) or unexpectedly absent food (negative error) teaches. As the predictor improves, errors shrink and learning slows on its own, so the absence of error signals a ceiling, not a failure.

A reward prediction error is the structural pattern in which a system learns not from raw outcomes but from the gap between expected and received outcomes, and uses the sign and size of that gap, rather than the outcome itself, to update its model. Outcomes matching expectation generate no error and produce no learning; outcomes exceeding expectation produce a positive error and reinforce whatever predicted them; outcomes falling short produce a negative error and weaken those predictors. The essential commitment is that the system carries a prediction (an expectation, forecast, or value estimate), receives a signal (an outcome, reward, or measurement), and computes a scalar error (signal minus prediction) that serves as the teaching signal for whatever process updates the predictor. Every instance specifies four parameters: the predictor (the model issuing expectations), the prediction (its output on a particular trial), the observed outcome, and the learning rate (how strongly the error updates the predictor). The error is the load-bearing currency of learning — a system without prediction errors keeps no record of surprise and does not improve — and the pattern lets a reasoner ask crisp questions raw-outcome accounts cannot: whose prediction error, against what predictor, with what learning rate. It is the dual of outcome-only learning: a Pavlovian organism that salivates when food arrives is responding to the stimulus, not its mismatch with expectation, whereas a prediction-error learner that already expected the food learns nothing from its arrival, while unexpected food (positive error) or unexpectedly absent food (negative error) teaches. A structural consequence is the baseline-shift phenomenon — as the predictor improves, the errors shrink and learning slows of its own accord, so the absence of further error signals that the system has reached its current ceiling, not that effort has failed.

Broad Use¶

Neuroscience: midbrain dopamine neurons fire on unpredicted reward and pause when a predicted reward fails to arrive — the canonical biological prediction-error signal.
Reinforcement learning: the temporal-difference error drives every TD-family algorithm; without it the value function does not update.
Bayesian filtering: the innovation in a Kalman filter — observation minus predicted observation — is the only term that updates the estimate.
Finance: earnings surprise (actual minus consensus) moves asset prices; fully anticipated earnings move nothing.
Education: learners progress most at the edge of expectation — too-easy problems produce no error, too-hard produce errors too large to use.
Forecasting: model skill is the residual against verification, and a shrinking residual signals an approaching ceiling.

Clarity¶

It makes visible that learning is driven by neither signal nor prediction alone but by their signed difference, relocating the explanation of learning from outcome magnitude to magnitude of surprise.

Manages Complexity¶

It collapses dopamine bursts, TD errors, Kalman innovations, and earnings surprises into one quantity, and reduces "why isn't this learning?" to a four-component checklist: predictor, signal channel, error computation, update rule.

Abstract Reasoning¶

It licenses no surprise, no learning, probe with the unexpected, and calibrate the learning rate to noise, plus the diagnosis that persistent zero error means either mastery or a sealed-off model.

Knowledge Transfer¶

Robotics: Kalman innovations and TD errors being the same quantity lets localization and reward learning be treated as one update rule.
Pedagogy: the zone-of-proximal-development literature is the prediction-error principle applied — optimal difficulty is calibrated surprise.
Finance: portfolio-manager habits of decomposing forecasts into expected-versus-surprise components port to any domain with calibrated predictions.

Example¶

A company reports record profits yet its stock falls because the profits missed the analyst consensus — a negative error — while a modest profit that beat expectations rises: prices move on the surprise, not the level.

Relationships to Other Primes¶

Parents (2) — more general patterns this builds on

Reward Prediction Error is part of, typical Predictive Coding — RPE is the scalar teaching signal (signal minus prediction) that predictive_coding's hierarchy transmits — the error any such system uses. Part-of/component, NOT identity: predictive_coding is one system built FROM prediction errors. The file: 'predictive coding is one system built from prediction errors; RPE is the error.' Owner: this could also read as RPE being more general than predictive_coding — drawn as the file frames it (PC=architecture, RPE=its currency).
Reward Prediction Error is part of, typical Reinforcement — RPE is the prediction-error decomposition INSIDE reinforcement — it refines what reinforcement's 'value signal' actually is (surprise, not magnitude). A component of the reinforcement loop, not the whole selection engine.

Path to root: Reward Prediction Error → Predictive Coding → Feedback

Not to Be Confused With¶

Reward Prediction Error is not Predictive Coding because predictive coding is the hierarchical architecture passing predictions down and residuals up, whereas RPE is the scalar teaching signal any error-driven learner uses.
Reward Prediction Error is not Reinforcement because reinforcement is the whole selection engine, whereas RPE specifies what its value signal actually is — surprise, not magnitude.
Reward Prediction Error is not Bayesian Updating because Bayesian updating revises a full distribution, whereas RPE is a scalar point-estimate correction, coinciding only in the Gaussian case.