Predictive Coding¶

Origin domain: Neuroscience
Subdomain: computational neuroscience → Neuroscience
Also from: Engineering & Design, Computer Science & Software Engineering, Psychology
Aliases: Predict and Correct, Residual Coding, Prediction Error Signaling, Generative Model Correction, Predictive Processing

Core Idea¶

Predictive coding is the structural pattern in which a system maintains an internal generative model that continuously predicts its incoming signal, compares that prediction against the actual input, and then transmits, stores, or acts upon only the residual — the prediction error. The essential commitment is that the expected part of the signal is suppressed and only the surprising part propagates; the model is then updated by the error so that future predictions improve. ^[1] It is a predict–compare–correct loop, not merely a smaller encoding: the residual is both the message and the teaching signal. The concept crystallized in computational neuroscience through Rao and Ballard's (1999) account of the visual cortex as a hierarchy of predictors, in which higher areas send predictions downward and lower areas return only the unexplained error upward. ^[1] What makes the pattern a prime rather than a single algorithm is that the same four-part shape — generative model, prediction, comparison, residual-driven correction — recurs wherever a system must track a changing source under bandwidth, energy, or attention constraints. It answers a recurring question: when reality mostly conforms to expectation, why pay the full cost of representing it again, instead of paying only for where it departs?

The prime is fundamentally economic and epistemic at once. Economically, it spends representational resources in proportion to surprise; epistemically, it treats the gap between belief and observation as the only thing worth carrying forward, since the rest was already implied by the model. ^[2] Friston's (2010) free-energy formulation pushes this further: a system that minimizes prediction error over time is, under stated assumptions, minimizing a bound on its own surprise and thereby maintaining itself against a disordering environment. ^[3]

How would you explain it like I'm…

Pay attention only to surprises

Imagine you're listening to a song you know really well. Your brain hums along guessing the next note. When the singer hits exactly what you expected, you barely notice. But if they change one note, your ears perk up — surprise! Your brain is mostly paying attention to what's different from what it expected.

Predict, compare, send only surprise

Predictive coding is the idea that a brain (or any smart system) is always guessing what's coming next, then only paying close attention to the parts where its guess was wrong. Instead of processing every detail from scratch, it builds a model of the world, predicts the next sound or sight, and reacts mainly to surprises. The surprises also teach the model to make better guesses next time. This saves energy and helps explain why familiar things fade into the background.

Predict, Compare, Send the Error

Predictive coding is a structural pattern in which a system maintains an internal generative model that constantly predicts its incoming signal, compares the prediction to the actual input, and forwards only the residual — the prediction error. The expected part of the signal is suppressed; only the surprising part propagates. The error then updates the model so future predictions improve. The pattern crystallized in computational neuroscience with Rao and Ballard (1999), who described the visual cortex as a hierarchy of predictors: higher areas send predictions down, lower areas return only the unexplained error up. The same shape — model, predict, compare, send the residual — recurs anywhere a system must track a changing source under limits on energy, bandwidth, or attention. It is economic (spend resources in proportion to surprise) and epistemic (carry forward only what was not already implied) at once.

Predictive coding is the structural pattern in which a system maintains an internal generative model that continuously predicts its incoming signal, compares the prediction to actual input, and then transmits, stores, or acts on only the residual — the prediction error. The expected portion of the signal is suppressed; only the surprising portion propagates, and the residual updates the model so future predictions improve. It is simultaneously a coding scheme (the residual is the message) and a teaching signal (the residual drives learning). The framework crystallized in computational neuroscience through Rao and Ballard (1999), who modeled the visual cortex as a hierarchy of predictors: higher areas send predictions downward, lower areas return only unexplained error upward. What makes the pattern more than a single algorithm is its recurrence wherever systems track changing sources under bandwidth, energy, or attention constraints. Friston (2010) generalized it as free-energy minimization: a system that minimizes prediction error is, under stated assumptions, minimizing a bound on its own surprise and thereby maintaining itself against a disordering environment.

Structural Signature¶

Predictive coding encodes a structural pattern: generative model → prediction → comparison-against-input → residual propagation-and-correction. It separates two streams that are usually fused in a naive system — the predicted component (what the model already knows) and the error component (what the model failed to anticipate) — and routes only the second through the expensive channel. ^[1] The model is the standing hypothesis; the residual is its running confession of where it is wrong.

Recurring features:

Transmit only the prediction error, suppress the expected
Generative model predicts; comparison yields a residual
Information lives in the surprising, not the anticipated
Top-down prediction meets bottom-up error
Update the model in proportion to its mistakes
Precision-weighting decides how much each error counts
Explaining away: once predicted, a signal need not propagate

The structural insight is robust precisely because the loop is indifferent to what the signal means. A cortical column predicting the next visual feature, a codec predicting the next audio sample, a Kalman filter predicting the next state, and a language model predicting the next token all instantiate the identical arrangement: a forward model whose error is fed back to sharpen it. Spratling (2017) surveys these implementations and argues they are variations on one canonical computation rather than separate inventions. ^[2] The signature is also recursive: errors at one level become the input to be predicted at the next, so the pattern stacks into hierarchies in which each layer explains away as much of the layer below as it can and forwards only what remains unexplained.

What It Is Not¶

Predictive coding does not claim that systems literally store a tiny "error file" and nothing else; the generative model itself is a substantial standing representation, often larger than the residual stream it produces. The prime's claim is about what propagates and what gets corrected, not that the model is free. A common misreading treats predictive coding as a compression trick that throws information away. It does not: in lossless forms (DPCM with full error transmission, the Kalman update) the residual plus the prediction perfectly reconstructs the input, and no information is lost at all. The savings come from the residual being cheaper to carry, not from discarding signal.

Nor does the prime assert that prediction is always accurate or that surprise is always bad. A well-tuned predictive system expects a steady trickle of error and uses it; a system reporting zero error over a changing input is broken or has stopped learning, not perfected. The error channel is the point, not a defect to be eliminated. Relatedly, predictive coding does not require consciousness, intention, or a brain. The loop runs in a thermostat's error-driven controller and in a phone's audio codec with no more "expectation" than a difference equation supplies; ascribing belief-talk to these systems is a convenience of language, not a commitment of the prime.

Finally, the prime is not a claim that the brain (or any system) is only a predictive coder. It is a structural pattern that a system may use heavily, partly, or not at all in a given subsystem. Predictive coding describes the shape of a particular loop; it does not legislate that all of cognition, or all of signal processing, reduces to that loop. Treating "the brain is a prediction machine" as a totalizing metaphysics overstates what the prime underwrites.

Broad Use¶

Computational neuroscience: Cortical hierarchies are modelled as passing prediction errors upward while higher levels send predictions downward, the architecture Rao and Ballard (1999) proposed and Friston's free-energy work generalized into a unifying account of perception, action, and learning. ^[1] Mismatch responses (the mismatch negativity, repetition suppression) are read as the signature of error units firing only when prediction fails.

Signal processing: Differential pulse-code modulation (DPCM) and linear predictive coding (LPC) transmit the difference between a predicted and an actual sample, slashing bandwidth; LPC underlies decades of speech codecs by modelling the vocal tract as a predictor and sending only the excitation residual. ^[4]

Control and estimation: The Kalman filter advances a state prediction and corrects it by the innovation — measurement minus predicted measurement — which is the exact same residual loop with an optimal, uncertainty-weighted gain, as Kalman (1960) formalized for linear-Gaussian systems. ^[5]

Machine learning: Autoregressive and self-supervised models learn by predicting the next token, frame, or masked patch and back-propagating the error; the training signal is the prediction residual, and the predict-then-correct loop is the entire learning dynamic.

Perception, reading, and attention: Expectation fills in the predicted, so effort and attention spike at violated predictions — garden-path sentences, visual surprise, the surprise that drives gaze. Precision-weighting, the gain on the error channel, is one influential structural account of attention itself.

Organizations and operations: Forecast-and-variance management reports only deviations from plan ("management by exception"), and anomaly-detection systems flag only departures from a learned baseline — a residual stream by another name.

Clarity¶

Naming predictive coding lets practitioners see that information lives in the unexpected: a system can be efficient precisely because it spends resources only where reality departs from its model. It makes "surprise" a first-class, measurable quantity rather than a vague feeling, and it cleanly separates two things a naive design conflates — the model (what is expected) from the error channel (what must still be explained). Once that split is named, design questions sharpen: how good is the model, how expensive is the residual, and how should the two trade off?

The clarity also dissolves a frequent confusion between having a prediction and acting on its error. Many systems forecast; far fewer close the loop by feeding the discrepancy back to correct the forecast. Predictive coding names the closed loop specifically, so a practitioner can ask, of any forecasting system, "where does the residual go?" If the answer is "nowhere," the system merely predicts; if the residual updates the model and gates downstream cost, it is genuinely a predictive coder. This distinction redirects attention from the glamour of prediction toward the often-neglected error pathway that does the real work. ^[6]

Manages Complexity¶

Predictive coding bounds processing, bandwidth, and storage to the residual stream rather than the full signal, and it localizes learning to wherever predictions fail. A high-dimensional input is reduced to (stable model) + (sparse error), so attention, memory, and computation concentrate on the small, informative remainder while the bulk that conformed to expectation is explained away cheaply. This is the structural reason a video codec can shrink a near-static scene to almost nothing and then spend bits suddenly when something moves: complexity is paid in proportion to surprise, not to raw size.

Stacked into a hierarchy, the pattern manages complexity recursively. Each level absorbs the regularities it can predict and forwards only its residual to the level above, so the system as a whole distributes the burden of explanation across layers and never re-represents what a lower layer already accounted for. Clark (2013) argues that this hierarchical error-routing is what lets bounded biological agents cope with a torrent of sensory data: the upper levels never see the firehose, only the trickle of what the lower levels could not explain. ^[6] The same logic lets an operations dashboard stay legible — managers attend to variances, not to the thousands of metrics that landed on plan.

Abstract Reasoning¶

The pattern licenses reasoning about prediction error as the engine of both perception and learning, about hierarchical message-passing (predictions down, errors up), and about pathologies of mis-set precision — the gain on the error channel. If precision is set too high, noise is treated as signal and the model chases phantoms; if too low, real change is dismissed as noise and the model goes stale. Framing perception and learning as precision-weighted error correction makes a wide family of failures — hallucination, neglect, perseveration, over-fitting — legible as the same parameter set wrongly. ^[3]

It also licenses the move of "explaining away": once a signal is predicted, it needs no further transmission, which means absence of error is itself informative. A practitioner can reason counterfactually — what would the residual look like under a different model? which errors would vanish if the model were correct? — and use the structure of the remaining error to diagnose what the model is missing. Because the same residual-and-gain vocabulary spans estimators, codecs, and cortex, an insight about one (say, that an estimator diverges when its noise model is wrong) transfers as a hypothesis about another (a perceptual system that hallucinates may have a miscalibrated precision on its sensory error). ^[6]

Knowledge Transfer¶

The Kalman innovation, the DPCM residual, and the cortical prediction error are recognizably one structure, so estimator-design intuitions — precision-weighting, optimal gain, the cost of a wrong noise model — transfer to models of attention and to anomaly-detection systems that flag only deviations from a learned baseline. An engineer who understands why a Kalman gain rises when measurement noise falls already holds an intuition for why attention should be drawn more strongly to a high-precision sensory error than to a noisy one; the mapping is not metaphorical but structural, because both are the same precision-weighted update. ^[3] A speech-coding engineer's grasp of why LPC fails on signals that violate its source model transfers directly to understanding why a self-supervised model trained on one distribution produces large, uninformative residuals on another. The shared vocabulary — model, prediction, residual, gain, precision — is what makes the transfer concrete rather than analogical. ^[2]

Examples¶

Formal/abstract¶

Kalman filtering (control and estimation): A tracking system maintains a state estimate (position and velocity) and a model of how the state evolves. At each step it predicts the next state and the measurement it expects, then takes an actual measurement. The difference — the innovation, or residual — is multiplied by the Kalman gain (which weights the residual by the relative uncertainty of model versus measurement) and added back to correct the estimate. Crucially, only the innovation drives the update; the predicted part contributes nothing new. The sequence of innovations, for a correctly specified filter, is white noise — meaning all the structure has been explained away, and any remaining pattern in the residuals signals a wrong model. Mapped back: This is the prime in its cleanest form: generative model (state-transition equations), prediction (the predicted measurement), comparison (measurement minus prediction), residual-driven correction (gain times innovation). The precision-weighting that sets the Kalman gain is exactly the precision-weighting that, in the cortical reading, decides how much a prediction error should revise a percept.

Linear predictive coding (signal processing): A speech codec models the vocal tract as a linear filter that predicts each audio sample from a weighted sum of recent samples. Instead of transmitting the samples, it transmits the filter coefficients (the model) plus the residual — the excitation signal that the predictor could not anticipate. Because human speech is highly predictable from its recent past, the residual is small and cheap, and decoding simply runs the predictor forward and adds the residual back. Mapped back: Model (the linear predictor), prediction (each estimated sample), comparison (sample minus estimate), residual (the excitation). When the signal violates the source model — music, noise, overlapping voices — the residual swells, exactly as a Kalman filter's innovations grow when its dynamics model is wrong and as cortical error units fire when a percept defies expectation.

Applied/industry¶

Video compression (industry): A modern video codec predicts each frame from previous frames (and from already-decoded regions of the same frame) using motion estimation, then encodes only the residual difference between prediction and reality. A static talking-head scene costs almost nothing because the prediction is nearly perfect and the residual is near zero; a hard cut or fast motion produces a large residual and a spike in bits. The decoder reconstructs by running the same predictor and adding the transmitted residual. Mapped back: This is predictive coding running a business at scale — the prediction (motion-compensated frame) is suppressed and only the surprise (the residual block) is paid for. The cost of the stream tracks surprise, not screen size, which is the prime's complexity-management claim made literal in a bandwidth budget.

Anomaly detection and monitoring (industry): A fraud-detection or infrastructure-monitoring system learns a baseline model of normal behavior, predicts the next observation, and raises a residual whenever the actual observation departs from the prediction. Operators see only the deviations — the management-by-exception report — not the overwhelming majority of events that landed on the expected baseline. Tuning the alert threshold is precisely setting the precision on the error channel: too sensitive and noise floods the operators; too lax and real anomalies are explained away as noise. Mapped back: The learned baseline is the generative model, the alert is the residual, and the threshold is precision-weighting. The same failure modes that afflict a mis-tuned Kalman filter or a hallucinating perceptual system appear here as false-positive storms and missed incidents — one structure, one set of pathologies.

Structural Tensions¶

T1: The savings depend entirely on the model being good. Predictive coding is cheap only when predictions are usually right; against a signal the model cannot anticipate, the residual approaches the full signal and the loop costs more than naive transmission once the model's overhead is counted. The prime promises efficiency proportional to predictability, which means it offers nothing — or a penalty — precisely where the world is genuinely novel. Designers who assume the residual will always be small are surprised when a distribution shift turns their compression scheme into an expansion scheme.

T2: Precision-weighting trades hallucination against neglect, and there is no setting that avoids both. Raising the gain on the error channel makes the system responsive to real change but credulous toward noise; lowering it makes the system robust to noise but blind to genuine novelty. Every predictive coder must pick a point on this spectrum, and the right point depends on the actual noise structure of the world, which the system is itself trying to learn. The parameter that protects against one failure mode is the same parameter that opens the other.

T3: A strong prior can explain away the very signal that should overturn it. Because predicted components are suppressed, a confident model can treat a true but unexpected observation as noise to be discounted rather than error to be learned from. The mechanism that makes the system efficient — suppressing the expected — is the mechanism by which it can become dogmatic, fitting the world to its model instead of its model to the world. Distinguishing healthy explaining-away from pathological self-confirmation requires information the loop does not natively contain.

T4: The model is invisible until it is wrong, which makes failure hard to anticipate. When predictions are accurate, the residual stream is quiet and the system looks healthy; the standing model goes unexamined precisely because it is succeeding. Latent defects in the model surface only as a surge of error when conditions change, often suddenly and after the cheap, quiet period has bred complacency. Predictive systems therefore tend to fail not gradually but in a burst, when accumulated model drift finally exceeds the suppression the model was providing.

T5: Hierarchical error-routing assumes lower levels can be trusted to forward what matters. In a stacked predictive coder each level sees only the residual the level below chose not to explain away. If a lower level wrongly explains away a signal — absorbing it into its own model — the upper levels never learn it exists. The architecture's efficiency comes from upper levels being shielded from the raw input, but that same shielding means a misallocation of explanation at the bottom is structurally invisible at the top. Trust flows upward with the residual, and so does any error in what got suppressed.

T6: Predicting the input can collapse into controlling it. Active versions of the prime let a system reduce prediction error either by updating the model or by acting on the world to make the prediction come true. These are formally interchangeable ways to shrink the residual, but they are not interchangeable in consequence: a system that minimizes surprise by acting can drift toward seeking out only the narrow, dark-room conditions it already predicts well, sacrificing the rich engagement that made prediction worth doing. The same error-minimizing imperative that drives learning can, applied through action, drive disengagement.

Structural–Framed Character¶

Predictive Coding sits at the structural end of the structural–framed spectrum: it is a pure processing loop, the same wherever it appears, in which a system maintains an internal model that continuously predicts its incoming signal, compares the prediction against the actual input, and transmits, stores, or acts upon only the residual prediction error, updating the model so future predictions improve.

The concept comes from computational neuroscience and signal engineering as a formal mechanism, carries no normative weight, and can be defined entirely in terms of a model, a prediction, and an error signal with no reference to human practice. Applying it recognizes a predict–compare–correct mechanism already operating rather than imposing a frame: the same structure unifies Kalman filtering, differential pulse-code modulation, and cortical error signaling. On every diagnostic, it reads structural.

Substrate Independence¶

Predictive Coding is about as substrate-independent as a prime can be — composite 5 / 5 on the substrate-independence scale. Its predict-compare-correct-on-residual signature is fully substrate-agnostic, and the cross-domain evidence shows it as literally one structure across cortical hierarchies in biology, DPCM and linear predictive coding and anomaly detection in computing, the Kalman innovation in formal control, and perception and attention in cognition. The decisive note is that the Kalman innovation, the DPCM residual, and cortical prediction error are recognizably the same object, so intuitions about gain and precision transfer directly among them — exactly the kind of concrete cross-substrate evidence the top tier demands. It easily equals the feedback and causality anchors at the ceiling.

Composite substrate independence — 5 / 5
Domain breadth — 5 / 5
Structural abstraction — 5 / 5
Transfer evidence — 5 / 5

Relationships to Other Abstractions¶

Current abstraction Predictive Coding Prime

Parents (4) — more general patterns this builds on

Predictive Coding is a kind of Encoding And Decoding Prime

Predictive_coding is one encoding scheme (transmit prediction-errors against a shared generative model); encoding_and_decoding is the general content<->code pair of which it is an instance.
Predictive Coding presupposes Compression Prime

Predictive coding presupposes compression because transmitting only the prediction error exploits the predictable signal's redundancy to shorten its representation.
Predictive Coding presupposes Feedback Prime

Predictive coding presupposes feedback because the predict-compare-correct loop routes prediction-error output back to update the generative model.
Predictive Coding is part of Prediction Error Prime

Prediction-error signals are internal messages in the predictive-coding hierarchy, comparing level-specific predictions with incoming activity and routing the residual upward.

Children (2) — more specific cases that build on this

Efference Copy Prime is a kind of, typical Predictive Coding

'not predictive coding in general; efference copy is the specific self-versus-world attribution built from broadcasting a command copy to the perceiver.' predictive_coding (predict input, propagate error) is the genus; efference copy is the self-attribution specialization.
Pattern Completion (Filling the Incomplete) Prime presupposes Predictive Coding

Pattern completion presupposes predictive coding because filling incomplete input requires a generative model whose predictions span the missing parts.

Hierarchy paths (6) — routes to 6 parentless roots

Predictive Coding → Encoding And Decoding → Transformation → Function (Mapping)

Show alternative paths (5)

Neighborhood in Abstraction Space¶

Predictive Coding sits among the more crowded primes in the catalog (2^nd percentile for distinctiveness): several abstractions describe nearly the same structure, so a description that fits it will tend to fit its neighbors too — transporting it usually means disambiguating within this family rather than landing on it exactly.

Family — Propagation & Temporal Dynamics (21 primes)

Nearest neighbors

Recurrence — 0.79
Rhythm — 0.77
Classification — 0.77
Interpretation — 0.77
Latency — 0.77

Computed from structural-signature embeddings · 2026-07-26

Not to Be Confused With¶

Predictive coding must be distinguished from Compression, which the v1 entry identifies as its nearest neighbor. The two are easily conflated because predictive coding is often deployed to compress — DPCM and LPC are compression schemes — and because both exploit redundancy. But the prime and the neighbor name different things. Compression is fundamentally about minimizing the size of a representation by removing redundancy, and it is frequently a static, one-shot transformation: given a body of data, find a shorter code for it. Predictive coding is a dynamic loop in which a generative model runs forward against live input, and the object of interest is the residual — the running prediction error — not the code length. The residual is simultaneously the cheap thing to transmit and the teaching signal that corrects the model; compression has no analogue of the second role, because a compressor does not learn from the gap between its expectation and reality in order to predict better next time. One can compress with no model and no loop (run-length encoding, Huffman coding), and one can run a predictive-coding loop with no compression at all (a lossless Kalman filter transmits or stores everything, yet is a paradigm predictive coder). Compression is best seen as one possible downstream use of a predictive-coding loop — when you decide to send only the residual and the residual is small — rather than the same structure under a different name. Where compression asks "how few bits?", predictive coding asks "where did my model go wrong, and by how much?"

Predictive coding is also not foreseeing or prediction in the bare sense. Prediction merely forms a belief about a future or hidden state: a forecast, an expectation, a guess at the next value. That belief can sit there unused. Predictive coding additionally closes the loop — it compares the prediction against the actual input, extracts the residual, and feeds that residual back to both correct the model and gate what propagates downstream. The distinction is precisely the one the Clarity section draws: many systems forecast; only a predictive coder routes the discrepancy back into the machinery. A weather model that predicts tomorrow's temperature is doing prediction; a weather model that compares yesterday's prediction to what actually happened and adjusts its parameters in proportion to the error is doing predictive coding. The extra commitments — comparison, residual, correction, and the suppression of the expected — are what separate the active loop from a mere belief about the future. Prediction is a component of predictive coding, not a synonym for it.

Finally, predictive coding is distinct from Pattern Completion, the prime that referred it into the catalog. Pattern completion fills in missing parts of a stored pattern from partial cues: given a fragment, retrieve or reconstruct the whole, as an associative memory recovers a full image from a corrupted one. Its work is reconstruction toward a remembered template. Predictive coding is the ongoing, error-driven correction of a generative model against live, continuously arriving input. The orientations differ: pattern completion looks backward to a stored whole and fills the gaps in the cue, whereas predictive coding looks forward to the next input and propagates the gap between expectation and arrival. Pattern completion's output is the completed pattern; predictive coding's output is the residual — the part that did not match. They can cooperate — a generative model's top-down prediction can be understood as completing the expected pattern, and the resulting error is what predictive coding forwards — but the prime named here is the loop that carries and learns from the mismatch, not the mechanism that fills it in from memory.

Solution Archetypes¶

Solution archetypes in the catalog that build on this prime — directly (this prime is a source ingredient) or as a related prime.

Built directly on this prime (4)

Also a related prime in 4 archetypes

Event-Rate Magnitude Encoding: Encode intensity as event frequency and decode it by counting or integrating over a calibrated window rather than by inspecting any single event.
Predictive Precommitment Correction: Model the likely consequence of an intended action before commitment, then adjust the action while correction is still cheap.
Sparse-Activation Representation Design: Encode each case with only a few meaningful active units from a much larger codebook, so many distinctions can be represented without dense overload.
Use-Time Source Attribution Calibration: Before using a commingled memory, note, claim, trace, or generated output, classify where it came from and how certain that attribution is.

Notes¶

Predictive coding operates at multiple scales and substrates that share the loop but differ sharply in mechanism and timescale. A cortical microcircuit corrects its predictions in milliseconds; a Kalman filter updates each control cycle; a self-supervised model corrects over millions of gradient steps; an organization's variance-management process corrects monthly. The structural pattern is the same, but the cost of a wrong model, the latency of correction, and the meaning of "precision" are domain-specific, and importing a tuning intuition across these scales without re-checking the noise structure is a common error.

The free-energy generalization (Friston) is powerful but contested. Treating every error-minimizing system as "inferring its own existence" or "minimizing surprise about itself" extends the prime from a description of a loop into a metaphysical claim about life and mind. Practitioners should hold the structural core — model, prediction, residual, precision-weighted correction — separately from the more ambitious philosophical packaging, which is not required to use the prime and is not what the substrate-independence assessment is scoring.

A recurring practical subtlety is that the absence of error is informative and can be misread. A quiet residual stream may mean the model is excellent, or it may mean the model has stopped attending to a channel, or that precision has been set so low that real change is being explained away. Because the loop suppresses the expected, "everything is normal" and "we have gone blind" can look identical from the residual alone; healthy predictive systems need an independent check on whether their quiet is earned.

The active-inference variant — reducing error by acting on the world rather than only updating the model — turns predictive coding from a perception story into an action story, and is where the prime touches control, robotics, and behavior. It is also where the prime's tension with goal-directed behavior is sharpest (see T6), since error minimization through action does not, by itself, distinguish a system that learns from one that merely hides from surprise.

References¶

[1] Rao, R. P. N., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2(1), 79–87. Seminal computational-neuroscience model of the cortex as a hierarchy of predictors in which higher areas send predictions downward and lower areas return only the residual error upward; grounds the core predict–compare–correct loop and its structural separation of predicted from error component. ↩

[2] Spratling, M. W. (2017). A review of predictive coding algorithms. Brain and Cognition, 112, 92–97. Survey treating the major predictive-coding implementations as variations on one canonical computation; supports the economic-and-epistemic framing (spend only on surprise, carry only the unexplained gap) and the claim that the shared model/prediction/residual/gain vocabulary makes cross-domain transfer concrete rather than analogical. ↩

[3] Friston, K. (2010). The free-energy principle: A unified brain theory? Nature Reviews Neuroscience, 11(2), 127–138. Free-energy formulation in which a system that minimizes prediction error minimizes a bound on its own surprise and thereby maintains itself; supports the precision-weighting account that unifies hallucination, neglect, perseveration, and over-fitting as one parameter set wrongly, and the transfer of optimal-gain intuitions to attention. ↩

[4] Atal, B. S., & Schroeder, M. R. (1979). Predictive coding of speech signals and subjective error criteria. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(3), 247–254. Foundational linear-predictive-coding work modelling the vocal tract as a predictor and transmitting only the residual (excitation) difference between predicted and actual sample; supports the DPCM/LPC signal-processing instantiation of the prime. ↩

[5] Kalman, R. E. (1960). "On the general theory of control systems." Proceedings of the First IFAC Congress, 1, 481–492. ↩

[6] Clark, A. (2013). Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences, 36(3), 181–204. Synthesis arguing that hierarchical prediction-error minimization lets bounded agents cope with the sensory torrent (upper levels see only the residual the lower levels cannot explain); supports the closed-loop vs. mere-forecast distinction and the explaining-away / counterfactual-residual diagnostic. ↩

[7] Kalman, R. E. (1963). "Mathematical description of linear dynamical systems." Journal of the Society for Industrial and Applied Mathematics, Series A: Control, 1(2), 152–192.

[8] Majors, C., Fong-Jones, L., & Miranda, G. (2022). Observability Engineering: Achieving Production Excellence. O'Reilly Media.

[9] Hespanha, J. P. (2018). Linear Systems Theory (2^nd ed.). Princeton University Press.

[10] Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2^nd ed.). Wiley.

[11] Moore, B. C. (1981). "Principal component analysis in linear systems: Controllability, observability, and model reduction." IEEE Transactions on Automatic Control, 26(1), 17–32.

[12] Sridharan, C. (2018). Distributed Systems Observability. O'Reilly Media.

[13] Ogata, K. (2010). Modern Control Engineering (5^th ed.). Prentice Hall.

[14] Charity Majors et al. (2019). Observability: A 3-Year Retrospective. Honeycomb Engineering. https://honeycomb.io.

[15] Bever, J., & Charity Majors. (2020). "The cost of observability." USENIX SREcon 2020.

[16] Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (Eds.). (2016). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media.

[17] Dwork, C., & Roth, A. (2014). "The algorithmic foundations of differential privacy." Foundations and Trends in Theoretical Computer Science, 9(3–4), 211–407.

[18] Kalman, R. E. (1961). "On the general theory of control systems." IRE Transactions on Automatic Control, 6(1), 110–110.

[19] Sridharan, C., et al. (2021). "Federated observability architectures for large-scale distributed systems." IEEE/ACM SoCC 2021.

[20] Beyer, B. (2017). "Postmortem culture: Learning from failure." In Site Reliability Engineering, Ch. 15. O'Reilly Media.