Convolution¶

Prime #: 751
Origin domain: Mathematics
Subdomain: signal processing → Mathematics

Core Idea¶

Convolution is the operation by which one signal acts on another by sliding, weighting, and summing: at every position, take a weighted local sum of the input using the second signal as the weight pattern, and the answer at that position is the result. In continuous form $(f * g)(t) = \int f(\tau)\, g(t-\tau)\, d\tau$, and in discrete form the corresponding sum; in both, the second function is flipped and slid across the first. The structural commitment is that each output value is a localized, time-shifted mixture of input values produced by a single fixed mixing pattern — the kernel — applied identically at every position.

This single shape captures three families of phenomenon that look unrelated until the mixing pattern is named. Filtering and smoothing: replace each point with a weighted average of its neighbourhood, as in moving averages, Gaussian blur, and edge detection. Memory and persistence: let the present output depend on a fading mixture of past inputs, as in impulse-response systems, RC circuits, and indebted economies. Spread and aggregation of independent contributions: the distribution of a sum of independent random variables is the convolution of their individual distributions, and the spread of news, votes, or particles follows the same shape. Convolution is a prime because the same kernel object — the impulse response, the influence function, the susceptibility, the synaptic weight — governs analyses across signal processing, probability, neural computation, physics, and the social sciences, and because the operation carries an unusual algebraic richness (it is commutative, associative, and bilinear, it has an identity in the delta function, and it diagonalizes under the Fourier transform) that makes the cross-domain transfer load-bearing rather than cosmetic. The mixing-pattern shape is medium-neutral; only the kernel's interpretation changes from one substrate to the next.

How would you explain it like I'm…

Slide and Smear

Imagine smearing each dot of a drawing a little into its neighbours to make it blurry. To find the new colour of one spot, you mix in a bit of all the spots around it. You slide that same little mixing recipe across the whole picture. That sliding-and-mixing is convolution.

The Sliding Mixer

Convolution takes two signals and lets one of them reshape the other by sliding, weighting, and adding. You take a small pattern of weights, called a kernel, and lay it over each spot of the input, multiply the nearby values by those weights, and add them up to get the new value at that spot. Then you slide the kernel one step over and do it again, everywhere. A blur kernel averages neighbours together to smooth a photo; a different kernel can sharpen edges instead. The trick is that the same little recipe is applied identically at every position.

Sliding Weighted Sum

Convolution is a sliding weighted sum: at every position, you replace a value with a weighted mixture of the input values around it, using a fixed pattern called the kernel. Formally you flip the kernel and slide it across the input, multiplying and summing at each offset. The single idea covers things that look unrelated: smoothing and edge detection (each point becomes a weighted average of its neighbourhood), fading memory (the present output is a decaying mixture of past inputs, as in an echo), and combining randomness (the distribution of a sum of two independent random variables is the convolution of their distributions). What stays the same is the mixing pattern; only its meaning changes from blur to memory to probability.

Convolution is the operation in which one signal acts on another by sliding, weighting, and summing: each output value is a localized, time-shifted mixture of input values produced by a single fixed kernel applied identically everywhere. In continuous form (f*g)(t) = integral of f(tau)g(t-tau) dtau, and the discrete sum mirrors it; in both, the second function is reflected and slid across the first. The same shape captures three families that look distinct until the kernel is named. Filtering and smoothing replace each point with a weighted neighbourhood average (moving averages, Gaussian blur, edge detection). Memory and persistence make the present a fading mixture of the past (impulse-response systems, RC circuits). And spread of independent contributions appears because the density of a sum of independent random variables is the convolution of their densities. Convolution is unusually well-behaved algebraically: it is commutative, associative, and bilinear, has the delta function as identity, and diagonalizes under the Fourier transform, which turns convolution into pointwise multiplication and makes the cross-domain transfer load-bearing rather than cosmetic.

Structural Signature¶

the input signal indexed by position — the fixed kernel (mixing pattern) — the slide-flip-weight-sum operation at each position — the translation invariance of the kernel — the algebraic richness (commutativity, associativity, delta identity) — the Fourier diagonalization into pointwise multiplication

A configuration is a convolution when each of the following holds:

A position-indexed input. There is a signal — a function over time, space, an index, or a probability density — whose values are arrayed over positions a kernel can be slid across.
A fixed kernel. A single weight pattern (impulse response, influence function, susceptibility, synaptic weight) specifies how much each input position contributes to a nearby output position; its interpretation is the only thing that changes across substrates.
A local mixing operation. Each output value is the flipped, shifted, weighted sum of input values under the kernel — a localized mixture computed identically at every position.
Translation invariance. The same kernel applies at every position; convolution is the unique linear operation commuting with translation, so any analysis demanding shift-invariance lands here by necessity.
Algebraic richness. The operation is commutative, associative, and bilinear, has the delta function as identity, and admits a deconvolution inverse when well-conditioned — so filters cascade and reorder freely.
Fourier diagonalization. Convolution in the position domain is pointwise multiplication in the frequency domain, giving a dual description (name the kernel; read its frequency content) that is itself substrate-neutral.

These compose into a fixed-mixing-pattern device: array an input over positions, slide one kernel across it taking weighted local sums, and exploit translation invariance, algebraic closure, and Fourier diagonalization — collapsing a whole history of dependence onto a single small, sliceable kernel object.

What It Is Not¶

Not correlation. Cross-correlation slides one signal across another without flipping the kernel and measures similarity at each lag; convolution flips the kernel and computes a system's output. The flip is structurally decisive — convolution is commutative and associative (filters cascade), correlation is not, and correlation answers "how alike, at this shift?" not "what does this system produce?" (see correlation).
Not function_mapping in general. A function mapping is any rule from inputs to outputs; convolution is the unique linear, translation-invariant such mapping. Its defining constraint — one fixed kernel applied identically everywhere — is far narrower than an arbitrary input-output rule (see function_mapping).
Not attention. Attention computes input-dependent, position-varying weights — the mixing pattern is recomputed per query from the data; convolution uses one fixed kernel applied at every position. Attention's weights respond to content; convolution's do not, which is exactly what makes convolution translation-invariant and attention adaptive.
Not aliasing_and_harmonic_distortion. Aliasing is an artifact arising when sampling or nonlinearity folds frequencies; convolution is the clean linear operation. They meet (a sampling kernel's spectrum governs aliasing), but one is a distortion phenomenon, the other the well-behaved mixing operation (see aliasing_and_harmonic_distortion).
Not conjugate_variables. Conjugate variables are pairs (position/momentum, time/frequency) linked by a Fourier transform with a trade-off relation; convolution is an operation that the Fourier transform happens to diagonalize. The Fourier connection is shared, but conjugacy is about a variable pair's uncertainty trade-off, not about a mixing operation.
Common misclassification. Treating any "weighted sum of neighbors" as convolution. The catch: ask whether the same weight pattern applies at every position (translation invariance) and whether the kernel is fixed rather than data-dependent; if weights vary by position or content, it is a general linear map or attention, not a convolution.

Broad Use¶

Signal processing, audio, and image. Low-pass and high-pass filters, Gaussian and median smoothing, edge detection via gradient kernels, and reverb via a room's impulse response — the entire filter literature is convolution with the right kernel.
Probability. The density of $X+Y$ for independent $X, Y$ is the convolution of their densities; central-limit proofs are statements about iterated convolutions concentrating on the Gaussian, and characteristic-function methods are the Fourier-domain version.
Neural computation. A neuron computes a weighted sum of its inputs over time; convolutional networks replace per-position parameters with a shared kernel to encode translation invariance, and early visual receptive fields are well modelled as oriented convolutions.
Physics. A linear time-invariant system's response to any input is the convolution of the input with the impulse response; Green's-function solutions of linear PDEs convolve the Green's function with the source.
Economics and operations. Distributed-lag models convolve past shocks with a response weight; queueing waiting-time distributions are convolutions of service times; depreciation schedules convolve investment with a decay kernel.
Social diffusion and vision. Cascade models compute influence at each node as a convolution of upstream activations with an influence kernel, and the pre-deep-learning vision toolkit — detection, blurring, sharpening — is convolution-based.

Clarity¶

Naming a process as a convolution forces the analyst to write down the kernel — the mixing pattern that turns input into output — and many disputes about "what causes the smoothing here" dissolve once the kernel is exhibited. A complaint about "fading memory" becomes a precise question about the decay rate of an exponential kernel; a debate about whether a signal is "noisy or just blurred" becomes a question about whether to deconvolve and which kernel to invert. The clarifying force is to convert a vague description of how past or neighbouring values influence the present into an explicit, sliceable object — the kernel — whose shape encodes exactly the dependence in question, so that arguments about the dependence become arguments about a function one can plot. The Fourier view adds a second source of clarity: convolution in the time or space domain is multiplication in the frequency domain, so many problems that are hard to reason about as overlapping local mixtures become straightforward as a multiplication after a change of basis. This dual clarity — name the kernel, then read its frequency content — transfers across substrates, because both the kernel and its Fourier transform are substrate-neutral descriptions of the same fixed mixing pattern.

Manages Complexity¶

Convolution turns a potentially complicated dependence — every output depending on the entire history of the input — into a single object, the kernel, that captures that dependence in a small, transparent, sliceable form. Once the kernel is known, the response to any input is fully determined; once its Fourier transform is known, the system's behaviour at each frequency can be read off directly. For a linear time-invariant system this reduction is total: the kernel is the complete description of the system, so an infinite-dimensional input-output mapping collapses to one function, and every linear question about the system becomes a convolution against that function. This is one of the few cases in which complexity collapses to a single, fully characterized object, and the consequence across substrates is that a complicated dynamical, biological, or economic system can sometimes be summarized by the question "what is its impulse response?", after which every linear question about it is answered by a convolution. The management move is to identify the kernel and then reason against it rather than against the full input-output behaviour, and the saving is that the kernel is typically small and interpretable where the full mapping is unbounded — a single decay profile standing in for an entire history of dependence.

Abstract Reasoning¶

Convolution supports several reusable inference patterns, each stated in terms of kernels and mixtures rather than any substrate. Algebraic manipulation: because convolution is commutative, associative, and bilinear with a delta-function identity and a deconvolution inverse when well-conditioned, one can split a complex filter into a cascade of simpler ones or swap the order of operations — manipulations unavailable for general operations. The convolution theorem: the Fourier transform diagonalizes convolution, turning it into pointwise multiplication, which is the deep reason filtering, signal processing, and PDE methods all share Fourier analysis as their workhorse. Concentration via repeated convolution: convolving a probability density with itself repeatedly concentrates it onto a Gaussian whose width grows like the square root of the count, which underwrites the universality of the normal distribution as the description of summed independent contributions. Translation invariance: convolution is the unique linear operation that commutes with translation, so any analysis demanding shift-invariance — image recognition, time-shift symmetry in physics, equal treatment of equal position — lands on convolution by structural necessity. Shape rules under combination: convolving exponentials gives a Gamma, convolving Gaussians gives Gaussians, convolving uniforms gives a B-spline, each a transferable rule about how kernels combine. Each pattern is a template about a fixed translation-invariant mixing operation, and each redeploys across signal processing, probability, physics, and learning by recognizing the convolution structure in the new setting.

Knowledge Transfer¶

The transferable content of convolution is the kernel itself and the diagnostics that attach to it, and because the sliding-weighted-mixture operation is a medium-neutral piece of mathematics, the same kernel object travels intact across signal processing, probability, neural computation, physics, economics, social diffusion, and vision. Filter design transfers into policy design: a central banker's choice of how aggressively to respond to past shocks is, mathematically, the choice of a kernel, and debates over smoothing a policy path are kernel-design debates that import directly from signal-processing intuition about causal kernels, phase delay, and leakage. Impulse response transfers as a universal diagnostic: macroeconomists test models by computing impulse-response functions, engineers test physical systems by injecting an impulse and reading the response, and the same diagnostic characterizes an organization's "filter" by the response of a workflow to a one-off shock. Deconvolution transfers as inference: recovering an unknown cause from an observed effect given a known propagation kernel is the same problem in astronomy (deconvolving a telescope's point-spread function), medical imaging (reconstruction as deconvolution plus tomography), and policy evaluation (backing out the timing of a cause from a lagged effect). Kernel choice transfers as inductive bias: in machine learning the kernel encodes the modeller's prior beliefs about smoothness, locality, and invariance, and the same insight applies to qualitative work, where "which features of the past matter, and how should they fade?" is itself a kernel choice. A photographer blurring a background with a Gaussian kernel, an economist whose aggregated noisy quantities are normally distributed because the cumulative distribution is a multi-fold convolution tending to a Gaussian by the central limit theorem, and a structural engineer computing a building's earthquake response as the convolution of ground acceleration with the building's impulse response are performing one operation in three substrates, and the diagnostic that transfers among them is identical — what is the kernel, and how can I reshape it? — whether the reshaping is choosing a smoothing window, designing a lag structure, or adding dampers to a building.

Examples¶

Formal/abstract¶

Let $X$ and $Y$ be independent random variables, $X$ uniform on $\{1,\dots,6\}$ (a fair die) and $Y$ likewise, and ask for the distribution of the sum $S = X+Y$. The position-indexed input is the probability mass function $p_X$ arrayed over outcomes; the fixed kernel is $p_Y$, the second die's mass function used as the mixing pattern. The slide-flip-weight-sum operation computes $p_S(s) = \sum_k p_X(k)\, p_Y(s-k)$ — for each total $s$, slide $p_Y$ across $p_X$ and sum the products — which is exactly the discrete convolution $(p_X * p_Y)(s)$, yielding the familiar triangular distribution peaking at $7$. The translation invariance holds: the same kernel weights apply at every position $s$. The algebraic richness is visible — convolution's commutativity means it does not matter which die we treat as kernel, and associativity means the distribution of three dice is $p_X * p_Y * p_Z$ computed in any order. Concentration via repeated convolution is the deep consequence: convolving many such densities together drives the sum toward a Gaussian whose width grows like $\sqrt{n}$ — this is the central limit theorem read as iterated convolution. The Fourier diagonalization gives the dual proof: characteristic functions (Fourier transforms of the densities) multiply under convolution, so the CLT becomes a statement that a product of characteristic functions converges to the Gaussian's.

Mapped back: Summing independent random variables instantiates the full signature — a position-indexed input, a fixed kernel, sliding weighted sums, translation invariance, commutative/associative algebra, and Fourier diagonalization — with the central limit theorem emerging as repeated-convolution concentration.

Applied/industry¶

A structural engineer computing a building's response to an earthquake instantiates convolution in its impulse-response form. The building is modeled as a linear time-invariant system whose complete dynamic behavior is captured by a single kernel — its impulse response $h(t)$, the displacement the structure exhibits after a unit-impulse jolt, encoding its natural frequencies and damping. The position-indexed input is the recorded ground acceleration $a(t)$ over time. The building's response is then the convolution $y(t) = (a * h)(t) = \int a(\tau)\, h(t-\tau)\, d\tau$: each instant's response is a fading, weighted mixture of all prior ground motion, weighted by the impulse response — the memory-and-persistence family of the operation. This is the manage-complexity payoff in action: an infinite-dimensional input-output mapping collapses to one interpretable function, so the engineer reasons against $h(t)$ rather than re-deriving the response for every possible quake. The Fourier diagonalization is the working tool — convolution in time becomes multiplication in frequency, so the engineer reads off resonance by checking where the ground motion's frequency content overlaps the building's transfer function, and reshapes the kernel by adding dampers (broadening $h$, shrinking the resonant peak). The same impulse-response diagnostic transfers directly to a macroeconomist testing a model via impulse-response functions (the economy's response to a one-off policy shock is a convolution of shocks with a response weight) and to an audio engineer applying reverb by convolving a dry signal with a room's measured impulse response.

Mapped back: Earthquake response, macroeconomic impulse-response analysis, and convolution reverb all summarize a system by one kernel and convolve inputs against it — exploiting Fourier diagonalization to read and reshape it — instantiating convolution in structural-engineering, economics, and audio substrates.

Structural Tensions¶

T1 — Linearity Assumed versus Nonlinear Reality (frame). Convolution is exactly the response of a linear time-invariant system; the entire kernel apparatus presupposes superposition holds. The competing reality is that many systems saturate, threshold, or interact nonlinearly. The failure mode is convolving past the linear regime — modeling a structure's response to a large quake, or a market's response to a big shock, with the small-signal impulse response, where the true response no longer scales. Diagnostic: ask whether doubling the input doubles the output; if superposition fails, the kernel describes only a local linearization, and treating it as the complete system mapping mis-predicts exactly the large inputs that matter most.

T2 — Translation Invariance versus Position-Dependence (scopal). The single fixed kernel applies identically everywhere, which is what makes the operation a convolution rather than a general linear map. The failure mode is forcing one kernel where the system's behavior actually varies with position or time — a blur kernel uniform across an image with spatially-varying focus, or a fixed lag structure in an economy whose dynamics shifted regime. Diagnostic: ask whether the mixing pattern is genuinely the same at every position; if the impulse response depends on where or when the impulse lands, the system is not shift-invariant, and a single kernel averages away the very heterogeneity being modeled.

T3 — Deconvolution Promise versus Ill-Conditioning (measurement). The algebra offers a deconvolution inverse — recover the input from the output given the kernel — but inversion is only well-conditioned when the kernel does not kill frequencies. The failure mode is naive deconvolution that amplifies noise catastrophically: where the kernel's Fourier transform is near zero, dividing by it explodes measurement error into the reconstruction. Diagnostic: ask whether the kernel's spectrum has zeros or deep troughs; if it does, those frequencies are irrecoverably lost and inversion there is dominated by noise, so deconvolution must be regularized, not run as a clean algebraic inverse.

T4 — Finite Kernel versus Infinite Memory (temporal). A kernel collapses an entire history of dependence into one object, but a kernel with long or heavy tails means each output depends on the distant past, and truncating it for tractability discards real influence. The failure mode is windowing away a slow-decaying tail — using a short moving average on a process with long memory — so persistent effects are silently dropped. Diagnostic: ask how fast the kernel decays relative to the analysis window; if the tail carries non-negligible weight beyond the truncation, the finite kernel under-represents persistence, and the discarded memory reappears as unexplained drift.

T5 — Position Domain versus Frequency Domain (representation duality). Fourier diagonalization makes convolution into pointwise multiplication, tempting the analyst to always work in the frequency domain. But the duality has a cost: sharp localization in one domain means delocalization in the other, and frequency-domain reasoning obscures local, transient structure. The failure mode is treating a non-stationary or edge-rich signal with global Fourier methods, smearing a localized event across all frequencies. Diagnostic: ask whether the phenomenon of interest is localized in position or in frequency; if it is a transient or an edge, the multiplication shortcut hides it, and a position-domain or time-frequency view is needed instead.

T6 — Kernel as Inductive Bias versus as Discovered Truth (scopal/epistemic). Choosing a kernel encodes prior beliefs about smoothness, locality, and which past matters — a modeling commitment, not a neutral measurement. The failure mode is reifying the chosen kernel as the system's true response when it was an assumption: a Gaussian smoothing window imposes smoothness the data may not have, and a chosen lag structure bakes in a decay the world may not obey. Diagnostic: ask whether the kernel was measured (an injected impulse, an estimated response) or assumed (a convenient shape); if assumed, the convolution's output partly reflects the analyst's prior, and conclusions inherit whatever locality and smoothness the kernel quietly imposed.

Structural–Framed Character¶

Convolution sits at the structural pole of the structural–framed spectrum: a pure mathematical operation — slide one fixed kernel across an input, taking weighted local sums — with a zero aggregate and every diagnostic reading the same way.

The pattern carries no home vocabulary that must travel with it: the sliding-weighted-mixture is told in an engineer's "impulse response," a probabilist's "density of a sum," a physicist's "Green's function," an economist's "distributed lag," and a neuroscientist's "receptive field," each in its own field's words, with only the kernel's interpretation changing while the operation stays identical — the entry stresses the transfer is "load-bearing rather than cosmetic" precisely because the same kernel object travels intact. It carries no evaluative weight: convolving a signal is value-neutral, a clean linear mixing operation with no approval attached and no normative content. Its origin is formal — an integral (or sum) defined over position-indexed functions, with no institutional pedigree. It is not bound to a human practice: a building's earthquake response is the convolution of ground acceleration with its impulse response, and the distribution of a sum of independent variables is the convolution of their densities, both facts about the substrate that hold with no observer present. And invoking it recognizes a mixing structure already present — any linear translation-invariant system is a convolution by structural necessity, the kernel waiting to be named — rather than importing an interpretive frame. Every diagnostic points one way, which is why the grade is a clean structural zero.

Substrate Independence¶

Convolution is about as substrate-independent as a prime can be — composite 5 / 5 on the substrate-independence scale. Its signature is a medium-neutral mathematical operation — array an input over positions, slide one fixed kernel across it taking flipped, weighted local sums — and only the kernel's interpretation changes from one substrate to the next, so it is recognized rather than translated when it surfaces in a new field. The breadth is maximal: low-pass filters and reverb in signal/audio/image processing, the density of a sum of independent variables in probability, shared-kernel layers and visual receptive fields in neural computation, impulse-response and Green's-function solutions in physics, distributed-lag models and queueing waiting-times in economics and operations, and cascade-influence models in social diffusion all instantiate the identical operation with the kernel reinterpreted as impulse response, influence function, susceptibility, or synaptic weight. The abstraction is maximal — the sliding-weighted-mixture is told as "impulse response," "density of a sum," "Green's function," "distributed lag," or "receptive field," each in its own field's words, while the operation stays identical, and any linear translation-invariant system is a convolution by structural necessity. The transfer is heavily documented and load-bearing, carried by an unusual algebraic richness (commutativity, associativity, bilinearity, the delta identity, and Fourier diagonalization into pointwise multiplication): the same kernel object and the same diagnostic — what is the kernel, and how can I reshape it? — port intact across a photographer's Gaussian blur, an economist's central-limit aggregation, and a structural engineer's earthquake-response computation. Maximal abstraction, maximal breadth, and deep documented transfer all line up at the ceiling.

Composite substrate independence — 5 / 5
Domain breadth — 5 / 5
Structural abstraction — 5 / 5
Transfer evidence — 5 / 5

Relationships to Other Primes¶

Parents (1) — more general patterns this builds on

Convolution is a kind of Function (Mapping)

The file: convolution is 'the UNIQUE linear, translation-invariant' function mapping — one fixed kernel applied identically everywhere. A sharply constrained specialization of function_mapping (an arbitrary input-output rule).

Path to root: Convolution → Function (Mapping)

Neighborhood in Abstraction Space¶

Convolution sits in a sparse region of abstraction space (77^th percentile for distinctiveness): few abstractions share its structure, so a faithful description tends to retrieve it precisely rather than landing on a neighbor.

Family — Signal Transformation & Mapping Effects (10 primes)

Nearest neighbors

Computed from structural-signature embeddings · 2026-06-14

Not to Be Confused With¶

Convolution's most precise confusion is with correlation (specifically cross-correlation), because the two are computed by an almost identical slide-and-sum and differ by a single operation: the flip. Cross-correlation slides one signal across another and, at each lag, sums their pointwise product without reversing the kernel; convolution reverses (flips) the kernel before sliding. This one difference has large structural consequences. Convolution is commutative and associative — it does not matter which signal is the kernel, and a cascade of filters can be reordered or merged — which is exactly why linear-time-invariant systems compose so cleanly; cross-correlation is neither, because the flip is what symmetrizes the operation. The two also answer different questions: convolution computes what a system produces when its impulse response acts on an input, while cross-correlation measures how similar two signals are at each relative shift, the workhorse of template matching, time-delay estimation, and the (misnamed) "convolutional" layers of neural networks, which actually compute cross-correlations. A practitioner who conflates them will get matched-filter outputs and system responses confused, and will wrongly expect the commutativity and cascade-reordering that only the flipped operation enjoys. The diagnostic is simply whether the kernel is reversed: flip present, convolution and its full algebra; flip absent, correlation and its similarity interpretation.

A second confusion is with function_mapping at the general level, because a convolution is a mapping from input signals to output signals and it is tempting to treat it as just "some linear filter." The point worth holding is that convolution is not an arbitrary linear map — it is the unique linear map that commutes with translation. A general linear operator on a signal can have a different rule at every position (a full, position-dependent matrix); convolution constrains all those rules to be the same fixed kernel slid everywhere, which is an enormous restriction and the entire source of its tractability and its Fourier diagonalization. The confusion matters because reasoning about a system as "a linear map" loses the translation-invariance that makes the kernel a single small object and that licenses the convolution theorem; conversely, assuming a system is a convolution when its behavior actually varies with position imposes a shift-invariance the system does not have. The right framing is that convolution is the special case of linear function-mapping picked out by the demand that equal positions be treated equally — and any analysis that needs position-dependence has left convolution's territory.

Convolution is also worth separating from attention, a confusion now common because both appear as neural-network layers that "mix" a signal with weights. The decisive difference is where the weights come from. A convolution's weights are a fixed kernel, learned once and applied identically at every position regardless of content. Attention's weights are computed from the data at run time — each output position derives its mixing pattern from a content-based comparison (query against keys), so the effective kernel is input-dependent and varies from position to position and example to example. This is why convolution is translation-invariant and parameter-frugal while attention is content-adaptive and able to relate distant positions whose relevance depends on what they contain. A practitioner who treats attention as "just a flexible convolution" misses that attention has no fixed kernel at all and therefore none of convolution's shift-invariance guarantees; one who treats convolution as a degenerate attention misses that convolution's fixed kernel is exactly the inductive bias (locality, invariance) that attention deliberately gives up. The substrates overlap, but the fixed-versus-computed nature of the mixing pattern is a hard structural line.

For a practitioner the cluster sorts by three questions. Is the kernel flipped (convolution, with its commutative cascade algebra) or not (correlation, measuring similarity)? Is the mixing pattern the same fixed kernel at every position (convolution) or position-dependent (a general linear map, or attention)? And are the weights fixed (convolution) or computed from the data (attention)? Each neighbor shares the slide-and-sum surface but differs on exactly one of these axes, and the recurring error is to import convolution's clean guarantees — commutativity, translation invariance, Fourier diagonalization, the single small kernel — into a neighbor that lacks the structural feature those guarantees rest on.

Solution Archetypes¶

No catalogued solution archetypes reference this prime yet.