The Structural Reasoning Substrate¶
What a Catalog of Abstractions Provides That a Neural Network Alone Does Not¶
A conceptual paper for the Encyclopedia of Abstractions project
Abstract¶
Augmented Abstract Reasoning (AAR) pairs a large language model with a curated catalog of cross-domain primes and archetypes, and empirically produces structurally faithful, auditable reasoning beyond what bare chain-of-thought reliably yields. This paper asks what the catalog provides conceptually: it argues the catalog functions as a structural reasoning substrate — named patterns with rigorous structural signatures, explicit relations to neighbors, documented cross-domain instantiations, and load-bearing negative-space definitions. The substrate enables a distinct flavor of cognitive rigor — structural rigor (pattern composition, coherence under configuration, consistency-checking against documented constraints) — distinct from the propositional rigor (truth-functional, brittle, narrow) the formal-logic tradition has anchored on. The load-bearing claim: the substrate's rigor depends primarily on its negative-space content — what each pattern is not, how it differs from neighbors, where it overextends, what its failure modes are. Without rigorous negative space, a catalog of abstractions collapses into pattern-matching without real constraint. The paper positions the approach against classical neurosymbolic AI, schema-based reasoning, pattern languages, and pure neural reasoning, and surfaces implications for LLM evaluation, training, and human-AI collaboration.
1. Introduction¶
Chain-of-thought prompting elicits reasoning that often looks careful but, as the empirical literature has documented over the past two years, frequently fails in structural ways the appearance of careful reasoning obscures. Dziri and colleagues[1] showed that on compositional tasks at scale, CoT-prompted models do not actually compose individual reasoning steps; they pattern-match to training-data instances and produce chains that appear compositional but break down predictably as task complexity grows. Mirzadeh and colleagues[2] showed that subtle perturbations of math word problems — preserving structure while changing surface features — significantly degraded model performance, suggesting CoT chains are keyed to surface features rather than to underlying structure. Turpin and colleagues[3] showed directly that CoT outputs can be unfaithful to the model's actual computation: the model arrives at an answer through some internal process and then generates a chain of thought that rationalizes the predetermined answer.
These failure modes share a structural shape. The reasoning chains are coherent locally but do not necessarily reflect the operations that produced the answer, and they do not generalize structurally beyond training-distribution patterns. For tasks where the central cognitive operation is recognizing structural patterns from one domain in situations from another — cross-domain abstract reasoning — these failure modes are particularly damaging, because cross-domain transfer requires precisely the structural-rather-than-surface reasoning the studies document CoT struggles with.
The Augmented Abstract Reasoning pipeline was developed in part as a response to these failure modes. It pairs a large language model with a curated catalog — the Encyclopedia of Abstractions, currently 511 prime abstractions and 230 solution archetypes — and orchestrates the LLM's use of the catalog through a nine-step procedure with explicit intermediate artifacts at each stage. The pipeline does not replace CoT; CoT is the local reasoning engine within each pipeline step. What the pipeline supplies that bare CoT does not is structural scaffolding that constrains reasoning into more inspectable, more structurally faithful forms.
The pipeline appears to work, at least in informal evaluation. The question this paper takes up is not whether it works; the operational case for the pipeline is made elsewhere. The question is why it works — what the catalog provides that the neural network alone does not, and why the combination behaves as more than the sum of its parts.
The answer this paper develops is that the catalog functions as a structural reasoning substrate, and the substrate enables a distinct flavor of cognitive rigor the field has not adequately named. Formal logic provides one flavor of rigor — truth-functional inference rules, the apparatus of theorem-proving and model-checking. The substrate provides a different flavor: pattern composition, coherence under configuration, consistency-checking against documented constraints. Both are real forms of rigor. They answer different questions. The second has often been conflated with looseness or dismissed as informal, but the substrate approach shows that structural rigor has constraints and verification operations of its own.
The argument has four moves. First, it distinguishes propositional rigor from structural rigor and argues both are legitimate. Second, it characterizes the structural reasoning substrate concretely — what is in it, what makes it rigorous, how it differs from a naive catalog of named concepts. Third, and most important, it argues that the substrate's rigor depends primarily on its negative-space definitions, and that without rigorous negative space the substrate collapses. Fourth, it positions the substrate approach against the neighboring traditions it most resembles and identifies what is new in the integration.
The paper is conceptual rather than empirical and is paired with the AAR pipeline paper, which provides the operational description. The aim here is to articulate what is happening at the conceptual level, so that future work — empirical evaluation, schema refinement, communication with the broader research community — has firmer ground to stand on.
2. Two Flavors of Cognitive Rigor¶
Discussions of rigorous reasoning typically treat one flavor of rigor as the standard: propositional/logical rigor, with its inference rules, truth conditions, and closed mathematical character. A → B; if A, then B. Strict, brittle, narrow. This is the rigor of formal proof, of theorem-proving systems, of model checking, of computability theory. When the AI literature talks about whether a system "reasons rigorously," propositional rigor is what is usually meant.
There is a second flavor of cognitive rigor that the literature has handled less carefully. Call it structural or schematic rigor. It works over named patterns rather than truth-functional propositions. Its operations are pattern recognition (is this pattern present?), composition (does this configuration of patterns match a known archetype?), structural mapping (does the meta-model from situation X apply to situation Y?), and consistency-checking against documented structural constraints. Its rigor is not truth-preservation under inference; it is something more like structural coherence under composition. Patterns must fit together in ways the substrate deems compatible; configurations either license known interventions or they do not; archetypes apply or do not apply based on whether their structural signatures match the situation.
Both flavors of rigor are real. They answer different questions. Propositional rigor answers: given these premises, what conclusions are necessarily true? Structural rigor answers: given this situation, which structural patterns are operative, which interventions does that operative set license, and where would those interventions fail? The first question is the one formal logic is designed for. The second question is the one engineering reasoning, design reasoning, clinical reasoning, and most professional reasoning are actually engaged in.
The distinction matters because the AI reasoning literature has, for the most part, treated structural rigor as a degenerate or weaker form of propositional rigor. The framing has been that "true" reasoning is logical, and anything else — pattern recognition, analogical inference, schema application — is approximate, heuristic, fallible. The substrate approach reverses this priority. Structural reasoning is not a degenerate version of propositional reasoning; it is a different kind of cognitive operation, and it has its own constraints, verification procedures, and failure modes. A pattern is correctly recognized or it is not. A configuration is consistent with a documented archetype or it is not. A candidate intervention either survives the archetype's documented failure-mode checks or it does not. These are not propositional statements with truth values, but they are not loose either. They are structural claims with structural verification conditions.
Once the two flavors are distinguished, the project this paper describes becomes legible. The AAR pipeline does not produce propositional rigor — that was never the goal. It produces structural rigor: claims about which patterns are operative, which archetypes apply, which interventions are licensed, which failure modes are at risk. The catalog (the Encyclopedia of Abstractions plus the archetype catalog) is the resource that makes structural rigor possible, because it provides the named patterns, their structural signatures, the documented neighbor relations, and the failure-mode catalogs that the structural-rigor operations require.
This reframing has consequences for how the project should be evaluated. Asking whether the pipeline produces "rigorous reasoning" without specifying which flavor of rigor is a question that has no clean answer. Asking whether the pipeline produces structurally faithful reasoning over a curated substrate of patterns — whether, that is, the pipeline's intermediate artifacts make claims that are checkable against the substrate and survive the checks — is a different and more answerable question. The reframing also has consequences for how to evaluate the catalog itself: a catalog that supports structural rigor is one whose pattern signatures, neighbor distinctions, and failure-mode documentation are rigorous enough to support checkable structural claims, even though no propositional inference rules are in play.
3. What a Structural Reasoning Substrate Is¶
A structural reasoning substrate is a curated catalog of named patterns with the following four properties.
First, each pattern has a structural signature — an explicit specification of the elements that must be present for the pattern to be operative in a situation. The signature is the rigorous definition of what the pattern is. It typically lists four to seven components, each named and characterized at a level of abstraction that crosses domains. The isomorphism entry in the encyclopedia, for instance, lists six components: a source structured object, a target structured object, a structure-preserving map, a bidirectional structure preservation requirement, an isomorphism class, and an explicit use that the isomorphism unlocks. Each component is defined in terms general enough to apply across mathematics, software architecture, physics, linguistics, and other domains, but specific enough that a candidate situation can be evaluated against it. A claim that isomorphism is operative in a situation is therefore a checkable claim: all six components must be identifiable in the situation, and they must hold the relationships the signature specifies.
Second, each pattern has explicit relations to other patterns: cross-references, parent and child relationships, neighbors that share structural features, alternative origin domains. The relations are not decorative. They are what allows the substrate to function as a navigable space rather than a flat list. A practitioner working with isomorphism can move from there to invariance, symmetry, abstraction, mapping, and equivalence relations along documented links. The relations also allow the substrate to support composition: solution archetypes are documented configurations of two to five primes, and the prime-to-archetype links allow the substrate to identify which archetypes a given configuration of operative primes might support.
Third, each pattern has documented cross-domain instantiations. For isomorphism, the encyclopedia entry documents how the pattern instantiates in mathematics, computer science, physics, linguistics, systems thinking, cognitive science, software architecture, category theory, graph theory, and biology. The instantiations are not a list of examples; they are role-by-role mappings showing how the pattern's abstract components are realized in each domain. This is what enables cross-domain transfer: an LLM (or human) reasoning over a new domain can use the documented instantiations as a guide for whether and how the pattern applies, and as a source of known-good analogies to draw from when reasoning by structural correspondence.
Fourth — and this is the load-bearing property that section 4 develops in detail — each pattern has rigorous negative-space definitions. What the pattern is not. How it differs from neighbors. Where it overextends. What failure modes catalog its known limits. The negative-space content is typically equal in length to the positive-space content, and in the v2 density-pass entries it often exceeds it. Without rigorous negative space, the substrate is not yet a substrate; it is a list of named patterns that an LLM will project promiscuously into situations where they do not actually apply.
A structural reasoning substrate is the combination of these four properties at scale. The Encyclopedia of Abstractions, currently 511 primes and 230 archetypes, is one realization of the idea. Smaller-scale realizations exist — Senge's system archetypes[4], Alexander's pattern language[5], the Gang of Four design patterns[6], TRIZ[7] — and they exhibit some but not all of the four properties. What makes the encyclopedia substrate-grade rather than merely catalog-grade is the rigor applied to the negative-space content, which the next section takes up.
4. The Negative-Space Requirement¶
The naive view of an abstraction catalog is that it provides positive definitions: a pattern is named, its components are listed, its examples are documented, and an LLM can then recognize the pattern in new situations by matching against the positive content. On this view, the value of the catalog is the names and the descriptions. More entries, more names, more recognized patterns. The richer the positive content, the better the catalog.
This view is wrong in a specific way, and the wrongness is what most distinguishes substrate-grade catalogs from catalog-grade catalogs.
The empirical observation, drawn from informal experimentation with LLMs operating over the encyclopedia, is that LLMs are good at recognizing primes where they naturally occur in a domain. Given a problem in cell biology, an LLM with access to the encyclopedia will reliably identify feedback loops, equilibrium dynamics, boundary structures, and modular organization in the system. The match between the prime and the situation, when the prime is naturally present, is one of the things LLMs do well — pattern recognition over named structured vocabulary is in their wheelhouse. The substrate plays to a strength here, not against a weakness.
The challenge is the inverse. LLMs are also good at recognizing primes where they do not naturally occur, because surface features and analogical resonance often produce plausible-looking but spurious matches. A problem about supply-chain coordination may look like a tragedy-of-the-commons problem because both involve shared resources, but the structural signature of tragedy-of-the-commons (open-access common-pool resource, individual incentives to overconsume, lack of effective monitoring, eventual depletion) may not hold for a supply chain with contracted suppliers, monitored deliveries, and structured incentives. An LLM presented with the supply-chain problem and given access to the tragedy-of-the-commons pattern will frequently identify the latter as operative based on surface resemblance, and the overextension produces a misleading analysis. The structural rigor of the substrate collapses at exactly this point.
The remedy is rigorous negative-space documentation: explicit characterization of what each pattern is not, where it overextends, how it differs from its closest neighbors, what failure modes are diagnostic of its boundary.
The isomorphism entry in the encyclopedia is a good worked example. The "What It Is Not" section runs for several paragraphs and addresses six distinct cousin concepts. Isomorphism is not equality — two isomorphic objects are structurally the same but typically distinct as particular things, and the distinction matters in foundations (the difference between equality and isomorphism is what gives rise to the univalence axiom in homotopy type theory) and in engineering (isomorphic data structures may have different memory layouts and timing characteristics, with structural isomorphism preserved at the level of operations but not implementation). Isomorphism is not homomorphism in general — homomorphisms preserve structure without the bijectivity requirement, and the hierarchy of structure-preserving maps (homomorphism, monomorphism, epimorphism, isomorphism, automorphism) shows that each refinement adds a structural constraint that licenses a different kind of transfer. Isomorphism is not similarity in the soft analogical sense — similarity measures are continuous, partial, and threshold-dependent, while isomorphism is exact and binary, and failing to distinguish soft analogy from strict isomorphism is the source of what the entry calls "the analogical fallacy" in cross-domain reasoning. Isomorphism is not homeomorphism specifically — homeomorphism is isomorphism in the category of topological spaces, but the general construct subsumes many category-specific structural equivalences. Isomorphism is not encoding in general — encoding maps without necessarily preserving all structure, and the framing as encoding rather than isomorphism is often a signal that the practitioner is thinking operationally rather than structurally. Isomorphism is not the trivial existence of bijections between sets of the same cardinality — two sets of the same cardinality admit infinitely many bijections, and whether any of them is an isomorphism with respect to a particular structure is a substantive mathematical question.
Each of these six distinctions is the negation of a likely overextension. An LLM presented with two data structures and asked whether they are isomorphic might, without rigorous negative space, accept that they are because they have the same number of elements (the bijection-of-sets confusion), because they appear similar (the similarity confusion), because one can be converted into the other (the encoding confusion), or because one is a function of the other (the homomorphism confusion). The negative-space documentation forecloses each of these overextensions explicitly. The structural-rigor check for isomorphism becomes: is the candidate pair structurally equivalent in the full sense, with structure preservation running in both directions, the structure-preserving inverse existing, and the relevant category specified — or is it merely one of the nearby cousin relationships that resemble isomorphism but do not entail what isomorphism entails?
The isomorphism entry's structural tensions and failure modes section continues the negative-space work in a different register. It documents tensions like strict isomorphism versus useful looseness (the rigor of strict isomorphism is sometimes operationally unavailable, and looser equivalences like homotopy equivalence and observational equivalence fill the gap but with weaker transfer guarantees that must be tracked carefully), and isomorphism-detection cost versus conceptual clarity (the conceptual crispness of isomorphism does not translate into algorithmic tractability, as Babai's 2015 quasi-polynomial-time algorithm for the graph isomorphism problem makes vivid). Each tension is paired with a common failure mode: an analyst who declares isomorphism loosely transfers a theorem the loose equivalence does not preserve and obtains a result that fails in the structural details that motivated the original demand for transfer. The most common version of this failure in software engineering is to declare two systems "behaviorally equivalent" on the basis of a sample of input-output pairs, transfer the architectural design of one to the other, and then discover that the edge-case behaviors differ in ways the original sample failed to expose.
The negative-space content does several jobs at once. It prevents overextension by foreclosing specific likely mistakes. It supports neighbor disambiguation by surfacing the cousins that are most easily confused with the target pattern. It provides the failure-mode catalog that the AAR pipeline's evaluation step (step 9) operates over. It documents the boundary of the pattern in a way that makes the boundary inspectable rather than implicit. And it gives the LLM (and human) doing structural reasoning something to push against — an explicit list of "if any of these conditions hold, the pattern probably does not apply" — that turns pattern recognition into a constrained checked operation rather than a free-form association.
The negative-space work is also what makes the v2 density-pass entries long — typically 3,500 to 5,000 words rather than the 200 to 400 words of the v1 baseline. The length is not stylistic. The negative-space content is doing the load-bearing work of turning the catalog from a list of names into a substrate that supports structurally rigorous reasoning. A catalog without rigorous negative space could be called a catalog of abstractions; only with rigorous negative space does it become a structural reasoning substrate.
A useful diagnostic: if a catalog could be summarized adequately by its positive definitions alone — if removing the "what it is not," tensions, neighbor distinctions, and failure modes would not substantially change how the catalog functioned in practice — it is not yet substrate-grade. The negative-space content has to be doing work that the positive content alone cannot do.
5. Something to Hang On To: The Cognitive Function¶
Why does the substrate work at all, even with rigorous negative space? What is it about reified, named, structurally specified patterns that supports better reasoning than direct engagement with situations?
The answer is partly about cognitive limits and partly about external representation. Both humans and LLMs reason within constraints that the substrate alleviates.
Working memory is limited for humans (the classic seven-plus-or-minus-two, with more modern estimates varying by task) and limited for LLMs in a different way (the attention budget across the context window has computational costs that scale poorly with the breadth of structure being held in mind simultaneously). Abstract reasoning that depends on holding many implicit relationships in active memory tends to collapse — patterns blur together, distinctions are lost, the analysis drifts. External representation alleviates this by giving the reasoner something to point at. A graph of named entities and labeled relations is faster to comprehend, easier to manipulate, and harder to confuse than the same content in continuous prose. This is the classic finding of the diagrammatic reasoning literature (Larkin and Simon's 1987 paper "Why a Diagram is Sometimes Worth Ten Thousand Words"[8]): the same information presented as a diagram supports inferences that the same information in sentential form does not, because the diagram makes structural relations directly inspectable rather than requiring them to be inferred.
The substrate extends this principle to abstract content. A pattern like feedback loop is, in isolation, just a phrase. With the substrate, it becomes a named entity with a specified structural signature, documented relations to neighboring patterns, a list of common failure modes, and cross-domain instantiations. The pattern is reified — given enough explicit structure that it can be manipulated like a physical object. A reasoner can ask whether the pattern is present, configure it with other patterns, check whether its failure-mode conditions apply, compare it to its neighbors, and transfer reasoning across its documented instantiations. Without the substrate, the pattern is something to think about; with the substrate, it is something to think with.
The cognitive science of this effect has been worked under several names. Bartlett's schemas (1932)[9] are organized knowledge structures that constrain interpretation and reasoning; once a schema is activated, certain inferences become available and others become difficult. Rumelhart[10] and Schank[11] developed schema theory in different directions in the 1970s and 80s, with applications to language understanding, memory, and problem-solving. Andy Clark[12] and Edwin Hutchins[13], in the distributed-cognition tradition, argued that cognition is partly constituted by external structures: tools, notations, and shared artifacts are not just conveyances for thought but partly determine what thought is possible. Lev Vygotsky, much earlier, made a similar argument with the framing of cognitive tools. The substrate is in this tradition. It is an external cognitive tool that supports a kind of reasoning that would be much harder without it. The substrate is to abstract reasoning what diagrams are to spatial reasoning, what spreadsheets are to numerical reasoning, what type systems are to programming. Each makes a class of cognitive operations tractable that would otherwise overwhelm working memory or pattern-recognition capacity.
A point worth surfacing because it has practical implications: the substrate is helpful, but it is not magical. A reasoner using the substrate still needs foundational competence in the domain to which the patterns are being applied. The substrate provides the cognitive scaffolding; the domain knowledge provides the substrate's anchoring points. An LLM applying the encyclopedia to a problem in immunology will benefit from the substrate, but only to the extent that the LLM has enough immunological background to map the substrate's abstract components onto immunological entities and processes. A human analyst is symmetrically positioned: the substrate accelerates reasoning over familiar domains and supports more careful reasoning over partially-familiar ones, but it does not substitute for domain learning. This bounds the substrate's value but does not undermine it; the substrate is a force multiplier on existing competence, not a replacement for it.
6. Why This Fits LLMs (and Humans) Well¶
LLMs are good at pattern recognition over named vocabulary. They are good at compositional consistency-checking when the constraints are explicit. They are good at retrieving instantiations of an abstract pattern when the pattern is named and the instantiations are documented. They are bad at strict propositional logic, in roughly the sense the bitter lesson actually predicts — formal inference requires explicit symbol manipulation under inference rules, and LLMs are not optimized for this and make characteristic mistakes when forced to do it.
The substrate plays to LLM strengths and constrains LLM weaknesses. The operations the substrate supports are exactly the operations LLMs do well: name the patterns present in a situation; check whether a situation's structural features match a named pattern's structural signature; retrieve documented cross-domain instantiations of a pattern; compose patterns into archetype configurations; check candidate interventions against documented failure-mode conditions. These are pattern-recognition and structured-retrieval operations, which LLMs handle naturally.
What the substrate constrains is the overextension failure mode discussed in section 4. Without the negative-space documentation, an LLM presented with a situation tends to identify patterns based on surface resemblance, project them confidently, and produce analyses that look careful but rest on misplaced pattern matches. With the negative-space documentation, the LLM is given explicit boundaries: this pattern is not that one; this pattern does not apply when these conditions hold; this pattern is in tension with this other pattern in the following ways. The boundaries constrain the LLM's tendency to over-pattern-match, which is the LLM equivalent of the analogical fallacy the analogy literature has analyzed for decades.
The fit with human reasoning is symmetric. Humans share most of the strengths LLMs have here — pattern recognition, retrieval of instantiations, compositional consistency-checking when the constraints are explicit — and most of the weaknesses, including the tendency to overextend patterns based on surface resemblance. Humans also share the cognitive-tool dependency: just as LLMs reason better with the substrate, humans reason better with it, and the substrate's structure is what makes joint human-LLM reasoning possible because both parties are operating over the same vocabulary, with the same documented constraints, against the same failure-mode catalog.
This last property — that the substrate is a shared cognitive substrate, not just an LLM-augmentation tool — is the property most often missed. The substrate is not a hidden internal resource that the LLM consults silently. It is a public, inspectable, modifiable artifact that both humans and LLMs use. When the LLM identifies feedback loops in a problem, the human can check the LLM's identification against the same encyclopedia entry the LLM was using. When the LLM proposes an archetype as applicable, the human can check the archetype's failure-mode conditions against the situation. The audit trail is not a reconstruction of what the LLM was doing; it is the same artifact the LLM was using.
This is what makes the substrate-augmented approach valuable beyond what a more capable LLM alone would provide. It enables human-AI collaborative reasoning of a kind that black-box CoT does not. The LLM and the human are reasoning in the same vocabulary, against the same constraints, and they can disagree productively in a way that is locatable: the disagreement is about whether a specific pattern is operative, or whether a specific failure-mode condition applies, or whether an archetype's signature matches the situation. Disagreement at this granularity is resolvable in ways that disagreement about a free-form chain of reasoning is not.
7. Not Neurosymbolic, but Adjacent¶
The substrate approach has surface similarities to classical neurosymbolic AI, and it is worth being explicit about why "neurosymbolic" is not the right label for the project.
Classical neurosymbolic AI typically refers to architectures that bolt discrete logic operations onto neural networks. The neural network handles perception and pattern recognition; a separate symbolic reasoner handles logical inference; the two communicate through some kind of interface that translates neural representations into symbolic ones and back. The hope is to combine the strengths of both: neural learning's flexibility and statistical robustness, symbolic reasoning's compositionality and verifiability.
The classical version has proven hard to instantiate at scale. The impedance mismatch between continuous, distributed representations and discrete, symbolic ones has resisted clean engineering for decades. Translations between the two are lossy. Compositional symbolic reasoning over neural-network outputs tends to inherit the network's uncertainties without acquiring the symbolic system's guarantees. The vision has been compelling and the practice has been slow.
The substrate approach is in the same broad family — combining structure with learning — but it differs from classical neurosymbolic AI in important ways. The substrate does not provide formal logic. It does not specify inference rules. It does not require translation between continuous and discrete representations. It provides structured vocabulary, documented patterns, explicit relations, and rigorous negative-space constraints — all of which the LLM can engage with in its native mode of operation, without requiring a separate symbolic reasoner or a clean translation layer.
Where rigor lives, in the substrate approach, is different from where it lives in classical neurosymbolic AI. Classical neurosymbolic puts rigor in the inference rules: the symbolic side guarantees truth-preservation under inference, conditional on the inputs being correctly translated. The substrate puts rigor in the catalog: the structural signatures, the negative-space definitions, and the failure-mode catalogs are what constrain the LLM's reasoning, and they constrain it not by enforcing logical inference but by making the LLM's pattern-recognition operations checkable against documented constraints.
This is a softer version of neurosymbolic AI, and the softness is a feature rather than a bug. It avoids the impedance mismatch problem because there is no symbolic side to translate to and from. It plays to LLM strengths rather than fighting them. It produces a different flavor of rigor — structural rather than propositional — but it is rigor nonetheless, and it is rigor that LLMs can actually produce.
The substrate approach is therefore best understood as adjacent to neurosymbolic AI rather than as an instance of it. It shares the broad goal of combining structure with learning, but it instantiates the combination in a way that classical neurosymbolic AI does not anticipate. The right framing is probably not "neurosymbolic" but something closer to "structured-knowledge-augmented neural reasoning" — a clunky phrase, but more accurate to what is happening. The phrase is a placeholder; a better one may emerge as the framing matures, and locking in a more elegant term too early would shape the framing in ways that may not survive scrutiny.
8. Intellectual Ancestors and What Is New¶
The substrate approach has antecedents in several scattered literatures. Each contributed pieces; none assembled the whole.
Schema theory in cognitive science (Bartlett, Rumelhart, Schank, and others) developed the idea that organized knowledge structures constrain interpretation and reasoning. A schema is a pattern with structure that, when activated, makes certain inferences available and others unlikely. The cognitive-science work focused mostly on individual schemas (a restaurant schema, a story schema) and on how they are activated and used in interpretation. It did not develop into a curated cross-domain catalog with rigorous negative-space content; the schema-theory tradition stayed largely within experimental cognitive psychology and did not produce a substrate.
Diagrammatic reasoning research (Larkin and Simon 1987 most centrally; also Stenning, Lemon, Barwise, Etchemendy in adjacent work) developed the idea that the form of an external representation changes what reasoning is computationally easy. A graph makes relations directly inspectable; a list of propositions requires reconstruction. This work foregrounds the cognitive role of external representation, but it tends to focus on specific representational forms — diagrams, tables, formal notations — rather than on curated catalogs of structured patterns.
Pattern languages in architecture and software (Alexander 1977; Gamma et al. 1994 for the Gang of Four) developed the idea of named, reusable design patterns. Each pattern has a context, a problem, and a solution, and patterns compose into larger pattern languages. Alexander's work explicitly aimed at a substrate-like role for buildings; the GoF book did similar work for object-oriented software design. The pattern-language tradition is the closest cousin to the substrate approach. What it does not have is the cross-domain scope of the encyclopedia (Alexander's patterns are about buildings; GoF patterns are about object-oriented software) or, in most cases, the rigorous negative-space documentation that the substrate requires.
Distributed cognition (Clark 1997, Hutchins 1995) developed the idea that cognition is partly constituted by external structures. Tools and notations are not just conveniences but partly determine what cognition is possible. This is the theoretical framework that explains why the substrate works — it is an external cognitive tool — but the distributed-cognition tradition has not produced specific catalogs designed for the role.
TRIZ (Genrich Altshuller's theory of inventive problem solving) developed a catalog of forty inventive principles that recur across patent literature, with a procedure for matching specific engineering problems to relevant principles. TRIZ is a closer cousin than is often recognized: it has a curated catalog, a matching procedure, and cross-domain instantiations. What it does not have is the systematic negative-space documentation and the integration with a learning-based reasoner.
Senge's system archetypes (1990) catalog about a dozen recurring dynamic patterns in social and organizational systems — limits to growth, shifting the burden, tragedy of the commons, fixes that fail, and so on. The archetypes are documented with their structural patterns and known intervention principles. The catalog is much smaller than the encyclopedia, restricted in scope, and varies in negative-space rigor, but the underlying idea — a curated set of recurring patterns to support diagnosis and intervention — is in the same family.
What the encyclopedia plus AAR pipeline adds, beyond the contributions of these ancestors, is the integration: a curated cross-domain catalog at substantial scale (511 primes and 230 archetypes); rigorous negative-space documentation in each entry; explicit cross-prime relations enabling navigation and composition; documented archetypes as explicit prime configurations; and an operational procedure that uses the catalog with an LLM as the reasoning engine, producing inspectable intermediate artifacts at each stage. None of the ancestors put these pieces together this way. The pieces are familiar in scattered places; the integration is the contribution.
9. Implications and Open Questions¶
The substrate framing has several implications worth being explicit about, and it leaves several questions open.
For LLM evaluation, the substrate suggests benchmark designs that have not been standard. Current cross-domain reasoning benchmarks tend to test pattern transfer in implicit forms — situations where a model has to recognize a known pattern in a novel context without being given the pattern explicitly. Substrate-grounded benchmarks would test something different: given the substrate, can the LLM produce a structurally faithful analysis that identifies the operative patterns, applies the relevant archetypes, and checks the candidate intervention against documented failure modes? The two kinds of benchmark measure different things, and the field has under-explored the second.
For LLM training, the substrate raises the question of whether the structural artifacts the pipeline produces could be used as training data. If the pipeline produces a corpus of (situation, identified-primes, context-specific-model, meta-model, archetype-query-result, dual-pass-reasoning, failure-mode-evaluation) tuples across many problems, that corpus could in principle be used to train models that produce these artifacts natively. The behavioral specification the pipeline currently enforces externally would become a behavioral specification the model satisfies internally. Whether this is desirable, and how the training would have to be designed to preserve the substrate's negative-space rigor rather than over-fitting to positive examples, are open questions worth careful design.
For human-AI collaboration, the substrate is among the most powerful pieces of the project, and it is the part most often missed. The substrate is a shared cognitive protocol. When the LLM identifies feedback in a problem, the human and the LLM are looking at the same encyclopedia entry. When the LLM proposes an archetype, the human can inspect the same archetype's failure-mode conditions. Disagreement is locatable; reconciliation is structured; the audit trail is the same artifact the LLM was using rather than a reconstruction of what the LLM was doing. The implications for fields where AI-assisted reasoning must be auditable — medicine, law, engineering, policy — are substantial, and the substrate may matter more there than in fields where opacity is tolerable.
For auditability more generally, the substrate produces something black-box LLM reasoning cannot: an inspectable, structurally constrained record of the reasoning. The structural artifacts are not a faithful representation of the model's internal computation (that problem is well-known to be unresolved), but they are a checkable record of what the model committed to at each step. The check is: do these commitments hold up against the substrate? If yes, the reasoning has a meaningful kind of soundness, even if the model's underlying computation cannot be inspected directly.
Several questions remain open. What objectively makes a catalog rigorous enough to function as a substrate? The negative-space requirement is identified here as load-bearing, but the threshold for sufficient rigor is not formalized. How should substrate quality be measured empirically? Inter-rater reliability on pattern identification, archetype application, and failure-mode evaluation would be useful starting points. Does the substrate approach generalize beyond cross-domain abstract reasoning to other reasoning tasks? Mathematical reasoning, scientific reasoning, planning, and ethical reasoning each have different cognitive structures, and the substrate's value in each is an open empirical question.
There are also limits worth being explicit about. The substrate's value depends on the substrate's quality. A loosely-curated catalog with weak negative-space content provides little benefit; a rigorously-curated one provides substantial benefit. The investment required to build and maintain a substrate-grade catalog is large, and the value extracted from it is sensitive to how rigorously the negative-space work is done. The substrate is also not a complete answer to LLM reasoning problems. It addresses one specific class of failure modes — structural overextension in cross-domain reasoning — and does not address others, such as factual hallucination or arithmetic errors. Other tools and techniques are required for those.
10. Conclusion¶
The Augmented Abstract Reasoning pipeline works, when it works, because it pairs an LLM with a structural reasoning substrate that supports a distinct flavor of cognitive rigor. The substrate is not a list of named concepts; it is a curated set of patterns with structural signatures, explicit relations, documented cross-domain instantiations, and — most importantly — rigorous negative-space definitions. Without rigorous negative space, the substrate collapses into pattern-matching without real constraint. With it, the substrate provides the reasoning operations the AAR pipeline depends on.
The cognitive rigor the substrate provides is structural rather than propositional. It is not the rigor of formal logic; it is the rigor of pattern composition, configuration coherence, and consistency-checking against documented constraints. Both flavors of rigor are real; they answer different questions; the field has under-developed the second by treating it as a degenerate version of the first. The substrate approach takes structural rigor seriously and shows what it requires.
The framework is not classical neurosymbolic AI, though it is in the same family of attempts to combine structure with learning. It is a softer version, where rigor lives in catalog quality rather than in inference rules. It plays to LLM strengths and constrains LLM weaknesses. It functions equally as a shared cognitive substrate for humans, which makes auditable human-AI collaboration possible.
The contribution is integrative rather than novel-in-parts. Schema theory, diagrammatic reasoning, pattern languages, distributed cognition, TRIZ, and Senge's system archetypes all contributed pieces. What is new is the assembly: a curated cross-domain catalog with rigorous negative-space content, used by an LLM through an operational pipeline that produces inspectable structural artifacts at each stage. The pieces are old; the substrate is new.
References¶
[1] Dziri, Nouha, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. "Faith and Fate: Limits of Transformers on Compositionality." Advances in Neural Information Processing Systems 36 (2023). ↩
[2] Mirzadeh, Iman, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models." arXiv preprint arXiv:2410.05229, 2024. Subsequently published at ICLR 2025. ↩
[3] Turpin, Miles, Julian Michael, Ethan Perez, and Samuel R. Bowman. "Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting." Advances in Neural Information Processing Systems 36 (2023). ↩
[4] Senge, Peter M. The Fifth Discipline: The Art and Practice of the Learning Organization. New York: Doubleday/Currency, 1990. ↩
[5] Alexander, Christopher. A Pattern Language: Towns, Buildings, Construction. New York: Oxford University Press, 1977. ↩
[6] Gamma, Erich, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Reading, MA: Addison-Wesley, 1994. ↩
[7] Altshuller, Genrich. And Suddenly the Inventor Appeared: TRIZ, the Theory of Inventive Problem Solving. Translated by Lev Shulyak. Worcester, MA: Technical Innovation Center, 1996. ↩
[8] Larkin, Jill H., and Herbert A. Simon. "Why a Diagram Is (Sometimes) Worth Ten Thousand Words." Cognitive Science 11, no. 1 (1987): 65–99. ↩
[9] Bartlett, Frederic C. Remembering: A Study in Experimental and Social Psychology. Cambridge: Cambridge University Press, 1932. ↩
[10] Rumelhart, David E. "Schemata: The Building Blocks of Cognition." In Theoretical Issues in Reading Comprehension, edited by Rand J. Spiro, Bertram C. Bruce, and William F. Brewer, 33–58. Hillsdale, NJ: Lawrence Erlbaum, 1980. ↩
[11] Schank, Roger C., and Robert P. Abelson. Scripts, Plans, Goals, and Understanding: An Inquiry into Human Knowledge Structures. Hillsdale, NJ: Lawrence Erlbaum, 1977. ↩
[12] Clark, Andy. Being There: Putting Brain, Body, and World Together Again. Cambridge, MA: MIT Press, 1997. ↩
[13] Hutchins, Edwin. Cognition in the Wild. Cambridge, MA: MIT Press, 1995. ↩