The Calculus of Abstraction¶

From a Lexicon of Primes to a Grammar of Operations for Cross-Domain Reasoning¶

Companion to Structural and Framed Primes and to the project's experimental record (the AAR-vs-CoT pilots, the §9.2 ablation, the forward cross-domain transfer test, and the retrieval diagnostics of 2026-05-22).

Editorial update (2026-05-26). The runtime architectures this paper proposes — above all the verb-driven planner of §9.2 and the typed lift/coverage discipline of §5 — were subsequently built and tested under blinded, pre-registered evaluation (projects 02–04). The headline result is largely null at the frontier: varying the control structure did not reliably move design quality, coverage of load-bearing components, or the faithfulness of the model's reasoning. Two specific claims below are revised by that evidence — the framed/structural gate on the lift (§5.4) was refuted (the lift's value tracks coverage, not framedness; the gate governs decompose, not lift), and the verb-driven planner (§9.2) proved inert relative to the fixed pipeline. The "pleasing metaphor vs. research program" test this paper sets for itself (§11) is now answerable. This document is preserved as the pre-test theory; the full empirical account is in the companion The Limits of Runtime Scaffolding, with the graded operation set in The Verb Grammar of Abstraction Operations and the verified prior-art map in Related Work & References.

Abstract¶

The Encyclopedia of Abstractions began as a catalog: a curated set of prime abstractions — patterns recurring across at least three domains — and solution archetypes. A catalog is a dictionary. This paper argues that a dictionary isn't enough, and that the more interesting object is a grammar: an explicit set of operations for manipulating abstractions, with the catalog as its lexicon. If primes are the nouns, the operations are the verbs, and the pair constitutes a calculus of abstraction. The paper enumerates the verbs already implicit in the Augmented Abstract Reasoning pipeline (lift, lower, compose, decompose, transport, salience-rank, prune, match, evaluate-fit, reconcile); brings category theory to bear (functors as cross-domain transport, free/forgetful adjunction as the lift/lower pair, Yoneda as the formalization of "an abstraction is its neighborhood"); situates the project against pattern catalogs (Alexander, Gang of Four, TRIZ), general systems theory, analogy work (Gentner), structural realism, and structured-prompting / neurosymbolic literature; and characterizes what we have built — structure as scaffold, not structure as solver — marking the frontier where scaffold could become verifier. The argument is anchored to small, confounded experiments rather than assertion; the most consequential finding is that on problems in well-covered regions, the protocol carries more of the value than the catalog.

1. Introduction: from a dictionary to a grammar¶

The Encyclopedia of Abstractions has an operational definition at its core: a prime abstraction is one that applies meaningfully across at least three domains of human knowledge. That definition does real work — it draws a usable line around the abstractions worth cataloging — and the project has built up, on the strength of it, a corpus of roughly 568 primes and 621 solution archetypes, each with a structured signature, anti-signatures, components, and mechanisms. The natural mental model for such an artifact is a reference work: a dictionary of patterns one consults, the way an architect consults Alexander's pattern language or a programmer consults the Gang of Four.

This paper is about a shift in that mental model, and the shift was forced on us by data rather than chosen for elegance. Over a sequence of experiments in May 2026, we compared the project's nine-step Augmented Abstract Reasoning (AAR) pipeline — which retrieves primes and archetypes from the catalog and reasons over them — against a bare chain-of-thought baseline, and, crucially, against a third condition: a chain-of-thought agent given the structural process of the pipeline (externalize a typed model, strip it to a domain-neutral meta-model, enumerate and score candidate solution patterns, gate them against failure conditions, compose and reconcile) but no catalog at all. On a deliberately hard, catalog-rich problem, the catalog-equipped pipeline and the catalog-free scaffold scored essentially the same, and both clearly beat bare chain-of-thought. The improvement, in other words, was coming from the process — the disciplined sequence of operations performed on abstractions — far more than from the content of the catalog the operations happened to draw on.

That result is initially deflating if one's mental model is "the catalog is the asset." It becomes generative the moment one reframes: perhaps the asset was never only the nouns. Perhaps the asset is partly, even mostly, the verbs — the operations the pipeline performs. The catalog is a lexicon; the pipeline is the first draft of a grammar; and the thing that does the reasoning is the grammar applied to the lexicon. We will call this pairing a calculus of abstraction: a small set of well-defined operations on abstractions, plus a stock of abstractions to operate on, plus an engine (in our case a large language model) that executes the operations.

The phrase "calculus of abstraction" is a deliberate aspiration, not a claim of having achieved one. A calculus, in the strong sense, has objects, operations on those objects, composition laws telling you how operations chain, and ideally identities and inverses and a notion of when two compositions are equal. We have objects (primes, archetypes, the relational models built from them) and we have operations (the pipeline steps), but the operations are at present informal — they are prompt-level disciplines executed by a probabilistic model, not typed transformations with guaranteed behavior. Much of this paper is an attempt to take the operations seriously as first-class objects of study: to name them, to ask what their inverses and duals are, to ask which operations apply to which kinds of abstraction, and to ask what it would take to make them executable and checkable rather than merely suggestive. The honest status of the project, stated up front, is: we have a rich lexicon, a working but informal grammar, and an empirical hint that the grammar matters more than we expected.

There is a second, more specific empirical thread that runs through this paper and shapes the grammar we are reaching for: the distinction between structural and framed primes, developed in the companion paper. Structural primes (feedback, threshold, equilibrium, recursion, hierarchy) are pure relational patterns that travel light across domains; framed primes (sovereignty, due process, legitimacy, property) carry an institutional or normative frame that travels with them and resists clean separation. This distinction turned out to predict — and sometimes to be complicated by — the behavior of our operations. When we tested whether a transported abstraction could be recognized in an alien domain, and when we tested whether semantic search could retrieve the right abstraction from a far-domain instance, the structural/framed axis kept reappearing, sometimes confirming the theory and once sharply refuting a prediction we had made from it. A grammar of abstraction has to account for the fact that its operations behave differently on different kinds of abstraction — which is to say, a serious grammar needs typing rules. We will return to this repeatedly.

The paper proceeds as follows. Section 2 briefly describes what the project has built, so the rest is self-contained. Section 3 lays out the empirical findings that forced the reframe, in enough detail that the theoretical claims later can be checked against them. Section 4 treats the nouns: what kind of object a prime is, the structural/framed spectrum, and an empirical property — distinctiveness, or neighborhood density — that turns out to govern how findable an abstraction is. Section 5 is the heart of the paper: the verbs. We enumerate the operations we already have, elucidate each, and then take seriously the problem of discovering the operations we lack and the typing rules that govern them. Section 6 brings in category theory as a borrowed grammar of structure, with a novice-friendly exposition and an honest accounting of what transfers and what would be mere ornament. Section 7 situates the whole enterprise against the prior art it has independently re-derived. Section 8 characterizes precisely what we have built — a symbolic-shaped discipline of representation — and distinguishes scaffold from verifier. Section 9 argues that the current pipeline is only one traversal of the grammar and sketches alternative architectures. Section 10 lays out a research agenda; Section 11 states limitations honestly; Section 12 concludes.

A note on voice and method. This paper emerged from an extended dialogue, and several of its central framings — "the catalog is the lexicon, the operations are the verbs," "a calculus of abstraction," "structure as scaffold versus structure as verifier" — were produced collaboratively in that dialogue rather than derived from a literature. We have since checked them against the literature, and much of what we found is that the components are well-precedented while the synthesis and the empirical, falsification-oriented stance are less common. We try throughout to be explicit about which is which, because the value of the project depends on not mistaking a re-derivation for a discovery.

2. What the project has built¶

To keep this paper self-contained, we summarize the artifacts the rest of it refers to.

Prime abstractions. A prime is an abstraction that recurs across at least three domains. Each prime in the corpus carries a core idea (a prose definition), a structural signature (a more domain-neutral statement of the relational pattern, often with named role-phrases), a set of structural tensions, a "what it is not" section that fences the concept off from neighbors, and cross-references to related primes and to archetypes. The corpus currently holds 568 primes. Examples span the obviously structural (feedback, threshold_and_criticality, hierarchy, network, equilibrium, recursion, conservation_laws) and the obviously framed (sovereignty, procedural_fairness_due_process, legitimacy, authority, rights_vs_freedoms), with a long middle (monitoring, coordination, trust, fairness, incentive_compatibility).

Solution archetypes. An archetype is an intervention pattern: a structured description of a problem shape and the move that addresses it. Each carries source primes (the primes it is anchored on), related primes, components, mechanisms, trigger conditions, anti-signatures (the situations in which it should not be applied), root tensions, action logic, target invariants, and expected outcomes. The corpus holds 621 archetypes. Examples relevant to this paper include commons_governance, tipping_point_prevention, critical_mass_building, incentive_compatible_rule_design, independent_verification_oversight, diffusion_containment, cycle_breaking, and irreversible_commitment_management. The anti-signature field is worth flagging now because it is the catalog's most distinctive asset: it encodes not just "here is a pattern" but "here is when this pattern misleads," which is exactly the kind of negative knowledge that unaided reasoning tends to skip.

The Augmented Abstract Reasoning (AAR) pipeline. The pipeline is a nine-step procedure for taking a concrete decision problem and reasoning about it through the catalog. In compressed form: (1) specify the problem; (2) identify operative primes by searching the catalog; (3) salience-rank them; (4) prune to an operative subset; (5) build a context-specific relational model (entities as nodes, relationships as labeled edges, primes annotated on nodes and edges); (6) strip the domain specifics to a meta-model (the structural skeleton); (7) query archetypes whose source primes overlap the operative set, retrieving their full records including anti-signatures; (8) reason via both the context-specific model and the meta-model and reconcile; (9) evaluate fit by walking each candidate archetype's anti-signatures and failure modes, dropping those the situation trips, and produce a structured recommendation. The pipeline is executed by a large language model that has the catalog available through a set of retrieval tools.

The structural/framed distinction. Developed in the companion paper, this is a refinement of the prime concept that classifies primes by how much of their home domain travels with them. Structural primes were domain-stripped at the moment they were named and travel as pure relational pattern; framed primes are partly constituted by an institutional or normative frame and travel "heavy," importing vocabulary and evaluative commitments. The distinction admits degree (a spectrum), correlates with the disciplinary character of the prime's origin (formal disciplines mint structural primes; institutional and normative disciplines mint framed ones), and comes with a decomposition operation: given a framed prime, attempt to extract a structural core by abstracting away the frame, with three possible fates — unification with an existing structural prime, the discovery of a new structural prime, or loss of identity when the residue is too thin to do the original's work. We will argue in Section 5 that this decomposition operation is the project's first explicitly-named verb, and that recognizing it as such is what opens the door to the rest of the grammar.

With these pieces in hand, we turn to the experiments that reframed how we understand them.

3. The findings that forced the reframe¶

This section lays out the empirical record the rest of the paper builds on. The experiments are small and we will be explicit about their weaknesses in Section 11; the point here is not to claim proof but to show what moved our priors and why.

3.1 The pipeline, bare reasoning, and the scaffold ablation

The first question was the obvious one: does reasoning through the catalog beat reasoning without it? We ran the AAR pipeline against a bare chain-of-thought baseline — an independent agent with no tools and no catalog, given the same decision scenario and asked to reason carefully and recommend — and had a separate, blinded grader score both outputs on a six-dimension rubric (structural coherence, failure-mode anticipation, implementability, mechanism grounding, stakeholder/equity fit, threshold/time-horizon awareness). To keep the comparison honest, both outputs were re-rendered to strip method-identifying vocabulary and matched for length and format before grading.

On the first scenario — a coastal fishery facing a commons collapse — the pipeline scored 50/60 and bare chain-of-thought 45/60; the grader preferred the pipeline. A re-run on the same scenario after a corpus rebuild scored 53 versus 52, essentially a tie. We then moved to a deliberately harder, messier, multi-domain scenario — a two-sided gig marketplace in a trust-and-liquidity doom-loop, chosen because its operative primes sit in catalog-dense regions — and ran a refined version of the pipeline. There the pipeline scored 53 and bare chain-of-thought 47, the widest gap we had seen.

The decisive experiment was the ablation. We added a third condition: a chain-of-thought agent given the pipeline's structural process as a prompt — identify the operative patterns, build a typed relational model, strip it to a domain-neutral meta-model, enumerate candidate solution templates and instantiate, score, and gate each against its failure conditions, compose, and discard to first principles where no template fits — but with no catalog access whatsoever. All three conditions were then re-graded together on one scale by a fresh blinded grader. The result: bare chain-of-thought 46, the catalog pipeline 53, and the catalog-free scaffold 55. The grader ranked the scaffold first.

The honest reading is that the gain over bare reasoning was almost entirely the protocol — the disciplined sequence of operations — and not the catalog content. The catalog-free scaffold, which had the operations but no nouns to look up, recovered the full gain and then some. A frontier model already holds the relevant solution patterns for a problem in a well-trodden region (commons governance, marketplace design); what it lacks, and what the protocol supplies, is the discipline to externalize, decompose, enumerate, and gate. The catalog's marginal contribution over a good structural prompt was, on this problem, undetectable.

This is the finding that reframes the project. If the operations carry the value, then the operations deserve to be studied as first-class objects — which is to say, as a grammar.

3.2 Cross-domain recognition: where the catalog did contribute something

The ablation tested the catalog in-distribution, where the model's own knowledge is dense. The natural next question was whether the catalog earns its keep where the model's knowledge is thin — specifically, in recognizing an abstraction transported into an alien domain. We pre-registered a test: take a framed prime, procedural_fairness_due_process, transport its structure into immunology (designing the target-engagement logic of an autonomous engineered cytotoxic-cell therapy that must commit irreversible kills under noisy identity signals, where attacking healthy tissue is catastrophic and missing the tumor is recoverable), write the scenario frame-clean (no legal vocabulary), and measure whether each condition recognized the due-process structure, scored against five pre-registered structural moves rather than against the word "due process." An independent immunology expert, blind to the target, validated the scenario as a genuine, non-trivial problem in the field.

The result was nuanced and instructive. On a coarse three-point recognition scale, all three conditions — bare chain-of-thought, scaffold, and pipeline — reached the staged-contestable-safeguard architecture (multiple gates before an irreversible action; a veto/clearance path; brakes; a principled bias toward the recoverable error). The structural core of the transported prime was reconstructable by reasoning alone. But the conditions split on the fifth, most distinctively institutional move — independent/impartial review, an auditable record of reasons, and protection scaled to the severity of the mistake. Bare chain-of-thought essentially missed it; the scaffold had it partially; only the catalog-equipped pipeline surfaced it in full. And it did so because the catalog handed it irreversible_commitment_management and independent_verification_oversight, two archetypes that literally encode "consent/review gate," "independent review panel," and "finality record."

Two caveats keep this from being a clean win, and both matter. First, the catalog reached this content via cousin archetypes, not via the due_process prime — honest, scenario-derived retrieval never surfaced the prime itself (more on this in 3.3). Second, and more seriously, the pipeline was run by an operator who knew the target was due process; even though retrieval was honest, the operator's choice of what to foreground in the recommendation cannot be cleanly separated from that knowledge, while the bare and scaffold conditions were run by target-blind sub-agents. The cleanest statement we can defend is: on a framed prime transported to an alien domain, the structural core was reachable by everyone, and the catalog's distinctive contribution was the frame — the institutional half that unaided reasoning gravitates away from — which is exactly what the structural/framed theory predicts a catalog should carry. It is the first result in the series where the catalog contributed something neither bare nor scaffolded reasoning produced, and it is also the result most in need of a target-blind re-run before it counts for much.

Update (2026-05-25) — the target-blind re-run, and it revises this finding. Project 01's retrieval redesign let us run the test the previous paragraph asked for, and the unique contribution did not survive contamination control. With the operator target-blind and retrieval honest, a blinded grader scored bare reasoning and the pipeline equal on the institutional move (both partial), and neither reached its full institutional form — even though the redesigned search now does surface the due-process material that lexical search had missed (it ranks the prime second on an "independent review" facet of the meta-model and reaches it as a one-hop neighbour of governance in the catalog graph). So the earlier run's operator-contamination caveat appears to have been carrying the result: the unique-frame contribution was substantially an artifact of the operator knowing the target. A follow-up then isolated why the frame did not spontaneously transport, and the answer was encouraging for the grammar even as it deflates the catalog claim. It was neither substrate-resistance (the alien domain readily yields a mechanism for each role) nor mainly a smuggled frame, but a missing operation: when the frame's roles were extracted and posed as explicit requirements, every condition found a working target-domain mechanism — a "homolog" — for every role, in every one of six runs; naive lowering sufficed to find them, and a role-preserving lower found more robust ones. The frame was lost in the original run only because it was never lifted out and made a requirement before transport. The honest restatement, then: the forward case does not yet demonstrate a catalog-frame contribution that survives a target-blind operator; what it demonstrates is that a verb the fixed pipeline lacks — extract the frame into roles, then instantiate each in the target — is the binding constraint, and that the substrate-resistance one might have feared is not real. This is the cleanest example in the series of a verb's absence, rather than the catalog's content, being what limited the result; §5.4 develops it as a typing rule and project 02 is built around it. (Records: experiments/2026-05-25_search_redesign_results.md; experiments/exp01_forward_rerun_2026-05-25/.)

3.3 Retrieval is the bottleneck, and it has two layers

The fact that honest retrieval never surfaced due_process from the immunology scenario pointed at a problem we had been underweighting: the catalog's search was purely lexical — case-insensitive substring matching over names, slugs, and definitions. A query phrased in immunology terms simply shares no surface vocabulary with a legal prime, so the prime is invisible regardless of how apt it is. We tested whether semantic (embedding-based) retrieval would fix this, using a local bge-small model to embed the catalog and querying it two ways: with the raw immunology scenario, and with a domain-stripped meta-model of it; and against two versions of the corpus, the prose definitions and the more abstract structural signatures.

The rank of due_process among 568 primes told a clean story. Lexically it was absent — never retrieved. Semantically, the raw scenario placed it at rank 217 (prose) or 186 (signature) — better than absent, but buried in a sea of biomedical neighbors, because the raw scenario embeds into biomedical space. The domain-stripped meta-model lifted it to 129 (prose) and 45 (signature). The improvement is real and the lesson is sharp: there are two distinct retrieval failures tangled together. Layer 1 is lexical brittleness — exact-word dependence — and semantic search fixes it well; under a meta-model query the decision-theory and governance cousins (irreversibility, reversibility_and_irreversibility, consent, authority_delegation_under_uncertainty, incentive_compatibility) jump to the top, and the archetype index returns irreversible_commitment_management as the single best match among 621 archetypes. Layer 2 is cross-domain structural retrieval — connecting a far-domain instance to an abstraction whose home domain is distant in surface-meaning space — and off-the-shelf embeddings do not solve it: even the best configuration left due_process at rank 45, below any practical retrieval cutoff. The meta-model is the key artifact for Layer 2 (it dramatically out-performs the raw scenario, and embedding the structural signature beats embedding the prose), but it is not sufficient on its own.

The most important consequence is retrospective: every prior experiment ran the pipeline with a lexical retriever, which means the catalog's content was being under-retrieved throughout. Some of "the catalog adds nothing" in the ablation is plausibly "the relevant catalog content was never fetched." This does not overturn the headline — the structural core is reachable by reasoning regardless, and even good semantic retrieval surfaces cousins rather than the framed prime itself — but it is a genuine confound that a retrieval redesign should lift, and it is the most actionable engineering finding in the series.

3.4 Two sweeps, and a refuted prediction

We then tried to test the structural/framed theory's transfer prediction directly. The theory implies that structural primes, being domain-neutral, should be easy to retrieve from a neutral structural description, while framed primes, weighed down by their frame, should be hard. We ran two sweeps. In the first, a blind agent wrote domain-neutral paraphrases of seven framed and eight structural primes' own structure, and we measured how well each prime retrieved itself from its paraphrase. In the second — the more rigorous one — a blind agent instantiated each paraphrase as a concrete far-domain scenario (a croquembouche for recursion, a harbor berth for sovereignty, power-grid governors for feedback, hospital credentialing for due process, a wetland food chain for hierarchy), and we measured the prime's retrieval rank from the far-domain instance.

Both sweeps refuted the prediction. Framed primes were retrieved better, not worse: in the cross-domain sweep, framed primes had median rank 3 (six of seven in the top ten), structural primes median 10 (only four of eight in the top ten), with structural misses as severe as recursion at 280 and symmetry at 91. The cause, once we looked, was not portability at all but distinctiveness, or neighborhood density. Structural primes are generic patterns that sit in dense clusters of near-synonyms — recursion competes with iteration, self-similarity, and decomposition; symmetry with invariance and conservation; equilibrium with attractor-selection and homeostasis — so a far-domain instance lands in the correct neighborhood but the exact prime is crowded out by its siblings. Framed primes are distinctive and sparsely-neighbored, so an instance lands precisely on them. The exception proves the rule: authority retrieved poorly (rank 10) precisely because its near-twin sovereignty outranked it — a dense framed pair.

So the variable that actually governs cross-domain exact-retrieval is the sparsity of an abstraction's semantic neighborhood, not whether it is framed or structural. This both refutes a clean theoretical prediction and replaces it with something more useful, and it has a direct connection to the cognitive science literature we discuss in Section 7: Gentner's MAC/FAC model of analogical retrieval found, decades ago, that human retrieval is driven by surface similarity while the mapping that follows is driven by structure — exactly the split between our Layer-1 (surface/topical) retrieval and the Layer-2 (structural) problem it cannot solve. We had re-derived, in an embedding-space experiment, a dissociation that the analogy literature established in the 1990s.

There is one more reframing buried in the sweep. In the original forward test, the immunology scenario retrieved due_process at rank 45; in this sweep, a clean far-domain instance of due process (hospital credentialing, where the procedure is the theme) retrieved it at rank 1. The difference is not domain distance — both are far from law — but obliqueness. The immunology scenario was a messy, multi-structure problem in which due process is one latent pattern among irreversibility, uncertainty, and asymmetric harm; the credentialing scenario is a clean, single-structure instance. Retrievability degrades with obliqueness, and real problems are oblique. That is a sobering result for any runtime use of the catalog and a central design constraint for the grammar.

3.5 What we are licensed to conclude

Every experiment above is an N of one or a handful, with a single grader, a single model class, and — in the most catalog-favorable case — an operator-contamination confound we could not fully remove. It would be a serious error to treat any of these as proof. But it would be an equal error to treat them as worthless. Following Hubbard's account of measurement as the reduction of uncertainty rather than its elimination — and his observation that the first few observations carry the most information when prior uncertainty is high — these runs did move our priors in specific, defensible directions: toward "the protocol carries most of the value in well-covered regions," toward "retrieval, not catalog content, is the binding constraint," and toward "distinctiveness, not framed-versus-structural portability, governs cross-domain retrieval." The discipline that makes a single observation informative is not sample size; it is control of confounds, which is why we kept fighting operator contamination, format leakage, and instance obliqueness. A small, clean observation about the right thing beats a large, dirty one about the wrong thing.

With that evidentiary base in place, we turn to the theory it motivates.

4. The nouns: a lexicon of abstractions, and why it is inert on its own¶

4.1 What kind of object a prime is

A prime is not a definition in the dictionary sense. It is closer to a role or a pattern schema: a relational structure that can be instantiated by indefinitely many concrete situations across domains. feedback is not "the thing in cybernetics"; it is the schema a portion of a system's output is routed back to influence its subsequent input, which is realized equally by a thermostat, a population's predator-prey dynamics, a power-grid governor, and a sarcastic remark that provokes the response that justifies it. The prime is the invariant that survives across those instantiations.

This has a consequence the project has half-acknowledged and that becomes central in Section 6: a prime is characterized less by an intrinsic essence than by its relationships — to the domains it appears in, to the other primes it co-occurs with, to the archetypes that draw on it, and to the primes it is explicitly not (the "what it is not" section). The catalog is, in this light, not a list but a graph; a prime's meaning is partly its position in that graph. We flag this now because it is precisely the content of the Yoneda perspective in category theory ("an object is determined by its relationships to all other objects"), and because our own retrieval experiments gave it an empirical edge: how findable a prime is turned out to depend on the density of its neighborhood, which is to say on its relationships, not on its intrinsic content.

4.2 The structural/framed spectrum, updated by evidence

The companion paper sorts primes along a spectrum from structural (pure relational pattern, domain-stripped at birth, travels light) to framed (partly constituted by an institutional or normative frame, travels heavy). The diagnostic criteria are whether the prime's home vocabulary travels with it, whether it carries default evaluative weight, whether it has an institutional referent at origin, whether it can be defined without reference to human practices, and whether applying it feels like recognizing a pattern that was already there or like importing a perspective that recasts the domain. Structural primes (symmetry, equilibrium, recursion, feedback, hierarchy, network, conservation_laws) score structural on all five; framed primes (due_process, sovereignty, legitimacy, dignity) score framed on all five; most primes sit in the middle.

Our experiments complicate this in a productive way. The theory predicts that framed primes should be harder to work with across domains because their frame resists travel. For the recognition operation that held: the structural core of due process transferred to every condition, but its frame (the impartial-review-and-reasons half) was supplied only by the catalog. For the retrieval operation, however, the prediction was flatly refuted: framed primes were easier to retrieve across domains, not harder, because — as Section 3.4 found — exact retrieval is governed by neighborhood sparsity, and framed primes happen to be more distinctive than the generic structural primes that blur into near-synonym clusters. So the structural/framed axis is real and predictive for some operations (decomposition, recognition of the frame) and orthogonal-to or even reversed-from the governing variable for others (retrieval). This is the first concrete sign that different operations have different sensitivities to the kind of abstraction they act on — the typing-rules problem we develop in Section 5.4.

4.3 Distinctiveness, or neighborhood density, as a property of the lexicon

The sweeps surfaced a property of primes that the catalog does not currently represent and probably should: how crowded an abstraction's neighborhood is. Some primes are isolated — few near-synonyms, a sparse local region of abstraction space — and these are reliably retrievable from a description or a far-domain instance, because nothing competes with them. Others are generic — recursion, symmetry, equilibrium, modularity — sitting in dense clusters of patterns that a short description cannot disambiguate, and these are hard to pin to an exact slug even when the instance faithfully realizes them.

This is not a defect of the primes; it is a fact about the lexicon's geometry, and it has two implications. First, for retrieval design (Section 9): for generic primes the right unit of retrieval is a neighborhood or family, not a single best match, because the family is what is actually identifiable and the reasoner can disambiguate within it. Second, for the lexicon itself: the catalog would be more useful if it explicitly represented these clusters — linking or partially merging the structural near-synonyms (recursion / iteration / self-similarity; symmetry / invariance / conservation; equilibrium / steady-state / homeostasis) so the cluster is legible rather than a set of competitors. A dictionary that knows which of its entries are near-synonyms is a better dictionary. We note in passing that "an object is known by its neighbors" is, again, the Yoneda intuition, and that representing the neighborhood explicitly is a step toward treating the catalog as the relational structure it actually is.

4.4 Why a lexicon underdetermines reasoning

The ablation result (Section 3.1) can now be stated more precisely. A lexicon of abstractions, however rich, is inert: it tells you what patterns exist but not what to do with them. The work of reasoning — anchoring a problem to the right patterns, building a model, abstracting it, finding the apt intervention, checking it against failure modes, reconciling competing readings — is operations on the lexicon, and a frontier model can perform those operations on patterns it already knows without consulting an external list. That is why the catalog-free scaffold matched the catalog-equipped pipeline: both had the operations; the scaffold simply supplied its own nouns from the model's parametric memory rather than from the corpus.

The reframe follows immediately. If the operations are doing the work, the operations are the thing worth building, sharpening, and (eventually) verifying. The catalog remains valuable for two specific jobs the operations alone do poorly — supplying negative knowledge (anti-signatures: when a pattern misleads) and supplying framed content (the institutional half that unaided reasoning skips, Section 3.2) — and as the lexicon over which a future, verifiable grammar would operate. But it is one ingredient, not the whole dish. The rest of this paper is about the other ingredient: the verbs.

5. The verbs: a grammar of operations on abstractions¶

5.1 The pipeline is a grammar in disguise

Read the nine-step AAR pipeline not as a workflow but as a sequence of operations on abstractions, and a grammar appears. Step 2 takes a concrete problem and finds the abstractions it instantiates; step 5 assembles those abstractions into a structured model; step 6 abstracts that model to a domain-neutral skeleton; step 7 carries the skeleton to a region of the catalog and pulls back candidate interventions; step 8 applies an intervention's logic back down to the concrete situation and reconciles competing readings; step 9 tests each candidate against the situations in which it is known to fail. Each of these is a transformation that takes abstractions (or abstraction-laden structures) as input and yields abstractions as output. They are verbs, and the pipeline is one particular sentence built from them.

Naming the verbs matters for the same reason naming the structural/framed distinction mattered: once a thing is named, you can ask questions about it that were invisible before. What is this verb's inverse? Its dual? Which kinds of abstraction does it apply to? When does it fail? Could it be made to run as a checkable procedure rather than a probabilistic gesture? The companion paper already, without announcing it, named one verb — the decomposition of a framed prime into a structural core plus a frame — and analyzed its behavior (the three fates). That was the first entry in a grammar. The following is an attempt at the rest.

5.2 The verbs we already have

We describe each operation by what it does, where it lives in the current pipeline, its inverse or dual where one exists, its characteristic failure modes, and the path (if any) toward making it executable and checkable rather than merely prompted.

Lift (abstraction). Lift takes a concrete situation, or a domain-specific model of one, and produces a more general structure by discarding domain particulars and keeping the relational skeleton. It is step 6 of the pipeline (context-specific model → meta-model) and, in miniature, step 2 (this concrete problem instantiates these patterns). Lift is the operation the contemporary prompting literature rediscovered as step-back prompting (Zheng et al., 2023), which improves reasoning by having the model derive high-level concepts before working the specifics. Its characteristic failure modes are over-lifting (stripping so much that the residue is vacuous — "this is a system with parts," true of everything and useful for nothing) and mis-lifting (keeping the wrong invariant, so the skeleton omits the load-bearing relation). Lift is partially checkable: one can verify that every entity and relation in the meta-model has a witness in the concrete model, and that the meta-model contains no domain terms — a referential-integrity-and-neutrality check we have run by hand and could automate.

Lower (instantiation). Lower is the inverse of lift: it takes a general structure or intervention pattern and realizes it in a specific domain, supplying the particulars the abstraction omits. It is step 8 (apply the archetype's action logic to the concrete situation) and the production of the final recommendation. Its failure modes are forced lowering (instantiating a pattern in a domain where it does not actually fit, producing plausible-sounding nonsense) and under-specification (a "recommendation" that restates the abstraction without committing to concrete moves). Lower is the operation most exposed to the probabilistic substrate's weaknesses, because it is generative rather than recognitional; it is also where domain knowledge is most needed and least checkable. A project-01 experiment refined lower's behavior on framed inputs and gave it a checkable core after all. Lowering a transported frame naively tends to silently drop the awkward roles — "independent review" collapses into the acting unit re-checking itself — which is precisely how the forward case failed. A role-preserving lower instead treats each element of the frame as a requirement and searches the target substrate for a mechanism that discharges the same function — a homolog — even when the literal form is unavailable; in the immunology test this recovered every role in every run (a separate sentinel cell for independent confirmation, a one-way genomic edit for a recorded rationale, niche-density-scaled vetoes for consequence-scaling), and the role-preserving framing found more robust homologs than the naive one. The check is then simple: every lifted role must map to a named target-domain mechanism or be explicitly flagged unsatisfiable. We treat this homolog-search as a typed mode of lower (it is meaningful only when there is a frame to re-instantiate), not a separate verb.

Compose. Compose assembles multiple primes into a single relational model — entities as nodes, relationships as labeled edges, primes annotated on the structure. It is step 5. Composition is what turns a set of operative primes into a theory of the situation, and it is where most of the genuine modeling work happens. Its failure modes are spurious coupling (drawing edges that do not exist, manufacturing a connected story out of a genuinely modular situation) and missing coupling (failing to draw the edge that carries the dynamics). Composition is the verb most improved by a typed representation: if the model is emitted as a typed graph with referential integrity (every edge references declared nodes; every operative prime appears as an annotation), the structure becomes checkable, and we found in the refined pipeline that imposing this discipline materially tightened the reasoning.

Decompose. Decompose is the operation the companion paper analyzed: given a framed prime, strip the institutional or normative frame and attempt to expose a structural core, with three fates (the core unifies with an existing structural prime; the core is a new structural prime; the core is too thin and the prime is constitutively framed). Decompose is not currently a pipeline step — it is a catalog-construction and theory operation rather than a problem-solving one — but it belongs in the grammar because it transforms one abstraction into others, and because its inverse is a kind of framing (taking a structural core and adding a frame to specialize it for an institutional domain). Its failure mode is the false fate-two: convincing oneself that an extracted residue is a real, projectable pattern when it is a thin sketch that does no work in any other domain. The check for fate-two is exactly a transport test (below): does the extracted core project into at least three domains where the original framed prime would have looked alien? Decompose, in other words, is verified by transport — a first hint that the verbs constrain each other.

Transport. Transport carries an abstraction across a domain gap: recognizing that a situation in domain B instantiates a pattern whose home is domain A, and carrying the pattern's structure (and, for archetypes, its intervention logic) over. This is the operation at the heart of the whole project's ambition — cross-domain transfer — and our experiments showed it is also the hardest and least reliable. Transport has two sub-operations that the analogy literature has long distinguished and that our retrieval experiments re-derived: retrieval (finding the candidate pattern, which Gentner's MAC/FAC shows is driven by surface similarity and which our embeddings did topically) and mapping (aligning the structure once a candidate is in hand, which is structural). The failure modes are retrieval failure (the apt pattern is never surfaced because the instance shares no surface features with it — the due_process-in-immunology case) and forced mapping (aligning a pattern that superficially resembles the situation but whose structure does not actually fit — the classic way analogies break). Transport is the verb most in need of help: semantic retrieval over meta-models lifts its retrieval sub-operation out of pure lexical brittleness, but Layer-2 cross-domain retrieval to distant abstractions remains unsolved off-the-shelf, and the mapping sub-operation has no current check at all.

Salience-rank and prune. These paired operations take a candidate set of abstractions and reduce it to the operative subset: rank by load-bearing relevance (a prime is load-bearing if removing it would change the recommended intervention), then cut. They are steps 3 and 4. Their failure modes are premature pruning (cutting a prime that turns out to carry the dynamics) and failure to prune (carrying so many primes that the model becomes an undifferentiated mush). Salience is a judgment the model makes well in practice and that is hard to check, because "would removing this change the answer?" is a counterfactual that, taken seriously, requires running the rest of the pipeline with and without the prime — which suggests an expensive but real verification: ablate each operative prime and confirm the recommendation changes.

Match (retrieve). Match anchors a problem to the catalog: given the operative primes (or the meta-model), find the archetypes whose source primes overlap, or whose problem shape aligns. It is step 7, and it is the operation our experiments identified as the binding constraint (Section 3.3). Its failure mode is precisely the lexical/structural dissociation: surface-driven retrieval finds topically-similar entries and misses structurally-apt-but-surface-distant ones. Match is the most improvable verb in pure engineering terms — replace substring search with semantic search over meta-model queries against structural-signature embeddings, return neighborhoods rather than single hits, and decompose oblique queries into facets — and Section 9 treats this as the priority redesign.

Evaluate-fit (gate). Evaluate-fit walks a candidate intervention's anti-signatures, trigger conditions, and root tensions, and drops the candidate if the situation trips a condition under which the pattern is known to fail. It is step 9, and it is the operation that draws most directly on the catalog's distinctive asset — negative knowledge. Unaided reasoning readily generates patterns; it less readily generates the disciplined "here is when this pattern backfires" check, which is why we found the catalog's anti-signature content valuable even when its positive content was redundant. The failure mode is a hollow gate: the model nods at the anti-signatures without genuinely testing them, treating the step as ceremony. This is the verb most ripe for becoming a hard check rather than a soft judgment — the anti-signatures could be encoded as explicit conditions evaluated against the typed model, converting the gate from the model's opinion into a rule. That conversion is the single clearest example of the scaffold-to-verifier move we develop in Section 8.

Reconcile. Reconcile runs the dual-view reasoning of step 8: apply a candidate's logic to the concrete model and to the meta-model separately, and resolve the (characteristically different) inferences they produce, preferring the reading consistent with both the archetype's source primes and the situation's operative primes. Reconcile is the verb that exploits the redundancy of having both a grounded and an abstracted representation; its failure mode is collapse — letting one view silently dominate, so the redundancy buys nothing. Reconciliation has no obvious automatic check, but it has a natural diagnostic: if the two views never disagree, the meta-model is not doing independent work and one of them is decorative.

5.3 Inverses, duals, and composition of verbs

Even this informal enumeration reveals structure among the verbs. Lift and lower are an inverse-like pair — abstraction and instantiation, climbing up and climbing down the ladder of generality — and we will argue in Section 6 that the right formalization is not strict inverse but adjunction, the category-theoretic relationship that holds between "forget the structure" and "freely build it back." Decompose and frame are a pair: strip the institutional overlay versus add one. Salience-rank and prune are a generate-and-filter pair. And the verbs constrain one another: decompose is verified by transport (does the extracted core actually project?), salience is verified by ablation (does removing the prime change the lower?), lift is verified by witness-checking against the model that compose produced. A grammar is not just a list of operations; it is the web of relationships among them — which compositions are legal, which verbs check which others — and even at this informal stage that web is visibly present.

There is also a natural notion of composition of verbs into the pipeline sentence: the AAR pipeline is, read this way, the composite lower ∘ reconcile ∘ evaluate-fit ∘ transport ∘ lift ∘ compose ∘ prune ∘ salience-rank ∘ match, give or take ordering. That this composite is only one of many possible sentences in the grammar is the point of Section 9: other orderings, other subsets, and entirely different control structures (a planner that chooses verbs dynamically rather than running a fixed sequence) are available the moment the verbs are first-class.

5.4 Typing rules: which verb applies to which kind of abstraction

The single most important thing the experiments taught us about the grammar is that it is typed. Verbs do not apply uniformly to all abstractions; their behavior depends on the kind of abstraction they act on, and a serious grammar has to encode this.

The clearest case is decompose, which is only meaningful — and only has the three-fates behavior — on framed primes; applied to a structural prime there is no frame to strip, so the operation is a no-op or returns the prime itself. Transport behaves differently on the two kinds: for a structural prime, transport is recognitional (the pattern is either present in the target domain or not) and its retrieval failure mode is neighborhood crowding (the structural pattern is generic, so the instance retrieves a cluster of near-synonyms); for a framed prime, transport is interpretive (does the frame illuminate the target?) and its failure mode is the smuggled frame (importing institutional assumptions the target does not warrant) on the mapping side and surface-distance on the retrieval side. Evaluate-fit depends on a typing fact too: it requires anti-signatures, which archetypes carry but bare primes largely do not — so the gate is well-typed over archetypes and under-typed over primes. And the recognition experiment showed that the output of transport is type-dependent: transporting a framed prime delivers two separable things — a structural core that any reasoner can reconstruct, and a frame that does not regenerate from first principles and so must be carried. (The §3.2 update sharpens the second half: carrying the frame is necessary but not sufficient — it transports only when an explicit lift-into-roles step makes each of its elements a requirement, not automatically.)

This is why the structural/framed distinction is not a side-theory but a part of the grammar's type system. To specify a verb fully is to specify its behavior on structural inputs, on framed inputs, and on the mid-spectrum cases — and to specify which verbs are even defined on which types. We are some distance from a complete type system, but the experiments have already populated several of its cells, and the discipline of asking "what type does this verb expect, and how does it behave on each type?" is itself a research method.

Project 01 populated one more cell, and validated it. Transporting a framed prime requires an explicit lift-the-frame-into-roles step before mapping — and when the fixed pipeline omits that step, the frame fails to transport. The forward re-run's follow-up pinned down why the omission is fatal, and corrected a tempting wrong answer: the frame did not fail to land because the alien substrate had no place for it (a paired micro-experiment found a working homolog for every frame role, in every run) and not mainly because of a smuggled frame, but because the roles were never extracted and posed as requirements — a salience/lift failure, not a substrate or mapping failure. The corrected typing rule is therefore explicit: on a framed input, transport decomposes as lift-roles → map → lower-to-homolog; on a structural input it is direct recognitional transport, with no frame to lift and neighborhood-crowding (not frame-loss) as the live risk. This is the first typed verb-composition the experiments have actually validated end to end, the structural/framed label is the type tag that gates it, and it is the spine of project 02.

[Update, 2026-05: this gate was refuted.] When the lift was tested across 13 primes (project 02), the lift-effect did not track framedness — a pure structural prime (conservation_laws) showed the largest effect, and symmetry (also structural) went negative. The lift is best understood as a coverage-enforcement operation whose value tracks how much of the operative pattern the unaided design already instantiates, not the framed/structural label. Framed/structural typing still genuinely gates decompose (only a framed prime has a frame to strip), but it does not gate lift. And even re-specified as coverage-enforcement, the lift's runtime value came back null at the frontier (projects 02–03). See verb-grammar-of-abstraction.md and the retrospective.

5.5 The hunt for new verbs

The user's intuition — that identifying additional verbs is the real trick — is, we think, correct, and the enumeration above suggests where to look. Several operations are visibly missing from the current pipeline yet clearly belong in a grammar of abstraction. We list candidates not as a finished taxonomy but as a search frontier.

Map (explicit structure-mapping) is the mapping sub-operation of transport, pulled out and made first-class: given a candidate pattern and a situation, produce the explicit correspondence between the pattern's roles and the situation's entities, and report where the correspondence breaks. The analogy literature (Gentner's Structure-Mapping Engine) has a forty-year-old algorithm for exactly this over structured representations; lifting it into the grammar would give transport a checkable mapping step it currently lacks. Generalize and specialize move a prime up or down a subsumption hierarchy (from feedback to circular_causality, or from commons_governance to congestion-commons governance) — operations the catalog implies through its cross-references but does not expose as moves. Dualize swaps a pattern for its opposite-valence twin (diffusion-acceleration versus diffusion-containment; critical-mass-building versus tipping-point-prevention) — a move the catalog's own family structure already hints at and that would let a reasoner ask "what is the inverse problem, and does its solution invert?" Negate / anti-pattern makes explicit the construction of a failure mode from a pattern, which is what anti-signatures encode statically and which a verb could generate dynamically. Blend or merge combines two abstractions into a composite (the cognitive-science operation Fauconnier and Turner call conceptual blending), which is what composition does informally but which deserves its own treatment when the inputs are themselves abstractions rather than entities. Factor and refactor decompose a complex archetype into a composition of simpler ones, or re-express a model in terms of a different but equivalent set of primes — operations that would let the grammar simplify and normalize its own products. Project and restrict take a model and view it through a single prime's lens, or restrict attention to a sub-structure — the multi-vector decomposition Section 9 proposes for oblique retrieval is really a project-then-transport composite.

Beyond enumerating candidates, the deeper question is how to discover verbs systematically, and three methods present themselves. The first is to study what expert cross-domain reasoners actually do — the move-level descriptions in TRIZ (abstract, find contradiction, apply inventive principle, map back) and in Polya's heuristics are catalogs of verbs in another guise. The second is to mine category theory, which is nothing but a grammar of structure-preserving operations, for operations that have no current analogue in our pipeline (Section 6). The third — and the most native to this project's method — is to mine the pipeline's failure modes: each characteristic failure of a verb points at a missing verb that would have caught it. The hollow-gate failure of evaluate-fit points at a verify verb (a hard check); the forced-mapping failure of transport points at a map-and-report-breakage verb; the spurious-coupling failure of compose points at a factor verb that would expose the manufactured edge. A grammar can be grown by repairing its own failures.

5.6 From suggestive to executable

A final observation about all of these verbs sets up the rest of the paper. As currently realized, every one of them is a prompted disposition: the model is asked to lift, or gate, or reconcile, and it does so probabilistically, with no guarantee that the operation was performed soundly. This is what makes the present system a discipline of representation rather than a calculus in the strong sense. But several of the verbs have a latent checkable core — lift can be witness-checked, compose can be referential-integrity-checked, evaluate-fit can be turned into hard conditions, salience can be ablation-checked, decompose can be transport-verified — and the gap between "the model claims it lifted correctly" and "the lift is verified" is exactly the gap between scaffold and verifier. Naming the verbs is what makes that gap visible and addressable, one verb at a time. We return to this in Section 8 as the central frontier of the project.

6. Category theory as a borrowed grammar of structure¶

6.1 Why look here at all

If we are reaching for a grammar of operations on structures, it is worth knowing that mathematics already built one, for exactly that purpose. Category theory is, in a slogan, the mathematics of structure and of structure-preserving transformation. It was invented in the 1940s to make precise the recurring observation that constructions in one field (topology, say) had exact analogues in another (algebra), and that the analogy was not loose metaphor but a rigorous correspondence. Its central methodological move is to stop asking what objects are and start asking how they map to one another — to study a mathematical world through its arrows rather than its points. That is the same move our project keeps being pushed toward: a prime is its relationships, transport is a structure-preserving map, the catalog is a graph. So category theory is the natural place to shop for verbs, and for a vocabulary precise enough to ask whether our informal operations have the structure we suspect.

A warning attaches to this whole section, and we state it before the enthusiasm. Category theory is notoriously easy to invoke and hard to apply. It is full of constructs that, named over a vague analogy, produce the appearance of rigor without the substance — what practitioners call "categorical theater," diagrams of arrows that commute only because no one checked. We will be explicit at the end about which of the following is something we could genuinely borrow now, which is a long-term north star, and which is, for the moment, suggestive analogy that should not be oversold. The exposition is pitched for a reader with no category-theory background; specialists should read it as deliberately informal.

6.2 The basics, for a newcomer

A category is almost embarrassingly simple to state. It consists of objects and, between objects, arrows (called morphisms) — and two rules. First, arrows compose: if there is an arrow from A to B and one from B to C, there is a composite arrow from A to C, and composition is associative. Second, every object has an identity arrow to itself that does nothing under composition. That is the entire definition. The objects can be anything — sets, spaces, types, the primes in our catalog — and the arrows are whatever structure-respecting relationships matter in that world: functions between sets, continuous maps between spaces, "is a special case of" between concepts.

The discipline's first lesson is the one already quoted: you learn the most about an object not by dissecting it but by seeing how it relates to everything else through arrows. A "point" in a space, a "set with one element," a "true statement" — each turns out to be characterized entirely by its pattern of arrows to and from other objects. This is the attitude our retrieval experiments stumbled into empirically: a prime's findability — and, we suspect, much of its meaning — lives in its neighborhood of relationships, not in its intrinsic text.

6.3 Functors: cross-domain transport, made precise

The first construct worth borrowing is the functor. A functor is a structure-preserving map between categories: it sends each object of one category to an object of another, and — crucially — each arrow to an arrow, in a way that respects composition and identities. If A maps to F(A) and the arrow A→B maps to F(A)→F(B), then the composite A→B→C must map to the composite of the images. A functor, in other words, carries not just the objects of one world into another but the relationships among them, intact.

This is precisely what our transport verb is trying to be. When we carry the structure of due_process into immunology, what we want to preserve is not the legal vocabulary but the relationships — that an irreversible adverse action stands downstream of a staged sequence of checks, that a clearance path stands in opposition to the activating signal, that reasons stand as a record over the decision. A faithful transport preserves this relational skeleton while swapping the domain's objects; an unfaithful one (a broken analogy) preserves some surface arrows but violates composition — it maps the pieces but not the way they fit together, which is exactly how the analogy "electrons orbit the nucleus like planets" fails. Category theory gives us the precise diagnostic for a good transport: does it preserve composition? That is a sharper criterion than "does it feel apt," and it suggests that the map verb we proposed in Section 5.5 should report not just a correspondence of roles but whether the correspondence respects the relationships among them.

6.4 Natural transformations: analogies between analogies

If functors are mappings between worlds, natural transformations are mappings between mappings. Given two functors F and G that both carry category C into category D, a natural transformation is a systematic way of turning F's image into G's image — a family of arrows, one for each object, that all fit together coherently (the technical condition is that a certain square of arrows commutes). The motivating examples are things like "every vector space maps coherently into its double-dual," a transformation that works uniformly across all spaces without choosing arbitrary coordinates.

The relevance to us is more aspirational but worth stating, because it names a level of structure our project gestures at and never formalizes. If transporting commons_governance into fisheries is one functor-like mapping and transporting it into spectrum allocation is another, a natural transformation would be the systematic relationship between those two transports — the sense in which they are "the same analogy applied twice." When the project talks about an archetype's family (the variants of commons_governance: common-pool, congestion-commons, maintenance-commons, pollution-sink), it is groping toward exactly this: a coherent family of related transports. Category theory says such families have their own arrows and can themselves be reasoned about. We are nowhere near formalizing this, but it tells us that "the relationship between two cross-domain mappings" is a real, studyable object and not a vague meta-comment.

6.5 Adjunctions: the formal pairing of lift and lower

The construct we would most like to be true of our grammar is the adjunction, and here the analogy is unusually tight. An adjunction is a precise relationship between two functors going in opposite directions — written F ⊣ G, "F is left adjoint to G." The canonical example is the free/forgetful pair. A forgetful functor throws away structure: it takes a richly-structured object (a group, say) and remembers only its bare underlying set, forgetting how the elements combined. A free functor goes the other way: it takes a bare set and builds the most general, most unconstrained structure of the desired kind on top of it — the "freest" group on those elements, assuming no relations beyond the ones forced by the axioms. These two are adjoint: there is a tight, canonical correspondence between maps out of a freely-built structure and maps out of the bare set.

Now read lift as forgetful and lower as free. Lift takes a domain-rich situation and forgets its particulars, keeping the bare relational skeleton — that is exactly a forgetful functor's move. Lower takes a bare abstraction and builds the most general instantiation consistent with a target domain — the freest realization — which is exactly a free functor's move. If this analogy holds in any rigorous sense, then lift and lower are not mere inverses (they are not — you cannot recover the discarded particulars) but adjoints, and the adjunction would tell us the precise sense in which lowering an abstraction and then lifting it again returns you to a canonical "most general" version of where you started. We do not claim to have proven this; it is the most enticing of the borrowable ideas and also the one most at risk of being categorical theater. But it is a concrete, checkable research question — is there an adjunction between abstraction and instantiation over the catalog? — and the fact that the free/forgetful adjunction is the textbook case of "add structure" versus "forget structure" is at least strong circumstantial encouragement that the lift/lower pair has more mathematical structure than we have been giving it.

6.6 Yoneda: an object is its relationships

The Yoneda lemma is category theory's deepest elementary result, and its informal content is one sentence: an object is completely determined by its relationships to all other objects. If you know every arrow into (or out of) an object — how it sits relative to everything else in the category — you know the object up to isomorphism; its "intrinsic nature" adds nothing beyond its pattern of relationships.

This is the formal statement of something our experiments found empirically and our lexicon does not yet represent. A prime's behavior under retrieval — how findable it is, what it gets confused with — is governed by its neighborhood, by its relationships to other primes, not by its intrinsic definition. The generic structural primes are hard to retrieve precisely because their neighborhoods are crowded; the distinctive framed primes are easy because theirs are sparse. Yoneda says this is not an accident of our embedding model but something close to a structural truth: the identity of a prime is its relational position. The practical upshot, which Section 9 develops, is that the catalog should represent primes by their relationships — explicit neighborhood structure, near-synonym links, dual pairs, subsumption arrows — and that retrieval and reasoning should operate over that relational structure rather than over isolated definition strings. The catalog should become, in category-theoretic spirit, a category.

6.7 Composition, products, and the legality of sentences

Two more borrowable ideas, briefly. Category theory's insistence on composition and its associativity is a model for what it would mean to say the verbs of our grammar compose into legal "sentences": the AAR pipeline is one composite, and a grammar worth the name would specify which composites are well-formed (you cannot evaluate-fit before you have transported anything; you cannot lower what you have not first lifted and mapped). And the universal constructions — products (the canonical way to pair two objects), coproducts (the canonical way to sum them), and the more general limits and colimits (canonical ways to glue a diagram of objects into one) — are precise versions of our compose and factor verbs. When we assemble several primes into a relational model, we are doing something like a limit construction; when we factor a complex archetype into simpler ones, we are doing something like a decomposition into a coproduct. We do not have these as formal operations, but category theory tells us that "the canonical way to combine these structures" is a well-posed question with, often, a unique best answer — which is a higher standard than "whatever the model drew."

6.8 Applied category theory we can actually mine

Three bodies of applied work are close enough to our project to be raided rather than merely admired.

The first is David Spivak's ologs (ontology logs; Spivak and Kent, 2012; developed in Category Theory for the Sciences, MIT Press, 2014). An olog is a category used as a knowledge-representation schema: objects are types ("an amino acid," "a fishery"), arrows are functional relationships ("has, as its catch limit"), and the commuting-diagram discipline enforces consistency. An olog is, almost exactly, what our catalog would be if its primes and their relationships were represented as a category rather than as text records with cross-reference strings. Ologs are the existence proof that a structured knowledge catalog can be given category-theoretic form usefully, by non-specialists, and they are the most direct template for representing the Encyclopedia as a category.

The second is Fong and Spivak's Seven Sketches in Compositionality (2018), an applied-category-theory text built precisely for newcomers, pairing concrete applications (databases, circuits, dynamical systems, scheduling) with the categorical structures that model them (adjunctions, enriched categories, toposes). It is the natural on-ramp for the project's category-theory literacy, and it demonstrates the compositional stance — build complex systems by composing simple, well-understood pieces with known interfaces — that our grammar-of-verbs is an informal instance of.

The third, and the most tantalizing, is DisCoCat — categorical compositional distributional semantics (Coecke, Sadrzadeh, and Clark, 2010). DisCoCat solves a problem with the exact shape of ours: it combines a grammar (how words compose into sentences, formalized as a pregroup grammar) with a space of meanings (word vectors) by means of a functor that carries grammatical derivations into linear maps on the meaning space, so that the meaning of a sentence is computed compositionally from the meanings of its words along the grammatical structure. Translate the analogy: DisCoCat combines a grammar with a lexicon-of-meanings via a structure-preserving map. That is the precise abstract shape of what a calculus of abstraction would be — a grammar of operations combined with a lexicon of abstractions via an engine that respects the structure. DisCoCat is not a drop-in solution; its meanings are word vectors and its grammar is linguistic. But it is the closest existing thing to a worked-out "grammar × lexicon, joined functorially" and it is worth studying as a structural template, including its use of string diagrams to make the information flow visible — a representation our typed relational models could borrow.

6.9 Honest accounting: what to borrow now, what is north star, what is theater

Sorting the above by how much weight it can currently bear: The genuinely borrowable-now item is the conceptual discipline — think in morphisms rather than objects; ask of every transport "does it preserve composition?"; represent the catalog as a relational structure (an olog) rather than a list; treat near-synonym crowding as a Yoneda-style fact about relational position; and adopt the compositional stance toward the verbs. None of that requires formalizing anything; it requires changing how we describe and structure what we already have, and it would directly improve retrieval and lexicon design. The long-term north stars are the adjunction between lift and lower (a real, checkable research question that could give the grammar a backbone) and a DisCoCat-style functorial semantics joining the grammar to the lexicon (the formal endpoint of "calculus of abstraction," years away if reachable at all). The theater risk is everything in between: naming our informal operations with categorical terms — calling compose a "limit," calling a family of transports a "natural transformation" — before we have checked that the laws actually hold. Those names are useful as hypotheses and dangerous as decorations. The rule we propose is simple: use a categorical name only once you can state the law it implies and design a test for it. Under that rule, category theory is not a costume for the project but a source of sharp, falsifiable questions about the grammar we are building.

7. Relation to existing research¶

It is a recurring and slightly humbling feature of this project that its central moves have been made before, in other vocabularies. Cataloguing the overlaps is not an exercise in deflation; it is how we tell a re-derivation from a discovery, and how we find the prior art worth raiding. We group the relevant traditions and, for each, note both the overlap and what (if anything) our work adds.

7.1 Pattern catalogs and inventive-problem-solving

The idea of a curated catalog of reusable cross-context patterns is Christopher Alexander's, from A Pattern Language (1977), which gave architecture a vocabulary of named, composable design patterns and directly inspired the software "Gang of Four" Design Patterns (1994). Our solution archetypes are, structurally, design patterns generalized past any one domain. But the closest ancestor to the AAR pipeline — to the verbs rather than the nouns — is TRIZ, Genrich Altshuller's theory of inventive problem-solving. TRIZ's core procedure is: take a messy specific problem, abstract it to a general contradiction, retrieve a general solution from a catalog of inventive principles, and map that general solution back down to the specific domain. That is, almost exactly, lift → transport → lower over a catalog of abstractions, complete with a retrieval index (the contradiction matrix) that anchors a problem's abstract shape to candidate solutions. The user arrived at the pipeline independently and over a year before encountering TRIZ; the convergence is itself a small piece of evidence that the abstract-solve-instantiate loop is a natural attractor for anyone trying to systematize cross-domain transfer. What our work adds over TRIZ is narrow but real: an explicit typing of the abstractions (structural versus framed), an explicit negative-knowledge layer (anti-signatures), and an empirical, blinded evaluation of whether the catalog actually helps — a question TRIZ asserts rather than tests.

7.2 General systems theory and cybernetics

One level up from any specific catalog is the mid-century program that sought a single trans-disciplinary vocabulary of patterns. Kenneth Boulding's 1956 essay "General Systems Theory: The Skeleton of Science" (Management Science) proposed exactly an "over-arching language" of constructs arranged across a hierarchy of system complexity, motivated by the over-specialization and mutual unintelligibility of the sciences — a near-verbatim statement of the Encyclopedia's motivation. Von Bertalanffy's general systems theory and the cybernetics of Wiener and Ashby (whose law of requisite variety is itself one of our primes) were the same impulse: hunt for the isomorphisms that recur across physical, biological, and social systems. The honest framing is that the Encyclopedia is an operationalization of Boulding's skeleton — the thing that program kept describing and never built, now buildable because a language model can instantiate and test the patterns at scale. The system-archetypes prime in our catalog is a direct descendant of Senge's organizational system archetypes, themselves a descendant of this lineage; there is a pleasant recursion in the fact that "system archetypes" is both an entry in the catalog and a name for what the catalog is.

7.3 The cognitive science of analogy

If transport is our hardest and most central verb, the cognitive science of analogy is the field that has studied it most carefully, and the overlaps are exact enough to be useful. Dedre Gentner's structure-mapping theory (1983) holds that analogy is the alignment of relational structure between a base and a target, with surface features left behind — the same distinction between structural and surface that runs through our structural/framed work and our retrieval findings. Gentner's group built this into the Structure-Mapping Engine (SME), an algorithm that computes the structural alignment between two represented situations and reports the correspondence — precisely the map verb we proposed in Section 5.5 and do not yet have. Most striking is the MAC/FAC model of analogical retrieval (Forbus, Gentner, and Law, 1995), which found a dissociation we re-derived in embedding space three decades later: human retrieval from memory is driven by surface similarity (the cheap "MAC" stage), while the mapping and evaluation that follow are driven by structural similarity (the expensive "FAC" stage), and people therefore tend to retrieve surface-similar cases that are often relationally useless. Our Layer-1 (topical, surface) retrieval versus Layer-2 (structural, cross-domain) problem is the same dissociation, and the finding that expertise shifts retrieval toward relational structure is a direct hint about why a fine-tuned, catalog-trained retriever might do what an off-the-shelf one cannot. Beyond Gentner: Holyoak and Thagard's multi-constraint theory of analogy, Hofstadter and Sander's Surfaces and Essences (2013) arguing that analogy is the core engine of all cognition, and Fauconnier and Turner's conceptual blending (the systematic combination of mental spaces — our blend verb) round out a literature that has, in effect, been studying our verbs under the heading of analogy for forty years. The opportunity is to import their algorithms (SME for mapping) and their empirical findings (MAC/FAC for retrieval) rather than rediscover them.

7.4 Philosophy: structure, thickness, and conceptual engineering

The structural/framed distinction has the richest philosophical pedigree, and one connection is deep enough to deserve emphasis. Structural realism in the philosophy of science — John Worrall's "Structural Realism: The Best of Both Worlds?" (1989), and James Ladyman's later ontic structural realism (1998) — argues that what survives across scientific revolutions is structure (the mathematical relationships, Fresnel's equations) rather than content or interpretation (the ether that was discarded). That is, almost exactly, our claim that structural primes travel while framed content does not: the project independently re-derived, for cross-domain abstraction transfer, a thesis that philosophers of science have argued for the persistence of scientific theories. The thick/thin distinction (Bernard Williams) and essentially-contested-concepts (W. B. Gallie) supply the analysis of the framed end (already worked into the companion paper), and the conceptual engineering movement (Herman Cappelen and others) supplies the framing of the whole enterprise: concepts as tools that can be deliberately designed, redesigned, and assessed for whether they do their intended work. Behind all of this stands the ancient problem of universals and Carnap's notion of explication (replacing a vague concept with a precise one that does its work better) — which is, arguably, what each well-formed prime is. The addition our work makes is the decomposition operation with its three fates, which gives the thick/thin and structural-realist intuitions a procedure and a set of testable predictions rather than a static distinction.

7.5 Conceptual metaphor

Lakoff and Johnson's Metaphors We Live By (1980) and the conceptual-metaphor tradition describe how source-domain structure is mapped onto a target domain, carrying entailments with it — which is the mechanism by which a framed prime smuggles its frame across a boundary (our "smuggled frame" failure mode). The literature's catalog of conventional metaphors (argument-is-war, time-is-money) is, in our terms, a catalog of habitual transports, and its analysis of which entailments survive a metaphorical mapping and which mislead is directly relevant to specifying transport's failure conditions.

7.6 Structured prompting and LLM reasoning

On the engine side, the project sits squarely in the contemporary literature on eliciting reasoning from language models, and two techniques are close enough to our verbs to be near-identical. Chain-of-thought prompting (Wei et al., 2022) established that making intermediate reasoning explicit improves performance. Step-back prompting (Zheng et al., 2023) improves reasoning by having the model first abstract to high-level concepts and principles before working the specifics — that is our lift verb, isolated and named. Analogical prompting (Yasunaga et al., 2023, "Large Language Models as Analogical Reasoners") has the model self-generate relevant exemplars before solving — that is our transport verb, run against the model's parametric memory rather than an external catalog. Tree-of-thoughts, graph-of-thoughts, least-to-most decomposition, and retrieval-augmented generation are all variations on "give the probabilistic engine a structure to reason within or a store to reason from." The empirical theme that unifies this literature with our ablation is that much of the gain from elaborate scaffolds comes from eliciting the model's latent capability, not from injecting new knowledge, and that retrieval helps most for long-tail or fresh material and least for well-covered topics. Our finding that the catalog-free scaffold matched the catalog-equipped pipeline in a well-covered region is a specific instance of that theme; what we add is the typology of the operations (a named grammar rather than a single trick) and the structural/framed lens on where external knowledge should still help (the framed content unaided reasoning skips).

7.7 Neuro-symbolic integration

Finally, the framing question the user raised: where does this sit relative to neuro-symbolic AI? Henry Kautz's taxonomy from "The Third AI Summer" (2022; AAAI presidential address, 2020) lays out six types of integration, from Type 1 (a neural net whose inputs and outputs happen to be symbols) through loosely-coupled pipelines (Type 2), neural-then-symbolic (Type 3, a neural net's output feeding a symbolic reasoner), symbolic knowledge compiled into training (Type 4), a neural engine with built-in symbolic reasoning (Type 5), to a fully integrated Type 6 that does not yet exist. The honest placement of our system is that it fits none of these cleanly, because in all six the symbolic component is a reasoning engine with its own inferential guarantees, and our "symbolic" component is not an engine at all — it is a discipline of representation imposed by a prompt, executed entirely by the neural substrate. There is no solver. The structure is real (typed graphs, explicit gates, named operations) and it demonstrably changes the output, but it carries no soundness guarantee; it is symbolic in shape and neural in substance. This is a category the Kautz taxonomy does not have a slot for, and naming it precisely is the work of the next section.

8. What we have built: a symbolic-shaped discipline of representation¶

8.1 Structure as scaffold

The cleanest characterization of the AAR pipeline is that it provides the language model with something to hang onto. A probabilistic next-token engine holds a problem as a diffuse cloud of associations; it is good at evoking the relevant considerations and bad at operating on them precisely, because there are no discrete objects in the cloud to grasp. The pipeline manufactures discrete objects: named primes annotated on labeled nodes, a meta-model that can be pointed at, candidate archetypes that can be enumerated and struck through, an explicit gate that has to be walked. None of these objects is computed by a separate engine; they are representations the model is required to lay down and then manipulate. The value is that operating on externalized handles is more reliable than operating in latent space — which is why the disciplined scaffold beat bare chain-of-thought even with no catalog, and why imposing typed-graph structure on the model-building step tightened the reasoning. The structure is a scaffold the probabilistic process climbs, not a solver that replaces it.

8.2 The offloading principle

This is an instance of a more general principle that also explains why language models use calculators, run code, and — in recent vision work — convert image regions into discrete tokens before counting them. A probabilistic substrate is unreliable at operations that demand exactness and discreteness: counting, multi-step arithmetic, precise spatial enumeration, sound logical inference. The remedy in every case is the same: externalize the operation onto a discrete, manipulable representation the substrate can offload to. Counting fails on a blurry holistic impression and succeeds when you draw boxes and count tokens, because counting is an exact discrete operation and you have given it discrete objects to count. Program-aided approaches offload arithmetic to an interpreter; tool-use offloads lookups to APIs. The AAR pipeline performs the same move for reasoning about abstractions: abstractions held as a probabilistic haze are hard to compose, decompose, and transport, but abstractions externalized as labeled handles — things you can strip, point at, gate, and re-instantiate — become manipulable. "The model needs something to grasp" is not a metaphor; it is the literal requirement that operations the substrate performs unreliably be given discrete external objects to operate on.

Seen this way, the calculus of abstraction is the offloading principle applied to its hardest target. Counting offloads to integers; arithmetic offloads to a calculator; logic offloads to a prover; abstraction offloads to a structured catalog plus a typed model plus an explicit grammar of operations. The lexicon supplies the discrete objects; the grammar supplies the operations; and the externalization is what lets the probabilistic engine do reliably what it cannot do reliably in its head.

8.3 Legibility without soundness, and the ceiling argument

It is essential to be precise about what the scaffold buys and what it does not. It buys legibility (the reasoning is externalized and inspectable), decomposition (a hard problem is broken into operations each easier than the whole), discipline (operations that unaided reasoning skips — the negative check, the abstraction step — are forced), and elicitation (latent capability is drawn out). It does not buy soundness. Every operation in the current pipeline is a probabilistic judgment; the gate is the model's opinion that no anti-signature is tripped, not a proof; the lift is the model's claim to have abstracted faithfully, not a verified one. The scaffold makes the steps visible, which makes errors catchable by a human or a downstream check, but it does not make the steps correct by construction.

This is the precise place to engage the critique that probabilistic systems are intrinsically limited — that, being driven by likelihood rather than proof, a language model can never establish that something is true or exhibit a verifiable derivation of how it got there, and that this is why apparent gains in machine reasoning may be tapering. The kernel is right: next-token prediction does not yield sound derivations, and a system whose only resource is the probabilistic substrate has a real ceiling for rigorous reasoning. But the strong form — probabilistic, therefore can never prove — is too quick, because it ignores composition with a verifier. A probabilistic proposer coupled to a sound checker produces results that are proven: the model guesses a candidate, the checker certifies or rejects it, and over the loop the system's outputs inherit the checker's soundness even though the proposer remains a guesser. This is how formal-mathematics systems built on proof assistants operate, and how code-execution loops turn an unreliable code-writer into a reliable one. The critique, in other words, does not refute the project; it identifies the project's natural endgame. The scaffold gives us legibility now; the way past the ceiling is to make the legible steps checkable.

8.4 The frontier: scaffold becomes verifier, one verb at a time

The path from "discipline of representation" toward something with genuine guarantees runs through the verbs, because several of them have a latent checkable core that the current pipeline leaves unexploited. Lift can be witness-checked: every entity and relation in the meta-model must have a witness in the concrete model, and the meta-model must contain no domain terms — both mechanically verifiable. Compose can be referential-integrity-checked: every edge must reference declared nodes, and every operative prime must appear in the model — we already imposed this by hand and it helped. Evaluate-fit is the highest-value target: an archetype's anti-signatures, currently prose the model nods at, could be encoded as explicit conditions evaluated against the typed model, converting the gate from opinion into rule, and turning the catalog's negative knowledge from advisory into binding. Salience can be ablation-checked: remove each operative prime and confirm the recommendation changes. Decompose is verified by transport: an extracted structural core counts as a real new prime (fate two) only if it demonstrably projects into other domains, which is a runnable test. And map — the mapping sub-operation of transport — could be handed to an actual structure-mapping algorithm (Gentner's SME) operating over the typed model, replacing a probabilistic alignment with a computed one that reports where the correspondence breaks.

Each of these is a place where a real symbolic component — a type-checker, a constraint evaluator, a graph-validator, an alignment algorithm — could be plugged into a specific verb, turning that verb from a prompted disposition into a checked operation. Doing so would move the system, verb by verb, from "symbolic in shape only" toward the Kautz taxonomy's genuinely integrated types, but built incrementally and where it pays rather than as a monolithic architecture. The design principle is to keep the probabilistic engine as the proposer for every verb — it remains far better than any symbolic component at evoking the relevant abstractions and drafting the model — and to add symbolic checkers at the verbs whose failures are both common and mechanically detectable. That is a concrete, buildable research program, and it is the literal operationalization of "structure as scaffold becoming structure as verifier."

9. Beyond the current pipeline¶

9.1 The pipeline is one sentence in the grammar

Once the verbs are first-class, the nine-step pipeline is revealed as a single fixed sentence built from them — one composition, in one order, applied uniformly to every problem. There is no reason to think it is the best one, and good reason to think it is not: it lifts and transports before it has checked whether the problem is even in a region the catalog covers; it runs the same nine steps on a problem that needs only two; it cannot revisit an earlier verb when a later one fails. The grammar admits many other sentences, and several alternative architectures follow directly from taking the verbs seriously.

9.2 Alternative architectures

A verb-driven planner. Rather than executing a fixed sequence, an agent could be given the verbs as available operations and choose which to apply next based on the state of the reasoning — lift when the problem is too concrete to match, transport when an abstraction is in hand, factor when a model is too tangled, gate when a candidate is on the table, and stop when a recommendation survives. This turns the pipeline from a script into a search over verb compositions, with the model as the policy that selects operations. It is more flexible and more failure-tolerant (a failed verb can be retried or routed around), at the cost of needing a notion of the reasoning state the planner acts on — which the typed model already partly supplies.

[Update, 2026-05: built and tested — null at the frontier.] This planner was implemented (a structured-state policy choosing verbs, in fixed-pipeline / free / enforced-discipline variants) and graded blind against the fixed pipeline across five scenarios. The conditions tied (≈50/60), a coverage-free variant matched the disciplined one, and the apparent "self-discipline" of the free planner turned out to be an artifact of the verb library's salience rather than policy. Control structure did not carry value for a frontier model on within-reach problems. See retrospective-testing-runtime-abstraction-scaffolding.md §5–§6.

A verifier-augmented pipeline. The scaffold-to-verifier move of Section 8.4, realized: keep the sequence but insert hard checks after the verbs that have them (witness-check after lift, referential-integrity after compose, encoded anti-signatures at the gate), and have failures trigger a retry of the offending verb rather than propagating downstream. This is the lowest-risk, highest-immediate-value architecture, because each check is independently useful and the whole degrades gracefully.

A redesigned retrieval layer. Our experiments make the retrieval redesign nearly a foregone conclusion. Replace lexical search with semantic search; query with the meta-model rather than the raw problem; embed the structural signatures rather than the prose; return neighborhoods rather than single best matches (because the generic structural primes are identifiable only as clusters); make retrieval type-aware (a framed-prime query and a structural-prime query want different treatment); and, for the oblique real-world problems that are the hard case, decompose the query into facets and retrieve against each — a project-then-transport composite — rather than blending everything into one vector that lands in the wrong neighborhood. Representing the catalog as a relational structure (an olog) rather than a list is the substrate that makes neighborhood-aware, type-aware retrieval natural.

A catalog represented as a category. Following Section 6, the lexicon itself could be restructured so that primes are nodes in an explicit relational graph — near-synonym links, dual pairs, subsumption arrows, source/related-prime edges to archetypes — rather than text records joined by cross-reference strings. This directly serves retrieval (the neighborhood is represented, not inferred), serves the grammar (subsumption arrows are what generalize/specialize traverse; dual links are what dualize traverses), and serves the Yoneda-style insight that a prime is its relationships. Spivak's ologs are the existence proof and the template.

Synthetic-data training: the project's origin, revisited. The Encyclopedia was conceived not as a runtime lookup but as a generator of synthetic training data to make a model better at detecting and transporting abstractions, especially in messy situations. Our experiments give that original vision its strongest support yet, by identifying exactly the gap that training would fill. Off-the-shelf embeddings cannot do Layer-2 cross-domain retrieval (the structural match to a surface-distant abstraction), and MAC/FAC tells us why and points at the cure: expertise shifts retrieval from surface to structural similarity. A retriever fine-tuned on (far-domain-instance ↔ prime) pairs generated by the catalog — exactly the instances our cross-domain sweep produced — would be acquiring that expertise. More broadly, the catalog can generate training data for the verbs themselves: lift exercises (instance → meta-model), transport exercises (instance in domain A → prime → instance in domain B), gate exercises (situation + archetype → which anti-signatures trip). The thing the catalog is uniquely good for may be teaching the grammar, not executing it — which would explain why, at runtime in a well-covered region, a model that already has the grammar internalized (a frontier model) gets little from the catalog, while a model being trained could get a great deal.

An executable calculus / functorial semantics. The north star, years off if reachable: a representation in which the verbs are actual operations with stated laws, the catalog is a category, and the join between grammar and lexicon is a structure-preserving map in the DisCoCat spirit — so that the "meaning" of a reasoning episode is computed compositionally along the grammar from the meanings of its primes. This is the strong sense of "calculus of abstraction," and we name it not because it is near but because it tells us what direction the incremental work points.

9.3 The open problems these architectures inherit

None of the above dissolves the two hardest problems the experiments surfaced. The obliqueness problem — that real problems instantiate their key abstractions latently, among many others, rather than as a clean single theme, and that retrievability collapses with obliqueness — is not solved by better embeddings or a planner; it is a property of how abstractions hide in real situations, and addressing it may require the facet-decomposition retrieval above plus a model genuinely trained to surface latent structure. The Layer-2 retrieval problem — structural match across a surface gap — remains open off-the-shelf and is the clearest case for catalog-trained fine-tuning. And the verb-completeness and typing problem — do we have the right verbs, and do we know how each behaves on structural versus framed inputs — is the open theoretical core of the grammar, addressable only by the slow work of enumerating, testing, and repairing. These are not reasons for pessimism; they are the actual research, stated so it can be worked.

10. A research agenda¶

The reframe converts a single sprawling ambition ("build the Encyclopedia") into a set of separable, testable workstreams. We order them by a rough product of value and tractability.

Near-term, high-tractability.

Rebuild retrieval as semantic + meta-model + neighborhood. This is the most actionable finding of the whole series and the one the project should do next regardless of the theory. Replace the lexical MCP search with semantic search over the catalog; query with the meta-model, not the raw problem; embed structural signatures; return neighborhoods, not single hits; make it type-aware (structural-prime queries want a cluster, framed-prime queries want the distinctive exact match). The diagnostic harness already built (scripts/semantic_catalog.py, with cached bge-small embeddings) is the prototype; the work is to wire it into the pipeline's match and transport verbs and to re-run the forward and ablation experiments under the new retriever to see how much of the catalog's previously-unretrieved value it recovers. We expect a modest but real improvement in-distribution and a larger one on the framed-content cases.

Represent near-synonym clusters in the lexicon. Link or partially merge the structural near-synonym families (recursion/iteration/self-similarity; symmetry/invariance/conservation; equilibrium/steady-state/homeostasis) and the dense framed pairs (authority/sovereignty) so the lexicon's geometry is explicit. This directly addresses the distinctiveness finding and makes neighborhood retrieval well-defined.

Encode anti-signatures as checkable conditions. Start the scaffold-to-verifier program at the highest-value verb: turn the gate from prose the model nods at into conditions evaluated against the typed model. Even partial coverage converts the catalog's negative knowledge from advisory to binding.

Medium-term, higher-value.

Run the experiments the current ones could not. A target-blind (or fully automated) re-run of the forward recognition test, to remove the operator-contamination confound that is the chief weakness of the one catalog-favorable result. The proper framed-vs-structural cross-domain experiment with more primes and instances, and crucially with the recognition metric (does the catalog supply structure a blind reasoner misses?) rather than only the retrieval-rank metric, to separate "the catalog helps you find the pattern" from "the catalog helps you see the pattern." And the §9.2 ablation re-run under semantic retrieval, to test whether better retrieval changes the protocol-versus-catalog verdict.

Enumerate and type the verbs. Take the verb list of Section 5 as a draft and work it: for each verb, state its inputs and outputs, its behavior on structural versus framed inputs, its failure modes, and its checkable core. Mine TRIZ, Polya, and the analogy literature for verbs we are missing, and grow the grammar by repairing the pipeline's observed failures. This is the theoretical core and it is done by careful description plus targeted experiment, not by speculation.

Import the analogy algorithms. Implement a structure-mapping step (in the spirit of Gentner's SME) over the typed model as the map verb, giving transport a computed alignment that reports where it breaks, rather than a probabilistic one. Use the MAC/FAC dissociation as the design rationale for a two-stage retrieve-then-align transport.

Long-term, north-star.

Train the grammar rather than only execute it. Use the catalog as the synthetic-data generator it was originally conceived to be: generate lift, transport, and gate exercises and (far-domain-instance ↔ prime) retrieval pairs, and fine-tune a retriever and possibly a reasoner toward the structural, expertise-driven retrieval that off-the-shelf models lack. This is the most direct route at the Layer-2 problem and the original vision's natural payoff.

Represent the catalog as a category and pursue functorial semantics. Restructure the lexicon as an olog; investigate whether the lift/lower pair is genuinely an adjunction; and hold the DisCoCat-style functorial join of grammar and lexicon as the formal endpoint. Govern this work by the rule of Section 6.9: a categorical name earns its place only when it comes with a stated law and a test.

A note on sequencing with the immediate work: the retrieval rebuild is both the most useful near-term engineering and the cleanest test of how much our earlier catalog-pessimism was really retrieval-pessimism in disguise. It should come first, with the structural/framed-aware design and the neighborhood-return behavior built in, after which the experiments are worth re-running on the improved substrate.

11. Limitations and honest caveats¶

The case made in this paper rests on a thin evidentiary base and a fresh, partly self-generated framing, and the reader should weight it accordingly.

The experiments are small. Most are a single run with a single blinded grader and a single model class for both the conditions and the grader, which leaves shared systematic biases unmeasured. The one result most favorable to the catalog — the cross-domain recognition test — carries an operator-contamination confound we could not eliminate, because the same operator who knew the target ran the catalog condition. The rubric in the comparison experiments overlaps with what the catalog explicitly enforces, biasing toward the pipeline on three of six dimensions. And the scenarios were, in several cases, constructed by the very system being tested, with the attendant risk that they were unconsciously shaped to fit. We have tried to control these (blinded grading, target-blind sub-agents, independent plausibility screens, format normalization), and we have flagged each where it bites, but the honest summary is directional evidence, not proof — which is exactly the status Hubbard's account of measurement licenses, and no more.

The theory took a hit. The structural/framed distinction predicted that framed primes would be harder to retrieve across domains; both sweeps found the opposite, and the governing variable turned out to be neighborhood density rather than framing. We have folded this into the account (the distinction remains predictive for decomposition and for the recognition of frame content, and is orthogonal to retrieval), but a reader should note that one of the project's central theoretical posits made a clear prediction that the data refuted, and treat the surviving claims about the distinction with corresponding caution.

The category-theory section is mostly promissory. With one or two exceptions (the conceptual discipline; ologs as a representation), the categorical connections are analogies we have not formalized and explicitly flag as at risk of being decoration. The adjunction between lift and lower is a hope with a plausible textbook template, not a result. We have stated a rule to keep this honest, but the section should be read as a source of questions, not answers.

The framing is partly self-generated. Several of the paper's central moves originated in dialogue with a language model and were checked against the literature only afterward. We found the components well-precedented and the synthesis less so, but the reader should be alert to the failure mode in which a fluent system and an enthusiastic interlocutor co-produce a framing that is more satisfying than true. The defense we can offer is that the empirical results were frequently surprising and sometimes unwelcome (the catalog underperforming the scaffold; the refuted prediction), which is weak evidence that we were measuring something real rather than confirming a story.

Finally, the central reframe — lexicon versus grammar — is itself a model, to be judged by whether it does useful work, not by whether it is the true description of how abstraction works. Its claim to usefulness is that it converted a stalled "is the catalog valuable?" question into a tractable set of workstreams about specific operations. If, a year from now, the verbs have not been profitably enumerated, typed, or made checkable, the reframe will have been a pleasing metaphor rather than a research program, and should be discarded.

12. Conclusion¶

The Encyclopedia of Abstractions set out to be a better dictionary of patterns. The experiments reported here suggest it has been quietly becoming something else: not a dictionary but a grammar — a set of operations for manipulating abstractions, with the catalog as the lexicon those operations run over. The evidence for the shift is concrete if small. A catalog-free agent given only the pipeline's operations matched a catalog-equipped one on a hard problem, which says the operations carry the value. The catalog earned its keep in exactly one place — supplying the institutional frame of a transported concept that unaided reasoning reconstructs the structure of but skips the frame of — which says the lexicon's distinctive role is narrow and specifiable. And retrieval, not catalog content, proved to be the binding constraint, governed by an abstraction's relational neighborhood rather than by its kind, which says the lexicon should be represented as the relational structure it is.

Naming the operations as verbs does the work that naming usually does: it makes previously invisible questions askable. What is this verb's inverse, its dual, its type signature? Which verbs check which others — decompose verified by transport, salience by ablation, lift by witness? Which sentences in the grammar are well-formed, and is the nine-step pipeline even a good one? Where does a verb's prompted disposition hide a checkable core that would turn scaffold into verifier? Category theory, the one mature grammar of structure-preserving transformation, supplies sharp versions of several of these questions — transport as a composition-preserving functor, lift-and-lower as a possible adjunction, a prime as its Yoneda neighborhood, the catalog as an olog — and the discipline, if held to the rule that a categorical name must come with a law and a test, is a source of falsifiable structure rather than ornament.

We have been careful not to overclaim. The questions are ancient — universals, analogy, the persistence of structure across change — and the components are precedented: pattern catalogs and TRIZ for the lexicon-and-loop, general systems theory for the trans-domain ambition, structure-mapping for transport, structural realism for the structural/framed distinction, step-back and analogical prompting for the verbs, the offloading principle for the whole scaffold. What is less common is the combination and the stance: a large, structured abstraction catalog joined to an explicit, growable grammar of operations, executed by a probabilistic engine and subjected to blinded ablations that genuinely try to falsify its value — and the recognition that the resulting system is neither a classical neuro-symbolic architecture nor mere prompting, but a symbolic-shaped discipline of representation whose frontier is the verb-by-verb conversion of scaffold into verifier.

The deepest point is the one that makes the project newly tractable. Questions that used to be armchair — do abstractions transfer across domains, does an explicit catalog of them help a reasoner, what distinguishes a portable concept from a captive one — are now, because a language model can instantiate a prime in an alien domain on demand and a blinded grader can score whether it was recognized, empirically testable. The instrument is new even though the questions are old. What the Encyclopedia really is, on this reading, is a specimen and a laboratory: the catalog and its grammar are the specimen, the language model is at once the microscope and the subject, and the calculus of abstraction is the thing we are trying to see. The lineage is long; the lab is new; and the next move is not to admire the dictionary but to build, test, and verify the verbs.

Postscript (2026-05-26). The next move was taken: the verbs were enumerated, the lift was made checkable, and the runtime architectures were tested (projects 02–04). By this paper's own §11 standard, the verdict is mixed and worth stating plainly. The grammar is a coherent and useful descriptive contribution — but its runtime value for a frontier model came back null: control structure did not move quality, coverage, or faithfulness, and the lift's framed/structural gate was refuted. The reframe was not a pleasing metaphor — it generated falsifiable workstreams and they were falsified — but the forward-looking value, if any, now appears to lie in the corpus as training signal and as a human curriculum, not in the runtime prompt. The full account, including what the null does and does not license, is the companion retrospective.