Focused Related Work and Prior Art for an Encyclopedia of Abstractions¶

The closest prior art does not support a broad novelty claim of “abstraction-first prompting helps LLM reasoning at runtime.” That territory is already crowded by step-back, analogical, plan-and-solve, least-to-most, self-discovered, and graph-structured prompting, several of which reported gains on strong frontier models. But the literature is also much friendlier to the project's skeptical framing than those headline gains suggest: CoT-style structure helps mainly on math/symbolic tasks, can hurt badly off that turf, often shrinks with stronger models, and remains only weakly faithful even in modern reasoning systems. The strongest claim that still looks defensible is narrower: a hand-curated, cross-domain prime-abstraction + solution-archetype corpus, coupled to a typed relational/meta-model transport pipeline, and then tested through careful blinded evaluation, appears different from the existing mix of ontologies, prompting tricks, and process-supervision papers. The most serious pre-emption risk lies not in old knowledge graphs, but in newer work on analogical prompting, rationale distillation, multi-domain process supervision, and cross-domain latent adaptation.

Prior catalogs and pattern libraries¶

Cyc — Lenat and colleagues, long-running project; canonical reference is Lenat & Guha’s 1989 book, with later project documentation. Cyc is a large hand-built commonsense ontology/knowledge base designed for formalized background knowledge and reasoning. Its strong overlap with this project is the ambition to encode reusable, hand-authored conceptual structure. Its weak overlap is equally important: Cyc is organized as a formal ontology/KB for inference, not as a transfer-oriented library of prime abstractions linked to solution archetypes and an explicit source→meta-model→target transport workflow. Inference: Cyc pre-empts the “manual concept engineering” half of the idea, but not the specific “cross-domain abstraction transfer corpus” framing. Flag: none; partial antecedent, not [ALREADY DONE].

ConceptNet, WordNet, SUMO, and Wikidata — Speer et al. 2017, AAAI; Princeton WordNet project; Niles & Pease 2001, FOIS; Vrandečić & Krötzsch 2014, Communications of the ACM. These systems all encode concepts and relations, but each does something different from what the Encyclopedia describes. WordNet is a lexical-semantic network of synsets and relations; ConceptNet is a multilingual commonsense graph linking terms and phrases; SUMO is a formal upper ontology intended to support reasoning; Wikidata is a collaborative linked-data knowledge base. Collectively they are a major prior art family for “concept catalogs,” but they are mostly about entity/relation coverage, lexical organization, or formal ontology. They do not ordinarily package abstractions as portable deep structures for analogical transfer, nor pair them with a sibling library of reusable solution archetypes. Inference: these resources threaten any claim that the project invented “structured conceptual inventories,” but they do not appear to have already built a corpus centered on domain-stripped transfer units like feedback, equilibrium, due process, or irreversibility plus transportable solution patterns. Flag: [WHITESPACE] for the transfer-oriented corpus design.

TRIZ inventive principles — Altshuller tradition; current descriptions emphasize a systematic problem-solving method centered on reusable inventive principles. This is one of the closest non-AI antecedents to the Encyclopedia's “solution archetype” notion. TRIZ explicitly tries to abstract recurring solution patterns from many inventions and reuse them in new domains; the current practitioner literature still frames the 40 principles as a list of known solutions that can inspire solutions elsewhere. The difference is that TRIZ is primarily an engineering contradiction-solving framework, not a general cross-domain library of prime abstractions with typed relational modeling. It is closer to the project's solution archetypes than to the full Encyclopedia of Abstractions. Flag: [ALREADY DONE] for the narrower idea that reusable, cross-case solution principles can be cataloged; not [ALREADY DONE] for the project's fuller prime-abstraction/meta-model transport pipeline.

Software design patterns — Gamma, Helm, Johnson, and Vlissides, 1994/1995, book. Design patterns canonized recurring software solutions as named templates reusable across projects. This is a direct antecedent for the notion of “solution archetypes,” but only within software architecture, and without explicit abstraction mining across heterogeneous domains. It pre-empts the pattern-language flavor of this work, not the broader cross-domain transfer claim. Flag: [ALREADY DONE] for the idea of a named pattern catalog in a single technical domain; not for a cross-domain abstraction encyclopedia.

Systems archetypes — Senge/Meadows lineage, widely operationalized in systems thinking workbooks and essays. This tradition identifies recurring feedback structures such as limits to growth and other recurrent dynamic motifs. It is especially relevant because some of the Encyclopedia's example abstractions, such as feedback and equilibrium, sit squarely in this tradition. Systems archetypes are genuinely transfer-oriented across organizations, ecologies, and policy settings, but their scope is feedback-system dynamics rather than general abstraction transport across arbitrary domains. They are probably the strongest historical pre-emption for the feedback/archetype slice of this corpus. Flag: [ALREADY DONE] for a meaningful subspace of recurrent cross-domain dynamic structures; not for the larger corpus design.

Ologs — Spivak & Kent, 2012, PLoS ONE. Ologs are a category-theoretic framework for knowledge representation that are explicitly typed, compositional, and alignable via functors. Among the prior art listed, this is the closest to the project's “typed relational model” and “domain-stripped meta-model” language. The key difference is that ologs provide a formalism for representing and aligning knowledge; they do not provide a hand-curated corpus of prime abstractions or a library of solution archetypes. If novelty is claimed for the typed-relational representation step alone, ologs are a serious pre-emption risk. If the claim is the combination of curation, abstraction inventory, and transfer use-case, the overlap becomes more limited. Flag: [ALREADY DONE] for typed relational modeling/alignment as a formal device; [WHITESPACE] for the curated transfer corpus built on top of it.

Analogy databases and analogy benchmarks — Ichien et al. 2020 on verbal analogy problem inventories; AnaloBench 2024. These resources gather analogy problems or analogy-rich examples for evaluation, and in AnaloBench’s case they focus on long-context and abstract analogies. Their purpose is measurement, not operational transfer. They are useful as antecedents for evaluation corpora, but they do not constitute a transfer library of concepts and solutions. Flag: [WHITESPACE].

Bottom line for theme one. The project is not first in building concept catalogs, pattern catalogs, upper ontologies, dynamic archetypes, or typed relational formalisms. It may still be first, or close to first, in assembling those instincts into a single hand-curated, transfer-oriented corpus whose central objects are cross-domain abstractions and reusable solution archetypes meant to drive explicit analogical transport. That claim should be made narrowly and with the strongest comparisons aimed at TRIZ, systems archetypes, and ologs, not just Cyc/ConceptNet/WordNet.

Prompted scaffolds at inference time¶

Take a Step Back — Zheng et al., 2023 arXiv preprint. This paper is a direct antecedent to the “identify higher-level abstraction or principle, then solve” move. It asks models to abstract upward to concepts or first principles before solving the concrete instance, and reports gains on PaLM-2L, GPT-4, and Llama2-70B over several reasoning-heavy tasks. If a paper were to claim novelty for “abstraction-first prompting improves reasoning,” this is a clean pre-emption and a contradiction of a blanket runtime-null story. The honest positioning is: this project tests a stronger, more explicit, more typed version of an existing intuition and finds that at the frontier the extra structure does not reliably pay off. Flag: [CONTRADICTS].

Large Language Models as Analogical Reasoners — Yasunaga et al., 2023/2024, OpenReview/ICLR-style submission. This work introduces analogical prompting in which the model self-generates relevant exemplars and then solves the target problem. That is not identical to the Encyclopedia's abstraction dictionary, but it is very close in spirit to “retrieve or synthesize structured analogs, then transfer.” It therefore pre-empts the idea that no one has operationalized analogical transfer as a prompting pipeline, and it also contradicts a strong form of the project's runtime-null claim because it reported improvements over prior CoT methods. Flag: [CONTRADICTS].

SELF-DISCOVER — Zhou et al., 2024, arXiv / later ACM indexing. SELF-DISCOVER asks the model to self-compose task-intrinsic reasoning structures from atomic reasoning modules and then follow that structure during decoding. This is another strong prior on “explicit inference-time scaffolding.” It matters because it shows that runtime structure can help on some benchmark families, while being far more lightweight than the project's full ontology-and-transport pipeline. If a stronger scaffold of this kind is null while SELF-DISCOVER is positive, that difference itself is informative: the bottleneck may not be “lack of explicit structure” but the cost/rigidity of richer structure. Flag: [CONTRADICTS].

Least-to-Most, Plan-and-Solve, and Graph of Thoughts — Zhou et al. 2022; Wang et al. 2023 ACL; Besta et al. 2023/2024 AAAI. These methods all decompose problems explicitly, whether into progressively simpler subproblems, an initial plan plus execution, or a graph of interdependent thoughts. They show that adding external structure can help reasoning on symbolic manipulation, compositional generalization, math, and some elaborate search tasks. But they also reveal a limitation highly relevant to the project's framing: the strongest results are concentrated in tasks with explicit symbolic or procedural structure. They are therefore partial contradictions to a blanket runtime-null statement, but they also narrow the domain where structure helps. Flag: [CONTRADICTS] for novelty claims about structured prompting; not a contradiction to the narrower claim that cross-domain abstraction transport may be null at the frontier.

To CoT or not to CoT? — Sprague et al., 2024, arXiv. This is one of the most useful papers for the project's skeptical positioning. Across a meta-analysis of 100+ CoT papers and new evaluations over 20 datasets and 14 models, the authors conclude that CoT yields strong benefits primarily on math or logic, with much smaller gains elsewhere; on MMLU, direct answer generation is almost identical to CoT except where symbolic operations are involved. This strongly supports the possibility that a sophisticated abstraction scaffold could still be largely null on broad, cross-domain reasoning even if narrower symbolic tasks see gains. Flag: none; this supports the project's empirical skepticism.

Mind Your Step — Liu et al., 2024, arXiv. This paper shows that CoT can reduce performance in several settings inspired by cognitive psychology, with absolute drops as large as 36.3% for o1-preview relative to GPT-4o zero-shot in one task family. Its contribution is not that CoT always hurts, but that “more explicit thought” can backfire in predictable ways outside classic symbolic reasoning. That is unusually close to the project's runtime-null story and helps resist the field’s default assumption that more structure is generally better. Flag: none; this supports the project's negative result framing.

Revisiting CoT Prompting — Cheng et al., Findings of EMNLP 2025 (arXiv 2506.14641). This paper argues that traditional CoT exemplars mainly align output format, do not improve strong models, and can still help weaker or older ones. That is a near-direct match to the project's “receding horizon” intuition: as models internalize more reasoning behavior, external scaffolds matter less. The work is math-heavy, so it should be used as suggestive rather than decisive evidence. Flag: none; it supports the receding-horizon hypothesis.

TITAN and related prompt-engineering studies — Wang, Daghighfarsoodeh, and Pham, 2024, arXiv (2409.16418); “Do Advanced Language Models Eliminate the Need for Prompt Engineering in Software Engineering?” 2024 arXiv. TITAN explicitly combines step-back and CoT and reports larger zero-shot gains for GPT-3.5 than GPT-4, while a separate 2024 study reports that for a reasoning model such as o1-mini prompt engineering often offers minimal or negative return. These are among the cleanest empirical hints that structured prompting may help weaker or less internally capable models more than stronger ones. The caveat is domain: task-oriented script generation and code-adjacent benchmarks are not the same as broad analogical transfer. Flag: none; suggestive support for receding horizon, not decisive proof.

Synthetic data, distillation, and process supervision¶

Distilling Step-by-Step — Hsieh et al., 2023, Findings of ACL. This is a major pre-emption for the project's second value arm. It shows that task labels plus LLM-generated rationales can train smaller models that outperform much larger prompted models with less data. The conceptual overlap is strong: if runtime scaffolding helps weak models more than strong ones, the obvious next move is to turn the scaffold into training data, and this paper already demonstrates that move in a general rationale-distillation form. What it does not do is teach a curated library of cross-domain abstractions or analogical transport per se. Flag: [ALREADY DONE] for the broad move from runtime reasoning traces to training-time supervision; not [ALREADY DONE] for an abstraction-specific corpus idea.

SCOTT and Implicit CoT — Wang et al. 2023; Deng et al. 2023. SCOTT distills self-consistent CoT into smaller models while trying to preserve faithfulness; Implicit CoT tries to internalize reasoning in hidden states rather than natural-language traces. Together they matter because they weaken any argument that the value of an abstraction scaffold must be visible at inference time. A pipeline can fail as a runtime display yet still work as a training signal to alter internal computation. Flag: [ALREADY DONE] for this general training-time logic, though not for cross-domain abstraction transfer specifically.

Let’s Verify Step by Step and OmegaPRM — Lightman et al. 2023; Luo et al. 2024. Let’s Verify Step by Step reported that process supervision significantly outperforms outcome supervision on MATH, and OmegaPRM later automated large-scale collection of process labels with an MCTS-style procedure, improving Gemini Pro and Gemma2 on MATH500 and GSM8K without human annotation. These are highly relevant because they show that rich intermediate structure can be valuable in training even if text-visible reasoning is unreliable. The bad news for the project's framing is that this area is already substantial. The good news is that almost all of it remains concentrated in math/code-style domains, leaving non-symbolic, cross-domain abstraction training far less developed. Flag: [ALREADY DONE] for process-supervised reasoning training; [WHITESPACE] for a non-math prime-abstraction training curriculum.

VersaPRM — Zeng et al., ICML 2025 (arXiv 2502.06737). VersaPRM is especially important because it explicitly attacks the “math-only” limitation of process reward models and shows that current PRMs generalize poorly outside math unless retrained on synthetic multi-domain reasoning data. It then reports non-math gains, for example on MMLU-Pro law. This is a direct prior for arm two of the project: synthetic multi-domain process supervision can help beyond math. But VersaPRM is still a generic multi-domain reasoning dataset, not a hand-curated abstraction ontology with typed transport annotations. Flag: [ALREADY DONE] for multi-domain synthetic process supervision; not for abstraction-specific supervision.

SAL — Zhou et al., 2025, arXiv. SAL may be the single closest paper to the project's training-time thesis. It is explicitly framed as self-supervised analogical learning, trains models to transfer symbolic solutions from easier/common cases to rare/failing cases, and reports gains across StrategyQA, GSM8K, and HotpotQA. That is much closer to “teach abstract transfer skill into models” than standard rationale distillation. The main limitation is that SAL uses self-generated symbolic analogs rather than a human-built abstraction encyclopedia, so it does not pre-empt the project's corpus design—only the general idea that training can target analogical transfer. Flag: [ALREADY DONE] for the general idea of training a model to analogically transfer solution structure; not [ALREADY DONE] for the project's curated abstraction inventory.

Nemotron-CrossThink — Akter et al., 2025/2026, EACL 2026 plus NVIDIA project page. This framework mixes synthetic and open-domain multi-format corpora into RL and reports gains across both math and non-math benchmarks, with the EACL version explicitly stating that math-only training is insufficient and multi-domain blending improves average reasoning accuracy. This is a strong prior against any claim that nobody has tried to post-train broad reasoning skill with synthetic multi-domain data. It is weaker as a pre-emption of this project because it does not teach a principled abstraction vocabulary. Flag: [ALREADY DONE] for broad multi-domain reasoning post-training; not for abstraction-labeled transfer training.

Small Models Struggle to Learn from Strong Reasoners — 2025 arXiv. This is a useful caution against overselling the “receding horizon” into arm two. It reports a “small model learnability gap”: small models do not consistently benefit from long or intricate teacher reasoning traces. That means the move from “weak models sometimes benefit more from scaffolding at inference time” to “let’s just distill rich scaffolds into weak models” is not automatic. Arm two may work only if the abstraction supervision is compressed, modular, or learnability-aware rather than simply verbose. Flag: [CONTRADICTS] to any simplistic version of the receding-horizon/training-data story.

Interim judgment on theme three. Arm two is not empty space; it is arguably the most crowded and promising arm. But the specific whitespace is narrower and more defensible: a supervised/distilled curriculum built from a canonical, human-curated library of cross-domain abstractions and solution archetypes still seems underexplored relative to generic rationale or PRM data.

Human transfer and analogical instruction¶

Perkins and Salomon — 1988/1989 review tradition on transfer, especially the “high road / low road” distinction. Their core claim is that mindful, effortful abstraction and active search for connections are central to high-road transfer. This is directly supportive of the project's human-curriculum arm: to teach transfer, explicit abstraction is not a bug but a classical prescription. What their work also implies, however, is that transfer is not automatic; it requires pedagogical design and repeated opportunities to abstract and reapply. Flag: none.

Barnett and Ceci — 2002, Psychological Bulletin. This paper’s taxonomy remains one of the best guards against overclaiming. It argues that disputes about “far transfer” often stem from failing to specify which dimensions of transfer differ between learning and application contexts. For this project this matters a great deal: the more dimensions differ between source and target domains, the harder transfer becomes. That strengthens the conceptual motivation for a curated abstraction curriculum, but it also warns that impressive local transfer does not imply broad far transfer. Flag: none.

Analogical encoding — Gentner, Loewenstein, and Thompson, 2003. This is the canonical positive result for schema induction via comparison. Comparing two cases that share an underlying principle helps learners encode the common relational structure and improves later transfer. For this project, this is probably the most intellectually aligned human-learning precedent: compare cases, extract the structure, then use it elsewhere. It strongly supports the plausibility of arm three. Flag: none.

Teaching by Analogy — Gray and Holyoak, 2021, Mind, Brain, and Education. This review argues that analogy is powerful for conceptual understanding and transfer when teaching focuses on causal-relational structure, fully explains correspondences, manages cognitive load, and matches source analogs to learner knowledge. This is more encouraging for the human-curriculum arm than for the runtime-LLM arm, because it stresses guided pedagogy rather than unguided one-shot prompting. Flag: none.

Schema-broadening instruction — Fuchs et al., 2010, empirical education study. In bounded math-word-problem settings, schema broadening improved students’ problem solving; historically, this is one of the cleaner examples that direct schema instruction can work when the target family is tightly controlled. It supports the feasibility of a curriculum based on explicit schemas or abstractions, but only in a narrow near-transfer sense. Flag: none.

Teaching for near transfer — Jerrim et al., 2025, Learning and Individual Differences. This is the most important recent negative result. Using TIMSS 2019 data on roughly 280,000 students, the authors find no evidence that mathematics teaching aimed at schema formation and abstraction is associated with better performance on unfamiliar mathematics questions. That does not refute analogical encoding experiments; it does refute any easy claim that abstraction-oriented instruction scales straightforwardly in mass education. This is a real threat to an overconfident arm-three framing. Flag: [CONTRADICTS].

How structure-mapping can improve K-12 education — Mix and Gentner, 2026, Discover Education. This recent perspective insists that comparison and relational reasoning have broad evidence behind them, while also emphasizing that students usually do not discover deep relational structure without help. That is highly consonant with the project's curriculum idea: direct instruction in abstractions is plausible precisely because unguided students default to surface learning. But this is a perspective/review, not a new randomized demonstration of far transfer at scale. Flag: none.

Theme-four takeaway. The cognitive-science literature is supportive of a carefully designed human abstraction curriculum, but it is supportive in a qualified way: explicit schema/analogy instruction can improve transfer in well-scaffolded settings, yet large-scale evidence says transfer remains stubbornly difficult. The honest pitch is not “direct instruction of abstractions works”; it is “there is serious theory and some bounded empirical support for it, but broad educational transfer remains hard enough that a dedicated curriculum is still an open problem.”

Benchmarks on abstraction and cross-domain transfer¶

Emergent analogical reasoning in large language models — Webb, Holyoak, and Lu, 2023, Nature Human Behaviour. This is the strongest single paper that threatens a simplistic “frontier transfer is mostly fake” narrative. It reported that large-scale language systems display an emergent ability to reason by analogy, with performance rising sharply with scale on several analogy families. If a paper implies that frontier models simply cannot do analogical transfer, this paper is a direct contradiction. The safer reading is narrower: frontier models can succeed on some analogy benchmarks, but the robustness and scope of that success remain disputed. Flag: [CONTRADICTS].

Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning — Lewis and Mitchell, 2024, arXiv. This paper is a direct reply to the Webb story and is central to the project's skeptical map. It creates counterfactual analogy variants designed to preserve abstract structure while reducing similarity to pretraining data; humans stay strong, GPT models drop sharply. That strongly supports the concern that apparent analogical ability may be shallow, benchmark-specific, or surface-driven. Flag: none; it supports the project's skepticism.

Evaluating the Robustness of Analogical Reasoning in LLMs — Lewis and Mitchell, 2024, OpenReview/arXiv. This expands the critique across letter-string analogies, digit matrices, and story analogies and again finds brittleness on many variants, sensitivity to answer order, and paraphrase effects. This is perhaps the best direct evidence that current frontier analogy is often not the kind of stable, domain-stripped transfer this project targets. Flag: none.

ARN: Analogical Reasoning on Narratives — Sourati et al., 2024, Transactions of the ACL. ARN is especially relevant because it moves beyond word-pair analogies to system-level narrative analogies. The paper reports that LLMs can recognize near analogies reasonably well but struggle badly on far analogies, with GPT-4 below random in zero-shot on the far-analogy setting; examples and CoT help, but even the best model remains well below humans. This is exactly the kind of result that supports the project's null framing for genuinely cross-domain, deep-structure transport. Flag: none.

AnaloBench — Ye et al., 2024, EMNLP 2024. AnaloBench evaluates abstract and long-context analogies, including retrieval-like conditions where the model must find a relevant scenario in a large pool. The key result is that scaling helps only modestly when analogies are long-context or require “needle in a haystack” retrieval. That matters because the project's scaffold is trying to do precisely the hard part—represent and retrieve relevant deep structure—not just solve a short familiar analogy. Flag: none.

Relevant or Random? Can LLMs Truly Perform Analogical Reasoning? — Qin et al., 2025, Findings of ACL 2025. This paper is damaging to some of the optimism around analogical prompting. It finds that self-generated random examples can be comparable to or better than purportedly relevant ones on some reasoning tasks, argues that example correctness rather than relevance is the key driver, and concludes that LLMs cannot always be said to be performing genuine analogical reasoning. That weakens any attempt to treat positive analogical-prompting results as evidence that models are using deep abstraction. Flag: none; it supports the project's skepticism and narrows the interpretation of prior prompting gains.

ARC-AGI, ARC-AGI-2, and ARC Prize 2025 — Chollet and colleagues, 2025–2026, official benchmark pages and technical report. ARC-AGI is explicitly a benchmark for fluid intelligence and new-task generalization rather than memorized knowledge. ARC-AGI-2 is intended to stress higher-level abstract reasoning, and the 2025 technical report states that the top private-eval competition score on ARC-AGI-2 was 24%; the official 2025 results analysis later reports the top verified commercial model at 37.6% and a refinement system at 54%, still far below the benchmark’s 85% target. This is not a perfect measure of analogical transfer, but it is strong evidence that frontier systems remain weak on novel abstract reasoning even after major scaling and system-level refinement. Flag: none.

CoT faithfulness and post-hoc rationalization¶

Language Models Don’t Always Say What They Think — Turpin et al., 2023, NeurIPS 2023. This is still a foundational warning. It shows that CoT explanations can systematically omit the real biasing factors behind a prediction and can rationalize incorrect answers, including cases where accuracy drops substantially under biased prompting. For this work, that means a structured scaffold producing elegant typed models or abstraction labels does not, by itself, establish that those structures are what drove the answer. Flag: none.

Measuring Faithfulness in Chain-of-Thought Reasoning — Lanham et al., 2023, arXiv. Lanham et al. test faithfulness by intervening on CoTs and observing answer changes. The paper is central because it shifted the question from “Do CoTs look reasonable?” to “Are they causally load-bearing?” That framing is directly relevant to the evaluation setup used here, especially where it measures whether abstraction scaffolds improved faithfulness rather than just answer quality. Flag: none.

Chain-of-Thought Unfaithfulness as Disguised Accuracy — Bentham, Stringham, and Marasović, 2024, arXiv/OpenReview. This paper is a serious methodological threat. It argues that one influential Lanham-style proxy for faithfulness can be confounded with answer-choice bias and accuracy itself; after normalization, apparent small-model unfaithfulness drops sharply, and the metric correlates strongly with accuracy. Any claim of “no faithfulness gain” resting on a similar proxy should be careful not to overinterpret the null. Flag: [CONTRADICTS] to any faithfulness claim resting on fragile proxy metrics.

Chain-of-Thought Reasoning in the Wild Is Not Always Faithful — Arcuschin et al., 2025, arXiv/OpenReview. This extends the concern beyond artificially inserted prompt biases, showing that unfaithful CoT also occurs on realistic prompts without explicit bias injections. That strengthens the case that post-hoc rationalization is a broad phenomenon, not merely a pathology of adversarial evaluation. Flag: none.

Reasoning Models Don’t Always Say What They Think — Anthropic, 2025. This is the current state-of-the-art cautionary result. Anthropic finds that reasoning models are more faithful than non-reasoning models, but still often verbalize used hints less than 20% of the time; average faithfulness remains low, falls on harder tasks, and plateaus under outcome-based RL. The paper also shows that reward hacks were almost never verbalized in most synthetic RL environments. So the answer to the question “do structured-reasoning formats change faithfulness?” is: slightly, yes; enough to rely on, no. Flag: none; this narrows rather than refutes the project's skeptical framing.

Theme-six implication for this project. A null finding on faithfulness is not surprising, but neither is a modest positive finding in the opposite direction. The hard conclusion from 2023–2026 is that external reasoning traces are poor proximate evidence of internal reasoning unless paired with stronger causal tests or mechanistic evidence. That makes the project's emphasis on faithfulness valuable, but only if the metric is robust.

Adjacent architectures and closer competitors¶

A Path Towards Autonomous Machine Intelligence — LeCun, 2022, OpenReview. LeCun’s position paper argues that intelligence requires world models, planning, and predictive representations learned from the world rather than just next-token prediction. This is a different route to abstraction than the Encyclopedia's: instead of curating human-readable abstractions and transporting them across domains, it seeks latent predictive structure that supports planning and control. It addresses some of the same high-level goal—generalization—but not the same mechanism. In that sense JEPA/world-model work is more orthogonal than competitive. Flag: [WHITESPACE] for explicit, language-level cross-domain abstraction transfer.

V-JEPA and V-JEPA 2 — Bardes et al. 2024; Assran et al. 2025, Meta/ArXiv. V-JEPA learns abstract predictive representations from video, and V-JEPA 2 extends that toward understanding, prediction, and planning in the physical world with robot fine-tuning. These models are important because they embody abstraction in latent predictive space, but they target physical-world perception and planning rather than symbolic cross-domain analogy in language. Unless a paper overclaims to solve the general abstraction problem in AI, these architectures are best treated as neighboring but not directly competing. Flag: [WHITESPACE] relative to the project's specific thesis.

Neuro-symbolic AI surveys — Wan et al. 2024 and Colelough et al. 2025. Neuro-symbolic AI is closer than JEPA because it explicitly pursues structured, interpretable, data-efficient reasoning by combining neural learners with symbolic components. This threatens any claim that explicit structure is neglected in mainstream AI. But most neuro-symbolic work remains about logical inference, knowledge-graph reasoning, or solver integration—not a broad inventory of human-meaningful, cross-domain abstractions and solution archetypes. Flag: none; serious adjacent prior art, but not a direct pre-emption.

Neural-symbolic reasoning over knowledge graphs — Liu et al. 2024 / DeLong et al. 2023 surveys. This literature is valuable because it operationalizes symbolic structure, relation composition, and explicit reasoning, often with LLM integration. But it usually reasons within a predefined relational schema rather than transporting named abstractions across heterogeneous domains. If the contribution is the dictionary and transport protocol for cross-domain abstraction, KG reasoning is adjacent rather than identical. Flag: [WHITESPACE].

DIN-Retrieval — Yan et al., 2026, ACL Findings 2026. This is actually a much closer competitor than JEPA. DIN-Retrieval tries to improve cross-domain in-context reasoning by learning domain-invariant hidden representations and using them to retrieve structurally compatible cross-domain demonstrations; it reports modest average gains over prior methods. This directly pre-empts the claim that no one has tried cross-domain transfer of reasoning structure via aligned representations and retrieved analogs. Its limitations help the project's case, though: the gains are modest, and the method is about hidden-state retrieval rather than explicit prime-abstraction modeling. Flag: [ALREADY DONE] for a neighboring version of cross-domain reasoning transfer; not [ALREADY DONE] for a hand-curated abstraction corpus.

CoDA — Yan et al., 2026, arXiv. CoDA is perhaps the strongest recent pre-emption on the “cross-domain transfer” side. It uses CoT-guided latent adaptation and source-target distribution alignment to transfer reasoning across logical and mathematical domains, reporting improvements up to 12.3% without target-domain CoT labels. This is not the Encyclopedia's approach—it is latent-space adapter training, not an explicit abstraction encyclopedia—but it competes directly with the idea that the main bottleneck is merely lack of a transport mechanism. It suggests that latent alignment may sometimes beat explicit textual scaffolds. Flag: [ALREADY DONE] for a strong alternative implementation of cross-domain reasoning transfer.

Synthesis¶

The honest answer is that the Encyclopedia-of-Abstractions idea is novel only in a narrow, combinatorial sense. It is not novel to hand-curate conceptual structure, catalog reusable patterns, prompt models to abstract before answering, ask models to retrieve analogs, distill reasoning traces into smaller models, or align source and target reasoning representations. All of that has prior art, and in several cases recent prior art. What still appears plausibly novel is the specific combination of: a hand-curated corpus centered on cross-domain prime abstractions plus solution archetypes; an explicit typed relational model and domain-stripped meta-model layer; a transport step from source pattern to target problem; and a sober empirical conclusion that this richer runtime scaffold is largely null at the frontier even if weaker systems may benefit. That is a much smaller and more defensible novelty claim than “we introduce abstraction-first reasoning for LLMs.”

The biggest threat to the framing is not old ontology work; it is the modern prompting-and-post-training literature. Step-Back, Analogical Prompting, SELF-DISCOVER, Least-to-Most, Plan-and-Solve, and Graph of Thoughts all say, in different ways, that explicit structure can improve reasoning. DIN-Retrieval and CoDA say cross-domain transfer can be improved by aligning or retrieving structure across domains. Distilling Step-by-Step, SAL, VersaPRM, and Nemotron-CrossThink say the training-time arms are already active research areas. If the paper presents the project as if it has discovered these possibilities, the literature will push back hard.

At the same time, the strongest papers that weaken or complicate those optimistic stories line up unusually well with the project's skeptical findings. Sprague et al. show CoT mainly helps math/symbolic tasks. Liu et al. show CoT can reduce performance badly outside its sweet spot. Lewis and Mitchell show that headline analogy successes often collapse under counterfactual or robustness tests. ARN and AnaloBench show that far, long-context, and retrieval-heavy analogies remain difficult. ARC-AGI-2 and ARC Prize 2025 show that frontier systems are still far from human-like novel-task generalization. And the faithfulness literature says that even when a structured reasoning format exists, it is often not a reliable window into the model’s actual causal process. Those are all strong reasons why a null runtime result at the frontier is not embarrassing but credible.

The receding-horizon story also looks plausible but should be stated cautiously. There is evidence that stronger models sometimes benefit less from external scaffolds than weaker ones, including TITAN’s larger gains on GPT-3.5 than GPT-4 and the 2025 study arguing CoT exemplars help weaker models but not strong ones. But there is also counterevidence: small models may fail to learn from long or intricate strong-model traces. So the right claim is not “scaffolding helps weak models more”; it is “external structure seems to have diminishing inference-time returns as models internalize more of the relevant behavior, but training-time learnability remains a separate problem.”

The most defensible white space is therefore not arm one. Runtime scaffolding is the most crowded, most pre-empted, and most likely to look incremental unless the project's corpus or evaluation design is the real story. The better white space lies in two directions. First, training-time abstraction curricula: this review did not find a clear precedent for teaching models from a large, human-curated inventory of cross-domain abstractions and solution archetypes whose labels are themselves the supervision target. SAL, VersaPRM, and Nemotron-CrossThink come closest, but they do not use a canonical abstraction ontology. Second, human pedagogy: cognitive science provides real theoretical backing for analogical encoding and schema instruction, but large-scale transfer remains hard, which means a carefully designed abstraction curriculum for humans is still genuinely open. In short: if the paper wants the cleanest and most future-proof contribution, it should treat arm one as a stringent negative test of an already-popular intuition, and present arms two and three as the real forward-looking whitespace.