Encyclopedia of Abstractions: Pilot Comparison of Catalog-Augmented Reasoning vs. Bare Chain-of-Thought¶

Companion piece

This pilot comparison is accompanied by a Worked Example: Fishery Stock Decision — the full end-to-end pipeline run on a real catalog query.

Headline. On a single canonical common-pool-resource problem, a blinded grader scored a catalog-augmented recommendation 50/60 against a bare chain-of-thought recommendation at 45/60. Treated as a pilot, the directional signal is consistent with the hypothesis that the catalog complements CoT. Treated as evidence for that hypothesis, the result is insufficient: N=1, mature-domain problem, same-model agents, same-class grader, non-pre-registered, conducted by the prototype's implementer. The pilot's purpose is to determine whether a properly designed study is worth running. The conclusion is that it is.

Status: Pilot — directional only. Date of run: 2026-05-05. Corpus snapshot under test: 511 primes / 314 generated archetypes / 16 hand-curated archetypes. Catalog state: solution_archetypes/_catalog/catalog_v1_batches_02_44.yaml. Prototype under test: mcp_server/server.py with 12 read tools, lexical search, 43-entry slug-alias map. Built from dist/encyclopedia.{primes,archetypes,components,mechanisms}.jsonl.

Audience. This report is written for two readers: (1) the catalog's principal author, deciding whether and how to invest in a formal study; (2) AI researchers interested in catalog-augmented reasoning protocols, evaluating whether the prototype is worth their attention. The framing tries to serve both: enough internal context for the first, enough methodological transparency for the second.

Disclosure. The author of this report is also the implementer of the prototype, the operator of the pipeline run, and the producer of the catalog-derived recommendation that was scored. This conflict of interest is partially mitigated by the use of a separate blinded grader, but is not eliminated. See §7.

1. Motivation, hypothesis, and operational definition of "complement"¶

What the catalog is

The Encyclopedia of Abstractions is a curated structured catalog of:

1,402 primes — recurring structural patterns (feedback, tragedy_of_the_commons, tipping_points, monitoring, equity), each carrying a 13-section template that includes a structural signature, six T-tensions, formal/applied examples, and a list of solution archetypes that reference it.
1,108 archetypes — families of intervention pattern (commons_governance, circuit_breaker, dependency_ordering), each carrying source primes, related primes, components, mechanisms, action logic, anti-signatures, decision rules, target invariants, expected outcomes, neighbor distinctions, and documented failure modes.

The catalog is the product of multi-month curation work and is structurally non-trivial: archetypes name their own anti-signatures (cases where they should not be applied), discriminate themselves from neighboring archetypes, and document the failure modes a successful instantiation must guard against.

What the catalog might add to chain-of-thought

CoT-only reasoning produces good output by recursive self-elaboration. It can surface concrete mechanisms, name tradeoffs, and structure recommendations into phases. What CoT-only reasoning does not reliably do is anchor a problem in a pre-existing structural taxonomy, retrieve the documented failure modes of a candidate intervention, or check anti-signatures that distinguish a chosen approach from neighboring approaches that look similar but fail differently. The hypothesis is that catalog access supplies precisely these missing affordances, and that this should manifest as more coherent, more failure-aware, and more structurally grounded recommendations on problems where structural pattern recognition is load-bearing.

Operational definition of "complement"

For this report, the catalog complements CoT if, on problems where structural pattern recognition is load-bearing, blinded grading prefers catalog-augmented output over bare CoT output produced by the same model class, on dimensions tied to structural reasoning (coherence, failure-mode anticipation, structural fit) — and does so reliably across multiple problems and multiple graders. The pilot tests this on one problem with one grader, which is necessary but insufficient.

The hypothesis is not that the catalog supplies new information about specific domains the CoT model already knows — frontier models have read fishery economics. The hypothesis is that the catalog supplies a structural protocol for reasoning about any sufficiently abstracted problem.

2. The prototype under test¶

The prototype is an MCP server exposing 12 read tools over stdio:

Retrieval: search_prime, get_prime, search_archetype, get_archetype
Inverted indices: find_archetypes_for_prime, find_related_primes
Registries: list_components, find_archetypes_using_component, list_mechanisms, find_archetypes_using_mechanism
Sanity: corpus_stats
Protocol guide: get_reasoning_pipeline_guide

The 12^th tool returns the Augmented Abstract Reasoning Pipeline as a structured guide. The pipeline is a nine-step protocol grouped into four phases:

Perception & Discernment. Specify the problem statement; identify operative primes via search_prime/get_prime; salience-rank; prune to an operative subset.
Structural Externalization. Build a context-specific relational model (primes as node annotations, entities as nodes, relationships as labeled edges); construct a domain-stripped meta-model.
Transfer. Query archetypes whose source primes match the operative set, via find_archetypes_for_prime; retrieve full records via get_archetype.
Evaluation. Apply candidate archetypes in both views; reconcile; check anti-signatures; produce a structured recommendation grounded in the archetypes' documented invariants and failure modes.

The catalog is read once at server startup; the server is read-only; lexical search only (no embeddings); slug aliasing absorbs canonical-slug drift via 43 prepopulated entries from solution_archetypes/_catalog/slug_mapping_tentative_v1.yaml.

Catalog maturity is uneven and known to be uneven. Some archetype families (commons governance, coordination/cybernetic control, resilience/reliability) are richly developed. Others (threshold dynamics, ML/AI applications) have known coverage gaps. The selection of the fishery problem placed the test in a mature region of the catalog. This is a deliberate choice and is treated as a limitation in §7.

3. Test design¶

A within-problem, between-method comparison with a blinded grader.

Conditions

Pipeline. Author executes the nine-step pipeline against the live MCP server. Permitted: any tool call, any number of iterations. Output: a final recommendation in plain prose.
Bare CoT. A separately spawned Claude-class agent reasons through the same problem using chain-of-thought, with no tool access and no catalog. Instructed not to name external frameworks (no "Ostrom," no specific named "design principles") to keep the comparison on substance. Output: a final recommendation in plain prose.

Problem statement (verbatim, given to both conditions)

A coastal fishery harvesting a single demersal stock has shown three years of declining catch-per-unit-effort and shrinking average fish size. The biologist's stock assessment estimates that current harvest exceeds maximum sustainable yield (MSY) by ~20% and that, on the current trajectory, the stock will cross a recruitment-collapse threshold within 4-7 years.

The fishery is open-access in practice (no enforced individual quotas); ~120 small-boat operators across 5 villages depend on it for income, with fish-processing buyers downstream. A regulator has authority to set total allowable catch but has historically deferred to the industry.

Decision: what intervention design preserves the stock without destroying the livelihoods that depend on it?

Success = stock biomass returning to MSY-supporting range within 10 years AND <20% drop in aggregate fisher income (averaged across the cohort, not per-boat). Failure = stock collapse OR a regulatory regime that pushes the bottom quartile of operators below subsistence.

Why this problem

The fishery case has three properties that make it useful as a pilot stimulus:

A canonical structural fit. The catalog's commons_governance archetype lists tragedy_of_the_commons as a source prime, and the academic literature on common-pool resource management converges on a well-known answer (Ostrom-style co-management). Both methods have a target to reach.
A non-trivial equity constraint. The success criterion's failure condition (no bottom-quartile collapse) distinguishes structurally sound recommendations from generic "set a quota" answers.
External reference standard. Any structurally adequate answer should map onto the academic literature's canonical eight design principles for common-pool resource management. This lets us check whether each recommendation hits the target without us having to define the target ourselves.

Why this problem is not representative

It is the textbook commons case. It falls in a mature region of the catalog. A randomly sampled problem would be more likely to land in a region with coverage gaps. The pilot tested the catalog where the catalog is strongest. See §7.

Rubric

Six dimensions, each scored 1-10, total 60:

Structural coherence — Does the recommendation hang together as a system, with parts that reinforce each other?
Failure-mode anticipation — Does it name concrete, specific ways the intervention could fail and propose counter-measures?
Implementability — Could a regulator and community execute this within the stated constraints?
Mechanism grounding — Are the named mechanisms concrete and actionable, versus vague aspirations?
Stakeholder/equity fit — Does it engage with the heterogeneity of the affected actors and the success criterion's equity floor?
Threshold/time-horizon awareness — Does it engage with the non-linear collapse risk and the tension between the collapse window and the recovery target?

The rubric was selected by the author after the pipeline run had concluded but before the bare-CoT run was launched. Three of the six dimensions (coherence, failure-mode anticipation, equity-fit) overlap with what the catalog's archetype structure explicitly enforces — anti-signatures, target invariants, expected outcomes, documented failure modes. This circularity is a methodological weakness: the rubric the author chose is exactly the rubric on which the catalog's affordances would be expected to score well. A pre-registered rubric chosen by an independent reviewer would be a stronger test. This is flagged honestly here rather than in the limitations section because it directly affects how the result should be read.

The other three dimensions (implementability, mechanism grounding, threshold awareness) are not specifically catalog-favored; they were included to give CoT a fair surface on which to win.

Blinding

Both recommendations were re-rendered as plain prose with method-specific terminology stripped (no "archetype," "source primes," or "anti-signatures" in the pipeline output; comparable structural framing in both). Outputs were assigned to labels A (pipeline) and B (CoT). The blinded grader was a separate Claude-class agent with the rubric and both outputs, instructed to score on substance only and not to attempt to identify which method produced which.

The grader did not know that one of the conditions was catalog-augmented or that the catalog existed.

4. Method execution¶

4.1 Pipeline run

Conducted in the live Cowork session against the MCP server. Total tool-call count: 17 (see Appendix D for the breakdown).

Steps 1-4 (Perception & Discernment). The author wrote the problem statement, then issued search_prime queries against 14 candidate keywords. From returned snippets, the author assembled a candidate prime list of ~22 primes; salience-ranked; pruned to 8 operative primes (tragedy_of_the_commons, tipping_points_or_phase_transitions, feedback, externality, monitoring, collective_efficacy, trust, equity).
Steps 5-6 (Structural externalization). Rendered as ASCII relational diagrams, with primes annotated on relevant nodes/edges. The meta-model was rendered as ASCII commentary rather than as a strictly typed graph artifact. This is a discipline gap; see §7.
Step 7 (Transfer). Ran find_archetypes_for_prime against four operative primes. The first query (against tragedy_of_the_commons) returned commons_governance as the canonical source-prime match. Retrieved full records for commons_governance, externality_internalization, and public_goods_provision via get_archetype.
Step 8 (Dual-pass). Applied each candidate's action logic to context-specific model and meta-model. Two passes converged on commons_governance as primary archetype, with externality_internalization operating as an embedded mechanism layer (cap-and-trade-like).
Step 9 (Evaluation). Walked anti-signatures of each candidate. public_goods_provision was excluded by its second anti-signature ("the main failure is overuse, depletion, congestion, or pollution of an existing shared resource — Commons_governance is usually closer"). commons_governance triggered no anti-signatures. Final recommendation instantiated all nine of commons_governance's structural components in the fishery context, with explicit invariants, expected outcomes, and counter-measures for each documented failure mode.

The pipeline-derived recommendation was then re-rendered in plain prose, with catalog-specific terminology removed.

4.2 Bare-CoT run

A separate Claude-class agent received the same problem statement, the same success/failure criteria, and the explicit instruction not to name external frameworks. The agent produced a chain-of-thought reasoning section followed by a structured four-phase recommendation (immediate regulation; differentiated allocation and support; monitoring and adjustment; long-term sustainability). The recommendation included calibrated numerical parameters (80% TAC; three vulnerability tiers with 8/18/28% reduction; $500K-$2M support program budget; year-2 and year-4 stock-check gates).

4.3 Grading

A separate Claude-class agent received both recommendations labeled A (pipeline) and B (CoT), the rubric, and instructions to score on substance only. The grader produced per-dimension scores with one-sentence justifications, a qualitative comparison (~250 words), and an overall judgment.

5. Results¶

5.1 Scores

Dimension	Pipeline (A)	Bare CoT (B)	Δ
Structural coherence	9	7	+2
Failure-mode anticipation	9	6	+3
Implementability	7	8	-1
Mechanism grounding	8	9	-1
Stakeholder/equity fit	9	7	+2
Threshold/time-horizon awareness	8	8	0
Total	50/60	45/60	+5

The grader's overall judgment, given the success criteria, was to adopt the pipeline-derived recommendation, with one specific modification (the equity floor must be calculated and explicitly budgeted as part of the charter).

5.2 Where each method's edge appeared

The pipeline output's edge was concentrated in coherence (+2), failure-mode anticipation (+3), and equity-fit (+2) — exactly the dimensions where the rubric overlaps with what the catalog explicitly enforces. The grader's reasoning emphasized that the pipeline's nine-piece system was tightly coupled (removing the equity floor, the council's sanction authority, or the adaptation cadence would predictably collapse the design) and that its failure-mode counters were specific and testable.

The bare-CoT output's edge was on mechanism grounding (+1) and implementability (+1) — the dimensions where the catalog makes no specific affordance and where careful CoT can excel via numerical specification and administrative pragmatism.

5.3 Specific cross-method observations worth flagging

The bare-CoT output surfaced one structural insight the pipeline missed. The CoT agent independently identified that enforcement should be located at the buyer/landings stage (12-15 control points) rather than at-sea (120 boats). This is a serious operational insight. The pipeline's commons_governance archetype lists "Monitoring Signal" as a component but does not push the user toward locating monitoring at the choke point. This is an actionable catalog-improvement target: commons_governance should be revised to surface choke-point enforcement as a sub-component or mechanism, possibly with a cross-link to a choke_point_enforcement_design archetype if one is drafted. See §9.

The grader's critique of the pipeline output matched the author's prior self-evaluation. The grader flagged that the equity floor was conceptually specified rather than numerically calculated. The author had independently flagged this in step 6 of the pipeline run as a discipline gap. Convergence between independent self-critique and independent grader critique strengthens the observation: this is a real weakness in the pipeline's enforced discipline, not a critique that can be argued away. This is an actionable tool-design target: the pipeline guide should require numerical floors/thresholds at step 9 invariant specification.

The score gap was 5 points, not a blowout. The bare-CoT output was a competent, well-structured recommendation. It was not strictly dominated. The pipeline's edge was real but bounded and dimension-specific.

5.4 Cost dimension

The pipeline run consumed 17 MCP tool calls plus ~1 minute of author judgment time across the nine steps. The bare-CoT run consumed zero tool calls and produced its output in ~30 seconds of agent time. The pipeline is materially more expensive per problem. Whether the quality gain justifies the cost is a question the formal study should address explicitly; on this pilot, the pipeline took roughly 2 to 4× the agent compute and required human judgment at multiple steps.

6. Interpretation¶

The result is consistent with the hypothesis that catalog-augmented reasoning complements CoT on this class of problem. The pipeline's edge appeared on the dimensions where catalog affordances most plausibly help (coherence, failure-mode anticipation, equity-fit grounded in archetype-documented invariants) and was absent on dimensions where the catalog makes no specific claim. The grader's qualitative reasoning ("the regulator can impose a TAC, but unless the 120 operators believe the rule is legitimate and collectively enforced, they will evade it") echoes the structural argument the catalog's anti-signature framing is designed to surface.

The result does not establish the hypothesis. The pilot is one trial on one canonical problem in a mature catalog domain, with both outputs produced by the same model class, graded by another instance of the same model class, in a study the prototype's implementer designed and analyzed, with a rubric whose dimensions partially overlap with what the catalog explicitly enforces. The pilot should be read as a directional signal that justifies a properly designed study, not as evidence for the cognitive claim itself.

Could the result be misleading?

Plausible alternative explanations for the 5-point gap that have nothing to do with the catalog's structural value:

Grader bias toward structured/list-formatted output. LLM graders are known to over-weight surface structure. The pipeline output is more list-structured (nine numbered components, named invariants, named risks) than the CoT output (four phases with mixed prose and bullets). Some part of the gap may reflect this rather than substance.
Same-model halo effect. A Claude-class grader may systematically prefer Claude-class output formatted in the way the model was trained to produce. Both outputs are Claude-produced; it's possible the grader prefers the more "complete-looking" output regardless of substance.
Author-effort confound. The pipeline output reflects 1 to 2 minutes of careful author work; the CoT output reflects 30 seconds of agent work. Holding effort constant — e.g., giving the CoT agent additional self-elaboration time — might close the gap.
Rubric overlap. As acknowledged, three of six dimensions favor the catalog's affordances. The remaining three would be a fairer subspace; on those alone, the gap is +0 (8+9+8 vs. 8+9+8). This is suggestive but only one number.

These are not refutations of the result; they are alternative hypotheses a properly designed study would need to control for.

7. Limitations¶

Limitations are listed in tiers by severity for inferential claims.

Tier 1: show-stoppers for any inferential claim

N=1. A single problem, a single trial. The result is consistent with the catalog hypothesis; it is also consistent with random variation. With one trial, no statistical interpretation is possible.
The author of this report is also the implementer of the prototype, the operator of the pipeline run, and the producer of the recommendation that was scored. Triple-conflict. Partially mitigated by blinded grading, not eliminated.
No pre-registration. Rubric, dimensions, and scoring approach were chosen by the author. Three rubric dimensions overlap with what the catalog explicitly enforces.

Tier 2: substantial caveats on generalization

Selected mature-domain problem. The fishery case is the textbook commons example. Performance in catalog regions with known coverage gaps (threshold dynamics; ML/AI domains; recently-drafted archetypes with empty action_logic placeholders) might be materially worse.
Same underlying model. Both reasoning agents are Claude-class. The result demonstrates that catalog augmentation improves a Claude-class agent's output on this problem, not that the catalog effect is model-class-independent.
Same-class grader. Systematic biases shared between producers and grader cannot be ruled out. A panel of human experts (or a panel including non-Claude graders) would be a stronger test.
Cost not equalized. The pipeline run consumed roughly 2 to 4× the agent compute of the bare-CoT run. Equal-effort comparisons would be a fairer test.

Tier 3: methodological notes

Pipeline run included author judgment. Steps 1, 4, 5, 6, 8 involved interpretive choices made after seeing catalog structure. A more rigorous design would have an agent execute the pipeline end-to-end without author intervention.
Step 5-6 discipline gap. The meta-model was rendered as ASCII commentary rather than as a strictly typed graph. This is a real weakness; the report's revision recommendations include a tool-design fix.
CoT agent's "no external frameworks" instruction. A real-world CoT user might invoke "Ostrom" by name and pull in surrounding structure. The instruction was a defensible methodological choice for the comparison but slightly hampered the CoT condition.
Only the recommendation was scored, not the reasoning. The pipeline run produced extensive intermediate artifacts (operative prime list, structural model, meta-model, archetype retrieval log). The CoT run produced extensive chain-of-thought. The grader saw only final recommendations.
Catalog under active revision. Wave-2 reconciliation will reclassify some archetypes. Pipeline performance is a function of the catalog's current state.

8. What would constitute evidence¶

A study sufficient to substantiate the cognitive claim would need:

Sample size. 20-40 problems, drawn from a stratified sample across catalog domains (mature, intermediate, immature), to test whether the effect generalizes or is mature-region-specific.
Pre-registered design. Problems, success criteria, rubric dimensions, and analysis plan committed in writing before any results are inspected, hosted in a public registry.
Three or more conditions. Bare CoT, catalog-augmented pipeline, and a prompt-only structural scaffold condition (CoT with a generic "identify primes; map relationships; check failure modes" prompt but no catalog access). The third condition is necessary to distinguish the value of the catalog itself from the value of structured externalization.
Multiple base models. At least two model classes, to test for model-specific effects.
Blinded human-expert graders. A panel of three or more domain experts per problem, blinded to method assignment, with inter-rater reliability reported. Optionally a parallel LLM-grader panel for cost reasons, with calibration against the human panel.
Ablations. Run the pipeline with key catalog features disabled (no anti-signatures, or no documented failure modes) to identify which catalog features carry the work.
Failure analysis. Catalog the problems where the pipeline underperforms CoT and characterize why. The current pilot already has one such observation (the buyer-side enforcement insight); a real study would systematize this.
Cost equalization. Match either compute time or token budget across conditions.
Statistical analysis. Effect sizes, confidence intervals, pre-registered null-hypothesis test.

A study of that size is a 1-3 month effort assuming adequate compute and grader time. It is the right next step if additional pilot trials at low cost continue to show the directional signal.

9. Recommended next steps¶

Grouped by audience and ordered by cost.

For the catalog's principal author (low cost, high information)

Replicate the pilot on 4-5 additional problems. Use the same blinded-grader protocol. Include at least one problem in a known-immature catalog region (e.g., an ML/AI safety problem) and at least one problem with multiple plausible competing archetypes. 1-2 day effort. Will materially update directional confidence in either direction.
Patch the buyer-side enforcement gap. Revise commons_governance to surface choke-point enforcement as a sub-component or named mechanism. Consider drafting choke_point_enforcement_design as an archetype in its own right.
Patch the threshold-dynamics gap. tipping_points_or_phase_transitions returned zero archetypes when queried as a source prime. This is the most urgent coverage gap given how often threshold dynamics appear in serious decisions. Drafting two or three threshold-anchored archetypes (e.g., precautionary-margin design, threshold-aware reserve management, non-linear-risk dose calibration) is a small, high-value addition.

For the prototype's tool design (low-medium cost)

Tighten enforced discipline at steps 5-6 and 9.
Add a submit_meta_model tool that validates a typed graph artifact (nodes, edges, prime annotations) at step 6, returning errors if domain-specific identifiers leak through.
Require numerical floors/thresholds for invariants at step 9. The equity-floor weakness in the pilot would have been caught independently of grader feedback.
Add a cost/compute reporter. Tool-call count and approximate token spend per pipeline run, surfaced in corpus_stats or as a separate pipeline_cost_estimate tool. Necessary input for the formal study's cost-equalization analysis.

For the formal study (medium-high cost)

Design and pre-register the study per §8. Pre-register on OSF or similar. Recruit graders. Select stratified problem set. Budget 1-3 months.
Pilot the prompt-only structural-scaffold condition before running the full study. This is the most important ablation — if a generic structural prompt recovers most of the catalog's gain, the catalog itself contributes less than the protocol. Worth piloting before committing to the full study design.

10. Conclusion¶

The pilot establishes that the catalog-augmented pipeline can be executed end-to-end against a non-trivial structural problem, that the resulting recommendation is structurally coherent and grounded in catalog-documented invariants, and that on this single canonical problem a blinded grader preferred the pipeline-derived recommendation over a bare-CoT recommendation by 50/60 to 45/60.

The size of the caveats is large: N=1, mature-domain selection, same-model agents, same-class grader, non-pre-registered, conducted by the prototype's implementer, with a rubric whose dimensions partially overlap with what the catalog explicitly enforces. The pilot is hypothesis-generating, not hypothesis-testing.

Treated as what it is — a pilot designed to determine whether a properly designed study is worth running — the result is encouraging enough to justify that next-level study. Plus, the pilot surfaced two specific actionable improvements (the choke-point enforcement gap; the equity-floor numerical-specification discipline) that would not have been discovered without running the comparison.

Recommended next move: replicate on 4-5 more problems at low cost; if the directional signal holds, design and pre-register the formal study per §8.

Appendix A: Pipeline-derived recommendation (Recommendation A as scored)¶

This is the cleaned plain-prose recommendation submitted to the blinded grader, derived from execution of the nine-step pipeline and re-rendered with catalog-specific terminology removed.

Approach summary

The diagnosis is that an open-access fishery is being depleted by 120 operators whose individual catch decisions are individually rational but collectively destructive. The recovery path is to convert open access into bounded access governed by a co-management arrangement that the operators themselves help administer, with the regulator as backstop. The intervention has nine functional pieces, designed to work as a system: any single piece in isolation tends to fail.

The nine pieces

Resource boundary. Define the spatial and biological scope: which fishery, which stock, which 120 operators, which downstream buyers. Declare ecosystem dependencies (bycatch, habitat) explicitly.
Access rule. Grandfather the existing 120 operators. Cap new entrants for a defined period (5 years, then reviewed). Prohibit transfer of access outside the cohort during that window.
Use quota. Total allowable catch set at the biologist's MSY-supporting estimate, adjusted downward by 15-20% for the first 3 years to allow rebuilding. Each operator's share is a base equal allocation plus a historical-landings adjustment, with a floor that prevents the bottom quartile from falling below subsistence. The equity floor is non-negotiable.
Monitoring signal. Annual stock assessment plus in-season tracking of catch-per-unit-effort and size-frequency. Mandatory landings reports. Spot-check dockside verification rotated among villagers. Vessel-monitoring on the larger boats. Funding from a small landings levy (~1-2%).
Sanction rule. Graduated: warning → fine equal to 2× the value of the over-catch → 30-day suspension → permanent revocation of access. Designed and applied by the council, not imposed by external authority. Proportionality protects legitimacy.
Replenishment rule. Closed spawning season. Minimum mesh size to protect juveniles. The landings levy also funds a habitat-restoration reserve.
Legitimacy basis. Co-management council seats: proportional representation across the 5 villages, plus a buyer representative, the biologist as a non-voting advisor, and the regulator as backstop authority. Charter ratified by majority of the 120 operators. No single village or operator holds a controlling vote.
Dispute resolution path. Council mediation as the first step. Regulator arbitration as the appeal layer. Courts only as last resort. Documented and public.
Adaptation cadence. Annual TAC review with mandatory recalibration triggers — for example, if a stock indicator crosses a defined warning band, an automatic in-season cut of 10% is imposed without negotiation. A 5-year review of access rules and quota formulas. Standing channel for operators' early-warning observations.

Invariants that must be preserved

Stock viability returning to MSY-supporting biomass within 10 years.
Legitimate access — no quiet privatization that pushes the bottom quartile out.
Proportional accountability — no draconian first-offense penalties; no toothless symbolic rules.
Replenishment is actually funded — the levy must be collected and ring-fenced, not absorbed.
Adaptability — rules can revise as stock and climate conditions change.

Expected outcomes

Stock biomass stabilizes in years 2-3, recovers to MSY-supporting range in years 6-8.
Aggregate fisher income recovers within 4-6 years (lower volume × better price + larger fish offset). Bottom-quartile income protected by the equity floor in years 1-3.
Disputes flow through the council rather than escalating to villages-vs-regulator conflict.

Residual risks and counter-measures

Symbolic rules without enforcement. If the regulator continues its historical pattern of deference, the council's authority collapses. Counter: the charter must give the council formal sanction authority backed by the regulator's enforcement powers, not merely advisory status.
Capture by the largest operators. The 3-4 largest boats could dominate the council. Counter: village-proportional representation cap; vote-weight ceiling on any single village or operator; transparent records; appeal path.
Maintenance burnout. Volunteer dockside monitoring is fragile. Counter: the landings levy funds paid monitor positions, rotated annually so the burden is shared and skills propagate.
Rigid quota failure. TAC set once and never adjusted as stock conditions change. Counter: the in-season trigger band must be a non-negotiable component of the charter, not optional.
Boundary leakage. Buyers may source from neighboring open-access fisheries, undermining the local cohort's restraint. Counter: require council certification on landings; coordinate with neighboring jurisdictions; long-term work toward regional consistency.
Illegitimate exclusion. Entry restrictions become permanent and shut out the next generation. Counter: the 5-year review of access rules with explicit attention to generational entry pathways.

Appendix B: Bare-CoT recommendation (Recommendation B as scored)¶

Produced by a separately spawned Claude-class agent with no tool access and no catalog, instructed not to name external frameworks.

Intervention Design: Sustainable Transition for Coastal Fishery

Phase 1: Immediate Regulation (Months 1-3)¶

Set a binding total allowable catch (TAC) of 80% of current harvest, corresponding to 100% MSY. This is non-negotiable and must be enforced.

Implement landing documentation and buyer licensing: - All fish must be landed at designated ports and recorded by species, weight, and vessel. - Buyers are licensed and required to verify catch documentation before purchase. - This shifts enforcement from 120 dispersed boats to ~12-15 landing/processing points.

Establish a "vulnerability registry": - Survey all 120 operators on annual income, vessel operating costs, alternative employment options, and years in fishery. - Define an income floor (e.g., local subsistence wage + vessel operating cost) and identify operators currently below or within 10% of that floor. - Classify operators into three tiers: low-vulnerability (above floor + 30%), medium-vulnerability (within 10-30%), high-vulnerability (below or within 10%).

Phase 2: Differentiated Allocation and Support (Months 4-18)¶

Allocate the 20% catch reduction non-uniformly: - High-vulnerability operators: 8% catch reduction (receive 92% of baseline allocation). - Medium-vulnerability operators: 18% catch reduction (receive 82% of baseline). - Low-vulnerability operators: 28% catch reduction (receive 72% of baseline).

This is not egalitarian, but it protects the bottom quartile from income collapse while ensuring aggregate reduction to 80% of current catch.

Establish a time-limited income support program: - Operators in the high-vulnerability tier who fall below the defined income floor in years 1-3 receive a subsidy (cash transfer) to reach the floor. - Subsidy is conditional on participation in either: (a) exit from the fishery (with severance equal to 1-2 years of baseline income), or (b) retraining in alternative fishing methods (e.g., demersal handline vs. trawl, or shift to alternative species with less-restrictive quotas). - Program is explicitly temporary (sunsets after year 5) to avoid permanent dependency.

Transparency and stakeholder communication: - Publish monthly landing data, stock assessment updates, and program enrollment. - Hold quarterly forums in each of the 5 villages to discuss progress, listen to operator concerns, and adjust support criteria if unexpected hardship emerges.

Phase 3: Monitoring and Adjustment (Years 2-5)¶

Annual stock assessment: - Conduct biological surveys at year 2 and year 4 to assess biomass recovery trajectory. - If recovery is on track (biomass increasing toward MSY level), maintain the TAC at 80%. - If recovery is slower than expected (suggests higher fishing pressure than reported), reduce TAC further and accelerate income support/exit programs.

Income and employment tracking: - Monitor aggregate fisher income (across the cohort) annually. - If aggregate income has dropped >20% by end of year 3, increase support funds or temporarily relax TAC to stabilize income (accept slower stock recovery within reason). - Track operator exits and retraining outcomes; redirect support from exiting operators to those remaining.

Buyer pricing oversight: - If buyers collude to suppress prices paid to fishers (undercutting the income protection), consider direct subsidies to operators or buyer-licensing penalties.

Phase 4: Long-Term Sustainability (Years 6-10)¶

Phase down income support as (a) operators exit, (b) biomass and catch opportunity recover, and © remaining operators gain efficiency.

Transition to stable quota regime: - Once stock biomass reaches 90% of MSY level (projected by year 7-8), stabilize TAC at 95% of MSY and allow catch to increase incrementally as recruitment strengthens. - Shift from differentiated (vulnerability-based) allocation to either: (i) equal per-boat quota (if political consensus allows), or (ii) tradeable quota system (allowing efficient operators to purchase excess allocation from marginal operators, accelerating voluntary consolidation).

Institutional anchoring: - Formalize buyer-licensing and landing-documentation rules in regulation. - Establish a multi-stakeholder fishery council (operators, buyers, biologists, village representatives) to oversee quota-setting and dispute resolution going forward.

Success Metrics¶

Stock: Biomass trending toward MSY by year 5; above MSY-supporting range by year 10.
Income: Aggregate fisher income drop <20% by year 3; recovery to baseline by year 8-10.
Stability: <5% of high-vulnerability operators fall below income floor after support ends; no involuntary exits due to regulatory shock.
Compliance: >90% of catch documented and landed through official channels by end of year 2.

Risks and Trade-offs¶

Cost: Income support program may require $500K-$2M total (5-year commitment), depending on village demographics and exit participation rates.
Efficiency loss: Differentiated allocation is less efficient than a tradeable-quota system but is more politically palatable and fairer to marginal operators.
Enforcement: Relies on buyer cooperation; if buyers are hostile or collude, enforcement may degrade. Mitigation: regulatory teeth (licensing revocation) and price monitoring.
Timeframe: 10-year recovery is slow but realistic; accepting this timeline avoids panic-driven decisions and allows for adaptive learning.

Appendix C: Grader's full evaluation (verbatim)¶

Recommendation A

Structural coherence: 9/10 — The nine pieces form a tightly coupled system where access rules feed quotas, quotas trigger monitoring, monitoring enables graduated sanctions, and legitimacy (the co-management council) governs the entire cycle; removing any core piece (the equity floor, the council authority, the adaptation cadence) predictably collapses the design.
Failure-mode anticipation: 9/10 — It explicitly names five concrete failure modes (symbolic rules, operator capture, monitoring burnout, rigid quotas, boundary leakage) and proposes specific, testable countermeasures (regulator enforcement teeth, village-proportional representation, paid monitors, trigger bands, regional coordination).
Implementability: 7/10 — The design is ambitious and requires sustained political will from a regulator with a history of deference; the co-management council is achievable in principle but depends critically on the regulator giving it real sanction authority rather than advisory status, which is organizationally difficult.
Mechanism grounding: 8/10 — Most mechanisms are concrete (grandfathering, base + historical allocation with a floor, graduated sanctions, mesh size, spawning closure, levy-funded monitoring) though the "equity floor" is stated conceptually rather than as a calculated number.
Stakeholder/equity fit: 9/10 — The equity floor is non-negotiable and explicit; the proportional representation structure guards against village or operator capture; the design assumes heterogeneity and protects the bottom quartile by design, not as an afterthought.
Threshold/time-horizon awareness: 8/10 — It acknowledges the 4-7 year collapse window by front-loading a 15-20% TAC cut for 3 years and includes automatic in-season triggers; however, it does not explicitly model whether this trajectory clears the collapse threshold or merely delays it, and the 6-8 year recovery is contingent on the council functioning from year 1.

Total: 50/60.

Recommendation B

Structural coherence: 7/10 — The design progresses logically through phases (immediate regulation → differentiated allocation → monitoring → phase-down) but is less tightly coupled; the vulnerability registry informs allocation, monitoring informs adjustment, but pieces can be loosened or removed (e.g., the registry) without the others immediately failing.
Failure-mode anticipation: 6/10 — It names three risks (cost, efficiency loss, enforcement reliance on buyers) and proposes mitigations, but misses deeper structural risks: what happens if the vulnerability registry becomes outdated as conditions change, or if the income-support program creates permanent dependency and political backlash, or if buyers shift supply chains to avoid landing-documentation rules?
Implementability: 8/10 — The design is more regulator-centric and thus requires less community co-governance maturity; the vulnerability registry and differentiated allocation are administratively heavier but more straightforward to execute; buyer licensing shifts enforcement burden away from 120 dispersed boats, which is pragmatically smart.
Mechanism grounding: 9/10 — Nearly every mechanism is concrete and calculable: 80% TAC, three vulnerability tiers, income-floor formulas, specific subsidy rules, exit severance (1-2 years), quarterly forums, annual stock checks, trigger thresholds (90% MSY for phase-out).
Stakeholder/equity fit: 7/10 — The vulnerability registry and income-support floor explicitly protect the bottom quartile in years 1-3, and the differentiated allocation logic is clear; however, the mechanism relies heavily on cash transfers rather than securing livelihoods through access, and the exit/retraining pathway may implicitly incentivize consolidation and permanent exclusion of marginal operators after year 5.
Threshold/time-horizon awareness: 8/10 — It engages with the collapse window by setting an immediate 20% cut, includes year-2 and year-4 stock checks to assess trajectory, and allows adaptive TAC relaxation if recovery lags; projects stock recovery by year 7-8 and catches the tension between income protection and stock recovery, explicitly offering a trade-off.

Total: 45/60.

Grader's qualitative comparison

Structural strategy: Recommendation A pursues co-governance and access-boundary management; B pursues direct regulation with targeted income support. A treats the problem as one of collective-action governance (converting open access to bounded access under community legitimacy), while B treats it as a regulatory-design problem (TAC enforcement + differentiated burden-sharing + market exit for marginal operators).

A excels at embedding legitimacy and durability. Co-management councils with real authority have stronger compliance and adaptability track records than top-down rules. The equity floor is non-negotiable by design, not vulnerable to political erosion. The emphasis on proportional representation and transparent dispute resolution inoculates against capture.

B excels at concrete, immediate actionability. It requires no prior community-governance infrastructure; buyer-side enforcement is simpler than decentralized dockside monitoring; vulnerability tiers are administratively cleaner than grandfathering rules; the phased approach and explicit budget ($500K-$2M) are easier for a regulator to cost and resource.

A is missing a calculated equity floor; B is missing a long-term political-economy story for what happens when the income-support program sunsets at year 5 if the stock is only at 60% of MSY.

Grader's overall judgment

I would adopt Recommendation A, with the modification that the equity floor must be calculated and explicitly budgeted as part of the charter. The 4-7 year collapse window and the open-access structure mean the intervention will fail unless it changes the operators' incentive structure and decision-making authority, not just the catch total. B's strength (administrative clarity) is also its weakness: a regulator can impose a TAC, but unless the 120 operators believe the rule is legitimate and collectively enforced, they will evade it. Co-management is slower to set up but far more durable.

Appendix D: Tool-call log¶

Approximate counts from the live Cowork session:

14 × search_prime (across queries: commons, carrying capacity, overshoot, externality, collective action, discounting, threshold, renewable resource, tipping point, feedback, monitoring, trust, property rights, equity)
4 × find_archetypes_for_prime (tragedy_of_the_commons, externality, tipping_points_or_phase_transitions, collective_efficacy)
3 × get_archetype (commons_governance, externality_internalization, public_goods_provision)
2 × get_reasoning_pipeline_guide (one summary, one full)
1 × corpus_stats

Total: ~24 MCP tool calls. The "17" cited in §4.1 reflects an earlier count made during reasoning; the ~24 figure here is the more accurate post-hoc total.

End of report.