Learnability and Curriculum Construction over a Prime Catalog¶

A tiering of the primes by first-encounter difficulty, with the algorithm and its honest ceiling¶

A conceptual paper for the Encyclopedia of Abstractions project

Abstract¶

A catalog of 1,402 cross-domain primes is not a curriculum. To turn the catalog into something a learner can actually walk through, the primes need an ordering — a sequence in which the easier ones are met before the harder ones, with prerequisites honored. This paper describes how the Encyclopedia constructs that ordering. The approach is a difficulty-weighted topological sort that combines three kinds of evidence: a per-prime learning-age assessment derived from a triangulated LLM "ELI ladder" (explanations at 5, 10, 15, 18, and specialist levels), a set of word-level signals over the prime's slug and catalog text (Kuperman age-of-acquisition norms, SUBTLEX frequency, Brysbaert concreteness, Flesch-Kincaid-style readability), and the catalog's own typed prerequisite DAG, honored as a hard topological constraint. The output is a single linear order chunked into five display tiers, with the top tier holding the most intuitive umbrella primes and the bottom tier reserved for the small set of primes where no faithful kindergarten explanation exists. This paper says what each signal contributes, what it misses, and where the algorithm hits an honest ceiling that no purely-objective signal can break past.

1. What we mean by "learnability" here¶

"Learnability" in this catalog does not mean amount of formal work the concept does or how foundational the prime is to other primes. A prime can be foundational and easy (a kindergartener has an intuition for trust long before encountering the word), or formal and shallow (a confidence interval presupposes nothing structurally heavy but cannot be honestly explained to a kindergartener without misrepresenting it as a posterior probability). What we want to measure is first-encounter difficulty for a learner: when this prime is the next thing a teacher would introduce, how hard is it to honestly land?

That framing has two consequences worth stating up front.

First, learnability is partly orthogonal to distinctiveness and partly orthogonal to structural↔framed character. A prime can be intuitive and crowded (trust sits near belief, expectation, reliance); it can be hard and distinctive (conjugate variables has no near-twins in the catalog). The three properties answer different questions about a prime: how easy is it to find? (distinctiveness), how much does the home domain travel with it? (structural/framed), and how soon can a learner meet it without being lied to? (learnability).

Second, learnability is partly compositional. If a prime is a structurally-derived compound — tragedy of the commons, say, composing individual rationality, collective action, commons, and free-riding — then teaching it before its prereqs forces the teacher to either hide the composition (a lie) or briefly cover all four (which is just teaching the prereqs first, badly). The catalog already encodes those prereqs as the typed hierarchy DAG; a faithful curriculum must respect them. So learnability scoring has two layers: an intrinsic per-prime difficulty estimate and a topological floor from the DAG. A prime cannot appear in the curriculum before all of its DAG ancestors, regardless of how simple the prime itself looks.

2. The intrinsic difficulty signal — what we tried and what worked¶

The intrinsic component is the one that doesn't fall out for free from the catalog's structure. We need a per-prime number that estimates first-encounter difficulty. We tried several signal families before settling on the current mix:

Word-level signals over the slug. The prime's slug (tragedy_of_the_commons, attention, nonparametric_methods) carries information about how an English speaker would first encounter the concept by name. Kuperman et al. 2012's age-of-acquisition norms^[1] give a per-word age in years; Brysbaert et al.'s 2014 concreteness ratings^[2] give a 1–5 score; SUBTLEX subtitle counts^[3] give a log-frequency. Aggregating these over the slug's content words gives a first-pass per-prime estimate: aggregation registers as a late-acquired, low-frequency, abstract word; wave registers as early-acquired, high-frequency, concrete.
Readability over the catalog's prose. The prime's own core_idea block is several hundred words long; running a Flesch-Kincaid-style proxy over it gives a sense of how dense its catalog explanation is. Dense explanations don't necessarily indicate hard concepts (some primes are over-articulated in the catalog), but they correlate.
Structural-signature complexity. The ## Structural Signature section of each v2 prime lists role-phrases — the named structural roles the prime decomposes into. The count is a coarse Sweller-style element-interactivity proxy: a prime that explicitly factors into 7 roles is structurally heavier than one that factors into 2.
An "ELI ladder" of explanations. This was the most labor-intensive piece. For every one of the corpus's ~650 primes, three independent LLM agents produced a five-level ladder of explanations — kindergarten (ELI5), 5^th-grade (ELI10), high-school (ELI15), college (ELI18), and specialist — plus a "kid-friendly name" at the first three. A fourth judge agent picked the best of three at each level, or marked the level N/A if at least two generators independently said no faithful explanation at that level was possible. The lowest level at which a faithful explanation survived this consensus is lowest_valid_eli_level. The triangulation matters: a single generator left to its own devices will almost always force an explanation through, even when the result misrepresents the concept; requiring two of three to agree on N/A is what surfaces the genuinely-unteachable cases.

What the ladder discovers, and what it doesn't. The ladder turns out to do exactly one thing extraordinarily well: it cleanly identifies the K-unteachable stratum. Across 654 primes, 9 had ≥2 of 3 generators independently mark ELI5 as impossible: confidence_intervals, conjugate_variables, dialectics, entanglement, gauge_invariance_gauge_symmetry, grand_narrative_metanarrative, historical_determinism, historicism, indifference_curves. Each fails for principled reasons (confidence intervals collapse into the canonical posterior-probability misreading; entanglement collapses into classical hidden-state correlations that Bell tests rule out; dialectics collapses into "people arguing"). These nine constitute their own tier — Tier 5, the K-unteachable.

What the ladder does not do well is discriminate among the other 645. 98.6% of primes pass the ELI5-faithfulness bar. So the level itself doesn't sort the remaining 645 into 1500 ordered positions. The content of the ladder — the actual ELI5/10/15 explanations — is the more valuable artifact (it's directly usable as teaching content on each prime's web page), but as a signal for ordering primes within the kindergarten-teachable majority, the ladder hits a ceiling.

So the production scorer leans on the word-level signals for within-tier ordering, with the ladder's contribution being the explicit Tier 5 boundary. The current weights:

Max Kuperman AoA over slug tokens — 0.45 (primary: hard-to-acquire concept-names indicate hard concepts)
Catalog core_idea readability — 0.25
Min SUBTLEX log-frequency over slug tokens (negated) — 0.15
Min Brysbaert concreteness over slug tokens (negated) — 0.05
Structural-signature element count — 0.10
Plus: ELI level tier multiplier (100 × lowest_valid_eli_level) — promotes the 9 K-unteachable primes into Tier 5 regardless of within-level signal

3. The algorithm — Kahn's topological sort with a priority queue¶

With per-prime difficulty in hand, the curriculum has to be an ordering that also respects the DAG. The algorithm is Kahn's topological sort with a difficulty-priority min-heap[^kahn-1962]:

Initialize a count of unsatisfied parent edges per prime. Primes with no parents (DAG roots) start with count 0.
Push every count-0 prime into a min-heap keyed by intrinsic difficulty. The cheapest (most intuitive) root sits at the top.
Repeatedly: pop the cheapest available prime; append it to the curriculum order; decrement the parent count of each of its children. When a child's count reaches 0 (all its prereqs have appeared), push it into the heap.

This guarantees two properties:

No prime appears before any of its DAG ancestors. A learner reading the curriculum in order never encounters a concept whose prereqs haven't been introduced. This is the catalog's typed prerequisite structure honored directly.
Among primes available at any given point, the easiest one comes next. The intrinsic difficulty estimate breaks ties — and most of the curriculum-ordering work — among primes whose DAG prereqs are all met.

The output is a single linear sequence over all 654 primes. To present it, we chunk it into five display tiers. Tier 5 is special-cased to hold exactly the K-unteachable primes (the ELI10+ stratum). Tiers 1–4 split the remaining ELI5-teachable primes evenly (~160 each). The tier numbers are a display convenience; the load-bearing thing is the global linear order.

4. What this produces — a quick portrait¶

The current corpus snapshot (654 primes, v0.7.0 of the scorer) lands roughly as follows:

Tier 1 opens with wave, time, deep_time, in_group_out_group, flow, attention, self_organization, authority, figure_ground, competition, frame_of_reference, role, trust, exchange — concepts whose slug names are early-acquired words and whose structural signatures are small enough that a learner can carry them in one mental gesture.
Tier 4 holds the densest ELI5-teachable primes — concepts whose slug names are late-acquired or whose structural signatures factor into many roles. These are the "hardest things still teachable to a kindergartener with a careful analogy."
Tier 5 holds exactly 9 primes — the K-unteachable stratum named in §2.

A spot-check against an experienced curator's intuition lands ~75% of placements close to where they "should" be by gut. The remaining ~25% have a few recognizable failure patterns, each of which the scoring deliberately does not try to fix.

5. The honest ceiling — what this scoring cannot do, and why we accept it¶

Three classes of failure recur regardless of how we tune the weights:

(a) The word-vs-concept gap. Some primes are named with simple words that have specialized abstract meanings (wave as oscillation pattern, pipeline as a sequence-of-stages, flow as a state of absorbed engagement). The slug-AoA signal reads them as easy because the word is acquired early. Other primes are named with technical words for intuitive concepts (aggregation for "combining things together," observability for "can you see what's going on inside"). The signal reads them as hard because the word is late-acquired. Both are the same bug: the slug name is not the concept. The signal has no way to distinguish them.

(b) DAG-vs-intuition conflicts. A curator's intuition might say texture (which has parents in the DAG) is more intuitive than frame of reference (a DAG root). Any algorithm that honors the topological constraint will put frame of reference first — its prereqs are empty. To satisfy the intuition, we would have to violate the catalog's own statement of prerequisite structure, which would break the basic invariant that a curriculum doesn't introduce concepts before their prereqs. So we don't.

© Catalog development depth ≠ concept difficulty. A prime whose core_idea is densely written may simply be a prime the catalog has elaborated heavily, not a prime that's intrinsically harder than its sparser-described siblings. The readability signal can't tell these apart.

We accept these three failure classes rather than fight them. The reasons:

Hand-curating the order is brittle as the catalog grows. Any per-prime tier override is a piece of work that has to be revisited every time a new prime is added or an existing one is renamed. The catalog adds new primes regularly (DP-56, DP-57, …); the curriculum needs to be re-runnable from scratch from the corpus + DAG + ELI ladder without manual intervention.
The remaining misfires are recognizable but not corpus-corrupting. A learner who encounters observability in Tier 2 instead of Tier 1 is not misled about anything — the prime's content is still right; only its placement is mildly off. The curriculum is an aid, not a hard sequencing requirement.
An honest signal is more valuable than a tuned one. If the curriculum is going to be wrong about ~25% of primes' intuitive ordering, far better to be wrong in legible ways the algorithm can be reasoned about than in arbitrary ways that emerge from per-prime adjustments. The three failure classes above are predictable from the slug, the DAG, and the catalog prose — anyone reading the curriculum can see where the algorithm is doing well and where it's bumping into a known ceiling.

6. Anti-self-grading discipline¶

The curriculum scorer makes no LLM calls. Every signal it uses is either derived from the catalog's curated content (DAG depth, structural-signature counts), human-rated psycholinguistic norms (Kuperman AoA, Brysbaert concreteness, SUBTLEX frequency), or pre-existing artifacts whose creation was logged separately (the ELI ladder). The LLM's role was bounded to content generation under tight prompt constraints, with a separate judge agent enforcing accuracy against the catalog's own core_idea. The actual scoring of that content uses human norms, not LLM judgment. This preserves the project's standing discipline that the same model class should not be both the source and the grader of a difficulty signal.

The ELI ladder generation does involve LLM judgment (the 3-of-3 N/A consensus that defines the K-unteachable tier), but that judgment is bounded: we don't ask the model "how hard is this prime"; we ask it to produce an explanation at a given level, and we use consensus failure to produce as the difficulty signal. The discipline is to never let the LLM directly rate difficulty — always derive the signal from what the LLM either could or could not produce, judged by accuracy against catalog content rather than by self-reported confidence.

7. Position in the broader project¶

The curriculum joins three other conceptual axes along which the catalog is organized:

Structural ↔ framed describes how much of a prime's home domain travels with it when it's transported.
Distinctiveness describes how findable a prime is given a faithful description.
Hierarchy (typed DAG) describes the prereq, composition, and decompose relationships among primes.
Learnability (this paper) describes the order in which a learner should meet the primes.

The four are independent in the same way distinctiveness and structural-framed character are independent: knowing a prime is structural doesn't tell you how findable it is, and knowing it's findable doesn't tell you how soon a learner can meet it without being misled. A prime's page on the site reports each property separately so the reader can form their own picture.

8. Reproducibility¶

The full pipeline — the scorer, the ELI ladder generation and consolidation scripts, the canonical ladder, and a README — is published in the implementation bundle (same artifacts the live site is built from). Raw per-batch provenance is preserved internally as ~260 generator and judge JSON files. The scorer's output is rebuilt deterministically from scratch on each run; the per-batch process is summarized in §8 of From Candidate to Catalog, the public methodology companion.

A downloadable bundle of all five scripts plus this conceptual paper plus a README is available at eoa_learnability_implementation.zip — the same artifacts the live site is built from.

The numerical thresholds and weights documented in §2 are the ones in production as of v0.7.0. They were arrived at empirically: 12 weight schemes were tested against a 16-test intuition-acceptance harness, and the scheme reported in §2 produced the highest score (11/16). The four remaining acceptance failures map cleanly onto the three failure classes in §5 and are the algorithm's honest ceiling.

References¶

[1] Kuperman, Victor, Hans Stadthagen-Gonzalez, and Marc Brysbaert. "Age-of-Acquisition Ratings for 30,000 English Words." Behavior Research Methods 44, no. 4 (2012): 978–990. Crowd-sourced norms giving the age at which each of 30,000 English words is typically learned — the per-word difficulty signal aggregated over a prime's slug. ↩

[2] Brysbaert, Marc, Amy Beth Warriner, and Victor Kuperman. "Concreteness Ratings for 40 Thousand Generally Known English Word Lemmas." Behavior Research Methods 46, no. 3 (2014): 904–911. Human abstract-to-concrete ratings (1–5) for 40,000 words — the signal that flags abstract concept-names as harder to first encounter. ↩

[3] Brysbaert, Marc, and Boris New. "Moving beyond Kučera and Francis: A Critical Evaluation of Current Word Frequency Norms and the Introduction of a New and Improved Word Frequency Measure for American English." Behavior Research Methods 41, no. 4 (2009): 977–990. Introduces SUBTLEX-US, word-frequency norms drawn from film and television subtitles — the source of the subtitle log-frequency signal. ↩

[4] Kahn, Arthur B. "Topological Sorting of Large Networks." Communications of the ACM 5, no. 11 (1962): 558–562. The classic algorithm for linearizing a directed acyclic graph consistent with its edges — adapted here with a difficulty-priority queue so the curriculum honors prerequisites while preferring easier primes.