Testable Predictions
The paper is published and set in stone. This page is the living layer - what we conjecture now, what evidence has accumulated, and what experiments we run. Each prediction carries a quantitative falsification anchor from the paper and an honest assessment of where it stands today.
The table is the paper's contract with the reader. Print it, run the measurements, mark the rows.
Evidence Tiers
| Tier | Prediction | Section | Falsification Anchor |
|---|---|---|---|
| BUILD | 13.1 Scene disambiguation | I | 30 point gap |
| TEST | 13.2 Episode reconstruction | II | 20 point gap |
| LIVED | 13.3 Revenue localisation | III | 10% of unattributed revenue |
| BUILD | 13.4 Tick settling | IV | 2-5 tick convergence; ~10% RMS of minimum-jerk |
| LIVED | 13.5 Four shape composition | V | 9/10 queries above threshold |
| TEST | 13.6 Temporal reasoning | VI | 8/10 correct |
| TEST | 13.7 Episode handover | VII | 80% vs 50% continuity |
| BUILD | 13.8 Fable fidelity | VIII | 70% structural, 50% tonal |
| BUILD | 13.9 Flock versus homunculus | IX | 30% adversarial gap |
| BUILD | 13.10 Three button cell | X | 40% mistake reduction |
| BUILD | 13.11 Structural kindness | XI | 50 point dimensional preservation gap |
| BUILD | 13.12 Aggregate | XII | All of the above |
Prediction Detail
13.1 Scene Disambiguation BUILD
A parameter-matched LLM with a four-dimensional context store will disambiguate the cat-on-the-mat-with-horror example at least thirty percentage points better than a flat context window baseline.
30-point gap13.2 Episode Reconstruction TEST
A full Episode storage shape will reconstruct a hundred-sample scene at least twenty points more accurately than a flat context window. Ordering: flat < vector < graph < Episode.
20-point gap13.3 Revenue Localisation LIVED
In a compound enterprise with three or more legacy policy admin systems and a warehouse on top, graph-as-referent will locate at least ten percent of previously unattributed revenue within sixty days.
10% unattributed revenue13.4 Tick Settling vs Minimum-Jerk BUILD
A three-floor derivative stack will converge its vote within two to five ticks on a reaching task, regardless of tick rate. The trajectory will approximate Flash and Hogan's minimum-jerk profile within ~10% RMS error.
2-5 tick convergence + ~10% RMS13.5 Four Shape Composition LIVED
On ten canonical queries (flat aggregates, multi-hop traversals, semantic similarity, raw payload), the four-shape composition will hit 9/10. No single shape exceeds 7/10.
9/10 queries13.6 Temporal Reasoning Under Ledger TEST
On ten temporal reasoning tasks, a ledger-equipped system will answer at least eight correctly. Without a ledger, at most four.
8/10 vs 4/10| Benchmark | Tests | Baseline |
|---|---|---|
| CounterBench | Counterfactual inference (1K causal graph questions) | LLMs near random-guessing |
| TempoBench | Multi-step temporal logic automata | Sharp difficulty scaling |
| TDBench | Bitemporal SQL, validity windows | Domain-specific |
| TemporalBench | Past vs present state distinction | Weak context-aware reasoning |
13.7 Episode Handover TEST
On scenes with 5+ participants, 20+ turns, and non-trivial emotional tone, Episode-backed handover preserves continuity above 80%. Transcript paste falls below 50%.
80% vs 50%13.8 Fable Round-Trip Fidelity BUILD
A well-authored Fable at 1:100 compression, given to a receiver with the compression context, reconstructs the Episode with 70%+ structural fidelity and 50%+ tonal fidelity. Without context, below 30%.
70% structural, 50% tonal13.9 Flock vs Homunculus BUILD
A hundred-voter Flock settles within 2-5 ticks, produces minimum-jerk trajectories, matches a homunculus on decision quality, and exceeds it by 30% on adversarial robustness.
30% adversarial gap13.10 Three-Button Coercion Resistance BUILD
In a hundred forced-mistake stimuli, a three-button cell (Act, Dismiss, Ask-sibling) reduces mistakes by 40% vs a two-button cell, with full dissent preservation and scale-consistent behaviour.
40% mistake reduction13.11 Structural Kindness BUILD
On a hundred ethically loaded decisions, a Diorama architecture preserves dimensional content 80%+ of the time. A flat architecture preserves it below 30%. Fifty-point falsification anchor.
50-point gapHistorical Illustration: The Slater Precedent
In 1789, Samuel Slater memorised the design of Richard Arkwright's textile machinery in Derbyshire and emigrated to Rhode Island with nothing but the shape in his head. He succeeded because the receivers - Moses Brown and the Pawtucket merchants - already had the substrate: business understanding, employment structures, the capacity to negotiate change. Their existing knowledge was the free inference. The machinery was a compressed representation that decompressed against their context.
The same industrial knowledge produced two architectures with different structural properties.
The Rhode Island System (Slater's mills): small, family-based, village-scale. Workers were families with names, skills, community ties. The architecture preserved dimensional content by default - not because Slater was kind, but because the structure was too small and too embedded to flatten people into labour units without consequences the owner could see.
The Waltham-Lowell System (Francis Cabot Lowell, 1814 onwards): large-scale factory towns. Initially preserved worker dimensionality - the "mill girls" had boarding houses, lending libraries, a literary magazine (the Lowell Offering), lectures. Then the architecture flattened. By the 1840s: longer hours, lower wages, speedups, child labour. The libraries stayed but the decisions no longer consulted them. The decision architecture had no structural resistance to ignoring dimensional content when quarterly profit became the single axis.
The cruelty was not a decision. It was an architectural consequence. The Lowell system had every hortatory mechanism - moral codes, boarding house rules, a magazine giving workers a voice. What it lacked was structural resistance to flattening when economic pressure arrived. Section XI argues: "Kindness is not a property that can be reliably installed by exhortation alone on a substrate that is geometrically indifferent to it." The fifty-point gap is not only a hypothesis about the future. It is an observation about 1840.
13.12 Aggregate BUILD
Run the full benchmark suite. Observe all gaps simultaneously. Any single failure kills the aggregate.
All of the aboveBenchmark Mapping
Several predictions can be tested against benchmarks that already exist. We name them so a reader who wants to attack a specific prediction knows where to start.
| Prediction | Benchmark | Tests | Baseline |
|---|---|---|---|
| 13.2 | LoCoMo | Recall, multi-hop, structured retrieval | Mem0 66.9%, MIRIX 85.4% |
| 13.5 | LongMemEval | Retrieval from complex histories | Oracle ~92%; commercial 30% drop |
| 13.6 | CounterBench | Counterfactual inference (1K questions) | Near random-guessing |
| 13.6 | TempoBench | Multi-step temporal logic | Sharp difficulty scaling |
| 13.6 | TDBench | Bitemporal SQL queries | Domain-specific |
| 13.6 | TemporalBench | Past vs present distinction | Weak context-aware reasoning |
| 13.7 | LoCoMo | Cross-session continuity | MemGPT 74%, Synapse F1 40.5 |
| 13.12 | AMA-Bench | Long-horizon agent memory | AMA-Agent 57.2% |
Honest Disclosure
Two predictions (13.3 and 13.5) are retrodictions - observations of systems already in operation, dressed as predictions. They are included because the measurement protocol applies prospectively to new instances, but the observation preceded the prediction in both cases.
How to Read This Table
The twelve predictions form a tight web of falsification. Any one of them can be attacked in isolation, in which case the framework fails at that prediction and survives in reduced form at the others.
Clean pass. All twelve hold. The framework earns further testing at larger scale.
Partial pass. Some hold, some fail. The boundary between survival and failure becomes the new research question.
Clean fail. A majority fail. The framework becomes a cautionary example with clear falsification criteria - still more useful than a right paper with vague ones.
The Invitation
This is a research programme, not a proof. Readers are invited to build, measure, and report.
Living document. Last updated 1 June 2026. Evidence tiers, benchmark mapping, and historical illustrations added following adversarial review.