Testable Predictions

The paper is published and set in stone. This page is the living layer - what we conjecture now, what evidence has accumulated, and what experiments we run. Each prediction carries a quantitative falsification anchor from the paper and an honest assessment of where it stands today.

The table is the paper's contract with the reader. Print it, run the measurements, mark the rows.

Evidence Tiers

LIVED Direct operational evidence TESTABLE NOW Third-party benchmarks exist NEEDS BUILDING New benchmarks required
Tier Prediction Section Falsification Anchor
BUILD13.1 Scene disambiguationI30 point gap
TEST13.2 Episode reconstructionII20 point gap
LIVED13.3 Revenue localisationIII10% of unattributed revenue
BUILD13.4 Tick settlingIV2-5 tick convergence; ~10% RMS of minimum-jerk
LIVED13.5 Four shape compositionV9/10 queries above threshold
TEST13.6 Temporal reasoningVI8/10 correct
TEST13.7 Episode handoverVII80% vs 50% continuity
BUILD13.8 Fable fidelityVIII70% structural, 50% tonal
BUILD13.9 Flock versus homunculusIX30% adversarial gap
BUILD13.10 Three button cellX40% mistake reduction
BUILD13.11 Structural kindnessXI50 point dimensional preservation gap
BUILD13.12 AggregateXIIAll of the above

Prediction Detail

13.1 Scene Disambiguation BUILD

A parameter-matched LLM with a four-dimensional context store will disambiguate the cat-on-the-mat-with-horror example at least thirty percentage points better than a flat context window baseline.

30-point gap
No benchmark exists. The 30-point anchor is set by engineering judgment. A new disambiguation benchmark must be built to Section I specifications.

13.2 Episode Reconstruction TEST

A full Episode storage shape will reconstruct a hundred-sample scene at least twenty points more accurately than a flat context window. Ordering: flat < vector < graph < Episode.

20-point gap
Benchmark: LoCoMo (300 turns, multi-session). Baselines: Mem0 66.9%, Mem0g 68.4%, MIRIX 85.4%. The ordering claim is the structural bet.

13.3 Revenue Localisation LIVED

In a compound enterprise with three or more legacy policy admin systems and a warehouse on top, graph-as-referent will locate at least ten percent of previously unattributed revenue within sixty days.

10% unattributed revenue
Operational evidence. This describes work already performed in a compound enterprise with three policy administration systems. The 10% anchor is conservative relative to observed results. Honestly, this is a retrodiction - the observation preceded the prediction.

13.4 Tick Settling vs Minimum-Jerk BUILD

A three-floor derivative stack will converge its vote within two to five ticks on a reaching task, regardless of tick rate. The trajectory will approximate Flash and Hogan's minimum-jerk profile within ~10% RMS error.

2-5 tick convergence + ~10% RMS
The most theoretically ambitious prediction. Connects the architecture to motor control literature (Flash and Hogan 1985). The substrate-independence claim (works regardless of tick rate) is the real boldness. Needs a reference implementation.

13.5 Four Shape Composition LIVED

On ten canonical queries (flat aggregates, multi-hop traversals, semantic similarity, raw payload), the four-shape composition will hit 9/10. No single shape exceeds 7/10.

9/10 queries
Operational evidence. The four-shape composition is in daily use: graph (315K+ node knowledge graph), table (session database), vector (semantic search, 768-dim), binary (configuration). Single-shape failure modes observed routinely. Like 13.3, this is a retrodiction.

13.6 Temporal Reasoning Under Ledger TEST

On ten temporal reasoning tasks, a ledger-equipped system will answer at least eight correctly. Without a ledger, at most four.

8/10 vs 4/10

The Champion Prediction

Under adversarial review, 13.6 emerged as the paper's strongest genuinely-forward prediction. CounterBench exists as a third-party benchmark, LLMs already perform at near random-guessing on counterfactual reasoning, the gap is noise-proof, and the scorecard is not ours. If the ledger moves the needle on CounterBench, the paper wins this row clean.

BenchmarkTestsBaseline
CounterBenchCounterfactual inference (1K causal graph questions)LLMs near random-guessing
TempoBenchMulti-step temporal logic automataSharp difficulty scaling
TDBenchBitemporal SQL, validity windowsDomain-specific
TemporalBenchPast vs present state distinctionWeak context-aware reasoning
CounterBench is the arena. LLMs at near random-guessing on formal counterfactual reasoning is direct evidence for Section VI's temporal collapse diagnosis. The ledger is the proposed fix. The benchmark is the test.

13.7 Episode Handover TEST

On scenes with 5+ participants, 20+ turns, and non-trivial emotional tone, Episode-backed handover preserves continuity above 80%. Transcript paste falls below 50%.

80% vs 50%
Benchmark: LoCoMo (multi-session). Baselines: MemGPT 74%, Synapse F1 40.5. Illustrative operational evidence exists but formal scoring is under-instrumented.

13.8 Fable Round-Trip Fidelity BUILD

A well-authored Fable at 1:100 compression, given to a receiver with the compression context, reconstructs the Episode with 70%+ structural fidelity and 50%+ tonal fidelity. Without context, below 30%.

70% structural, 50% tonal
Requires a new benchmark. The mechanism has illustrative precedent in technology transfer and oral tradition, but the specific fidelity measurements need controlled testing.

13.9 Flock vs Homunculus BUILD

A hundred-voter Flock settles within 2-5 ticks, produces minimum-jerk trajectories, matches a homunculus on decision quality, and exceeds it by 30% on adversarial robustness.

30% adversarial gap
Ensemble diversity literature supports the robustness claim directionally. Needs a reference implementation and adversarial benchmark.

13.10 Three-Button Coercion Resistance BUILD

In a hundred forced-mistake stimuli, a three-button cell (Act, Dismiss, Ask-sibling) reduces mistakes by 40% vs a two-button cell, with full dissent preservation and scale-consistent behaviour.

40% mistake reduction
The third button (Ask-sibling) is the structural escape from binary coercion. Partial implementation exists in operational Diorama cells. The 40% anchor needs a forced-mistake benchmark built to Section X specifications.

13.11 Structural Kindness BUILD

On a hundred ethically loaded decisions, a Diorama architecture preserves dimensional content 80%+ of the time. A flat architecture preserves it below 30%. Fifty-point falsification anchor.

50-point gap
The paper's moral claim and most original contribution. "Dimensional content preservation" is not a standard metric - it needs defining and building. Section XI argues cruelty is structural: what happens when dimensional content is flattened and the discard is forgotten.

Historical Illustration: The Slater Precedent

In 1789, Samuel Slater memorised the design of Richard Arkwright's textile machinery in Derbyshire and emigrated to Rhode Island with nothing but the shape in his head. He succeeded because the receivers - Moses Brown and the Pawtucket merchants - already had the substrate: business understanding, employment structures, the capacity to negotiate change. Their existing knowledge was the free inference. The machinery was a compressed representation that decompressed against their context.

The same industrial knowledge produced two architectures with different structural properties.

The Rhode Island System (Slater's mills): small, family-based, village-scale. Workers were families with names, skills, community ties. The architecture preserved dimensional content by default - not because Slater was kind, but because the structure was too small and too embedded to flatten people into labour units without consequences the owner could see.

The Waltham-Lowell System (Francis Cabot Lowell, 1814 onwards): large-scale factory towns. Initially preserved worker dimensionality - the "mill girls" had boarding houses, lending libraries, a literary magazine (the Lowell Offering), lectures. Then the architecture flattened. By the 1840s: longer hours, lower wages, speedups, child labour. The libraries stayed but the decisions no longer consulted them. The decision architecture had no structural resistance to ignoring dimensional content when quarterly profit became the single axis.

The cruelty was not a decision. It was an architectural consequence. The Lowell system had every hortatory mechanism - moral codes, boarding house rules, a magazine giving workers a voice. What it lacked was structural resistance to flattening when economic pressure arrived. Section XI argues: "Kindness is not a property that can be reliably installed by exhortation alone on a substrate that is geometrically indifferent to it." The fifty-point gap is not only a hypothesis about the future. It is an observation about 1840.

13.12 Aggregate BUILD

Run the full benchmark suite. Observe all gaps simultaneously. Any single failure kills the aggregate.

All of the above
The most demanding prediction: twelve simultaneous bets where any failure kills the aggregate. Without a reference implementation, this is a promissory note. It is also the most honest prediction in the set - it explicitly invites the reader to print the table and mark every row.

Benchmark Mapping

Several predictions can be tested against benchmarks that already exist. We name them so a reader who wants to attack a specific prediction knows where to start.

PredictionBenchmarkTestsBaseline
13.2LoCoMoRecall, multi-hop, structured retrievalMem0 66.9%, MIRIX 85.4%
13.5LongMemEvalRetrieval from complex historiesOracle ~92%; commercial 30% drop
13.6CounterBenchCounterfactual inference (1K questions)Near random-guessing
13.6TempoBenchMulti-step temporal logicSharp difficulty scaling
13.6TDBenchBitemporal SQL queriesDomain-specific
13.6TemporalBenchPast vs present distinctionWeak context-aware reasoning
13.7LoCoMoCross-session continuityMemGPT 74%, Synapse F1 40.5
13.12AMA-BenchLong-horizon agent memoryAMA-Agent 57.2%

Honest Disclosure

Two predictions (13.3 and 13.5) are retrodictions - observations of systems already in operation, dressed as predictions. They are included because the measurement protocol applies prospectively to new instances, but the observation preceded the prediction in both cases.

How to Read This Table

The twelve predictions form a tight web of falsification. Any one of them can be attacked in isolation, in which case the framework fails at that prediction and survives in reduced form at the others.

Clean pass. All twelve hold. The framework earns further testing at larger scale.

Partial pass. Some hold, some fail. The boundary between survival and failure becomes the new research question.

Clean fail. A majority fail. The framework becomes a cautionary example with clear falsification criteria - still more useful than a right paper with vague ones.

The Invitation

This is a research programme, not a proof. Readers are invited to build, measure, and report.

Living document. Last updated 1 June 2026. Evidence tiers, benchmark mapping, and historical illustrations added following adversarial review.