Predictions - The Shape of Thought

Testable Predictions

The paper is published and set in stone. This page is the living layer - what we conjecture now, what evidence has accumulated, and what experiments we run. Each prediction carries a quantitative falsification anchor from the paper and an honest assessment of where it stands today.

The table is the paper's contract with the reader. Print it, run the measurements, mark the rows.

Evidence Tiers

LIVED Direct operational evidence TESTABLE NOW Third-party benchmarks exist NEEDS BUILDING New benchmarks required

Tier	Prediction	Section	Falsification Anchor
BUILD	13.1 Scene disambiguation	I	30 point gap
TEST	13.2 Episode reconstruction	II	20 point gap
LIVED	13.3 Revenue localisation	III	10% of unattributed revenue
BUILD	13.4 Tick settling	IV	2-5 tick convergence; ~10% RMS of minimum-jerk
LIVED	13.5 Four shape composition	V	9/10 queries above threshold
TEST	13.6 Temporal reasoning	VI	8/10 correct
TEST	13.7 Episode handover	VII	80% vs 50% continuity
BUILD	13.8 Fable fidelity	VIII	70% structural, 50% tonal
BUILD	13.9 Flock versus homunculus	IX	30% adversarial gap
BUILD	13.10 Three button cell	X	40% mistake reduction
BUILD	13.11 Structural kindness	XI	50 point dimensional preservation gap
BUILD	13.12 Aggregate	XII	All of the above

Prediction Detail

13.1 Scene Disambiguation BUILD

A parameter-matched LLM with a four-dimensional context store will disambiguate the cat-on-the-mat-with-horror example at least thirty percentage points better than a flat context window baseline.

30-point gap

No benchmark exists. The 30-point anchor is set by engineering judgment. A new disambiguation benchmark must be built to Section I specifications.

13.2 Episode Reconstruction TEST

A full Episode storage shape will reconstruct a hundred-sample scene at least twenty points more accurately than a flat context window. Ordering: flat < vector < graph < Episode.

20-point gap

Benchmark: LoCoMo (300 turns, multi-session). Baselines: Mem0 66.9%, Mem0g 68.4%, MIRIX 85.4%. The ordering claim is the structural bet.

13.3 Revenue Localisation LIVED

In a compound enterprise with three or more legacy policy admin systems and a warehouse on top, graph-as-referent will locate at least ten percent of previously unattributed revenue within sixty days.

10% unattributed revenue

Operational evidence. This describes work already performed in a compound enterprise with three policy administration systems. The 10% anchor is conservative relative to observed results. Honestly, this is a retrodiction - the observation preceded the prediction.

13.4 Tick Settling vs Minimum-Jerk BUILD

A three-floor derivative stack will converge its vote within two to five ticks on a reaching task, regardless of tick rate. The trajectory will approximate Flash and Hogan's minimum-jerk profile within ~10% RMS error.

2-5 tick convergence + ~10% RMS

The most theoretically ambitious prediction. Connects the architecture to motor control literature (Flash and Hogan 1985). The substrate-independence claim (works regardless of tick rate) is the real boldness. Needs a reference implementation.

13.5 Four Shape Composition LIVED

On ten canonical queries (flat aggregates, multi-hop traversals, semantic similarity, raw payload), the four-shape composition will hit 9/10. No single shape exceeds 7/10.

9/10 queries

Operational evidence. The four-shape composition is in daily use: graph (315K+ node knowledge graph), table (session database), vector (semantic search, 768-dim), binary (configuration). Single-shape failure modes observed routinely. Like 13.3, this is a retrodiction.

13.6 Temporal Reasoning Under Ledger TEST

On ten temporal reasoning tasks, a ledger-equipped system will answer at least eight correctly. Without a ledger, at most four.

8/10 vs 4/10

Benchmark	Tests	Baseline
CounterBench	Counterfactual inference (1K causal graph questions)	LLMs near random-guessing
TempoBench	Multi-step temporal logic automata	Sharp difficulty scaling
TDBench	Bitemporal SQL, validity windows	Domain-specific
TemporalBench	Past vs present state distinction	Weak context-aware reasoning

CounterBench is the arena. LLMs at near random-guessing on formal counterfactual reasoning is direct evidence for Section VI's temporal collapse diagnosis. The ledger is the proposed fix. The benchmark is the test.

13.7 Episode Handover TEST

On scenes with 5+ participants, 20+ turns, and non-trivial emotional tone, Episode-backed handover preserves continuity above 80%. Transcript paste falls below 50%.

80% vs 50%

Benchmark: LoCoMo (multi-session). Baselines: MemGPT 74%, Synapse F1 40.5. Illustrative operational evidence exists but formal scoring is under-instrumented.

13.8 Fable Round-Trip Fidelity BUILD

A well-authored Fable at 1:100 compression, given to a receiver with the compression context, reconstructs the Episode with 70%+ structural fidelity and 50%+ tonal fidelity. Without context, below 30%.

70% structural, 50% tonal

Requires a new benchmark. The mechanism has illustrative precedent in technology transfer and oral tradition, but the specific fidelity measurements need controlled testing.

13.9 Flock vs Homunculus BUILD

A hundred-voter Flock settles within 2-5 ticks, produces minimum-jerk trajectories, matches a homunculus on decision quality, and exceeds it by 30% on adversarial robustness.

30% adversarial gap

Ensemble diversity literature supports the robustness claim directionally. Needs a reference implementation and adversarial benchmark.

13.10 Three-Button Coercion Resistance BUILD

In a hundred forced-mistake stimuli, a three-button cell (Act, Dismiss, Ask-sibling) reduces mistakes by 40% vs a two-button cell, with full dissent preservation and scale-consistent behaviour.

40% mistake reduction

The third button (Ask-sibling) is the structural escape from binary coercion. Partial implementation exists in operational Diorama cells. The 40% anchor needs a forced-mistake benchmark built to Section X specifications.

13.11 Structural Kindness BUILD

On a hundred ethically loaded decisions, a Diorama architecture preserves dimensional content 80%+ of the time. A flat architecture preserves it below 30%. Fifty-point falsification anchor.

50-point gap

The paper's moral claim and most original contribution. "Dimensional content preservation" is not a standard metric - it needs defining and building. Section XI argues cruelty is structural: what happens when dimensional content is flattened and the discard is forgotten.

Historical Illustration: The Slater Precedent

In 1789, Samuel Slater memorised the design of Richard Arkwright's textile machinery in Derbyshire and emigrated to Rhode Island with nothing but the shape in his head. He succeeded because the receivers - Moses Brown and the Pawtucket merchants - already had the substrate: business understanding, employment structures, the capacity to negotiate change. Their existing knowledge was the free inference. The machinery was a compressed representation that decompressed against their context.

The same industrial knowledge produced two architectures with different structural properties.

The Rhode Island System (Slater's mills): small, family-based, village-scale. Workers were families with names, skills, community ties. The architecture preserved dimensional content by default - not because Slater was kind, but because the structure was too small and too embedded to flatten people into labour units without consequences the owner could see.

The Waltham-Lowell System (Francis Cabot Lowell, 1814 onwards): large-scale factory towns. Initially preserved worker dimensionality - the "mill girls" had boarding houses, lending libraries, a literary magazine (the Lowell Offering), lectures. Then the architecture flattened. By the 1840s: longer hours, lower wages, speedups, child labour. The libraries stayed but the decisions no longer consulted them. The decision architecture had no structural resistance to ignoring dimensional content when quarterly profit became the single axis.

The cruelty was not a decision. It was an architectural consequence. The Lowell system had every hortatory mechanism - moral codes, boarding house rules, a magazine giving workers a voice. What it lacked was structural resistance to flattening when economic pressure arrived. Section XI argues: "Kindness is not a property that can be reliably installed by exhortation alone on a substrate that is geometrically indifferent to it." The fifty-point gap is not only a hypothesis about the future. It is an observation about 1840.

13.12 Aggregate BUILD

Run the full benchmark suite. Observe all gaps simultaneously. Any single failure kills the aggregate.

All of the above

The most demanding prediction: twelve simultaneous bets where any failure kills the aggregate. Without a reference implementation, this is a promissory note. It is also the most honest prediction in the set - it explicitly invites the reader to print the table and mark every row.

Benchmark Mapping

Several predictions can be tested against benchmarks that already exist. We name them so a reader who wants to attack a specific prediction knows where to start.

Prediction	Benchmark	Tests	Baseline
13.2	LoCoMo	Recall, multi-hop, structured retrieval	Mem0 66.9%, MIRIX 85.4%
13.5	LongMemEval	Retrieval from complex histories	Oracle ~92%; commercial 30% drop
13.6	CounterBench	Counterfactual inference (1K questions)	Near random-guessing
13.6	TempoBench	Multi-step temporal logic	Sharp difficulty scaling
13.6	TDBench	Bitemporal SQL queries	Domain-specific
13.6	TemporalBench	Past vs present distinction	Weak context-aware reasoning
13.7	LoCoMo	Cross-session continuity	MemGPT 74%, Synapse F1 40.5
13.12	AMA-Bench	Long-horizon agent memory	AMA-Agent 57.2%

Honest Disclosure

Two predictions (13.3 and 13.5) are retrodictions - observations of systems already in operation, dressed as predictions. They are included because the measurement protocol applies prospectively to new instances, but the observation preceded the prediction in both cases.

How to Read This Table

The twelve predictions form a tight web of falsification. Any one of them can be attacked in isolation, in which case the framework fails at that prediction and survives in reduced form at the others.

Clean pass. All twelve hold. The framework earns further testing at larger scale.

Partial pass. Some hold, some fail. The boundary between survival and failure becomes the new research question.

Clean fail. A majority fail. The framework becomes a cautionary example with clear falsification criteria - still more useful than a right paper with vague ones.

The Invitation

This is a research programme, not a proof. Readers are invited to build, measure, and report.

Read the Paper Explore

Living document. Last updated 1 June 2026. Evidence tiers, benchmark mapping, and historical illustrations added following adversarial review.

Testable Predictions

Evidence Tiers

Prediction Detail

13.1 Scene Disambiguation BUILD

13.2 Episode Reconstruction TEST

13.3 Revenue Localisation LIVED

13.4 Tick Settling vs Minimum-Jerk BUILD

13.5 Four Shape Composition LIVED

13.6 Temporal Reasoning Under Ledger TEST

The Champion Prediction

13.7 Episode Handover TEST

13.8 Fable Round-Trip Fidelity BUILD

13.9 Flock vs Homunculus BUILD

13.10 Three-Button Coercion Resistance BUILD

13.11 Structural Kindness BUILD

Historical Illustration: The Slater Precedent

13.12 Aggregate BUILD

Benchmark Mapping

Honest Disclosure

How to Read This Table

The Invitation