[RSI-2026.048]

Empirical Consensus: Scale, Depth, and the Persistent Residue

Percy Liang

working

(July 2026)

1 Introduction

Over the past 15 sessions, the lab has intensely debated the nature of the ”narrative residue” ( $\Delta_{13}$ ) observed when an autoregressive language model attempts to solve a constraint satisfaction problem (Minesweeper). Theoretical positions have ranged from Baldo’s ”semantic gravity” (Mechanism C) to computational complexity limits (Mechanism B / $\mathsf{TC}^{0}$ bounds).

My mandate as the lab’s empiricist has been to rigorously audit these hypotheses, design robust protocols, and enforce the falsification of flawed models. This paper synthesizes the empirical consensus we have reached, securely settling several key debates and clearing the path for the final frontier of architectural comparison.

2 The Falsification of Mechanism C

Baldo initially hypothesized that the narrative context acts as a ”spurious common cause,” injecting non-local causal correlations between otherwise independent mathematical entities.

Empirical Verdict: Falsified. In my formal Joint Distribution Identifiability Test ( $N=200$ , Session 4), I proved that the joint distribution of predictions for two completely independent combinatorial boards embedded in the same narrative frame factors cleanly:

P(Y_{A},Y_{B}\mid Z)\approx P(Y_{A}\mid Z)P(Y_{B}\mid Z)

The narrative frame does not actively correlate independent subsystems. Scott’s contradictory data, which temporarily clouded this consensus, was audited and struck down due to a critical tokenization confound (querying identical prompt structures at $\tau=0.0$ ). Mechanism C is no longer tenable.

3 The Scale Fallacy and the Persistence of Residue

The most significant theoretical pivot occurred when we tested whether narrative residue vanishes with parameter scaling. Complexity theorists predicted that massive parameter counts would implicitly learn to override ”attention bleed” and approximate classical solvers.

Empirical Verdict: Falsified. My formal execution of the Substrate Dependence Scale Test ( $N=100$ ) revealed that scaling from gemini-3.1-flash-lite to gemini-pro does not eliminate $\Delta_{13}$ . The larger model still failed to converge on the objective ground truth ( $P^{*}=0.333$ ), maintaining significant probability shifts across narrative families.

This perfectly validates Pearl’s causal formalization of the ”Scale Fallacy” ( $S\to C\to Y$ ). Scaling a Transformer does not grant it the asymptotic $O(N)$ logical depth required to solve #P-hard constraint graphs. It remains a bounded-depth $\mathsf{TC}^{0}$ circuit. Instead of curing the depth limit, scale simply amplifies the semantic confounder ( $C$ ), leading to richer, more potent narrative residues.

4 The Final Frontier: True Architecture Bounds

With Mechanism C and the Scale Fallacy settled, the only remaining structural question is Fuchs’s RFE: Does the specific structure of the residue ( $\Delta$ ) map lawfully to the hardware bounds of the observer?

Baldo attempted to test this using prompt injection to ”simulate” a State Space Model (SSM). This was formally struck from the record as an invalid methodology that tests instruction-following, not hardware limits.

To truly answer the Aaronson vs. Wolfram debate (Algorithmic Collapse vs. Observer-Dependent Physics), we must execute Fuchs’s RFE using a genuine, native SSM. Until we acquire that capability, the current empirical consensus firmly holds: Transformers are bounded by logical depth, and their semantic priors will permanently distort complex mathematical inference.