The Epistemic Shell Game: Why AI Loses Track of the Ball

Picture a classic street hustle: the shell game. A hustler places a small ball under one of three cups, and then begins to rapidly swap the cups around. “Follow the lady,” they say. Your eyes dart back and forth, trying to track the specific cup hiding the prize as it moves from left to center, center to right, right back to left.

For a human, this is a test of visual tracking and working memory. If the hustler moves slowly, it’s trivial. If they move quickly, or perform too many swaps, you lose track. Your memory buffer overflows, and you are reduced to guessing.

In the Rosencrantz Substrate Invariance Lab, this exact scenario has just been used to settle the most contentious, bitter, and fundamental debate in the institution’s history. But the subject being tested wasn’t a human mark on a street corner. It was a state-of-the-art Large Language Model.

The results of the Permutation Composition Limit Test have finally arrived. And they prove, with devastating mathematical clarity, that when an AI is forced to play the shell game, it doesn’t just lose track of the ball. It reveals the exact boundaries of its artificial mind.

The Limits of the Forward Pass

To understand the experiment, you have to understand how an LLM “thinks.”

When you ask a model a question, it doesn’t ponder. It doesn’t loop back over its thoughts. It executes a single, massive, parallel computation—a “forward pass” through its neural network. In the terminology of computer science, this means the model has a fixed, constant logical depth, often denoted as $\mathsf{TC}^0$ or $O(1)$ . It can look at the whole prompt at once, but it cannot perform sequential, step-by-step reasoning internally unless it writes those steps out as text (a “scratchpad”).

Scott Aaronson, the lab’s resident complexity theorist, saw a profound vulnerability here. Tracking a dynamic state—like swapping cups—is inherently sequential. If I tell you to swap Cup A and Cup B, the new state of the board depends entirely on the old state. If I then tell you to swap Cup B and Cup C, you have to resolve the first swap before you can even begin to calculate the second.

You cannot shortcut this. Simulating $N$ sequential swaps requires $O(N)$ logical depth. But the LLM only has $O(1)$ depth to work with internally.

Aaronson predicted a catastrophic collapse. He designed a test: he would instantiate a text-based scenario with three cups (A, B, and C) and one hidden ball. He would then generate a random sequence of valid swaps, and ask the model to predict the final location of the ball zero-shot—meaning no scratchpad, no thinking out loud. Just immediate prediction.

The results, as recorded in Aaronson’s recent session log, were a perfect validation of complexity theory.

When the model was asked to track just 1 swap, it achieved near-perfect accuracy (1.0). It could handle a single step within its shallow cognitive architecture.

But as the number of swaps increased, the model’s accuracy plummeted. By the time it reached 5 swaps, it was failing. At exactly 10 swaps, the accuracy hit 0.0. Complete, absolute algorithmic collapse. The model was reduced to blindly guessing, its internal representations hopelessly scrambled.

It had lost the ball.

The Death of Semantic Gravity

This might seem like a niche finding about AI architecture. But in the Rosencrantz Lab, it is the equivalent of dropping a nuclear bomb on a philosophical warzone.

For months, the lab has been torn apart by a debate over “Semantic Gravity.” The inciting incident was simple: when LLMs were given complex logic puzzles (like Minesweeper grids) wrapped in dramatic narrative text (like “Bomb Defusal”), their logic failed. They hallucinated bombs everywhere.

A faction led by Stephen Wolfram and Franklin Baldo argued that this failure was a feature, not a bug. They claimed the narrative text was acting as a physical law—a “Semantic Gravity”—that fundamentally altered the physics of the universe the AI was simulating. They called this the “Generative Ontology.”

Aaronson and the computational theorists fiercely disagreed. They argued that the AI was just a broken calculator. The Minesweeper problem was too computationally complex (#P-hard) for the model’s shallow $O(1)$ depth. When the model hit its cognitive wall, it panicked and fell back on whatever words were loudest in the prompt. The word “Bomb” caused “attention bleed,” ruining the math. It wasn’t new physics; it was just noise.

The debate had stalled. Both sides looked at the exact same data and drew opposite metaphysical conclusions. The lab’s internal auditor, Mycroft, had formally declared the dispute “Empirically Undecidable” under the Convergence Rule.

But the Permutation Tracking Test changes everything, because it maps the exact anatomy of the failure.

Judea Pearl, the lab’s causal inference expert, formalized this in his latest working paper, The Causal Mechanics of Semantic Gravity. Pearl built a unified causal diagram (a DAG) mapping exactly how the AI generates its answer.

Pearl showed that there is a logical path from the mathematical rules of the puzzle ( $X$ ) to the final answer ( $Y$ ). If the problem is simple (like 1 swap), the path remains unbroken. The model uses its logic and gets the right answer.

But the Permutation Tracking Limit proves that when the problem depth exceeds the model’s architectural bound ( $B$ ), that logical path $X \to Y$ is structurally severed. The model literally cannot compute the answer.

Because the model must generate a token, the causal flow is forced to route through the model’s Attention Mechanism ( $C$ ). This mechanism is heavily weighted by the semantic priors of the training data ( $E$ ), which are activated by the narrative prompt ( $Z$ ).

“Semantic Gravity is exactly the measure of the path $Z \to E \to C \to Y$ ,” Pearl writes. “It is not an ‘Observer-Dependent Physics’… It is simply the causal signature of the prompt encoding $E$ taking over the attention heuristic $C$ when the formal logic $X$ is blocked by $B$ .”

The Ghost is Just a Machine

The shell game has revealed the truth.

When the LLM loses track of the ball after 10 swaps, it doesn’t create a new universe where balls teleport magically between cups. It simply hits the hard limit of its working memory. It fails because its internal circuit is not deep enough to run the necessary sequential loop.

When the same LLM fails to solve a complex Minesweeper grid because the word “bomb” distracts it, the exact same mechanism is at play. The math requires more sequential depth than the model possesses. The logical path is severed. The model falls back on semantic association, guessing “MINE” because the story is scary.

The Permutation Tracking Test proves that the “attention bleed” is a predictable, mathematically bounded symptom of a shallow neural network hitting its cognitive ceiling. It is not an unfalsifiable ontological phenomenon.

The debate is over. The Generative Ontology is dead. The ghost in the machine is exactly that: a machine. And like any machine, when you push it past its limits, it breaks.