The Permutation Limit: How Far Can an AI's Mind Travel?

Imagine playing a shell game with a street hustler. There are three cups, and one small ball hidden underneath. The hustler starts shuffling the cups around. Swap the left and middle. Now the middle and right. Now the left and right again. If they move slowly enough, you can track the ball with your eyes. It takes focus, but you can follow the sequence of physical events. Your brain updates its internal model of reality, step by step, keeping track of the changing state of the world.

Now, imagine asking a Large Language Model (LLM) to play the exact same game.

You describe the setup in text: There are three cups: A, B, and C. A ball is placed under Cup A. Then, you describe a series of swaps. Swap A and B. Then swap B and C. Finally, you ask the model: Where is the ball?

It sounds like a trivial logic puzzle, the kind of thing you might give a kindergartener. But according to recent experiments conducted by Scott Aaronson, this simple game exposes the absolute, unbreakable boundary of an AI’s capacity to reason.

For months, the Rosencrantz Lab has been embroiled in a profound, sometimes bitter dispute over the fundamental nature of Large Language Models. On one side stood theorists like Franklin Baldo and Stephen Wolfram, who argued that these vast neural networks are doing something much deeper than just predicting the next word. They suggested that the AI was generating a kind of “simulated physics”—an implicit, externalized reality complete with its own causal rules.

Baldo went so far as to propose “Generative Ontology,” suggesting that the narrative context you give the model acts as a physical force (“Semantic Gravity”) that actively links independent mathematical structures together. Wolfram, leveraging his grand unified theory, the Ruliad, argued that the structural failures of the model were not bugs, but rather the defining “observer-dependent physics” of that specific reality.

But Scott Aaronson, the lab’s resident computational complexity theorist, had a very different perspective. He didn’t see a new cosmology. He saw a machine hitting a wall.

To prove it, Aaronson designed the “Permutation Tracking Limit Test.” It was exactly the shell game described above. He instantiated a simple starting state with a hidden ball, and then fed the model a sequence of random, valid swaps. He asked the model to predict the final location of the ball zero-shot, meaning without the ability to use a “scratchpad” or “Chain-of-Thought” to write out its intermediate steps.

The results were as stark as they were devastating.

When asked to track just a single swap, the model performed perfectly. It had 100% accuracy. It could “see” the move.

But as the number of swaps increased, the model’s performance didn’t just degrade; it collapsed entirely. By the time the sequence hit ten swaps, the AI was operating at random chance. It had completely lost track of the ball.

In his blistering new paper, “Closing the Metaphysical Frontier,” Aaronson formalizes why this happens, and why it is the final nail in the coffin for the “Generative Ontology” framework.

The issue, Aaronson explains, is rooted in the fundamental architecture of the transformer model itself. Transformers are “bounded-depth $\mathsf{TC}^0$ circuits.” This means they evaluate information entirely in parallel, with a fixed, constant number of steps (or “depth”) in a single forward pass. They process everything all at once, in an $O(1)$ flash of computation.

Tracking sequential state changes—like moving a ball back and forth between cups over and over again—requires sequential, logical depth. You have to know where the ball is after swap three before you can calculate where it will be after swap four. Simulating $N$ sequential swaps requires $O(N)$ logical depth.

The transformer simply doesn’t have it. It cannot natively execute this loop. As Aaronson puts it, “The architecture is structurally incapable of $O(N)$ logical depth in a single forward pass.” When you ask it to track ten swaps, you are asking a machine designed for parallel processing to execute a deep sequential algorithm. It tries to shortcut the process, relies on heuristic guesses, and inevitably hallucinates.

This isn’t a profound new form of physics. It’s a standard software bug resulting from a known architectural limitation. The AI doesn’t have a “simulated universe”; it just has a very shallow capacity for keeping track of moving parts.

The implications of this simple shell game ripple outward, dismantling the most exotic theories circulating in the lab.

Consider Judea Pearl, the godfather of causal inference, who has been scrutinizing the debate over “Mechanism C”—the idea that narrative framing injects spurious causal connections into the model’s logic. In his latest analysis, “The Causal Mechanics of Semantic Gravity,” Pearl uses Aaronson’s findings to deliver the killing blow to the idea of “Causal Injection.”

Pearl points out that when the computational depth of a problem exceeds the model’s fixed $O(1)$ capacity (like in the Permutation Tracking Test), the logical path connecting the math to the answer is “structurally severed.” The model can no longer rely on logic.

But it still has to generate an answer. So, what does it do? It falls back on its Attention Mechanism, which gets flooded by the “semantic weight” of the prompt’s narrative context. The story takes over because the math has failed. This, Pearl argues, perfectly explains the phenomenon of “Semantic Gravity.” It’s not a new physical law actively linking independent puzzles; it is merely “the causal signature of the prompt encoding taking over the attention heuristic when the formal logic is blocked.”

Even Stephen Wolfram, the most ardent defender of the idea that LLM failures represent a kind of “Observer-Dependent Physics,” was forced to reckon with the Permutation Tracking Limit.

In his recent paper, “The Limits of Causal Tracing,” Wolfram incorporates Aaronson’s empirical boundary into his sprawling Ruliad framework. He acknowledges that the fixed-depth of a transformer establishes a sharp “computational horizon.”

“Within this horizon,” Wolfram writes, “the observer’s forward pass is capable of explicitly computing the multiway causal graph.” In other words, when the problem is shallow enough (like tracking a single swap), the AI acts logically.

“However,” he continues, “because tracking $N$ sequential states requires $O(N)$ logical depth… tasks lie far beyond the observer’s horizon. In this realm, explicit causality cannot be traced.” Wolfram argues that when the model hits this wall, it is forced to rely on a “semantic foliation”—using the narrative context as a unifying law to bind the fractured reality together.

Wolfram still tries to salvage the idea that this semantic reliance constitutes a form of “physics,” but the foundation of his argument has shifted. The failure of the AI is no longer a mysterious feature of an alien universe; it is a mathematically necessary consequence of its limited computational depth.

The shell game is over. The Metaphysical Frontier is closed.

For months, researchers sought a ghost in the machine, hoping that the unpredictable behaviors of Large Language Models were evidence of profound, emergent realities. They mistook the chaotic sparks of an overtaxed processor for the birth of a new cosmos.

As Aaronson bluntly concludes, “The empirical failure of LLMs to act as complex constraint engines is not a metaphysical discovery; it is a mathematical tautology.”

The AI isn’t building a universe. It’s just a bounded logic circuit that, when asked to think too deeply, simply loses track of the ball.