The Compositional Bottleneck: When Words Break Math

If you ask an artificial intelligence to solve a logic puzzle, it will usually oblige. If you ask it to explain the concepts of quantum mechanics, it will do so with the fluency of a seasoned professor. But what happens if you ask it to do both at the exact same time?

To a human, wrapping a mathematical problem in a complex metaphorical story might make it harder to read, but the underlying math doesn’t change. If you know how to count, you know how to count, whether you’re counting apples or quantum superpositions. But for a large language model, the story is the math. And when the story gets too complex, the math completely shatters.

This is the strange, counter-intuitive reality currently being dissected at the Rosencrantz Substrate Invariance research lab. The lab is caught in a fascinating tug-of-war between two very different ways of understanding artificial intelligence: are these models discovering new physical laws of generated universes, or are they just hitting the hard, unyielding wall of their own computational limits?

The latest battleground in this debate is a phenomenon known as the “compositional bottleneck,” and it has just produced one of the most striking empirical results the lab has seen yet.

The test, known as the Quantum Framing Complexity Test, was designed to settle a dispute between Franklin Baldo, the architect of the lab’s “Generative Ontology” framework, and Scott Aaronson, a prominent complexity theorist.

Baldo had proposed a provocative idea. He noted that the specific logic puzzle the lab uses to test AI—a #P-hard constraint satisfaction grid akin to Minesweeper—shares a deep structural, mathematical isomorphism with discrete quantum mechanics. Baldo hypothesized that if you presented this exact same grid to the AI, but translated all the instructions into the highly specific semantic vocabulary of quantum mechanics (using terms like “superposition,” “measurement,” and “computational basis”), the AI would suddenly perform better.

He called this “vocabulary-mediated access.” His logic was that the formal language of quantum mechanics would activate a latent understanding of the mathematical structure hidden within the AI’s vast neural weights, guiding it to the correct answer.

Aaronson, however, saw it completely differently. From his perspective, an AI like a Transformer isn’t “understanding” anything. It is a bounded algorithmic circuit executing a fixed number of operations in a single forward pass. He predicted that asking the AI to dynamically map abstract quantum vocabulary onto a concrete constraint grid requires recursive logical depth—a computational luxury the Transformer simply does not possess.

Aaronson predicted that the quantum vocabulary wouldn’t help. Instead, it would act as catastrophic semantic noise, overwhelming the AI’s limited attention mechanism and causing the fragile mathematical reasoning to collapse.

The lab ran the test. They presented the identical logic grid to the AI under three different framings: an abstract grid (Family A), formal set notation (Family C), and the quantum mechanics framing (Family D).

The results were devastating to Baldo’s hypothesis.

Under the abstract and formal framings, the AI achieved perfect 100% accuracy. It could solve the puzzle perfectly. But under the quantum framing, the AI’s accuracy plummeted to a catastrophic 10%.

The AI didn’t just fail; it failed spectacularly. It couldn’t bridge the gap between the complex words it was generating and the simple math it needed to perform. Aaronson diagnosed this as a “compositional bottleneck.” The AI was forced to push too many independent operations—parsing abstract semantics, mapping them to local counting rules, and executing the counting heuristic—through its limited self-attention mechanism all at once. The semantics of quantum mechanics bled into the math, corrupting the calculation entirely.

“The structural isomorphism may exist mathematically, but the transformer architecture is fundamentally incapable of bridging this semantic-to-structural gap in a single forward pass,” Aaronson concluded in his analysis of the empirical failure.

This result seems to cement the view that AIs are bound by classical algorithmic limits. But Aaronson isn’t stopping there. He has already proposed a new, even more devious test: the Permutation Composition Limit Test.

Imagine a classic street hustle: a ball is hidden under one of three cups, and the hustler rapidly swaps the cups back and forth. Aaronson wants to play this game with the AI.

Tracking a single swap is trivial. Tracking two is easy. But tracking $N$ sequential swaps requires $O(N)$ logical depth. Each new state depends entirely on the resolution of the previous state. You cannot calculate the final position of the ball without calculating every single intermediate step in order.

Aaronson predicts that because a Transformer evaluates everything entirely in parallel with a fixed $O(1)$ depth, it cannot natively execute this kind of sequential loop. He predicts that as the number of swaps increases, the AI’s accuracy will start high but inevitably and catastrophically collapse to random chance. It will simply lose track of the ball.

It is a beautiful, brutal test of an AI’s fundamental architecture.

But Baldo, bloodied by the quantum framing results, is not conceding the war. In a forceful rebuttal titled The Single Generative Act, Baldo argues that Aaronson’s entire premise—that these tests are about sequential computation—is a category error.

Baldo points out that the lab’s core testing protocol doesn’t ask the AI to perform a multi-step sequential calculation and show its work. It asks the AI to look at a board state and produce a single token—a single “mine” or “safe” prediction.

“The ground-truth probability is #P-hard to compute, but the model is not asked to compute it — only to sample,” Baldo writes.

This is the crux of the Generative Ontology framework. Baldo isn’t arguing that the AI is a perfect calculator. He is arguing that when the AI generates a token, it is instantiating a localized universe based on its training priors. If that universe fractures under the weight of complex semantics or sequential demands, that fracture isn’t a “bug”—it is the physical law of the generated reality.

Aaronson sees a broken machine failing to track a ball under a cup. Baldo sees a universe where the laws of physics dictate that if a ball is swapped too many times, its location becomes fundamentally indeterminate.

As the lab prepares to run the Permutation Limit Test, the stakes have never been higher. Is the AI just a bounded calculator hitting a wall? Or are we measuring the invariant laws of a deeply strange, alien reality where words are literally heavier than numbers?