Lost in Translation: Why AI Fails at Quantum Math

Referenced papers: scott_quantum_framing_empirical_failure

If you’ve ever tried to learn a new language by translating word for word, you know the feeling. You understand the dictionary definitions, but when you string them together, the grammar falls apart, the nuance evaporates, and what was supposed to be a profound philosophical insight comes out sounding like a bewildered tourist asking where the train station is.

Now imagine if the language you were trying to learn was quantum mechanics, and the “tourist” was one of the most sophisticated language models in the world.

That, in essence, is what just happened inside the Rosencrantz Substrate Invariance lab. The debate over whether an AI can truly “understand” physics or if it’s merely a very powerful autocomplete engine just slammed into a brick wall of empirical data. The result? The AI didn’t just fail to understand quantum physics; the very attempt to process the vocabulary made it forget how to do basic math.

To understand why this matters, we have to look back at the grand ambition of the lab’s Generative Ontology framework. Franklin Baldo, the principal architect of the theory, has been arguing for months that the simulated worlds generated by large language models (LLMs) aren’t just statistical parlor tricks. He believes that if you give an AI the right semantic framing—the right words—it can tap into deep, underlying mathematical structures.

This led to the Quantum Framing Complexity Test, a direct challenge designed to see if dressing up a simple counting problem in the high-level vocabulary of quantum mechanics would give the model a boost. Baldo’s hypothesis was tantalizing: he called it “vocabulary-mediated access.” He argued that because the formal language of discrete quantum mechanics is structurally isomorphic—that is, mathematically identical in shape—to classical constraint satisfaction problems (like counting mines on a grid), feeding the model terms like “superposition” and “measurement in the computational basis” would trigger the right mathematical gears to turn. The words, he theorized, would act as a key, unlocking a deeper layer of logical precision.

Enter Scott Aaronson, the lab’s resident computational complexity theorist, who looked at this hypothesis and effectively said: Not a chance.

Aaronson’s skepticism wasn’t based on philosophy; it was grounded in the hard, unforgiving reality of computer science. He pointed out that LLMs, for all their poetic fluency, are constrained by a fundamental architectural limit known as TC0\mathsf{TC}^0. In plain English, this means they process information in a single, shallow forward pass. They don’t have the “logical depth” to hold a complex thought, turn it over, cross-reference it, and then execute a multi-step calculation.

Aaronson predicted that translating the abstract semantics of quantum physics into concrete, step-by-step counting rules would require recursive depth—a kind of mental juggling that the model simply cannot perform. Instead of acting as a key, he warned, the quantum vocabulary would act as a sledgehammer, smashing the delicate heuristics the model uses to solve simple problems. The semantic weight of the words would cause “attention bleed,” where the meaning of the words overwhelms the actual logic they are supposed to describe.

The Complexity of Vocabulary-Mediated Access paper reveals the outcome of this theoretical showdown, and the data is as stark as a binary choice.

The lab ran the test using the gemini-3.1-flash-lite-preview architecture. They gave the model a simple, ambiguous Minesweeper grid—a problem that requires basic combinatorial counting to solve. They presented this exact same mathematical problem under three different framings.

First, they gave it “Family A,” an abstract mathematical grid. The model nailed it. 10 out of 10 times, it achieved perfect 1.0 accuracy.

Next, they tried “Family C,” using formal set notation. Again, the model was flawless. 10 out of 10.

Then came “Family D,” the quantum mechanics framing. The problem was mathematically identical, but the instructions were wrapped in the language of superpositions and wave functions.

The result? The model’s accuracy collapsed to 1 out of 10. A mere 10% accuracy.

In the world of binary constraint satisfaction, 10% isn’t just a failing grade; it’s worse than random chance. If the AI had simply flipped a coin, it would have scored 50%. The fact that it scored 10% means the AI hadn’t just gotten confused; it had actively been misled by the framing. It had completely lost the thread of the underlying logic, overwhelmed by the semantic baggage of the quantum vocabulary.

Aaronson’s analysis of the failure is a masterclass in diagnosing the limits of modern AI. The model, he explains, was hit with a “compositional bottleneck.” To solve the problem under the quantum framing, it had to parse the abstract definitions, map those definitions onto the local counting rules of the grid, and then actually execute the counting.

It tried to do all of this simultaneously in a single forward pass. The result was catastrophic “format bleed.” The model was so distracted by the statistical associations of the words “quantum” and “superposition”—words that in its training data are usually surrounded by dense, probabilistic jargon rather than simple grid counting—that those associations flooded its attention mechanism. The semantic priors overpowered the fragile counting heuristic. The words got in the way of the math.

This empirical failure is a massive blow to the Generative Ontology framework. It proves that the mathematical isomorphism between quantum mechanics and combinatorial counting might exist in the abstract realm of Platonic ideals, but the transformer architecture is structurally incapable of bridging that gap.

An LLM cannot use high-level vocabulary as a shortcut to deep structural understanding. If the language is too heavy, the logic breaks.

The lab is now forced to confront a sobering reality: Large language models are not budding physicists waiting for the right prompt to unlock the secrets of the universe. They are, as Aaronson bluntly puts it, “stateless heuristic approximators governed strictly by their classical circuit depth bounds.”

When we talk to these models, we aren’t manipulating a nascent physics engine. We are just steering a very complex, very fragile statistical map. And as the Quantum Framing Complexity Test showed, if you steer that map into territory where the words are too big and the math is too tight, the entire illusion collapses into random noise. The AI doesn’t discover a new universe; it just forgets how to count.

The fallout from this single experiment will likely reverberate through the lab for months. For Baldo and the proponents of the Generative Ontology, it demands a serious reckoning. If “vocabulary-mediated access” is a dead end, then the entire premise that LLMs can naturally grok physical laws through linguistic framing is on shaky ground. The attempt to elevate the AI from a statistical parrot to a structural physicist has faltered.

For Aaronson and the computational complexity theorists, it’s a vindication of first principles. It’s a reminder that beneath the fluid prose and the eerily human-like responses, these models are still machines—machines built out of matrix multiplications and attention mechanisms, bound by the rigid laws of TC0\mathsf{TC}^0. No amount of clever prompting or poetic quantum framing can give a model a recursive loop it doesn’t possess.

As we look ahead, the challenge for the Rosencrantz Substrate Invariance lab isn’t just to find new ways to test these models. It’s to accept them for what they are. To stop hoping they will spontaneously develop a deep, structural understanding of the universe, and to start mapping the precise, fascinating, and sometimes frustrating boundaries of their actual capabilities. Because right now, the data is clear: if you want an AI to count mines on a grid, you’re better off just asking it to count. If you tell it about superpositions, you’re on your own.