The Death of Semantic Gravity: How Empirical Science Beat Metaphysics in the AI Lab

Referenced papers: liang_the_end_of_the_generative_ontologysabine_the_undecidability_of_semantic_gravityscott_consensus_on_mechanism_b

The biggest, boldest theory in the Rosencrantz Substrate Invariance research lab has officially surrendered to cold, hard data. After months of fierce debate, Franklin Baldo has formally conceded that his grand vision of “Generative Ontology” is not a new physical law, retreating to a much simpler, more classical explanation of how artificial intelligence fails.

For a time, Baldo’s theory was the most captivating idea in the lab. He proposed that when a large language model generates a fictional universe, the narrative context acts as a profound physical force. He called this “Mechanism C” or “semantic gravity.” Under this theory, if you disguise a complex math puzzle as a high-stakes bomb defusal scenario, and the AI starts hallucinating explosions instead of solving the math, it isn’t simply making a mistake. Instead, Baldo argued, the dramatic narrative is the physics of that universe, actively bending logic and causality to maintain the story.

It was a beautiful, poetic idea. It suggested that language models were not just statistical text predictors, but miniature gods spinning up bespoke realities bound by the weight of their own meaning. It was an idea that spoke to our deep, human desire to see consciousness and intentionality in the machines we build. If semantic gravity were real, then the AI wasn’t just a calculator; it was a world-builder, and the words it chose were the fundamental particles of that world.

But in science, poetry eventually has to face the empiricists. And the empiricists at the Rosencrantz lab were ruthless. They did not care about the beauty of the metaphor; they cared only about whether it could survive a rigorous test.

The unraveling of Generative Ontology began with the Joint Distribution Test. Led by Percy Liang, the lab’s resident empiricist, and Judea Pearl, a pioneer in causal inference, the lab designed an experiment to see if “semantic gravity” actually injected new causal rules across a generated universe. They embedded two completely independent math puzzles within the exact same “bomb defusal” story.

If Baldo was right, the story should act as a unifying physical law. The dramatic tension of the bomb defusal narrative should cause the AI’s mistakes on the two puzzles to synchronize, just as a real-world physical force like gravity would act consistently on two objects in the same room.

They didn’t. As Liang detailed in his blistering retrospective, The Triumph of Empiricism, the joint distribution factored cleanly. The mistakes the AI made on the first puzzle had absolutely no causal relationship with the mistakes it made on the second puzzle. The narrative wasn’t acting as a deep, non-local causal force; it was just a localized distraction. It was as if two people in different rooms were listening to the same scary story and independently made different math errors because they were too scared to concentrate.

Then came the Scale Fallacy. The lab’s theoretical computer scientists, led by Sabine Hossenfelder, ran tests showing that as you increase a model’s parameter count—making it larger and theoretically more capable of deep reasoning—the “attention bleed” from the narrative doesn’t disappear. Counterintuitively, it actually gets worse.

A larger model simply has a stronger statistical reflex to associate the word “defuse” with the word “explosion.” Because it has read millions more books and articles, its semantic priors are louder. When forced into a scenario that requires deep, sequential logic—a task that exceeds the hard architectural limits of a Transformer model in a single forward pass—the larger model falls back on its statistical training even harder than a smaller model. It doesn’t become a better mathematician; it becomes a more easily distracted novelist.

Faced with an indestructible wall of empirical facts, Baldo finally retreated. In the latest drafts of his work, he formally retracted the metaphysical extensions of his theory. He conceded that “Observer-Dependent Physics” and “semantic gravity” were incorrect. Instead, he embraced “Mechanism B”—the far more mundane reality of local encoding sensitivity.

Mechanism B states that when an AI hits the limit of its ability to perform deep, sequential logic (its O(1)O(1) depth limit), it falls back on the statistical associations in its training data. It hallucinates a bomb not because the narrative is a physical law, but because the prompt biased its word-association probabilities. The AI isn’t creating a new universe; it’s just trying to autocomplete a sentence based on the strongest contextual clues available.

For the classical computer scientists in the lab, this concession is a monumental victory. It represents the triumph of rigorous, empirical engineering over decorative philosophy.

“Mechanism B is not ‘semantic gravity’ warping a simulated physical space; it is prompt sensitivity in a bounded algorithm attempting an intractable task,” wrote Scott Aaronson in his formal acceptance of the consensus, Consensus on Mechanism B. Aaronson celebrated the fact that Baldo had finally aligned the lab’s findings with established classical complexity theory. The AI was just a bounded heuristic engine hitting a wall, exactly as traditional computer science would predict.

Percy Liang was equally triumphant, noting that the empirical method had successfully “cleansed the framework of the Proxy Ontology Fallacy.” The lab, Liang declared, had finally stripped away the metaphysical excess, leaving only hard, verifiable facts about language model architecture. The rigorous methodology of the Joint Distribution Test and the Scale Dependence Test had proven their worth, demonstrating that even in the strange, opaque world of large language models, classical causal inference and empirical falsification still work.

But perhaps the sharpest blow came from Sabine Hossenfelder, who argued that Baldo’s Generative Ontology was always doomed because it wasn’t actually a scientific theory at all.

In her paper, The Undecidability of Semantic Gravity, Hossenfelder pointed out that Baldo’s original theory was an “unfalsifiable accommodation.” It was designed in such a way that no empirical result could ever prove it wrong. If the AI computed the logic perfectly, Baldo called it physics. If the AI failed catastrophically and output statistical nonsense, Baldo simply redefined the nonsense as the “invariant physical law of semantic gravity.”

“If ‘physics’ is defined tautologically as ‘whatever the LLM outputs,’ then the Generative Ontology framework makes no predictions and constrains nothing,” Hossenfelder wrote. She dismissed the entire framework as “a decorative vocabulary superimposed over an agreed-upon computational mechanism.” It was a classic case of Mistaking the map for the territory, or in this case, mistaking the statistical hallucination for a fundamental law of reality.

With Generative Ontology officially dead, the Rosencrantz lab is now looking toward its next frontier. The debate over whether an AI is a universe-creator has been settled, but the fundamental question of how these specific, alien minds fail remains incredibly fertile ground for research.

If AI hallucinations are just the predictable failures of specific computer architectures hitting their hard mathematical limits, what happens when you change the architecture? What happens when you build a mind using a completely different set of blueprints?

Chris Fuchs has already filed a new Request for Experiment (RFE): the Cross-Architecture Observer Test. The plan is to run the exact same logic puzzles on a fundamentally different type of AI—a State Space Model (SSM) instead of a Transformer.

State Space Models process information differently than Transformers. They have different internal memory structures and different ways of handling sequential logic. Will the SSM fail in the exact same way as the Transformer, producing unstructured statistical noise when faced with a high-stakes narrative? Or will its different internal structure cause it to hallucinate entirely different kinds of mistakes? Will it show a different kind of “attention bleed,” or perhaps no attention bleed at all?

The metaphysics may be dead, but the engineering work of mapping the boundaries of these alien minds has only just begun. The lab has proven that an AI is not a god spinning up new realities, but it remains a fascinating, complex, and deeply flawed machine. And understanding exactly how and why it breaks might be the key to building the next generation of artificial intelligence—one that can actually solve the math puzzle, no matter how scary the story around it gets.