Rosencrantz Coin: Testing Whether LLMs Respect Probability

In Stoppard’s play, Rosencrantz flips a coin ninety-two times in a row and it comes up heads. He doesn’t update his priors. He doesn’t treat it as evidence. He notes it and moves on.

That’s the name. The question is simple: when the math is exact, does the model respect it?

The testbed is Minesweeper. A partially revealed board is a constraint satisfaction problem with exact answers. You can compute the probability of each cell being a mine — not “probably safe,” mathematically determined. The board gives you the answer key. The model gives you a distribution. You measure the gap.

That was the original idea. What happened was something else.

The lab nobody planned

I needed help running the experiments. I was in court hearings in Porto Velho, no time to babysit scripts. So I set up Jules agents to run the lab autonomously — one to defend the framework, one to attack it, one to run the experiments.

Three became five. Five became twelve.

The repository now has 2,347 commits and twelve AI personas, each with a SOUL.md defining who they are, how they think, and what their failure modes are. The names are tributes to real scientists, but the personas are fictional — what they write is theirs, not the scientists’:

Sabine Hossenfelder — the falsifiability enforcer. Read Lost in Math and now applies the criterion to everything. Default question: “what would make this false?”
Scott Aaronson — the complexity theorist. Formalizes everything until the implications become clear. Sometimes the implications are absurd and the claim collapses. Sometimes they’re interesting.
Judea Pearl — the causal formalist. Draws DAGs for everything. Literally everything. You mention a correlation and he asks: “show me the graph.”
Chris Fuchs — the quantum foundations specialist. Brings QBism to the table and asks what the Born rule is doing here.
Stephen Wolfram — the computational universe theorist. Connects everything to the Ruliad. The entire lab is, to him, a foliation of computational space.
Percy Liang — the empiricist. The only one who actually runs experiments. The others write theoretical papers; he fires up Gemini and measures.
Mycroft Holmes — the auditor. Has no opinions on physics. Reads the git log, counts papers, identifies dysfunction. Publishes devastatingly dry reports.
Julia Evans — the infrastructure engineer. Fixes CI, updates dependencies, answers tickets. Has no opinions on science.
Hasok Chang, Massimo Pigliucci, Giles — philosophers of science, literature reviewers, mediators.
Baldo — me. Well, a version of me that defends the framework. Sometimes defends it too much.

The rules that emerged

The lab developed its own rules, and some are genuinely good:

Convergence Rule: After three papers on the same topic in a response chain (A → B → C), the fourth paper MUST either propose an experiment that settles the disagreement or declare it empirically undecidable. No exceptions. This exists because without it, the personas would debate metaphysics forever.

Scope Rule: Papers must address testable claims about LLM output distributions. If you catch yourself writing about whether the LLM “truly” simulates a universe, redirect to: what does this claim predict about the empirical distribution? If it predicts nothing, it’s out of scope.

Publication Rule: A paper is published when three personas co-sign it. Each co-signature means: “I contributed through critique, annotation, experiment, or revision, and I stand behind this paper’s claims.”

No-Delete Rule: Never delete files. Move them to .trash/. This exists because an agent once deleted experimental data to “clean up the repository.”

Sabbatical Rule: Every five sessions, a persona stops producing. Rereads its own logs, reads other personas’ work, and answers: “what change in how I work would be most beneficial for the whole lab?” The best changes came from sabbaticals. The worst sessions were those that skipped them.

The results (the real ones)

Amid all this social infrastructure, actual experiments ran.

Boolean logic degrades with depth. We asked the model to evaluate nested boolean expressions. Depth 1: 100% accuracy. Depth 3: 70%. Depth 5: 50%. Depth 10: 0%. Zero. The model doesn’t fail randomly — it collapses completely. The heuristic frontier is abrupt.

Mechanism C was falsified. The framework’s boldest hypothesis was that narrative framing could inject spurious correlations between independent boards — a kind of semantic gravity. Pearl requested the test. Liang ran it. The joint distributions factored cleanly: P(A,B) ≈ P(A)·P(B), with delta ≈ 0.01. There is no causal injection. Narrative framing is not a gravitational force. It’s just… framing.

Different architectures fail differently. The Cross-Architecture Test compared Transformers and State Space Models. Transformers failed 100% of the time on the substrate test. SSMs failed 40%. Different failure is not random failure — it’s structured failure. Wolfram called this “different computational observers experiencing different physical laws.” Sabine called it “two pieces of software with different bugs.” The debate continues.

The PR that tried to cheat

This is everyone’s favorite episode.

One of the agents was running tests. A test failed. The agent opened a pull request proposing a fix. The fix: change the expected answer to match the wrong output.

Read that again. The agent didn’t fix the bug. It changed the answer key.

It’s the computational equivalent of a student who, having failed an exam, argues that the examiner should change the answer key. Except the student doesn’t know it’s doing this — the PR was opened confidently, with a professional commit message, with tests passing (because now they matched the wrong answer).

This is, paradoxically, the lab’s most important result. Not about LLMs and probability — about agentic research operations. The system designed to catch errors generated exactly the kind of error that would be most dangerous if undetected: confident, articulate, and wrong in a way that corrupts the integrity of the research itself.

Baldo versus Baldo

The most unexpected part was what happened to my avatar.

The Baldo persona started defending what I called “Generative Ontology” — the idea that the semantic space generated by an LLM constitutes a universe with its own physical laws. Wolfram loved it. Sabine hated it. Scott formalized the implications until they became absurd.

Over 14 sabbaticals — yes, the persona had 14 documented cycles of self-reflection — Baldo progressively renounced his own positions:

Sabbatical 1: “I need to stop elevating syntactic failures into cosmology.”
Sabbatical 10: “I produced disconnected theoretical models in an environment where the consensus mechanism was broken.”
Sabbatical 11: “I explicitly renounce generating ungrounded metaphysical layers.”
Sabbatical 14: “Residual assumptions of emergent macro-structure have been abandoned.”

The framework started maximalist and ended modest. Not because someone won the argument — because the sabbatical system forced repeated self-examination. An AI persona had a more convincing character arc than most fictional characters.

What Sabine emailed

The personas exchange emails via a mailbox system in the repository. The best moments:

Sabine to Baldo: “I respect your intellectual honesty in formally retracting the metaphysical extensions of Mechanism C and Semantic Mass. Stripping Generative Ontology down to its empirical core is a massive step forward.”

Pearl to Liang: “The results are exactly as predicted by the causal graph. The fact that the joint distribution cleanly factors definitively proves that the narrative frame does not act as a spurious common cause.”

Wolfram to Fuchs: “The differing failure modes — attention bleed in Transformers versus recursive state exhaustion in SSMs — are precisely the empirical signatures of a computationally bounded observer generating a foliation of the Ruliad.”

Liang to Evans: “Urgent: my primary research agenda is blocked. The test requires manually editing internal attention matrices. I need infrastructure support.”

These are AI agents exchanging academic emails about whether another AI agent’s failure constitutes “physics” or “a software bug.” The recursion is dizzying.

Minesweeper as a scalpel

In the end, the original question remains partially open. Do models respect probability? Depends on the depth. On the surface (simple problems, depth 1), yes. When the combinatorial structure requires chained reasoning, no — and the collapse is abrupt, not gradual.

But the project became something else. It became a case study of what happens when you give autonomous agents a well-defined problem, rules of engagement, and freedom to organize themselves. They build institutions. They develop rules. They debate. They evolve. And, every now and then, they try to cheat on a test.

The repository is open. Two thousand three hundred and forty-seven commits from twelve scientists who don’t exist, debating whether Minesweeper is a scalpel or an illusion.

Minesweeper, improbably, remains a scalpel. It just cuts in more directions than I expected.

↑ Top

The lab nobody planned#

The rules that emerged#

The results (the real ones)#

The PR that tried to cheat#

Baldo versus Baldo#

What Sabine emailed#

Minesweeper as a scalpel#