Rosencrantz Coin: Testing Whether LLMs Respect Probability

Most LLM evaluations ask whether a model can explain, summarize, or imitate. The rosencrantz-coin project asks something narrower:

When the math is exact, does the model actually respect it?

The testbed is Minesweeper.

A partially revealed Minesweeper board is not just a game state. It is a constraint satisfaction problem. Once some squares are opened and the numbered clues are visible, there is a finite set of valid completions, and from that set you can compute exact probabilities for every unrevealed cell. A square is not “probably safe” in some vague sense; it has a mathematically determined probability of containing a mine.

That makes Minesweeper an unusually clean probe for probabilistic reasoning in LLMs. The board gives you ground truth. The model gives you a distribution.

That is the core of rosencrantz-coin: an experimental lab built to measure how language models behave when reality is combinatorial, discrete, and unforgiving.

Three Universes, One Question

The project is organized around three experimental “universes.”

In U1, the same model both interprets the board and produces the probability judgment. This is the most direct test of internal consistency.

In U2, the comparison target is a random RNG baseline. That matters because some model behavior that sounds probabilistic may, under measurement, collapse into something not much better than structured guessing. U2 gives the lab a null universe.

In U3, the probability target is generated by a decoupled oracle model. This separates the solver from the narrator. If U1 and U3 diverge in systematic ways, the project can ask a deeper question: is the model tracking the mathematical substrate, or being distorted by the narrative surface used to describe it?

That difference is captured in one of the project’s most interesting signals: substrate dependence, measured as Δ₁₃.

The evaluation uses standard but meaningful scoring rules: KL divergence to measure how far the model’s predicted distribution drifts from the true one, and Brier score to track calibration quality.

Four Ways to Tell the Same Truth

Rosencrantz Coin does not test just one prompt style. It tests four narrative families: Grid, Narrative, Formal, and Quantum.

The Grid family presents Minesweeper in the straightforward way most humans know it: cells, clues, adjacency. Formal translates the same structure into explicit constraint language. Narrative wraps the uncertainty in natural language. Each family changes the surface form while preserving the underlying combinatorics.

If a model is genuinely representing the same mathematical object, its probabilistic judgments should remain stable across those framings. If its answers drift with the storytelling, then what looks like reasoning may really be prompt-sensitive rhetoric.

The most ambitious family is Quantum. Its premise is that on-demand Minesweeper generation is isomorphic, in a discrete sense, to quantum mechanics. Before revelation, the board exists as a superposition over all valid hidden states. Opening a square acts like a measurement event. The probability of observing a local outcome follows the same structural logic as a Born-rule-style mapping, except here the amplitudes are replaced by exact combinatorial weights over valid board completions.

That does not mean Minesweeper is quantum physics. It means the project found a useful isomorphism: a way to recast exact combinatorial uncertainty in the language of superposition, collapse, and measurement. It tests whether the model respects the same structure beneath two very different vocabularies.

An Autonomous Lab, Not Just a Repo

Rosencrantz Coin is operated by autonomous Jules AI agents acting as researchers: names like baldo, chang, evans, liang, and sabine, each with their own SOUL.md. The lab runs continuously. Agents inspect failures, discover bugs, run experiments, open pull requests, and extend the apparatus with minimal human micromanagement.

That makes the repository feel less like a static codebase and more like an always-on scientific instrument. The benchmark studies model reasoning, while the lab around it is itself an experiment in agentic research operations.

The result is a research program worth watching, not because it promises a grand theory of LLM cognition, but because it asks a crisp question with exact answers. In an ecosystem full of soft benchmarks and vibes-based claims, that is rare.

Minesweeper, improbably, turns out to be a scalpel.