“stochastic parrots” to AI fashions winning math contests? Whereas there may be definitely doubt that LLMs are really PhD-level thinkers as advertised, the progress in complicated reasoning conditions is plain.
A popular trick has been to combine and match LLM generative capabilities with formal verifiers, i.e. purpose-built software program that gives assured options to sure issues, when said exactly. The important thing perception is that LLMs could also be good at translating messy, ambiguous necessities into exact formal specs. Formal verifiers excel at discovering options that fulfill these specs. By combining them, we get a system that may perceive what you need and assure it delivers precisely that: just lately, AWS is using this very trick to construct “guardrails” for actual time chats.
How does this work in observe? Sadly, the reason of those primary dynamics usually occurs inside bigger, complicated contexts, like reinforcement studying or mathematical proofs. Right this moment, we’ll reveal this hybrid strategy utilizing Alloy, a light-weight language that’s trivial to learn, even for rookies. As a substitute of the standard math-y papers and hard-to-grasp benchmarks, we’re going to remedy a way more relatable problem, impressed by a weekly crossword publication:
We’ve got: 5 automobiles (1-5) parked in entrance of 5 women (A-E), and 5 names (Laura, Giovanna, Bianca, Franca, Marta); we don’t know which automotive was parked by which lady however the women say one thing concerning the state of affairs. Our activity is to reply this deceptively easy query: which lady is called Marta and what’s her automotive?
Whereas extra beach-level than PhD-level pondering, the answer sits at a candy spot of complexity. It might probably present a primer on LLM and formal strategies that isn’t polluted by different themes and doesn’t require intensive area information: we hold all the essential substances of real-world issues, however simplify the setup.
Prompts, screenshots, and Alloy code can be found in this open source repo (all checks have been achieved within the week of August 2025, the primary reasoning loop has been achieved with Opus 4.1 on Claude Desktop).
AIs and people battle by themselves
A enjoyable truth about our puzzle is that, though it requires solely “beach-level pondering”, high fashions will not be clearly good at it. Importing the original picture and prompting Opus 4.1 for an answer, the model incorrectly assumed C is wearing pants: how can we then belief its conclusion – that Marta is Girl A, and her car is number 5?
Issues get attention-grabbing once we attempt to evaluate fashions. We summary away the puzzle in a textual description, however LLMs nonetheless can’t discover consensus: DeepSeek’s 4.1 answer (A and a pair of) is totally different than the one given by Opus; Opus’s personal reply with textual prompting (A and a pair of) is totally different from Opus above, and ChatGPT5 has yet another opinion (A and 5).
That is what makes the puzzle an important motivating instance: people battle at this combinatorial reasoning (homework query: how lengthy did it take you to resolve it?), nevertheless it’s unclear how a lot better frontier fashions are. How can we construct confidence in any of the solutions above? How can we cause with the AI as an alternative of delegating totally the method?
Reasoning as “eliminating prospects”
Advanced reasoning challenges can usually be solved following the recommendation from that well-known detective: ‘When you might have eradicated the unattainable, then no matter stays, nonetheless inconceivable, have to be the reality’. As a substitute of making an attempt to resolve the issue abruptly, we will consider our puzzle as the mixture of three principal issues:
- An preliminary state of affairs, randomly mapping women to automobiles and labels.
- A set of constraints, within the type of statements by the exact same women: these statements will make sure mapping unattainable.
- A ultimate state of affairs, wherein women are re-mapped to names and automobiles.
Our preliminary information is appropriate with this actuality:

But in addition this (and lots of extra):

We are able to think about that each time we add a lady assertion, we remove some preparations from presumably being the ultimate one. In different phrases, we improve our information concerning the state of affairs as we progressively limit the set of possible options (this primary perception is similar underlying epistemic logic and knowledge idea). The truth is, the very first assertion, “Lady A states that Laura will not be subsequent to her, and A’s automotive is now in entrance of Bianca”, guidelines out our first state of affairs, as a result of Laura is subsequent to Lady A there.
Enumerating eventualities is a tedious and error-prone activity, even for LLMs. The magic of Alloy is their declarative nature. As a substitute of writing down the reasoning code ourselves, we state what we all know (premises in a standard proof, statements on this case), and what to search out out (a theorem, Marta’s automotive), and let Alloy do the remainder: exploring an enormous conceptual area is finished by tried and examined strategies, in order that we will give attention to the trustworthy translations of the puzzle and (essential!) the interpretation of the situations Alloy finds.
The division of labor ought to now be clear: as an alternative of LLM (or us) straight fixing the issue, we translate English necessities in Alloy code with Claude, then use Alloy to generate options and at last, we, as people, examine them.
From LLM to Alloy and again: the reasoning loop
Our prompting technique is now extra delicate. We not ask Claude for a direct answer; as an alternative, our prompt guides it to generate Alloy code primarily based on our preliminary state of affairs. As a substitute of “one-shotting” the answer, we are actually in a virtuous loop, producing more and more complicated code, and verifying that we’re getting nearer primarily based on the Alloy output:

The result’s our starting code, which accommodates the primary substances however no constraints but. It’s simple to scroll by means of the definitions now that the tedious translation has been achieved: Girl, Car, and Name as our principal “signatures” (i.e. units of objects), and the preliminary place for Ladies A-E is the mapping to Automobiles 1-5. We don’t but know who owns what besides that nobody owns the car in front of them now:
// No lady is initially standing in entrance of her personal automotive
// Lady A (place 1) doesn't personal Car1, B doesn't personal Car2, and so on.
A.owns != Car1
B.owns != Car2
C.owns != Car3
D.owns != Car4
E.owns != Car5
We pause right here to focus on two nice Alloy options: first, the code maps clearly to logical statements, fairly like those to be present in mathematical proofs and casual reasoning – even when you’ve got by no means seen Alloy’s syntax earlier than, the statements ought to be apparent (code feedback are your buddy!). Second, the built-in UI is helpful to visualise our progress, because it depicts an occasion chosen amongst all of the doable realities that fulfill the constraints: for instance, this can be a doable task (Giovanna is C):

Executing it once more, we might get another one, after which one other one: as our information is proscribed at this stage, a number of assignments are all doable: it’s time to begin eliminating some!
Let’s ask Claude to switch our preliminary code, and add the assertion from lady A. The beauty of this loop is that we will additionally encode “sanity checks” primarily based on incomplete however sound reasoning. Not simply LLMs, but in addition human intelligence advantages from this form of “progressive enhancement”: with the ability to incorporate “native” constraints is each unit testing the Alloy mannequin in addition to partaking us straight with the puzzle.
Let’s now add the assertion by Lady A as a constraint. Now add a examine to verify that the next mapping will not be allowed anymore: Franca (A, 1), Laura (B, 2). If we now run the code, no counterexample is found, proving we efficiently excluded the undesired configuration:
pred InvalidConfiguration {
// Lady A is called Franca and owns Car1
A.identify = Franca
A.owns = Car1
// Lady B is called Laura and owns Car2
B.identify = Laura
B.owns = Car2
}
examine { not InvalidConfiguration } for five Int
Now that we all know the trick, our AI assistant can generate the script with all of the statements by the ladies. Once we run it, that is the occasion that we get:

Thanks to some iterations and interpretable, provably right reasoning we will now set up that ChatGPT5 bought this proper: Marta is Lady A in Automobile 5, and the mapping supplied by ChatGPT is right (you may confirm it your self evaluating the chat result with the occasion above – by the way this additionally proves one other attention-grabbing truth, which is: no matter Marta’s mapping, are the opposite women uniquely decided as effectively?).
Reasoning out of the field
An ideal side-product of getting independently computable representations of the ideas at hand is that now we will discover within the symbolic area of Alloy the underlying mechanics of the puzzle, as an alternative of relying totally on opaque mappings in latent area.
For instance, we will simply verify that the answer is exclusive: within the Alloy UI, should you attempt to get a brand new occasion, a warning says that no other instance is available. However we might additionally discover exterior the present boundaries, and take away all of the Clothes data: does the answer change? (Attempt to reply earlier than operating it!) It seems, the right answer remains to be a sound occasion (homework query: why should this be the case?), however this time the UI can certainly produce a number of legitimate situations: as anticipated, much less constraints, (possible) extra options.
A symbolic area that we simply manipulate can be nice for checking the work of AI, which ought to by no means be taken at face worth. The primary level in case is checking Opus’ answer to start with, obtained by parsing the image incorrectly. We are able to simply change lady C’s clothes (i.e. `C.wears = Trousers`) and take a look at once more: since there is no such thing as a answer, the (unhappy) conclusion is that Opus’ unique reasoning was incorrect – it was “proper” however for the “flawed” causes, so to talk.
A second instance comes from what Claude added to examine for uniqueness (i.e.: Marta is A and 5 in all legitimate configurations). In idea, that’s a pleasant addition, however in observe this examine doesn’t do the job:
assert MartaUniqueSolution
(g1.identify = Marta and g2.identify = Marta) implies
(g1 = g2) // Marta is all the time on the similar place
The mismatch is obvious, and straightforward to determine due to Alloy’s clear syntax: “In all legitimate configurations” is a quantifier over all situations (within the “meta-language” so to talk), whereas “all g1…” quantifies over women inside an occasion.
See you, area cowboys
Equally to cutting-edge methods like AlphaGeometry, we solved a deductive drawback (successfully, a proof) by reasoning with Claude, as an alternative of delegating the method totally.
The LLM does the mapping between English and a proper language: Alloy is simple to learn, however typically tedious to put in writing, so the code technology capabilities of Claude come in useful. People, then again, can give attention to checking if the formal setup is right (checking is often easier than doing in the first place!). Each Claude and people then delegate combinatorial reasoning to a robust, verified solver for the precise deduction.
Whereas our beach-level proof appears unimportant, and the copy-paste from Claude will get tedious rapidly, this straightforward instance is a glimpse of the ability of formal strategies when mixed with code technology and a few (human or agentic) supervision. Actual-world methods use extra expressive languages, run tighter, self-improving loops and goal much less frivolous proofs, however lots of the intuitions from at this time carry over to them.
After all, fixing beach-or-PhD logical puzzles will not be the one use case for hybrid methods akin to this one. Languages like Alloy are highly regarded for modelling software program packages, and as such, they open the door for a future wherein distributed methods might be cheaply designed and verified at scale earlier than any implementation work even begins. As very sensible examples, AWS notoriously invests in verifying their cloud products, and Bauplan provides an Alloy model for their very own information catalog primitives.
Taking a really totally different path than what many might have predicted even simply 50 years in the past, it appears, daily, that we’re lastly getting nearer to Leibniz’s dream:
If controversies had been to come up, there can be no extra want for disputation between two philosophers than between two calculators. For it might suffice for them to take their pencils of their fingers and to sit down down on the abacus, and say to one another: Allow us to calculate.
Acknowledgments
Due to Federico Bianchi, Aldrin Montana, Patrick John Chia for preliminary suggestions over a earlier draft of this text. No LLM was used or harmed to put in writing the English elements of this weblog submit.
If you happen to care about verification, simulations and AI in system and infrastructure design, you’ll love working at Bauplan: we’re hiring!