Understanding Vibe Proving | Towards Data Science

“What I can’t create, I don’t perceive”
— attributed to R. Feynman

After Vibe Coding, we appear to have entered the (very area of interest, however a lot cooler) period of Vibe Proving: DeepMind wins gold at the International Mathematical Olympiad, Harmonic solves a non-trivial downside in quantity concept, and for the primary time in historical past AI techniques look like doing critical arithmetic.

We’re on the cusp of a profound change within the subject of arithmetic. Vibe proving is right here.

Aristotle from @HarmonicMath simply proved Erdos Drawback #124 in @leanprover, all by itself. This downside has been open for practically 30 years since conjectured within the paper “Full sequences…

— Vlad Tenev (@vladtenev) November 30, 2025

On the identical time, we’re constantly reminded that LLMs hallucinate: it appears we have now a paradox at hand. If LLMs can produce each rubbish textual content (together with math proofs) and proper reasoning (together with math reasoning), how can we inform which mannequin completions are good and which of them are hallucinations?

Coaching for vibe proving: LLMs generate candidate proofs, that are verified via a software program, offering the reward for post-training. [ Image by the author ]

The quick LinkedIn-length reply is that we use an LLM to generate mathematical reasoning, then we externalize our belief to particular software program which verifies it. However this quick reply raises in flip new questions:

How does this “particular software program” work?

Why can we belief it?

How can we practice an LLM to make use of it for proofs?

It’s tempting to write down articles on huge concepts like “LLMs and math”, offering intuitive, non-rigorous solutions in probably the most normal setting attainable. For instance, that is how a brand new book tries to promote you on the entire “math and computer systems” story:

The “fact oracle” that thinkers have sought for hundreds of years, a instrument to definitively confirm or refute any mathematical or logical assertion. (…) [The book] presents a profound reply to a longstanding thriller: Can computer systems reveal common truths?

We are going to do the precise reverse right here. As an alternative of miracle-sounding phrases and analogies, we’ll construct an LLM-for-proofs system from scratch, with the aim of giving precise solutions for easy eventualities: we achieve in precision, buying and selling some generality. As an alternative of big-picture arguments, we try for inspectable scripts which work on easy “proofs”. We break down our information into two elements, which can even be loved independently:

Half 1 (this one): understanding verifiers, and constructing a strong psychological mannequin of what makes math so particular (why don’t we belief LLMs writing legal guidelines in the identical method we do with proofs?). This can reply questions 1 and a couple of above;

Half 2 (coming quickly): coaching an open-source LLM to supply proofs, utilizing our verifier to supply a reward sign throughout reinforcement studying. This can reply 3.

In fact, one might piece collectively the fundamentals by studying papers on the frontier. Reducing-edge stories, nonetheless, sometimes use a special-purpose language and are full of many different technical issues, making it troublesome to disentangle the important parts and construct an preliminary psychological mannequin of the issue area.

By constructing an LLM-for-proofs, we get direct publicity to the primary concepts in order that additional explorations needs to be simpler: as we can’t reply all of the questions in-line, extra superior follow-ups are addressed within the FAQs on the finish.

Now, buckle up, clone the repo and code alongside.

What does it imply for a proof to be right?

“Archimedes will likely be remembered when Aeschylus is forgotten, as a result of languages die and mathematical concepts don’t”

— G. H. Hardy

Proof checkers are software program packages that take as enter mathematical reasoning generated by an LLM and confirm that it’s right. That is deceptively easy: understanding what “right” means and the way it’s attainable to determine correctness are the important thing to constructing a exact psychological mannequin of how LLMs may be so profitable at math.

That computer systems can confirm math in generality might certainly be stunning, however all of us have familiarity with mechanical software of particular mathematical guidelines. In highschool we be taught the rule to verify that 2x is the by-product of x² + 3 – crucially, we might not know what a by-product is, or what these symbols even imply, however so long as we apply the rule, we are able to confirm the declare. In a nutshell, what we’ll discover as we speak is that this phenomenon is rather more normal than what we might suspect: mathematical reasoning itself may be codified in symbols, and handled algorithmically.

Everyone is aware of the syllogism as a paradigmatic case of right reasoning: “All males are mortal; Socrates is a person; subsequently Socrates is mortal”. The usual for correctness is logical consequence: an announcement C is a logical consequence of (essentially follows from) premises S₁, S₂ … Sₖ if it’s not attainable for S₁, S₂ … Sₖ to be true and C to be false. In different phrases, at any time when the premises are true, the conclusion is essentially true.

In easy instances, it’s simple to persuade ourselves that the conclusion follows from the premises, however what about extra complicated ones? Are you able to “see” instantly that infinite prime numbers are a logical consequence of fundamental multiplication and division?

Proofs are a sequence of statements, beginning with a number of premises, and ending with a conclusion, such that every assertion essentially follows from the premises and / or earlier statements: the aim of a proof is to present us precisely how C follows from the premises, by continuing step-by-step. For instance, we are able to certainly present that there are infinitely many primes:

Suppose, for the sake of contradiction, that there are solely finitely many primes. Listing all of them: p₁, p₂, …, pₖ.
Take into account a brand new quantity N = (p₁ × p₂ × … × pₖ) + 1. So N is both prime or composite;
If N is prime, then we have now discovered a chief that isn’t within the unique record, contradicting the belief that p₁, …, pₖ have been all of the primes.
If N is composite, it will need to have a chief divisor q. However q can’t be amongst p₁, …, pₖ, as a result of dividing N by any p leaves the rest 1. So once more we have now discovered a chief not within the unique record, q.
In each instances we contradict the belief that there have been solely finitely many primes. Due to this fact, there have to be infinitely many primes.

So long as you might be snug with fundamental arithmetic and have a casual grasp on “negation” and “contradiction”, you’ll be able to observe this chain of deductions. The essential perception is that developing with a proof within the first place might require quite a lot of (for lack of a greater phrase) creativity, however checking one is a purely mechanical process: you don’t must take my phrase for it, you’ll be able to verify for your self that my proof is right. Certainly, a proof you can’t verify is not a proof: as Alonzo Church put it, if an auditor can’t verify with “sure means” a sequence of formulation put ahead as a proof, he might demand one other proof, and “till the supplementary proof is offered, he might refuse to be satisfied” (and so forth, probably perpetually).

To sum issues up: a proof is a step-by-step checkable option to present {that a} conclusion follows from a set of premises. Since proofs are checkable by design, we are able to construct software program that performs the mechanical checks at scale, and use it to differentiate rubbish from right proofs generated by LLMs.

Because the historical past of software program reveals, algorithms work nicely on formally outlined languages, akin to Python or Rust. On our option to construct our personal tiny verifier, we then want to choose a option to signify proofs in order that a pc can simply verify them.

A programming language for reasoning

“The advantage of formal texts is that their manipulations, as a way to be official, must fulfill only some easy guidelines; they’re, once you come to think about it, an amazingly efficient instrument for ruling out all kinds of nonsense that, after we use our native tongues, are nearly inconceivable to keep away from.”
— E.W. Dijkstra

Vibe Coding might give us the phantasm that we now “program in English”, however ultimately computer systems at all times execute directions in a programming language. If we wish a pc to verify proofs, we have to categorical them in a exact language: unambiguous to parse, simple to control.

After dreaming about it for centuries, mathematicians have found out how one can use mathematics to mannequin reasoning itself, a particular programming language, if you’ll, to hold out proofs in a rigorous method. Let’s take this trivial proof, which purports to point out that 12 is even and divisible by 3 ranging from two self-evident premises:

S₁: 12 is bigger than 10 and 12 is even
S₂: 12 is divisible by 3
By line 1, we are able to see that 12 is even
By line 2, we are able to see that 12 is divisible by 3
C, by line 3 and 4: subsequently, 12 is even and 12 is divisible by 3

We will use variables and boolean operations to signify sentences in symbols, analogously to abstracting strings into constants in Python: Q = “12 is bigger than 10”; R = “12 is even”; Z = “12 is divisible by 3”

Q AND R
Z
R (1)
Z (2)
R AND Z (3,4)

With dependencies between strains expressed in brackets, it’s simple to see what mechanical guidelines can be utilized right here: one rule (“AND elimination”) states that from alpha AND beta (the place alpha, beta are placeholders) we are able to remove AND and get one of many two; one other rule (“AND introduction”) says to introduce alpha AND beta offered that alpha, beta seem in some earlier step. In contrast to the English proof, this translation permits for mechanical manipulation: if you want for a coding analogy, an accurate proof is sort of a script that “compiles appropriately”, as routinely established by an algorithm (the compiler / verifier).

In calculus, after we introduce trigonometric features, we want new derivative rules. Similarly, when we introduce extra boolean operators (NOT, OR), we want new guidelines. The reader can verify the main points online (or immediately within the repo), however the normal gist stays: a rule will re-arrange a number of strains into a brand new line, both eliminating present operators or introducing them.

To sum issues up: we are able to rework an English proof to symbols that may be simply manipulated, not not like treating a parabola as an equation in order that we are able to get the by-product by some “re-arranging” of the phrases. Checking the symbolic proof tells us whether or not the reasoning is right and the mathematical conclusion may be trusted.

Nevertheless, there are infinitely many issues we might need to show: how do we all know that in a very complicated proof the foundations gained’t break down and go a rubbish proof for an accurate outcome? In different phrases, didn’t we simply shift the burden from LLMs that hallucinate to guidelines that (plausibly fallible) mathematicians invented?

That is the place the actual magic occurs: since reasoning itself is now exactly acknowledged, we go meta and show issues about our guidelines for proofs themselves!

A sound(ness) verify

“I don’t consider in empirical science. I solely consider in a priori fact.”
— attributed to K. Gödel

Even when an algorithm checks that the foundations within the proof have been utilized appropriately, we nonetheless don’t know that the proof is verified; in any case, packages have bugs on a regular basis! In different phrases, how do we all know the manipulation guidelines themselves are right?

The insightful reader might have observed our sleight of hand already. As people, we care concerning the fact of our statements: we need to draw true conclusions from what we all know already. Nevertheless, machines care concerning the sequence of symbols on a proof line, and the way a brand new line is created by mechanically rearranging these (bear in mind: you may get the by-product of x² + 3 by re-arranging the symbols, with out figuring out what a slope is!). What’s then the connection between symbols and fact? Ideally, these two views ought to align: by utilizing guidelines on true premises, I ought to get new strains that are themselves true. In different phrases, guidelines are truth-preserving: irrespective of how complicated the proofs, no software of a rule will produce a false assertion ranging from a real one.

After we ascend from investigating particular proofs to discussing what all proofs ought to do, we go from logic to meta-logic. It’s meta-logic that solutions our query, which is named the “soundness” of the system, i.e. a proof that, in each sequence of manipulations main from premises to a conclusion, the conclusion is essentially true at any time when the premises are. As soundness proofs are tedious, we present how the proof works for just one rule, and take the prospect to introduce how the idea of fact is modeled in our language.

The above proof added alpha AND beta as a brand new step ranging from alpha, beta showing individually. Up to now, so good. To mannequin fact, we outline an interpretation, which is an task I of fact values (1 = True, 0 = False) to our variables, with the usual recursive definition for complicated sentences:

I(p AND q) = 1 if and provided that I(p) = 1 and I(q) = 1
I(NOT q) = 1 if and provided that I(q) = 0
…

This definition would really feel like a déjà vu for any coder that has ever accomplished one thing like if boolean_var and boolean_other_var, and that’s as a result of AND, OR, NOT in Python are modeled after the usual semantics of logical connectives (one of many many hyperlinks between programming and logic!).

An interpretation operate defines a attainable state of the world by mapping variables to booleans: for instance, one world has p true and q false, one other world has each false, and so forth (within the coding case, the interpretation is finished within the script when initializing boolean_var and boolean_other_var). To finish the soundness verify, we have to present fact preservation in all attainable world states. We use a truth table to search out out that, at any time when p and q are true, p AND q is true as nicely:

As we needed, the rule certainly complies with our definition of logical consequence: each time the premises are true, the conclusion is essentially so (row 1). Is the converse true? As in, if a conclusion follows from some premises, is there at all times a proof of that? Solely in some “programming languages” (like ours), the reply is sure: this is named completeness, and it’s the twin of soundness (additionally it’s extra fascinating, however a lot more durable to show!).

If we’re ready (and we’re!) to ensure that each rule we use is truth-preserving, we achieve the certainty that irrespective of how complicated a proof, irrespective of how far sooner or later, irrespective of if LLMs or people are reasoning, the proof checking software program won’t ever go a rubbish proof for proper reasoning.

Barring implementation bugs (that are in fact unavoidable) and a trustworthy English-to-symbols translation, then, a verified proof is the final word normal for certainty: a exact, unambiguous type of reasoning which is mathematically sound and algorithmically checked. This isn’t solely essential as a tenet when trusting LLMs with mathematical discovery, but in addition underscores how totally different arithmetic is from anything: whereas we might get up at some point and discover out Newton was mistaken (we did!), we’ll by no means get up at some point and discover out primes are usually not infinite. As soon as a proof is verified, it’s everlasting data we are able to construct upon safely to realize new data:

[Mathematicians from Ancient Greek] are usually not intelligent schoolboys, however Fellows of one other school.

G. H. Hardy

See you, math cowboys!

“Discuss is reasonable. Present me the code.”
— L. Torvalds

On this first piece, we constructed a psychological mannequin to grasp why we must always even count on mathematical reasoning to be one thing that may be reliably verified by a pc. In answering our preliminary questions, we discovered the distinctive, mechanical nature of proofs, and launched the minimal components we have to signify proofs as a program: a language and manipulation guidelines (principally, a DSL!)

Within the companion repository, proof_checker_playground.py shows a pattern of legitimate and invalid proofs that leverages a small Python class that implements our verification logic. The proof checker first verifies {that a} given collection of steps is syntactically right: like several programming language, we want symbols to be correctly formatted! Then, it proceeds step-by-step, simulating a mathematician checking that certainly each new line within the proof follows from earlier ones. Our code achieves this in a barely unorthodox method in comparison with industry-strength provers, sacrificing efficiency (homework: are you able to guess why it’s computationally inefficient? How lengthy will a desk develop because the variety of variables improve?) for pedagogical functions. When verifying step 4 on this pattern proof:

S₁: …
S₂: ..
S₃: A AND B
A (3)
…

the checker will retrieve the justification (the premise on line 3), then construct a fact desk with the required variables (A, B), with line 3 because the premise and the goal step (line 4) being the conclusion:

A	B	A AND B (line 3)	*A (4)*
1	1	1	1
1	0	0	1
0	1	0	0
0	0	0	0

The savvy reader might already see the place that is going: the proof step is taken into account right if, at any time when the premises are true, the conclusion is true as nicely – the definition of logical consequence we began from!

Within the subsequent piece, we’ll practice an LLM to generate proofs, and use the scaffolding constructed as we speak to confirm these proofs. We are going to leverage closed-source LLMs and the checker to assist us create a dataset of verified proofs, then use Tinker to arrange a reinforcement studying loop. We immediate a small open-source mannequin to show a conclusion from some premises, than go the generated proof to the checker and reward the mannequin whether it is right.

Whereas it’s unclear how a lot RL is “just” stressing existing capabilities or studying novel issues, both method we hope that even a small mannequin can show easy information in our “programming language”: see you, math cowboys!

FAQ

If you wish to go deeper, listed here are a couple of frequent follow-ups:

Guidelines are truth-preserving, however how do I do know the premises are true within the first place? Logically talking, a consequence can observe from mistaken premises, however a proof that begins from 2+2=5 will hardly be fascinating. In most real-world instances, our premises are the outcomes of different theorems, so we all know them to be penalties of different, extra elementary information; the dangerous information is that sooner or later the buck ought to cease and no extra premises may be invoked. It seems that the overwhelming majority of math may be backed by a couple of statements, the axioms of set theory: since axioms are the premises you can’t show, it’s fascinating to ask ourselves what justifies believing those in the first place. Math axioms are nonetheless much less debatable than, say, morality, which is why it’s method simpler to consider in formalizing math (the place everyone agrees on the premises, so we “simply” verify the reasoning), than doing the identical for authorized arguments (the place usually the 2 events begin from totally different premises). The excellent news, although, is {that a} proof resembles a pc program in multiple method: as soon as we all know a theorem, we are able to now “import” it into a bigger proof in the same option to how we import pandas in our code.
If proof checking is totally algorithmic, can we additionally write an algorithm to show all truths? In full generality, we can’t. The existence of truths we can’t show is probably going some of the profound mathematical information of the final century: that is the important take-away of Gödel’s Incompleteness Theorem. Furthermore, we don’t know if some statements are true or false as our greatest axiom system generally can’t resolve (the Axiom of Choice being a well-known instance of one thing impartial from set concept).
Can we show fascinating issues concerning the notion of proof itself? Sure! Whereas the Church’s level is unquestionably a sound argument, on his option to proving his celebrated theorem, Gödel proved that the predicate IsProofOf(proof, assertion) is algorithmic. Plus, there may be a complete logic for provability!
The place can I learn extra? My favourite intro-to-intermediate textual content is Language, Proof and Logic, which is particularly geared in direction of non-math college students; Computability and Logic is an intermediate textual content with a give attention to the connection between logic, proof and computation.

Acknowledgements

Because of Patrick John Chia, Federico Bianchi, Ethan Rosenthal, Ryan Vilim, Davis Treybig for treasured suggestions over earlier variations of this draft. In the event you just like the intersection of genAI, reasoning about distributed techniques and verification, you can too try our research at Bauplan.

AI coding assistants have been used to write down the companion repository, however no assistant was used to write down the textual content (if not for proof-reading and typo correction).

Source link

Three OpenClaw Mistakes to Avoid and How to Fix Them

I Stole a Wall Street Trick to Solve a Google Trends Data Problem

Why Your AI Search Evaluation Is Probably Wrong (And How to Fix It)

MIT Department of Economics to launch James M. and Cathleen D. Stone Center on Inequality and Shaping the Future of Work | MIT News

The Death of the “Everything Prompt”: Google’s Move Toward Structured AI

Modeling Urban Walking Risk Using Spatial-Temporal Machine Learning

Why a CEO Fired 80% of His Staff (and Would Do It Again)

Why Your AI Search Evaluation Is Probably Wrong (And How to Fix It)

Most Popular

TDS Newsletter: How Compelling Data Stories Lead to Better Business Decisions

Microslop är det nya namnet på Microsoft

New MIT Study Says 95% of AI Pilots Fail, AI and Consciousness, Another Meta AI Reorg, Otter.ai Lawsuit & Sam Altman Talks Up GPT-6

Our Picks