Introduction
(LLMs) are more and more able to fixing advanced reasoning duties, resembling math Olympiad issues, scientific Q&A, and multi-step logical puzzles[3,8]. However are they actually nice? Sure, they’re, however proper now, they’re very computationally costly and inefficient at check time[5,6]. To deal with this problem, Researchers at Meta AI have provide you with an answer known as “DeepConf,” also referred to as “Deep Suppose with Confidence”[1].
There’s a downside often called self-consistency with majority voting.
I’m positive you might be questioning what this downside appears to be like like in follow. Think about a classroom of 100 college students. You gave them a fancy Olympiad downside and an hour to unravel it. On the finish, you possibly can take all of the solutions and vote — the solutions with essentially the most votes “win.”
That is how the self-consistency with the bulk downside works in LLMs[2,3]. As an alternative of just one resolution, the mannequin explores tons of of reasoning paths (for instance, 512 completely different step-by-step options) after which chooses essentially the most frequent reply.
On the AIME 2025 math benchmark, a single go by Qwen3–8B (known as go@1) will get about 68% accuracy; it’s like taking 1 reply from 1 scholar. However should you generate 512 reasoning traces per query (known as conf@512) and take the bulk reply, then accuracy jumps to 82%[1,4].
Sounds nice, proper? The catch is that these additional 511 traces generate almost 100 million extra tokens, and extra traces don’t all the time assist; efficiency will stay the identical and even drop typically when low-quality options dominate the vote[1,7,8]. In different phrases, if the scholars are guessing randomly, then the category vote doesn’t mirror the most effective thinker within the room[1].
What did the researchers do about it: Early Fixes
Researchers tried to unravel this downside by trying on the mannequin’s inside uncertainty indicators. Now what’s that inside un…… It’s like taking a look at every scholar after some time period, suppose each 5 minutes, to see if they’re doing the proper child steps or not. The mannequin appears to be like on the chance distribution of every token and calculates its confidence or entropy at a selected time. If the mannequin has excessive confidence or low entropy (low unfold with a excessive peak), then the mannequin is definite concerning the explicit token prediction, and vice versa[1,11].
By including these token-level prediction statistics throughout a complete reasoning hint, we will estimate how “reliable” the answer actually is. We will additionally filter out the low-confidence traces earlier than majority voting — identical to ignoring the solutions from the scholars who clearly guessed. Fewer dangerous votes, Stronger outcomes[1].

Nonetheless, these strategies are nonetheless world and don’t totally remedy the effectivity downside[1,6,13].
Let’s speak about some maths right here, resembling how token entropy, token confidence, and hint confidence work [1,11].
Token Entropy:

Let’s break this Entropy factor. The logPᵢ(j) time period tells how stunning the token prediction is, with the Likelihood of the token on the ith place. When the chance is 1 (the mannequin is lifeless positive, shock is 0. No drama, no uncertainty), which tells the mannequin is extremely sure concerning the token prediction. We then take the common of all token entropies to outline the entropy in every step or token prediction[1].
Token Confidence:

Token Confidence senses how sharp it guesses for every token prediction (anti-surprise meter)[1].
Common Hint Confidence:

Whereas we’re calculating the arrogance in every token, the common of those confidence scores offers the arrogance of the hint[1].
Confidence-Conscious Check Time Scaling: DeepConf
DeepConf takes the thought additional, as a substitute of taking tons of of options and easily voting on them[2,3,12]. It appears to be like on the mannequin’s inside confidence indicators throughout and after era. It filters out low-quality reasoning traces dynamically, both in actual time (on-line mode) or after all of the options are generated (offline mode). It retains solely essentially the most trusted reasoning methods and reduces wasted computation[1,6].
And the outcomes? On AIME 2025, DeepConf@512 with GPT-OSS-120B hits a jaw-dropping 99.9% accuracy. In contrast with plain majority voting, it’s 97.0%, and a single try (go@1) achieves solely 91.8%. On the identical time, DeepConf reduces token era by as much as 84.7% in comparison with brute-force parallel pondering[1,6,7].
With the instinct clear, it’s time to see how these confidence measures truly work below the hood.
Group Confidence:

Cₜ continues to be our token stage confidence. Consider group confidence (C_Gᵢ) as a zoomed-in examine for the knowledge, the place |Gᵢ| is the variety of earlier tokens with the overlapping window (instance 1024 or 2048 tokens). This provides us an area snapshot of the knowledge[1].
Backside 10% Group Confidence:

Once we kind the Group confidence rating and zoom in on the underside 10%, we’re mainly shining a light-weight on the weakest hyperlinks within the chain of reasoning. If these steps look shaky, we will toss them out to save lots of our computation[1].
Tail Confidence:

Tail confidence is easy; we simply take the final fastened variety of tokens, like 2048, and discover how assured the mannequin is in the previous couple of steps (checking the final mile), a important step for predicting the proper conclusions[1].
We will use the DeepConf in two modes: Offline and on-line[1].
Offline Pondering with Confidence
If you end up offline, you don’t name the mannequin time and again or fetch additional knowledge. As an alternative, you might be left with traces you’ve already generated.
The problem is to squeeze essentially the most dependable solutions out of them.
In Offline Mode, we will do plain voting of the result traces(which might break when there are extra noisy outcomes) or confidence-weighted majority voting, the place we take the imply confidence worth of the hint and easily take the product of the arrogance rating with the incidence of that resolution[1,2].
Confidence Filtering and Voting: Earlier than voting, discard the weakest traces. First filter traces by confidence (take prime n% of the traces) after which both do plain voting or weighted confidence voting[1,9,10].
You may take whichever confidence metrics swimsuit you, like Common confidence, Group Confidence, or tail confidence[1,10,11].

Step-by-step clarification:
Inputs:
Immediate P: the query or enter you need answered.
Variety of traces N: what number of reasoning paths you’ll generate.
Filtering threshold 𝜂: the share of prime traces to filter on.
Confidence measurement C(t): to compute the arrogance rating of a hint by any technique you need[1].
Initialization:
Create an empty set T.
create an empty confidence set C[1].
Generate Traces:
For every iteration from 1 to N: You may generate a hint tᵢ for immediate P.
Calculate the arrogancerating Cᵢ = C(tᵢ).
Retailer the pair (tᵢ, Cᵢ) in T and C[1].
Filter Excessive-Confidence Traces:
From all N traces, choose the highest η% primarily based on their confidence scores.
This removes the noisy or low-quality traces, holding solely robust assured reply[1].
Voting:
we will calculate the vote rating V(a) for every doable reply a.
This may be plain counting or weighted voting[1].
Choose the Ultimate Reply:
Select the reply âwith the best vote rating[1]:


On-line Pondering with Confidence
The algorithm generates the traces on the fly, dynamically measuring confidence when there may be sufficient proof[1,5,14,15].
The Algorithm:

Step-by-Step Clarification
1. Inputs
Immediate P: once more the query you’re answering.
Hint finances B: It’s for the utmost variety of traces you need to generate.
Preliminary traces Nᵢₙᵢₜ: It’s a beginning pool of traces to heat up with.
Filtering threshold η: what number of high-confidence traces to maintain.
Consensus threshold τ: It offers a share that displays, when you possibly can cease since you’re assured within the majority reply[1].
2. Offline Warmup
Earlier than producing on-line:
Run Algorithm 1 with Nᵢₙᵢₜ traces.
Compute the arrogance threshold s:
Take the 100, η percentile of the arrogance scores from the preliminary traces.
This defines the minimal confidence a token/group must be thought-about.
Initialize the hint set T with the preliminary traces and calculate preliminary vote values V(a) for all solutions[1].

Decide the preliminary majority reply â[1].
3. On-line Technology Loop
Whereas two circumstances maintain:
The present majority reply will not be but assured sufficient:

And you continue to haven’t exceeded the hint finances |T|<B
→ Hold producing new traces[1]:
4. Generate a Hint Step-by-Step
Whereas producing a hint t: Generate token by token.
After every token iii, calculate the group confidence C_Gᵢ for that token/group.
If C_Gᵢ<s: cease producing the hint (low confidence).
Else: add token iii to the hint t[1].
5. Replace
Add the finished hint ttt to the hint set T.
Compute the hint confidence Cₜ.
Replace vote counts V(a) for all solutions.
Replace the bulk reply â[1].
6. Termination
Cease when both:
The bulk reply âachieves consensus above the edge τ.
Or the hint finances B is reached.
Return the ultimate majority reply â[1].

I believe this algorithm is the artwork of early stopping and saving an infinite quantity of computation and sources[1,5,6,7,13,14].
Conclusion
So, what do you assume? What’s the ethical of the story? Even the neatest “college students” within the AI classroom typically want a bit of self-doubt to shine. DeepConf exhibits how highly effective self-doubt is. We will save thousands and thousands of computations not by brute drive however by selecting smarter, confidence-based approaches. It’s like turning a chaotic math contest into a peaceful staff of knowledgeable problem-solvers.
As AI retains studying to assume with confidence, we’re shifting towards a future the place fashions usually are not solely smarter but in addition thriftier, spending much less compute, making fewer errors, and delivering extra brainpower per token. And who is aware of? Possibly someday your favourite mannequin can be your most frugal, self-aware research buddy. Till then, let’s preserve pondering smarter, not more durable.
References
[1] Dayananda, A., Sivasubramanian, S., & Bartlett, P. (2024). Deep Suppose with Confidence: Confidence-Conscious Check-Time Scaling for Higher Alignment. arXiv preprint arXiv:2508.15260. Retrieved from https://arxiv.org/pdf/2508.15260
[2] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2022). Self-consistency improves chain-of-thought reasoning in language fashions. arXiv preprint arXiv:2203.11171.
[3] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., & others. (2022). Chain-of-thought prompting elicits reasoning in massive language fashions. In Advances in neural info processing methods (Vol. 35, pp. 24824–24837).
[4] Artwork of Drawback Fixing. (2025a). 2025 AIME I. https://artofproblemsolving.com/wiki/index.php/2025_AIME_I. Accessed: 2025.
[5] OpenAI. (2024). OpenAI o1 system card. arXiv preprint arXiv:2412.16720.
[6] Snell, C., Lee, J., Xu, Okay., & Kumar, A. (2024). Scaling LLM test-time compute optimally will be simpler than scaling mannequin parameters. arXiv preprint arXiv:2408.03314.
[7] Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q. V., Ré, C., & Mirhoseini, A. (2024). Massive language monkeys: Scaling inference computation with repeated sampling. arXiv preprint arXiv:2407.21787.
[8] Chen, L., Davis, J. Q., Hanin, B., Bailis, P., Stoica, I., Zaharia, M., & Zou, J. (2024a). Are extra LLM calls all you want? in the direction of scaling legal guidelines of compound inference methods. https://arxiv.org/abs/2403.02419
[9] Aggarwal, P., Madaan, A., Yang, Y., et al. (2023). Let’s pattern step-by-step: Adaptive consistency for environment friendly reasoning and coding with LLMs. arXiv preprint arXiv:2305.11860.
[11] Fadeeva, E., Rubashevskii, A., Shelmanov, A., Petrakov, S., Li, H., Mubarak, H., … & Panov, M. (2024). Reality-Checking the Output of Massive Language Fashions by way of Token-Degree Uncertainty Quantification. arXiv preprint arXiv:2403.04696.
[13] Li, Y., Yuan, P., Feng, S., Pan, B., Wang, X., Solar, B., … & Li, Okay. (2024). Escape sky-high value: Early-stopping self-consistency for multi-step reasoning. arXiv preprint arXiv:2401.10480.
[14] Han, Z., Li, Z., Wang, Y., Guo, C., Track, R., He, J., … & Chen, W. (2024). Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Higher, Even Mid-Technology. arXiv preprint arXiv:2410.02725.