How to Benchmark LLMs – ARC AGI 3

the previous couple of weeks, we’ve seen the discharge of highly effective LLMs resembling Qwen 3 MoE, Kimi K2, and Grok 4. We are going to proceed seeing such speedy enhancements within the foreseeable future, and to match the LLMs towards one another, we require benchmarks. On this article, I talk about the newly launched ARC AGI 3 benchmark and why frontier LLMs battle to finish any duties on the benchmark.

Motivation

At the moment, we’re saying a preview of ARC-AGI-3, the Interactive Reasoning Benchmark with the widest hole between simple for people and onerous for AI

We’re releasing:
* 3 video games (environments)
* $10K agent contest
* AI brokers API

Beginning scores – Frontier AI: 0%, People: 100% pic.twitter.com/3YY6jV2RdY

— ARC Prize (@arcprize) July 18, 2025

ARC AGI 3 was not too long ago launched.

My motivation for writing this text is to remain on high of the most recent developments in LLM expertise. Solely within the final couple of weeks have we seen the Kimi K2 mannequin (finest open-source mannequin when launched), Qwen 3 235B-A22B (presently finest open-source mannequin), Grok 4, and so forth. There may be a lot occurring within the LLM house, and one option to sustain is to trace the benchmarks.

I believe the ARC AGI benchmark is especially attention-grabbing, primarily as a result of I wish to see if LLMs can match human-level intelligence. ARC AGI puzzles are made in order that people are in a position to full them, however LLMs will battle.

You may as well learn my article on Utilizing Context Engineering to Significantly Enhance LLM Performance and take a look at my website, which contains all my information and articles.

Desk of Contents

Introduction to ARC AGI

ARC AGI is actually a puzzle sport of sample matching.

ARC AGI 1: You might be given a collection of input-output pairs, and have to finish the sample
ARC AGI 2: Just like the primary benchmark, performing sample matching on enter and output examples
ARC AGI 3: Right here you might be enjoying a sport, the place you need to transfer your block into the objective space, however some required steps in between

I believe it’s cool to check out these puzzle video games and full them myself. Then, you may see LLMs initially battle with the benchmarks, after which enhance their efficiency with higher fashions. OpenAI, for instance, scored:

7.8% with o1 mini
75% with o3-low
88% with o3-high

As you may also see within the picture under:

This determine reveals the efficiency of various OpenAI fashions on the ARC AGI 1 benchmark. You may see how efficiency will increase with extra superior fashions. Picture from ARC AGI, which is underneath the Apache 2 license.

Enjoying the ARC AGI benchmark

You may as well strive the ARC AGI benchmarks your self or construct an AI to carry out the duties. Go to the ARC AGI 3 website and begin enjoying the sport.

The entire level of the video games is that you haven’t any directions, and you need to determine the principles your self. I take pleasure in this idea, because it represents determining a wholly new drawback, with none assist. This highlights your capacity to be taught new environments, adapt to them, and remedy issues.

You may see a recording of me playing ARC AGI 3 here, encountering the issues for the primary time. I used to be sadly unable to embed the hyperlink within the article. Nevertheless, it was tremendous attention-grabbing to check out the benchmark and picture the problem an LLM has to undergo to unravel it. I first observe the setting, and what occurs once I carry out the totally different actions. An motion on this case is urgent one of many related buttons. Some actions do nothing, whereas others have an effect on the setting. I then proceed to uncover the objective of the puzzle (for instance, get the thing to the objective space) and attempt to obtain this objective.

Why frontier fashions obtain 0%

This article states that when frontier fashions had been examined on the ARC AGI 3 preview, they achieved 0%. This would possibly sound disappointing to some individuals, contemplating you had been most likely in a position to efficiently full quite a lot of the duties your self, comparatively shortly.

As I beforehand mentioned, a number of OpenAI fashions have had success with the sooner ARC AGI benchmarks, with their finest mannequin reaching 88% on the primary model. Nevertheless, initially, fashions achieved 0%, or within the low single-digit percentages.

I’ve a number of theories for why frontier fashions weren’t in a position to carry out duties on ARC AGI 3:

Context size

When engaged on ARC AGI 3, you don’t get any details about the sport. The mannequin thus has to check out quite a lot of actions, see the output of these actions (for instance, nothing occurs, or a block strikes, and so forth). The mannequin then has to judge the actions it took, together with the output, and take into account its subsequent strikes.

I consider the motion house on ARC AGI 3 could be very massive, and it’s thus tough for the fashions to each experiment sufficient to seek out the proper motion and keep away from repeating unsuccessful actions. The fashions basically have an issue with their context size and using the complete size of it.

I not too long ago learn an attention-grabbing article from Manus about how they develop their brokers and handle their reminiscence. You need to use strategies resembling summarizing earlier context or utilizing a file system to retailer vital context. I consider this will probably be key to rising efficiency on the ARC AGI 3 benchmark.

Coaching dataset

One other major motive frontier fashions are unable to finish ARC AGI 3 duties efficiently is that the duties are very totally different from their coaching dataset. LLMs will nearly at all times carry out method higher on a activity if such a activity (or the same one) is included within the coaching dataset. On this occasion, I consider LLMs have little coaching information on working with video games, for instance. Moreover, an vital level right here can be the agentic coaching information for the LLMs.

With agentic coaching information, I imply information the place the LLM is using instruments and performing actions. I consider we’re seeing a speedy enhance in LLMs used as brokers, and thus, the proportional quantity of coaching information for agentic habits is quickly rising. Nevertheless, it could be that present frontier fashions nonetheless aren’t nearly as good at performing such actions, although it would probably enhance quickly within the coming months.

Some individuals will spotlight how this proves LLMs wouldn’t have actual intelligence: The entire level of intelligence (and the ARC AGI benchmark) is to have the ability to perceive duties with none clues, solely by inspecting the setting. To some extent, I agree with this level, and I hope to see fashions carry out higher on ARC AGI due to elevated mannequin intelligence, and never due to benchmark chasing, an idea I discover later on this article.

Benchmark efficiency sooner or later

Sooner or later, I consider we are going to see huge enhancements in mannequin efficiency on ARC AGI 3. Largely as a result of I believe you may create AI brokers which can be fine-tuned for agentic efficiency, and that optimally make the most of their reminiscence. I consider comparatively low-cost enhancements can be utilized to vastly enhance efficiency, although I additionally anticipate dearer enhancements (for instance, the discharge of GPT-5) will carry out properly on this benchmark.

Benchmark chasing

I believe it’s vital to go away a bit about benchmark chasing. Benchmark chasing is the idea of LLM suppliers chasing optimum scores on benchmarks, slightly than merely creating the very best or most clever LLMs. It is a drawback as a result of the correlation between benchmark efficiency and LLM intelligence isn’t 100%.

Within the reinforcement studying world, benchmark chasing could be known as reward hacking. A state of affairs the place the agent figures out a option to hack the setting they’re in to attain a reward, with out correctly performing a activity.

The explanation LLM suppliers do that is that at any time when a brand new mannequin is launched, individuals often have a look at two issues:

Benchmark efficiency
Vibe

Benchmark efficiency is often measured on identified benchmarks, resembling SWE-bench and ARC AGI. Vibe testing can be a method LLMs are sometimes measured by the general public (I’m not saying it’s a great way of testing the mannequin, I’m merely saying it occurs in observe). The issue with this, nonetheless, is that I consider it’s fairly easy to impress individuals with the vibe of a mannequin, as a result of vibe checking tries some very small share of the motion house for the LLM. You could solely be asking it sure questions which can be found on the internet, or asking it to program an utility which the mannequin has already seen 1000 cases of in its coaching information.

Thus, what you must do is to have a benchmark by yourself, for instance, an in-house dataset that has not been leaked to the web. Then you may benchmark which LLM works finest in your use case and prioritize utilizing this LLM.

Conclusion

On this article, I’ve mentioned LLM benchmarks and why they’re vital for evaluating LLMs. I’ve launched you to the newly launched ARC AGI 3 benchmark. This benchmark is tremendous attention-grabbing contemplating people are simply in a position to full among the duties, whereas frontier fashions rating 0%. This thus represents a activity the place human intelligence nonetheless outperforms LLMs.

As we advance, I consider we are going to see speedy enhancements in LLM efficiency on ARC AGI 3, although I hope this is not going to be the results of benchmark chasing, however slightly the intelligence enchancment of LLMs.

Source link

Reading Research Papers in the Age of LLMs

The Machine Learning “Advent Calendar” Day 6: Decision Tree Regressor

TDS Newsletter: How to Design Evals, Metrics, and KPIs That Work

The Absolute Beginner’s Guide to Pandas DataFrames

Svenska AI-webbläsaren Strawberry automatiserar vardagsuppgifter

Alibaba lanserar sin senaste flaggskepps-AI-modell Qwen 3

Akool Live Camera: Realtids AI-avatarer för videomöten och streaming

How do you teach an AI model to give therapy?

Most Popular

New Skechers AI Store Assistant Rates Outfit and Suggests What to Buy

10 Ways AI Can Improve Your Reading And Writing In 2025 » Ofemwire

TDS Newsletter: November Must-Reads on GraphRAG, ML Projects, LLM-Powered Time-Series Analysis, and More

Our Picks