New Benchmark Shows AI Agents Perform Poorly When Automating Real Jobs

A brand new paper from the Heart for AI Security and Scale AI has launched the Remote Labor Index (RLI), the primary benchmark designed to measure how nicely AI brokers can carry out paid, distant jobs.

The RLI benchmark contains real-world initiatives from freelance platforms, spanning complicated fields corresponding to recreation improvement, structure, knowledge evaluation, and video manufacturing. These aren’t easy duties: The initiatives represented over 6,000 hours of human work valued at greater than $140,000.

The outcomes? Present AI brokers carried out poorly.

Manus, the top-performing agent, may solely automate 2.5 p.c of the work. Different prime fashions, corresponding to Grok 4 and Sonnet 4.5, managed simply 2.1 p.c, whereas GPT-5 hit 1.7 p.c and Gemini 2.5 Professional got here in beneath 1 p.c. The researchers famous failures stemmed from incomplete deliverables, damaged recordsdata, and low-quality work that would not meet skilled requirements.

Whereas these low numbers may appear reassuring to human staff, they do not inform the entire story. To know what these findings actually imply for the way forward for AI within the workforce, I mentioned them with SmarterX and Advertising AI Institute founder and CEO Paul Roetzer on Episode 178 of The Artificial Intelligence Show.

Why Common Brokers Are the Flawed Measuring Stick

Roetzer wasn’t shocked by the low automation charges, noting that the benchmark checks normal brokers that are not particularly educated for these complicated jobs.

The actual and far sooner progress is going on with specialised brokers. He factors to examples together with OpenAI reportedly hiring Goldman Sachs bankers to coach fashions to do the job of an funding banker.

“My guess is OpenAI’s is method additional alongside than 2.5 p.c for that particular factor,” he says.

This highlights a vital distinction in how we must always take into consideration AI’s capabilities. The RLI gives a invaluable baseline for normal fashions, however the true financial impression will doubtless come from fashions intensely targeted on a selected job.

Good at Duties Not But at Jobs

Roetzer explains this utilizing a easy framework: duties, initiatives, and jobs.

Proper now, AI is superb on the process stage, which incorporates the small, discrete actions that make up a bigger venture.

“It’s good on the duties,” he says. “It isn’t good at doing the complete factor.”

An agent cannot exchange a CEO, for instance, nevertheless it would possibly assist with 25 totally different duties {that a} CEO does each month. People, nonetheless, are nonetheless important for setting targets, planning, connecting knowledge sources, integrating instruments, and, most significantly, overseeing and verifying the AI output.

The Financial Turing Take a look at

The important thing metric to look at, in accordance with Roetzer, is how lengthy an agent can work and not using a human needing to intervene, an idea he calls “actions per disengagement,” just like how Tesla measures self-driving.

We have not but reached what he calls the “financial Turing check,” the place the financial labor of AI is indistinguishable from that of a human.

“Is it to the purpose the place I’d rent an agent or a symphony of brokers as an alternative of a human?” he asks. “In each occasion I can consider, the reply continues to be no.”

Nonetheless, brokers are getting higher, extra autonomous, and extra dependable inside particular jobs slowly however certainly. And even augmentation of individuals with AI brokers might result in a discount within the variety of individuals wanted, says Roetzer.

“Because the brokers get extra autonomous, as they get extra dependable, as extra corporations perceive the best way to construct and combine them into workflows, you do not want as many individuals doing the work that you simply beforehand did.”

Source link

Why Google’s NotebookLM Might Be the Most Underrated AI Tool for Agencies Right Now

Why Optimization Isn’t Enough Anymore

Adversarial Prompt Generation: Safer LLMs with HITL

A Bird’s-Eye View of Linear Algebra: Why Is Matrix Multiplication Like That?

How Conversational AI is Framing the Future of Automobiles?

Healthcare Data De-identification: Achieving Compliance in 2024 & Beyond

Data Visualization Explained (Part 5): Visualizing Time-Series Data in Python (Matplotlib, Plotly, and Altair)

Automated Data Extraction for AI Workflows: A Complete Guide

Most Popular

The road to artificial general intelligence

Vana is letting users own a piece of the AI models trained on their data | MIT News

Baidu släpper ERNIE 4.5 som öppen källkod

Our Picks

Optimizing Data Transfer in Distributed AI/ML Training Workloads

Achieving 5x Agentic Coding Performance with Few-Shot Prompting

Why the Sophistication of Your Prompt Correlates Almost Perfectly with the Sophistication of the Response, as Research by Anthropic Found

New Benchmark Shows AI Agents Perform Poorly When Automating Real Jobs

Why Common Brokers Are the Flawed Measuring Stick

Good at Duties Not But at Jobs

The Financial Turing Take a look at

Related Posts