A brand new paper from the Heart for AI Security and Scale AI has launched the Remote Labor Index (RLI), the primary benchmark designed to measure how nicely AI brokers can carry out paid, distant jobs.
The RLI benchmark contains real-world initiatives from freelance platforms, spanning complicated fields corresponding to recreation improvement, structure, knowledge evaluation, and video manufacturing. These aren’t easy duties: The initiatives represented over 6,000 hours of human work valued at greater than $140,000.
The outcomes? Present AI brokers carried out poorly.
Manus, the top-performing agent, may solely automate 2.5 p.c of the work. Different prime fashions, corresponding to Grok 4 and Sonnet 4.5, managed simply 2.1 p.c, whereas GPT-5 hit 1.7 p.c and Gemini 2.5 Professional got here in beneath 1 p.c. The researchers famous failures stemmed from incomplete deliverables, damaged recordsdata, and low-quality work that would not meet skilled requirements.
Whereas these low numbers may appear reassuring to human staff, they do not inform the entire story. To know what these findings actually imply for the way forward for AI within the workforce, I mentioned them with SmarterX and Advertising AI Institute founder and CEO Paul Roetzer on Episode 178 of The Artificial Intelligence Show.
Why Common Brokers Are the Flawed Measuring Stick
Roetzer wasn’t shocked by the low automation charges, noting that the benchmark checks normal brokers that are not particularly educated for these complicated jobs.
The actual and far sooner progress is going on with specialised brokers. He factors to examples together with OpenAI reportedly hiring Goldman Sachs bankers to coach fashions to do the job of an funding banker.
“My guess is OpenAI’s is method additional alongside than 2.5 p.c for that particular factor,” he says.
This highlights a vital distinction in how we must always take into consideration AI’s capabilities. The RLI gives a invaluable baseline for normal fashions, however the true financial impression will doubtless come from fashions intensely targeted on a selected job.
Good at Duties Not But at Jobs
Roetzer explains this utilizing a easy framework: duties, initiatives, and jobs.
Proper now, AI is superb on the process stage, which incorporates the small, discrete actions that make up a bigger venture.
“It’s good on the duties,” he says. “It isn’t good at doing the complete factor.”
An agent cannot exchange a CEO, for instance, nevertheless it would possibly assist with 25 totally different duties {that a} CEO does each month. People, nonetheless, are nonetheless important for setting targets, planning, connecting knowledge sources, integrating instruments, and, most significantly, overseeing and verifying the AI output.
The Financial Turing Take a look at
The important thing metric to look at, in accordance with Roetzer, is how lengthy an agent can work and not using a human needing to intervene, an idea he calls “actions per disengagement,” just like how Tesla measures self-driving.
We have not but reached what he calls the “financial Turing check,” the place the financial labor of AI is indistinguishable from that of a human.
“Is it to the purpose the place I’d rent an agent or a symphony of brokers as an alternative of a human?” he asks. “In each occasion I can consider, the reply continues to be no.”
Nonetheless, brokers are getting higher, extra autonomous, and extra dependable inside particular jobs slowly however certainly. And even augmentation of individuals with AI brokers might result in a discount within the variety of individuals wanted, says Roetzer.
“Because the brokers get extra autonomous, as they get extra dependable, as extra corporations perceive the best way to construct and combine them into workflows, you do not want as many individuals doing the work that you simply beforehand did.”
