Close Menu
    Trending
    • Optimizing Data Transfer in Distributed AI/ML Training Workloads
    • Achieving 5x Agentic Coding Performance with Few-Shot Prompting
    • Why the Sophistication of Your Prompt Correlates Almost Perfectly with the Sophistication of the Response, as Research by Anthropic Found
    • From Transactions to Trends: Predict When a Customer Is About to Stop Buying
    • America’s coming war over AI regulation
    • “Dr. Google” had its issues. Can ChatGPT Health do better?
    • Evaluating Multi-Step LLM-Generated Content: Why Customer Journeys Require Structural Metrics
    • Why SaaS Product Management Is the Best Domain for Data-Driven Professionals in 2026
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » How to build a better AI benchmark
    AI Technology

    How to build a better AI benchmark

    ProfitlyAIBy ProfitlyAIMay 8, 2025No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    The boundaries of conventional testing

    If AI corporations have been sluggish to answer the rising failure of benchmarks, it’s partially as a result of the test-scoring method has been so efficient for therefore lengthy. 

    One of many largest early successes of latest AI was the ImageNet problem, a sort of antecedent to up to date benchmarks. Launched in 2010 as an open problem to researchers, the database held greater than 3 million photographs for AI programs to categorize into 1,000 completely different lessons.

    Crucially, the check was utterly agnostic to strategies, and any profitable algorithm shortly gained credibility no matter the way it labored. When an algorithm referred to as AlexNet broke by way of in 2012, with a then unconventional type of GPU coaching, it turned one of many foundational outcomes of contemporary AI. Few would have guessed upfront that AlexNet’s convolutional neural nets could be the key to unlocking picture recognition—however after it scored effectively, nobody dared dispute it. (Considered one of AlexNet’s builders, Ilya Sutskever, would go on to cofound OpenAI.)

    A big a part of what made this problem so efficient was that there was little sensible distinction between ImageNet’s object classification problem and the precise strategy of asking a pc to acknowledge a picture. Even when there have been disputes about strategies, nobody doubted that the highest-scoring mannequin would have a bonus when deployed in an precise picture recognition system.

    However within the 12 years since, AI researchers have utilized that very same method-agnostic method to more and more normal duties. SWE-Bench is usually used as a proxy for broader coding means, whereas different exam-style benchmarks typically stand in for reasoning means. That broad scope makes it troublesome to be rigorous about what a selected benchmark measures—which, in flip, makes it onerous to make use of the findings responsibly. 

    The place issues break down

    Anka Reuel, a PhD scholar who has been specializing in the benchmark downside as a part of her analysis at Stanford, has develop into satisfied the analysis downside is the results of this push towards generality. “We’ve moved from task-specific fashions to general-purpose fashions,” Reuel says. “It’s not a couple of single process anymore however a complete bunch of duties, so analysis turns into tougher.”

    Just like the College of Michigan’s Jacobs, Reuel thinks “the primary situation with benchmarks is validity, much more than the sensible implementation,” noting: “That’s the place plenty of issues break down.” For a process as difficult as coding, as an example, it’s practically unattainable to include each attainable state of affairs into your downside set. In consequence, it’s onerous to gauge whether or not a mannequin is scoring higher as a result of it’s extra expert at coding or as a result of it has extra successfully manipulated the issue set. And with a lot stress on builders to realize report scores, shortcuts are onerous to withstand.

    For builders, the hope is that success on a number of particular benchmarks will add as much as a usually succesful mannequin. However the methods of agentic AI imply a single AI system can embody a posh array of various fashions, making it onerous to judge whether or not enchancment on a selected process will result in generalization. “There’s simply many extra knobs you may flip,” says Sayash Kapoor, a pc scientist at Princeton and a outstanding critic of sloppy practices within the AI trade. “Relating to brokers, they’ve kind of given up on the very best practices for analysis.”



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleLightricks LTX Video upp till 30 gånger snabbare än jämförbara modeller
    Next Article Murder Victim Speaks from the Grave in Courtroom Through AI
    ProfitlyAI
    • Website

    Related Posts

    AI Technology

    America’s coming war over AI regulation

    January 23, 2026
    AI Technology

    “Dr. Google” had its issues. Can ChatGPT Health do better?

    January 22, 2026
    AI Technology

    Everyone wants AI sovereignty. No one can truly have it.

    January 22, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Open the pod bay doors, Claude

    August 26, 2025

    Hands-On with Agents SDK: Safeguarding Input and Output with Guardrails

    September 6, 2025

    LLM Benchmarking, Reimagined: Put Human Judgment Back In

    November 25, 2025

    Google utökar testningen av sitt AI-mode Google-Labs

    May 8, 2025

    Nya Gemini-verktyg för elever och lärare

    July 2, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    AI Might Take Your Job. But These Roles Could Be Your Future

    June 24, 2025

    DuckDuckGo låter användare filtrera AI-genererade bilder

    July 19, 2025

    Printable aluminum alloy sets strength records, may enable lighter aircraft parts | MIT News

    October 7, 2025
    Our Picks

    Optimizing Data Transfer in Distributed AI/ML Training Workloads

    January 23, 2026

    Achieving 5x Agentic Coding Performance with Few-Shot Prompting

    January 23, 2026

    Why the Sophistication of Your Prompt Correlates Almost Perfectly with the Sophistication of the Response, as Research by Anthropic Found

    January 23, 2026
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.