Close Menu
    Trending
    • Five with MIT ties elected to National Academy of Medicine for 2025 | MIT News
    • Why Should We Bother with Quantum Computing in ML?
    • Federated Learning and Custom Aggregation Schemes
    • How To Choose The Perfect AI Tool In 2025 » Ofemwire
    • Implementing DRIFT Search with Neo4j and LlamaIndex
    • Agentic AI in Finance: Opportunities and Challenges for Indonesia
    • Dispatch: Partying at one of Africa’s largest AI gatherings
    • Topp 10 AI-filmer genom tiderna
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » How Good Are New GPT-OSS Models? We Put Them to the Test.
    AI Technology

    How Good Are New GPT-OSS Models? We Put Them to the Test.

    ProfitlyAIBy ProfitlyAISeptember 16, 2025No Comments5 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    OpenAI hasn’t launched an open-weight language mannequin since GPT-2 again in 2019. Six years later, they stunned everybody with two: gpt-oss-120b and the smaller gpt-oss-20b.

    Naturally, we wished to know — how do they really carry out?

    To search out out, we ran each fashions by our open-source workflow optimization framework, syftr. It evaluates fashions throughout completely different configurations — quick vs. low cost, excessive vs. low accuracy — and contains assist for OpenAI’s new “pondering effort” setting.

    In concept, extra pondering ought to imply higher solutions. In follow? Not at all times.

    We additionally use syftr to discover questions like “is LLM-as-a-Judge actually working?” and “what workflows perform well across many datasets?”.

    Our first outcomes with GPT-OSS would possibly shock you: the very best performer wasn’t the most important mannequin or the deepest thinker. 

    As an alternative, the 20b mannequin with low pondering effort persistently landed on the Pareto frontier, even rivaling the 120b medium configuration on benchmarks like FinanceBench, HotpotQA, and MultihopRAG. In the meantime, excessive pondering effort not often mattered in any respect.

    How we arrange our experiments

    We didn’t simply pit GPT-OSS in opposition to itself. As an alternative, we wished to see the way it stacked up in opposition to different robust open-weight fashions. So we in contrast gpt-oss-20b and gpt-oss-120b with:

    • qwen3-235b-a22b
    • glm-4.5-air
    • nemotron-super-49b
    • qwen3-30b-a3b
    • gemma3-27b-it
    • phi-4-multimodal-instruct

    To check OpenAI’s new “pondering effort” function, we ran every GPT-OSS mannequin in three modes: low, medium, and excessive pondering effort. That gave us six configurations in complete:

    • gpt-oss-120b-low / -medium / -high
    • gpt-oss-20b-low / -medium / -high

    For analysis, we forged a large web: 5 RAG and agent modes, 16 embedding fashions, and a variety of stream configuration choices. To guage mannequin responses, we used GPT-4o-mini and in contrast solutions in opposition to recognized floor reality.

    Lastly, we examined throughout 4 datasets:

    • FinanceBench (monetary reasoning)
    • HotpotQA (multi-hop QA)
    • MultihopRAG (retrieval-augmented reasoning)
    • PhantomWiki (artificial Q&A pairs)

    We optimized workflows twice: as soon as for accuracy + latency, and as soon as for accuracy + value—capturing the tradeoffs that matter most in real-world deployments.

    Optimizing for latency, value, and accuracy

    Once we optimized the GPT-OSS fashions, we checked out two tradeoffs: accuracy vs. latency and accuracy vs. value. The outcomes have been extra shocking than we anticipated:

    • GPT-OSS 20b (low pondering effort):
      Quick, cheap, and persistently correct. This setup appeared on the Pareto frontier repeatedly, making it the very best default alternative for many non-scientific duties. In follow, which means faster responses and decrease payments in comparison with increased pondering efforts.
    • GPT-OSS 120b (medium pondering effort):
      Greatest suited to duties that demand deeper reasoning, like monetary benchmarks. Use this when accuracy on advanced issues issues greater than value.
    • GPT-OSS 120b (excessive pondering effort):
      Costly and normally pointless. Hold it in your again pocket for edge circumstances the place different fashions fall brief. For our benchmarks, it didn’t add worth.
    Determine 1: Accuracy-latency optimization with syftr
    Figure 02 cost
    Determine 2: Accuracy-cost optimization with syftr

    Studying the outcomes extra fastidiously

    At first look, the outcomes look easy. However there’s an necessary nuance: an LLM’s prime accuracy rating relies upon not simply on the mannequin itself, however on how the optimizer weighs it in opposition to different fashions within the combine. As an instance, let’s have a look at FinanceBench.

    When optimizing for latency, all GPT-OSS fashions (besides excessive pondering effort) landed with related Pareto-frontiers. On this case, the optimizer had little cause to focus on the 20b low pondering configuration—its prime accuracy was solely 51%.

    Figure 03 latency financebench
    Determine 3: Per-LLM Pareto-frontiers for latency optimization on FinanceBench

    When optimizing for value, the image shifts dramatically. The identical 20b low pondering configuration jumps to 57% accuracy, whereas the 120b medium configuration really drops 22%. Why? As a result of the 20b mannequin is much cheaper, so the optimizer shifts extra weight towards it.

    Figure 04 cost financebench
    Determine 4: Per-LLM Pareto-frontiers for value optimization on FinanceBench

    The takeaway: Efficiency will depend on context. Optimizers will favor completely different fashions relying on whether or not you’re prioritizing pace, value, or accuracy. And given the massive search house of potential configurations, there could also be even higher setups past those we examined.

    Discovering agentic workflows that work properly in your setup

    The brand new GPT-OSS fashions carried out strongly in our checks — particularly the 20b with low pondering effort, which regularly outpaced dearer opponents. The larger lesson? Extra mannequin and extra effort doesn’t at all times imply extra accuracy. Typically, paying extra simply will get you much less.

    That is precisely why we constructed syftr and made it open-source. Each use case is completely different, and the very best workflow for you will depend on the tradeoffs you care about most. Need decrease prices? Quicker responses? Most accuracy? 

    Run your own experiments and discover the Pareto candy spot that balances these priorities to your setup.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleThe looming crackdown on AI companionship
    Next Article Should You Turn Your Executives Into AI Avatars?
    ProfitlyAI
    • Website

    Related Posts

    AI Technology

    Dispatch: Partying at one of Africa’s largest AI gatherings

    October 22, 2025
    AI Technology

    Why AI should be able to “hang up” on you

    October 21, 2025
    AI Technology

    From slop to Sotheby’s? AI art enters a new phase

    October 17, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Why AI hardware needs to be open

    June 18, 2025

    ChatGPT får långtidsminne – kommer nu ihåg alla dina konversationer

    April 13, 2025

    A Visual Guide to Tuning Random Forest Hyperparameters

    September 4, 2025

    Microsoft lanserar Bing Video Creator med OpenAI Soras modell

    June 3, 2025

    Google’s “Nano Banana” Might Be the Most Powerful AI Image Editor Yet

    September 3, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    Google introducerar Jules deras motsvarighet till Codex/GitHub Copilot

    May 20, 2025

    Perplexity AI:s röstassistent är nu tillgänglig för iOS

    April 25, 2025

    Google’s New AI System Outperforms Physicians in Complex Diagnoses

    April 17, 2025
    Our Picks

    Five with MIT ties elected to National Academy of Medicine for 2025 | MIT News

    October 22, 2025

    Why Should We Bother with Quantum Computing in ML?

    October 22, 2025

    Federated Learning and Custom Aggregation Schemes

    October 22, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.