Close Menu
    Trending
    • Three OpenClaw Mistakes to Avoid and How to Fix Them
    • I Stole a Wall Street Trick to Solve a Google Trends Data Problem
    • How AI is turning the Iran conflict into theater
    • Why Your AI Search Evaluation Is Probably Wrong (And How to Fix It)
    • Machine Learning at Scale: Managing More Than One Model in Production
    • Improving AI models’ ability to explain their predictions | MIT News
    • Write C Code Without Learning C: The Magic of PythoC
    • LatentVLA: Latent Reasoning Models for Autonomous Driving
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » How We Are Testing Our Agents in Dev
    Artificial Intelligence

    How We Are Testing Our Agents in Dev

    ProfitlyAIBy ProfitlyAIDecember 6, 2025No Comments6 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Why testing brokers is so laborious

    AI agent is performing as anticipated will not be straightforward. Even small tweaks to parts like your immediate variations, agent orchestration, and fashions can have massive and surprising impacts. 

    A number of the high challenges embody:

    Non-deterministic outputs

    The underlying challenge at hand is that brokers are non-deterministic. The identical enter goes in, two completely different outputs can come out. 

    How do you check for an anticipated end result while you don’t know what the anticipated end result might be? Merely put, testing for strictly outlined outputs doesn’t work. 

    Unstructured outputs

    The second, and fewer mentioned, problem of testing agentic methods is that outputs are sometimes unstructured. The muse of agentic methods are massive language fashions in spite of everything. 

    It’s a lot simpler to outline a check for structured information. For instance, the id discipline ought to by no means be NULL or at all times be an integer. How do you outline the standard of a big discipline of textual content?

    Value and scale

    LLM-as-judge is the commonest methodology for evaluating the standard or reliability of AI brokers. Nevertheless, it’s an costly workload and every person interplay (hint) can include a whole bunch of interactions (spans).

    So we rethought our agent testing technique. On this submit we’ll share our learnings together with a brand new key idea that has confirmed pivotal to making sure reliability at scale.

    Picture courtesy of the writer

    Testing our agent

    Now we have two brokers in manufacturing which might be leveraged by greater than 30,000 customers. The Troubleshooting Agent combs via a whole bunch of alerts to find out the basis explanation for an information reliability incident whereas the Monitoring Agent makes sensible information high quality monitoring suggestions.

    For the Troubleshooting agent we check three essential dimensions: semantic distance, groundedness, and power utilization. Right here is how we check for every.

    Semantic distance

    We leverage deterministic assessments when applicable as they’re clear, explainable, and cost-effective. For instance, it’s comparatively straightforward to deploy a check to make sure one of many subagent’s outputs is in JSON format, that they don’t exceed a sure size, or to verify the guardrails are being referred to as as supposed.

    Nevertheless, there are occasions when deterministic assessments received’t get the job achieved. For instance, we explored embedding each anticipated and new outputs as vectors and utilizing cosine similarity tests. We thought this could be a less expensive and quicker option to consider semantic distance (is the which means related) between noticed and anticipated outputs. 

    Nevertheless, we discovered there have been too many instances wherein the wording was related, however the which means was completely different. 

    As an alternative, we now present our LLM choose the anticipated output from the present configuration and ask it to attain on a 0-1 scale the similarity of the brand new output. 

    Groundedness

    For groundedness, we examine to make sure that the important thing context is current when it ought to be, but in addition that the agent will decline to reply when the important thing context is lacking or the query is out of scope. 

    That is necessary as LLMs are desirous to please and can hallucinate once they aren’t grounded with good context.

    Device utilization

    For software utilization we now have an LLM-as-judge consider whether or not the agent carried out as anticipated for the pre-defined state of affairs which means:

    • No software was anticipated and no software was referred to as
    • A software was anticipated and a permitted software was used
    • No required instruments have been omitted
    • No non-permitted instruments have been used

    The true magic will not be deploying these assessments, however how these assessments are utilized. Right here is our present setup knowledgeable by some painful trial and error.

    Agent testing greatest practices 

    It’s necessary to remember not solely are your brokers non-deterministic, however so are your LLM evaluations! These greatest practices are primarily designed to fight these inherent shortcomings.

    Mushy failures

    Exhausting thresholds could be noisy with non-deterministic assessments for apparent causes. So we invented the idea of a “tender failure.”

    The analysis comes again with a rating between 0-1. Something lower than a .5 is a tough failure, whereas something above a .8 is a move. Mushy failures happen for scores between .5 to .8. 

    Adjustments could be merged for a tender failure. Nevertheless, if a sure threshold of soppy failures is exceeded it constitutes a tough failure and the method is halted. 

    For our agent, it’s at present configured in order that if 33% of assessments lead to a tender failure or if there are any greater than 2 tender failures complete, then it’s thought-about a tough failure. This prevents the change from being merged.

    Re-evaluate tender failures

    Mushy failures could be a canary in a coal mine, or in some instances they are often nonsense. About 10% of soppy failures are the results of hallucinations. Within the case of a tender failure, the evaluations will mechanically re-run. If the ensuing assessments move we assume the unique outcome was incorrect. 

    Explanations

    When a check fails, you might want to perceive why it failed. We now ask each LLM choose to not simply present a rating, however to elucidate it. It’s imperfect, however it helps construct belief within the analysis and infrequently speeds debugging.

    Eradicating flaky assessments

    It’s a must to check your assessments. Particularly with LLM-as-judge evaluations, the best way the immediate is constructed can have a big impression on the outcomes. We run assessments a number of occasions and if the delta throughout the outcomes is just too massive we are going to revise the immediate or take away the flaky check.

    Monitoring in manufacturing

    Agent testing is new and difficult, however it’s a stroll within the park in comparison with monitoring agent habits and outputs in manufacturing. Inputs are messier, there isn’t a anticipated output to baseline, and every little thing is at a a lot bigger scale.

    To not point out the stakes are a lot increased! System reliability issues shortly grow to be enterprise issues.

    That is our present focus. We’re leveraging agent observability instruments to deal with these challenges and can report new learnings in a future submit. 

    The Troubleshooting Agent has been one of the vital impactful options we’ve ever shipped. Creating dependable brokers has been a career-defining journey and we’re excited to share it with you.


    Michael Segner is a product strategist at Monte Carlo and the writer of the O’Reilly report, “Enhancing information + AI reliability via observability.” This was co-authored with Elor Arieli and Alik Peltinovich.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleA new AI agent for multi-source knowledge
    Next Article TDS Newsletter: How to Design Evals, Metrics, and KPIs That Work
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Three OpenClaw Mistakes to Avoid and How to Fix Them

    March 9, 2026
    Artificial Intelligence

    I Stole a Wall Street Trick to Solve a Google Trends Data Problem

    March 9, 2026
    Artificial Intelligence

    Why Your AI Search Evaluation Is Probably Wrong (And How to Fix It)

    March 9, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Data Mesh Diaries: Realities from Early Adopters

    August 13, 2025

    Scene Understanding in Action: Real-World Validation of Multimodal AI Integration

    July 10, 2025

    A Fundamental Rethinking of How AI Learns

    December 4, 2025

    From Pixels to Plots | Towards Data Science

    June 30, 2025

    Top 9 Tungsten Automation (Kofax) alternatives

    April 4, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    Agentic AI with NVIDIA and DataRobot

    July 2, 2025

    How OpenAI’s Autonomous AI Researcher Could Reshape the Economy

    November 4, 2025

    Dreaming in Blocks — MineWorld, the Minecraft World Model

    October 10, 2025
    Our Picks

    Three OpenClaw Mistakes to Avoid and How to Fix Them

    March 9, 2026

    I Stole a Wall Street Trick to Solve a Google Trends Data Problem

    March 9, 2026

    How AI is turning the Iran conflict into theater

    March 9, 2026
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.