Close Menu
    Trending
    • Why the Sophistication of Your Prompt Correlates Almost Perfectly with the Sophistication of the Response, as Research by Anthropic Found
    • From Transactions to Trends: Predict When a Customer Is About to Stop Buying
    • America’s coming war over AI regulation
    • “Dr. Google” had its issues. Can ChatGPT Health do better?
    • Evaluating Multi-Step LLM-Generated Content: Why Customer Journeys Require Structural Metrics
    • Why SaaS Product Management Is the Best Domain for Data-Driven Professionals in 2026
    • Stop Writing Messy Boolean Masks: 10 Elegant Ways to Filter Pandas DataFrames
    • What Other Industries Can Learn from Healthcare’s Knowledge Graphs
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » Production-Grade Observability for AI Agents: A Minimal-Code, Configuration-First Approach
    Artificial Intelligence

    Production-Grade Observability for AI Agents: A Minimal-Code, Configuration-First Approach

    ProfitlyAIBy ProfitlyAIDecember 17, 2025No Comments14 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    develop extra advanced, conventional logging and monitoring fall quick. What groups really need is observability: the power to hint agent choices, consider response high quality mechanically, and detect drift over time—with out writing and sustaining massive quantities of customized analysis and telemetry code.

    Subsequently, groups must undertake the correct platform for observability whereas they concentrate on the core process of constructing and enhancing the brokers’ orchestration. And combine their utility to the observability platform with minimal overhead to their useful code. On this article, I’ll reveal how one can arrange an open-source AI observability platform to carry out the next utilizing a minimal-code strategy:

    • LLM-as-a-Choose: Configure pre-built evaluators to attain responses for Correctness, Relevance, Hallucination and extra. Show scores throughout runs with detailed logs and analytics.
    • Testing at scale: Arrange datasets to retailer regression take a look at circumstances for measuring accuracy towards anticipated floor fact responses. Proactively detect LLM and agent drift.
    • MELT knowledge: Observe metrics (latency, token utilization, mannequin drift), occasions (API calls, LLM calls, software utilization), logs (consumer interplay, software execution, agent determination making) with detailed traces – all with out detailed telemetry and instrumentation code.

    We shall be utilizing Langfuse for observability. It’s open-source and framework-agnostic and may work with standard orchestration frameworks and LLM suppliers.   

    Multi-agent utility

    For this demonstration, I’ve connected the LangGraph code of a Buyer Service utility. The appliance accepts tickets from the consumer, classifies into Technical, Billing or Each utilizing a Triage agent, then routes it to the Technical Assist agent, Billing Assist agent or to each of them. Then a finalizer agent synthesizes the response from each brokers right into a coherent, extra readable format. The flowchart is as follows:

    Buyer Service agentic utility
    The code is connected right here
    # --------------------------------------------------
    # 0. Load .env
    # --------------------------------------------------
    from dotenv import load_dotenv
    load_dotenv(override=True)
    
    # --------------------------------------------------
    # 1. Imports
    # --------------------------------------------------
    import os
    from typing import TypedDict
    
    from langgraph.graph import StateGraph, END
    from langchain_openai import AzureChatOpenAI
    
    from langfuse import Langfuse
    from langfuse.langchain import CallbackHandler
    
    # --------------------------------------------------
    # 2. Langfuse Shopper (WORKING CONFIG)
    # --------------------------------------------------
    langfuse = Langfuse(
        host="https://cloud.langfuse.com",
        public_key=os.environ["LANGFUSE_PUBLIC_KEY"] , 
        secret_key=os.environ["LANGFUSE_SECRET_KEY"]  
    )
    langfuse_callback = CallbackHandler()
    os.environ["LANGGRAPH_TRACING"] = "false"
    
    
    # --------------------------------------------------
    # 3. Azure OpenAI Setup
    # --------------------------------------------------
    llm = AzureChatOpenAI(
        azure_deployment=os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"],
        api_version=os.environ.get("AZURE_OPENAI_API_VERSION", "2025-01-01-preview"),
        temperature=0.2,
        callbacks=[langfuse_callback],  # 🔑 allows token utilization
    )
    
    # --------------------------------------------------
    # 4. Shared State
    # --------------------------------------------------
    class AgentState(TypedDict, complete=False):
        ticket: str
        class: str
        technical_response: str
        billing_response: str
        final_response: str
    
    # --------------------------------------------------
    # 5. Agent Definitions
    # --------------------------------------------------
    
    def triage_agent(state: dict) -> dict:
        with langfuse.start_as_current_observation(
            as_type="span",
            title="triage_agent",
            enter={"ticket": state["ticket"]},
        ) as span:
            span.update_trace(title="Buyer Service Question - LangGraph Demo") 
    
            response = llm.invoke([
                {
                    "role": "system",
                    "content": (
                        "Classify the query as one of: "
                        "Technical, Billing, Both. "
                        "Respond with only the label."
                    ),
                },
                {"role": "user", "content": state["ticket"]},
            ])
    
            uncooked = response.content material.strip().decrease()
    
            if "each" in uncooked:
                class = "Each"
            elif "technical" in uncooked:
                class = "Technical"
            elif "billing" in uncooked:
                class = "Billing"
            else:
                class = "Technical"  # ✅ protected fallback
    
            span.replace(output={"uncooked": uncooked, "class": class})
    
            return {"class": class}
    
    
    
    def technical_support_agent(state: dict) -> dict:
        with langfuse.start_as_current_observation(
            as_type="span",
            title="technical_support_agent",
            enter={
                "ticket": state["ticket"],
                "class": state.get("class"),
            },
        ) as span:
    
            response = llm.invoke([
                {
                    "role": "system",
                    "content": (
                        "You are a technical support specialist. "
                        "Provide a clear, step-by-step solution."
                    ),
                },
                {"role": "user", "content": state["ticket"]},
            ])
    
            reply = response.content material
    
            span.replace(output={"technical_response": reply})
    
            return {"technical_response": reply}
    
    
    def billing_support_agent(state: dict) -> dict:
        with langfuse.start_as_current_observation(
            as_type="span",
            title="billing_support_agent",
            enter={
                "ticket": state["ticket"],
                "class": state.get("class"),
            },
        ) as span:
    
            response = llm.invoke([
                {
                    "role": "system",
                    "content": (
                        "You are a billing support specialist. "
                        "Answer clearly about payments, invoices, or accounts."
                    ),
                },
                {"role": "user", "content": state["ticket"]},
            ])
    
            reply = response.content material
    
            span.replace(output={"billing_response": reply})
    
            return {"billing_response": reply}
    
    def finalizer_agent(state: dict) -> dict:
        with langfuse.start_as_current_observation(
            as_type="span",
            title="finalizer_agent",
            enter={
                "ticket": state["ticket"],
                "technical": state.get("technical_response"),
                "billing": state.get("billing_response"),
            },
        ) as span:
    
            elements = [
                f"Technical:n{state['technical_response']}"
                for ok in ["technical_response"]
                if state.get(ok)
            ] + [
                f"Billing:n{state['billing_response']}"
                for ok in ["billing_response"]
                if state.get(ok)
            ]
    
            if not elements:
                ultimate = "Error: No agent responses accessible."
            else:
                response = llm.invoke([
                    {
                        "role": "system",
                        "content": (
                            "Combine the following agent responses into ONE clear, professional, "
                            "customer-facing answer. Do not mention agents or internal labels. "
                            f"Answer the user's query: '{state['ticket']}'."
                        ),
                    },
                    {"position": "consumer", "content material": "nn".be a part of(elements)},
                ])
                ultimate = response.content material
    
            span.replace(output={"final_response": ultimate})
            return {"final_response": ultimate}
    
    
    # --------------------------------------------------
    # 6. LangGraph Building 
    # --------------------------------------------------
    builder = StateGraph(AgentState)
    
    builder.add_node("triage", triage_agent)
    builder.add_node("technical", technical_support_agent)
    builder.add_node("billing", billing_support_agent)
    builder.add_node("finalizer", finalizer_agent)
    
    builder.set_entry_point("triage")
    
    # Conditional routing
    builder.add_conditional_edges(
        "triage",
        lambda state: state["category"],
        {
            "Technical": "technical",
            "Billing": "billing",
            "Each": "technical",
            "__default__": "technical",  # ✅ by no means dead-end
        },
    )
    
    # Sequential decision
    builder.add_conditional_edges(
        "technical",
        lambda state: state["category"],
        {
            "Each": "billing",         # Proceed to billing if Each
            "__default__": "finalizer",
        },
    )
    builder.add_edge("billing", "finalizer")
    builder.add_edge("finalizer", END)
    
    graph = builder.compile()
    
    
    # --------------------------------------------------
    # 9. Principal
    # --------------------------------------------------
    if __name__ == "__main__":
    
        print("===============================================")
        print(" Conditional Multi-Agent Assist System (Prepared)")
        print("===============================================")
        print("Enter 'exit' or 'give up' to cease this system.n")
        
        whereas True:
            # Get consumer enter for the ticket
            ticket = enter("Enter your assist question (ticket): ")
    
            # Test for exit command
            if ticket.decrease() in ["exit", "quit"]:
                print("nExiting the assist system. Goodbye!")
                break
    
            if not ticket.strip():
                print("Please enter a non-empty question.")
                proceed
                
            attempt:                
                    # --- Run the graph with the consumer's ticket ---
                 end result = graph.invoke(
                    {"ticket": ticket},
                    config={"callbacks": [langfuse_callback]},
                )
            
                # --- Print Outcomes ---
                class = end result.get('class', 'N/A')
                print(f"n✅ Triage Classification: **{class}**")
                
                # Test which brokers had been executed primarily based on the presence of a response
                executed_agents = []
                if end result.get("technical_response"):
                    executed_agents.append("Technical")
                if end result.get("billing_response"):
                    executed_agents.append("Billing")
                
                
                print(f"🛠️ Brokers Executed: {', '.be a part of(executed_agents) if executed_agents else 'None (Triage Failed)'}")
    
                print("n================ FINAL RESPONSE ================n")
                print(end result["final_response"])
                print("n" + "="*60 + "n")
    
            besides Exception as e:
                # That is vital for debugging: print the exception kind and message
                print(f"nAn error occurred throughout processing ({kind(e).__name__}): {e}")
                print("nPlease attempt one other question.")
                print("n" + "="*60 + "n")
    

    Observability Configuration

    To arrange Langfuse, go to https://cloud.langfuse.com/, and arrange an account with a Billing tier (interest tier with beneficiant limits accessible), then arrange a Challenge. Within the challenge settings, you may generate the general public and secret keys which must be offered at the start of the code. You additionally want so as to add the LLM connection, which shall be used for the LLM-as-a-Choose analysis.

    Langfuse challenge arrange

    LLM-as-a-Choose setup

    That is the core of the efficiency analysis setup for brokers. Right here you may configure numerous pre-built Evaluators from the Evaluator Library which can rating the responses on numerous standards equivalent to Conciseness, Correctness, Hallucination, Reply Critic and so on. These ought to suffice for many use circumstances, else Customized Evaluators will be arrange additionally. Here’s a view of the Evaluator library:

    Evaluator library

    Choose the evaluator, say Relevance, that you just want to use. You’ll be able to select to run it for brand new or current traces or for Dataset runs. As well as, evaluation the analysis immediate to make sure it satisfies your analysis goal. Most significantly, the question, technology and different variables must be accurately mapped to the supply (normally, to the Enter and Output from the appliance hint). For our case, these would be the ticket knowledge entered by the consumer and the response generated by the finalizer agent respectively. As well as, for Dataset runs, you may examine the generated responses to the Floor Fact responses saved as anticipated outputs (defined within the subsequent sections).

    Right here is the configuration for the ‘GT Accuracy’ analysis I arrange for brand new Dataset runs, together with the Variable mapping. The analysis immediate preview can also be depicted. Many of the evaluators rating inside a variety of 0 to 1:

    Evaluator setup
    Evaluator immediate

    For the customer support demo, I’ve configured 3 evaluators – Relevance, Conciseness which run for all new traces, and GT Accuracy, which deploys for Dataset runs solely.

    Energetic evaluators

    Datasets setup

    Create a dataset to make use of as a take a look at case repository. Right here, you may retailer take a look at circumstances with the enter question and the perfect anticipated response. To create the dataset, there are 3 selections: create one report at a time, add a CSV of queries and anticipated responses, or, fairly conveniently, add inputs and outputs immediately from the appliance traces whose responses are adjudged to be of fine high quality by human consultants.

    Right here is the dataset I’ve created for the demo. These are a mixture of technical, billing, or ‘Each’ queries, and I’ve created all of the information from utility traces:

    Dataset view

    That’s it! The configuration is completed and we’re able to run observability.

    Observability Outcomes

    The Langfuse Dwelling web page is a dashboard of a number of helpful charts. It reveals the rely of execution traces, scores and averages at a look, traces by time, mannequin utilization and price and so on.

    Observability overview dashboard

    MELT knowledge

    Probably the most helpful observability knowledge is obtainable within the ‘Tracing’ choice, which shows summarized and detailed views of all executions. Here’s a view of the dashboard depicting the time, title, enter, output and the essential latency and token utilization metrics. Observe that for each agent execution of our utility, there are 2 analysis traces generated for the Conciseness and Relevance evaluators we arrange.

    Tracing overview
    Conciseness and Relevance analysis runs for every utility execution

    Let’s have a look at the small print of one of many executions of the Buyer Service utility. On the left panel, the agent circulation is depicted each as a tree in addition to a flowchart. It reveals the LangGraph nodes (brokers) and the LLM calls together with the token utilization. If our brokers had software calls or human-in-the-loop steps, they might have been depicted right here as effectively. Observe that the analysis scores for Conciseness and Relevance are additionally depicted on high, that are 0.40 and 1 respectively for this run. Clicking on them reveals the explanation for the rating and a hyperlink to take us to the evaluator hint.

    On the correct, for every agent, LLM and power name, we are able to see the Enter and generated output. As an illustration, right here we see that the question was categorized as ‘Each’, and due to this fact within the left chart, it reveals each the technical and billing assist brokers had been referred to as, which confirms our circulation is working as anticipated.

    Multi-agent hint

    On high of the correct hand panel, there’s the ‘Add to datasets’ button. At any step of the tree, this button, when clicked, will open up a panel just like the one depicted under, the place you may add the enter and output of that step on to a take a look at dataset created within the earlier part. This can be a helpful function for human consultants so as to add regularly occurring consumer queries and good responses to the dataset throughout regular agent operations, thereby constructing a Regression take a look at repository with minimal effort. In future, when there’s a main improve or launch to the appliance, the Regression dataset will be run and the generated outputs will be scored towards the Anticipated outputs (floor fact) recorded right here utilizing the ‘GT Accuracy’ evaluator we created in the course of the LLM-as-a-judge setup. This helps to detect LLM drift (or agent drift) early and take corrective steps.

    Add to Dataset

    Right here is likely one of the analysis traces (Conciseness) for this utility hint. The evaluator offers the reasoning for the rating of 0.4 it adjudged this response to be.

    Evaluator reasoning

    Scores

    The Scores choice in Langfuse present a listing of all of the analysis runs from the varied lively evaluators together with their scores. Extra pertinent is the Analytics dashboard, the place two scores will be chosen and metrics equivalent to imply and normal deviation together with pattern traces will be considered.

    Scores dashboard
    Rating analytics

    Regression testing

    With Datasets, we’re able to run regression testing utilizing the take a look at case repository of queries and anticipated outputs. We have now saved 4 queries in our Regression dataset, with a mixture of technical, billing and ‘Each’ queries.

    For this, we are able to run the connected code which will get the related dataset and runs the experiment. All of the take a look at runs are logged together with the common scores. We are able to view the results of a particular take a look at with Conciseness, GT Accuracy and Relevance scores for every take a look at case in a single dashboard. And as wanted, the detailed hint will be accessed to see the reasoning for the rating.

    You’ll be able to view the code right here.
    from langfuse import get_client
    from langfuse.openai import OpenAI
    from langchain_openai import AzureChatOpenAI
    from langfuse import Langfuse
    import os
    # Initialize consumer
    from dotenv import load_dotenv
    load_dotenv(override=True)
    
    langfuse = Langfuse(
        host="https://cloud.langfuse.com",
        public_key=os.environ["LANGFUSE_PUBLIC_KEY"] , 
        secret_key=os.environ["LANGFUSE_SECRET_KEY"]  
    )
    
    llm = AzureChatOpenAI(
        azure_deployment=os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME"),
        api_version=os.environ.get("AZURE_OPENAI_API_VERSION", "2025-01-01-preview"),
        temperature=0.2,
    )
    
    # Outline your process perform
    def my_task(*, merchandise, **kwargs):
        query = merchandise.enter['ticket'] 
        response = llm.invoke([{"role": "user", "content": question}])
    
        uncooked = response.content material.strip().decrease()
     
        return uncooked  
     
    # Get dataset from Langfuse
    dataset = langfuse.get_dataset("Regression")
     
    # Run experiment immediately on the dataset
    end result = dataset.run_experiment(
        title="Manufacturing Mannequin Check",
        description="Month-to-month analysis of our manufacturing mannequin",
        process=my_task # see above for the duty definition
    )
     
    # Use format technique to show outcomes
    print(end result.format())
    Check runs
    Scores for a take a look at run

    Key Takeaways

    • AI observability doesn’t must be code-heavy.
      Most analysis, tracing, and regression testing capabilities for LLM brokers will be enabled by means of configuration quite than customized code, considerably lowering improvement and upkeep effort.
    • Wealthy analysis workflows will be outlined declaratively.
      Capabilities equivalent to LLM-as-a-Choose scoring (relevance, conciseness, hallucination, ground-truth accuracy), variable mapping, and analysis prompts are configured immediately within the observability platform—with out writing bespoke analysis logic.
    • Datasets and regression testing are configuration-first options.
      Check case repositories, dataset runs, and ground-truth comparisons will be arrange and reused by means of the UI or easy configuration, permitting groups to run regression assessments throughout agent variations with minimal extra code.
    • Full MELT observability comes “out of the field.”
      Metrics (latency, token utilization, price), occasions (LLM and power calls), logs, and traces are mechanically captured and correlated, avoiding the necessity for handbook instrumentation throughout agent workflows.
    • Minimal instrumentation, most visibility.
      With light-weight SDK integration, groups achieve deep visibility into multi-agent execution paths, analysis outcomes, and efficiency traits—liberating builders to concentrate on agent logic quite than observability plumbing.

    Conclusion

    As LLM brokers develop into extra advanced, observability is now not elective. With out it, multi-agent programs rapidly flip into black bins which are troublesome to judge, debug, and enhance.

    An AI observability platform shifts this burden away from builders and utility code. Utilizing a minimal-code, configuration-first strategy, groups can allow LLM-as-a-Choose analysis, regression testing, and full MELT observability with out constructing and sustaining customized pipelines. This not solely reduces engineering effort but in addition accelerates the trail from prototype to manufacturing.

    By adopting an open-source, framework-agnostic platform like Langfuse, groups achieve a single supply of fact for agent efficiency—making AI programs simpler to belief, evolve, and function at scale.

    Wish to know extra? The Buyer Service agentic utility offered right here follows a manager-worker structure sample, which doesn’t work in CrewAI. Examine how observability helped me to repair this well-known situation with the manager-worker hierarchical means of CrewAI, by tracing agent responses at every step and refining them to get the orchestration to work because it ought to. Full evaluation right here: Why CrewAI’s Manager-Worker Architecture Fails — and How to Fix It

    Join with me and share your feedback at www.linkedin.com/in/partha-sarkar-lets-talk-AI

    All photographs and knowledge used on this article are synthetically generated. Figures and code created by me



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous Article3 Techniques to Effectively Utilize AI Agents for Coding
    Next Article The Machine Learning “Advent Calendar” Day 17: Neural Network Regressor in Excel
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Why the Sophistication of Your Prompt Correlates Almost Perfectly with the Sophistication of the Response, as Research by Anthropic Found

    January 23, 2026
    Artificial Intelligence

    From Transactions to Trends: Predict When a Customer Is About to Stop Buying

    January 23, 2026
    Artificial Intelligence

    Evaluating Multi-Step LLM-Generated Content: Why Customer Journeys Require Structural Metrics

    January 22, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    How to Maximize Agentic Memory for Continual Learning

    December 10, 2025

    When AIs bargain, a less advanced agent could cost you

    June 17, 2025

    How to Create Professional Articles with LaTeX in Cursor

    November 25, 2025

    Why handing over total control to AI agents would be a huge mistake

    April 3, 2025

    Gemini är nu en universal translator

    December 15, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    A Developer’s Guide to Building Scalable AI: Workflows vs Agents

    June 27, 2025

    How I Used Machine Learning to Predict 41% of Project Delays Before They Happened

    October 17, 2025

    HIPAA Expert Determination for De-Identification

    April 9, 2025
    Our Picks

    Why the Sophistication of Your Prompt Correlates Almost Perfectly with the Sophistication of the Response, as Research by Anthropic Found

    January 23, 2026

    From Transactions to Trends: Predict When a Customer Is About to Stop Buying

    January 23, 2026

    America’s coming war over AI regulation

    January 23, 2026
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.