develop extra advanced, conventional logging and monitoring fall quick. What groups really need is observability: the power to hint agent choices, consider response high quality mechanically, and detect drift over time—with out writing and sustaining massive quantities of customized analysis and telemetry code.
Subsequently, groups must undertake the correct platform for observability whereas they concentrate on the core process of constructing and enhancing the brokers’ orchestration. And combine their utility to the observability platform with minimal overhead to their useful code. On this article, I’ll reveal how one can arrange an open-source AI observability platform to carry out the next utilizing a minimal-code strategy:
- LLM-as-a-Choose: Configure pre-built evaluators to attain responses for Correctness, Relevance, Hallucination and extra. Show scores throughout runs with detailed logs and analytics.
- Testing at scale: Arrange datasets to retailer regression take a look at circumstances for measuring accuracy towards anticipated floor fact responses. Proactively detect LLM and agent drift.
- MELT knowledge: Observe metrics (latency, token utilization, mannequin drift), occasions (API calls, LLM calls, software utilization), logs (consumer interplay, software execution, agent determination making) with detailed traces – all with out detailed telemetry and instrumentation code.
We shall be utilizing Langfuse for observability. It’s open-source and framework-agnostic and may work with standard orchestration frameworks and LLM suppliers.
Multi-agent utility
For this demonstration, I’ve connected the LangGraph code of a Buyer Service utility. The appliance accepts tickets from the consumer, classifies into Technical, Billing or Each utilizing a Triage agent, then routes it to the Technical Assist agent, Billing Assist agent or to each of them. Then a finalizer agent synthesizes the response from each brokers right into a coherent, extra readable format. The flowchart is as follows:
The code is connected right here
# --------------------------------------------------
# 0. Load .env
# --------------------------------------------------
from dotenv import load_dotenv
load_dotenv(override=True)
# --------------------------------------------------
# 1. Imports
# --------------------------------------------------
import os
from typing import TypedDict
from langgraph.graph import StateGraph, END
from langchain_openai import AzureChatOpenAI
from langfuse import Langfuse
from langfuse.langchain import CallbackHandler
# --------------------------------------------------
# 2. Langfuse Shopper (WORKING CONFIG)
# --------------------------------------------------
langfuse = Langfuse(
host="https://cloud.langfuse.com",
public_key=os.environ["LANGFUSE_PUBLIC_KEY"] ,
secret_key=os.environ["LANGFUSE_SECRET_KEY"]
)
langfuse_callback = CallbackHandler()
os.environ["LANGGRAPH_TRACING"] = "false"
# --------------------------------------------------
# 3. Azure OpenAI Setup
# --------------------------------------------------
llm = AzureChatOpenAI(
azure_deployment=os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"],
api_version=os.environ.get("AZURE_OPENAI_API_VERSION", "2025-01-01-preview"),
temperature=0.2,
callbacks=[langfuse_callback], # 🔑 allows token utilization
)
# --------------------------------------------------
# 4. Shared State
# --------------------------------------------------
class AgentState(TypedDict, complete=False):
ticket: str
class: str
technical_response: str
billing_response: str
final_response: str
# --------------------------------------------------
# 5. Agent Definitions
# --------------------------------------------------
def triage_agent(state: dict) -> dict:
with langfuse.start_as_current_observation(
as_type="span",
title="triage_agent",
enter={"ticket": state["ticket"]},
) as span:
span.update_trace(title="Buyer Service Question - LangGraph Demo")
response = llm.invoke([
{
"role": "system",
"content": (
"Classify the query as one of: "
"Technical, Billing, Both. "
"Respond with only the label."
),
},
{"role": "user", "content": state["ticket"]},
])
uncooked = response.content material.strip().decrease()
if "each" in uncooked:
class = "Each"
elif "technical" in uncooked:
class = "Technical"
elif "billing" in uncooked:
class = "Billing"
else:
class = "Technical" # ✅ protected fallback
span.replace(output={"uncooked": uncooked, "class": class})
return {"class": class}
def technical_support_agent(state: dict) -> dict:
with langfuse.start_as_current_observation(
as_type="span",
title="technical_support_agent",
enter={
"ticket": state["ticket"],
"class": state.get("class"),
},
) as span:
response = llm.invoke([
{
"role": "system",
"content": (
"You are a technical support specialist. "
"Provide a clear, step-by-step solution."
),
},
{"role": "user", "content": state["ticket"]},
])
reply = response.content material
span.replace(output={"technical_response": reply})
return {"technical_response": reply}
def billing_support_agent(state: dict) -> dict:
with langfuse.start_as_current_observation(
as_type="span",
title="billing_support_agent",
enter={
"ticket": state["ticket"],
"class": state.get("class"),
},
) as span:
response = llm.invoke([
{
"role": "system",
"content": (
"You are a billing support specialist. "
"Answer clearly about payments, invoices, or accounts."
),
},
{"role": "user", "content": state["ticket"]},
])
reply = response.content material
span.replace(output={"billing_response": reply})
return {"billing_response": reply}
def finalizer_agent(state: dict) -> dict:
with langfuse.start_as_current_observation(
as_type="span",
title="finalizer_agent",
enter={
"ticket": state["ticket"],
"technical": state.get("technical_response"),
"billing": state.get("billing_response"),
},
) as span:
elements = [
f"Technical:n{state['technical_response']}"
for ok in ["technical_response"]
if state.get(ok)
] + [
f"Billing:n{state['billing_response']}"
for ok in ["billing_response"]
if state.get(ok)
]
if not elements:
ultimate = "Error: No agent responses accessible."
else:
response = llm.invoke([
{
"role": "system",
"content": (
"Combine the following agent responses into ONE clear, professional, "
"customer-facing answer. Do not mention agents or internal labels. "
f"Answer the user's query: '{state['ticket']}'."
),
},
{"position": "consumer", "content material": "nn".be a part of(elements)},
])
ultimate = response.content material
span.replace(output={"final_response": ultimate})
return {"final_response": ultimate}
# --------------------------------------------------
# 6. LangGraph Building
# --------------------------------------------------
builder = StateGraph(AgentState)
builder.add_node("triage", triage_agent)
builder.add_node("technical", technical_support_agent)
builder.add_node("billing", billing_support_agent)
builder.add_node("finalizer", finalizer_agent)
builder.set_entry_point("triage")
# Conditional routing
builder.add_conditional_edges(
"triage",
lambda state: state["category"],
{
"Technical": "technical",
"Billing": "billing",
"Each": "technical",
"__default__": "technical", # ✅ by no means dead-end
},
)
# Sequential decision
builder.add_conditional_edges(
"technical",
lambda state: state["category"],
{
"Each": "billing", # Proceed to billing if Each
"__default__": "finalizer",
},
)
builder.add_edge("billing", "finalizer")
builder.add_edge("finalizer", END)
graph = builder.compile()
# --------------------------------------------------
# 9. Principal
# --------------------------------------------------
if __name__ == "__main__":
print("===============================================")
print(" Conditional Multi-Agent Assist System (Prepared)")
print("===============================================")
print("Enter 'exit' or 'give up' to cease this system.n")
whereas True:
# Get consumer enter for the ticket
ticket = enter("Enter your assist question (ticket): ")
# Test for exit command
if ticket.decrease() in ["exit", "quit"]:
print("nExiting the assist system. Goodbye!")
break
if not ticket.strip():
print("Please enter a non-empty question.")
proceed
attempt:
# --- Run the graph with the consumer's ticket ---
end result = graph.invoke(
{"ticket": ticket},
config={"callbacks": [langfuse_callback]},
)
# --- Print Outcomes ---
class = end result.get('class', 'N/A')
print(f"n✅ Triage Classification: **{class}**")
# Test which brokers had been executed primarily based on the presence of a response
executed_agents = []
if end result.get("technical_response"):
executed_agents.append("Technical")
if end result.get("billing_response"):
executed_agents.append("Billing")
print(f"🛠️ Brokers Executed: {', '.be a part of(executed_agents) if executed_agents else 'None (Triage Failed)'}")
print("n================ FINAL RESPONSE ================n")
print(end result["final_response"])
print("n" + "="*60 + "n")
besides Exception as e:
# That is vital for debugging: print the exception kind and message
print(f"nAn error occurred throughout processing ({kind(e).__name__}): {e}")
print("nPlease attempt one other question.")
print("n" + "="*60 + "n")
Observability Configuration
To arrange Langfuse, go to https://cloud.langfuse.com/, and arrange an account with a Billing tier (interest tier with beneficiant limits accessible), then arrange a Challenge. Within the challenge settings, you may generate the general public and secret keys which must be offered at the start of the code. You additionally want so as to add the LLM connection, which shall be used for the LLM-as-a-Choose analysis.

LLM-as-a-Choose setup
That is the core of the efficiency analysis setup for brokers. Right here you may configure numerous pre-built Evaluators from the Evaluator Library which can rating the responses on numerous standards equivalent to Conciseness, Correctness, Hallucination, Reply Critic and so on. These ought to suffice for many use circumstances, else Customized Evaluators will be arrange additionally. Here’s a view of the Evaluator library:

Choose the evaluator, say Relevance, that you just want to use. You’ll be able to select to run it for brand new or current traces or for Dataset runs. As well as, evaluation the analysis immediate to make sure it satisfies your analysis goal. Most significantly, the question, technology and different variables must be accurately mapped to the supply (normally, to the Enter and Output from the appliance hint). For our case, these would be the ticket knowledge entered by the consumer and the response generated by the finalizer agent respectively. As well as, for Dataset runs, you may examine the generated responses to the Floor Fact responses saved as anticipated outputs (defined within the subsequent sections).
Right here is the configuration for the ‘GT Accuracy’ analysis I arrange for brand new Dataset runs, together with the Variable mapping. The analysis immediate preview can also be depicted. Many of the evaluators rating inside a variety of 0 to 1:


For the customer support demo, I’ve configured 3 evaluators – Relevance, Conciseness which run for all new traces, and GT Accuracy, which deploys for Dataset runs solely.

Datasets setup
Create a dataset to make use of as a take a look at case repository. Right here, you may retailer take a look at circumstances with the enter question and the perfect anticipated response. To create the dataset, there are 3 selections: create one report at a time, add a CSV of queries and anticipated responses, or, fairly conveniently, add inputs and outputs immediately from the appliance traces whose responses are adjudged to be of fine high quality by human consultants.
Right here is the dataset I’ve created for the demo. These are a mixture of technical, billing, or ‘Each’ queries, and I’ve created all of the information from utility traces:

That’s it! The configuration is completed and we’re able to run observability.
Observability Outcomes
The Langfuse Dwelling web page is a dashboard of a number of helpful charts. It reveals the rely of execution traces, scores and averages at a look, traces by time, mannequin utilization and price and so on.

MELT knowledge
Probably the most helpful observability knowledge is obtainable within the ‘Tracing’ choice, which shows summarized and detailed views of all executions. Here’s a view of the dashboard depicting the time, title, enter, output and the essential latency and token utilization metrics. Observe that for each agent execution of our utility, there are 2 analysis traces generated for the Conciseness and Relevance evaluators we arrange.


Let’s have a look at the small print of one of many executions of the Buyer Service utility. On the left panel, the agent circulation is depicted each as a tree in addition to a flowchart. It reveals the LangGraph nodes (brokers) and the LLM calls together with the token utilization. If our brokers had software calls or human-in-the-loop steps, they might have been depicted right here as effectively. Observe that the analysis scores for Conciseness and Relevance are additionally depicted on high, that are 0.40 and 1 respectively for this run. Clicking on them reveals the explanation for the rating and a hyperlink to take us to the evaluator hint.
On the correct, for every agent, LLM and power name, we are able to see the Enter and generated output. As an illustration, right here we see that the question was categorized as ‘Each’, and due to this fact within the left chart, it reveals each the technical and billing assist brokers had been referred to as, which confirms our circulation is working as anticipated.

On high of the correct hand panel, there’s the ‘Add to datasets’ button. At any step of the tree, this button, when clicked, will open up a panel just like the one depicted under, the place you may add the enter and output of that step on to a take a look at dataset created within the earlier part. This can be a helpful function for human consultants so as to add regularly occurring consumer queries and good responses to the dataset throughout regular agent operations, thereby constructing a Regression take a look at repository with minimal effort. In future, when there’s a main improve or launch to the appliance, the Regression dataset will be run and the generated outputs will be scored towards the Anticipated outputs (floor fact) recorded right here utilizing the ‘GT Accuracy’ evaluator we created in the course of the LLM-as-a-judge setup. This helps to detect LLM drift (or agent drift) early and take corrective steps.

Right here is likely one of the analysis traces (Conciseness) for this utility hint. The evaluator offers the reasoning for the rating of 0.4 it adjudged this response to be.

Scores
The Scores choice in Langfuse present a listing of all of the analysis runs from the varied lively evaluators together with their scores. Extra pertinent is the Analytics dashboard, the place two scores will be chosen and metrics equivalent to imply and normal deviation together with pattern traces will be considered.


Regression testing
With Datasets, we’re able to run regression testing utilizing the take a look at case repository of queries and anticipated outputs. We have now saved 4 queries in our Regression dataset, with a mixture of technical, billing and ‘Each’ queries.
For this, we are able to run the connected code which will get the related dataset and runs the experiment. All of the take a look at runs are logged together with the common scores. We are able to view the results of a particular take a look at with Conciseness, GT Accuracy and Relevance scores for every take a look at case in a single dashboard. And as wanted, the detailed hint will be accessed to see the reasoning for the rating.
You’ll be able to view the code right here.
from langfuse import get_client
from langfuse.openai import OpenAI
from langchain_openai import AzureChatOpenAI
from langfuse import Langfuse
import os
# Initialize consumer
from dotenv import load_dotenv
load_dotenv(override=True)
langfuse = Langfuse(
host="https://cloud.langfuse.com",
public_key=os.environ["LANGFUSE_PUBLIC_KEY"] ,
secret_key=os.environ["LANGFUSE_SECRET_KEY"]
)
llm = AzureChatOpenAI(
azure_deployment=os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME"),
api_version=os.environ.get("AZURE_OPENAI_API_VERSION", "2025-01-01-preview"),
temperature=0.2,
)
# Outline your process perform
def my_task(*, merchandise, **kwargs):
query = merchandise.enter['ticket']
response = llm.invoke([{"role": "user", "content": question}])
uncooked = response.content material.strip().decrease()
return uncooked
# Get dataset from Langfuse
dataset = langfuse.get_dataset("Regression")
# Run experiment immediately on the dataset
end result = dataset.run_experiment(
title="Manufacturing Mannequin Check",
description="Month-to-month analysis of our manufacturing mannequin",
process=my_task # see above for the duty definition
)
# Use format technique to show outcomes
print(end result.format())


Key Takeaways
- AI observability doesn’t must be code-heavy.
Most analysis, tracing, and regression testing capabilities for LLM brokers will be enabled by means of configuration quite than customized code, considerably lowering improvement and upkeep effort. - Wealthy analysis workflows will be outlined declaratively.
Capabilities equivalent to LLM-as-a-Choose scoring (relevance, conciseness, hallucination, ground-truth accuracy), variable mapping, and analysis prompts are configured immediately within the observability platform—with out writing bespoke analysis logic. - Datasets and regression testing are configuration-first options.
Check case repositories, dataset runs, and ground-truth comparisons will be arrange and reused by means of the UI or easy configuration, permitting groups to run regression assessments throughout agent variations with minimal extra code. - Full MELT observability comes “out of the field.”
Metrics (latency, token utilization, price), occasions (LLM and power calls), logs, and traces are mechanically captured and correlated, avoiding the necessity for handbook instrumentation throughout agent workflows. - Minimal instrumentation, most visibility.
With light-weight SDK integration, groups achieve deep visibility into multi-agent execution paths, analysis outcomes, and efficiency traits—liberating builders to concentrate on agent logic quite than observability plumbing.
Conclusion
As LLM brokers develop into extra advanced, observability is now not elective. With out it, multi-agent programs rapidly flip into black bins which are troublesome to judge, debug, and enhance.
An AI observability platform shifts this burden away from builders and utility code. Utilizing a minimal-code, configuration-first strategy, groups can allow LLM-as-a-Choose analysis, regression testing, and full MELT observability with out constructing and sustaining customized pipelines. This not solely reduces engineering effort but in addition accelerates the trail from prototype to manufacturing.
By adopting an open-source, framework-agnostic platform like Langfuse, groups achieve a single supply of fact for agent efficiency—making AI programs simpler to belief, evolve, and function at scale.
Wish to know extra? The Buyer Service agentic utility offered right here follows a manager-worker structure sample, which doesn’t work in CrewAI. Examine how observability helped me to repair this well-known situation with the manager-worker hierarchical means of CrewAI, by tracing agent responses at every step and refining them to get the orchestration to work because it ought to. Full evaluation right here: Why CrewAI’s Manager-Worker Architecture Fails — and How to Fix It
Join with me and share your feedback at www.linkedin.com/in/partha-sarkar-lets-talk-AI
All photographs and knowledge used on this article are synthetically generated. Figures and code created by me
