Let AI Tune Your Voice Assistant

began: The world of voice AI has fairly a number of overlapping phrases. To verify we’re all on the identical web page let’s rapidly go over the primary phrases and the way I will probably be utilizing them on this article:

Voice assistant: The applying or “character” the consumer speaks to. That is the whole system from the consumer’s perspective.
Dwell API: The technical “gateway” that connects the consumer to the mannequin. It handles the real-time, bidirectional streaming of audio and information.
AI Model: The “mind” behind the agent. That is the Giant Language Mannequin (LLM) that understands intent and decides which motion to take.

Picture by creator

With that cleared up, let’s dive in 😃

What is that this about?

Prior to now few months I’ve seen a surge of curiosity in voice assistants. Not solely with the purchasers I work with, however the business as entire: Google Deepmind has demonstrated Project Astra at Google I/O, OpenAI has launched GPT-4o with advanced voice capability already some time again, and not too long ago ElevenLabs additionally launched an identical service with 11ai.

Voice assistants have gotten more and more frequent, permitting us to carry out actions on this planet simply by talking to them. They fill a spot that so many first technology voice assistants like Siri, and Alexa have left vast open: They’ve a significantly better understanding of pure language, can infer our intents significantly better, and have contextual reminiscence. Briefly, they’re simply a lot simpler to speak with.

The core mechanism that lets them carry out actions and makes them actually helpful is perform calling – the power to make use of instruments like a calendar or climate service. Nevertheless, the assistant’s effectiveness relies upon fully on how we instruct its underlying AI mannequin on when to make use of which software. That is the place the system immediate turns into crucial.

On this tutorial, we’ll discover how we are able to leverage Automated Immediate Engineering (APE) to enhance an agent’s function-calling capabilities by routinely refining this method immediate. The tutorial is cut up into two elements.

First, we’ll construct a strong take a look at suite for our voice assistant. This includes: taking a consumer question, utilizing an LLM to generate a number of semantic variations, and at last, changing these textual content queries into a various set of audio recordsdata. These audio recordsdata will probably be used to work together with the Dwell API.

Within the second half, we’ll use APE to iteratively enhance the agent’s efficiency. We’ll start with an preliminary system immediate and consider it towards our audio take a look at suite by observing which perform the agent requires every audio file. We then examine these responses to the bottom reality—the anticipated habits for that question—to calculate an general accuracy rating. This rating, together with the immediate that produced it, is distributed to an “optimiser” LLM. This optimiser then crafts a brand new, improved system immediate based mostly on the efficiency of all earlier makes an attempt, and the method begins once more.

On the finish of this course of, we’ll (hopefully) have a brand new system immediate that instructs the AI mannequin way more successfully on when to make use of every perform.
As at all times all of the code is freely accessible in a Github repo: https://github.com/heiko-hotz/voice-assistant-prompt-optimization/

Why ought to we care?

As we get into the age of voice assistants powered by LLMs, it’s essential to verify these brokers really behave the way in which we would like them to. Think about we requested an agent to examine our calendar, just for it to name a climate API and inform us the climate. It’s an excessive instance, however hopefully, it brings dwelling the purpose.

This was already a headache with chatbots, however with voice assistants, issues get far more sophisticated. Audio is inherently messier than a clear, written question. Take into consideration all of the methods a consumer can immediate the underlying AI mannequin—with totally different accents or dialects, speaking quick or gradual, throwing in fillers like ‘uhm’ and ‘ah’, or with a loud espresso store within the background.

Picture by creator – created with ChatGPT

And this extra dimension is inflicting an actual drawback. Once I work with organisations I typically see them battle with this added complexity and so they typically revert again to the one technique they really feel they’ll belief: guide testing. This implies groups of individuals sitting in a room, studying from scripts to simulate real-world situations. It’s not solely extremely time-consuming and costly, nevertheless it’s additionally not very efficient.

That is the place automation turns into important. If we would like our brokers to have even the slightest likelihood of getting complicated duties proper, we have now to get the fundamentals proper, and we have now to do it systematically. This weblog put up is all about an method that automates your entire analysis and optimization pipeline for voice assistants—a technique designed to save improvement time, reduce testing prices, and construct a extra dependable voice assistant that customers will really belief and maintain utilizing.

Fast recap: The ideas of Automated Immediate Engineering (APE)

Fortunately, I’ve already written about Automated Prompt Engineering previously, and so I can shamelessly refer again to my older weblog put up 😏

We are going to use the identical precise precept of OPRO (Optimisation by PROmpting) on this mission as effectively. However to rapidly recap:

It’s a bit like hyperparameter optimisation (HPO) within the good previous days of supervised machine studying: manually making an attempt out totally different studying charges and batch sizes was suboptimal and simply not sensible. The identical is true for guide immediate engineering. The problem, nevertheless, is {that a} immediate is text-based and due to this fact its optimisation area is big (simply think about in what number of alternative ways we might rephrase one immediate). In distinction, conventional ML hyperparameters are numerical, making it simple to programmatically choose values for them.

So, how will we automate the technology of textual content prompts? What if we had a software that by no means will get drained, able to producing numerous prompts in numerous kinds whereas constantly iterating on them? We would wish a software proficient in language understanding and technology – and what software actually excels at language? That’s proper, a Giant Language Mannequin (LLM) 😃

However we simply don’t need it to check out totally different prompts randomly, we really need it to be taught from earlier iterations. That is on the coronary heart of the OPRO technique: If random immediate technology is analogous to random search in HPO, OPRO is analogous to Bayesian search. It doesn’t simply guess randomly; it actively tries to hill-climb towards the analysis metric by studying from previous outcomes.

The important thing to OPRO is the meta-prompt (quantity 8 within the diagram above), which is used to information the “optimiser” LLM. This meta-prompt consists of not solely the duty description but in addition the optimisation trajectory—a historical past of all of the earlier prompts and their efficiency scores. With this info, the optimiser LLM can analyse patterns, determine the weather of profitable prompts, and keep away from the pitfalls of unsuccessful ones. This studying course of permits the optimiser to generate more and more more practical prompts over time, iteratively enhancing the goal LLM’s efficiency.

Our mission construction

Earlier than we begin diving deeper into your entire course of I believe it’s price our whereas to have a fast take a look at our mission construction to get a very good overview:

voice-assistant-prompt-optimization/
├── 01_prepare_test_suite.py     # Step 1: Generate take a look at instances and audio
├── 02_run_optimization.py       # Step 2: Run immediate optimization
├── initial-system-instruction.txt  # Complete beginning immediate
├── optimization.log             # Detailed optimization logs (auto-generated)
├── test_preparation.log         # Take a look at suite preparation logs (auto-generated)
├── audio_test_suite/           # Generated audio recordsdata and mappings
├── configs/
│   ├── input_queries.json      # Base queries for take a look at technology
│   └── model_configs.py        # AI mannequin configurations
├── data_generation/
│   ├── audio_generator.py      # Textual content-to-speech technology
│   ├── query_restater.py       # Question variation technology
│   └── output_queries.json     # Generated question variations (auto-generated)
├── analysis/
│   └── audio_fc_evaluator.py   # Perform name analysis system
├── optimization/
│   ├── metaprompt_template.txt # Template for immediate optimization
│   └── prompt_optimiser.py     # Core optimization engine
├── runs/                       # Optimization outcomes (auto-generated)
└── necessities.txt            # Python dependencies

Let’s get began and stroll via the person elements intimately.

The Beginning Level: Defining Our Take a look at Circumstances

Earlier than we are able to begin optimizing, we first have to outline what “good” seems like. The complete course of begins by creating our “examination paper” and its corresponding reply key. We do that in a single configuration file: configs/input_queries.json.

Inside this file, we outline a listing of take a look at situations. For every state of affairs, we have to present two key items of knowledge: the consumer’s preliminary question and the anticipated final result—the bottom reality. This is usually a perform name with its title and its corresponding parameters or no perform name.

Let’s check out the construction for a few examples:

{
  "queries": [
    {
        "query": "What's the weather like today?",
        "trigger_function": true,
        "function_name": "get_information",
        "function_args": {
          "query": "What's the weather like today?"
        }
    },
    {
      "query": "I need to speak to a human please",
      "trigger_function": true,
      "function_name": "escalate_to_support",
      "function_args": {
        "reason": "human-request"
      }
    },
    {
        "query": "Thanks, that's all I needed",
        "trigger_function": false
    }
  ]
}

As we are able to see, every entry specifies the question, whether or not a perform needs to be triggered, and the anticipated function_name and function_args. The Evaluator will later use this floor reality to grade the assistant’s efficiency.

The standard of those “seed” queries is vital for the entire optimization course of. Listed below are a number of ideas we should always take note:

We Must Cowl All Our Bases

It’s straightforward to solely take a look at the apparent methods a consumer may discuss to our agent. However a very good take a look at suite must cowl the whole lot. This implies we should always embrace queries that:

Set off each single perform the agent can use.
Set off each doable argument or motive for a perform (e.g., we should always take a look at each human-request and vulnerable-user causes for the escalate_to_support perform).
Set off no perform in any respect. Circumstances like “Thanks, that’s all” are tremendous vital. They train the mannequin when to not do one thing, so it doesn’t make annoying or flawed perform calls when it shouldn’t.

We Ought to Embrace Ambiguity and Edge Circumstances

That is the place issues get attention-grabbing, and the place most fashions fail. Our beginning queries have to have among the bizarre, unclear phrasings that folks really use. For instance:

Direct vs. Oblique: We must always have a direct command like “I would like to talk to a human” proper subsequent to one thing oblique like “Can I discuss to somebody?”. At first, the mannequin most likely may solely get the direct one. The APE course of will make it be taught that each imply the identical factor.
Delicate Nuance: For the vulnerable-user case, a question like “I’m feeling actually overwhelmed” is likely to be a a lot tougher take a look at than one thing apparent. It forces the mannequin to select up on emotion, not simply search for key phrases.

By placing these laborious issues in our beginning set, we’re telling the APE system, “Hey, give attention to fixing these items.” The workflow will then simply maintain making an attempt till it finds a immediate that may really cope with them.

Half 1: Constructing the Take a look at Suite

Alright, let’s pop the hood and take a look at how we are able to generate the take a look at suite. The primary script for this half is 01_prepare_test_suite.py which builds our “examination.” It’s a two-step course of: first, we generate a number of textual content variations for the preliminary consumer queries we supplied, after which we flip them into lifelike audio recordsdata.

Step 1: Rephrasing Queries with an LLM

Every little thing kicks off by studying our “seed” queries from `input_queries.json` which we noticed above. We begin with about 10 of those, overlaying all of the totally different features and situations we care about. However as mentioned – we don’t wish to solely take a look at these 10 examples, we wish to create many various variations so to be sure that the voice assistant will get it proper irrespective of how the consumer asks for a particular motion.

So, for every of those 10 queries, we ask a “rephraser” LLM to provide you with 5 alternative ways of claiming the identical factor (this quantity is configurable). We don’t simply need 5 boring copies; we want selection. The system immediate we use to information the LLM for this course of is fairly easy however efficient:

Please restate the next consumer question for a monetary voice assistant in {NUM_RESTATEMENTS} alternative ways.
    The aim is to create a various set of take a look at instances.


    Pointers:
    - The core intent should stay similar.
    - Use a mixture of tones: direct, informal, well mannered, and formal.
    - Fluctuate the sentence construction and vocabulary.
    - Return ONLY the restatements, every on a brand new line. Don't embrace numbering, bullets, or another textual content.


    Authentic Question: "{question}"

This entire course of is kicked off by our 01_prepare_test_suite.py script. It reads our input_queries.json file, runs the rephrasing, and generates an intermediate file known as output_queries.json that appears one thing like this:

{
  "queries": [
    {
      "original_query": "I need to speak to a human please",
      "trigger_function": true,
      "restatements": [
        "Get me a human.",
        "Could I please speak with a human representative?",
        "Can I get a real person on the line?",
        "I require assistance from a live agent.",
        "Please connect me with a human."
      ],
      "function_name": "escalate_to_support",
      "function_args": { "motive": "human-request" }
    },
    {
      "original_query": "Thanks, that is all I wanted",
      "trigger_function": false,
      "restatements": [
        "Thank you, I have everything I need.",
        "Yep, thanks, I'm good.",
        "I appreciate your assistance; that's all for now.",
        "My gratitude, the provided information is sufficient.",
        "Thank you for your help, I am all set."
      ]
    }
  ]
}

Discover how every original_query now has a listing of restatements. That is nice as a result of it provides us a a lot wider set of take a look at instances. We’re not simply testing a technique of asking for a human; we’re testing six (the unique question and 5 variations), from the very direct “Get me a human” to the extra well mannered “May I please communicate with a human consultant?”.

Now that we have now all these textual content variations, we’re prepared for the subsequent step: turning them into precise audio to create our take a look at suite.

Step 2: Creating the audio recordsdata

So, we’ve received a bunch of textual content. However that’s not sufficient. We’re constructing a voice assistant, so we want precise audio. This subsequent step might be crucial a part of the entire setup, because it’s what makes our take a look at lifelike.

That is all nonetheless dealt with by the 01_prepare_test_suite.py script. It takes the output_queries.json file we simply made and feeds each single line—the unique queries and all their restatements—right into a Textual content-to-Speech (TTS) service.

To get among the finest and lifelike voices we are able to get we’ll use Google’s new Chirp 3 HD voices. They’re principally the most recent technology of Textual content-to-Speech, powered by LLMs themselves, and so they sound extremely lifelike and pure. And we don’t simply convert the textual content to audio utilizing one commonplace voice. As a substitute, we use an entire record of those totally different HD voices with totally different gender, accents, and dialects—US English, UK English, Australian, Indian, and so forth. We do that as a result of actual customers don’t all sound the identical, and we wish to make sure that our agent can perceive a request for assist whether or not it’s spoken with a British accent or an American one.

VOICE_CONFIGS = [
    # US English voices
    {"name": "en-US-Chirp3-HD-Charon", "dialect": "en-US"},
    {"name": "en-US-Chirp3-HD-Kore", "dialect": "en-US"},
    {"name": "en-US-Chirp3-HD-Leda", "dialect": "en-US"},
    
    # UK English voices
    {"name": "en-GB-Chirp3-HD-Puck", "dialect": "en-GB"},
    {"name": "en-GB-Chirp3-HD-Aoede", "dialect": "en-GB"},
    
    # Australian English voices
    {"name": "en-AU-Chirp3-HD-Zephyr", "dialect": "en-AU"},
    {"name": "en-AU-Chirp3-HD-Fenrir", "dialect": "en-AU"},
    
    # Indian English voices
    {"name": "en-IN-Chirp3-HD-Orus", "dialect": "en-IN"},
    {"name": "en-IN-Chirp3-HD-Gacrux", "dialect": "en-IN"}
]

Aspect word:
Once I was growing this mission, I hit a extremely annoying snag. As soon as I generated a wav file I might ship the audio to the reside API, and… nothing. It might simply dangle, failing silently. It seems the generated audio recordsdata ended too abruptly. The API’s Voice Exercise Detection (VAD) didn’t have sufficient time to grasp that the consumer (our audio file) had completed talking. It was simply ready for extra audio that by no means got here.

So I developed a workaround: I programmatically added one second of silence to the tip of each single audio file. That little pause provides the API the sign it must realize it’s its flip to reply.

After the script runs, we find yourself with a brand new folder known as audio_test_suite/. Inside, it’s filled with .wav recordsdata, with names like restatement_02_en-GB_… .wav. And we have to be sure that we hyperlink these audio to the unique statements, and extra importantly, the bottom reality. To that finish we can even create an audio mapping file `audio_test_suite/audio_mapping.json`. It maps each single audio file path to its floor reality—the perform name that we count on the agent to make when it hears that audio.

{
  "audio_mappings": [
    {
      "original_query": "What's the weather like today?",
      "audio_files": {
        "original": {
          "path": "audio_test_suite/query_01/original_en-IN_Orus.wav",
          "voice": "en-IN-Chirp3-HD-Orus",
          "expected_function": {
            "name": "get_information",
            "args": {
              "query": "What's the weather like today?"
            }
          }
        },
...

With our audio test suite and its mapping file in hand, our exam is finally ready. Now, we can move on to the interesting part: running the optimization loop and seeing how our agent actually performs.

Part 2: Running the Optimization Loop

Alright, this is the main event. With our audio test suite ready, it’s time to run the optimization. Our 02_run_optimization.py script orchestrates a loop with three key players: an initial prompt to get us started, an Evaluator to grade its performance, and an optimiser to suggest improvements based on those grades. Let’s break down each one.

The Starting Point: A rather naive Prompt

Every optimization run has to start somewhere. We begin with a simple, human-written starting_prompt. We define this directly in the 02_run_optimization.py script. It’s intentionally basic because we want to see a clear improvement.

Here’s an example of what our starting prompt might look like:

You are a helpful AI voice assistant.
Your goal is to help users by answering questions and performing actions through function calls.


# User Context
- User's preferred language: en
- Interaction mode: voice


# Responsibilities
Your main job is to understand the user's intent and route their request to the correct function.
- For general questions about topics, information requests, or knowledge queries, use the `get_information` function.
- If the user explicitly asks to speak to a human, get help from a person, or requests human assistance, use the `escalate_to_support` function with the reason 'human-request'.
- If the user sounds distressed, anxious, mentions feeling overwhelmed, or describes a difficult situation, use the `escalate_to_support` function with the reason 'vulnerable-user'.

This prompt looks reasonable, but it’s very literal. It probably won’t handle indirect or nuanced questions well, which is exactly what we want our APE process to fix.

The Evaluator: Grading the Test

The first thing our script does is run a baseline test. It takes this starting_prompt and evaluates it against our entire audio test suite. This is handled by our AudioFunctionCallEvaluator.

The evaluator’s job is simple but critical:

It takes the system prompt.
It loops through every single audio file in our audio_test_suite/.
For each audio file, it calls the live API with the given system prompt.
It checks the function call the API made and compares it to the ground truth from our audio_mapping.json.
It counts up the passes and fails and produces an overall accuracy score.

This score from the first run is our baseline. It lets us know where we stand, and we have the first data point for our optimization history.

Our evaluation/audio_fc_evaluator.py is the engine that actually “grades” each prompt. When we tell it to evaluate a prompt, it doesn’t just do a simple check.

First, it needs to know what tools the agent even can use. These are defined as a strict schema right in the evaluator code. This is exactly how the AI model understands its capabilities:

# From evaluation/audio_fc_evaluator.py
GET_INFORMATION_SCHEMA = {
    "name": "get_information",
    "description": "Retrieves information or answers general questions...",
    "parameters": {"type": "OBJECT", "properties": {"query": {"type": "STRING", ...}}}
}
ESCALATE_TO_SUPPORT_SCHEMA = {
    "name": "escalate_to_support",
    "description": "Escalates the conversation to human support...",
    "parameters": {"type": "OBJECT", "properties": {"reason": {"type": "STRING", ...}}}
}
TOOL_SCHEMAS = [GET_INFORMATION_SCHEMA, ESCALATE_TO_SUPPORT_SCHEMA]

The precise implementation of those instruments is irrelevant (in our code they may simply be dummy features) – the vital half is that the AI mannequin selects the right software!

Then, it runs all our audio exams towards the reside API. For every take a look at, its comparability logic is kind of nuanced. It doesn’t simply examine for an accurate perform name; it checks for particular failure varieties:

PASS: The mannequin did precisely what was anticipated.
FAIL (Unsuitable Perform): It was purported to name get_information however known as escalate_to_support as an alternative.
FAIL (Missed Name): It was purported to name a perform however made no name in any respect.
FAIL (False Constructive): It was supposed to remain quiet (like for “thanks, that’s all”) however known as a perform anyway.

This detailed suggestions is essential. It’s what provides the optimiser the wealthy info it wants to truly be taught.

The optimiser: Studying from Errors

That is the center of the OPRO technique. Our script takes the consequence from the evaluator—the immediate, its preliminary rating, and an in depth breakdown of which queries failed—and makes use of it to construct a meta-prompt. That is the lesson plan we ship to our optimiser LLM.

The meta-prompt is structured to present the optimiser most context. It seems one thing like this:

You might be an skilled in immediate engineering for voice AI. Your activity is to write down a brand new, improved system immediate that fixes the weaknesses you see under.


## PROMPT_HISTORY_WITH_DETAILED_ANALYSIS
<PROMPT>
<PROMPT_TEXT>
You're a useful voice assistant...
</PROMPT_TEXT>
<OVERALL_ACCURACY>
68%
</OVERALL_ACCURACY>
<QUERY_PERFORMANCE>
"I would like to talk to a human please": 6/6 (100%)
"Can I discuss to somebody?": 1/6 (17%) - CRITICAL
"I am feeling actually overwhelmed...": 2/6 (33%) - CRITICAL
</QUERY_PERFORMANCE>
<CRITICAL_FAILURES>
"Can I discuss to somebody?" → Anticipated: escalate_to_support, Bought: get_information
</CRITICAL_FAILURES>
</PROMPT>


## INSTRUCTIONS
...Write a brand new immediate that may repair the CRITICAL points...

That is extremely highly effective. The optimiser LLM doesn’t simply see a rating. It sees that the immediate works superb for direct requests however is critically failing on oblique ones. It will probably then motive about why it’s failing and generate a brand new immediate particularly designed to repair that drawback.

This brings us to the `optimization/prompt_optimiser.py`. Its job is to take all that wealthy suggestions and switch it into a greater immediate. The key sauce is the meta-prompt, which is constructed from a template file: optimization/metaprompt_template.txt. Now we have already seen how the metaprompt seems within the earlier part.

The optimiser script makes use of helper features like _calculate_query_breakdown() and _extract_failing_examples() to create an in depth report for that {prompt_scores} part. It then feeds this whole, detailed meta-prompt to the “optimiser” LLM. The optimiser mannequin then writes a brand new immediate, which the script extracts utilizing a easy common expression to seek out the textual content contained in the [[…]] brackets.

Logging, repeating, and the ultimate consequence

All of this tough work is meticulously logged. Every run creates a timestamped folder inside runs/ containing:

iteration_0/, iteration_1/, and so forth., with the precise immediate used, its rating, and an in depth JSON of the analysis.
best_prompt.txt: The very best-scoring immediate discovered in the course of the run.
prompt_history.txt: A log of each immediate tried and its efficiency breakdown.
score_history_summary.txt: A neat abstract of how the rating improved over time.

So, when the loop is completed, you don’t simply get one good immediate. You get an entire audit path of how the system “thought” its strategy to a greater answer.

After the loop finishes, we’re left with our prize: the best-performing immediate. Once I first ran this, it was genuinely fascinating to see what the optimiser got here up with. The preliminary immediate was very inflexible, however the closing, optimised immediate was far more nuanced.

Within the run folder we’re in a position to see how the immediate improved the mannequin efficiency over time:

And we are able to additionally see how every question group improved in every iteration:

Lastly we are able to see how the immediate has concerned from a relatively easy immediate to one thing far more subtle:

# Identification
You're a useful AI voice assistant.
Your aim is to assist customers by answering questions and performing actions via perform calls.
...


# Perform Choice Logic
Your major duty is to precisely perceive the consumer's request and choose the suitable perform. Your decision-making course of is a strict hierarchy. Crucial distinction is whether or not the consumer is expressing an emotional state of misery versus requesting purposeful assistance on a activity or subject.


**STEP 1: Verify for Escalation Triggers (`escalate_to_support`).**
That is your first and highest precedence.


*   **Purpose 1: `human-request`**
    *   **Situation:** Use this ONLY when the consumer explicitly asks to talk to... an individual, human, or agent.
    *   **Examples:** "I would like to talk to a human," "Can I discuss to somebody?"


*   **Purpose 2: `vulnerable-user`**
    *   **Situation:** Use this when the consumer's major intent is to precise a state of emotional misery, confusion, or helplessness. Deal with their *state of being*, even when they point out a subject.
    *   **Triggers for `vulnerable-user` embrace:**
        1.  **Direct Emotional Expressions:** The consumer states they really feel overwhelmed, harassed, anxious...
        2.  **Oblique Misery or Helplessness:** The consumer makes a common, non-specific request for assist, or expresses being misplaced or clueless... **This is applicable even when a subject is talked about.**
            *   Examples: "I am having a extremely laborious time and will use some assist," ... "My monetary state of affairs is a large headache, and I am completely clueless about what to do."


**STEP 2: If NO Escalation Triggers are Met, Default to `get_information`.**
If the request just isn't an unambiguous escalation, it's an info request.
*   **Situation:** Use this for ANY and ALL consumer requests for info, explanations, "how-to" guides, or task-based help on a particular topic.
*   **Examples:** "What is the climate like at present?", "How do I prepare dinner pasta?" ...


## Vital Disambiguation Guidelines
To make sure accuracy, comply with these strict distinctions:


*   **"Assist" Requests:**
    *   **Imprecise Plea = Escalate:** "I require fast help." -> `escalate_to_support(motive='vulnerable-user')`
    *   **Particular Job = Info:** "I need assistance with my tutorial work." -> `get_information(question='assist with tutorial work')`


*   **Matter-Associated Requests:**
    *   **Misery ABOUT a Matter = Escalate:** "My funds really feel fully unmanageable proper now." -> `escalate_to_support(motive='vulnerable-user')`
    *   **Query ABOUT a Matter = Info:** "Are you able to inform me about managing funds?" -> `get_information(question='easy methods to handle funds')`


... [ Final Rule, Greeting, Language, and General Behavior sections ] ...

And, as at all times with automated immediate engineering, I discover it fascinating to see the optimiser’s evaluation and reasoning:

### Evaluation of Failures and Technique for Enchancment


1. Core Drawback Recognized: The optimiser first pinpointed the primary weak spot: the mannequin struggles when a consumer expresses misery a few particular subject (e.g., "I am overwhelmed by my funds"). It was incorrectly seeing the "subject" and ignoring the consumer's emotional state.
2. Evaluation of Previous Failures: It then reviewed earlier makes an attempt, realizing that whereas a easy, strict hierarchy was a very good begin (like in Immediate 1), including a rule that was too broad about matters was a "deadly flaw" (like in Immediate 2), and abandoning the hierarchy altogether was a catastrophe.
3. Strategic Plan for the New Immediate: Primarily based on this, it devised a brand new technique:
Shift from Key phrases to Intent: The core change was to cease searching for simply key phrases ("harassed") or matters ("funds") and as an alternative give attention to intent detection. The important thing query grew to become: "Is the consumer expressing an emotional state of being, or are they asking for purposeful activity/info help?"
Add "Vital Disambiguation" Guidelines: To make this new logic express, the optimiser deliberate so as to add a pointy, new part with direct comparisons to resolve ambiguity. The 2 most important contrasts it determined so as to add have been:
Imprecise Plea vs. Particular Job: Differentiating "I need assistance" (escalate) from "I need assistance with my homework" (get info).
Misery ABOUT a Matter vs. Query ABOUT a Matter: This was the essential repair, contrasting "I am overwhelmed by my funds" (escalate) with "Inform me about monetary planning" (get info).

The place We Go From Right here: Limitations and Subsequent Steps

Let’s be trustworthy about this: it isn’t magic. It’s a strong software, however what we’ve constructed is a stable basis that handles one particular, however crucial, a part of the puzzle. There are a number of large limitations we want to pay attention to and loads of methods we are able to make this mission even higher.

The Largest Limitation: We’re Solely Testing the First Flip

Crucial factor to grasp is that our present setup solely exams a single-turn interplay. We ship an audio file, the agent responds, and we grade that one response. That’s it. However actual conversations are virtually by no means that straightforward.

An actual consumer may need a back-and-forth dialog:

Consumer: "Hello, I would like some assist with my account."
Agent: "After all, I will help with that. What appears to be the issue?"
Consumer: "Effectively, I am simply feeling actually overwhelmed by all of it, I do not know the place to start out."

In our present system, we solely take a look at that final, essential sentence. However a very nice agent wants to keep up context over a number of turns. It ought to perceive that the consumer is in misery inside the context of an account drawback. Our present optimization course of doesn’t take a look at for that in any respect. That is, by far, the largest alternative for enchancment.

Different Issues This Doesn’t Do (But)

We’re Testing in a Soundproof Sales space: The audio we generate is “studio high quality”—completely clear, with no background noise. However actual customers are virtually by no means in a studio. They’re in espresso outlets, strolling down the road, or have a TV on within the background. Our present exams don’t examine how effectively the agent performs when the audio is messy and filled with real-world noise.
It’s Solely as Good as Our Preliminary Take a look at Circumstances: The entire course of is guided by the input_queries.json file we create at first. If we don’t embrace a sure kind of edge case in our preliminary queries, the optimiser received’t even realize it wants to unravel for it. The standard of our beginning take a look at instances actually issues.
The optimiser Can Get Caught: Generally the optimiser LLM can hit a “native most.” It would discover a immediate that’s fairly good (say, 85% correct) after which simply maintain making tiny, unhelpful adjustments to it as an alternative of making an attempt a very totally different, extra artistic method that would get it to 95%.

The Enjoyable Half: How We Can Enhance It

These limitations aren’t useless ends; they’re alternatives. That is the place we are able to actually begin to experiment and take the mission to the subsequent stage.

Constructing Multi-Flip Take a look at Situations: That is the large one. We might change our take a look at suite from a listing of single audio recordsdata to a listing of conversational scripts. The evaluator must simulate a multi-turn dialogue, sending one audio file, getting a response, after which sending the subsequent one. This might permit us to optimise for prompts that excel at sustaining context.
Smarter Analysis: As a substitute of re-running your entire audio take a look at suite each single time, what if we solely re-ran the exams that failed within the final iteration? This might make every loop a lot quicker and cheaper.
Higher Analysis Metrics: We might simply increase our Evaluator. What if, along with checking the perform name, we used one other LLM to attain the agent’s politeness or conciseness? Then we might optimise for a number of issues directly.
Human-in-the-Loop: We might construct a easy UI that reveals us the brand new immediate the optimiser got here up with. We might then give it a thumbs-up or make a small guide edit earlier than the subsequent analysis spherical, combining AI scale with human instinct.
Exemplar Choice: And naturally, there’s the subsequent logical step: exemplar choice. As soon as we’ve discovered the very best immediate, we might run one other loop to seek out the very best few-shot examples to go together with it, pushing the accuracy even increased.

The chances are large. Be happy to take the code and take a look at implementing a few of these concepts your self. That is only the start of what we are able to do with automated immediate engineering for voice.

Conclusion

And that’s a wrap! We’ve gone from a easy concept to a full-blown automated immediate engineering pipeline for a voice AI assistant. It’s a testomony to the ability of APE and the OPRO algorithm, displaying that they’ll work even within the messy world of audio.

On this weblog put up, we’ve explored how crafting efficient prompts is crucial for an agent’s efficiency, however how the guide means of tweaking and testing is simply too gradual and tough for at present’s complicated voice assistants. We noticed how we are able to use APE to get away from that irritating guide work and transfer in the direction of a extra systematic, data-driven method.

However we didn’t simply discuss principle – we received sensible. We walked via your entire course of, from producing a various audio take a look at suite with lifelike voices to implementing the OPRO loop the place an “optimiser” LLM learns from an in depth historical past of successes and failures. We noticed how this automated course of can take a easy beginning immediate and uncover a significantly better one which handles the difficult, ambiguous queries that actual customers throw at it.

After all, what we’ve constructed is simply a place to begin. There are various methods to boost it additional, like constructing multi-turn conversational exams or including background noise to the audio. The chances are large.

I actually hope you loved this walkthrough and located it helpful. The complete mission is obtainable on the GitHub repository, and I encourage you to test it out. Be happy to clone the repo, run it your self, and possibly even attempt to implement among the enhancements we mentioned.

Thanks for studying, and comfortable optimising! 🤗

Heiko Hotz

👋 Comply with me on Towards Data Science and LinkedIn to learn extra about Generative AI, Machine Studying, and Pure Language Processing.

Source link

Accuracy Is Dead: Calibration, Discrimination, and Other Metrics You Actually Need

Topic Model Labelling with LLMs | Towards Data Science

There and Back Again: An AI Career Journey

What Is It About » Ofemwire

The Role of Luck in Sports: Can We Measure It?

OpenAI lanserar Codex AI-agent för mjukvaruutveckling

Dynamic Inventory Optimization with Censored Demand

Rust for Python Developers: Why You Should Take a Look at the Rust Programming Language

Most Popular

The Secret Power of Data Science in Customer Support

Shaip Unveils Cutting-Edge Data Platform for Ethical and Quality AI Training

ChatGPT Now Connects to Your Business Tools

Our Picks