Building a Scalable and Accurate Audio Interview Transcription Pipeline with Google Gemini

This text is co-authored by Ugo Pradère and David Haüet

can or not it’s to transcribe an interview? You feed the audio to an AI mannequin, wait a couple of minutes, and increase: good transcript, proper? Effectively… not fairly.

Relating to precisely transcribe lengthy audio interviews, much more when the spoken language shouldn’t be English, issues get much more difficult. You want prime quality transcription with dependable speaker identification, exact timestamps, and all that at an inexpensive worth. Not so easy in spite of everything.

On this article, we take you behind the scenes of our journey to construct a scalable and production-ready transcription pipeline utilizing Google’s Vertex AI and Gemini fashions. From sudden mannequin limitations to price range analysis and timestamp drift disasters, we’ll stroll you thru the true challenges, and the way we solved them.

Whether or not you might be constructing your individual Audio Processing device or simply inquisitive about what occurs “below the hood” of a strong transcription system utilizing a multimodal mannequin, you will see sensible insights, intelligent workarounds, and classes realized that needs to be value your time.

Context of the venture and constraints

Initially of 2025, we began an interview transcription venture with a transparent aim: to construct a system able to transcribing interviews in French, sometimes involving a journalist and a visitor, however not restricted to this case, and lasting from a couple of minutes to over an hour. The ultimate output was anticipated to be only a uncooked transcript however needed to replicate the pure spoken dialogue written in a “book-like” dialogue, making certain each a devoted transcription of the unique audio content material and a great readability.

Earlier than diving into growth, we performed a brief market evaluation of current options, however the outcomes had been by no means passable: the standard was usually disappointing, the pricing positively too excessive for an intensive utilization, and generally, each without delay. At that time, we realized a customized pipeline can be needed.

As a result of our group is engaged within the Google ecosystem, we had been required to make use of Google Vertex AI companies. Google Vertex AI provides a wide range of Speech-to-Textual content (S2T) fashions for audio transcription, together with specialised ones resembling “Chirp,” “Latestlong,” or “Telephone name,” whose names already trace at their meant use instances. Nonetheless, producing an entire transcription of an interview that mixes excessive accuracy, speaker diarization, and exact timestamping, particularly for lengthy recordings, stays an actual technical and operational problem.

First makes an attempt and limitations

We initiated our venture by evaluating all these fashions on our use case. Nonetheless, after intensive testing, we got here rapidly to the next conclusion: no Vertex AI service totally meets the entire set of necessities and can enable us to attain our aim in a easy and efficient method. There was all the time no less than one lacking specification, often on timestamping or diarization.

The horrible Google documentation, this have to be stated, price us a major period of time throughout this preliminary analysis. This prompted us to ask Google for a gathering with a Google Cloud Machine Studying Specialist to attempt to discover a answer to our drawback. After a fast video name, our dialogue with the Google rep rapidly confirmed our conclusions: what we aimed to attain was not so simple as it appeared at first. The complete set of necessities couldn’t be fulfilled by a single Google service and a customized implementation of a VertexAI S2T service needed to be developed.

We offered our preliminary work and determined to proceed exploring two methods:

Use Chirp2 to generate the transcription and timestamping of lengthy audio information, then use Gemini for diarization.
Use Gemini 2.0 Flash for transcription and diarization, though the timestamping is approximate and the token output size requires looping.

In parallel of those investigations, we additionally needed to take into account the monetary facet. The device can be used for lots of of hours of transcription per 30 days. In contrast to textual content, which is mostly low-cost sufficient to not have to consider it, audio could be fairly expensive. We subsequently included this parameter from the start of our exploration to keep away from ending up with an answer that labored however was too costly to be exploited in manufacturing.

Deep dive into transcription with Chirp2

We started with a deeper investigation of the Chirp2 mannequin since it’s thought of because the “greatest at school” Google S2T service. An easy utility of the documentation supplied the anticipated end result. The mannequin turned out to be fairly efficient, providing good transcription with word-by-word timestamping in accordance with the next output in json format:

"transcript":"Oui, en effet",
"confidence":0.7891818284988403
"phrases":[
  {
    "word":"Oui",
    "start-offset":{
      "seconds":3.68
    },
    "end-offset":{
      "seconds":3.84
    },
    "confidence":0.5692862272262573
  }
  {
    "word":"en",
    "start-offset":{
      "seconds":3.84
    },
    "end-offset":{
      "seconds":4.0
    },
    "confidence":0.758037805557251
  },
  {
    "word":"effet",
    "start-offset":{
      "seconds":4.0
    },
    "end-offset":{
      "seconds":4.64
    },
    "confidence":0.8176857233047485
  },
]

Nonetheless, a brand new requirement got here alongside the venture added by the operational crew: the transcription have to be as devoted as attainable to the unique audio content material and embrace small filler phrases, interjections, onomatopoeia and even mumbling that may add that means to a dialog, and sometimes come from the non-speaking participant both on the similar time or towards the tip of a sentence of the talking one. We’re speaking about phrases like “oui oui,” “en effet” but additionally easy expressions like (hmm, ah, and so on.), so typical of the French language! It’s truly not unusual to validate or, extra hardly ever, oppose somebody level with a easy “Hmm Hmm”. Upon analyzing Chirp with transcription, we observed that whereas a few of these small phrases had been current, quite a few these expressions had been lacking. First draw back for Chirp2.

The principle problem on this method lies within the reconstruction of the speaker sentences whereas performing diarization. We rapidly deserted the concept of giving Gemini the context of the interview and the transcription textual content, and asking it to find out who stated what. This technique may simply end in incorrect diarization. We as a substitute explored sending the interview context, the audio file, and the transcription content material in a compact format, instructing Gemini to solely carry out diarization and sentence reconstruction with out re-transcribing the audio file. We requested a TSV format, a perfect structured format for transcription: “human readable” for quick high quality checking, simple to course of algorithmically, and light-weight. Its construction is as follows:

First line with speaker presentation:

Diarization Speaker_1:speaker_nameSpeaker_2:speaker_nameSpeaker_3:speaker_nameSpeaker_4:speaker_name, and so on.

Then the transcription within the following format:

speaker_idttime_startttime_stoptext with:

speaker: Numeric speaker ID (e.g., 1, 2, and so on.)
time_start: Phase begin time within the format 00:00:00
time_stop: Phase finish time within the format 00:00:00
textual content: Transcribed textual content of the dialogue section

An instance output:

Diarization Speaker_1:Lea FinchSpeaker_2:David Albec

1 00:00:00 00:03:00 Hello Andrew, how are you?

2 00:03:00 00:03:00 Effective thanks.

1 00:04:00 00:07:00 So, let’s begin the interview

2 00:07:00 00:08:00 All proper.

A easy model of the context supplied to the LLM:

Right here is the interview of David Albec, skilled soccer participant, by journalist Lea Finch

The end result was pretty qualitative with what seemed to be correct diarization and sentence reconstruction. Nonetheless, as a substitute of getting the very same textual content, it appeared barely modified in a number of locations. Our conclusion was that, regardless of our clear directions, Gemini in all probability carries out extra than simply diarization and truly carried out partial transcription.

We additionally evaluated at this level the price of transcription with this technique. Under is the approximate calculation based mostly solely on audio processing:

Chirp2 worth /min: 0.016 usd

Gemini 2.0 flash /min: 0,001875 usd

Worth /hour: 1,0725 usd

Chirp2 is certainly fairly “costly”, about ten instances greater than Gemini 2.0 flash on the time of writing, and nonetheless requires the audio to be processed by Gemini for diarization. We subsequently determined to place this technique apart for now and discover a approach utilizing the model new multimodal Gemini 2.0 Flash alone, which had simply left experimental mode.

Subsequent: exploring audio transcription with Gemini flash 2.0

We supplied Gemini with each the interview context and the audio file requesting a structured output in a constant format. By fastidiously crafting our immediate with normal LLM pointers, we had been in a position to specify our transcription necessities with a excessive diploma of precision. As well as with the standard components any immediate engineer would possibly embrace, we emphasised a number of key directions important for making certain a high quality transcription (remark in italic):

Transcribe interjections and onomatopoeia even when mid-sentence.
Protect the total expression of phrases, together with slang, insults, or inappropriate language. => the mannequin tends to alter phrases it considers inappropriate. For this particular level, we needed to require Google to deactivate the protection guidelines on our Google Cloud Challenge.
Construct full sentences, paying specific consideration to adjustments in speaker mid-sentence, for instance when one speaker finishes one other’s sentence or interrupts. => Such errors have an effect on diarization and accumulate all through the transcript till context is powerful sufficient for the LLM to right.
Normalize extended phrases or interjections like “euuuuuh” to “euh.” and never “euh euh euh euh euh …” => this was a classical bug we had been encountering referred to as “repetition bug” and is mentioned in additional element under
Establish audio system by voice tone whereas utilizing context to find out who’s the journalist and who’s the interviewee. => as well as we are able to move the knowledge of the primary speaker within the immediate

Preliminary outcomes had been truly fairly satisfying by way of transcription, diarization, and sentence development. Transcribing quick check information made us really feel just like the venture was almost full… till we tried longer information.

Coping with Lengthy Audio and LLM Token Limitations

Our early checks on quick audio clips had been encouraging however scaling the method to longer audios rapidly revealed new challenges: what initially appeared like a easy extension of our pipeline turned out to be a technical hurdle in itself. Processing information longer than only a few minutes revealed certainly a collection of challenges associated to mannequin constraints, token limits, and output reliability:

One of many first issues we encountered with lengthy audio was the token restrict: the variety of output tokens exceeded the utmost allowed (MAX_INPUT_TOKEN = 8192) forcing us to implement a looping mechanism by repeatedly calling Gemini whereas resending the beforehand generated transcript, the preliminary immediate, a continuation immediate, and the identical audio file.

Right here is an instance of the continuation immediate we used:

Proceed transcribing audio interview from the earlier end result. Begin processing the audio file from the earlier generated textual content. Don’t begin from the start of the audio. Watch out to proceed the beforehand generated content material which is offered between the next tags <previous_result>.

Utilizing this transcription loop with massive knowledge inputs appears to considerably degrade the LLM output high quality, particularly for timestamping. On this configuration, timestamps can drift by over 10 minutes on an hour-long interview. If a couple of seconds drift was thought of suitable with our meant use, a couple of minutes made timestamping ineffective.

Our preliminary check on quick audios of some minutes resulted in a most 5 to 10 seconds drift, and vital drift was noticed typically after the primary loop when max enter token was reached. We conclude from these experimental observations, that whereas this looping method ensures continuity in transcription pretty effectively, it not solely results in cumulative timestamp errors but additionally to a drastic lack of LLM timestamps accuracy.

We additionally encountered a recurring and notably irritating bug: the mannequin would generally fall right into a loop, repeating the identical phrase or phrase over dozens of strains. This conduct made complete parts of the transcript unusable and sometimes seemed one thing like this:

1 00:00:00 00:03:00 Hello Andrew, how are you?

2 00:03:00 00:03:00 Effective thanks.

2 00:03:00 00:03:00 Effective thanks

2 00:03:00 00:03:00 Effective thanks.

2 00:03:00 00:03:00 Effective thanks

2 00:03:00 00:03:00 Effective thanks.

and so on.

This bug appears erratic however seems extra regularly with medium-quality audio with sturdy background noise, distant speaker for instance. And “on the sphere”, that is usually the case.. Likewise, speaker hesitations or phrase repetitions appear to set off it. We nonetheless don’t know precisely what causes this “repetition bug”. Google Vertex crew is conscious of it however hasn’t supplied a transparent rationalization.

The implications of this bug had been particularly limiting: as soon as it occurred, the one viable answer was to restart the transcription from scratch. Unsurprisingly, the longer the audio file, the upper the likelihood of encountering the difficulty. In our checks, it affected roughly one out of each three runs on recordings longer than an hour, making it extraordinarily troublesome to ship a dependable, production-quality service below such circumstances.

To make it worse, resuming transcription after a reached Max_token “cutoff” required resending your entire audio file every time. Though we solely wanted the following section, the LLM would nonetheless course of the total file once more (with out outputting the transcription), that means we had been billed the total audio time lenght for each resend.

In follow, we discovered that the token restrict was sometimes reached between the fifteenth and twentieth minute of the audio. Consequently, transcribing a one hour lengthy interview usually required 4 to five separate LLM calls, resulting in a complete billing equal of 4 to five hours of audio for a single file.

With this course of, the price of audio transcription doesn’t scale linearly. Whereas a 15-minute audio can be billed as quarter-hour, in a single LLM name, a 1-hour file may successfully price 4 hours, and a 2-hour file may improve to 16 hours, following a close to quadratic sample (≈ 4^x, the place x = variety of hours).
This made lengthy audio processing not simply unreliable, but additionally costly for lengthy audio information.

Pivoting to Chunked Audio Transcription

Given these main limitations, and being rather more assured within the means of the LLM to deal with text-based duties over audio, we determined to shift our method and isolate the audio transcription course of to take care of excessive transcription high quality. A top quality transcription is certainly the important thing step of the necessity and it is sensible to make sure that this a part of the method needs to be on the core of the technique.

At this level, splitting audio into chunks grew to become the best answer. Not solely, it appeared prone to enormously enhance timestamp accuracy by avoiding the LLM timestamping efficiency degradation after looping and cumulative drift, but additionally lowering worth since every chunk can be runned ideally as soon as. Whereas it launched new uncertainties round merging partial transcriptions, the tradeoff appeared to our benefit.

We thus targeted on breaking lengthy audio into shorter chunks that will insure a single LLM transcription request. Throughout our checks, we noticed that points like repetition loops or timestamp drift sometimes started across the 18-minute mark in most interviews. It grew to become clear that we must always use 15-minute (or shorter) chunks for security. Why not use 5-minute chunks? The standard enchancment seemed minimal to us whereas tripling the variety of segments. As well as, shorter chunks cut back the general context, which may harm diarization.

Additionally this setup drastically minimized the repetition bug, we noticed that it nonetheless occurred sometimes. In a want to offer the very best service attainable, we positively wished to seek out an environment friendly counterback to this drawback and we recognized a possibility with our beforehand annoying max_input_token: with 10-minute chunks, we may positively be assured that token limits wouldn’t be exceeded in almost all instances. Thus, if the token restrict was hit, we knew for certain the repetition bug occurred and will restart that chunk transcription. This pragmatic technique turned out to be certainly very efficient at figuring out and avoiding the bug. Nice information.

Correcting audio chunks transcription

With good transcripts of 10 minutes audio chunk in hand, we carried out at this stage an algorithmic post-processing of every transcript to handle minor points:

Elimination of header tags like tsv or json added firstly and the tip of the transcription content material:

Regardless of optimizing the immediate, we couldn’t totally remove this facet impact with out hurting the transcription high quality. Since that is simply dealt with algorithmically, we selected to take action.

Changing speaker IDs with names:

Speaker identification by title solely begins as soon as the LLM has sufficient context to find out who’s the journalist and who’s being interviewed. This ends in incomplete diarization firstly of the transcript with early segments utilizing numeric IDs (first speaker in chunk = 1, and so on.). Furthermore, since every chunk could have a distinct ID order (first individual to speak being speaker 1), this is able to create confusion throughout merging. We instructed the LLM to solely use IDs and supply a diarization mapping within the first line, through the transcription course of. The speaker ids are subsequently changed through the algorithmic correction and the diarization headline eliminated.

Hardly ever, malformed or empty transcript strains are encountered. These strains are deleted, however we flag them with a observe to the consumer: “formatting difficulty on this line” so customers are no less than conscious of a possible content material loss and proper it will definitely handwise. In our last optimized model, such strains had been extraordinarily uncommon.

Merging chunks and sustaining content material continuity

On the earlier stage of audio chunking, we initially tried to make chunks with clear cuts. Unsurprisingly, this led to phrases and even full sentences loss at reduce factors. So we naturally switched to overlapping chunk cuts to keep away from such content material loss, leaving the optimization of the dimensions of the overlap to the chunk merging course of.

With out a clear reduce between chunks, the chance to merge the chunks algorithmically disappeared. For a similar audio enter, the transcript strains output could be fairly completely different with breaks at completely different factors of the sentences and even filler phrases or hesitations being displayed otherwise. In such a scenario, it’s advanced, to not say not possible, to make an efficient algorithm for a clear merge.

This left us with using the LLM possibility in fact. Shortly, few checks confirmed the LLM may higher merge collectively segments when overlaps included full sentences. A 30-second overlap proved adequate. With a ten min audio chunk construction, this is able to implies the next chunks cuts:

1st transcript: 0 to 10 minutes
2nd transcript: 9m30s to 19m30s
third transcript: 19m to 29m …and so forth.

Picture by the authors

These overlapped chunk transcripts had been corrected by the beforehand described algorithm and despatched to the LLM for merging to reconstruct the total audio transcript. The concept was to ship the total set of chunk transcripts with a immediate instructing the LLM to merge and provides the total merged audio transcript in tsv format because the earlier LLM transcription step. On this configuration, the merging course of has primarily three high quality criterias:

Guarantee transcription continuity with out content material loss or duplication.
Alter timestamps to renew from the place the earlier chunk ended.
Protect diarization.

As anticipated, max_input_token was exceeded, forcing us into an LLM name loop. Nonetheless, since we had been now utilizing textual content enter, we had been extra assured within the reliability of the LLM… in all probability an excessive amount of. The results of the merge was passable generally however liable to a number of points: tag insertions, multi-line entries merged into one line, incomplete strains, and even hallucinated continuations of the interview. Regardless of many immediate optimizations, we couldn’t obtain sufficiently dependable outcomes for manufacturing use.

As with audio transcription, we recognized the quantity of enter data as the primary difficulty. We had been sending a number of hundred, even hundreds of textual content strains containing the immediate, the set of partial transcripts to fuse, a roughly comparable quantity with the earlier transcript, and a few extra with the immediate and its instance. Positively an excessive amount of for a exact utility of our set of directions.

On the plus facet, timestamp accuracy did certainly enhance considerably with this chunking method: we maintained a drift of simply 5 to 10 seconds max on transcriptions over an hour. As the beginning of a transcript ought to have minimal drift in timestamping, we instructed the LLM to make use of the timestamps of the “ending chunk” as reference for the fusion and proper any drift by a second per sentence. This made the reduce factors seamless and saved total timestamp accuracy.

Splitting the chunk transcripts for full transcript reconstruction

In a modular method just like the workaround we used for transcription, we determined to hold out the merge of the transcript individually, with a purpose to keep away from the beforehand described points. To take action, every 10 minute transcript is break up into three elements based mostly on the start_time of the segments:

Overlap section to merge firstly: 0 to 1 minute
Important section to stick: 1 to 9 minutes
Overlap section to merge on the finish: 9 to 10 minutes

NB: Since every chunk, together with first and final ones, is processed the identical approach, the overlap firstly of the primary chunk is straight merged with the primary section, and the overlap on the finish of the final chunk (if there may be one) is merged accordingly.

The start and finish segments are then despatched in pairs to be merged. As anticipated, the standard of the output drastically elevated, leading to an environment friendly and dependable merge between the transcripts chunk. With this process, the response of the LLM proved to be extremely dependable and confirmed not one of the beforehand talked about errors encountered through the looping course of.

The method of transcript meeting for an audio of 28 minutes 42 seconds:

Full transcript reconstruction

At this last stage, the one remaining process was to reconstruct the entire transcript from the processed splits. To attain this, we algorithmically mixed the primary content material segments with their corresponding merged overlaps alternately.

General course of overview

The general course of entails 6 steps from which 2 are carried out by Gemini:

Chunking the audio into overlapped audio chunks
Transcribing every chunks into partial textual content transcripts (LLM step)
Correction of partial transcripts
Splitting audio chunks transcripts into begin, important, and finish textual content splits
Fusing finish and begin splits of every couple of chunk splits (LLM step)
Reconstructing the total transcripts

The general course of takes about 5 min per hour of transcription deserved to the consumer in an asynchronous device. Fairly cheap contemplating the amount of labor executed behind the scene, and this for a fraction of the worth of different instruments or pre-built Google fashions like Chirp2.

One extra enchancment that we thought of however finally determined to not implement was the timestamp correction. We noticed that timestamps on the finish of every chunk sometimes ran about 5 seconds forward of the particular audio. An easy answer may have been to incrementally alter the timestamps algorithmically by roughly one second each two minutes to right most of this drift. Nonetheless, we selected to not implement this adjustment, because the minor discrepancy was acceptable for our enterprise wants.

Conclusion

Constructing a high-quality, scalable transcription pipeline for lengthy interviews turned out to be rather more advanced than merely selecting the “proper” Speech-to-Textual content mannequin. Our journey with Google’s Vertex AI and Gemini fashions highlighted key challenges round diarization, timestamping, cost-efficiency, and lengthy audio dealing with, particularly when aiming to export the total data of an audio.

Utilizing cautious immediate engineering, sensible audio chunking methods, and iterative refinements, we had been in a position to construct a strong system that balances accuracy, efficiency, and operational price, turning an initially fragmented course of right into a easy, production-ready pipeline.

There’s nonetheless room for enchancment however this workflow now kinds a strong basis for scalable, high-fidelity audio transcription. As LLMs proceed to evolve and APIs turn out to be extra versatile, we’re optimistic about much more streamlined options within the close to future.

Key takeaways

No Vertex AI S2T mannequin met all our wants: Google Vertex AI gives specialised fashions, however each has limitations by way of transcription accuracy, diarization, or timestamping for lengthy audios.
Token limits and lengthy prompts affect transcription high quality drastically: Gemini output token limitation considerably degrades transcription high quality for lengthy audios, requiring closely prompted looping methods and eventually forcing us to shift to shorter audio chunks.
Chunked audio transcription and transcript reconstruction considerably improves high quality and cost-efficiency:
Splitting audio into 10 minute overlapping segments minimized important bugs like repeated sentences and timestamp drift, enabling larger high quality outcomes and drastically lowered prices.
Cautious immediate engineering stays important: Precision in prompts, particularly concerning diarization and interjections for transcription, in addition to transcript fusions, proved to be essential for dependable LLM efficiency.
Quick transcript fusion merging maximize reliability:
Splitting every chunk transcript into smaller segments with finish to begin merging of overlaps supplied excessive accuracy and prevented widespread LLM points like hallucinations or incorrect formatting.

Source link

What Are Agent Skills Beyond Claude?

When Data Lies: Finding Optimal Strategies for Penalty Kicks with Game Theory

Three OpenClaw Mistakes to Avoid and How to Fix Them

Classical Computer Vision and Perspective Transformation for Sudoku Extraction

Exploring the Proportional Odds Model for Ordinal Logistic Regression

rtrvr.ai AI-agent för effektiv webbhämtning, analys och automatisering

The Difference between Duplicate and Reference in Power Query

The Three Ages of Data Science: When to Use Traditional Machine Learning, Deep Learning, or an LLM (Explained with One Example)

Most Popular

Which Hallucinates Less And How To Fix Both » Ofemwire

TruthScan vs. Grammarly: Which AI Detector Works Best?

The crucial first step for designing a successful enterprise AI system

Our Picks

How Pokémon Go is helping robots deliver pizza on time

What Are Agent Skills Beyond Claude?

When Data Lies: Finding Optimal Strategies for Penalty Kicks with Game Theory

Building a Scalable and Accurate Audio Interview Transcription Pipeline with Google Gemini

Context of the venture and constraints

First makes an attempt and limitations

Deep dive into transcription with Chirp2

Subsequent: exploring audio transcription with Gemini flash 2.0

Coping with Lengthy Audio and LLM Token Limitations

Pivoting to Chunked Audio Transcription

Correcting audio chunks transcription

Merging chunks and sustaining content material continuity

Splitting the chunk transcripts for full transcript reconstruction

Full transcript reconstruction

General course of overview

Conclusion

Key takeaways

Related Posts