A selecting operation is the method of accumulating gadgets from storage areas to fulfil buyer orders.
It is among the most labour-intensive actions in logistics, accounting for as much as 55% of whole warehouse working prices.
For every order, an operator receives an inventory of things to gather from their storage areas.
They stroll to every location, determine the product, decide the correct amount, and make sure the operation earlier than transferring to the following line.
In most warehouses, operators depend on RF scanners or handheld tablets to obtain directions and make sure every decide.
- What occurs when operators want each arms for dealing with?
- The way to onboard operators who don’t learn the native language?
Voice selecting solves this by changing the display screen with audio directions: the system tells the operator the place to go and what to select, and the operator confirms verbally.

After I was designing provide chain options in logistics firms, vocalisation was the default alternative, particularly for price-sensitive tasks.
Based mostly on my expertise, with vocalization, operators’ productiveness can attain 250 containers/hour for retail and FMCG operations.
The idea just isn’t new. {Hardware} suppliers and software program editors have provided voice-picking options for the reason that early 2000s.
However these methods include vital constraints:
- Proprietary {hardware} at $2,000 to $5,000 per headset
- Vendor-locked software program with restricted customisation
- Lengthy deployment cycles of three to six months per web site
- Inflexible language assist that requires retraining for every new language
For a 50-FTE warehouse, the overall funding reaches $150K to $300K, excluding coaching prices.
It’s too costly for my prospects.
What should you might obtain comparable outcomes utilizing a smartphone, a custom-made internet software, and fashionable AI voice expertise?
On this article, I’ll present how I constructed a minimalist voice-picking module that integrates with Warehouse Administration Methods, utilizing ElevenLabs for text-to-speech and speech recognition.

This internet software has been deployed within the distribution centre of a small grocery store chain with nice outcomes (the shopper is glad!).
The target is to not design options that compete with market leaders, however relatively to provide an alternative choice to logistics and manufacturing operations that lack the capability to spend money on costly gear and wish customised options.
Drawback Assertion
Earlier than we get into voice-picking powered by ElevenLabs, let me introduce the logistic operations this AI-powered internet software will assist.

That is the central distribution centre of a small grocery store chain that delivers to 50 shops in Central Europe.

The power is organised in a grid format with aisles (A by way of L) and positions alongside every aisle:
- Every location shops a particular merchandise (known as SKU) with a recognized amount in containers.
- Operators must know the place to go and what to anticipate once they arrive.
What’s the goal? Increase the operators productiveness!
They weren’t glad concerning the order allocation and strolling paths offered by their outdated system.

They first requested to cut back operators’ strolling distance and enhance the variety of containers picked per hour utilizing the options presented in this article.
The answer was an online software related to the Warehouse Administration System (WMS) database that guides the operator by way of the warehouse.

This visible format offers a real-time view of what now we have within the system, with a greater routing resolution.
Our goal is to go from a productiveness of 75 containers/hour to 200 containers/hour with:
- A greater order allocation of orders with spatial clustering and pathfinding to minimise the strolling distance per field picked
- Voice-picking to information operators in a flawless method
How the Choosing Circulation Works
Earlier than leaping into the vocalisation of the software, let me introuce the method of order selecting.
Three shops despatched orders to the warehouse:
- Retailer 1 ordered 3 containers of
Natural Inexperienced Tea 500gwhich can be positioned in Location A1 - Retailer 2 ordered 2 containers of
Earl Gray Tea 250gwhich can be positioned in Location A3 - Retailer 3 ordered 5 containers of
Arabica Espresso Beans 1kgwhich can be positioned in Location B2
A selecting batch is a bunch of retailer orders consolidated right into a single work task.

The system generates a batch with a number of order traces with directions:
- The place to go (the storage location)
- What to select (the SKU reference)
- What number of containers to gather

The operator simply has to course of every line sequentially.
As soon as they affirm a decide, the system advances to the following instruction.
This sequential circulate is vital as a result of it determines the strolling path by way of the warehouse utilizing the optimisation algorithms.

As this can be a {custom} software, we might implement this optimisation with out counting on an exterior editor.
Why constructing a {custom} resolution? As a result of it’s cheaper and simpler to implement.
Initially, the shopper deliberate to buy a business resolution and wished me to combine the pathfinding resolution.
After investigation, we found that it will have been costlier to combine the app into the seller resolution than to construct one thing from scratch.
What’s the course of with out the AI-based voice function?
Guide Mode: The Display screen-Based mostly Baseline
In handbook mode, the operator reads every instruction on display screen and confirms by tapping a button.
Two actions can be found at every step:
- Affirm Choose: operator collected the correct amount
- Report Situation: the placement is empty, the amount doesn’t match, or the product is broken

I constructed the handbook mode as a dependable fallback in case now we have points with Elevenlabs.
However it retains the operator’s eyes and one hand tied to the system at each step.
We have to add vocal instructions!
Voice Mode: Palms-Free with ElevenLabs
Now that you already know why we wish the voice mode to exchange display screen interplay, let me clarify how I added two AI-powered parts.

Textual content-to-Speech: ElevenLabs Reads the Directions
When the operator begins a selecting session in voice mode, every instruction is transformed to speech utilizing the ElevenLabs API.
As a substitute of studying “Location A-03-2, decide 4 containers of SKU-1042” on a display screen, the operator hears a pure voice say:
“Location Alpha Three Two. Choose 4 containers.”
ElevenLabs offers a number of benefits over fundamental browser-based TTS:
- Pure intonation that’s straightforward to know in a loud warehouse
- 29+ languages obtainable out of the field, with no retraining
- Constant voice high quality throughout all directions
- Sub-second technology for brief sentences like decide directions
However what about speech recognition?
Speech-to-Textual content: The Operator Confirms Verbally
After listening to the instruction, the operator walks to the placement, picks the gadgets, and desires to verify.
Right here, I made a deliberate design alternative relying on speech recognition and the reasoning capabilities of ElevenLabs.
Utilizing a single endpoint, we seize the response and match it in opposition to anticipated instructions:
- “Affirm” or “Carried out” to validate the decide
- “Drawback” or “Situation” to flag a discrepancy
- “Repeat” to listen to the instruction once more
The agentic half interprets the operator’s suggestions and tries to match it to the anticipated interactions (CONFIRM, ISSUE, or REPEAT).

For a multilingual warehouse, this can be a vital profit:
- A Czech operator and a Filipino operator can each obtain directions of their native language from the identical system, with none {hardware} change.
- I don’t have to contemplate all of the languages doable within the design of the answer
Why utilizing ElevenLabs?
For one more function, the stock cycle rely software presented in this video, I’ve used n8n with AI agent nodes to carry out the identical activity.

This was working fairly nicely, but it surely required a extra complicated setup
- Two AI nodes: one for the audio transcription utilizing OpenAI fashions, and one AI agent to format the output of the transcription
- The system prompts had been assuming that the operator was talking English.
I’ve changed that with a single ElevenLabs endpoint with multi-lingual capabilities.
Placing each parts collectively, a single decide cycle seems like this:

- The app calls ElevenLabs to generate the audio instruction
- The operator hears: “Location Alpha Three Two. Choose 4 containers.”
- The operator walks to the placement (arms free, eyes free)
- The operator picks the gadgets and says, “Affirm”
- The speech recognition endpoint processes the affirmation and strikes to the following selecting location
Your complete interplay takes a number of seconds of system time.
What concerning the prices?
That is the place the comparability with conventional methods turns into placing.

For this mid-size warehouse with 50 FTEs, they estimated that the normal method prices roughly $60K to $150K within the first 12 months.
The AI-powered method prices a number of API calls.
The trade-off is evident: conventional methods provide confirmed reliability and offline functionality for high-volume operations.
In case of failures, now we have the handbook resolution as a rollback.
This AI-powered method gives accessibility and velocity for organisations that can’t justify a six-figure funding.
What Does That Imply for Operations Managers and Resolution Makers?
Voice selecting is not a expertise reserved for the most important 3PLs and retailers with massive budgets.
In case your warehouse has WiFi and your operators have smartphones, you possibly can prototype a voice-guided selecting system in days.
It’s straightforward to check it on an actual batch to measure the impression earlier than committing any vital funds for productisation.
Three situations the place this method makes specific sense:
- Multilingual services the place operators wrestle with screen-based directions in a language that’s not their very own
- Multi-site operations the place deploying proprietary {hardware} to each small warehouse just isn’t economically viable
- Excessive-turnover environments the place coaching time on complicated scanning methods straight impacts productiveness
What about different processes?
Excellent news, the identical structure extends past selecting.
Voice-guided workflows can assist any course of the place an operator wants directions whereas holding their arms free.
You’ll find a reside demo of a listing cycle counting software right here:
The way to begin this journey?
As you may simply guess, the entrance finish of those functions has been vibecoded utilizing Lovable and Claude Code.
For the backend, if in case you have restricted coding capabilities, I might recommend beginning with n8n.

n8n is a low-code automation platform that permits you to join APIs and AI fashions utilizing visible workflows.
The preliminary model of this resolution has been constructed with this software:
- I began with a backend related to a Telegram Bot
- Customers had been enjoying with the software utilizing this interface
- After validation, we moved that to an online software
That is the best strategy to begin, even with restricted coding abilities.
I share a step-by-step tutorial with free templates to begin automating from day 1 on this video:
Let me know what you propose to construct utilizing all these good instruments!
About Me
Let’s join on LinkedIn and Twitter. I’m a Provide Chain Engineer who’s utilizing information analytics to enhance logistics operations and cut back prices.
In the event you’re in search of tailor-made consulting options to optimise your provide chain and meet sustainability objectives, please contact me.