Voice-First Lead Capture
How we built voice ai lead capture on WhatsApp for a manufacturing sales team in India. Two bots, Twilio, Prisma, ₹151/month infrastructure.
The client sells industrial bearings and machined components to mid-sized manufacturers across Maharashtra and Gujarat. Their sales team is ten people. Most of them spend three to four days a week outside the office: distributor visits, trade shows at Pragati Maidan, cold drops at industrial estates in Pune and Vadodara. The rest of the week they are in transit.
We were brought in to fix one specific thing. The lead data from those field visits was not making it into the CRM. The owner suspected he was losing somewhere between a third and half of every visit's information by the time the salesperson got back to a desk.
The actual problem
The CRM the client was using is fine. It is a Zoho instance that the owner's brother set up four years ago. Salespeople had logins. There was a mobile app. The lead form had eleven fields and a free-text notes section.
Nobody used it in the field.
We sat with two of the salespeople for half a day each before writing any code. The pattern was identical for both of them. They would meet a contact, exchange cards, talk for twenty minutes, take a photo of the visiting card, and move on to the next meeting. The Zoho app would get opened in the evening at the hotel, sometimes. More often it got opened on Sunday for the whole week's visits at once. By Sunday, the details were a blur. Volume estimates became guesses. Competitor names dropped off entirely. Follow-up dates got rounded to "next week."
This is not a CRM adoption problem in the way it usually gets framed. The salespeople were not lazy or resistant to process. The interface just did not match the tempo of the work. Standing in a factory parking lot in 38 degree heat, nobody is going to fill out an eleven-field form on a phone screen.
40-60%
Industry estimate for lead data lost to delayed manual CRM entry
What they did do, all day, was send WhatsApp voice notes. To each other, to the regional manager, to the owner. Quick updates, route changes, asks for pricing clarification. Voice notes were already the medium.
If the capture mechanism matches the medium people already use, adoption is not a problem you have to solve. It just happens.
So the brief became: build the lead capture inside WhatsApp, accept voice notes as the primary input, and meet the existing tempo of the work.
Why two bots
The first version of the design used a single WhatsApp bot for everything. Salespeople would send voice notes for capture. Managers would send text questions for analytics. The bot would route based on the sender.
We scrapped this in week one for two reasons. The first was conversational. Salespeople sometimes asked the bot questions like "show me the leads I logged at Auto Expo." That should have been a manager-style analytics query, but in the single-bot model it got tangled with capture state. The bot would sometimes try to extract lead fields from the question itself.
The second reason was permission. The owner wanted managers to see all leads across the team. Salespeople should only see their own. Building two bots with different identities and different scopes turned out to be cleaner than building permission logic inside one bot.
So the system runs two separate Twilio WhatsApp numbers. Each has its own bot personality, its own command set, its own database scope. They share the same backend application and the same Postgres database, but the conversational surface is split.
The Employee Bot
This is the capture bot. The flow looks like this.
A salesperson sends a voice note. Most are between fifteen and forty-five seconds. The bot acknowledges receipt within a second or two so the salesperson knows it landed, then runs the audio through transcription. We use Sarvam AI's speech-to-text for this. It handles Indian English and Hinglish noticeably better than Whisper for our use case, and at ₹1 per minute it sits at a tenth the cost of OpenAI's transcription tier.
Once we have the transcript, a Gemini 3.1 Flash Lite call extracts structured fields. The schema is narrow on purpose: company name, contact person, product interest, monthly volume estimate, current supplier, budget range, follow-up date, free-text notes. Nothing else. We tried a wider schema in the first prototype with fields for designation, location, decision-maker status, and pain points. The model would hallucinate three or four of those fields whenever the salesperson did not mention them. Cutting the schema down made the extraction reliable.
The bot sends back a formatted lead card as a WhatsApp message. Each field shows the extracted value, and there are interactive buttons at the bottom: Confirm, Edit, Discard. If the salesperson taps Edit, the bot walks through fields one at a time and accepts either text or another voice note as the correction. If they tap Confirm, the lead writes to Postgres and the salesperson gets a one-line confirmation with the lead ID.
The whole interaction takes two to three minutes from voice note to confirmed lead. The Zoho form, when it got filled out at all, took ten to fifteen.
A typical voice note: "Met Rajesh Kulkarni at Gupta Industries in Chakan, they need around five hundred units of bearing housings per month, they are buying from SKF right now, budget is roughly two lakh per month, want to follow up next Tuesday." The extraction returns company "Gupta Industries", contact "Rajesh Kulkarni", product "bearing housings", volume "500 units/month", current supplier "SKF", budget "₹2,00,000/month", follow-up "next Tuesday". The free-text notes field gets the literal transcript so nothing is lost.
The Manager Bot
This is the analytics bot. Different number, different conversational surface. Three managers have access: the owner, the regional sales head, and the operations head.
The interaction is text in, formatted text out. A manager types "show me leads from last week" or "how many leads has Amit captured this month" or "which leads from Pune have not been followed up yet." The bot parses the intent, runs the corresponding Postgres query through Prisma, and returns a formatted summary with lead IDs that link back to the Google Sheet view.
We did not build a full natural-language-to-SQL layer for this. The query patterns we observed in week one were narrow. There were maybe twelve real questions managers wanted to ask, and they were variations of three or four templates: by salesperson, by date range, by territory, by status. We hand-wrote intent classifiers for those templates and fall back to a Gemini call for anything off-template. The off-template path triggers maybe twice a week.
The bot also pushes scheduled messages. A daily digest at 9 AM with yesterday's captures grouped by salesperson. A weekly performance summary on Monday morning with capture counts, conversion rates from lead to follow-up, and territories with no activity in the past seven days. The owner reads the daily digest in bed before getting up. He told us this two months in.
The interesting thing about the manager bot is what it replaced. The owner used to ask the regional head for these numbers over WhatsApp every other day. The regional head would open Zoho on his laptop, run a few filters, screenshot the result, and send it back. That round trip took fifteen to twenty minutes and depended on the regional head being at a desk. Now the owner just asks the bot.
Storage and the Sheets mirror
Lead data lives in Postgres, accessed through Prisma. That is the source of truth. Every confirmed lead also writes to a Google Sheet within a few seconds, via the Google Sheets API.
The Sheets mirror exists for one reason: managers in this business live in spreadsheets. Asking them to log into a database tool, even a friendly one like Retool or a Zoho dashboard, would have been a non-starter. They wanted to sort leads by volume, filter by territory, share specific rows with the procurement team, copy a column into an email. All the things a Google Sheet does without anyone needing to learn anything.
The sync is one-way. The bot writes to both Postgres and Sheets when a lead gets confirmed. Edits to the Sheet do not flow back. We considered making it bidirectional and decided against it. The owner sometimes deletes rows from the Sheet to clean up his view, and we did not want those deletions cascading into the actual database. If a lead needs to be marked stale or wrong, that happens through the manager bot with a "discard lead 4823" command.
The Prisma schema is small. One Lead table with the extracted fields plus metadata (capturer ID, captured-at timestamp, status, edit history as JSONB). One Employee table for the phone whitelist. One AuditLog table that records every bot interaction for debugging. That is the whole data model.
Authentication
Phone number based. Each bot has a whitelist of authorized numbers stored in the Employee table. The Twilio webhook checks the From field on every incoming message. If the number is not on the whitelist for that bot, the message gets a polite rejection: "This number is not authorized to use this bot. Please contact your manager."
No passwords. No magic links. No login flow. The phone number is the identity, and the WhatsApp account on that phone is the second factor by virtue of how WhatsApp itself works.
For a team of ten this is fine. For a team of two hundred it would not be. The friction of getting added to the whitelist is a manager messaging us with the new salesperson's number, which works at this scale and would not at four times the size. We have a TODO to build a self-service onboarding flow if the client expands the system to their distributor partners, which they have started talking about.
Response time
We targeted sub-three-second response from voice note received to lead card sent back. Most of the time we hit it. The breakdown:
Twilio webhook fires within 200 to 400 milliseconds of the message being sent. Sarvam transcription on a thirty-second clip takes 800 to 1500 milliseconds. The Gemini extraction call adds another 600 to 900 milliseconds. Database write and WhatsApp send for the lead card adds 200 to 400 milliseconds. Total is usually 1.8 to 3.2 seconds.
The reason this matters is field tempo. If the bot takes ten seconds to respond, the salesperson has already pocketed the phone and started walking to the next meeting. They will not come back to confirm. The lead either gets lost or sits in an unconfirmed state that pollutes the inbox. Sub-three is the threshold where the confirmation happens in the same micro-context as the voice note.
We hit a wall once when Sarvam had a regional outage in November. Latency went up to twelve seconds for about forty minutes. Confirmation rate dropped from 94 percent to 61 percent in that window. Once latency recovered, confirmation rate snapped back. That was the empirical proof that the latency target was not a vanity metric.
What it costs
This is the part that surprised us.
₹151/month
Total infrastructure cost for the ten-person team in March
The breakdown for that month:
Twilio WhatsApp messaging was ₹78. Each voice-note-to-confirmed-lead interaction averages four messages round trip (the voice note in, the lead card back, the confirm tap, the confirmation reply). At the team's volume of around 280 confirmed leads in March, that is roughly 1,120 messages plus the manager bot traffic.
Sarvam transcription was ₹34. About 34 minutes of audio processed across the month.
Gemini 3.1 Flash Lite extraction and intent parsing was ₹19. Lite tier on the read-heavy workload is genuinely that cheap.
Postgres on Railway was ₹0. We are well inside the free tier.
The Railway application server itself runs at ₹420 per month flat, but it is shared across this client and two others on the same Node.js codebase, so the allocated cost is around ₹140. We round it to nothing in the breakdown above because the app server would exist for the other clients regardless.
The reason this is so cheap is not clever optimization. It is that the messaging and AI tiers used here are priced for volume, and a ten-person manufacturing sales team generates very little volume by those tiers' standards. The system would still cost under ₹500 per month at four times the team size.
10 to 15 minutes per lead via Zoho mobile form, captured for maybe half of visits
2 to 3 minutes per lead via voice note with confirmation, captured for 90+ percent of visits
What broke and what we are still figuring out
Three things broke in the first month that are worth naming.
The first was the lead card formatting. WhatsApp's interactive button messages have a hard limit on character count for the body text. Our initial card template overflowed for leads with long company names plus long notes fields. The bot would just fail silently. We caught it because one salesperson messaged the regional head saying "the bot is ignoring me." Fix was to truncate notes in the card display while keeping the full notes in the database, with a "show full notes" button if needed.
The second was Hinglish numbers. The phrase "do lakh" (two lakh, or 200,000) was getting transcribed correctly but extracted as "2" by Gemini, dropping the lakh suffix. We added a few-shot examples in the extraction prompt covering lakh, crore, hazaar, and the common spoken patterns. Accuracy on numeric fields jumped from 78 to 96 percent.
The third was timezones. The follow-up date extraction would sometimes pull "next Tuesday" relative to UTC, which on a Sunday evening in IST is one day off. We pinned all date arithmetic to Asia/Kolkata in the extraction layer.
The thing we are still figuring out is the daily digest. Right now it is static. Same format every day, regardless of what happened. A smarter version would surface anomalies: Amit captured 11 leads yesterday against his usual three, or no leads from Vadodara in five days, or three leads from the same company in a week which suggests a cluster worth a manager's attention. We have a draft of that but have not shipped it yet because we want a few more months of baseline data before deciding what counts as anomalous.
The other open question is what happens when a lead gets followed up. Right now the system captures cleanly, but follow-up tracking still happens in Zoho. Salespeople log a call note in Zoho when they do the Tuesday follow-up. Closing the loop, having the bot prompt for follow-up outcomes via voice note, would extend the same friction reduction to the next stage. We have not built it because the owner wants to see six months of capture data first before changing the follow-up process.
What we would do differently
Two things, in hindsight.
We would have built the manager bot second, not in parallel. We launched both bots in week three together, and the manager bot got almost no use for the first month because the database did not have enough leads in it to be interesting. Building the manager bot after a month of capture data would have let us see which queries actually mattered, instead of guessing.
We would have spent less time on the Sheets sync and more on the audit log. The Sheets sync was three days of work and gets used heavily. The audit log was half a day of work and has saved us probably ten times that in debugging the three issues above. We underbuilt the thing that mattered for operations and overbuilt the thing that mattered for adoption. Both turned out fine, but the time allocation was wrong.
The voice-first capture is the part of the system we are most confident about. It works because it matches what salespeople were already doing. The dual-bot architecture is the part we are least confident about. It works at this team size and would need rethinking if the client doubles. The cost structure is the part that genuinely surprised us, and the part that makes this kind of system viable for businesses that would never authorize a six-figure CRM overhaul.
Stack: Twilio WhatsApp Business API, Sarvam AI for speech-to-text, Gemini 3.1 Flash Lite for extraction and intent, Prisma over Postgres on Railway, Google Sheets API for the manager mirror. Deployed February 2026. Ten-person sales team across Maharashtra and Gujarat.
Want to build something like this?
We design and ship AI products, automation systems, and custom software.
Get in touch