The Real Cost of AI Voice Agents in India
We modeled three production voice stacks for Indian SMEs. The per-minute math, the telecaller crossover point, and where the real money actually goes.
Five invoices for one phone call
We spent the last quarter modeling AI voice agent costs for three clients in different industries. Every one of them asked the same question: "What will it actually cost to automate our calls?" The answer we kept arriving at was: more than the headline number, less than you think at scale, and the pricing is structured in a way that makes comparison genuinely difficult.
A voice agent is not a single product. It is a stack of four to five billable layers, each charged differently, by different vendors, in different currencies. The confusion is not accidental. When a platform advertises "$0.05 per minute," that number covers only the orchestration fee. The speech recognition, the language model, the voice synthesis, and the phone line are all separate charges. We found that the gap between the advertised price and the fully loaded cost runs anywhere from 3x to 6x.
This post breaks down what we found when we modeled real costs for Indian businesses. We are using April 2026 pricing throughout, and converting to rupees where the vendor bills in dollars. The goal is a reference you can use to build your own estimates, not a recommendation for any specific vendor.
This analysis focuses on outbound and inbound voice for structured business calls: payment reminders, order confirmations, delivery status checks. For conversational AI over WhatsApp (which many of our clients actually prefer), we wrote about the Hinglish processing challenges in our WhatsApp bot post.
The five layers and what each one costs
We broke the stack into five cost layers. Here is what we found for each, with India-specific options where they exist.
Layer 1: Speech-to-Text (STT)
This converts the caller's voice into text that the language model can process. Costs range from about ₹0.25/min to ₹2/min depending on language support and accuracy.
Sarvam AI's STT runs at ₹0.50/min (₹30/hour). It handles Hindi and Hindi-English code-switching significantly better than global models like AssemblyAI or Google's standard tier. Google's Enhanced model sits around ₹2/min but offers broader language coverage. For most Indian B2B calls happening in Hindi-English mix, Sarvam is the price-performance sweet spot we found.
Layer 2: The language model
This is the "brain" that reads the transcribed text, decides what to say, and generates a response. It is also, counterintuitively, the cheapest layer.
A three-minute call using DeepSeek V3 costs roughly ₹0.02 in inference. GPT-4o mini is under ₹0.50 for the same call. Even if you use a frontier model, the LLM cost per call stays under ₹2. The model is not where your money goes, which surprises most people we talk to. The cost center is everything around the model.
Layer 3: Text-to-Speech (TTS)
This turns the model's text response back into spoken audio. This is where costs diverge the most across vendors.
Google's standard voices run about ₹0.33/min. Their WaveNet voices (noticeably more natural) cost ₹1.33/min. ElevenLabs, which produces the most lifelike English output, charges $0.12 to $0.18 per minute depending on your plan, which works out to roughly ₹10-15/min. That is a 30x spread between the cheapest and most natural-sounding option.
For Indian languages specifically, Sarvam AI's Bulbul v3 is priced at ₹30 per 10,000 characters. In practice, a typical three-minute agent response runs about 600-800 words, which translates to roughly ₹1.20-1.50/min. It supports 11 Indian languages natively.
Layer 4: Telephony
The actual phone line. This cost is easy to overlook and hard to reduce.
Plivo (founded in India, bills in INR) charges approximately ₹0.80-1.00/min for outbound calls to Indian mobile numbers. Exotel, another India-native option, falls in a similar range at ₹0.80-1.00/min. Twilio charges about ₹0.65/min for outbound to landlines, but mobile numbers and the USD billing (with forex fluctuation) push the effective cost higher to around ₹1.20-1.50/min.
At 50 calls per day, the telephony layer alone costs ₹3,600-6,750/month. It is a fixed cost that no amount of model optimization will reduce.
Layer 5: Orchestration
The platform that ties STT, LLM, TTS, and telephony into a single API. This is where the pricing gets misleading.
Vapi advertises $0.05/min for their platform fee. But that is just the orchestration charge. When you add their default STT, LLM, TTS, and telephony providers, the fully loaded cost lands between $0.23 and $0.33/min (₹19-28/min). That is a 4-6x gap between the number on the landing page and what shows up on your invoice.
4-6x
gap between advertised and fully-loaded voice agent pricing
Three stacks we modeled
We put together three realistic configurations and priced them for a baseline of 50 three-minute outbound calls per day (about 150 minutes of voice per day, 4,500 minutes per month).
Stack A: Budget India-native
Components: DeepSeek V3 + Sarvam STT + Sarvam Bulbul TTS + Plivo telephony, self-orchestrated.
Per-minute cost: ₹2.50-3.50/min
Monthly at 50 calls/day: ₹11,250-15,750
This is the lowest-cost stack we could assemble that still handles Hindi-English code-switching reliably. The tradeoff is that you are building your own orchestration layer, which means engineering time upfront. We wrote about a similar build-vs-buy tradeoff in our harnesses post; the pattern of wrapping vendor APIs in your own control logic applies here directly.
Stack B: Mid-market with managed platform
Components: GPT-4o mini + Deepgram STT + Google WaveNet TTS + Retell orchestration + Twilio telephony.
Per-minute cost: ₹10-13/min
Monthly at 50 calls/day: ₹45,000-58,500
Better voice quality, lower engineering overhead, but 4x the cost of Stack A. This makes sense when voice naturalness directly affects conversion rates (sales calls, appointment booking) and you have the budget to justify it.
Stack C: All-in-one India platform
Components: Sarvam Samvaad (all-inclusive voice agent platform).
Per-minute cost: ~₹1/min
Monthly at 50 calls/day: ~₹4,500
The cheapest option if your calls are primarily in Hindi or one of the 10 other Indian languages Sarvam supports. The per-minute rate includes STT, LLM, TTS, and telephony bundled together. The constraint is that you are locked into their stack, their voices, and their language model choices.
For comparison, Bolna, another India-focused voice AI platform (recently raised $6.3M from General Catalyst), comes in around ₹4-5/min fully loaded.
| Stack | Per-minute cost | Monthly (50 calls/day) | Best for |
|---|---|---|---|
| A: Budget India-native | ₹2.50-3.50 | ₹11,250-15,750 | Hindi-English calls, engineering team available |
| B: Mid-market managed | ₹10-13 | ₹45,000-58,500 | English-primary, voice quality matters |
| C: Sarvam all-in-one | ~₹1 | ~₹4,500 | Hindi/Indian language calls, minimal setup |
| Bolna all-in-one | ₹4-5 | ₹18,000-22,500 | Indian languages, mid-range budget |
The telecaller comparison
A telecaller in a Tier-2 Indian city costs ₹15,000-18,500/month on average. At first glance, that makes even Budget Stack A look expensive. So why consider AI at all?
Because the economics change the moment you move past one person.
One telecaller handles 40-80 outbound calls per day on a single shift. They take leave, they have inconsistent days, and the Indian BPO sector sees 30-35% annual attrition even after recent improvements. Roughly one in three of your callers will leave within a year, and they take their training with them.
To go from 50 calls/day to 200, you need three to four people. To reach 500, you need eight to ten people plus a supervisor. At 1,000 calls/day, you are running a small call center: 20+ agents, a team lead, QA processes, a training pipeline, second shift coverage, possibly night shift at a 10-20% premium.
AI does not scale that way. Going from 50 to 500 to 5,000 calls/day, the per-minute cost stays flat. No hiring. No training pipeline. No attrition risk.
₹15,000-18,500
monthly telecaller salary in Tier-2 cities (2026)
Here is where the crossover happens:
| Daily calls | Telecaller cost/month | Budget AI (Stack A) | Sarvam (Stack C) |
|---|---|---|---|
| 50 | ₹15,000-18,500 (1 person) | ₹11,250-15,750 | ₹4,500 |
| 200 | ₹60,000-74,000 (4 people) | ₹45,000-63,000 | ₹18,000 |
| 500 | ₹1,80,000+ (10 people + supervisor) | ₹1,12,500-1,57,500 | ₹45,000 |
At 50 calls/day, it is roughly a wash between Stack A and a telecaller. At 200 calls/day, AI saves ₹15,000-30,000/month. At 500 calls/day, the gap is in lakhs. And this is before you account for consistency: the AI agent delivers the same quality on the 500th call as the first, at 11 PM the same as at 11 AM.
Which calls to automate first
Not all calls are the same, and this matters more than the cost math.
From what we have observed across our deployments, roughly 60-70% of business calls are structured and repetitive. Payment reminders. Order confirmations. Delivery status checks. Appointment scheduling. Follow-up nudges. These follow a script. The customer's response falls into predictable patterns. AI handles these well today, consistently, without fatigue, across all hours.
The remaining 30-40% require judgment. A customer escalating a complaint about a wrong delivery. A supplier negotiating revised terms. A long-time client who needs someone to actually listen. These calls go off-script within the first minute, and current AI starts losing coherence after about three to five minutes of unstructured conversation. Not because of language limitations, but because the underlying reasoning struggles when it needs to track emotional context, recall relationship history, and make real-time judgment calls all at once.
The pragmatic starting point: automate the structured 60-70% and build an escalation path for the rest. Your human agents spend their time on calls that genuinely need human judgment, instead of reading the same payment reminder script 40 times a day. We have seen this hybrid pattern work well in our voice-first lead capture deployments.
The Hindi question (and the dialect question)
This is where the conversation gets specific for India.
Standard Hindi with English mixing (how most B2B conversations actually happen) works well today. Gemini's models handle Hindi-English code-switching natively, without a separate translation layer. Sarvam AI and Gnani.ai have been built specifically for Indian languages. The experience is not flawless. You will hit edge cases with technical terms, numbers spoken in mixed formats, and regional pronunciation variations. But for structured calls with predictable vocabulary, it is production-ready.
Pure regional dialects like Bhojpuri, Marwadi, Chhattisgarhi, and Haryanvi are a different story. AI4Bharat and IIT Madras have built IndicVoices, a dataset that has grown to 23,700 hours of speech across 400+ Indian districts covering all 22 scheduled languages. This is a substantial foundation, significantly larger than the 12,000-hour milestone from a couple of years ago. But production-ready dialect support for real-time voice conversations is probably still 18-24 months out.
The practical question worth asking: do your business calls actually need dialect support? Most B2B supplier and dealer conversations happen in standard Hindi or English, even when both parties speak a dialect at home. If your use case is B2C outreach in rural markets, dialect capability matters. For most SME operations calls, it does not. At least not yet.
Three gaps that still trip up production deployments
Beyond language, we have observed three issues that consistently cause problems in production voice agents.
Latency
If the agent takes more than 800 milliseconds to respond, the conversation feels unnatural. People start talking over the agent, or they hang up thinking the line dropped. Traditional cascaded pipelines (STT to LLM to TTS, processed sequentially) run 800ms to 2 seconds end-to-end. Newer native speech models, like Gemini 2.5's native audio mode, bring this down to sub-second response times by processing audio natively rather than converting to text first.
The cheapest stacks are still noticeably slower than a human conversation. If your use case is transactional (payment reminders), the slight delay is tolerable. If it is conversational (sales, support), latency is a dealbreaker at current budget-tier speeds.
Background noise
Speech recognition models achieve high accuracy in controlled environments and significantly lower accuracy in real-world conditions. Factory floors, busy retail shops, someone calling from an auto-rickshaw. The voice activity detection triggers on background sounds, the transcription garbles, and the conversation derails. This is an underrated problem that does not show up in demos but shows up immediately in production.
Long conversations
Anything beyond five minutes of unstructured back-and-forth sees a noticable drop in quality. Context drifts, the agent starts repeating itself, hallucination risk increases. For a two-minute payment reminder, this is irrelevant. For a ten-minute customer complaint, it makes the agent unusable. This ceiling is improving with each model generation, but it is real today.
~5 min
practical ceiling for unstructured voice AI conversations (2026)
Where this is going
The trajectory is clear. Sarvam AI offering bundled voice agents at ₹1/min across 11 Indian languages would have seemed implausible two years ago. Bolna raising $6.3M from General Catalyst specifically for India-focused voice orchestration signals that investors see this market maturing. Gemini processing Hindi-English speech natively, without a separate translation pipeline, is a genuine capability shift.
The cost curve is dropping. The quality curve is rising. They have not crossed yet for complex calls, but for structured, high-volume calls, the crossover already happened.
For most Indian SMEs today, the right move is not "replace all your callers with AI" and it is not "wait until it is perfect." It is: pick your highest-volume, most scripted call type (probably payment reminders or order confirmations), run a pilot at 50 calls/day, measure the actual per-minute cost against your current spend, and scale from there as the technology catches up to your more complex call types.
The math works today for the right calls. The question is which calls to start with.
Cost analysis based on published API pricing as of April 2026. All INR figures use approximate USD/INR conversion where vendors bill in dollars. Stack costs exclude engineering time for integration and ongoing maintenance. Sarvam AI, Bolna, Plivo, and Exotel pricing sourced from their respective pricing pages. Telecaller salary data from Glassdoor India.
Want to build something like this?
We design and ship AI products, automation systems, and custom software.
Get in touch