Building a WhatsApp Bot That Understands Hinglish Orders

The deployment

Wonderfresh sells home care products across South India. Lavender floor cleaner, rose dishwash liquid, lemon toilet cleaner, neem phenyl, a few other fragrance lines. 82 SKUs once you account for variants and pack sizes. About 100 field salespersons place orders on behalf of the retail stores they cover.

Every order came in over WhatsApp. Not as a fallback or convenience channel, but as the only channel. Salespersons would type something like bhai 5 carton lavender floor clean aur 10 rose dishwash bhej do into a chat, and someone at HQ would read it, decode the abbreviations, look up the SKUs, and key the order into a Google Sheet. The system worked when there were 20 salespersons. By the time there were 100, the HQ team was processing several hundred messages a day, and the misreads had become a real cost. Wrong product shipped, quantity off by a carton, order missed entirely because it scrolled past unanswered.

We were asked to take the manual parser out of the loop without changing how the salespersons placed orders.

What the messages actually look like

Three real examples from the first week of production:

lavnder flr cln 500ml 5 ctn means Lavender Floor Cleaner 500ml, 5 cartons
rose dishwsh liqd 3 means Rose Dishwash Liquid, 3 units
lemon toliet clear bda wala 10 means Lemon Toilet Cleaner (large size), 10 units

These are not edge cases. They are the median message. Salespersons type fast, on field phones, in whatever mix of Hindi and English comes naturally. bda wala (the big one) is a size qualifier. ctn is carton, liqd is liquid, clear is cleaner, toliet is a transposition that occurs in roughly one in eight messages we logged.

Building a deterministic parser was off the table. Every regex we sketched out either matched too narrowly (and missed real orders) or matched too broadly (and confused lavender 5 with lavender pack of 5 instead of 5 cartons of lavender). The variation lives in the data and the data does not converge.

Architecture

A Twilio number receives the WhatsApp message and posts to an Express webhook on our Node.js server. The webhook routes the message body to Gemini 2.5 Flash Lite with a system prompt that includes the full 82-SKU catalog, common aliases, and known misspellings. Gemini returns structured JSON: a list of {product_id, quantity, unit, confidence} objects. Anything below 0.7 confidence is flagged for clarification rather than processed.

The structured output flows into five validators that run in parallel: inventory check, credit limit, MOQ, payment compatibility, GST. If all five pass, we render an order summary back to the salesperson over WhatsApp. They reply APPROVE or A, and the order writes to Google Sheets, decrements inventory, and triggers a notification email to the accounts team.

The whole loop has to fit inside a 15-second budget. Twilio enforces a 15-second timeout on webhook responses, after which the request is marked failed and retried. We targeted under 3 seconds end-to-end so we had headroom for retries, network jitter, and the occasional cold start on Railway.

Why Gemini 2.5 Flash Lite

We tested three models on a sample of 200 real messages from the historical WhatsApp logs: Gemini 2.5 Flash Lite, Gemini 2.5 Flash, and GPT-4o mini. Flash and 4o mini were both more accurate by 2-3 percentage points on the harder Hinglish messages. Flash Lite was 4-5x cheaper and roughly half the latency.

The accuracy gap was consistent but small, and most of the errors at the Flash Lite tier were ones the confidence threshold caught anyway. The cost gap was decisive. Flash Lite is priced at $0.10 per million input tokens and $0.40 per million output tokens. Our average request is around 4,000 input tokens (the catalog is included in every prompt) and 200 output tokens. At Wonderfresh's volume, we land under ₹5,000 a month in AI spend across all 100 active users.

The catalog stays in the prompt instead of in a vector store because the catalog is small enough to fit and the latency of a separate retrieval call wasn't worth it for 82 SKUs. If the catalog grows past a few hundred items we will revisit this.

The system prompt includes every SKU with a list of common misspellings and Hindi aliases, hand-curated from six months of historical WhatsApp logs. Building this list took longer than wiring up the Gemini integration. It is the part of the system most likely to need maintenance.

The validation engine

Once we have a structured order, the language model is done. Everything that follows is deterministic Node.js code, because everything that follows has to be auditable by Wonderfresh's accounts team.

Five validators run concurrently:

Inventory check. Is the SKU in stock in the quantity requested? Inventory lives in Google Sheets, fronted by a small read-through cache to avoid hammering the Sheets API.
Credit limit. Does the retailer this salesperson is ordering on behalf of have enough credit remaining for the order value? Credit data also lives in Sheets.
MOQ validation. Every order must clear ₹25,000 to be economically viable for delivery. The validator computes the running total, and if the order is short, the bot tells the salesperson exactly how much more they need to add. Not "minimum order quantity not met," but "add ₹4,200 more to qualify."
Payment compatibility. COD orders for credit-eligible customers get an automatic 5% discount. The validator checks whether the customer's payment method is one we accept for that retailer and applies the discount line if applicable.
GST calculation. Per-product tax rates, applied to the line items, totaled in the summary. This was the validator that took the longest to get right because Wonderfresh's product categories don't all sit in one HSN bucket.

All five run in Promise.all. Total validation time is bounded by the slowest one, which is usually the inventory check because of the Sheets round trip. We measured a p95 of about 800ms for the full validator suite.

If any validator fails, the salesperson gets a specific error: which validator failed, what is wrong, what would fix it. Generic "order failed" messages were the single biggest source of frustration in the first pilot week. Once every error came with a specific remediation, the inbound clarification messages dropped to almost zero.

The approval loop

After validation, the bot sends a formatted order summary back over WhatsApp: line items, quantities, unit prices, GST breakdown by line, COD discount if applicable, and the total. The salesperson replies APPROVE or A to confirm.

On approval, three things happen, in this order:

The order writes to the orders sheet in Google Sheets, with a generated order ID.
Inventory in the inventory sheet decrements by the ordered quantities.
An HTML email goes to the accounts team via the Gmail API, with the same formatted summary the salesperson saw.

The Gmail API is the part of this that surprises people. We send the notification email as the actual Wonderfresh Gmail account, not from an SMTP relay on our server. SMTP relay from a cloud server lands in spam often enough that it would have been a problem within a month. Sending as the genuine Wonderfresh account means the emails thread correctly with the rest of the accounts team's inbox, and the deliverability is whatever Google's own deliverability is.

End-to-end, from the salesperson typing A to the confirmation message landing back on their phone, takes about 2 seconds.

State and storage

Conversation state lives in SQLite, accessed through Drizzle ORM. Each WhatsApp number has an active conversation context: the parsed order being assembled, what stage of approval it is at, which clarification questions have been asked and answered. The state machine is small enough that we did not reach for a workflow engine.

We chose SQLite over Postgres because the working set fits in memory on a single Railway instance and because we wanted backups to be a single file copy. The Sheets API is the system of record for orders and inventory. SQLite is the conversation buffer.

// db/schema.ts
export const conversations = sqliteTable('conversations', {
  phone: text('phone').primaryKey(),
  state: text('state', { enum: ['idle', 'parsing', 'awaiting_clarification', 'awaiting_approval'] }).notNull(),
  draftOrder: text('draft_order', { mode: 'json' }),
  lastMessageAt: integer('last_message_at', { mode: 'timestamp' }).notNull(),
})

A cron job clears idle conversations older than 24 hours. Drafts that have been awaiting approval for more than an hour get a follow-up nudge from the bot.

What broke

The first version of the prompt asked Gemini to return a single best match per line item. This worked until a salesperson typed safai wala 5. Safai wala translates roughly to "the cleaning one" and could refer to floor cleaner, toilet cleaner, or surface cleaner depending on which retailer was asking and what they had ordered before. The model picked one and assigned a confidence of 0.82, well above our 0.7 threshold. The retailer received the wrong product.

The fix was to have Gemini return its top three candidates with confidences when the gap between first and second was below a threshold, and to route those to a clarification message: "Did you mean: 1. Floor cleaner 2. Toilet cleaner 3. Surface cleaner?" The salesperson replies with a number and the order proceeds. This cost us about 30 lines of additional prompt instruction and a small branch in the message handler.

The second thing that broke was retries. Twilio retries failed webhook deliveries, and during one Gemini API outage, the same message hit our webhook five times across half an hour. Five duplicate orders were almost placed. We added an idempotency check keyed on the Twilio message SID, and now duplicate webhook deliveries are recognized and ignored.

The third was a feedback loop we did not build initially and wished we had. When a salesperson corrected a parse (typed the right SKU after the bot misidentified one), that correction was useful training data, but we threw it away. We retrofitted a feedback table that stores corrections and includes a sample of recent ones in the prompt context. Accuracy on the long tail of weird messages climbed by a couple of percentage points after this. Building this in from the start would have saved us a few weeks of slightly worse accuracy and a re-architecting of the message handler.

If you are building a multilingual or domain-specific LLM-backed parser, build the correction feedback path on day one. Storing corrections is cheap. Rebuilding the message pipeline to pipe them back into the prompt later is not.

Results

<3 seconds

End-to-end response time, well within Twilio's 15-second webhook timeout

~85% order accuracy with manual WhatsApp parsing

~96% accuracy with Gemini 2.5 Flash Lite plus deterministic validators, remaining 4% routed to clarification

The accuracy numbers come from comparing 1,000 orders processed by the bot against an audit performed by the Wonderfresh accounts team. The 85% baseline is from a sample of historical orders processed manually before the bot existed, audited against the WhatsApp source messages.

The 4% the bot does not confidently parse are not failures. They are the messages where the bot asks a clarification question instead of guessing, and where the salesperson then provides the additional information that resolves the ambiguity. The clarification rate is roughly twice as high in the first week a new salesperson is onboarded and drops as the bot's exposure to that salesperson's vocabulary grows.

Cost is under ₹5,000 a month in Gemini spend at current volume, plus Railway hosting and Twilio per-message charges, which Wonderfresh was already paying.

Stack

Node.js and TypeScript on the backend. Express for the webhook server. Gemini 2.5 Flash Lite via the Google AI Studio API. Twilio for the WhatsApp Business channel. Google Sheets API for orders and inventory. Gmail API for accounts team notifications. Drizzle ORM with SQLite for conversation state. Deployed on Railway, single instance, with the SQLite file backed up nightly to a Cloudflare R2 bucket.

We considered moving Wonderfresh off Sheets and onto Postgres during scoping. They use Sheets for everything else in the business and the migration was not worth the disruption for a system that the Sheets API can serve. If their volume grows another 5x we will revisit.

Built by South Arc Digital for Wonderfresh, deployed in production since November 2025. Stack: Node.js, TypeScript, Express, Gemini 2.5 Flash Lite, Twilio WhatsApp Business, Google Sheets API, Gmail API, Drizzle ORM, SQLite, Railway.