Building Harnesses for Operational AI Agents
Most agent harness literature focuses on coding agents. When agents process orders, query legacy ERPs, or generate business reports, the harness looks different because it has to account for business context, not just technical context.
An agent harness is every piece of code, configuration, and execution logic that isn't the model itself. The model provides intelligence; the harness makes that intelligence practical. LangChain's formula is straightforward: Agent = Model + Harness.
The harness literature in 2026, from Anthropic's work on long-running coding agents to LangChain's harness anatomy, defines five core components: filesystems, code execution, sandboxes, memory, and context management. These components assume the agent writes and runs code.
We deployed AI agents across three industries in India: a textile manufacturer, an industrial gas distributor, and an FMCG distributor. These agents didn't write code. They processed orders, answered inventory queries, and generated daily business reports over WhatsApp, against legacy systems, under real-time constraints. The harness components we needed were fundamentally different, because the constraints came from the business, not from a codebase.
This post describes the harness architecture that emerged from those deployments.
The business problem
26%
of employees fully utilize their ERP's capabilities
Enterprise software is an abstraction layer. An ERP takes the messy reality of warehouses, factories, and delivery routes and forces it into structured data that a business can reason about. But there's a persistent gap: the people who operate the software are rarely the people who make decisions based on it.
Consider what it actually takes for a business owner to answer a simple question like "can we commit 200 units of Product X?" In most ERP systems, the friction chain looks like this: open a laptop or app, navigate to the right module, either create a custom report or hope someone on the team has already built one, generate the report, wait for it to load, read through the output, and in older systems, export to a pivot table to do any real analysis. Newer ERPs offer dashboards, but those dashboards still require the owner to log in, know where to look, and interpret the data themselves.
In practice, most owners skip this entirely. In the businesses we worked with, this pattern was consistent. The textile manufacturer's owner never logged into his MS SQL Server ERP. A staff member generated reports and relayed the numbers over a phone call. The gas distributor's owner reviewed cylinder fleet performance through WhatsApp voice notes from his operations manager. The FMCG sales team couldn't check inventory during customer calls; they messaged the warehouse and waited.
This introduced latency, interpretation errors, and information loss into every business decision. A Retain International analysis found that only 26% of employees fully utilize ERP capabilities. Deloitte's Tech Trends 2026 reported that just 14% of organizations had production-ready agentic solutions and only 11% were actively using them. The gap between having software and actually using it for decisions remained wide.
AI agents could bridge this gap. A business owner who won't open an ERP dashboard might ask "how much stock do we have?" on WhatsApp. A distributor who won't check a reporting tool will read a daily digest pushed to their phone at 7am. The interface changes from the software's terms to the business leader's terms.
But this only works if the agent harness accounts for how the business actually operates, not just what the software's API exposes.
The technical problem
As Fortune reported on the MIT NANDA initiative's 2025 study, researchers surveyed 350 employees and analyzed 300 public AI deployments. Their central finding was that "the core problem isn't the quality of the AI models, but the 'learning gap' for both tools and organizations." A notable split emerged: AI solutions purchased from specialized vendors succeeded roughly two-thirds of the time, while internal generic builds succeeded at about half that rate.
This tracked with what we observed. Plugin-based AI features, the generic "ask your data" tools that SaaS platforms ship, struggled in our deployments because they lacked domain understanding. They didn't know that AMT meant different things in different tables, that CO2 cylinders rotate slower than oxygen, or that segment-specific order cadences varied from daily (dealers) to biweekly (marketing accounts). These weren't edge cases. They were the entire surface area of real business operations.
The pattern we converged on was consistent with what Harrison Chase described when he observed that "nearly all of the agentic systems we see in production are a combination of workflows and agents." Tomasz Tunguz quantified this from his own production systems: 65% of workflow nodes ran as non-AI code, and only 14% of workflows were fully agentic. Stripe's Blueprint Architecture formalized the same principle: "deterministic code handles the predictable; LLMs tackle the ambiguous."
The harness was where this split happened in practice. It was where business context got encoded: the domain rules, the schema translations, the routing decisions that determined whether the agent produced something useful or something confidently wrong.
The Abstraction Stack
Business Leader
WhatsApp, voice, conversational queries
Agent Harness
Tools, routing, schema translation, memory, pre-computed context
Business Software
ERP, inventory tools, invoicing, CRM
Business Reality
Warehouses, factories, sales calls, deliveries, payments
Each layer simplifies the one below it. The agent harness bridges rigid software and how business leaders actually communicate.
Harness component: heuristic-first intent routing
The business need. An FMCG distributor's sales team placed orders over WhatsApp. Messages arrived in English, Hindi, or Hinglish with no consistent format:
"5 carton glass cleaner aur 10 pack phenyl for Rajesh, COD"
"Rajesh ke liye 5 GC aur 10 phenyl bhej do, cash payment"
Before building, we mapped the full user flow: a salesperson in the field sends a WhatsApp message, the system classifies intent, parses order details using Gemini, validates against five business rules in parallel (minimum order value of ₹25,000, credit limits, inventory, payment compatibility, GST at 18%), computes pricing, and presents a summary for approval. On approval, the order was written to both SQLite and Google Sheets, the accounts team received an email and WhatsApp notification simultaneously, inventory was updated, and the salesperson got a confirmation. The entire flow needed to complete within Twilio's 15-second webhook timeout.
What broke. Running full LLM intent classification on every incoming message, including obvious ones like "hi" or "help," cost 800ms-1.2s per call. With downstream validation, pricing calculation, and the approval workflow still needing to execute, that latency budget was unsustainable for messages that could be classified in milliseconds.
What we built. A two-tier intent detector. A fast heuristic layer caught obvious patterns using regex matching, resolving approximately 60% of messages in under 5ms. Only ambiguous messages fell through to Gemini 3.1 Flash Lite for full classification with confidence scoring.
// services/ai/intentDetector.ts
quickIntentCheck(message: string): MessageIntent | null {
const lowerMsg = message.toLowerCase().trim();
const greetings = ['hi', 'hello', 'hey', 'good morning'];
if (greetings.some(g => lowerMsg === g || lowerMsg.startsWith(g + ' '))) {
return MessageIntent.GREETING;
}
// Numbers + product keywords = likely an order
const hasNumbers = /\d+/.test(message);
const hasProductKeywords =
/carton|bottle|pack|piece|cleaner|wash|soap|phenyl|fragrance/i.test(message);
if (hasNumbers && hasProductKeywords) {
return MessageIntent.ORDER;
}
return null; // Ambiguous: fall through to LLM
}One failure worth noting: the system included an approval flow where salespersons replied "ok" to confirm orders. A salesperson named Ashok kept triggering false approvals because the substring "ok" matched inside his name. The fix was word-boundary regex (\bok\b) across all keyword matching. This was one of several cases that reinforced our decision to pursue the two-tier routing model, where deterministic checks handled predictable patterns before the LLM was invoked.
Two-Tier Intent Routing
Incoming WhatsApp Message
"5 carton glass cleaner aur 10 phenyl, COD"
Heuristic Check
~5ms
Regex: numbers + product keywords? Question marks? Greeting words?
Resolved
~60%
Greetings, help requests, orders with clear product keywords
Ambiguous
~40%
Partial orders, mixed queries, context-dependent messages
Gemini 3.1 Flash Lite
~800ms
Full intent classification with confidence scoring
Structured Action
ORDER / GREETING / QUESTION / INVENTORY_QUERY / PARTIAL_ORDER
What this enabled. Sub-3-second end-to-end order processing. The sales team placed and confirmed orders during customer calls without contacting the warehouse. Approved orders flowed directly to the accounts team via email and WhatsApp, with inventory updated automatically. The per-message inference cost stayed under a few cents because the heuristic layer eliminated the LLM call for the majority of messages.
Harness component: tool abstraction over a legacy schema
The business need. A textile manufacturer's owner wanted to query his business data conversationally, checking stock, reviewing outstanding payments, tracking pending orders, over WhatsApp in Hindi or English. The data lived in a 20-year-old MS SQL Server ERP with 442,000 sales records across five companies.
What broke. The schema was adversarial in ways we didn't anticipate. The column for supplier phone numbers was spelled SupplerPhone. The column AMT in the orders table meant the full order amount; Amt in the sales table meant the line-item amount; Amount in the outstanding table meant the bill amount. Same word, three meanings, case-sensitive. Multi-company data required company filters, but some tables expected 'APPLE LIFESTYLE INDUSTRIES LIMITED' (uppercase) while others expected 'Apple Lifestyle Industries Limited' (mixed case).
What we built. Three layers of harness infrastructure between the agent and the database.
MCP tools as the abstraction
We built 35+ MCP tools using the @modelcontextprotocol/sdk, each encapsulating one business question. The agent discovered available tools and called the appropriate one; each tool handled schema translation internally.
// mcp/server.ts
mcpServer.tool(
'check_stock',
'Check stock availability for a quality/product.',
{ quality: z.string().describe('Product quality name to check') },
async ({ quality }) => {
const qualityLike = buildLikeClause('Quality', quality);
// Three parallel queries across tables with inconsistent naming
const [readyStock, wipStock, pendingOrders] = await Promise.all([
executeQuery(`SELECT Quality, SUM(BalPcs) as ReadyStock
FROM SpTblFinalQualityStockReadyToSale
WHERE ${qualityLike} AND BalPcs > 0 GROUP BY Quality`),
executeQuery(`SELECT Quality, SUM(BalPcs) as WIPStock
FROM AppleJobWorkPendingGoodsPCS
WHERE ${qualityLike} AND BalPcs > 0 GROUP BY Quality`),
executeQuery(`SELECT Quality, SUM(BalPcs) as PendingPcs
FROM APPLESALEORDERPENDING
WHERE ${qualityLike} AND BalPcs > 0 GROUP BY Quality`)
]);
// Returns unified stock: ready, work-in-progress, and committed
}
);The agent called check_stock. It never needed to know that SpTblFinalQualityStockReadyToSale existed. Each tool encapsulated the schema inconsistencies, the parallel queries, and the business logic for its domain.
The MCP server also served a second purpose: connected via stdio transport, it gave us a way to debug and observe the system interactively through Claude Code during development, querying the production database conversationally while building the tools.
The PostgreSQL analytical layer
We didn't query the MS SQL ERP directly for agent requests. Instead, we built a PostgreSQL analytical tier that synced from MS SQL on a tiered schedule: orders and outstanding data every 2 minutes, purchase details every 5 minutes, sales history every 15 minutes, and master data hourly. The sync used incremental windows for large tables (fetching only the last 2 hours of changes) to avoid locking the ERP during business hours.
The reasons were practical. The ERP's MS SQL Server had 30-second connection timeouts and 45-second query timeouts, too slow for WhatsApp responses where users expected answers in seconds. The client's ERP vendor had also flagged that frequent direct queries were impacting transactional performance during business hours. PostgreSQL, hosted on Railway alongside the backend, responded in under a second with 3-second connection timeouts and 8-second statement timeouts. If PG was unavailable (Railway restarts, connection resets), the system fell back to MS SQL automatically and reset the PG connection pool for recovery on the next request.
SQL dialect translation
The agent used a ReAct reasoning loop: a pattern where the model thinks step-by-step, decides what data it needs, generates a SQL query, observes the results, and then either queries again or produces a final answer. We chose ReAct because business questions often require multiple queries to answer. "Who are my top customers and what do they owe?" requires hitting the sales table, then the outstanding table, then combining the results. A single-shot prompt couldn't handle this reliably.
The ReAct agent generated queries in MS SQL syntax, which is what the 200-line schema prompt taught it. But we executed against PostgreSQL. A translation function converted between dialects at runtime:
// services/mssql.ts
function convertQueryToPostgres(query: string): string {
let pgQuery = query;
// DATEADD(month, -3, GETDATE()) -> NOW() + INTERVAL '-3 months'
pgQuery = pgQuery.replace(
/DATEADD\s*\(\s*month\s*,\s*(-?\d+)\s*,\s*GETDATE\(\)\s*\)/gi,
(_, months) => `NOW() + INTERVAL '${months} months'`
);
// ISNULL(x, y) -> COALESCE(x, y)
pgQuery = pgQuery.replace(/ISNULL\s*\(/gi, 'COALESCE(');
// DATEDIFF uses a balanced-parenthesis parser
// for nested calls like DATEDIFF(day, Dat, NOW())
pgQuery = replaceDATEDIFF(pgQuery);
// 60+ column remappings: PascalCase -> snake_case
// Sorted by key length so 'Supplier Group' processes before 'Supplier'
const sortedColumns = Object.entries(PG_COLUMN_MAP)
.sort((a, b) => b[0].length - a[0].length);
for (const [mssqlCol, pgCol] of sortedColumns) {
pgQuery = pgQuery.replace(new RegExp(`\\b${mssqlCol}\\b`, 'g'), pgCol);
}
return pgQuery;
}The replaceDATEDIFF function used a recursive descent parser because DATEDIFF(day, Dat, NOW()) nests a function call inside a function call, and simple regex failed on these cases.
Dynamic schema context
A 200-line system prompt taught the ReAct agent which columns existed, what they meant (not what they were named), which tables required company filters, and which casing to use. Date context, including current IST time, Indian fiscal year boundaries (April-March), and pre-computed date filters, was injected dynamically on every invocation.
We also made design decisions about what information was served natively as WhatsApp text versus what was generated as a chart, Excel file, or PDF and delivered as a downloadable link. Text queries ("how much stock?", "who owes us money?") returned formatted WhatsApp messages. Requests for visual analysis ("send me a sales trend chart", "export outstanding to Excel") triggered report generation via ChartJS, ExcelJS, or PDFKit, stored in an in-memory cache with a 1-hour TTL, and served through a public /media/:id endpoint that Twilio could access.
What this enabled. Natural language querying over WhatsApp across 7,100+ SKUs. The owner asked "how much stock do we have?" (in Hindi, English, or a mix) and got an answer that hit three tables, translated between two SQL dialects, and returned in under 4 seconds. For deeper analysis, they could ask for a chart or export and receive it as a downloadable file in the same conversation.
Harness component: pre-computed context engine
The business need. An industrial gas distributor's owner wanted daily business intelligence. Not raw numbers, but interpreted reports that compared yesterday against historical baselines, flagged customer anomalies, and suggested actions. Reports needed to arrive via both email (detailed) and WhatsApp (brief) at 7am.
What broke. Asking an LLM to compute baselines, identify anomalies, and generate reports from raw database records produced inconsistent results. The model calculated different averages on different runs, missed important customer changes, and hallucinated trends that didn't exist in the data.
What we built. A four-layer pre-computed context engine. Deterministic code handled all aggregation, comparison, and anomaly detection. The LLM received structured facts and did only what it's good at: interpretation and narrative generation.
Four-Layer Pre-Computed Context Engine
Format Templates
StaticReport type instructions: daily briefing, monday review, friday outlook. Section structure, tone, length constraints.
Daily Dynamics
Every dayYesterday's revenue, invoice count, customer deltas (no_order, surge, rotation_drop), alerts, outstanding balances.
Baselines
Pre-computed daily13-week weekday averages, week-over-week comparison, month-to-date vs prior month, median values.
Static Context
Module loadCompany overview, active products with vessel costs, customer segments, gas-type rotation thresholds.
The LLM receives structured, pre-aggregated facts, not raw database records. Deterministic code handles computation. The model handles interpretation.
The context engine computed a BusinessContext document daily. The report agent, a Google ADK LlmAgent, received this pre-computed context as its instruction, not as data to query:
// lib/agents/report-agent.ts
export function createReportAgent(reportType, context, previousSummary) {
// Layer 2: baselines from BusinessContext
const dow = context.baselines?.dayOfWeek || {};
const layer2 = `## Baseline Metrics
- ${dow.dayName} baseline (13-week): ${dow.avgInvoices?.toFixed(0)} invoices,
${formatINR(dow.avgRevenue)} revenue
- This week to date: ${formatINR(weekly.thisWeek)}
(last week: ${formatINR(weekly.lastWeek)}, ${weekly.weekOverWeekPct}%)`;
// Layer 3: yesterday's data + customer deltas
const deltaLines = customerDeltas
.filter(d => d.detail?.daysSinceLastOrder <= 90) // exclude churned
.slice(0, 15)
.map(formatDelta)
.join('\n');
// Full instruction = static + baselines + daily + format rules
const instruction = `${STATIC_CONTEXT}\n${layer2}\n${layer3}\n${formatInstruction}`;
return new LlmAgent({
name: 'guljag_report_agent',
model: 'gemini-3.1-flash-lite-preview',
instruction,
generateContentConfig: {
temperature: 0.3,
responseMimeType: 'application/json',
},
});
}The key design decision: pre-compute everything deterministically, then hand the LLM structured facts. The model never touched raw records. It received "Tuesday baseline: 42 invoices, ₹3.2L revenue," not 10,000 invoice rows to aggregate on its own.
Customer deltas, the anomaly detection layer, were also pre-computed. The context engine identified five event types: no_order (customer overdue based on segment-specific cadence), surge (unusual spike), rotation_drop (declining cylinder turnover), recovery_target (idle assets with capital locked), and payment_received. Each event type had segment-specific thresholds: dealers ordered daily, factories weekly, marketing accounts biweekly.
What this enabled. The owner received a WhatsApp message at 7am: yesterday's revenue vs the 13-week Tuesday baseline, the three customers who needed follow-up, and the LPG segment update. A detailed HTML email with the full analysis arrived simultaneously. The LLM's output was consistent because its inputs were deterministic.
Cross-cutting concern: conversation memory
Business queries were naturally conversational. A user would ask "show me stock for Banarasi silk." Then: "what about their outstanding?" The word "their" referred to customers who bought Banarasi silk, information from the previous query's results.
We built a rolling window memory system in Redis. The last 10 messages were retained in full. Older messages were compressed into a structured summary that tracked what was discussed and which database tables were queried, not just the text of the conversation:
// services/memory.ts
function summarizeOldMessages(conversation: ConversationMemory): void {
const toSummarize = conversation.messages.slice(0, SUMMARY_WINDOW);
conversation.messages = conversation.messages.slice(SUMMARY_WINDOW);
const summaryPoints: string[] = [];
for (const msg of toSummarize) {
if (msg.role === 'user') {
summaryPoints.push(`User asked: "${msg.content.substring(0, 50)}..."`);
} else if (msg.queryUsed) {
summaryPoints.push(`Bot queried ${msg.tableUsed || 'database'}`);
}
}
conversation.summary = summaryPoints.join('; ');
}This gave the ReAct agent enough context to resolve "their" without inflating the prompt with full conversation history. Memory expired after 24 hours via Redis TTL.
What the harness enabled
The underlying software didn't change. The ERP still has SupplerPhone. The product catalog still lives in Google Sheets. But the harness made these systems accessible to the people who needed the information most:
- The textile manufacturer's owner queried 7,100+ SKUs over WhatsApp in Hindi, without opening the ERP
- The gas distributor received a daily report at 7am comparing yesterday to the 13-week baseline, with customer-specific action items, without checking a dashboard
- The FMCG sales team placed orders in under 3 seconds during customer calls, without contacting the warehouse. Approved orders flowed automatically to the accounts team.
Across all three deployments, the agents acted as translation layers between how the business actually communicated (WhatsApp messages, voice notes, phone calls) and the structured systems underneath (ERP, MongoDB, Google Sheets). The monthly infrastructure cost for each system stayed under $10, with per-message inference costs of a few cents, kept low by the heuristic-first routing pattern that eliminated the LLM call for the majority of messages.
Each outcome was driven by a specific business constraint that shaped the harness. The harness components weren't chosen from a menu. They emerged from diagnosing how each business actually operated.
Observations
What worked. The heuristic-first pattern, deterministic code handling what it could with the LLM reserved for judgment, proved reliable across all three deployments. Pre-computing context for the report agent eliminated the inconsistency we saw when the LLM processed raw data directly. MCP tools as the abstraction layer between agent and database gave us both production reliability and developer observability through the same interface.
What didn't. The 200-line schema prompt for the textile ERP was hand-crafted over weeks of trial and error: querying the database, discovering that AMT meant different things in different tables, finding the SupplerPhone typo. Tooling that could automate semantic schema discovery would dramatically reduce deployment time. Anthropic's recent work on context engineering suggests that progressive disclosure and just-in-time retrieval may be better approaches for large, evolving schemas.
What we'd do differently. The conversation memory used simple rolling-window summarization. More sophisticated approaches, such as the always-on memory agent pattern where a consolidation agent periodically synthesizes related memories and builds entity relationships, could improve reasoning across longer conversation arcs. We'd also invest earlier in structured evaluation: McKinsey's State of AI 2025 found that only 6% of organizations qualified as "AI high performers," and a common differentiator was rigorous measurement of AI's impact on business outcomes.
References
- LangChain, "The Anatomy of an Agent Harness," 2026. blog.langchain.com
- Anthropic, "Effective harnesses for long-running agents," 2026. anthropic.com/engineering
- Anthropic, "Effective context engineering for AI agents," 2026. anthropic.com/engineering
- Harrison Chase, "How to Think About Agent Frameworks," LangChain, 2025 (updated Feb 2026). blog.langchain.com
- Tomasz Tunguz, "Is AI Doing Less & Less?," Feb 2026. tomtunguz.com
- Stripe Engineering, "Blueprint Architecture," via ByteByteGo. blog.bytebytego.com
- MIT NANDA Initiative, as reported by Fortune, "95% of generative AI pilots at companies are failing," Aug 2025. fortune.com
- McKinsey, "The State of AI in 2025: Agents, Innovation, and Transformation." mckinsey.com
- Deloitte, "Agentic AI Strategy," Tech Trends 2026. deloitte.com
- Gartner, "40% of Enterprise Apps Will Feature AI Agents by 2026," Aug 2025. gartner.com
- Retain International, "54 ERP Statistics and Trends." retaininternational.com
- Google Cloud, "AI Agent Trends 2026." cloud.google.com
- Google Cloud Platform, "Always-On Memory Agent." github.com
The systems described in this post were deployed across three industries in India. The MCP server, ReAct agent, and WhatsApp integrations were built with TypeScript and Node.js. The pre-computed context engine and data harmonization used MongoDB with Node.js. The report agent used Google ADK. Inference ran on Gemini 3.1 Flash Lite.
Want to build something like this?
We design and ship AI products, automation systems, and custom software.
Get in touch