AI vendors are becoming consultants because support agents keep slipping
Enterprises are learning that rolling out AI agents takes more than a few logins. OpenAI is reportedly hiring hundreds of engineers to customize models with customer data and build agents for clients, with about 60 already in consulting roles and 200+ in technical support. Anthropic is also working hands-on with customers instead of just shipping an app.
The reason is simple: out-of-the-box agents aren't reliable enough for production support. Retailer Fnac reportedly tested OpenAI and Google models for customer service, but agents kept mixing up serial numbers-only stabilizing after help from AI21 Labs.
Why this matters for customer support leaders
- Expect services, not just software. Real value often starts with a consulting sprint to wire models into your stack and data.
- Integration is the work. Agents must talk to your systems of record, apply business rules, and handle edge cases before an agent UI is even useful.
- Rollouts will take longer than a typical SaaS deployment. Budget time for evaluation, guardrails, and change management.
- Vendor choice now affects process design, not just price. You're buying a playbook and a team, not only a model.
Context: "Frontier" shows the hidden work
OpenAI's new agentic enterprise platform, Frontier, highlights the moving parts: connect to systems of record, encode business context, execute and optimize agents, then layer interfaces on top. That stack explains why providers are leaning into consulting-and why scale in B2B agents may be slower than the pitch decks suggest.
Tools like Claude Cowork can help, but speed-to-value depends on your connectors, policies, and data hygiene. Model gains will lift routine tasks; security and reliability risks won't vanish overnight.
Where AI agents break in support
- Entity mix-ups: serial vs. order vs. ticket ID; customer "John A." vs. "John B."
- Tool-call failures: missing auth, timeouts, flaky APIs, non-idempotent actions.
- Partial context: agent sees the ticket but not the warranty, policy, or previous RMAs.
- Edge cases: returns across channels, bundles, fraud flags, regional policies.
- State management: multi-step flows without checkpoints or rollback.
A practical rollout plan for support teams
- Pick one narrow use case. Example: warranty lookup and reply draft for post-purchase tickets.
- Define success upfront. Target FCR, deflection, AHT variance, CSAT change, and escalation rate.
- Build a gold test set. 100-300 real tickets with ground-truth answers and tool calls.
- Wire to systems safely. Read-only first. Scope by team, region, and action. Add canary tenants.
- Human-in-the-loop. Drafts on day one. Graduated autonomy only after stable metrics.
- Guardrails that bite. PII redaction, policy snippets, tool schemas with strict validation, and output filters.
- Fine-tune with your data. Start with retrieval over policies; consider supervised fine-tuning once you have labeled examples.
- Instrument everything. Track tool-call accuracy, hallucination flags, correction rate, and cost per resolved ticket.
- Have a rollback plan. Version prompts, prompts+tools as a bundle, and keep a one-click revert.
Data and security guardrails that prevent headlines
- Least-privilege access: separate service accounts per agent capability; rotate keys.
- Deterministic actions: APIs that require explicit IDs and confirmations; no free-text side effects.
- Redaction and minimization: scrub PII before model calls; pass only what's needed per step.
- Hallucination containment: require tool-confirmed facts for order status, payments, and identity.
- Audit trails: log model prompts, responses, tool inputs/outputs, and human approvals.
Vendor evaluation checklist
- Do they offer implementation engineers and playbooks for support use cases?
- Proven connectors to your CRM, order system, and knowledge base? How are errors handled?
- Offline evaluation tools with your test set? Support for regression testing before deploys?
- Safety features: data residency, PII controls, action whitelists, and approval flows.
- Reliability metrics shared weekly: tool-call success, rollback reasons, incident history.
- Support model: response SLAs, on-call escalation, and who owns post-mortems.
- Total cost clarity: model usage, integration time, and ongoing ops headcount.
What to pilot now (low risk, high signal)
- Agent assist: suggested replies and macros grounded in policies and past resolutions.
- Auto-tagging and routing: classify intent, product, sentiment, and urgency.
- Case summarization: compress long threads for faster handoffs and QA.
- Content QA: policy checks before sending offers, refunds, or replacements.
- Controlled actions: safe, reversible steps like scheduling or FAQ links, not refunds.
The bottom line
AI agents can reduce handle time and lift consistency, but only after you wire them into your stack with tight controls. That's why OpenAI and Anthropic are acting like consultants-and why your roadmap should treat agent work as a program, not a widget. Start narrow, instrument deeply, and earn autonomy with data.
Related links
Upskill your support team
If you're building an internal playbook and need structured training for support roles, explore our AI courses by job and focused certifications for Claude and ChatGPT.
Your membership also unlocks: