OpenAI and Anthropic Turn to Consulting as Enterprise AI Agents Fall Short

OpenAI and Anthropic are rolling up their sleeves as support bots flub IDs and tools. The wins show up after messy integration, tight guardrails, and small, measured pilots.

Categorized in: AI News Customer Support
Published on: Feb 08, 2026
OpenAI and Anthropic Turn to Consulting as Enterprise AI Agents Fall Short

AI vendors are becoming consultants because support agents keep slipping

Enterprises are learning that rolling out AI agents takes more than a few logins. OpenAI is reportedly hiring hundreds of engineers to customize models with customer data and build agents for clients, with about 60 already in consulting roles and 200+ in technical support. Anthropic is also working hands-on with customers instead of just shipping an app.

The reason is simple: out-of-the-box agents aren't reliable enough for production support. Retailer Fnac reportedly tested OpenAI and Google models for customer service, but agents kept mixing up serial numbers-only stabilizing after help from AI21 Labs.

Why this matters for customer support leaders

  • Expect services, not just software. Real value often starts with a consulting sprint to wire models into your stack and data.
  • Integration is the work. Agents must talk to your systems of record, apply business rules, and handle edge cases before an agent UI is even useful.
  • Rollouts will take longer than a typical SaaS deployment. Budget time for evaluation, guardrails, and change management.
  • Vendor choice now affects process design, not just price. You're buying a playbook and a team, not only a model.

Context: "Frontier" shows the hidden work

OpenAI's new agentic enterprise platform, Frontier, highlights the moving parts: connect to systems of record, encode business context, execute and optimize agents, then layer interfaces on top. That stack explains why providers are leaning into consulting-and why scale in B2B agents may be slower than the pitch decks suggest.

Tools like Claude Cowork can help, but speed-to-value depends on your connectors, policies, and data hygiene. Model gains will lift routine tasks; security and reliability risks won't vanish overnight.

Where AI agents break in support

  • Entity mix-ups: serial vs. order vs. ticket ID; customer "John A." vs. "John B."
  • Tool-call failures: missing auth, timeouts, flaky APIs, non-idempotent actions.
  • Partial context: agent sees the ticket but not the warranty, policy, or previous RMAs.
  • Edge cases: returns across channels, bundles, fraud flags, regional policies.
  • State management: multi-step flows without checkpoints or rollback.

A practical rollout plan for support teams

  • Pick one narrow use case. Example: warranty lookup and reply draft for post-purchase tickets.
  • Define success upfront. Target FCR, deflection, AHT variance, CSAT change, and escalation rate.
  • Build a gold test set. 100-300 real tickets with ground-truth answers and tool calls.
  • Wire to systems safely. Read-only first. Scope by team, region, and action. Add canary tenants.
  • Human-in-the-loop. Drafts on day one. Graduated autonomy only after stable metrics.
  • Guardrails that bite. PII redaction, policy snippets, tool schemas with strict validation, and output filters.
  • Fine-tune with your data. Start with retrieval over policies; consider supervised fine-tuning once you have labeled examples.
  • Instrument everything. Track tool-call accuracy, hallucination flags, correction rate, and cost per resolved ticket.
  • Have a rollback plan. Version prompts, prompts+tools as a bundle, and keep a one-click revert.

Data and security guardrails that prevent headlines

  • Least-privilege access: separate service accounts per agent capability; rotate keys.
  • Deterministic actions: APIs that require explicit IDs and confirmations; no free-text side effects.
  • Redaction and minimization: scrub PII before model calls; pass only what's needed per step.
  • Hallucination containment: require tool-confirmed facts for order status, payments, and identity.
  • Audit trails: log model prompts, responses, tool inputs/outputs, and human approvals.

Vendor evaluation checklist

  • Do they offer implementation engineers and playbooks for support use cases?
  • Proven connectors to your CRM, order system, and knowledge base? How are errors handled?
  • Offline evaluation tools with your test set? Support for regression testing before deploys?
  • Safety features: data residency, PII controls, action whitelists, and approval flows.
  • Reliability metrics shared weekly: tool-call success, rollback reasons, incident history.
  • Support model: response SLAs, on-call escalation, and who owns post-mortems.
  • Total cost clarity: model usage, integration time, and ongoing ops headcount.

What to pilot now (low risk, high signal)

  • Agent assist: suggested replies and macros grounded in policies and past resolutions.
  • Auto-tagging and routing: classify intent, product, sentiment, and urgency.
  • Case summarization: compress long threads for faster handoffs and QA.
  • Content QA: policy checks before sending offers, refunds, or replacements.
  • Controlled actions: safe, reversible steps like scheduling or FAQ links, not refunds.

The bottom line

AI agents can reduce handle time and lift consistency, but only after you wire them into your stack with tight controls. That's why OpenAI and Anthropic are acting like consultants-and why your roadmap should treat agent work as a program, not a widget. Start narrow, instrument deeply, and earn autonomy with data.

Related links

Upskill your support team

If you're building an internal playbook and need structured training for support roles, explore our AI courses by job and focused certifications for Claude and ChatGPT.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)