Anthropic study finds AI agents still mostly coding as autonomous runs grow longer

AI agents are taking off in dev-about half of tool calls-while other teams still test the waters. Autonomy climbs as trust grows, with longer runs and more auto-approvals.

Categorized in: AI News IT and Development
Published on: Feb 23, 2026
Anthropic study finds AI agents still mostly coding as autonomous runs grow longer

AI agents are surging in software development - and almost nowhere else

Anthropic analyzed millions of real interactions across its coding agent and public API. One takeaway stands out: software engineering accounts for nearly half of all agent tool calls. Customer service, sales, finance, BI, and e-commerce barely register a few percentage points each. We're still in the early days of agent adoption outside dev work.

What the usage data says

  • Software development dominates agent usage (~50% of public API tool calls).
  • Other sectors are experimental at best, each at only a few percentage points of traffic.
  • Claude Code's longest autonomous sessions nearly doubled from Oct 2025 to Jan 2026: under 25 minutes to 45+ minutes.
  • Median work step time holds steady around 45 seconds.

Anthropic frames this as simple market reality: developers adopted agents first, built workflows around them, and kept pushing scope. Other industries are just starting to test where agents fit.

Autonomy is rising - but underused

Autonomy didn't spike with single model releases. It climbed steadily across versions, which points to human factors: users building trust, assigning bigger tasks, and product improvements that reduce friction. Anthropic calls this a "deployment overhang" - models can handle more than people currently ask them to do.

That lines up with commentary from OpenAI and Microsoft leadership. And it's consistent with an external evaluation by METR, estimating that Claude Opus 4.5 solves some tasks with a 50% success rate that would take a human nearly five hours.

How user behavior shifts with experience

  • New users fully auto-approve ~20% of sessions; after ~750 sessions, that passes 40%.
  • Interruptions rise modestly with experience: ~5% of work steps to ~9% - users let agents run, then step in only when needed.
  • Public API oversight stays high for simple tasks (~87%), and drops for complex ones (~67%).

Even experienced users don't intervene most of the time. Over 90% of work steps run without interruption.

Claude self-checks more than humans interrupt

On the hardest tasks, Claude pauses itself to ask questions more often than humans step in. That's a useful safety valve: the model recognizes uncertainty and asks before it commits.

  • Why Claude pauses itself: present choices between approaches (35%), request missing technical context or corrections (32%), gather diagnostics or test results (21%), clarify vague or incomplete requests (13%), get approval before taking action (11%), request missing credentials/tokens/access (12%).
  • Why humans interrupt: Claude was slow/hanging/excessive (17%), they had enough help to continue solo (7%), they wanted to take the next step themselves (7%), requirements changed mid-task (5%).

Anthropic's stance: encourage this asking behavior, pair it with external safeguards (auth systems, scoped permissions), and monitor outcomes post-deployment. Forcing manual approval on every micro-action adds drag without guaranteed safety gains.

What this means for engineering leaders and product teams

  • Scope the sandbox first: grant least-privilege access, repo subsets, and read-only defaults. Add write/deploy rights progressively as the agent proves itself.
  • Instrument everything: log tool calls, cost, latency, error classes, and reversal/rollback events. Alert on anomaly patterns (spikes in retries, unusual file churn, long stalls).
  • Gate by risk, not by ritual: require approvals for data writes, production deploys, and credential use. Don't gate every trivial action.
  • Teach the agent to ask: prompt for "ask-before-act" on destructive steps, ambiguous specs, or missing context. Reward clarification over blind execution.
  • Design for long sessions: batch related work, set checkpoints, and cache context. Give the agent room to run while maintaining review hooks.
  • Track the overhang: maintain a backlog of tasks the agent could plausibly own. Measure success rate and intervention rate per task class.
  • Protect credentials: use short-lived tokens, per-task scopes, environment segregation, and rate limits. Log access and rotate aggressively.
  • Create runbooks: define interruption triggers, rollback steps, and escalation paths. Make it obvious when to step in.

Why most non-dev teams lag - and how to catch up

Developer teams had obvious agent-friendly loops: code edit → test → iterate. Other functions need similar loops defined. Start by mapping tasks with clear success criteria, low blast radius, and frequent repetition (e.g., report generation, data grooming, internal QA).

Stand up pilots with tight scopes, real metrics, and weekly iteration. Expand only after you've proven value and tightened controls.

Further reading and resources


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)