Anthropic designs three-agent framework to support long-running autonomous software development

Anthropic built a three-agent system to handle multi-hour coding tasks, splitting work between planning, generation, and evaluation agents. The design fixes context loss and self-grading bias that typically derail long autonomous sessions.

Categorized in: AI News IT and Development
Published on: Apr 05, 2026
Anthropic designs three-agent framework to support long-running autonomous software development

Anthropic's Three-Agent System Tackles Long-Running AI Development Tasks

Anthropic has introduced a multi-agent framework designed to handle extended autonomous development sessions, addressing fundamental problems that cause AI systems to lose coherence over multi-hour workflows. The approach divides work among three specialized agents: one for planning, one for generation, and one for evaluation.

The framework targets both frontend design and full-stack software creation. Anthropic engineers built it to solve two critical failures in autonomous coding: context loss between sessions and premature task termination.

How the System Maintains State

Rather than compacting context-a technique that preserves information but makes models cautious about approaching token limits-Anthropic uses structured handoff artifacts. When one agent completes its work, it passes a defined state to the next agent, allowing the workflow to continue without amnesia.

This matters because models operating near context limits often perform worse. The handoff approach sidesteps the problem entirely by resetting context between agents while maintaining continuity through explicit artifacts.

Separating Judgment From Execution

Agents routinely overestimate the quality of their own outputs, especially on subjective tasks like design. Anthropic addressed this by creating a separate evaluator agent, calibrated with specific scoring criteria and few-shot examples.

For frontend work, the evaluator uses four grading criteria: design quality, originality, craft, and functionality. It interacts with live pages using Playwright, then provides detailed feedback that guides the generator through iterative refinement cycles.

Prithvi Rajasekaran, engineering lead at Anthropic Labs, said: "Separating the agent doing the work from the agent judging it proves to be a strong lever to address this issue."

Results From Extended Sessions

Iteration cycles range from five to fifteen per run, with some sessions lasting up to four hours. Each cycle produces progressively refined outputs that combine visual distinction with functional accuracy.

The structured approach enables clear task decomposition. Planning, generation, and evaluation remain separate responsibilities with defined handoffs, making it easier to track progress and identify where breakdowns occur.

What Practitioners Are Seeing

Industry observers have noted the framework's practical advantages. The separation of evaluation from generation improves reliability by removing conflicts of interest-the agent generating code no longer judges its own work.

The structure itself-JSON specifications, enforced testing, commit-by-commit progress-prevents the context amnesia that typically derails long-running agents. Every new session starts from a known working state.

Operational Considerations

Teams implementing this framework need to establish evaluation criteria upfront and calibrate scoring mechanisms. Agents execute evaluations automatically, but human oversight remains necessary for initial setup and quality validation.

The workflow supports both parallel and sequential agent execution, depending on task dependencies. This flexibility allows teams to distribute processing across multiple agents or run them in sequence as needed.

What Comes Next

As models improve, the harness's role will shift. Some tasks may move directly to next-generation models without requiring multi-agent coordination. Simultaneously, better models enable the harness to tackle more complex work.

Engineers should experiment with harness configurations, monitor execution traces, decompose tasks carefully, and adjust workflows as model capabilities evolve. The optimal combination of agents and responsibilities will continue to change.

Learn more about AI Agents & Automation and Generative Code practices for development teams.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)