Exponential AI Progress Defies Slowdown Claims: METR and GDPval Show 2-Hour Tasks, GPT-5 Nears Expert Level
METR and OpenAI evals show models handling 2-hour software tasks at ~50% success, with GPT-5 and Claude Opus 4.1 near expert level. Plan for longer runs, guardrails, observability.

Exponential AI Progress: METR and OpenAI Evals Show 2-Hour Autonomy and Near-Expert Performance
Claims of an AI slowdown miss what the data says. A new analysis brings hard numbers from METR and OpenAI evaluations and shows models already completing 2-hour tasks with meaningful success rates, with GPT-5 and Claude Opus 4.1 pushing toward expert-level performance.
The critique is simple: we keep underestimating compounding curves. As the author notes, "Long after the timing and scale of the coming global pandemic was obvious from extrapolating the exponential trends, politicians, journalists and most public commentators kept treating it as a remote possibility or a localized phenomenon."
What the evaluations actually measure
METR's long-horizon autonomy evals show task length capabilities doubling every seven months. Systems now complete software engineering tasks up to two hours at roughly 50% success, with Grok 4, Claude Opus 4.1, and GPT-5 ahead of prior trend lines.
OpenAI's GDPval assesses 44 occupations across nine industries using 1,320 practitioner-authored tasks and blinded grading against human outputs. GPT-5 sits close to human performance across several sectors, while Claude Opus 4.1 matches industry experts on many tasks.
One notable point: recognition for OpenAI publishing an evaluation where another lab's model outperformed theirs. Cross-lab transparency matters if you care about objective signals over marketing claims.
Why this matters for engineering teams
- Long-horizon agents are becoming practical: 1-2 hour software tasks at usable success rates changes build-vs-buy decisions for internal tooling.
- Surface-level chat quality is a poor proxy for capability. Structured evals reveal gains that casual testing misses.
- Expect variance across models and tasks. Benchmark headlines at launch often don't hold under standardized, blinded comparisons.
- Design for autonomy: retries, tool use, memory, and budget control will be first-class concerns for platform teams.
- Observability is mandatory: trace spans, function-level logging, and dataset versioning to diagnose long-run failures.
Methodology and limitations
METR's tasks average about 3 on a 16-point "messiness" scale; typical software work is 7-8. Real jobs include coordination, ambiguity, and shifting requirements that these evals only partially capture.
GDPval focuses on well-defined, digital tasks with complete instructions. Many roles rely on proprietary systems, stakeholder feedback, and iterative cycles. Even with these constraints, both studies are useful trend trackers.
Model signals you should note
- Grok 4 and Gemini 2.5 Pro underperformed relative to their launch-era benchmark narratives under standardized grading.
- Sonnet 3.7 (seven months old) hit ~50% on one-hour software tasks.
- Grok 4, Claude Opus 4.1, and GPT-5 show 2+ hour autonomy in the METR setup.
If the trend continues
- Mid-2026: autonomous agents working for a full 8-hour day on a non-trivial share of tasks.
- By end of 2026: at least one system reaches expert parity across multiple industries.
- 2027: frequent above-expert performance on many tasks. These are extrapolations of observed curves, not speculative breakthroughs.
Where the gains show up
- Professional services: legal analysis, financial planning, and management consulting improve with better structure, memory, and tool use.
- Healthcare admin and technical engineering: strong gains in documentation, planning, and structured troubleshooting.
- Manufacturing and retail: steady progress in supply chain optimization, inventory, and customer operations.
- Marketing operations: autonomous agents coordinate data pulls, MMM variants, and content workflows. Reported traffic effects skew positive, but error rates still matter.
On accuracy: a recent review found ~20% error rates across major platforms for PPC strategy tasks, with Google AI Overviews at ~26% incorrect responses and Google Gemini at ~6%. Treat outputs as drafts, not decisions, and gate high-risk actions.
Infrastructure headwinds
Europe faces mounting energy constraints for AI builds. The Netherlands reports grid congestion, with data centers drawing electricity comparable to 100,000 households each. Capacity planning and location choices will affect latency, cost, and feasibility.
Evaluation resources
Review sources and methodology directly before making roadmap calls:
Practical moves for the next two quarters
- Stand up an eval harness mirroring METR/GDPval patterns: blinded comparisons, practitioner-authored tasks, and clear pass criteria tied to business KPIs.
- Pilot long-horizon agents with strict guardrails: timeout caps, budget tracking, tool whitelists, and human review gates for irreversible actions.
- Instrument deeply: per-step traces, error taxonomies, deterministic replays, and dataset snapshots for regressions.
- Adopt standardized prompts and templates; version them like code. Measure variance across model families, not just versions.
- Plan for infra: concurrency, cold starts, vector storage growth, and GPU/CPU mix. Energy and region choices now affect cost curves later.
- Train teams on failure modes: hallucination containment, tool-loop breaks, and escalation paths.
Timeline highlights
- 2020: METR clocks GPT-2 at 1-second task capability
- 2022-2024: task length doubles every ~7 months
- May 2024: Google begins AI search experiments affecting publishers
- July 2024: 80% of companies block LLMs (HUMAN Security)
- Jul 10, 2025: One in five PPC strategy answers contain inaccuracies
- Jul 13, 2025: Dutch grid crunch signals EU energy gaps for AI
- Jul 21-27, 2025: Practical guides and industry analysis on agentic AI in advertising
- Aug-Sep 2025: New enterprise AI tools ship; METR/GDPval trends published
Who's behind the analysis
Julian Schrittwieser, Member of Technical Staff at Anthropic and previously a Principal Research Engineer at DeepMind, connects the dots across METR and OpenAI evaluations. His background includes work on AlphaGo, AlphaZero, MuZero, and reinforcement learning with human feedback at scale.
Where to skill up
If you're building with agents or deploying long-horizon workflows, upskill your team on agent patterns, evals, and safety guardrails. See our AI certification for coding and role-based tracks in courses by job.
Bottom line
Exponential curves are intact. Don't judge progress by chat vibes-judge by standardized evaluations and task duration at controlled success rates. Plan for agents that work longer, handle more context, and require better observability. Build the rails now so you're ready when 8-hour autonomy crosses the threshold.