Exponential AI Progress Defies Slowdown Claims: METR and GDPval Show 2-Hour Tasks, GPT-5 Nears Expert Level

Exponential AI Progress: METR and OpenAI Evals Show 2-Hour Autonomy and Near-Expert Performance

Claims of an AI slowdown miss what the data says. A new analysis brings hard numbers from METR and OpenAI evaluations and shows models already completing 2-hour tasks with meaningful success rates, with GPT-5 and Claude Opus 4.1 pushing toward expert-level performance.

The critique is simple: we keep underestimating compounding curves. As the author notes, "Long after the timing and scale of the coming global pandemic was obvious from extrapolating the exponential trends, politicians, journalists and most public commentators kept treating it as a remote possibility or a localized phenomenon."

What the evaluations actually measure

METR's long-horizon autonomy evals show task length capabilities doubling every seven months. Systems now complete software engineering tasks up to two hours at roughly 50% success, with Grok 4, Claude Opus 4.1, and GPT-5 ahead of prior trend lines.

OpenAI's GDPval assesses 44 occupations across nine industries using 1,320 practitioner-authored tasks and blinded grading against human outputs. GPT-5 sits close to human performance across several sectors, while Claude Opus 4.1 matches industry experts on many tasks.

One notable point: recognition for OpenAI publishing an evaluation where another lab's model outperformed theirs. Cross-lab transparency matters if you care about objective signals over marketing claims.

Why this matters for engineering teams

Long-horizon agents are becoming practical: 1-2 hour software tasks at usable success rates changes build-vs-buy decisions for internal tooling.
Surface-level chat quality is a poor proxy for capability. Structured evals reveal gains that casual testing misses.
Expect variance across models and tasks. Benchmark headlines at launch often don't hold under standardized, blinded comparisons.
Design for autonomy: retries, tool use, memory, and budget control will be first-class concerns for platform teams.
Observability is mandatory: trace spans, function-level logging, and dataset versioning to diagnose long-run failures.

Methodology and limitations

METR's tasks average about 3 on a 16-point "messiness" scale; typical software work is 7-8. Real jobs include coordination, ambiguity, and shifting requirements that these evals only partially capture.

GDPval focuses on well-defined, digital tasks with complete instructions. Many roles rely on proprietary systems, stakeholder feedback, and iterative cycles. Even with these constraints, both studies are useful trend trackers.

Model signals you should note

Grok 4 and Gemini 2.5 Pro underperformed relative to their launch-era benchmark narratives under standardized grading.
Sonnet 3.7 (seven months old) hit ~50% on one-hour software tasks.
Grok 4, Claude Opus 4.1, and GPT-5 show 2+ hour autonomy in the METR setup.

If the trend continues

Mid-2026: autonomous agents working for a full 8-hour day on a non-trivial share of tasks.
By end of 2026: at least one system reaches expert parity across multiple industries.
2027: frequent above-expert performance on many tasks. These are extrapolations of observed curves, not speculative breakthroughs.

Where the gains show up

Professional services: legal analysis, financial planning, and management consulting improve with better structure, memory, and tool use.
Healthcare admin and technical engineering: strong gains in documentation, planning, and structured troubleshooting.
Manufacturing and retail: steady progress in supply chain optimization, inventory, and customer operations.
Marketing operations: autonomous agents coordinate data pulls, MMM variants, and content workflows. Reported traffic effects skew positive, but error rates still matter.

On accuracy: a recent review found ~20% error rates across major platforms for PPC strategy tasks, with Google AI Overviews at ~26% incorrect responses and Google Gemini at ~6%. Treat outputs as drafts, not decisions, and gate high-risk actions.

Infrastructure headwinds

Europe faces mounting energy constraints for AI builds. The Netherlands reports grid congestion, with data centers drawing electricity comparable to 100,000 households each. Capacity planning and location choices will affect latency, cost, and feasibility.

Evaluation resources

Review sources and methodology directly before making roadmap calls:

Practical moves for the next two quarters

Stand up an eval harness mirroring METR/GDPval patterns: blinded comparisons, practitioner-authored tasks, and clear pass criteria tied to business KPIs.
Pilot long-horizon agents with strict guardrails: timeout caps, budget tracking, tool whitelists, and human review gates for irreversible actions.
Instrument deeply: per-step traces, error taxonomies, deterministic replays, and dataset snapshots for regressions.
Adopt standardized prompts and templates; version them like code. Measure variance across model families, not just versions.
Plan for infra: concurrency, cold starts, vector storage growth, and GPU/CPU mix. Energy and region choices now affect cost curves later.
Train teams on failure modes: hallucination containment, tool-loop breaks, and escalation paths.

Timeline highlights

2020: METR clocks GPT-2 at 1-second task capability
2022-2024: task length doubles every ~7 months
May 2024: Google begins AI search experiments affecting publishers
July 2024: 80% of companies block LLMs (HUMAN Security)
Jul 10, 2025: One in five PPC strategy answers contain inaccuracies
Jul 13, 2025: Dutch grid crunch signals EU energy gaps for AI
Jul 21-27, 2025: Practical guides and industry analysis on agentic AI in advertising
Aug-Sep 2025: New enterprise AI tools ship; METR/GDPval trends published

Who's behind the analysis

Julian Schrittwieser, Member of Technical Staff at Anthropic and previously a Principal Research Engineer at DeepMind, connects the dots across METR and OpenAI evaluations. His background includes work on AlphaGo, AlphaZero, MuZero, and reinforcement learning with human feedback at scale.

Where to skill up

If you're building with agents or deploying long-horizon workflows, upskill your team on agent patterns, evals, and safety guardrails. See our AI certification for coding and role-based tracks in courses by job.

Bottom line

Exponential curves are intact. Don't judge progress by chat vibes-judge by standardized evaluations and task duration at controlled success rates. Plan for agents that work longer, handle more context, and require better observability. Build the rails now so you're ready when 8-hour autonomy crosses the threshold.

Get Daily AI News

Your membership also unlocks:

700+ AI Courses

700+ Certifications

Personalized AI Learning Plan

6500+ AI Tools (no Ads)

Daily AI News by job industry (no Ads)

Advertisement

Exponential AI Progress Defies Slowdown Claims: METR and GDPval Show 2-Hour Tasks, GPT-5 Nears Expert Level

Exponential AI Progress: METR and OpenAI Evals Show 2-Hour Autonomy and Near-Expert Performance

What the evaluations actually measure

Why this matters for engineering teams

Methodology and limitations

Model signals you should note

If the trend continues

Where the gains show up

Infrastructure headwinds

Evaluation resources

Practical moves for the next two quarters

Timeline highlights

Who's behind the analysis

Where to skill up

Bottom line

Related AI News for IT and Development

Zoom AI Companion 3.0 Launches with Agentic Workflows, Federated Models, and Real-Time CX Support

Palfinger's Pune AI Hub Fuels Momentum-€43.3 in Sight or Already Priced In?

TikTok flooded with AI videos sexualising minors, report says, linking to Telegram groups sharing child sexual abuse material

Corruptible by Design: Weird Generalizations and Backdoors in LLMs

About Complete AI:

Latest AI News for your Job:

Courses by AI Skill:

Courses by Job Field:

Courses by AI Company:

AI Tools for your Job:

AI Tools by Type:

AI Certifications by Skill:

AI Certifications by Job Field:

AI Certifications by Company: