US government report puts China 8 months behind in AI, but the benchmarks behind that claim are hard to verify

A US government report claims China's DeepSeek lags American AI by eight months, but the finding relies on benchmarks the government itself designed and controls. Independent evaluators show a steadier gap, and DeepSeek beats US models on cost.

Categorized in: AI News Government
Published on: May 04, 2026
US government report puts China 8 months behind in AI, but the benchmarks behind that claim are hard to verify

US Claims China's AI Is 8 Months Behind. The Evidence Doesn't Support It.

The Center for AI Standards and Innovation (CAISI), a US government body, released a report concluding that China's DeepSeek V4 Pro lags American frontier models by eight months. The finding relies on benchmarks that CAISI developed internally and controls entirely - tests that cannot be independently verified.

This matters to government officials evaluating AI capability gaps and making procurement decisions. The headline figure comes from a specific statistical comparison to GPT-5, released eight months prior. But the underlying data rests on proprietary tests where verification is impossible.

The Methodology Problem

CAISI did commit to its benchmark suite before seeing results, a practice most evaluators skip. The organization published confidence intervals and described its methods in detail. That transparency is genuine.

But three of the most damaging benchmarks for DeepSeek - PortBench, CTF-Archive-Diamond, and ARC-AGI-2 semi-private - are either CAISI-developed or use private datasets. You cannot verify an experiment you cannot see.

DeepSeek claims V4 Pro performs on par with Opus 4.6 and GPT-5.4, models released two months ago, not eight. Artificial Analysis, an independent evaluator without geopolitical interests, reports the US-China capability gap remains steady rather than widening.

When one competitor designs the test, administers it, and declares itself the winner, the result is a credentialed opinion, not science.

Cost Changes the Picture

CAISI's own cost comparison shows DeepSeek V4 Pro cheaper than GPT-5.4 mini on five of seven tests, sometimes by more than 50 percent. Cursor, a widely used AI coding assistant, built its model on a Chinese open-weight model specifically for cost advantages over OpenAI and Anthropic.

Capability benchmarks measure one characteristic. Cost per useful task determines scalability and real-world deployment. By that measure, the gap narrows considerably.

What the Numbers Actually Show

The US does have a capability lead in some areas. On ARC-AGI-2 tests, GPT-5.5 scored 79 percent versus DeepSeek's 46 percent. That gap is real.

But "eight months behind" is an exact figure derived from internal comparisons conducted by one competitor against another. It assumes both sides optimize for the same outcomes. They may not.

The US likely leads on capability. China leads on cost. Framing this as a race requires assuming both countries prioritize identical metrics.

Government officials should treat the eight-month claim as one data point from a single source, not settled fact. Independent verification remains necessary before using this assessment to inform policy or budgeting decisions.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)