US Claims China's AI Is 8 Months Behind. The Evidence Doesn't Support It.
The Center for AI Standards and Innovation (CAISI), a US government body, released a report concluding that China's DeepSeek V4 Pro lags American frontier models by eight months. The finding relies on benchmarks that CAISI developed internally and controls entirely - tests that cannot be independently verified.
This matters to government officials evaluating AI capability gaps and making procurement decisions. The headline figure comes from a specific statistical comparison to GPT-5, released eight months prior. But the underlying data rests on proprietary tests where verification is impossible.
The Methodology Problem
CAISI did commit to its benchmark suite before seeing results, a practice most evaluators skip. The organization published confidence intervals and described its methods in detail. That transparency is genuine.
But three of the most damaging benchmarks for DeepSeek - PortBench, CTF-Archive-Diamond, and ARC-AGI-2 semi-private - are either CAISI-developed or use private datasets. You cannot verify an experiment you cannot see.
DeepSeek claims V4 Pro performs on par with Opus 4.6 and GPT-5.4, models released two months ago, not eight. Artificial Analysis, an independent evaluator without geopolitical interests, reports the US-China capability gap remains steady rather than widening.
When one competitor designs the test, administers it, and declares itself the winner, the result is a credentialed opinion, not science.
Cost Changes the Picture
CAISI's own cost comparison shows DeepSeek V4 Pro cheaper than GPT-5.4 mini on five of seven tests, sometimes by more than 50 percent. Cursor, a widely used AI coding assistant, built its model on a Chinese open-weight model specifically for cost advantages over OpenAI and Anthropic.
Capability benchmarks measure one characteristic. Cost per useful task determines scalability and real-world deployment. By that measure, the gap narrows considerably.
What the Numbers Actually Show
The US does have a capability lead in some areas. On ARC-AGI-2 tests, GPT-5.5 scored 79 percent versus DeepSeek's 46 percent. That gap is real.
But "eight months behind" is an exact figure derived from internal comparisons conducted by one competitor against another. It assumes both sides optimize for the same outcomes. They may not.
The US likely leads on capability. China leads on cost. Framing this as a race requires assuming both countries prioritize identical metrics.
Government officials should treat the eight-month claim as one data point from a single source, not settled fact. Independent verification remains necessary before using this assessment to inform policy or budgeting decisions.
Your membership also unlocks: