AI solves six of 10 research-level math problems in first formal benchmark test

Top AI models solved 6-7 out of 10 research-level math problems in the first results from "First Proof," a benchmark designed by mathematicians to test whether AI can genuinely aid their work. Models skipped citations and cost up to $1,000 per query.

Categorized in: AI News Science and Research
Published on: Jun 12, 2026
AI solves six of 10 research-level math problems in first formal benchmark test

AI Models Score C-Minus on Rigorous Math Benchmark

The best artificial intelligence models answered six or seven of ten research-level math problems essentially correctly in the first official results from "First Proof," a project designed to evaluate whether large language models can actually help professional mathematicians.

The benchmark differs from previous AI math tests because mathematicians themselves designed it. Rather than rely on metrics created by AI companies, the team wanted to measure what matters to researchers doing actual math work.

Who Tested and What They Found

Only OpenAI and three academic groups-teams from ETH Zurich and Aarhus University, UCLA, and Princeton-agreed to submit models for testing. OpenAI's ChatGPT-5.5 Pro solved four to five problems correctly. IMProofBench, built by the Swiss and Danish researchers, performed best with six or seven correct answers.

Expert graders assembled at Harvard's Center of Mathematical Sciences and Applications to evaluate the responses using the same standard math journals apply: accept solutions with minor flaws that could be easily fixed.

The models excelled at finding obscure references and applying established techniques in new ways. In one case, an AI pursued a strategy the problem's authors had identified but abandoned as too tedious. The model's computational stamina pushed through where humans stopped.

The Hidden Infrastructure Behind Performance

State-of-the-art math models aren't single systems. They're multiple models stacked together, each checking and pushing the others to work harder. A basic language model left alone will often claim a problem is impossible or invent plausible-sounding nonsense.

IMProofBench uses this "scaffolding" approach, consulting a council of models including Anthropic's Claude and Google's Gemini. This layering improved results but created a new problem: cost.

Some queries racked up nearly $1,000 in charges just to produce wrong answers. Researchers worry this creates a funding crisis where math grants must include substantial line items for purchasing tokens from technology companies.

Persistent Problems With Academic Standards

The models frequently omitted citations. "If it was a human, one might call it plagiarism," said Lauren Williams, a Harvard mathematician on the First Proof team.

Williams hopes the math community will pressure AI companies to align their products with scientific ethics. The team plans to release additional problems in coming weeks and conduct the next official round of testing in the fall.

This round of testing received funding from philanthropic foundations and unrestricted donations from major AI companies, though Anthropic did not submit a model for evaluation.

What This Means for Research

The results show large language models are becoming useful tools for research-but with significant limitations. They generate substantial amounts of incorrect or irrelevant material alongside useful work, requiring researchers to spend considerable time filtering results.

The benchmark itself represents progress. Williams said the team executed "something that's much closer to being a proper benchmark, as opposed to an experiment" by prioritizing objectivity and transparency.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)