CritPt Benchmark Finds LLMs Top Out Around 10% on Research-Scale Physics

CritPt tests LLMs on unpublished, research-grade physics problems across 10 areas, with auto-grading of steps and outputs. Top models score 4-10% on full challenges, exposing gaps.

Categorized in: AI News Science and Research
Published on: Oct 05, 2025
CritPt Benchmark Finds LLMs Top Out Around 10% on Research-Scale Physics

CritPt: A Benchmark That Tests LLMs on Real Physics Reasoning

Most AI evaluations live in neat datasets. Real research is messy, open-ended, and unforgiving. CritPt (Complex Research using Integrated Thinking, Physics Test) closes that gap by testing large language models on unpublished, research-grade physics problems sourced from active labs.

It spans ten research areas and probes whether models can plan, reason, compute, and verify across full research workflows. Early results show a clear gap: even top models with coding tools reach only about 10% accuracy on full challenges.

What CritPt Is

CritPt is a benchmark built from 71 composite, research-scale challenges and 190 modular checkpoints. Over 50 physicists authored the problems based on current projects, not textbook exercises. Every problem is self-contained, verifiable, and structured to prevent shortcut guessing or data contamination.

The benchmark uses a physics-aware auto-grading system that checks arrays of floating-point numbers, complex symbolic expressions, and reasoning steps, not just final answers.

Why This Matters

Physics research involves multi-step reasoning, modeling, and translation between theory, computation, and experiment. LLMs have not earned trust here yet. CritPt gives a realistic, standardized way to track progress and pinpoint failure modes that actually block research.

If you're building or adopting AI for scientific work, this is the kind of benchmark that keeps development honest.

What It Covers: Ten Research Areas

  • Topological quantum materials (topological insulators, crystalline insulators)
  • Two-dimensional materials (graphene, WSe2, moiré systems)
  • Strongly correlated electrons and emergent phenomena
  • Quantum spin liquids, Kitaev physics, and Majorana platforms
  • Excitations and excitons in quantum materials
  • Disorder, localization, and transport in complex systems
  • Magnetism and spin dynamics
  • Spectroscopy and scattering (RIXS, EELS, neutron techniques)
  • Dark matter searches (axions, dark photons, scalar fields) using interferometers and gravitational detectors
  • Astrophysics and nuclear physics tasks linked to the above domains

Topological Quantum Materials and Exotic Phases: What's Inside

Problems include modeling topological and crystalline insulators, analyzing correlated behavior, and predicting signatures of fractionalization. Tasks range from fitting and interpreting spectroscopic data (RIXS, EELS) to simulating spin excitations, disorder effects, and europium-based compounds.

Several challenges link condensed matter and cosmology via dark-matter detection using interferometers and gravitational-wave infrastructure. For context on detectors used in related searches, see LIGO. For ongoing physics preprints, arXiv remains a central resource.

How the Problems Are Built

  • Unpublished, hand-crafted by active researchers; grounded in ongoing work
  • Self-contained statements with machine-verifiable answers (arrays, symbols)
  • Strict anti-contamination safeguards; no recycled textbook items
  • Composite challenges simulate full projects; checkpoints isolate subskills
  • Physics-informed auto-grader evaluates reasoning steps, not just endpoints

What the Results Show

Across several leading LLMs, base accuracy on full challenges hovers around 4%. Tool-using variants (with coding) reach roughly 10%. Models do better on isolated subtasks than on end-to-end research problems, which is where most real bottlenecks live.

Translation: today's systems can assist with fragments but frequently break on integration, consistency, and validation-the exact skills needed to move a project forward.

Practical Takeaways for Your Lab

  • Use composite challenges to stress-test end-to-end workflows before deploying AI on live projects.
  • Benchmark both with and without tools. The 2-3x lift from coding tools is real but still limited.
  • Audit failure modes: unit handling, boundary cases, symbol definitions, and stepwise logical consistency.
  • Adopt human-in-the-loop review. Treat model outputs as drafts; require verifiable quantities and reproducible scripts.
  • Align internal datasets with CritPt formats (arrays, symbolic forms) to enable reliable auto-grading.
  • Invest in evaluation harnesses. Progress depends more on measured iteration than on model hype.

How This Guides AI Development

For model and tool builders, CritPt offers a clean target: improve reasoning across multi-stage physics tasks with strict verification. That likely means better planning, symbolic math, units and uncertainty handling, and tighter integration with scientific computing.

It also pushes for interpretable intermediate steps and tool use that mirrors how researchers actually work.

Learn More and Upskill

To keep up with applied AI skills relevant to research workflows, see Complete AI Training by job role. For broader discovery, monitor physics preprints on arXiv.

Bottom line: CritPt brings real physics into AI evaluation. The gap is measurable, and that's good news-now we know where to build.