ReplicationBench: Can AI Agents Replicate Astrophysics Papers, End to End?

ReplicationBench tests if AI agents can fully reproduce astrophysics papers-code, data, and results-within multi-turn code runs. Right now, top models score under 20%.

Categorized in: AI News Science and Research
Published on: Nov 01, 2025
ReplicationBench: Can AI Agents Replicate Astrophysics Papers, End to End?

ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?

AI agents are getting better at research support, but trust requires proof. ReplicationBench sets a high bar: can an agent reproduce a full astrophysics paper end-to-end, including the code, data analysis, and core results?

The benchmark supplies each agent with the manuscript, the original dataset, execution metadata, and a set of author-written tasks. Agents work in multi-turn code execution environments, and their traces are graded both automatically and by domain experts.

Why astrophysics is a smart testbed

Astrophysics relies heavily on public, archival data and computational workflows. That means fewer lab dependencies and cleaner replication conditions. It's a direct way to assess whether an AI agent follows methods faithfully and produces technically correct outputs.

How ReplicationBench works

  • Task decomposition: each paper is split into tasks that cover experimental setup, derivations, data analysis, and codebase reproduction.
  • Grounded targets: tasks are co-developed with paper authors and map to specific scientific results.
  • Dual grading: agents are evaluated on faithfulness (adherence to methods) and correctness (accuracy of results), using automated checks and expert review.
  • Execution context: agents operate in multi-turn, code-running environments to mirror real research workflows.

What the results show

ReplicationBench is hard for current frontier models. Even the best models score under 20% across paper-scale tasks.

Analysis of agent trajectories reveals many ways things go wrong during scientific work. The errors span process, interpretation, and execution - and they compound across steps.

Why this matters for your team

If you're considering agents for research, this benchmark sets expectations. It shows that scaling from "helpful snippets" to "paper-level reliability" is a different problem entirely.

  • Don't skip replication. Before asking agents to try novel ideas, see if they can reproduce known results with the same inputs.
  • Decompose the work. Define tasks that align with key results and evaluation checks. Make "faithfulness vs. correctness" explicit.
  • Control the environment. Version datasets, code, dependencies, and execution metadata. Track every run.
  • Use dual evaluation. Automate what you can, but keep expert review in the loop for scientific judgment calls.
  • Plan for errors. Expect failure modes across setup, data handling, method adherence, and result interpretation. Design guardrails and retries.

What ReplicationBench contributes

It's the first benchmark focused on paper-scale, expert-validated tasks in astrophysics. The setup generalizes to other data-driven fields, and it gives teams a repeatable way to measure reliability instead of guessing.

Subject areas

Computation and Language (cs.CL); Instrumentation and Methods for Astrophysics (astro-ph.IM)

Where to learn more

Read the paper on arXiv: arXiv:2510.24591.

If you're building skills to evaluate AI agents and code-driven research workflows, browse curated training paths: Complete AI Training - Courses by Skill.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)
Advertisement
Stream Watch Guide