ReplicationBench: Can AI Agents Replicate Astrophysics Papers, End to End?

ReplicationBench tests if AI agents can fully reproduce astrophysics papers-code, data, and results-within multi-turn code runs. Right now, top models score under 20%.

Categorized in: AI News Science and Research

Published on: Nov 01, 2025

ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?

AI agents are getting better at research support, but trust requires proof. ReplicationBench sets a high bar: can an agent reproduce a full astrophysics paper end-to-end, including the code, data analysis, and core results?

The benchmark supplies each agent with the manuscript, the original dataset, execution metadata, and a set of author-written tasks. Agents work in multi-turn code execution environments, and their traces are graded both automatically and by domain experts.

Why astrophysics is a smart testbed

Astrophysics relies heavily on public, archival data and computational workflows. That means fewer lab dependencies and cleaner replication conditions. It's a direct way to assess whether an AI agent follows methods faithfully and produces technically correct outputs.

How ReplicationBench works

Task decomposition: each paper is split into tasks that cover experimental setup, derivations, data analysis, and codebase reproduction.
Grounded targets: tasks are co-developed with paper authors and map to specific scientific results.
Dual grading: agents are evaluated on faithfulness (adherence to methods) and correctness (accuracy of results), using automated checks and expert review.
Execution context: agents operate in multi-turn, code-running environments to mirror real research workflows.

What the results show

ReplicationBench is hard for current frontier models. Even the best models score under 20% across paper-scale tasks.

Analysis of agent trajectories reveals many ways things go wrong during scientific work. The errors span process, interpretation, and execution - and they compound across steps.

Why this matters for your team

If you're considering agents for research, this benchmark sets expectations. It shows that scaling from "helpful snippets" to "paper-level reliability" is a different problem entirely.

Don't skip replication. Before asking agents to try novel ideas, see if they can reproduce known results with the same inputs.
Decompose the work. Define tasks that align with key results and evaluation checks. Make "faithfulness vs. correctness" explicit.
Control the environment. Version datasets, code, dependencies, and execution metadata. Track every run.
Use dual evaluation. Automate what you can, but keep expert review in the loop for scientific judgment calls.
Plan for errors. Expect failure modes across setup, data handling, method adherence, and result interpretation. Design guardrails and retries.

What ReplicationBench contributes

It's the first benchmark focused on paper-scale, expert-validated tasks in astrophysics. The setup generalizes to other data-driven fields, and it gives teams a repeatable way to measure reliability instead of guessing.

Subject areas

Computation and Language (cs.CL); Instrumentation and Methods for Astrophysics (astro-ph.IM)

Where to learn more

Read the paper on arXiv: arXiv:2510.24591.

If you're building skills to evaluate AI agents and code-driven research workflows, browse curated training paths: Complete AI Training - Courses by Skill.

Get Daily AI News

Your membership also unlocks:

700+ AI Courses

700+ Certifications

Personalized AI Learning Plan

6500+ AI Tools (no Ads)

Daily AI News by job industry (no Ads)

Advertisement

ReplicationBench: Can AI Agents Replicate Astrophysics Papers, End to End?

ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?

Why astrophysics is a smart testbed

How ReplicationBench works

What the results show

Why this matters for your team

What ReplicationBench contributes

Subject areas

Where to learn more

Related AI News for Science and Research

BD Unveils Research Cloud 7.0 and Horizon Panel Maker to Speed Immunology and Cancer Research

When Chatbots Grow a Personality on Their Own - and What It Means for How We Use Them

Supercomputers in the Classroom: TACC Prepares UT Austin Students for AI

AI Is Changing Work, and College Still Means Higher Wages in Hawaiʻi

About Complete AI:

Latest AI News for your Job:

Courses by AI Skill:

Courses by Job Field:

Courses by AI Company:

AI Tools for your Job:

AI Tools by Type:

AI Certifications by Skill:

AI Certifications by Job Field:

AI Certifications by Company: