ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?
AI agents are getting better at research support, but trust requires proof. ReplicationBench sets a high bar: can an agent reproduce a full astrophysics paper end-to-end, including the code, data analysis, and core results?
The benchmark supplies each agent with the manuscript, the original dataset, execution metadata, and a set of author-written tasks. Agents work in multi-turn code execution environments, and their traces are graded both automatically and by domain experts.
Why astrophysics is a smart testbed
Astrophysics relies heavily on public, archival data and computational workflows. That means fewer lab dependencies and cleaner replication conditions. It's a direct way to assess whether an AI agent follows methods faithfully and produces technically correct outputs.
How ReplicationBench works
- Task decomposition: each paper is split into tasks that cover experimental setup, derivations, data analysis, and codebase reproduction.
- Grounded targets: tasks are co-developed with paper authors and map to specific scientific results.
- Dual grading: agents are evaluated on faithfulness (adherence to methods) and correctness (accuracy of results), using automated checks and expert review.
- Execution context: agents operate in multi-turn, code-running environments to mirror real research workflows.
What the results show
ReplicationBench is hard for current frontier models. Even the best models score under 20% across paper-scale tasks.
Analysis of agent trajectories reveals many ways things go wrong during scientific work. The errors span process, interpretation, and execution - and they compound across steps.
Why this matters for your team
If you're considering agents for research, this benchmark sets expectations. It shows that scaling from "helpful snippets" to "paper-level reliability" is a different problem entirely.
- Don't skip replication. Before asking agents to try novel ideas, see if they can reproduce known results with the same inputs.
- Decompose the work. Define tasks that align with key results and evaluation checks. Make "faithfulness vs. correctness" explicit.
- Control the environment. Version datasets, code, dependencies, and execution metadata. Track every run.
- Use dual evaluation. Automate what you can, but keep expert review in the loop for scientific judgment calls.
- Plan for errors. Expect failure modes across setup, data handling, method adherence, and result interpretation. Design guardrails and retries.
What ReplicationBench contributes
It's the first benchmark focused on paper-scale, expert-validated tasks in astrophysics. The setup generalizes to other data-driven fields, and it gives teams a repeatable way to measure reliability instead of guessing.
Subject areas
Computation and Language (cs.CL); Instrumentation and Methods for Astrophysics (astro-ph.IM)
Where to learn more
Read the paper on arXiv: arXiv:2510.24591.
If you're building skills to evaluate AI agents and code-driven research workflows, browse curated training paths: Complete AI Training - Courses by Skill.
Your membership also unlocks: