Exeter team to benchmark how well "AI Scientists" handle experimental error
A University of Exeter researcher, Dr Stephan Guttinger, has secured a Research Leadership Award from the Leverhulme Trust to test a crucial question: can autonomous AI systems reason through experimental error like working scientists do?
The aim is practical and high-impact. If AI agents can diagnose, explain, and recover from error, they can support safer, faster, and more cost-effective progress in areas such as drug discovery.
The bottleneck: error reasoning isn't in the data
Most scientific error-solving happens in lab meetings, on whiteboards, or in hallway conversations. Those discussions rarely make it into papers, preprints, or datasets that AI models learn from.
As Dr Guttinger notes, even the most sophisticated AI benchmarks don't meaningfully test for this type of reasoning. We don't yet know how capable current systems are at identifying experimental error or proposing viable fixes.
Building a theory of error, then testing it
To close the gap, the Exeter team will first develop a systematic theory of error in science: the types of errors researchers encounter and the strategies they use to address them. That theory will anchor a structured database of error cases and response patterns across disciplines.
From there, the project will deliver two benchmarks. One will be a traditional set of 500+ question-answer pairs to probe the error-reasoning ability of standalone AI agents. The second will evaluate human-AI teams, capturing how collaboration changes diagnosis, hypothesis updates, and next-step selection.
Interdisciplinary by design
Dr Guttinger, a lecturer in Philosophy of Data in the Department of Social and Political Sciences, Philosophy and Anthropology, is assembling a team spanning philosophy, natural sciences, and computer science. The goal: develop the conceptual and mathematical tools, plus the datasets, needed to assess how AI Scientists work through error on their own and with humans.
This foundation is meant to guide the trustworthy development of AI research agents that can operate inside real labs and real workflows.
Why this matters for research teams
- Model evaluation will get closer to day-to-day lab reality. Expect benchmarks that test error detection, root-cause analysis, and recovery strategies-not just clean textbook problems.
- Human-AI collaboration will be measured, not assumed. Teams will see evidence on where AI agents help, where they hinder, and how to structure handoffs.
- Better error reasoning could shorten feedback loops in areas like assay development, synthesis planning, and method transfer-helping teams cut wasted cycles.
What to watch next
- The release of the error taxonomy and database, which could double as training material for both models and lab onboarding.
- Benchmark leaderboards that reveal which AI systems are reliable under messy, imperfect lab conditions.
Learn more about the Research Leadership Awards at the Leverhulme Trust.
If your lab is building AI fluency across roles, explore practical learning paths by role at Complete AI Training.
Your membership also unlocks: