Researchers use Battleship to teach AI better scientific decision-making
Researchers tested how well AI models make decisions under resource constraints by having them play collaborative Battleship against humans and each other. The work, presented at the International Conference on Learning Representations in April, offers a framework for improving how AI assists with scientific research.
The core problem: scientists must choose which hypotheses to pursue and which experiments to run with limited budgets. "You can get only so much data because getting data is either expensive or time-consuming," said Valerio Pepe, the research scientist who led the project before joining OpenAI.
The researchers designed a two-player version of Battleship where one participant asked questions about ship locations while the other answered. By tracking how many rounds it took to sink all ships, they compared how large language models (LLMs) performed against each other and 42 human players.
Results
Initially, humans won faster than Meta's Llama-4-Scout, an efficiency-focused AI model. OpenAI's GPT-5 outperformed both.
The researchers then optimized their models using Bayesian experimental design-a statistical approach that estimates the likelihood of outcomes based on prior assumptions. They trained the models to ask questions that maximized both accuracy and information gain, while also planning one move ahead.
The breakthrough came when players switched from natural language to code snippets for communication. Accuracy jumped significantly.
After optimization, Llama-4-Scout beat GPT-5 two-thirds of the time while costing roughly one hundredth as much to run. It also won in seven fewer moves than human players on average.
Application to science
Battleship is far simpler than real scientific problems. Chemical and biological samples don't yield clear answers the way a game board does. But the decision-making strategies the models learned should transfer to actual research work, Pepe said.
Yuanqi Du, a Cornell researcher focused on AI for chemistry, sees value in the framework. "The framework will be very useful to measure whether language models are really making progress in deciding which hypotheses to pursue among all possibilities," Du said. "Understanding the whole hypothesis space you're searching, that's the hardest part."
Your membership also unlocks: