Scientists use Battleship to improve AI decision-making in research

Researchers trained AI models to make smarter decisions under budget limits using collaborative Battleship. After optimization, a low-cost Llama model beat GPT-5 two-thirds of the time and outpaced human players by seven moves.

Categorized in: AI News Science and Research
Published on: May 09, 2026
Scientists use Battleship to improve AI decision-making in research

Researchers use Battleship to teach AI better scientific decision-making

Researchers tested how well AI models make decisions under resource constraints by having them play collaborative Battleship against humans and each other. The work, presented at the International Conference on Learning Representations in April, offers a framework for improving how AI assists with scientific research.

The core problem: scientists must choose which hypotheses to pursue and which experiments to run with limited budgets. "You can get only so much data because getting data is either expensive or time-consuming," said Valerio Pepe, the research scientist who led the project before joining OpenAI.

The researchers designed a two-player version of Battleship where one participant asked questions about ship locations while the other answered. By tracking how many rounds it took to sink all ships, they compared how large language models (LLMs) performed against each other and 42 human players.

Results

Initially, humans won faster than Meta's Llama-4-Scout, an efficiency-focused AI model. OpenAI's GPT-5 outperformed both.

The researchers then optimized their models using Bayesian experimental design-a statistical approach that estimates the likelihood of outcomes based on prior assumptions. They trained the models to ask questions that maximized both accuracy and information gain, while also planning one move ahead.

The breakthrough came when players switched from natural language to code snippets for communication. Accuracy jumped significantly.

After optimization, Llama-4-Scout beat GPT-5 two-thirds of the time while costing roughly one hundredth as much to run. It also won in seven fewer moves than human players on average.

Application to science

Battleship is far simpler than real scientific problems. Chemical and biological samples don't yield clear answers the way a game board does. But the decision-making strategies the models learned should transfer to actual research work, Pepe said.

Yuanqi Du, a Cornell researcher focused on AI for chemistry, sees value in the framework. "The framework will be very useful to measure whether language models are really making progress in deciding which hypotheses to pursue among all possibilities," Du said. "Understanding the whole hypothesis space you're searching, that's the hardest part."


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)