Small AI Model Beats GPT-5 at Battleship While Costing 1 Percent as Much
Researchers at MIT's Computer Science and Artificial Intelligence Laboratory and Harvard's School of Engineering and Applied Sciences used the classic guessing game to test how well language models ask questions in uncertain environments. They found that a smaller model, Llama 4 Scout, outperformed GPT-5 after being equipped with better inference strategies - and did it at a fraction of the computational cost.
The work addresses a real problem with current AI agents. While large language models excel at answering complex queries, they struggle to ask good questions when exploring a space of possibilities. This matters for applications like medical diagnosis and scientific discovery, where agents must investigate many potential solutions.
How the experiment worked
The researchers created "Collaborative Battleship," a version of the game where one player (the captain) asks yes-or-no questions to locate hidden ships, while another player (the spotter) answers. The team first collected data from over 40 humans playing together, then tested state-of-the-art language models against this benchmark.
Without any modifications, the largest models could beat human players but often asked poor questions. Llama 4 Scout succeeded only 8 percent of the time.
The researchers applied two techniques. First, they gave models a Monte Carlo inference strategy that weighs the likelihood of different answers after each response, similar to inflating or deflating game pieces based on new information. This helped models ask more informative questions.
Second, they converted each question into Python code that explicitly told the model how to verify answers. When asked "Is there a ship in column one that spans two rows?" the model received clear instructions to search that area and assess the ship's width.
The results
Llama 4 Scout's win rate jumped from 8 percent to 82 percent. It then outpaced GPT-5 while using around 1 percent of that model's computational resources.
The code conversion approach also improved accuracy across the board. GPT-4o-mini saw a 30 percent performance bump. Claude 4 Opus gained about 8 percentage points. Even GPT-5 improved when answering questions as the spotter.
The team tested the approach on "Guess Who?" - a game where models must narrow down 100 character options. Llama 4 Scout improved from 30 percent success to 72 percent. GPT-4o jumped from 62 percent to 90 percent.
What this means for AI agents
The findings suggest that AI agents have untapped potential for what researchers call "needle-in-a-haystack" discovery - finding rare solutions within massive option spaces. This directly applies to identifying molecular structures or other scientific problems requiring systematic exploration.
Jacob Andreas, an MIT associate professor and lead investigator, said the work opens possibilities for using code-generation techniques to improve how models explore and gather information, not just verify solutions. He sees potential applications in coding and mathematical problem-solving.
The researchers acknowledge limits. Models still struggle with complex questions compared to humans. Expert Battleship players remain difficult to beat, unlike chess, where AI systems consistently defeat top players.
The team plans to test models in more complex environments with larger option spaces, and to study whether humans and AI collaborate more effectively together. They also want to explore fine-tuning models on game simulations and leveraging additional computing power for more advanced inference.
The work was presented at the International Conference on Learning Representations in April and was supported by MIT's Siegel Family Quest for Intelligence, the MIT-IBM Watson AI Lab, the Air Force Office of Scientific Research, DARPA, and the National Science Foundation, among others.
Learn more about generative AI and large language models or explore AI research applications.
Your membership also unlocks: