AI plays the Keynesian beauty contest closer to equilibrium-and misses a dominant move

Study drops top LLMs into a classic number-guessing game. They adapt like trained players and pick low, yet miss the simple 'choose zero' dominance in two-player setups.

Categorized in: AI News Science and Research

Published on: Dec 26, 2025

AI models in a classic economics game reveal where machines think differently from humans

A new study drops leading language models into a classic strategic game and checks how their choices stack up against human data. If your work touches experiments, forecasting, or decision systems, this is signal you can use.

The headline: modern models adapt to game structure like trained participants, often pick more "rational" numbers than humans, but miss simple dominant strategies that any econ undergrad would spot in seconds.

The experiment at a glance

Researchers tested GPT-4o, GPT-4o Mini, Gemini-2.5-flash, Claude-Sonnet-4, and Llama-4-Maverick on variants of the "Guess the Number" game, a modern take on the Keynesian beauty contest. Each scenario asked the model to choose a number between 0 and 100 closest to a fraction (often 1/2 or 2/3) of the group's aggregate.

16 scenarios reflecting classic human experiments (Nagel and follow-ons).
Opponents described with different traits: students, experts, angry, analytical, etc.
Aggregation rules varied: average, median, maximum.
50 independent runs per model per scenario; no learning across rounds.

All 4,000 outputs stayed within bounds. Almost every response included explicit strategic reasoning; only 23 did not.

Where models align with theory-and where they don't

Compared with well-known human results, the models consistently chose lower numbers. In Nagel-style conditions, people average around 27 (target = half the average) and ~37 (target = two thirds). Every model came in below those human means. Some pushed close to zero, the Nash equilibrium in most versions. Others (notably GPT-4o) landed higher, but still below humans. Differences were statistically meaningful.

When the rule used the maximum instead of an average, both humans and models chose higher numbers as expected. But models spread out: Claude Sonnet hovered near the mid-30s; Llama played smaller. So, directionally correct, but not uniform.

The standout gap: in two-player versions, choosing zero is weakly dominant-it never loses to any other choice. None of the models detected or explained this. They defaulted to iterative "what-will-the-other-do" reasoning instead of recognizing dominance. That's a clear miss relative to formal training.

Scale effects: bigger models, deeper reasoning

Across Llama variants from 1B to 405B parameters, behavior shifted with size. Small models guessed like typical humans (often near 50). As size grew, choices moved toward equilibrium and explanations reflected more layers of thinking. Scale pushed models away from human heuristics and closer to theory.

Context sensitivity: emotion and framing matter

Prompt phrasing and social cues nudged choices in familiar ways. Describing opponents as angry led to higher guesses; sadness had smaller effects. Labeling opponents as analytical pulled numbers down relative to "intuitive" descriptions. GPT-4o Mini and Llama were especially sensitive to wording, yet the overall response pattern stayed stable and predictable.

Why this matters for science and Research

Modeling behavior: LLMs track comparative statics well. If theory predicts higher/lower, they move the right way.
Rationality gap: They often overshoot sophistication, underweight bounded reasoning, and can miss dominant strategies.
Method design: For experiments and agent-based work, model size and prompt framing materially change behavior.
Policy and markets: If a system assumes strategic opponents but faces noise and emotion, forecasts can drift.

How to use this in your workflow

Encode simple dominance: Before deployment, prompt the model to check for dominant strategies and eliminate non-viable options.
Calibrate opponent models: Specify distributions of opponent types (bounded, noisy, biased). Don't let the model assume universal sophistication.
Stress-test framing: Vary wording, emotional cues, and social context in prompts. Log shifts in recommendations.
Choose model size intentionally: Smaller models may mirror human heuristics; larger ones skew toward equilibrium. Pick based on task needs.
Use ensembles and human-in-the-loop: Combine a "theory checker" with a "behavioral checker," then review disagreements.
Audit reasoning chains: Require short rationales. Flag explanations that miss basic dominance or contradict setup rules.

Limitations and next steps

The study focused on one-shot decisions with no learning across rounds. Real settings add learning, incentives, feedback loops, and strategic noise. It also tested a limited but diverse set of model families.

Useful extensions: real-payoff experiments, repeated games, heterogeneous stakes, adversarial opponents, and tool-assisted reasoning. For applied teams, replicate these tests on your stack with your prompts and guardrails before trusting agent behavior in live systems.

Sources and further reading

Background on the beauty contest game: Wikipedia overview. The research is available in the Journal of Economic Behavior & Organization.

If you're building research workflows with LLMs and want structured upskilling, see our AI certification for data analysis.

Get Daily AI News

Your membership also unlocks:

700+ AI Courses

700+ Certifications

Personalized AI Learning Plan

6500+ AI Tools (no Ads)

Daily AI News by job industry (no Ads)

AI plays the Keynesian beauty contest closer to equilibrium-and misses a dominant move

AI models in a classic economics game reveal where machines think differently from humans

The experiment at a glance

Where models align with theory-and where they don't

Scale effects: bigger models, deeper reasoning

Context sensitivity: emotion and framing matter

Why this matters for science and Research

How to use this in your workflow

Limitations and next steps

Sources and further reading

Related AI News for Science and Research

Brightseed launches enterprise platform connecting health sciences discovery to commercialization

Stanford researcher finds AI useful for spotting errors in peer review but unreliable on scientific judgment

AI system generates research paper that passes peer review at machine learning conference workshop

NSF launches AI-Ready America initiative to build workforce and business skills across all 50 states

About Complete AI:

Latest AI News for your Job:

Courses by AI Skill:

Courses by Job Field:

Courses by AI Company:

AI Tools for your Job:

AI Tools by Type:

AI Certifications by Skill:

AI Certifications by Job Field:

AI Certifications by Company: