AI models in a classic economics game reveal where machines think differently from humans
A new study drops leading language models into a classic strategic game and checks how their choices stack up against human data. If your work touches experiments, forecasting, or decision systems, this is signal you can use.
The headline: modern models adapt to game structure like trained participants, often pick more "rational" numbers than humans, but miss simple dominant strategies that any econ undergrad would spot in seconds.
The experiment at a glance
Researchers tested GPT-4o, GPT-4o Mini, Gemini-2.5-flash, Claude-Sonnet-4, and Llama-4-Maverick on variants of the "Guess the Number" game, a modern take on the Keynesian beauty contest. Each scenario asked the model to choose a number between 0 and 100 closest to a fraction (often 1/2 or 2/3) of the group's aggregate.
- 16 scenarios reflecting classic human experiments (Nagel and follow-ons).
- Opponents described with different traits: students, experts, angry, analytical, etc.
- Aggregation rules varied: average, median, maximum.
- 50 independent runs per model per scenario; no learning across rounds.
All 4,000 outputs stayed within bounds. Almost every response included explicit strategic reasoning; only 23 did not.
Where models align with theory-and where they don't
Compared with well-known human results, the models consistently chose lower numbers. In Nagel-style conditions, people average around 27 (target = half the average) and ~37 (target = two thirds). Every model came in below those human means. Some pushed close to zero, the Nash equilibrium in most versions. Others (notably GPT-4o) landed higher, but still below humans. Differences were statistically meaningful.
When the rule used the maximum instead of an average, both humans and models chose higher numbers as expected. But models spread out: Claude Sonnet hovered near the mid-30s; Llama played smaller. So, directionally correct, but not uniform.
The standout gap: in two-player versions, choosing zero is weakly dominant-it never loses to any other choice. None of the models detected or explained this. They defaulted to iterative "what-will-the-other-do" reasoning instead of recognizing dominance. That's a clear miss relative to formal training.
Scale effects: bigger models, deeper reasoning
Across Llama variants from 1B to 405B parameters, behavior shifted with size. Small models guessed like typical humans (often near 50). As size grew, choices moved toward equilibrium and explanations reflected more layers of thinking. Scale pushed models away from human heuristics and closer to theory.
Context sensitivity: emotion and framing matter
Prompt phrasing and social cues nudged choices in familiar ways. Describing opponents as angry led to higher guesses; sadness had smaller effects. Labeling opponents as analytical pulled numbers down relative to "intuitive" descriptions. GPT-4o Mini and Llama were especially sensitive to wording, yet the overall response pattern stayed stable and predictable.
Why this matters for science and Research
- Modeling behavior: LLMs track comparative statics well. If theory predicts higher/lower, they move the right way.
- Rationality gap: They often overshoot sophistication, underweight bounded reasoning, and can miss dominant strategies.
- Method design: For experiments and agent-based work, model size and prompt framing materially change behavior.
- Policy and markets: If a system assumes strategic opponents but faces noise and emotion, forecasts can drift.
How to use this in your workflow
- Encode simple dominance: Before deployment, prompt the model to check for dominant strategies and eliminate non-viable options.
- Calibrate opponent models: Specify distributions of opponent types (bounded, noisy, biased). Don't let the model assume universal sophistication.
- Stress-test framing: Vary wording, emotional cues, and social context in prompts. Log shifts in recommendations.
- Choose model size intentionally: Smaller models may mirror human heuristics; larger ones skew toward equilibrium. Pick based on task needs.
- Use ensembles and human-in-the-loop: Combine a "theory checker" with a "behavioral checker," then review disagreements.
- Audit reasoning chains: Require short rationales. Flag explanations that miss basic dominance or contradict setup rules.
Limitations and next steps
The study focused on one-shot decisions with no learning across rounds. Real settings add learning, incentives, feedback loops, and strategic noise. It also tested a limited but diverse set of model families.
Useful extensions: real-payoff experiments, repeated games, heterogeneous stakes, adversarial opponents, and tool-assisted reasoning. For applied teams, replicate these tests on your stack with your prompts and guardrails before trusting agent behavior in live systems.
Sources and further reading
Background on the beauty contest game: Wikipedia overview. The research is available in the Journal of Economic Behavior & Organization.
If you're building research workflows with LLMs and want structured upskilling, see our AI certification for data analysis.
Your membership also unlocks: