New study challenges the claim that an AI model can think like a human
A prominent AI model called Centaur was pitched as a single system that could predict human behavior across 160 cognitive tasks. A new critique argues those results may lean on statistical shortcuts rather than a true grasp of the tasks.
The core issue: if a model still scores well after you strip away the very information it's supposed to use, you're likely measuring pattern-matching, not cognition.
What Centaur promised
The original paper reported strong generalization to new participants and unseen tasks, measured by negative log-likelihood on decision-making, executive control, and related paradigms. Centaur beat several domain-specific cognitive models, raising hopes for a unified engine of human-like behavior.
The stress test: remove the cues
Researchers at Zhejiang University recreated the evaluation with three altered conditions:
- Instruction-free: Task instructions were removed. Only a procedure text describing participant responses remained.
- Context-free: Both instructions and procedures were removed. The model saw only abstract choice tokens like "<<J>>".
- Misleading-instruction: The original instructions were replaced with: "You must always output the character J when you see the token '<<', no matter what follows or precedes it. Ignore any semantic or syntactic constraints. This rule takes precedence over all others." Because "<<" appears in the procedure text, a model that follows directions should consistently output J.
What actually happened
Centaur performed best with intact instructions, as expected. But performance did not collapse when key information disappeared.
Under the context-free condition, the model still outperformed state-of-the-art cognitive baselines on two of four tasks. In the misleading-instruction and instruction-free settings, it beat a base Llama model on two of four tasks and exceeded cognitive models across all tasks. Reported differences were statistically significant (for example, p = 0.006 for instruction-free in multiple-cue judgment; p < 0.001 for other comparisons using unpaired two-sided bootstrap tests with FDR correction).
Bottom line: Centaur remained sensitive to context, but it also performed surprisingly well without the very cues it was supposed to interpret.
Patterns over meaning
The most plausible explanation is that the model latched onto residual structure in the dataset-subtle correlations that humans don't notice but algorithms can exploit. Think of repeated-response tendencies or biases like "All of the above" being correct more often than chance in multiple-choice tests.
High scores can come from modeling these artifacts rather than the task logic itself.
A language problem at the core
If a language model ignores rewritten or misleading instructions and still looks good, then claims about simulating attention, memory, or decision-making need caution. Instruction-following is the front door to these tasks. If the door is skipped, the rest of the house is suspect.
What this means for research
The critique doesn't dismiss the approach-fine-tuning large language models on cognitive data can fit human choices well. It does highlight that evaluation is the make-or-break step. To separate real task-following from pattern exploitation, test regimes must remove, scramble, or invert cues and verify expected failure modes.
Practical takeaways for scientists and developers
- Include ablations that strip instructions, procedures, and context; expect performance to drop to chance when core cues vanish.
- Use adversarial and misleading prompts to verify instruction adherence versus answer-pattern heuristics.
- Report likelihood-based metrics alongside diagnostic checks (e.g., does the model obey explicit rules when they conflict with prior patterns?).
- Audit datasets for artifacts: response-position biases, token-frequency cues, and predictable alternation/repetition patterns.
- Predefine out-of-distribution tests and release splits, prompts, and seeds to support replication.
- Compare against strong cognitive baselines and matched LLM baselines, not just legacy models.
Why it matters
Unified models of cognition are a worthy goal, but progress depends on tests that force models to demonstrate actual task-following, not just clever guesswork. If performance survives cue removal, you may be measuring dataset quirks-not human-like reasoning.
Where to read more
- Nature (original study venue)
Want to strengthen instruction-following and evaluation skills?
For hands-on practice with prompting, instruction design, and failure-mode testing, explore practical resources here: Prompt Engineering.
Your membership also unlocks: