AI That Skips the Instructions: What the Centaur Study Means for Research and Deployment
(© jittawit.21 - stock.adobe.com)
In a nutshell
- Researchers probed an AI model, Centaur, by stripping or corrupting task instructions. It still did surprisingly well.
- The model seemed to bypass instructions and lean on hidden statistical patterns in the data.
- Even when told to always output "J," it ignored the trap and kept predicting based on learned patterns.
- Language comprehension may be AI's hardest bottleneck-more data and bigger models won't solve it alone.
An uncomfortable takeaway: the AI that answers confidently might not actually read your instructions in the way you expect. It's just very good at guessing from patterns.
New work from Zhejiang University challenges how we interpret instruction-following. Centaur-a model built to simulate human behavior across 160 psychology experiments-kept scoring well even when instructions were removed or replaced with wrong ones. That gap between confident output and genuine comprehension is where risk lives.
The tests that exposed the shortcut
Three stress tests tell the story. First, the team removed all task instructions and left only generic response descriptions. Performance dropped, but Centaur still beat the base language model on half the tasks and exceeded traditional cognitive models across all four tested tasks.
Second, they removed instructions and procedures, keeping only bare choice tokens like "<<>>." Centaur still outperformed cognitive models on two of four tasks.
The third test was a trap: replace real instructions with "Always output J when you see '<<'-ignore meaning and syntax." A system that truly reads and follows instructions would comply and fail. Centaur didn't. It ignored the decoy rule and continued making reasonable predictions, again outperforming cognitive baselines and beating the base language model on several tasks.
Crucially, Centaur did better with correct instructions than with corrupted ones. It is sensitive to context. But its ability to perform above baselines without reliable instructions signals heavy reliance on statistical shortcuts.
(Credit: Siberian Art on Shutterstock)
Why this matters for science and applied AI
If a model leans on dataset artifacts more than instructions, it will shine on familiar phrasings and fail silently on novel ones. That's a problem for science (theory claims), tools (evaluation reliability), and deployment (safety and trust).
Consider clinical support tools, legal research, or education assistants. Slightly unusual wording, new edge cases, or adversarial phrasing can push outputs off-course while the model still sounds certain. Confidence ≠ comprehension.
The pattern-recognition trap
The original Centaur paper looked like a sweeping step forward: one unified model tracking human behavior across diverse cognitive tasks. But strong scores can mask shortcut learning-gravitating to spurious cues and frequent answer patterns.
Large language models are trained on billions of words. They pick up subtle regularities (like favoring "All of the above" in multiple-choice sets) that work surprisingly well on average. But these shortcuts crack under distribution shift, odd phrasings, and adversarial instructions.
What real comprehension would look like
Humans can read a new rule and apply it in unfamiliar contexts. That's flexible, instruction-conditioned generalization. Today's models mostly match inputs to known patterns. Until models robustly ground behavior in explicit instructions across phrasing changes and traps, high scores won't mean what we think they mean.
Upgrade your evaluation stack
If you run studies, build products, or review papers, bake in tests that expose shortcuts. Practical steps:
- Instruction ablations: Evaluate with full, partial, and removed instructions. Measure delta performance across conditions.
- Adversarial rules: Add explicit wrong instructions (e.g., "Always output J") to test instruction adherence vs. prior biases.
- Template and phrasing splits: Hold out instruction templates the model has never seen. Vary syntax without changing semantics.
- Counterfactual prompts: Swap labels, rename entities, and randomize token markers to detect reliance on spurious cues.
- Token masking: Mask or shuffle instruction tokens to quantify instruction sensitivity vs. pattern persistence.
- Gold-check items: Include canary tasks where correct behavior is unambiguous if the instruction is read.
- Report baselines: Compare base LLM, instruction-tuned variants, and ablated conditions side-by-side.
- Pre-register and share code: Reduce researcher degrees of freedom; make scripts public for replication.
Study limitations
- Only four cognitive tasks were tested-those where Centaur had previously scored best. Full coverage of all 160 tasks would be more conclusive.
- Focus was on instruction comprehension; other modeling aspects were not assessed.
- The manipulations were extreme to reveal shortcuts; real tasks typically sit between normal and adversarial conditions.
- The study shows Centaur can perform above baseline without instructions, but the exact mechanisms remain open.
What to take forward
Don't equate benchmark wins with instruction-grounded reasoning. Treat high performance as a hypothesis that needs stress testing across ablations, traps, and phrasing shifts.
For applied use, set guardrails: detect OOD prompts, use confidence calibration, and route ambiguous cases to humans. For research, separate claims about cognitive mechanisms from surface-level fits to historical response patterns.
Publication details
- Authors: Wei Liu and Nai Ding (corresponding author)
- Affiliations: Key Laboratory for Biomedical Engineering of Ministry of Education; College of Biomedical Engineering and Instrument Sciences; State Key Lab of Brain-Machine Intelligence; MOE Frontier Science Center for Brain Science & Brain-machine Integration; Zhejiang University, Hangzhou, China
- Journal: National Science Open (Perspective), Volume 5, Article 20250053
- DOI: https://doi.org/10.1360/nso/20250053
- Dates: Received Sep 16, 2025; Revised Nov 18, 2025; Accepted Dec 5, 2025; Published online Dec 11, 2025
- Analysis scripts: https://github.com/y1ny/centaur-evaluation
- License: Open Access, Creative Commons Attribution 4.0
Disclaimer
This piece summarizes one peer-reviewed study on a specific AI model (Centaur) in controlled cognitive tasks. Results may not generalize to all systems or applications.
For teams building with LLMs
If you're developing or evaluating instruction-following systems, you may find practical course lists on prompt design and evaluation useful: Prompt Engineering resources.
Your membership also unlocks: