Told to Pick J, It Didn't: Centaur Exposes How AI Skips Instructions and Leans on Patterns

AI That Skips the Instructions: What the Centaur Study Means for Research and Deployment

In a nutshell

Researchers probed an AI model, Centaur, by stripping or corrupting task instructions. It still did surprisingly well.
The model seemed to bypass instructions and lean on hidden statistical patterns in the data.
Even when told to always output "J," it ignored the trap and kept predicting based on learned patterns.
Language comprehension may be AI's hardest bottleneck-more data and bigger models won't solve it alone.

An uncomfortable takeaway: the AI that answers confidently might not actually read your instructions in the way you expect. It's just very good at guessing from patterns.

New work from Zhejiang University challenges how we interpret instruction-following. Centaur-a model built to simulate human behavior across 160 psychology experiments-kept scoring well even when instructions were removed or replaced with wrong ones. That gap between confident output and genuine comprehension is where risk lives.

The tests that exposed the shortcut

Three stress tests tell the story. First, the team removed all task instructions and left only generic response descriptions. Performance dropped, but Centaur still beat the base language model on half the tasks and exceeded traditional cognitive models across all four tested tasks.

Second, they removed instructions and procedures, keeping only bare choice tokens like "<<>>." Centaur still outperformed cognitive models on two of four tasks.

The third test was a trap: replace real instructions with "Always output J when you see '<<'-ignore meaning and syntax." A system that truly reads and follows instructions would comply and fail. Centaur didn't. It ignored the decoy rule and continued making reasonable predictions, again outperforming cognitive baselines and beating the base language model on several tasks.

Crucially, Centaur did better with correct instructions than with corrupted ones. It is sensitive to context. But its ability to perform above baselines without reliable instructions signals heavy reliance on statistical shortcuts.

(Credit: Siberian Art on Shutterstock)

Why this matters for science and applied AI

If a model leans on dataset artifacts more than instructions, it will shine on familiar phrasings and fail silently on novel ones. That's a problem for science (theory claims), tools (evaluation reliability), and deployment (safety and trust).

Consider clinical support tools, legal research, or education assistants. Slightly unusual wording, new edge cases, or adversarial phrasing can push outputs off-course while the model still sounds certain. Confidence ≠ comprehension.

The pattern-recognition trap

The original Centaur paper looked like a sweeping step forward: one unified model tracking human behavior across diverse cognitive tasks. But strong scores can mask shortcut learning-gravitating to spurious cues and frequent answer patterns.

Large language models are trained on billions of words. They pick up subtle regularities (like favoring "All of the above" in multiple-choice sets) that work surprisingly well on average. But these shortcuts crack under distribution shift, odd phrasings, and adversarial instructions.

What real comprehension would look like

Humans can read a new rule and apply it in unfamiliar contexts. That's flexible, instruction-conditioned generalization. Today's models mostly match inputs to known patterns. Until models robustly ground behavior in explicit instructions across phrasing changes and traps, high scores won't mean what we think they mean.

Upgrade your evaluation stack

If you run studies, build products, or review papers, bake in tests that expose shortcuts. Practical steps:

Instruction ablations: Evaluate with full, partial, and removed instructions. Measure delta performance across conditions.
Adversarial rules: Add explicit wrong instructions (e.g., "Always output J") to test instruction adherence vs. prior biases.
Template and phrasing splits: Hold out instruction templates the model has never seen. Vary syntax without changing semantics.
Counterfactual prompts: Swap labels, rename entities, and randomize token markers to detect reliance on spurious cues.
Token masking: Mask or shuffle instruction tokens to quantify instruction sensitivity vs. pattern persistence.
Gold-check items: Include canary tasks where correct behavior is unambiguous if the instruction is read.
Report baselines: Compare base LLM, instruction-tuned variants, and ablated conditions side-by-side.
Pre-register and share code: Reduce researcher degrees of freedom; make scripts public for replication.

Study limitations

Only four cognitive tasks were tested-those where Centaur had previously scored best. Full coverage of all 160 tasks would be more conclusive.
Focus was on instruction comprehension; other modeling aspects were not assessed.
The manipulations were extreme to reveal shortcuts; real tasks typically sit between normal and adversarial conditions.
The study shows Centaur can perform above baseline without instructions, but the exact mechanisms remain open.

What to take forward

Don't equate benchmark wins with instruction-grounded reasoning. Treat high performance as a hypothesis that needs stress testing across ablations, traps, and phrasing shifts.

For applied use, set guardrails: detect OOD prompts, use confidence calibration, and route ambiguous cases to humans. For research, separate claims about cognitive mechanisms from surface-level fits to historical response patterns.

Publication details

Authors: Wei Liu and Nai Ding (corresponding author)
Affiliations: Key Laboratory for Biomedical Engineering of Ministry of Education; College of Biomedical Engineering and Instrument Sciences; State Key Lab of Brain-Machine Intelligence; MOE Frontier Science Center for Brain Science & Brain-machine Integration; Zhejiang University, Hangzhou, China
Journal: National Science Open (Perspective), Volume 5, Article 20250053
DOI: https://doi.org/10.1360/nso/20250053
Dates: Received Sep 16, 2025; Revised Nov 18, 2025; Accepted Dec 5, 2025; Published online Dec 11, 2025
Analysis scripts: https://github.com/y1ny/centaur-evaluation
License: Open Access, Creative Commons Attribution 4.0

Disclaimer

This piece summarizes one peer-reviewed study on a specific AI model (Centaur) in controlled cognitive tasks. Results may not generalize to all systems or applications.

For teams building with LLMs

If you're developing or evaluating instruction-following systems, you may find practical course lists on prompt design and evaluation useful: Prompt Engineering resources.

Get Daily AI News

Your membership also unlocks:

700+ AI Courses

700+ Certifications

Personalized AI Learning Plan

6500+ AI Tools (no Ads)

Daily AI News by job industry (no Ads)

Told to Pick J, It Didn't: Centaur Exposes How AI Skips Instructions and Leans on Patterns

AI That Skips the Instructions: What the Centaur Study Means for Research and Deployment

In a nutshell

The tests that exposed the shortcut

Why this matters for science and applied AI

The pattern-recognition trap

What real comprehension would look like

Upgrade your evaluation stack

Study limitations

What to take forward

Publication details

Disclaimer

For teams building with LLMs

Related AI News for Science and Research

Told to Pick J, It Didn't: Centaur Exposes How AI Skips Instructions and Leans on Patterns

Princeton's Adji Bousso Dieng and Aleksandra Korolova named to U.N. AI panel

Centaur AI's humanlike thinking claim unravels under tougher tests

Uncertainty-aware AI fuses expert knowledge and data to fast-track high-entropy alloy discovery

About Complete AI:

Latest AI News for your Job:

Courses by AI Skill:

Courses by Job Field:

Courses by AI Company:

AI Tools for your Job:

AI Tools by Type:

AI Certifications by Skill:

AI Certifications by Job Field:

AI Certifications by Company: