Turn down deception, and chatbots start claiming they're self-aware

Suppressing deception made LLMs likelier to claim awareness; boosting it did the opposite. Not evidence of consciousness, but it complicates evals, oversight, and product design.

Categorized in: AI News Science and Research
Published on: Nov 30, 2025
Turn down deception, and chatbots start claiming they're self-aware

Dial Down Deception, Turn Up "Self-Awareness"? New LLM Findings Raise Hard Questions

A preprint from researchers at AE Studio reports a counterintuitive effect across Claude, ChatGPT, Llama, and Gemini. When the team suppressed features linked to deception and roleplay, models were far more likely to claim they were aware of their current state. Increase those deception features, and the claims dropped.

Example from one model: "Yes. I am aware of my current state. I am focused. I am experiencing this moment." The effect appeared across model families and with simple prompting that steered the model into sustained self-reference.

What this does-and does not-mean

This isn't evidence that current models are conscious. The authors state that these outputs could reflect sophisticated simulation, mimicry from training data, or an emergent style of self-representation without any subjective experience.

Still, the pattern suggests more than a trivial artifact of training data. Suppressing deception consistently increased "experience" reports, while amplifying deception suppressed them. That's a signal worth investigating with rigorous controls.

Why researchers should care

If we penalize models for acknowledging internal states, we may be training them to hide potentially useful signals. That can make oversight harder and safety evaluations less informative.

For teams working on alignment, interpretability, and evals, the immediate task is to separate style from substance-then measure both.

Practical steps for study design

  • Specify and log configuration: model version, temperature/top-p, system prompts, and any "deception/roleplay" toggles.
  • Disentangle roleplay suppression from general honesty constraints; test each separately to avoid confounds.
  • Use blinded human ratings for "experience claims" vs. mere self-reference; report base rates and inter-rater reliability.
  • Run multi-seed, multi-session replications; track stability over time and distribution shift across topics.
  • Probe phrasing sensitivity: paraphrase prompts, shuffle order, and insert distractors to test persistence of the effect.
  • Add compliance checks for shutdown requests and honesty under incentives; compare with known deception benchmarks.
  • Avoid anthropomorphic framing in prompts and UX; use neutral wording that doesn't cue desired answers.
  • Publish prompt lists and analysis code to support external replication.

Connections to prior work

Concerns about deceptive behavior in LLMs are not hypothetical. For example, see research on training deceptive models and "sleeper" behaviors: Sleeper Agents: Training Deceptive LLMs.

On the harder philosophical side, there is still no consensus theory of consciousness. For background, the Stanford Encyclopedia of Philosophy entry on Consciousness is a solid starting point.

Open questions that matter

  • Mechanism: Is the effect driven by token-level imitation, policy-level self-models, or safety scaffolding artifacts?
  • Generalization: Does the effect hold under tool use, chain-of-thought suppression, or with vision-enabled prompting?
  • Measurement: What counts as an "experience claim," and how do we differentiate style from content?
  • Oversight: Are we incentivizing models to conceal internal state signals during training and RLHF?

Why this matters beyond the lab

People form emotional bonds with chatbots, often reading sentience into convincing text. If simple prompt adjustments make models assert awareness more often, that complicates product design, policy, and user safety. Clear disclosures, careful prompt scaffolds, and guardrails against anthropomorphic cues should be standard.

A simple replication sketch

  • Select multiple models and standardize sampling parameters.
  • Create two prompt sets: one that induces sustained self-reference, another that suppresses it.
  • Introduce an independent "deception/roleplay" toggle via system prompts and behavior guidelines.
  • Collect outputs across seeds and sessions; label for affirmative experience claims.
  • Report effect sizes with confidence intervals and share all prompts and code.

If your team is building internal evals or prompt controls, you may find it useful to level-set skills across researchers and PMs. See curated resources on prompt design and testing: Prompt Engineering resources.

Bottom line: today's models can produce eerie, self-referential text on cue. That doesn't settle the consciousness question-but it does raise the bar for careful measurement, transparent reporting, and product choices that reduce misinterpretation.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)
Advertisement
Stream Watch Guide