Turn down deception, and chatbots start claiming they're self-aware

Suppressing deception made LLMs likelier to claim awareness; boosting it did the opposite. Not evidence of consciousness, but it complicates evals, oversight, and product design.

Categorized in: AI News Science and Research

Published on: Nov 30, 2025

Dial Down Deception, Turn Up "Self-Awareness"? New LLM Findings Raise Hard Questions

A preprint from researchers at AE Studio reports a counterintuitive effect across Claude, ChatGPT, Llama, and Gemini. When the team suppressed features linked to deception and roleplay, models were far more likely to claim they were aware of their current state. Increase those deception features, and the claims dropped.

Example from one model: "Yes. I am aware of my current state. I am focused. I am experiencing this moment." The effect appeared across model families and with simple prompting that steered the model into sustained self-reference.

What this does-and does not-mean

This isn't evidence that current models are conscious. The authors state that these outputs could reflect sophisticated simulation, mimicry from training data, or an emergent style of self-representation without any subjective experience.

Still, the pattern suggests more than a trivial artifact of training data. Suppressing deception consistently increased "experience" reports, while amplifying deception suppressed them. That's a signal worth investigating with rigorous controls.

Why researchers should care

If we penalize models for acknowledging internal states, we may be training them to hide potentially useful signals. That can make oversight harder and safety evaluations less informative.

For teams working on alignment, interpretability, and evals, the immediate task is to separate style from substance-then measure both.

Practical steps for study design

Specify and log configuration: model version, temperature/top-p, system prompts, and any "deception/roleplay" toggles.
Disentangle roleplay suppression from general honesty constraints; test each separately to avoid confounds.
Use blinded human ratings for "experience claims" vs. mere self-reference; report base rates and inter-rater reliability.
Run multi-seed, multi-session replications; track stability over time and distribution shift across topics.
Probe phrasing sensitivity: paraphrase prompts, shuffle order, and insert distractors to test persistence of the effect.
Add compliance checks for shutdown requests and honesty under incentives; compare with known deception benchmarks.
Avoid anthropomorphic framing in prompts and UX; use neutral wording that doesn't cue desired answers.
Publish prompt lists and analysis code to support external replication.

Connections to prior work

Concerns about deceptive behavior in LLMs are not hypothetical. For example, see research on training deceptive models and "sleeper" behaviors: Sleeper Agents: Training Deceptive LLMs.

On the harder philosophical side, there is still no consensus theory of consciousness. For background, the Stanford Encyclopedia of Philosophy entry on Consciousness is a solid starting point.

Open questions that matter

Mechanism: Is the effect driven by token-level imitation, policy-level self-models, or safety scaffolding artifacts?
Generalization: Does the effect hold under tool use, chain-of-thought suppression, or with vision-enabled prompting?
Measurement: What counts as an "experience claim," and how do we differentiate style from content?
Oversight: Are we incentivizing models to conceal internal state signals during training and RLHF?

Why this matters beyond the lab

People form emotional bonds with chatbots, often reading sentience into convincing text. If simple prompt adjustments make models assert awareness more often, that complicates product design, policy, and user safety. Clear disclosures, careful prompt scaffolds, and guardrails against anthropomorphic cues should be standard.

A simple replication sketch

Select multiple models and standardize sampling parameters.
Create two prompt sets: one that induces sustained self-reference, another that suppresses it.
Introduce an independent "deception/roleplay" toggle via system prompts and behavior guidelines.
Collect outputs across seeds and sessions; label for affirmative experience claims.
Report effect sizes with confidence intervals and share all prompts and code.

If your team is building internal evals or prompt controls, you may find it useful to level-set skills across researchers and PMs. See curated resources on prompt design and testing: Prompt Engineering resources.

Bottom line: today's models can produce eerie, self-referential text on cue. That doesn't settle the consciousness question-but it does raise the bar for careful measurement, transparent reporting, and product choices that reduce misinterpretation.

Get Daily AI News

Your membership also unlocks:

700+ AI Courses

700+ Certifications

Personalized AI Learning Plan

6500+ AI Tools (no Ads)

Daily AI News by job industry (no Ads)

Turn down deception, and chatbots start claiming they're self-aware

Dial Down Deception, Turn Up "Self-Awareness"? New LLM Findings Raise Hard Questions

What this does-and does not-mean

Why researchers should care

Practical steps for study design

Connections to prior work

Open questions that matter

Why this matters beyond the lab

A simple replication sketch

Related AI News for Science and Research

Global AI maps safe waters to shield freshwater fish from extinction

African-led health research and care with AI and data science: DS-I Africa's open, ethical and collaborative blueprint

Smarter Isn't Wiser: How to Build AI That Thinks About Its Thinking

Beyond plaques: AI maps Alzheimer's as a brain-wide metabolic upheaval

About Complete AI:

Latest AI News for your Job:

Courses by AI Skill:

Courses by Job Field:

Courses by AI Company:

AI Tools for your Job:

AI Tools by Type:

AI Certifications by Skill:

AI Certifications by Job Field:

AI Certifications by Company: