Why AI Keeps Making Things Up-and Should It Just Say I Don't Know?

AI models still make stuff up because they predict likely words, not facts. Reward 'I don't know,' ground with retrieval, and calibrate uncertainty to cut errors.

Categorized in: AI News Science and Research

Published on: Nov 03, 2025

Why AI Still Makes Things Up - And How to Reduce It

Large language models can sound confident while delivering answers that aren't true. That's not a bug in the code as much as a feature of how these systems learn.

Researchers argue that hallucinations persist even in advanced models. A recent preprint from OpenAI authors on arXiv (not yet peer-reviewed) has triggered fresh debate around how models should handle uncertainty.

Why models hallucinate

Language models are probability machines over word sequences, not fact databases. They pick the next token that's most likely to follow, given the prompt and training data.

When the training signal is sparse or skewed, the model can produce fluent, plausible fiction. That includes fabricated studies, titles, references, and even real researcher names stitched into non-existent papers.

During training and evaluation, models are often rewarded for getting the "right" answer. They are rarely rewarded for acknowledging uncertainty.

The multiple-choice incentive problem

Benchmarks frequently look like multiple-choice tests. If you can't leave a question blank, guessing is rational.

That incentive carries over to deployment: the model learns to answer even when it shouldn't. One proposal is simple-introduce an explicit "I don't know" option during training and evaluation so the model can be rewarded for abstaining.

That could reduce hallucinations on short-form tasks. It's tougher for open-ended outputs like literature summaries or long-form analysis.

Should models say "I don't know" more often?

Some analyses suggest a calibrated model might start with "I don't know" roughly a third of the time. As commentary in The Conversation notes, many users could quickly lose patience with that behavior.

Others counter that clarity beats false certainty. The key is balance: avoid both confident nonsense and excessive refusal that makes the system unusable.

Are things getting better?

Grounding with external tools has helped. Retrieval-augmented generation, web search, and document-specific contexts reduce errors by tying outputs to verifiable sources.

Still, failure rates can be high on unsupported questions. One recent test on news queries reported that 45 percent of answers included major errors and fake links.

Practical steps for scientists and research teams

Use retrieval-first workflows. Provide a curated corpus. Require the model to cite URLs/DOIs and limit claims to retrieved passages.
Gate on confidence. Calibrate per-task thresholds; abstain or ask a follow-up when uncertainty is high. Route low-confidence cases to human review.
Prompt for verification. Instruct the model to separate claims from evidence and include a source list for each major claim.
Adopt structured outputs. Ask for JSON with fields like claim, evidence_url, evidence_quote, confidence, unresolved_questions. Makes auditing easier.
Reward abstention. If you fine-tune, include an explicit "I don't know" token and align rewards to correct abstentions, not just correct answers.
Constrain open-ended tasks. For summaries, require paragraph-level citations and forbid statements not supported by the provided context.
Evaluate continuously. Track hallucination rates, citation validity, and failure modes by topic. Maintain a hard test set your team never uses for prompt tuning.
Design the interface for honesty. Show uncertainty indicators, number of sources, and last-updated times. Avoid overly confident phrasing when confidence is low.

What to do for open-ended outputs

Use a claim-evidence template: each paragraph ends with citations tied to exact quotes. If a point lacks support, the model must flag what data is missing or request more context.

For literature reviews, bind the model to a fixed library (your lab's Zotero, PDFs, or a vetted corpus) and disallow references outside that set.

Research directions worth watching

Self-consistency and multi-pass reasoning with verifiers
Tool use by default: search, calculators, code, and document loaders
Calibrated uncertainty from logits and ensemble agreement
Retrieval-augmented critics that challenge unsupported claims
Coverage analysis to detect topic areas where the model is likely to guess

Build capability in your team

If you're implementing retrieval, evaluation pipelines, or abstention-aware prompts across your org, explore focused training on prompt patterns and RAG workflows: Prompt engineering guides and courses by job role.

Get Daily AI News

Your membership also unlocks:

700+ AI Courses

700+ Certifications

Personalized AI Learning Plan

6500+ AI Tools (no Ads)

Daily AI News by job industry (no Ads)

Advertisement

Why AI Keeps Making Things Up-and Should It Just Say I Don't Know?

Why AI Still Makes Things Up - And How to Reduce It

Why models hallucinate

The multiple-choice incentive problem

Should models say "I don't know" more often?

Are things getting better?

Practical steps for scientists and research teams

What to do for open-ended outputs

Research directions worth watching

Further reading

Build capability in your team

Related AI News for Science and Research

AI reveals how 3D genome breakdown silences tumor suppressors and fuels blood cancer

40 Google DeepMind Scholarships: AI for Science Master's 2026-2027 at AIMS South Africa

DeepMind at 15: Pushmeet Kohli on responsible AI that accelerates science, from AlphaFold to Gemini

AI's climate footprint is smaller than feared-and could accelerate clean tech

About Complete AI:

Latest AI News for your Job:

Courses by AI Skill:

Courses by Job Field:

Courses by AI Company:

AI Tools for your Job:

AI Tools by Type:

AI Certifications by Skill:

AI Certifications by Job Field:

AI Certifications by Company: