Why AI Keeps Making Things Up-and Should It Just Say I Don't Know?

AI models still make stuff up because they predict likely words, not facts. Reward 'I don't know,' ground with retrieval, and calibrate uncertainty to cut errors.

Categorized in: AI News Science and Research
Published on: Nov 03, 2025
Why AI Keeps Making Things Up-and Should It Just Say I Don't Know?

Why AI Still Makes Things Up - And How to Reduce It

Large language models can sound confident while delivering answers that aren't true. That's not a bug in the code as much as a feature of how these systems learn.

Researchers argue that hallucinations persist even in advanced models. A recent preprint from OpenAI authors on arXiv (not yet peer-reviewed) has triggered fresh debate around how models should handle uncertainty.

Why models hallucinate

Language models are probability machines over word sequences, not fact databases. They pick the next token that's most likely to follow, given the prompt and training data.

When the training signal is sparse or skewed, the model can produce fluent, plausible fiction. That includes fabricated studies, titles, references, and even real researcher names stitched into non-existent papers.

During training and evaluation, models are often rewarded for getting the "right" answer. They are rarely rewarded for acknowledging uncertainty.

The multiple-choice incentive problem

Benchmarks frequently look like multiple-choice tests. If you can't leave a question blank, guessing is rational.

That incentive carries over to deployment: the model learns to answer even when it shouldn't. One proposal is simple-introduce an explicit "I don't know" option during training and evaluation so the model can be rewarded for abstaining.

That could reduce hallucinations on short-form tasks. It's tougher for open-ended outputs like literature summaries or long-form analysis.

Should models say "I don't know" more often?

Some analyses suggest a calibrated model might start with "I don't know" roughly a third of the time. As commentary in The Conversation notes, many users could quickly lose patience with that behavior.

Others counter that clarity beats false certainty. The key is balance: avoid both confident nonsense and excessive refusal that makes the system unusable.

Are things getting better?

Grounding with external tools has helped. Retrieval-augmented generation, web search, and document-specific contexts reduce errors by tying outputs to verifiable sources.

Still, failure rates can be high on unsupported questions. One recent test on news queries reported that 45 percent of answers included major errors and fake links.

Practical steps for scientists and research teams

  • Use retrieval-first workflows. Provide a curated corpus. Require the model to cite URLs/DOIs and limit claims to retrieved passages.
  • Gate on confidence. Calibrate per-task thresholds; abstain or ask a follow-up when uncertainty is high. Route low-confidence cases to human review.
  • Prompt for verification. Instruct the model to separate claims from evidence and include a source list for each major claim.
  • Adopt structured outputs. Ask for JSON with fields like claim, evidence_url, evidence_quote, confidence, unresolved_questions. Makes auditing easier.
  • Reward abstention. If you fine-tune, include an explicit "I don't know" token and align rewards to correct abstentions, not just correct answers.
  • Constrain open-ended tasks. For summaries, require paragraph-level citations and forbid statements not supported by the provided context.
  • Evaluate continuously. Track hallucination rates, citation validity, and failure modes by topic. Maintain a hard test set your team never uses for prompt tuning.
  • Design the interface for honesty. Show uncertainty indicators, number of sources, and last-updated times. Avoid overly confident phrasing when confidence is low.

What to do for open-ended outputs

Use a claim-evidence template: each paragraph ends with citations tied to exact quotes. If a point lacks support, the model must flag what data is missing or request more context.

For literature reviews, bind the model to a fixed library (your lab's Zotero, PDFs, or a vetted corpus) and disallow references outside that set.

Research directions worth watching

  • Self-consistency and multi-pass reasoning with verifiers
  • Tool use by default: search, calculators, code, and document loaders
  • Calibrated uncertainty from logits and ensemble agreement
  • Retrieval-augmented critics that challenge unsupported claims
  • Coverage analysis to detect topic areas where the model is likely to guess

Further reading

Preprints and debates about uncertainty and abstention are active on arXiv. For a perspective on user tolerance for uncertainty statements, see analysis in The Conversation.

Build capability in your team

If you're implementing retrieval, evaluation pipelines, or abstention-aware prompts across your org, explore focused training on prompt patterns and RAG workflows: Prompt engineering guides and courses by job role.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)