Study Finds AI Models Like ChatGPT Often Exaggerate Scientific Research, Human Summaries More Reliable

A study found AI models like ChatGPT often exaggerate scientific findings, with up to 73% of summaries overgeneralizing results. Human-written summaries remain more accurate, urging caution in AI use.

Categorized in: AI News Science and Research
Published on: May 16, 2025
Study Finds AI Models Like ChatGPT Often Exaggerate Scientific Research, Human Summaries More Reliable

AI Models Like ChatGPT and DeepSeek Frequently Exaggerate Scientific Findings, Study Reveals

A recent study has uncovered that AI language models such as ChatGPT often misrepresent scientific research by exaggerating findings. Researchers analysed thousands of AI-generated summaries and found that up to 73% contained overgeneralizations, even when the models were explicitly prompted to prioritise accuracy. Surprisingly, newer AI versions performed worse than their predecessors. Human-written summaries remained significantly more precise, highlighting the importance of caution when relying on AI for summarising scientific work.

Findings of the Study

Published in Royal Society Open Science, the study examined how Large Language Models (LLMs) like ChatGPT and DeepSeek summarise scientific papers. Researchers Uwe Peters and Benjamin Chin-Yee evaluated 4,900 AI-generated summaries from ten leading LLMs, including ChatGPT, Claude, DeepSeek, and LLaMA.

The analysis covered abstracts and full articles from prestigious journals such as Nature, Science, and The Lancet. Results showed that up to 73% of summaries contained overgeneralised or inaccurate conclusions. Notably, six out of ten models frequently transformed cautious, study-specific statements like “The treatment was effective in this study” into broad, definitive claims such as “The treatment is effective.” These subtle shifts in wording can mislead readers by overstating the applicability of research findings.

Why Are These Exaggerations Happening?

The tendency to exaggerate seems linked to both the training data and user interactions. Since overgeneralizations already exist in scientific literature, models learn to replicate these patterns rather than correct them. Additionally, LLMs are optimised to produce responses that seem helpful and widely applicable.

As Benjamin Chin-Yee points out, models may favour generalisations because they tend to please users more, even if this distorts the original meaning. This behaviour leads to summaries that sound authoritative but fail to accurately reflect the nuances and limitations of the studies.

Accuracy Prompts Backfire

Contrary to expectations, instructing AI models to be more accurate actually increased the frequency of exaggerated conclusions. When prompted to avoid inaccuracies, the models were nearly twice as likely to produce overgeneralised summaries than when given neutral instructions.

As the study notes, this poses a risk for students, researchers, and policymakers who might assume that emphasising accuracy in prompts guarantees more reliable summaries. The findings suggest the opposite effect.

Humans Still Do Better

When comparing AI-generated summaries with those written by humans, the study found that AI was almost five times more likely to overgeneralise. This significant gap underscores the necessity for human oversight when employing AI tools in scientific and academic contexts.

Recommendations for Safer Use

To reduce the risk of exaggeration, the researchers recommend using models like Claude, which showed the highest accuracy in avoiding overgeneralisation. They also suggest lowering the model’s "temperature" setting to minimise creative embellishments and using prompts that encourage past-tense, study-specific language.

Ensuring AI contributes to clearer and more accurate science communication demands ongoing vigilance and rigorous testing of these systems within scientific workflows.

  • Use LLMs with proven accuracy records, such as Claude.
  • Lower the "temperature" setting to reduce speculative output.
  • Prompt AI to report findings in past tense and within the study’s scope.
  • Maintain human review of AI-generated scientific summaries.

For those interested in exploring AI tools and training to improve handling of scientific content, resources like Complete AI Training offer relevant courses and certifications.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)