AAAS Study Finds ChatGPT Fails Science Writing, Human Journalists Prevail

AAAS found ChatGPT failed to meet pro science writing standards after a year-long test. Humans beat AI on accuracy, context, and restraint; editors saw more work, not savings.

ChatGPT Fails Science Writing Test: Human Journalists Win

The American Association for the Advancement of Science spent a full year asking a simple question: can AI write like a professional science journalist? The answer was clear-no.

Over 12 months, ChatGPT tried to summarize 64 tough research papers. Human evaluators scored the outputs. The AI struggled with accuracy, context, and restraint-the basics of credible science communication.

The Experiment That Exposed AI's Limits

From December 2023 to December 2024, AAAS researchers fed ChatGPT studies loaded with jargon, controversy, big claims, and unusual formats. The model got three prompt styles per paper and ran across GPT-4 and GPT-4o versions. Summaries were judged by the same experts who write SciPak briefs for Science and EurekAlert-the people other journalists rely on to get the science right.

In short: this was a fair test against real standards. And the bar wasn't hit.

AI's Report Card: Failing Grades Across the Board

Could the summaries mix seamlessly with human-written briefs? Average score: 2.26 out of 5. Compelling writing? 2.14. Only one of 64 summaries earned a perfect mark from any evaluator. Thirty received the lowest possible rating.

It wasn't close.

Where ChatGPT Goes Wrong

Confuses correlation with causation-core scientific error.
Omits critical context needed for accurate interpretation.
Overuses hype words like "groundbreaking" and "novel," inflating routine findings.

As AAAS writer Abigail Eisenstadt put it: "These technologies may have potential as helpful tools for science writers, but they are not ready for 'prime time,' at this point for the SciPak team."

Transcription Isn't Translation

ChatGPT can restate what a paper says. That's transcription. Science journalism requires translation-probing methods, exposing limits, balancing conflicts, and connecting to prior literature and real-world stakes.

When studies had mixed results, or when asked to synthesize across papers, the model stumbled. Editors found that starting with AI increased workload: fact-checking and rewriting erased any time savings.

A Familiar Pattern

The results match wider AI reliability issues. Prior tests show AI search tools often cite incorrect sources at high rates. In science communication, that failure rate is unacceptable.

Yes, evaluators could be biased. But the results were consistently poor across topics, prompts, and models. Bias doesn't explain that level of underperformance.

The Verdict: Structure Without Substance

ChatGPT can mimic tone and format. But structure isn't substance. The AAAS team concluded the model "does not meet the style and standards for briefs in the SciPak press package."

They left the door open for new trials if major updates change capabilities-note that GPT-5 became publicly available in August. Until then, the craft of translating research for public use stays with humans.

What Science Writers Can Do Right Now

Use AI as a research assistant, not a writer. Ask for definitions, paper outlines, and lists of variables-then verify everything.
Ban hype. Remove adjectives that overstate significance unless they're supported by consensus or external commentary.
Force causal discipline. Add a line to every draft: "What else could explain this result?" If the model implies causation, rewrite to correlation unless the methods justify it.
Interrogate methods and limits. Require a paragraph on sample size, controls, confounders, statistical power, and study design.
Cross-check claims against the paper's figures and supplementary materials. If you can't trace a claim to a specific result, cut it.
When synthesizing multiple studies, write the comparison yourself. AI blends narratives; it doesn't adjudicate evidence.

Editor's Quick Checklist

Are causal statements warranted by methods?
Is the effect size meaningful, not just statistically significant?
Are limitations and uncertainties explicitly stated?
Is context provided: prior work, competing explanations, real-world implications?
Any hype words without external support? Remove them.

For Teams Using AI Carefully

Create a style rule: AI outputs are notes, not final copy.
Keep a living library of vetted prompts for definitions, variable extraction, citation formatting, and interview prep.
Log every AI-assisted piece with a fact-check trail.
Train staff on prompt discipline and verification workflows.

AI will improve, and tests should continue. For now, reliable science writing still depends on judgment, context, and precision-the human parts of the job.

If you're building a safe, efficient AI workflow for reporting and editing, see practical prompt and verification resources here: Prompt Engineering Guides. For context on the organizations involved, visit AAAS.

Get Daily AI News

Your membership also unlocks:

700+ AI Courses

700+ Certifications

Personalized AI Learning Plan

6500+ AI Tools (no Ads)

Daily AI News by job industry (no Ads)

Advertisement