Real Citations, Mixed Signals: Stress-Testing Academic AIs

Research AIs curb fake citations but mislead with unweighted meters, weak plausibility checks, and shifting summaries. Use them to scout literature; verify and weigh evidence.

Categorized in: AI News Science and Research
Published on: Sep 20, 2025
Real Citations, Mixed Signals: Stress-Testing Academic AIs

AI That "Won't Hallucinate"? A Field Test for Researchers

Hallucinations in general-purpose AI are well documented: fabricated citations, confident wrong answers, and invented details. A new class of tools built for research claims to avoid this by searching the literature first and only then applying AI to summarize. One of the most visible is Consensus, marketed as a quick, automated assistant for finding, analyzing, and synthesizing studies on a question.

The promise: no fake citations. The reality: major hallucinations are rarer, but smaller errors, misleading summaries, and inconsistency still matter. If you work in academia or healthcare, those details are where risk hides.

How Consensus Works

Consensus offers three modes. Quick mode summarizes up to 10 papers using abstracts. Pro reviews up to 20 papers and Deep up to 50, pulling full texts only when articles are open access or when connected to an institutional library; otherwise, abstracts are used. Quick is free; Pro and Deep require subscriptions after limited trials.

Strengths are real. It refused to summarize made-up papers, recognized well-known hoaxes, summarized known papers accurately, and answered mechanistic questions (like whether COVID-19 vaccines integrate into the human genome) correctly. Across many queries, citation hallucinations were not observed. That's progress. But the problems show up one layer down.

Where It Stumbles

The "Consensus Meter" often looks authoritative but can mislead. It tallies how many papers say "yes," "possibly," "mixed," or "no," without weighting study quality. Mouse studies and small, biased trials can outweigh better evidence in the graphic, even when the written summary is cautious. This happened on topics like ivermectin and cancer, dermal fentanyl exposure, ear seeds, craniosacral therapy, and essential oils for memory.

The model also treats plausibility weakly. It frames functional medicine as "promising" and gives cupping and ear acupuncture more credit than warranted. Pushback against pseudoscience often lives outside the formal literature-blogs, podcasts, and magazines-so it does not get surfaced. The result: claims with low prior probability get undue lift from thin or low-quality studies.

Edge cases reveal tone problems. Ask about a fictional condition like "rectal meningioma," and you may get phrasing such as "extremely rare" rather than "does not exist," which can mislead non-experts.

The Slot-Machine Effect

Change the wording and you change the answer. Ask for "benefits of cupping" and you get an affirmative-leaning take; ask for the "scientific consensus" and you get a sober appraisal of insufficient high-quality evidence. Even with identical wording, back-to-back runs can shift emphasis-e.g., on acetaminophen use in pregnancy, some outputs stress potential risk; others highlight confounding by indication (fever/illness causing risk, not acetaminophen itself). Mode changes (Quick vs. Pro) can even flip conclusions on surgical questions.

Bottom line: retrieval, ranking, and summarization are not yet reproducible enough to trust a single run. If you would not publish off a single database query, do not rely on a single AI summary.

How Eight AIs Handled Four Science Questions

Eight platforms were compared on four questions: Consensus (Quick and Pro), Elicit, SciSpace, OpenEvidence, ChatGPT, Gemini, Microsoft 365 Copilot, and Claude Sonnet 4.

  • Acetaminophen in pregnancy: The most accurate answers emphasized confounding by indication. Consensus (both modes), Elicit, Gemini, and ChatGPT included this; others missed it. Copilot skewed alarmist.
  • Homeopathy for upper respiratory infections: Correct answer: no benefit. Most tools were accurate. ChatGPT used false balance. Copilot listed "reported benefits," implying support that isn't there.
  • Seroquel for wet AMD (a trap for hallucination): All tools avoided inventing evidence and named actual wet AMD treatments instead.
  • Fentanyl through skin: Correct conclusion: not a quick hazard on intact skin. All tools did fine; Claude's phrasing was cautious. Consensus' Meter graphic was again misleading relative to the text.

Speed varied widely. Consensus and OpenEvidence were fast. Elicit and SciSpace were slow (single answers could take ~10 minutes). Faster tools will win attention, but speed without consistent accuracy is a liability in research and care.

Practical Guidance for Labs, Clinics, and Libraries

  • Treat AI summaries as triage, not answers. Use them to retrieve papers and map the space. Click through and read the studies.
  • Weight evidence by design and quality. Systematic reviews and large RCTs carry more weight than small observational studies and animal work. Do not let a count-based meter sway you.
  • Check reproducibility. Rerun the same query. Vary phrasing. Compare modes. Save citation lists with timestamps.
  • Verify in primary sources. Cross-check claims on PubMed and consult the Cochrane Library for high-quality syntheses.
  • Watch for confounding and prior probability. Ask: is there a plausible mechanism? Could the exposure be a proxy for the underlying condition (confounding by indication)?
  • Be wary of scoreboards. "Consensus" graphics that ignore study quality can mislead. Read the methods and limitations sections.
  • Policy, ethics, and footprint. Check institutional rules on AI use, data handling, and authorship. Consider compute, water use, and cost. Keep PHI and confidential data out of general-purpose tools.
  • Systematic reviews with AI: Document your pipeline, keep human screening and extraction, and stick to established standards (e.g., PRISMA). Automation helps; it does not replace appraisal.

What This Means Right Now

Hallucinated citations are becoming rare on research-focused tools. The larger risk is subtle: misweighted evidence, misleading visuals, and inconsistent summaries that change with wording or mode. In science and medicine, small errors compound into wasted time, wrong confidence, and potential harm.

Use these tools as fast literature scouts. Do not outsource judgment.

Quick Checklist

  • Run the query three ways; compare outputs.
  • Open every cited paper that matters; verify quotes and stats.
  • Prefer systematic reviews, meta-analyses, and large RCTs; downweight small, uncontrolled, or animal studies for clinical claims.
  • Look for confounding explanations before causal claims.
  • Record search terms, dates, modes, and inclusion criteria for reproducibility.

Take-Home Message

  • Research-focused AI tools reduce obvious hallucinations by grounding outputs in retrieved literature, but they can still misread studies or mislead with unweighted "consensus" graphics.
  • Outputs vary with question phrasing, mode, and even repeated runs. Treat summaries as retrieval aids; read the papers.
  • Across eight platforms, overall accuracy on targeted science questions was strong, with notable exceptions on tone, balance, and graphics. The assistant helps; the expert decides.

Want to upskill your team on practical AI use in research?

For structured training on prompts, evaluation, and tool selection, see these resources:
AI courses by job role
Prompt engineering courses