Study finds AI chatbots give problematic health advice around half the time

Five leading AI chatbots gave problematic medical advice in roughly half their responses during testing, per a BMJ Open study. Fabricated references and open-ended questions were the biggest failure points.

Categorized in: AI News Science and Research
Published on: Apr 21, 2026
Study finds AI chatbots give problematic health advice around half the time

Chatbots Give Problematic Health Advice Half The Time, Study Finds

Five of the world's most popular AI chatbots produced problematic medical advice in roughly half their responses when tested on common health questions, according to research published in BMJ Open. Researchers asked ChatGPT, Gemini, Grok, Meta AI, and DeepSeek each 50 questions spanning cancer, vaccines, stem cells, nutrition, and athletic performance.

Two independent experts rated the answers. Nearly 20% were rated highly problematic, 50% were problematic, and 30% were somewhat problematic. Only two of the 250 questions were refused outright.

Grok performed worst, with 58% of responses flagged as problematic. ChatGPT followed at 52%, and Meta AI at 50%. The five chatbots performed roughly similarly overall.

Where Chatbots Struggled Most

Performance varied significantly by topic. Chatbots handled vaccines and cancer best-fields with large, well-structured research bodies-yet still produced problematic answers roughly a quarter of the time.

Nutrition and athletic performance were the weakest areas. These domains contain conflicting advice online and thinner rigorous evidence.

Open-ended questions created the biggest problems. Researchers found 32% of open-ended answers were highly problematic, compared with just 7% for closed questions. Most real-world health queries are open-ended. People ask things like "Which supplements are best for overall health?" rather than true-or-false questions.

References Look Like Proof

When asked for ten scientific references, the median completeness score across all chatbots was just 40%. No chatbot produced a single fully accurate reference list across 25 attempts. Errors included wrong authors, broken links, and entirely fabricated papers.

This matters because citations create false authority. A reader seeing a neatly formatted reference list has little reason to doubt the content above it.

Why This Happens

Language models do not know things. They predict the most statistically likely next word based on training data. They do not weigh evidence or make value judgments.

Their training material includes peer-reviewed papers alongside Reddit threads, wellness blogs, and social media arguments. The model treats all sources equally when calculating probability.

Context Matters

Researchers deliberately crafted prompts designed to push chatbots toward misleading answers-a stress-testing technique called "red teaming." This means error rates probably overstate what occurs with neutral phrasing. The study tested free versions available in February 2025; paid tiers may perform better.

Still, most people use free versions, and most health questions are not carefully worded. The study's conditions reflect how people actually use these tools.

A Broader Pattern

These findings align with other recent research. A February 2026 study in Nature Medicine found something striking: chatbots could get the right medical answer almost 95% of the time in controlled settings. When real people used those same chatbots, they got the right answer less than 35% of the time-no better than people who didn't use them at all.

A separate study in JAMA Network Open tested 21 leading AI models on medical diagnosis. With only basic patient details, the models failed to suggest the right set of possible conditions more than 80% of the time. Adding exam findings and lab results pushed accuracy above 90%.

Another study found chatbots readily repeated and elaborated on made-up medical terms slipped into prompts.

What This Means

These chatbots are not going away, nor should they. They can summarize complex topics, help prepare questions for doctors, and serve as starting points for research.

But they should not be treated as stand-alone medical authorities. If you use one for medical advice, verify any health claim it makes. Treat references as suggestions to check rather than fact. Notice when a response sounds confident but offers no disclaimers.

For more on how AI for Healthcare is being deployed, see our coverage of applications in clinical settings.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)