AI gives wrong medical answers more than 20% of the time when used by everyday consumers, study finds

Medical AI answers consumer health questions with just 76% accuracy, making errors in roughly 1 of every 5 responses - double the mistake rate of human physicians. A Penn State study tested 212 real-world health prompts across four popular AI models.

Categorized in: AI News Healthcare

Published on: Jun 01, 2026

Medical AI Stumbles on Real-World Health Questions, Penn State Study Shows

Large language models answer consumer health questions with just 76% accuracy and make errors in roughly one out of every five responses, according to a Penn State study that tested how people actually use AI for medical guidance.

The error rate tops 20%-nearly double the mistake rate of human physicians. Researchers evaluated 212 health-related prompts submitted by 34 participants who queried four popular LLMs during a weeklong competition. Nine board-certified physicians then assessed the AI responses for accuracy.

The findings raise serious concerns about integrating these systems into patient-facing healthcare applications. A 20% error rate "far exceeds the acceptable margin in most healthcare settings," the researchers wrote in their report, posted online ahead of peer-review publication.

Why Real-World Testing Matters

The study filled a gap in the literature. As the general public increasingly turns to LLMs for informal medical guidance, few rigorous assessments have tested how these systems perform with everyday health queries from actual consumers-not carefully crafted prompts from researchers.

Bonam Mingole, an informatics PhD candidate who led the work, said the study's strength lies in replicating real usage patterns. Participants chose their LLM of choice and used it as they would on a normal day, rather than following scripted instructions.

"This type of participatory research is important for understanding how the public uses AI in their daily life," Mingole said.

The Accuracy Problem Isn't Just Technical

The researchers acknowledge that LLM responses often impress physicians and users alike. GPT-4o, the best-performing model tested, still generated invalid responses in roughly one out of every five cases. If acted upon, such errors could lead to harmful clinical outcomes.

Beyond raw accuracy, the study identified three systemic concerns:

Healthcare disparities: Lower-quality responses for underrepresented patient populations and rare conditions risk exacerbating existing inequities. Addressing this requires commitment to equity in data collection, model development, and evaluation-not just technical fixes.
Psychological harm: False positives can increase health-related preoccupation, prompt unnecessary medical consultations, or cause patients to avoid professional care due to fear or mistrust. These psychological costs are often overlooked in discussions about LLM integration.
Structural deflection: In regions already facing physician shortages, relying on LLMs may create a false sense of sufficiency and distract from the need to increase the supply of actual health professionals.

What Comes Next

The researchers recommend that healthcare organizations "approach with great caution" any integration of LLMs into clinical applications. Users should exercise careful judgment when employing these tools for self-diagnosis or health decisions.

LLMs may offer temporary support, the authors emphasize, but should never be viewed as replacements for clinical expertise.

Learn more about AI for Healthcare and Generative AI and LLM applications in professional settings.

Get Daily AI News

Your membership also unlocks:

700+ AI Courses

700+ Certifications

Personalized AI Learning Plan

6500+ AI Tools (no Ads)

Daily AI News by job industry (no Ads)

AI gives wrong medical answers more than 20% of the time when used by everyday consumers, study finds

Medical AI Stumbles on Real-World Health Questions, Penn State Study Shows

Why Real-World Testing Matters

The Accuracy Problem Isn't Just Technical

What Comes Next

Related AI News for people in Healthcare

AI gives wrong medical answers more than 20% of the time when used by everyday consumers, study finds

Healthcare AI needs stronger governance and human oversight before it can safely scale, review finds

Salesforce launches AI governance framework and signs largest Agentforce deal with CVS Health

India launches SAHI framework to guide responsible AI adoption in healthcare

About Complete AI:

Latest AI News for your Job:

Courses by AI Skill:

Courses by Job Field:

Courses by AI Company:

AI Tools for your Job:

AI Tools by Type:

AI Certifications by Skill:

AI Certifications by Job Field:

AI Certifications by Company: