Medical AI Stumbles on Real-World Health Questions, Penn State Study Shows
Large language models answer consumer health questions with just 76% accuracy and make errors in roughly one out of every five responses, according to a Penn State study that tested how people actually use AI for medical guidance.
The error rate tops 20%-nearly double the mistake rate of human physicians. Researchers evaluated 212 health-related prompts submitted by 34 participants who queried four popular LLMs during a weeklong competition. Nine board-certified physicians then assessed the AI responses for accuracy.
The findings raise serious concerns about integrating these systems into patient-facing healthcare applications. A 20% error rate "far exceeds the acceptable margin in most healthcare settings," the researchers wrote in their report, posted online ahead of peer-review publication.
Why Real-World Testing Matters
The study filled a gap in the literature. As the general public increasingly turns to LLMs for informal medical guidance, few rigorous assessments have tested how these systems perform with everyday health queries from actual consumers-not carefully crafted prompts from researchers.
Bonam Mingole, an informatics PhD candidate who led the work, said the study's strength lies in replicating real usage patterns. Participants chose their LLM of choice and used it as they would on a normal day, rather than following scripted instructions.
"This type of participatory research is important for understanding how the public uses AI in their daily life," Mingole said.
The Accuracy Problem Isn't Just Technical
The researchers acknowledge that LLM responses often impress physicians and users alike. GPT-4o, the best-performing model tested, still generated invalid responses in roughly one out of every five cases. If acted upon, such errors could lead to harmful clinical outcomes.
Beyond raw accuracy, the study identified three systemic concerns:
- Healthcare disparities: Lower-quality responses for underrepresented patient populations and rare conditions risk exacerbating existing inequities. Addressing this requires commitment to equity in data collection, model development, and evaluation-not just technical fixes.
- Psychological harm: False positives can increase health-related preoccupation, prompt unnecessary medical consultations, or cause patients to avoid professional care due to fear or mistrust. These psychological costs are often overlooked in discussions about LLM integration.
- Structural deflection: In regions already facing physician shortages, relying on LLMs may create a false sense of sufficiency and distract from the need to increase the supply of actual health professionals.
What Comes Next
The researchers recommend that healthcare organizations "approach with great caution" any integration of LLMs into clinical applications. Users should exercise careful judgment when employing these tools for self-diagnosis or health decisions.
LLMs may offer temporary support, the authors emphasize, but should never be viewed as replacements for clinical expertise.
Learn more about AI for Healthcare and Generative AI and LLM applications in professional settings.
Your membership also unlocks: