AI Chatbots Fail Accuracy Test for Medical Advice
Nearly one-quarter of Americans under 30 now turn to AI chatbots for health guidance, yet a Penn State University study found that 24% of responses to everyday medical questions contain errors serious enough to warrant physician review.
Researchers evaluated 212 health-related prompts submitted by 34 participants during a weeklong competition. Nine board-certified physicians assessed responses from ChatGPT-4o, ChatGPT-3.5, Gemini-1.5 Pro, and Llama3-8b using a six-point scale measuring accuracy, information quality, reasoning, and potential harm.
The study differs from prior research by testing real-world usage patterns rather than performance on medical licensing exams. Participants chose their preferred model and used it as they normally would, mimicking how patients actually seek health information online.
Performance Gaps Across Specialties
ChatGPT-4o delivered the strongest results at 84.62% accuracy. Llama3-8b performed worst at 50%.
Performance varied significantly by medical specialty. Obstetrics and gynecology and otolaryngology generated the highest validity scores with minimal harm risk. Internal medicine, neurology, and dermatology produced the lowest validity scores and highest risk of harmful responses.
The findings exposed another problem: AI for Healthcare systems performed worse for underrepresented patient populations and rare conditions. Researchers warned that without attention to equity in data collection and model development, AI tools risk widening existing healthcare disparities.
What Affects Response Quality
Prompt length mattered. Queries between 60 and 250 characters produced the most accurate responses. Highly specific prompts also improved output quality.
An unexpected finding: physician reviewers perceived greater harm risk when prompts were written from a medical professional's perspective rather than a patient's perspective.
Researchers tested whether medical training improved performance by enhancing models with textbooks, clinical guidelines, and peer-reviewed research. The results were mixed. Physicians preferred the baseline versions of Gemini and Llama over their medically trained versions, while ChatGPT models showed no meaningful difference.
What This Means for Patient Care
The error rate remains too high for AI to replace physician judgment on diagnosis or treatment. Over half of U.S. adults already consult online resources for medical advice, and that trend continues to grow.
Researchers plan to expand the study with larger, more balanced datasets and investigate ways to discourage overreliance on AI-generated medical advice. The work underscores the need for clear communication about AI's current limitations in healthcare settings.
Your membership also unlocks: