AI chatbots answer health questions accurately 76% of the time, Penn State study finds

AI health chatbots answered medical questions with 76% accuracy in a Penn State study - more than double the error rate of human doctors. Researchers found the worst results in neurology, internal medicine, and dermatology.

Categorized in: AI News Healthcare
Published on: May 29, 2026
AI chatbots answer health questions accurately 76% of the time, Penn State study finds

AI Health Chatbots Hit 76% Accuracy, But Doctors Remain Superior

Artificial intelligence chatbots answer everyday health questions with nearly 76% accuracy - a result that raises serious concerns about their use by patients seeking medical advice, according to a Penn State study presented at the 2026 Association for Computing Machinery Fairness, Accountability and Transparency conference.

Researchers tested how average internet users rely on generative AI and LLMs for symptom checking and health concerns. They recruited 34 participants to submit 212 prompts using ChatGPT-4o, ChatGPT-3.5, Gemini-1.5 Pro, and Llama3-8b. Nine board-certified physicians then evaluated the AI responses for accuracy and potential harm.

Where AI Performs Well - and Where It Fails

Performance varied significantly by medical specialty. Obstetrics and gynecology, along with ear, nose and throat treatment, showed the highest accuracy. Internal medicine, neurology, and dermatology saw the worst results, with low accuracy scores and higher harm potential.

The error rate exceeded 20% - roughly double the error rate of human physicians. That gap matters. A single wrong answer about neurological symptoms or skin conditions could lead a patient down the wrong diagnostic path.

More specific prompts produced better responses. Queries between 60 and 250 characters generated the most accurate outputs.

Training on Medical Data Didn't Help

Researchers trained the base models on medical textbooks, clinical guidelines, and peer-reviewed research to see if specialized training would improve accuracy. Seven medical professionals evaluated both the original and augmented versions.

The results surprised them. Physicians preferred the base versions of Gemini and Llama over the augmented models. ChatGPT showed no meaningful difference either way.

"I don't think AI will replace human physicians," said Jennifer Kraschnewski, director of the Penn State Clinical and Translational Science Institute. "But there's a huge opportunity for us to help upskill today's physician in a way that's never been done before."

The Right Use Case: Physician Tools, Not Patient Tools

AI for healthcare works best when physicians use it, not patients. Doctors have the training to catch errors and contextualize recommendations. Patients using AI as a first-line diagnostic tool face real risks.

"People will continue to use AI for diagnosing their health problems," said S. Shyam Sundar, Evan Pugh University Professor at Penn State. "By understanding their use patterns and testing the validity of AI performance, our project helps advance literacy on the best and worst uses of AI for medical advice."

The takeaway: AI chatbots can support clinical decision-making in trained hands. For patients searching symptoms online, the accuracy gap remains too wide to trust without physician review.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)