AI Chatbots Answer Health Questions With 76% Accuracy, Study Finds
Large language models like ChatGPT respond to everyday health questions with nearly 76% accuracy, but their error rate still roughly doubles that of human physicians, according to Penn State researchers. The finding raises questions about whether patients should rely on AI for medical advice.
The researchers tested how the average internet user might deploy AI for health concerns - a scenario largely unexplored in previous studies. They ran a competition at Penn State where 34 participants submitted 212 prompts and AI-generated responses covering real and imaginary health problems, written from both patient and doctor perspectives.
Participants chose from four large language models: ChatGPT-4o, ChatGPT-3.5, Gemini-1.5 Pro, and Llama3-8b. Nine board-certified physicians then evaluated the AI responses for accuracy and potential harm using a six-point scale.
Performance Varied Sharply by Medical Specialty
Obstetrics and gynecology, along with ear, nose and throat medicine, performed best. Internal medicine, neurology, and dermatology saw the worst results, with low accuracy scores and higher harm potential.
The researchers also found that more specific prompts - particularly those between 60 and 250 characters - produced more accurate responses. Vague or overly broad questions generated less reliable answers.
Training on Medical Data Didn't Help Much
The team trained base models on medical textbooks, clinical guidelines, and peer-reviewed research to see if additional training would improve performance. A panel of seven medical professionals evaluated both base and augmented versions.
The results were unexpected. Physicians preferred the base versions of Gemini and Llama over the augmented models. ChatGPT showed no significant preference either way.
Error Rates Remain a Concern
Despite the 76% accuracy rate, the AI error rate exceeded 20% - roughly double the error rate of human physicians. Researchers said those mistakes could harm patients.
The study suggests AI for healthcare may work better as a tool for physicians than for patients. Doctors could use generative AI and LLM systems to improve their own efficiency and decision-making rather than relying on patients to self-diagnose.
"People will continue to use AI for diagnosing their health problems," said study co-author S. Shyam Sundar. "By understanding their use patterns and testing the validity of AI performance, our project helps advance literacy on the best and worst uses of AI for medical advice."
The researchers presented their findings at the 2026 Association for Computing Machinery Fairness, Accountability and Transparency conference in Montreal.
Your membership also unlocks: