AI Models Excel at Final Diagnosis but Struggle With Early Clinical Reasoning
Large language models can identify the correct diagnosis in more than 90% of cases when given complete patient information, but they fail at the foundational reasoning that doctors use to navigate uncertainty, according to research published in JAMA Network Open.
A study from Mass General Brigham evaluated 21 AI models across structured patient scenarios and found a critical gap: models perform well at confirming diagnoses but struggle to generate them.
Where Models Fall Short
The weakness emerges early in the diagnostic process. When physicians have incomplete information, they generate a differential diagnosis-a list of possible conditions that guides further testing and decision-making. Models failed to do this reliably in most cases.
"These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn't much information," said Arya Rao, lead author and M.D.-Ph.D. student at Harvard Medical School.
The problem runs deeper than accuracy metrics suggest. AI systems tend to converge too quickly on a single answer rather than maintaining uncertainty and exploring alternatives-the opposite of how physicians approach ambiguous cases.
A New Way to Measure Performance
Researchers developed an evaluation framework that assesses performance across multiple stages of care: initial hypothesis generation, test selection, final diagnosis, and treatment planning. Traditional accuracy metrics mask weaknesses in intermediate reasoning steps.
Performance improved notably as additional structured data-lab results, imaging-was introduced. This shows that models rely heavily on complete inputs to reach accurate conclusions.
Newer model versions generally outperformed earlier ones, but the underlying limitations in clinical reasoning remained consistent.
What This Means for Healthcare Organizations
The findings cut both ways. High rates of correct final diagnoses reinforce AI's potential as a clinical support tool, particularly in data-rich environments where comprehensive patient information is available.
But the inability to reliably navigate early-stage diagnostic reasoning raises concerns about overreliance. Real-world medicine often involves incomplete information and ambiguity-the exact conditions where these models struggle most.
Current AI systems are not ready to operate independently in clinical environments. They should augment human judgment, not replace it.
"We want to help separate the hype from the reality of these tools as they apply to health care," Rao said. "Our results reinforce that large language models in healthcare continue to require a human-in-the-loop and very close oversight."
Learn more about AI for Healthcare and Generative AI and LLM applications in clinical settings.
Your membership also unlocks: