AI models reach accurate final diagnoses but struggle with early clinical reasoning, study finds

AI models correctly identify diagnoses in over 90% of cases with complete data, but fail at early clinical reasoning, per a Mass General Brigham study. Researchers say the tools need close human oversight.

Categorized in: AI News Healthcare
Published on: Apr 30, 2026
AI models reach accurate final diagnoses but struggle with early clinical reasoning, study finds

AI Models Excel at Final Diagnosis but Struggle With Early Clinical Reasoning

Large language models can identify the correct diagnosis in more than 90% of cases when given complete patient information, but they fail at the foundational reasoning that doctors use to navigate uncertainty, according to research published in JAMA Network Open.

A study from Mass General Brigham evaluated 21 AI models across structured patient scenarios and found a critical gap: models perform well at confirming diagnoses but struggle to generate them.

Where Models Fall Short

The weakness emerges early in the diagnostic process. When physicians have incomplete information, they generate a differential diagnosis-a list of possible conditions that guides further testing and decision-making. Models failed to do this reliably in most cases.

"These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn't much information," said Arya Rao, lead author and M.D.-Ph.D. student at Harvard Medical School.

The problem runs deeper than accuracy metrics suggest. AI systems tend to converge too quickly on a single answer rather than maintaining uncertainty and exploring alternatives-the opposite of how physicians approach ambiguous cases.

A New Way to Measure Performance

Researchers developed an evaluation framework that assesses performance across multiple stages of care: initial hypothesis generation, test selection, final diagnosis, and treatment planning. Traditional accuracy metrics mask weaknesses in intermediate reasoning steps.

Performance improved notably as additional structured data-lab results, imaging-was introduced. This shows that models rely heavily on complete inputs to reach accurate conclusions.

Newer model versions generally outperformed earlier ones, but the underlying limitations in clinical reasoning remained consistent.

What This Means for Healthcare Organizations

The findings cut both ways. High rates of correct final diagnoses reinforce AI's potential as a clinical support tool, particularly in data-rich environments where comprehensive patient information is available.

But the inability to reliably navigate early-stage diagnostic reasoning raises concerns about overreliance. Real-world medicine often involves incomplete information and ambiguity-the exact conditions where these models struggle most.

Current AI systems are not ready to operate independently in clinical environments. They should augment human judgment, not replace it.

"We want to help separate the hype from the reality of these tools as they apply to health care," Rao said. "Our results reinforce that large language models in healthcare continue to require a human-in-the-loop and very close oversight."

Learn more about AI for Healthcare and Generative AI and LLM applications in clinical settings.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)