LLM outperforms physicians on clinical reasoning tasks but is not ready for autonomous practice, researchers say

OpenAI's o1 model outscored physicians on clinical reasoning tasks by up to 48 points, per Harvard and Beth Israel research in Science. Doctors remain essential-the study tested only text-based tasks, leaving much of real clinical work unexamined.

Categorized in: AI News Healthcare
Published on: May 06, 2026
LLM outperforms physicians on clinical reasoning tasks but is not ready for autonomous practice, researchers say

AI Model Outperforms Physicians on Clinical Reasoning Tasks, but Doctors Still Essential

An advanced language model from OpenAI outperformed hundreds of physicians across multiple clinical reasoning tasks, according to research published in Science by teams at Harvard Medical School and Beth Israel Deaconess Medical Center. The findings show generative AI and LLMs have reached a new capability threshold - but researchers warn the technology is not ready to work without human oversight.

The researchers tested the o1 series model against physicians using published patient cases, real emergency room data, and clinical management scenarios. In one experiment involving ER patients at different diagnostic stages, the model outperformed both GPT-4o and two expert attending physicians when evaluated by independent physicians.

In a separate test using five clinical vignettes, the o1-preview model scored 41 percentage points higher than GPT-4 alone, 41.9 percentage points higher than physicians using GPT-4, and 48.4 percentage points higher than physicians with standard resources.

"Our findings suggest that LLMs have now eclipsed most benchmarks of clinical reasoning," the researchers concluded.

The Case for Human Supervision Remains Strong

Despite the performance gap, the study's authors emphasized that AI cannot yet operate independently in clinical settings. The research examined only six aspects of clinical reasoning, while dozens of other tasks may have greater impact on actual patient care.

Clinical medicine involves far more than text. Physicians interpret visual data, listen to patient histories, and integrate information from multiple sources - capabilities the study did not assess.

Peter Brodeur, a clinical fellow at Beth Israel Deaconess and co-first author, said the concern goes beyond diagnosis accuracy. "A model might get the top diagnosis right but also suggest unnecessary testing that could expose a patient to harm," Brodeur said. "Humans should be the ultimate baseline when it comes to evaluating performance and safety."

New Testing Methods Required

The researchers called for fundamentally different evaluation approaches as AI for Healthcare advances. Traditional multiple-choice benchmarks no longer work - models now score near 100% on those tests, making progress invisible.

The field needs new benchmarks, studies on how humans and AI work together, and prospective clinical trials that test real-world outcomes. Brodeur noted that "models are increasingly capable" and the testing infrastructure has not kept pace.

The study's limitations underscore what remains unknown about AI's role in medicine. Text-based performance, while impressive, does not predict how a model would function in a hospital where clinicians juggle incomplete information, time pressure, and the weight of actual patient outcomes.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)