LLM outperforms physicians on clinical reasoning tasks but researchers say human oversight remains necessary

OpenAI's o1 model outscored physicians on every clinical reasoning test in a Harvard-led study published in Science. Researchers say AI still isn't ready for independent practice and called for stricter real-world trials before deployment.

Categorized in: AI News Healthcare

Published on: May 06, 2026

AI Model Outperforms Doctors on Clinical Reasoning, But Researchers Warn Against Solo Practice

An advanced language model outperformed physicians across multiple clinical reasoning tasks, according to research published in Science. But the study's authors made clear: AI is not ready to practice medicine independently.

Researchers from Harvard Medical School and Beth Israel Deaconess Medical Center tested OpenAI's o1 model against hundreds of physicians on various clinical tasks. The experiments included published patient cases, evaluations of real emergency room patients, and diagnostic and treatment planning scenarios.

The o1 model scored higher than physicians in every test. In one experiment using clinical vignettes, o1-preview scored 41 percentage points higher than GPT-4 alone, 41.9 points higher than doctors using GPT-4, and 48.4 points higher than physicians relying on conventional resources. In an emergency department test, the model outperformed both ChatGPT-4o and two expert attending physicians when assessed by independent physician reviewers.

The Limitations Are Significant

The researchers identified substantial gaps between lab performance and clinical readiness. The study examined only six aspects of clinical reasoning, while dozens of other tasks may matter more in actual patient care.

The experiments tested only text-based inputs for both humans and the AI. Real clinical medicine involves auditory and visual information-imaging, vital sign trends, patient presentation-that the study did not assess.

Peter Brodeur, a Harvard Medical School clinical fellow and study co-author, highlighted a concrete risk: "A model might get the top diagnosis right but also suggest unnecessary testing that could expose a patient to harm." He said humans must remain the baseline for evaluating AI performance and safety.

New Testing Methods Needed

As large language models advance, current evaluation methods are becoming obsolete. Researchers said multiple-choice tests no longer work-models now consistently score near 100%, making progress impossible to measure.

The researchers called for new approaches: updated benchmarks that test clinically relevant tasks, studies on how humans and AI interact together, and prospective clinical trials before deployment in real settings.

The broader question remains unsettled. Research consistently shows that AI for healthcare improves when physicians supervise it, but struggles with certain tasks and can generate plausible-sounding errors. Until testing catches up with model capability, clinical oversight is not optional.

Get Daily AI News

Your membership also unlocks:

700+ AI Courses

700+ Certifications

Personalized AI Learning Plan

6500+ AI Tools (no Ads)

Daily AI News by job industry (no Ads)

LLM outperforms physicians on clinical reasoning tasks but researchers say human oversight remains necessary

AI Model Outperforms Doctors on Clinical Reasoning, But Researchers Warn Against Solo Practice

The Limitations Are Significant

New Testing Methods Needed

Related AI News for people in Healthcare

Experts warn AI chatbots are not a reliable substitute for medical diagnosis

NHA recognises winners of AB PM-JAY hackathon for AI-based healthcare claims solutions

Emids CAIO says healthcare AI adoption fails without leadership alignment and responsible governance

PM-JAY hackathon at IISc develops AI tools to detect fake health insurance claims

About Complete AI:

Latest AI News for your Job:

Courses by AI Skill:

Courses by Job Field:

Courses by AI Company:

AI Tools for your Job:

AI Tools by Type:

AI Certifications by Skill:

AI Certifications by Job Field:

AI Certifications by Company: