AI Model Outperforms Doctors on Clinical Reasoning, But Researchers Warn Against Solo Practice
An advanced language model outperformed physicians across multiple clinical reasoning tasks, according to research published in Science. But the study's authors made clear: AI is not ready to practice medicine independently.
Researchers from Harvard Medical School and Beth Israel Deaconess Medical Center tested OpenAI's o1 model against hundreds of physicians on various clinical tasks. The experiments included published patient cases, evaluations of real emergency room patients, and diagnostic and treatment planning scenarios.
The o1 model scored higher than physicians in every test. In one experiment using clinical vignettes, o1-preview scored 41 percentage points higher than GPT-4 alone, 41.9 points higher than doctors using GPT-4, and 48.4 points higher than physicians relying on conventional resources. In an emergency department test, the model outperformed both ChatGPT-4o and two expert attending physicians when assessed by independent physician reviewers.
The Limitations Are Significant
The researchers identified substantial gaps between lab performance and clinical readiness. The study examined only six aspects of clinical reasoning, while dozens of other tasks may matter more in actual patient care.
The experiments tested only text-based inputs for both humans and the AI. Real clinical medicine involves auditory and visual information-imaging, vital sign trends, patient presentation-that the study did not assess.
Peter Brodeur, a Harvard Medical School clinical fellow and study co-author, highlighted a concrete risk: "A model might get the top diagnosis right but also suggest unnecessary testing that could expose a patient to harm." He said humans must remain the baseline for evaluating AI performance and safety.
New Testing Methods Needed
As large language models advance, current evaluation methods are becoming obsolete. Researchers said multiple-choice tests no longer work-models now consistently score near 100%, making progress impossible to measure.
The researchers called for new approaches: updated benchmarks that test clinically relevant tasks, studies on how humans and AI interact together, and prospective clinical trials before deployment in real settings.
The broader question remains unsettled. Research consistently shows that AI for healthcare improves when physicians supervise it, but struggles with certain tasks and can generate plausible-sounding errors. Until testing catches up with model capability, clinical oversight is not optional.
Your membership also unlocks: