AI model outperforms ER doctors at diagnosis in real-world hospital test

An OpenAI reasoning model outdiagnosed experienced ER physicians in a Harvard study using real patient cases from a Boston emergency department. Researchers say the findings call for rigorous clinical trials before the technology enters hospitals.

AI Model Outperforms ER Doctors in Real-World Diagnostic Test

An AI reasoning model developed by OpenAI diagnosed patients more accurately than experienced emergency room physicians in a study published Thursday in Science. Researchers at Harvard Medical School and Beth Israel Deaconess Medical Center tested the model on actual cases from their Boston emergency department, measuring diagnostic accuracy at multiple points during patient care.

The AI outperformed two experienced physicians using only electronic health records - the same limited information available to the doctors at the time.

How the Test Worked

Researchers graded the AI model's performance at three stages: triage, admission, and ongoing care. One case involved a patient initially treated for a pulmonary embolism - a blood clot in the lungs - whose symptoms worsened despite medication. The AI identified a history of lupus, an autoimmune condition that can cause heart inflammation, as the underlying problem. The diagnosis was correct.

The team also tested the model against case reports from the New England Journal of Medicine and clinical vignettes designed to probe difficult diagnostic questions. The model consistently outperformed physician baselines across these scenarios.

What Sets This Apart

Earlier versions of large language models struggled with medical uncertainty and generating differential diagnoses - the list of possible conditions explaining a patient's symptoms. This model handled both tasks effectively.

"It works with the messy real-world data of the emergency department," said Dr. Adam Rodman, a clinical researcher at Beth Israel and study author. "It works for making diagnoses in the real world."

Significant Limitations Remain

The AI relied on text alone. In actual clinical practice, physicians integrate images, sounds, and nonverbal cues. The model also performed well partly because it reviewed only emergency department records - a narrow slice of a patient's total medical history. Researchers acknowledge performance would likely drop substantially with more complex patient histories.

Dr. David Reich, chief clinical officer for Mount Sinai Health System, who was not involved in the study, said the results show promise but raise a separate challenge: "Now the open question is how the heck do you introduce it into clinical workflows in ways that actually improve care?"

What Comes Next

Study authors rejected the notion that these results support replacing physicians with AI. Instead, they argue the findings demonstrate the need for rigorous, forward-looking trials to measure whether the technology actually improves patient outcomes in practice.

"This paper is a beautiful summary of just how much things have improved," Reich said. "You have something which is quite accurate, possibly ready for prime time."

Raj Manrai, assistant professor of Biomedical Informatics at Harvard Medical School, added: "I think it does mean that we're witnessing a really profound change in technology that will reshape medicine."

Designing such trials remains challenging. But researchers say this study makes the case that the work must proceed systematically, with clinical evidence guiding how AI gets integrated into hospitals.

Get Daily AI News

Your membership also unlocks:

700+ AI Courses

700+ Certifications

Personalized AI Learning Plan

6500+ AI Tools (no Ads)

Daily AI News by job industry (no Ads)