OpenAI reasoning model matches and outperforms doctors at diagnosing patients, study finds

An OpenAI AI model outperformed experienced physicians at diagnosing patients from real emergency department records, per a Harvard and Beth Israel study in Science. Researchers say it's not a doctor replacement, but a call for clinical trials.

AI Model Diagnoses Patients Better Than Experienced Doctors in Hospital Study

An AI reasoning model developed by OpenAI outperformed experienced physicians at diagnosing patients using only electronic health records, according to a study published Thursday in Science. Researchers at Harvard Medical School and Beth Israel Deaconess Medical Center tested the model on actual emergency department cases, including complex scenarios where initial diagnoses proved incorrect.

In one case, a patient admitted with a pulmonary embolism deteriorated despite treatment. The AI model identified a history of lupus-an autoimmune condition that can cause heart inflammation-as the actual underlying problem. The diagnosis was correct.

How the Study Worked

The team evaluated the AI model at three critical points: triage, admission, and during hospital stay. They compared its performance against two experienced internal medicine physicians using the same information available to clinicians at each stage.

The AI also tackled diagnostic cases published in the New England Journal of Medicine and clinical vignettes designed to test how well it handled established medical benchmarks and complex diagnostic questions.

"This is the big conclusion for me - it works with the messy real-world data of the emergency department," said Dr. Adam Rodman, a clinical researcher at Beth Israel and study author. "It works for making diagnoses in the real world."

What Makes This Different

Earlier versions of large language models struggled with medical uncertainty and generating differential diagnoses-the list of possible conditions explaining a patient's symptoms. This model handled both effectively.

The researchers acknowledge a significant limitation: the AI worked from text alone. Clinicians rely on images, sounds, and nonverbal cues that weren't available to the model. Additionally, the emergency department represents only a slice of patient care. The study's success might not translate to patients with complex histories spanning weeks of hospitalization.

Dr. David Reich, chief clinical officer for Mount Sinai Health System, noted the advancement. "You have something which is quite accurate, possibly ready for prime time," he said. "Now the open question is how the heck do you introduce it into clinical workflows in ways that actually improve care?"

The Real-World Challenge

Arriving at a difficult diagnosis-where the AI excelled-differs from how medicine actually unfolds. Patient outcomes involve subtleties and variations that final diagnoses alone don't capture.

The study's authors rejected the notion that these results justify replacing doctors with AI. "I think it does mean that we're witnessing a really profound change in technology that will reshape medicine," said Raj Manrai, assistant professor of Biomedical Informatics at Harvard Medical School.

The findings underscore the need for rigorous testing through forward-looking clinical trials. Such trials could clarify whether AI actually improves patient outcomes in practice-a harder question than whether a model can diagnose correctly.

"It's a very challenging process to design these trials," Reich said, "but this study is a perfect call to action."

Learn more about AI for Healthcare applications in clinical settings.

Get Daily AI News

Your membership also unlocks:

700+ AI Courses

700+ Certifications

Personalized AI Learning Plan

6500+ AI Tools (no Ads)

Daily AI News by job industry (no Ads)