OpenAI's o1 model outperforms emergency room doctors in diagnostic accuracy, Harvard study finds

OpenAI's o1 correctly diagnosed ER patients 67% of the time at initial triage, compared to 55% and 50% for two specialist physicians. Researchers say AI won't replace doctors-the study only tested text-based records, not full clinical exams.

Categorized in: AI News Science and Research
Published on: May 03, 2026
OpenAI's o1 model outperforms emergency room doctors in diagnostic accuracy, Harvard study finds

AI Outperforms Doctors in Emergency Room Diagnosis, Study Finds

OpenAI's reasoning model o1 diagnosed emergency room patients more accurately than doctors in a clinical study published in Science. Researchers from Harvard Medical School, Stanford University, and Beth Israel Deaconess Medical Center compared AI and physician performance across six experiments, including real cases from 76 emergency room patients.

In the initial triage stage, o1 identified diagnoses matching or closely approximating actual conditions in 67.1% of cases. Two specialist physicians achieved 55.3% and 50% respectively. The gap widened in treatment planning: o1 scored 90% on clinical management tasks, while doctors using GPT-4 achieved 41%.

How the Study Worked

Researchers evaluated AI and physician performance at three critical moments: when patients first arrived and described symptoms, after the emergency room doctor's evaluation, and when hospitalization decisions were made.

As more patient information accumulated, AI performance improved. With data from the emergency room doctor's evaluation, o1 matched or closely approximated the actual diagnosis in 72.4% of cases. By the hospitalization decision stage, this reached 81.6%.

In medical reasoning tests, o1 achieved full marks on 78 of 80 answers. Specialists scored full marks on 28 of 80, and residents on 16 of 72.

The Limits of the Current Research

The research team emphasized that these results do not indicate AI will replace physicians. The experiments relied on written medical records and case descriptions-not the full clinical picture.

Actual emergency medicine involves non-textual information: patient facial expressions, pain levels, voice tone, imaging studies, and physical examination findings. The researchers said their work "only evaluated text-based clinical judgment capabilities."

Before implementing AI in clinical practice, the team said prospective clinical trials and studies on human-AI collaboration are necessary. The findings suggest a future where AI provides diagnostic suggestions to physicians immediately upon patient arrival, but human judgment remains essential.

For professionals in AI for Healthcare or those following AI Research, this study demonstrates both the capability and current boundaries of large language models in medical settings.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)