AI Models Match or Exceed Emergency Doctors in Diagnostic Accuracy, Study Finds
OpenAI's o1 language model diagnosed emergency room patients as accurately as or better than human physicians in a head-to-head comparison, according to a study published this week in Science. Researchers from Harvard Medical School and Beth Israel Deaconess Medical Center tested the model against internal medicine attending physicians using real patient cases.
The comparison focused on 76 patients who arrived at Beth Israel's emergency room. Two independent attending physicians assessed diagnoses from both the AI models and human doctors without knowing which came from AI. The o1 model performed especially well at initial triage, where doctors have the least information and face the most time pressure.
At that critical first diagnostic touchpoint, o1 reached the exact or very close diagnosis in 67% of cases. One human physician achieved this in 55% of cases; the other in 50%. OpenAI's 4o model performed on par with the physicians across diagnostic stages.
The researchers presented the AI systems with the same text-based information available in electronic medical records at the time each diagnosis was made, with no preprocessing. "We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines," said Arjun Manrai, who leads an AI lab at Harvard Medical School and co-authored the study.
Study Does Not Claim AI Is Ready for Clinical Use
The researchers stopped short of suggesting these models should make real diagnostic decisions. Instead, they called for "prospective trials to evaluate these technologies in real-world patient care settings."
Adam Rodman, a Beth Israel physician and study co-author, told the Guardian that no formal accountability framework exists around AI diagnoses. Patients still expect "humans to guide them through life or death decisions," he said.
The study examined only text-based information. The researchers acknowledged that existing research shows current language models struggle with nontext inputs like imaging data.
Criticism: Wrong Comparison Group
Emergency physician Kristen Panthagani raised a methodological concern: the study compared AI performance to internal medicine physicians, not emergency medicine specialists. "If we're going to compare AI tools to physicians' clinical ability, we should start by comparing to physicians who actually practice that specialty," she said.
Panthagani also noted that emergency medicine priorities differ from diagnostic accuracy alone. "My primary goal is not to guess your ultimate diagnosis. My primary goal is to determine if you have a condition that could kill you," she said.
Learn more about Generative AI and LLM capabilities, or explore how AI supports Science & Research work.
Your membership also unlocks: