General-purpose large language models outperform specialized healthcare AI tools in clinical evaluations

Three consumer LLMs beat two specialized clinical AI tools across all three healthcare tests, including a novel benchmark of 100 real physician queries rated by 12 clinicians.

Categorized in: AI News Healthcare
Published on: Jun 25, 2026
General-purpose large language models outperform specialized healthcare AI tools in clinical evaluations

Three consumer-level large language models have beaten two specialized clinical AI tools across three distinct healthcare evaluations, according to a new study from New York University. The results, published this month in Nature Medicine, challenge the assumption that healthcare-specific AI is inherently superior for clinical decision support.

The three general-purpose large language models - OpenAI's GPT-5.2, Google's Gemini 3.1 Pro, and Anthropic's Claude Opus 4.6 - were pitted against OpenEvidence's OpenEvidence platform and Wolters Kluwer's UpToDate Expert AI. Both clinical tools are purpose-built to aid clinical decision-making with research-grounded evidence.

Three evaluation stages

Graduate student Krithik Vishwanath, neurosurgeon Eric Oermann, MD, and colleagues tested the models in three stages. They used 500 MedQA questions to measure medical knowledge, 500 HealthBench items to assess alignment with clinicians, and a novel benchmark called Real Clinical Queries (RCQ). The RCQ was built from 100 de-identified physician queries to a general-purpose language model in a live clinical environment. Twelve U.S. clinicians then performed randomized, blinded reviews of the model outputs, generating 1,800 model-question annotations.

Key findings

The frontier LLMs "outperformed clinical AI tools in all three evaluations," the study reports. The specialized clinical AI tools performed comparably to auto-enabled Google Search AI Overview on the RCQ benchmark. "These findings highlight the need for independent, real-world evaluation of AI tools before they enter clinical settings," the study authors write.

Implications for clinical AI

In their discussion, Vishwanath and colleagues note that clinical AI tools often carry institutional legitimacy and are likely safe for routine use. Yet their results show that specialized clinical AI tools are not superior to frontier models on knowledge, communication, or clinical alignment. "The superior performance of frontier models in our study suggests that scale, alignment and cross-domain reasoning may outweigh domain-specific tuning as determinants of medical competency for particular tasks," they write.

The researchers say this has implications for procurement, reimbursement, and regulatory oversight. They suggest that hospital-specific LLMs that use institutional data could be a path forward, with careful use of frontier models for less-sensitive tasks.

Why this matters for healthcare professionals

The study challenges the assumption that purpose-built clinical AI tools are automatically better for clinical decision support. Healthcare leaders should demand independent, real-world evaluation of any AI tool before deployment, regardless of whether it carries a specialized label. The findings also point toward a future where AI for Healthcare may involve both large general-purpose models for certain tasks and institution-specific models tuned to local data to mitigate harm. As the researchers note, "As generative LLMs become integrated into healthcare at the enterprise, individual clinician and consumer levels, there is an increasing need for rigorous, independent evaluation on real-world tasks."


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)