Stanford's MedAgentBench Sets Real-World Standard for Clinical AI Agents

Stanford's MedAgentBench tests AI agents on real EHR tasks, prioritizing safety and workflows. Early results show top models near 70% SR, augmenting clinicians under human review.

Categorized in: AI News Healthcare
Published on: Sep 16, 2025
Stanford's MedAgentBench Sets Real-World Standard for Clinical AI Agents

Stanford Develops Real-World Benchmarks for Healthcare AI Agents

Date: September 15, 2025

AI in medicine must prove it can handle the same EHR tasks clinicians do every day. A Stanford team built a benchmark to test that, prioritizing safety and real clinical workflows over hype. The goal is clear: confirm what these systems can reliably do, then let them augment care without adding risk.

"Working on this project convinced me that AI won't replace doctors anytime soon," said Kameron Black, Clinical Informatics Fellow at Stanford Health Care. "It's more likely to augment our clinical workforce."

MedAgentBench: A benchmark for AI agents that act in the EHR

The study, MedAgentBench: A Virtual EHR Environment to Benchmark Medical LLM Agents (published in NEJM AI), evaluates whether AI agents can complete the kinds of tasks clinicians perform inside a live EHR. Unlike chatbots, agents can autonomously perform multistep actions with minimal supervision-pulling data, reasoning over it, and using tools to execute orders.

"Chatbots say things. AI agents can do things," said Jonathan Chen, the paper's senior author. That higher bar demands a repeatable way to measure capability and error modes before any real-world use.

How the benchmark works

The team created a virtual EHR with 100 realistic patient profiles and 785,000 total records (labs, vitals, meds, diagnoses, procedures). They tested about a dozen large language models on 300 physician-authored tasks, using FHIR API endpoints to simulate real system interactions and messy data.

  • Retrieve patient data (labs, meds, problem lists, vitals)
  • Order tests and imaging
  • Prescribe and adjust medications

Results at a glance

Overall Success Rate (SR) on MedAgentBench:

  • Claude 3.5 Sonnet v2 - 69.67%
  • GPT-4o - 64.00%
  • DeepSeek-V3 (685B, open) - 62.67%
  • Gemini-1.5 Pro - 62.00%
  • GPT-4o-mini - 56.33%
  • o3-mini - 51.67%
  • Qwen2.5 (72B, open) - 51.33%
  • Llama 3.3 (70B, open) - 46.33%
  • Gemini 2.0 Flash - 38.33%
  • Gemma2 (27B, open) - 19.33%
  • Gemini 2.0 Pro - 18.00%
  • Mistral v0.3 (7B, open) - 4.00%

Many models struggled with nuanced reasoning, complex workflows, and interoperability-issues clinicians face daily. Knowing the error types and their frequency is essential before pilots, so teams can build safeguards and oversight.

Why this matters for clinical care

The benchmark shows that top models can already handle a meaningful subset of day-to-day tasks. Early wins are most likely in "housekeeping" work-structured requests, repeatable workflows, well-bounded orders-under human review. Performance is improving with newer models, especially when systems are tuned for observed error patterns.

With the right design, safety checks, structure, and consent, the path from lab prototype to tightly scoped pilots is within reach.

Action checklist for health systems

  • Start in a sandbox EHR or virtual environment with synthetic data.
  • Use FHIR-based tool access and restrict agent permissions by task.
  • Define a narrow, high-value task list with clear success criteria and stop conditions.
  • Keep a human in the loop for orders; require approvals for meds, imaging, and high-risk actions.
  • Log everything. Add guardrails (timeouts, rate limits, formulary rules, CDS hooks).
  • Track error patterns (omissions, wrong-patient risk, order mismatches) and retrain or harden workflows.
  • Pilot with a small group of clinicians; measure time saved, error rates, and user trust.
  • Establish governance: IRB or oversight committee, security review, bias and equity checks, informed consent.
  • Align vendor contracts to require auditability, rollback, and clear liability terms.

The road ahead

Hospitals are already applying AI to notes and chart summaries. Benchmarks like MedAgentBench help decide where agents can safely assist next. With a projected global healthcare staffing shortfall exceeding 10 million by 2030, the promise isn't replacement-it's relief. As James Zou and Eric Topol note, AI is moving from tool to teammate.

"With deliberate design, safety, structure, and consent, it will be feasible to start moving these tools from research prototypes into real-world pilots," said Black.

Paper details and further reading

Upskilling your team

If your clinical or informatics team is exploring AI-agent workflows, see practical training options by role: AI courses by job.