Less hype, more proof: HIMSS26 spotlights shared benchmarks to de-risk healthcare AI

At HIMSS26, leaders pushed shared benchmarks so hospitals can test AI before they buy. The Healthcare AI Challenge rates models on usefulness, speed, and ROI to cut risk.

Categorized in: AI News Healthcare
Published on: Mar 13, 2026
Less hype, more proof: HIMSS26 spotlights shared benchmarks to de-risk healthcare AI

HIMSS26: Shared benchmarks to de-risk AI deployment in health systems

At the 2026 HIMSS Global Health Conference & Exposition in Las Vegas, Nabile Safdar, chief AI officer at Emory Healthcare, and Bernardo Bizzo, senior director of artificial intelligence at Mass General Brigham, made a simple case: health systems need shared benchmarks to make safer, smarter AI decisions. Their initiative, the Healthcare AI Challenge, is built to give clinicians and leaders clear evidence before they commit resources, rework workflows, and retrain teams. About the HIMSS Global Conference

Why AI decisions still feel risky

AI tooling is moving fast, but the evaluation playbook inside hospitals hasn't kept up. "The AI opportunity is to enhance healthcare workflow efficiency," Bizzo said. "But health systems lack tools to assess foundational models for safety and effectiveness."

Leaders are asked to fund implementation, training, and integration without clear proof of value. "Clinical leaders are not sure if it's bringing them value," Safdar said. Many models were built for general use, not clinical environments, which makes ROI murky. As Bizzo put it: "We lack benchmarks to assess how these tools are performing and how well they can help us."

The Healthcare AI Challenge: a shared, test-before-you-buy model

To close that gap, Bizzo and colleagues launched the Healthcare AI Challenge, a cross-institution effort that evaluates AI systems using common datasets and standardized methods. Its "AI Arena" lets clinical experts compare outputs across tasks like radiology reporting and medical record summarization.

So far: five challenges, more than 4,500 evaluations, roughly 200 participants across 40 institutions. Eighteen foundation models have been tested, spanning general-purpose and healthcare-specific systems. The goal is a repeatable, transparent process you can trust before you scale. "As healthcare professionals, this is the information you want to know before investing in a model," Bizzo said.

Measure what matters in clinics, not just accuracy

Technical accuracy matters, but it's not the whole story. "A lot of us get stuck on accuracy," Safdar said. "But often your family practice clinician is thinking, 'Does it make me faster?'"

In the AI Arena, evaluators can compare human performance with AI outputs and model-vs-model head-to-head. They assess speed, clinical usefulness, and whether results hit an acceptable threshold for real workflows. That's the level of evidence clinical leaders need to green-light deployment.

A practical checklist for safer AI adoption

  • Start with 1-2 priority use cases and define baseline metrics (time, quality, throughput, safety).
  • Use shared datasets and standardized tasks so results are comparable across vendors and versions.
  • Include frontline clinicians in scoring for usefulness, clarity, and cognitive load-not just accuracy.
  • Benchmark model outputs against human performance and against peer models.
  • Shadow test in production-like settings before live use; monitor error types and near misses.
  • Decide acceptable thresholds for performance and turnaround time by specialty and task.
  • Validate workflow fit: handoffs, documentation, audit trails, and fallback paths when the model is uncertain.
  • Track total cost of ownership: licenses, compute, integration, training, support, and model updates.
  • Set guardrails for PHI handling, bias checks, incident response, and pause/rollback criteria.
  • Define ROI measures upfront: minutes saved per user per day, report quality, denial reduction, throughput gains.

What's next: EHR integration and agentic workflows

The team plans to expand the platform to connect directly with EHR systems and evaluate emerging agentic workflows. The focus is clear: measure real productivity gains, not just model scores. "We want to measure how much more efficient users are and have that information available so you know how much ROI you can expect," Bizzo said.

Expect closer alignment with safety expectations and regulatory thinking as health systems operationalize these tools. For context on current device and software guidance, see the FDA's overview of AI/ML-enabled medical devices here.

Bottom line

Shared benchmarks and clinician-centered evaluations reduce risk and guesswork. If you can compare models side by side, quantify workflow impact, and set clear thresholds, you can invest with confidence-and know when to walk away.

For practical training on building evaluation muscle inside your organization, explore AI for Healthcare and strategy resources for leaders in AI for Executives & Strategy.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)