AI Medical Devices May Fail on Real Patients Despite Strong Test Results
AI-enabled medical devices that perform well during testing can still malfunction when deployed on real patients whose medical images differ from the training data, according to a new report from the Paragon Health Institute. The finding highlights a gap between controlled validation environments and actual clinical practice.
The report focuses on "generalization uncertainty"-an AI system's ability to process real-world data accurately outside the lab. When devices encounter patients, imaging techniques, or clinical environments substantially different from their training data, performance can degrade, creating patient-safety risks and eroding clinician confidence.
Why Training Data Matters
Unlike traditional software built on deterministic rules, AI medical devices rely on predictive models trained on specific datasets. Device performance is tightly coupled to that training data's characteristics.
Kev Coleman, director of the Healthcare AI Initiative at Paragon Health Institute, said current validation approaches fall short. "Too little training data or too much consistency among that data can result in the AI device working well during development but having problems in the real world," he said.
Broad demographic representation in training data alone doesn't guarantee reliability. Individual patients whose medical images differ significantly from dominant dataset characteristics still face higher risks of inaccurate outputs.
Equipment and Technique Create Hidden Variables
The report identifies a frequently overlooked factor: variation introduced by imaging hardware and technician technique. Differences in radiology equipment, image quality, and clinical workflows all influence whether AI systems generalize across healthcare settings.
These variations mean an AI device validated on one hospital's imaging systems may perform differently elsewhere.
A Voluntary Approach to Validation
Rather than requiring disclosure of proprietary training data, the report recommends "Digital Similarity Analysis"-a voluntary tool comparing an individual patient's medical image against the device's training and testing data before deployment.
Coleman said validation gaps vary by algorithm type and clinical setting. The FDA is refining its oversight of AI devices, recognizing discontinuities between AI systems and the agency's original framework for software as a medical device.
Post-market surveillance and total product life cycle risk management are under consideration, particularly as the FDA considers approving adaptive or generative AI within medical devices.
Learn more about AI for Healthcare and AI Data Analysis to understand how these systems are evaluated and deployed.
Your membership also unlocks: