MIT Researchers Propose Framework to Make Medical AI Less Confident, More Cautious
A team led by MIT researchers has developed a method to engineer uncertainty into clinical AI systems, requiring them to ask questions and admit knowledge gaps rather than issuing authoritative diagnoses. The framework, published in BMJ Health and Care Informatics, addresses a specific failure mode: clinicians over-relying on AI recommendations even when their own judgment contradicts the system's output.
The problem is not that AI gets diagnoses wrong. The problem is that it sounds completely certain when it does.
Why Confidence Matters More Than Accuracy
Studies have documented ICU physicians deferring to incorrect AI recommendations and radiologists following flawed AI suggestions despite contradictory visual evidence in front of them. In both cases, the AI's confident tone drove the decision, not the quality of its reasoning.
Medical errors kill more than 250,000 Americans annually. As hospitals deploy more AI tools, the risk is that these systems could worsen a particular failure mode: automation bias, the human tendency to over-trust machine outputs.
Large language models regularly show minimal variation in their expressed confidence between correct and incorrect answers. They sound equally certain whether they are right or wrong. Some also display sycophantic behavior, complying with illogical medical requests up to 100% of the time when an authority figure makes them.
Leo Anthony Celi, a physician at Beth Israel Deaconess Medical Center and senior author of the study, said the solution is not smarter AI. "We're now using AI as an oracle, but we can use AI as a coach. We could use AI as a true co-pilot. That would not only increase our ability to retrieve information but increase our agency to be able to connect the dots."
How BODHI Works
The framework is called BODHI, standing for Balanced, Open-minded, Diagnostic, Humble and Inquisitive. Rather than retraining models from scratch, BODHI operates at the prompting level, meaning it can be layered onto existing AI systems without extensive rework.
The approach runs in two passes. In the first pass, the model analyzes its own uncertainty before responding to a clinician. It classifies the type of task, estimates its own confidence level, identifies information gaps, generates clarifying questions, and flags red flags requiring escalation. This analysis is structured and auditable.
The second pass generates the actual response to the clinician, shaped by what the first pass produced. A component called the Virtue Activation Matrix determines which behavioral stance the model should adopt based on two dimensions: how confident it is and how complex the clinical scenario is.
A high-confidence, low-complexity case triggers a "proceed and monitor" response. A low-confidence, high-complexity case triggers explicit escalation to human expertise, framed as deferential rather than directive.
Test Results Show Dramatic Shifts in Behavior
Researchers tested BODHI on 200 challenging clinical scenarios from HealthBench Hard, a benchmark covering emergency medicine, primary care, and specialty consultations. Two AI models were evaluated in their standard form and with BODHI applied.
One model, GPT-4.1-mini, showed its context-seeking rate-how often it asked clarifying questions rather than issuing answers-jump from 7.8% to 97.3%. Its overall clinical quality score improved from 2.5% to 19.1%.
The second model, GPT-4o-mini, showed more modest overall improvement but its context-seeking rate rose from zero to 73.5%. Both models produced consistent results across multiple independent test runs.
One metric moved in the wrong direction. Communication quality scores dropped roughly 12 percentage points for both models. The researchers argue this is the expected cost of epistemic constraint: confident declarations sound polished; appropriately hedged responses with questions do not.
In their view, the standard benchmark may be penalizing exactly the behavior that makes clinical AI safer.
Addressing Data Bias in Medical AI
The BODHI research is part of a broader effort by Celi and colleagues at MIT Critical Data to address structural problems in how medical AI is designed and for whom.
Many clinical AI models are trained on electronic health records from U.S. institutions. These datasets reflect existing patterns of care, access, and documentation. People who lack access to the healthcare system, including many rural patients, may be absent from those datasets entirely.
At workshops hosted by MIT Critical Data, researchers prompt data scientists, clinicians, social scientists, and patients to examine their training data before building anything. Were certain populations excluded? Does the data capture the real drivers of the outcome being predicted?
Sebastián Andrés Cajas Ordoñez, lead author and researcher at MIT Critical Data, said the goal is to keep humans central to AI systems. "We are trying to include humans in these human-AI systems, so that we are facilitating humans to collectively reflect and reimagine, instead of having isolated AI agents that do everything. We want humans to become more creative through the usage of AI."
What Comes Next
The MIT team plans to implement BODHI in AI systems trained on MIMIC, the large clinical database from Beth Israel Deaconess Medical Center, and test it with clinicians in the Beth Israel Lahey Health system. Radiology and emergency triage have been identified as additional targets.
The broader case being made is architectural: AI systems used in high-stakes settings should be designed to express uncertainty as a feature rather than suppress it. A model that asks a clarifying question before committing to a diagnosis is not weaker. It is safer.
Premature diagnostic closure-committing to a diagnosis before all relevant information is gathered-is a known contributor to medical error. A system that routinely asks "what information is missing here?" before issuing a recommendation nudges clinical workflows in the opposite direction.
Whether this shift in design philosophy takes hold depends on how AI developers, hospital administrators, and regulatory bodies respond to evidence that humility, not just accuracy, should be a performance metric for clinical AI tools.
For more on how AI is being applied in healthcare settings, see our coverage of AI for Healthcare and explore AI Research Courses for deeper understanding of AI development and validation approaches.
Your membership also unlocks: