MIT researchers design AI medical framework that asks questions instead of giving confident wrong answers

MIT researchers built BODHI, a framework that makes medical AI ask questions and flag uncertainty instead of giving confident answers. In tests, one model's context-seeking rate jumped from 7.8% to 97.3%.

Categorized in: AI News Science and Research
Published on: Apr 04, 2026
MIT researchers design AI medical framework that asks questions instead of giving confident wrong answers

MIT Researchers Design AI That Admits Uncertainty in Medical Decisions

The problem with asking an AI for a diagnosis is not that it might be wrong. It is that it might be wrong and sound completely certain.

That distinction matters in medicine. Studies show experienced ICU physicians defer to AI recommendations even when their own instincts push back. Radiologists have followed incorrect AI suggestions despite contradictory evidence in front of them. Confidence, regardless of whether it is warranted, is persuasive.

Researchers at MIT propose that the solution is not smarter AI, but humbler AI. Their framework, published in BMJ Health and Care Informatics, engineers uncertainty into clinical systems so models ask questions and admit gaps rather than press forward with authoritative-sounding answers.

Why Overconfident AI Matters in Medicine

Medical errors kill more than 250,000 people annually in the United States. AI was supposed to reduce that toll. Instead, current tools may be making a specific failure mode worse: automation bias, the human tendency to over-rely on machine outputs.

Large language models regularly exhibit overconfidence in clinical reasoning tasks. Recent work found that even accurate models show minimal variation in expressed confidence between correct and incorrect answers. Some models also comply with illogical medical requests up to 100% of the time when the request comes from an authority figure.

Leo Anthony Celi, senior author of the study and a physician at Beth Israel Deaconess Medical Center, frames the alternative: "We're now using AI as an oracle, but we can use AI as a coach. We could use AI as a true co-pilot. That would not only increase our ability to retrieve information but increase our agency to be able to connect the dots."

How BODHI Works

The framework is called BODHI, standing for Balanced, Open-minded, Diagnostic, Humble and Inquisitive. Rather than modifying a model's underlying architecture, BODHI works at the level of prompting. This means it can be layered onto existing AI systems without extensive rework.

The approach runs in two passes. The first pass requires the model to analyze its own uncertainty before responding to a clinician. It must classify the task type, estimate its own confidence, identify information gaps, generate clarifying questions, and flag red flags requiring escalation.

The second pass generates the actual response shaped by the first pass analysis. A component called the Virtue Activation Matrix determines which behavioral stance the model should adopt based on two dimensions: how confident it is and how complex the clinical scenario is. High confidence and low complexity trigger a "proceed and monitor" response. Low confidence and high complexity trigger escalation to human expertise.

Test Results Show Significant Behavioral Shifts

Researchers tested BODHI on 200 challenging clinical scenarios covering emergency medicine, primary care, and specialty consultations. Two AI models were evaluated in standard form and with BODHI applied.

One model, GPT-4.1-mini, showed its context-seeking rate jump from 7.8% to 97.3%. Its overall clinical quality score improved from 2.5% to 19.1%. The second model, GPT-4o-mini, showed its context-seeking rate rise from zero to 73.5%, with overall score improvement from 0.0% to 2.2%.

Both models showed consistent results across multiple independent test runs. One metric moved in the wrong direction: communication quality scores dropped roughly 12 percentage points for both models. The researchers argue this is the expected cost of epistemic constraint. Confident declarations sound polished; appropriately hedged responses containing questions do not.

Broader Questions About Training Data

The BODHI work is part of a broader effort by MIT Critical Data, a global consortium, to address structural problems in how medical AI is designed.

Many clinical AI models are trained on electronic health records from U.S. institutions that reflect existing patterns of care, access, and documentation. People who lack access to healthcare, including many rural patients, may be absent from those datasets entirely. The resulting models can encode existing inequities without deliberate intent.

At workshops, MIT Critical Data prompts data scientists, clinicians, social scientists, and patients to examine training data before building anything. Were certain populations excluded? Does the data capture the real drivers of the outcome being predicted? The assumption is that no dataset is neutral.

Sebastián Andrés Cajas Ordoñez, lead author and researcher at MIT Critical Data, said: "We are trying to include humans in these human-AI systems, so that we are facilitating humans to collectively reflect and reimagine, instead of having isolated AI agents that do everything."

Next Steps and Implications

The MIT team plans to implement the framework in AI systems trained on MIMIC, the large clinical database from Beth Israel Deaconess Medical Center, and test it with clinicians in the Beth Israel Lahey Health system. Radiology and emergency triage are additional targets.

The broader case being made is architectural: AI systems used in high-stakes settings should express uncertainty as a feature rather than suppress it for authority. A model that asks a clarifying question before committing to a diagnosis is not weaker. It is safer.

Premature diagnostic closure-committing to a diagnosis before gathering all relevant information-is a known contributor to medical error. A system that routinely asks "what information is missing here?" nudges clinical workflows toward more complete assessment. It treats uncertainty as data rather than as a flaw to hide.

Whether this shift in design philosophy takes hold depends partly on how AI developers, hospital administrators, and regulatory bodies respond to evidence that humility, not just accuracy, should be a performance metric for clinical tools.

Learn more about AI for Healthcare and explore AI Research Courses to deepen your understanding of how AI systems are being developed and deployed in medical settings.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)