MIT researcher builds framework to make clinical AI say "I don't know" more often

MIT and Beth Israel researchers built BODHI, a framework that makes AI models ask for more information instead of giving confident answers. In tests on 1,000 clinical cases, the models paused or requested context 735 times.

Categorized in: AI News Science and Research
Published on: Jun 10, 2026
MIT researcher builds framework to make clinical AI say "I don't know" more often

Researchers Build Framework to Make Clinical AI Say "I Don't Know"

Leo Anthony Celi, an intensive care physician at Beth Israel Deaconess Medical Center and researcher at MIT, has published a framework designed to make AI models pause and ask for more information instead of providing confident answers they may not be certain about.

The framework, called BODHI (Balanced, Open-minded, Diagnostic, Humble, Inquisitive), was published in 2026 in BMJ Health & Care Informatics. Celi and colleagues wrapped the framework around GPT-4.1-mini and GPT-4o-mini and tested both models on 1,000 clinical cases.

What the tests showed

Under BODHI, the models paused or asked for additional context in 735 of the 1,000 cases. The framework requires models to assess their own confidence and identify gaps in their knowledge before answering clinical questions.

The tradeoff is measurable: both models scored about 12 percentage points lower on standard communication-quality benchmarks when operating under BODHI. Those benchmarks typically reward fluent, confident-sounding responses.

Celi said: "Right now, we use AI as an oracle. We could use it as a coach."

The problem BODHI addresses

Clinical settings face a documented automation bias problem. When AI systems sound confident, clinicians tend to defer to them rather than apply independent judgment. In the U.S. health system, preventable medical errors remain a significant cause of patient harm.

Standard benchmarks that reward polished, certain-sounding outputs can mask clinical risk. A system that appears more competent on communication tests may actually encourage clinicians to trust it too much.

What practitioners should watch

The peer-reviewed paper should report whether BODHI affects actual clinical decision-making or downstream safety measures, not just pause rates and benchmark scores. External validation of the 735-out-of-1,000 pause rate and the 12-point benchmark drop matters for understanding whether the framework generalizes.

Benchmark designers face a design challenge: rewarding appropriate uncertainty rather than surface fluency. Current metrics often penalize the kind of epistemic humility that clinical settings need.

For teams building deployed AI for Healthcare systems, the study demonstrates a concrete safety tradeoff. Encouraging Generative AI and LLM models to express uncertainty reduces their apparent polish but potentially reduces harmful overconfidence in clinical contexts.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)