Researchers Build Framework to Make Clinical AI Say "I Don't Know"
Leo Anthony Celi, an intensive care physician at Beth Israel Deaconess Medical Center and researcher at MIT, has published a framework designed to make AI models pause and ask for more information instead of providing confident answers they may not be certain about.
The framework, called BODHI (Balanced, Open-minded, Diagnostic, Humble, Inquisitive), was published in 2026 in BMJ Health & Care Informatics. Celi and colleagues wrapped the framework around GPT-4.1-mini and GPT-4o-mini and tested both models on 1,000 clinical cases.
What the tests showed
Under BODHI, the models paused or asked for additional context in 735 of the 1,000 cases. The framework requires models to assess their own confidence and identify gaps in their knowledge before answering clinical questions.
The tradeoff is measurable: both models scored about 12 percentage points lower on standard communication-quality benchmarks when operating under BODHI. Those benchmarks typically reward fluent, confident-sounding responses.
Celi said: "Right now, we use AI as an oracle. We could use it as a coach."
The problem BODHI addresses
Clinical settings face a documented automation bias problem. When AI systems sound confident, clinicians tend to defer to them rather than apply independent judgment. In the U.S. health system, preventable medical errors remain a significant cause of patient harm.
Standard benchmarks that reward polished, certain-sounding outputs can mask clinical risk. A system that appears more competent on communication tests may actually encourage clinicians to trust it too much.
What practitioners should watch
The peer-reviewed paper should report whether BODHI affects actual clinical decision-making or downstream safety measures, not just pause rates and benchmark scores. External validation of the 735-out-of-1,000 pause rate and the 12-point benchmark drop matters for understanding whether the framework generalizes.
Benchmark designers face a design challenge: rewarding appropriate uncertainty rather than surface fluency. Current metrics often penalize the kind of epistemic humility that clinical settings need.
For teams building deployed AI for Healthcare systems, the study demonstrates a concrete safety tradeoff. Encouraging Generative AI and LLM models to express uncertainty reduces their apparent polish but potentially reduces harmful overconfidence in clinical contexts.
Your membership also unlocks: