AI medical advice wins trust over doctors - even when it's wrong
A new study reported by researchers at MIT and published in the New England Journal of Medicine found that people - including medical experts - trust AI-generated medical answers more than those written by physicians or online health platforms. The sample included 300 participants who rated responses on accuracy, validity, trustworthiness, and completeness.
AI consistently scored higher across those dimensions. Participants also struggled to tell AI responses apart from human-written ones. The alarming twist: even low-accuracy AI answers (unknown to participants) were still rated as valid and trustworthy, and users indicated they would act on them.
If you work in science or research, that gap between perceived quality and actual correctness is the signal to pay attention to.
Why people trust AI's medical answers
- Fluency bias: well-structured, confident text reads as "correct."
- Calibration gap: models present certainty that isn't aligned with ground truth.
- Interface effects: clean formatting, complete-seeming lists, and formal tone amplify perceived authority.
- Human factors: clinicians often hedge responsibly; users may read that as uncertainty compared to AI's polished delivery.
Real harm isn't theoretical
The researchers highlight cases where AI advice led to dangerous behavior. One person sought ER care after following a chatbot's instructions involving rubber bands for hemorrhoids. In another case, a 60-year-old man was hospitalized for three weeks after ingesting a pool-related chemical suggested as a substitute for table salt, as described in a clinical case report.
Study signal for science and R&D teams
The core takeaway: trust does not equal truth. If you're building, evaluating, or deploying AI for health contexts, your metrics, UX, and governance must account for over-trust in confident but incorrect outputs.
How to evaluate AI medical systems beyond standard accuracy
- Run blinded head-to-heads against physicians and reputable platforms; include a "can you tell AI vs. human?" check.
- Add low-accuracy decoy arms to measure user susceptibility, action intent, and unnecessary care-seeking.
- Score answers on actionability, potential harm, and alignment with guidelines - not just perceived completeness.
- Measure calibration: expected vs. empirical correctness given stated confidence.
- Require references and check verifiability; reward abstention and deferral when uncertainty is high.
- Red-team with domain experts for rare, ambiguous, and high-stakes scenarios.
- Track behavior change outcomes: Did the answer prompt risky self-treatment or inappropriate escalation?
Practical safeguards you can ship now
- Confidence-aware outputs: include clearly labeled uncertainty and "could be wrong" cues tied to abstention.
- Deferral defaults: easy, visible "contact a clinician" paths for high-risk topics and symptom triage.
- Source discipline: link to guideline-grade references and highlight when evidence is weak or contested.
- Retrieval with strict guardrails; block advice in topics with high harm potential unless verified by curated sources.
- Format carefully: avoid over-polished presentation for uncertain answers; reduce false authority signals.
- Post-deployment monitoring: incident reporting, drift checks, and rapid rollback for problematic patterns.
Key details (as reported)
- N = 300 participants; responses came from a physician, an online platform, or an AI model.
- Experts and non-experts rated AI higher on accuracy, validity, trustworthiness, and completeness.
- Participants couldn't reliably distinguish AI from human-written responses.
- Low-accuracy AI outputs were still rated highly, and users indicated they would follow the advice, including unnecessary care-seeking.
Implications
- Perceived quality can outrun factual quality, increasing risk in high-stakes domains.
- AI systems need built-in friction around medical advice: calibrated confidence, transparent sourcing, and safe defaults.
- Regulatory and clinical governance should evaluate action risk, not just answer polish.
For background, see the New England Journal of Medicine and MIT's work in AI and healthcare. A recent survey from Censuswide also reported sizable consumer trust in AI health advice.
Level up your team's AI evaluation skills
If your research group is formalizing AI evaluation, calibration, and prompt discipline, structured training helps shorten the learning curve and reduce avoidable risk.
- AI courses by job function - pick paths for clinical, data, and product teams.
- AI certification for data analysis - build measurement, calibration, and evaluation depth.
Your membership also unlocks: