Health Education is Adopting GenAI Faster Than Evidence Can Keep Up
Medical students are asking ChatGPT for diagnostic help. Nursing educators are using Claude to prepare case studies. Faculty teams are testing Gemini to manage student requests around the clock. This is not hypothetical - it's happening now, and the pace of adoption is striking.
Yet the evidence behind most of these uses remains thin. Educators and institutions are moving faster than the research can support.
Promising is not the same as proven
GenAI likely has a role in health education. It may help students practice more, receive faster feedback, and access support when needed. It may free educators from repetitive administrative work.
But "may help" is doing heavy lifting in most conversations about these tools.
The existing evidence is patchy and inconsistent. Much of it comes from short interventions, small samples, and limited contexts. Some studies conflate task performance with actual learning - a critical distinction.
A tool can produce fluent, well-structured feedback and do nothing for learning. A chatbot can answer exam questions and weaken a student's judgment if they start outsourcing their thinking. A system that saves one educator an hour can create two hours of extra work elsewhere.
In health education, what matters is what actually changes because of a tool. Technical capability alone is not enough.
We're confusing adoption with impact
The current debate blurs adoption with impact. It mistakes efficiency gains for educational improvement. It calls "promising" the same as "proven."
This matters because health education shapes judgment, professional identity, accountability, and safe practice. It teaches students not just what to know, but how to think, how to act, and how to make decisions that affect real people.
What would better evidence look like?
A simple rule applies: the bigger the claim, the stronger the evidence should be.
If the claim is that GenAI gives better feedback, what ensures it's specific and accurate? Do students actually use it? Does it build their ability to evaluate their own work, or just hand them an answer?
If the claim is that GenAI improves performance, a short-term bump in test scores is insufficient. Does learning stick? Does it transfer to new situations? Does it support reasoning or replace it?
For assessment claims - using GenAI in grading or progression decisions - the bar must be higher. Any such system needs scrutiny for fairness, transparency, and bias. Consistency does not guarantee fairness.
Claims that an educational GenAI tool will ultimately benefit patients deserve careful handling. Evidence must link teaching practice to clinical behavior, and clinical behavior to health outcomes. That chain should not be assumed.
Not every tool needs a randomized trial
GenAI tools should not be locked away until someone runs a controlled trial. That is not realistic.
But there is a difference between an educator using GenAI to draft a formative quiz and a system using it to flag struggling students or inform decisions about progression. Every claim carries an obligation: the claim and the evidence should match.
The most important outcomes in health education are often invisible in the short term. Good health professionals develop over years through practice, reflection, feedback, supervision, and experience. Some of these processes may improve with GenAI. Some may narrow, particularly if learning becomes too focused on what is easy to prompt, score, or automate.
Responsibility is shared
Improving the evidence base is not one person's job.
- Researchers need to be clear about what they claim
- Educators need to distinguish between experimenting with a tool and proving it improves learning
- Universities need to resist turning pilots into policy before consequences are understood
- Industry and accrediting bodies should support AI literacy without endorsing untested systems
- Vendors need to make narrower, more honest claims
- Students should be treated as partners, not simply as end users or risks to manage
Evaluation must continue after implementation
Educational tools do not remain stable once introduced into real settings. Students adapt. Educators redesign tasks. Institutions change rules. GenAI models are updated, sometimes substantially and with little warning.
What works for one cohort or discipline may not work for another. Evaluation cannot be a one-off event - it must be built into implementation.
Apply the same standards
The real question is whether we will hold GenAI to the same standards we expect of everything else in health and medicine.
We would not introduce a new clinical intervention on the basis that it "seems useful" or "may help." We would not accept vague claims of benefit without evidence that those benefits are real, meaningful, and sustained.
Education should not be different simply because the risks are less visible or take longer to emerge. If GenAI is to play a meaningful role in preparing future health professionals, it needs the same care we apply to the rest of health practice.
Not simply because it is new. But because it matters.
Your membership also unlocks: