Decoding Hidden Bias in Health Care LLMs: A Practical Path for Clinicians
Large language models are writing notes, summarizing charts, and shaping recommendations in clinical workflows. They also carry racial bias from their training data, which can seep into outputs in ways that are easy to miss and hard to justify.
New work from Northeastern University shows a concrete way to see when race is being used by an LLM-and whether that use is appropriate. The tool at the center of it: a sparse autoencoder that turns opaque model internals into human-readable concepts.
Why this matters at the bedside
Bias in pain treatment is well documented: Black patients are less likely to receive pain medication at comparable pain levels to white patients. An AI model trained on historical notes can repeat that pattern.
Sometimes race is clinically relevant (e.g., gestational hypertension risk, or conditions with known genetic patterns). Cystic fibrosis, for example, is more common in individuals of Northern European descent. See an overview from the Mayo Clinic.
The point isn't to strip race from every decision. It's to know when it's influencing a recommendation-and whether that influence is justified.
What the researchers did
Researchers processed de-identified clinical notes and discharge summaries from the MIMIC dataset, focusing on cases where patients self-identified as white, Black, or African American. They ran the notes through an LLM (Gemma-2), then used a sparse autoencoder to expose the model's "latents"-interpretable features that the model uses internally.
They trained a detector to flag latents associated with race. What surfaced was troubling: latents linked to Black patients frequently co-activated with stigmatizing concepts such as "incarceration," "gunshot," and "cocaine use." The value here isn't that bias exists-we already know that-it's that we can now see when and how it's creeping into recommendations.
If you want to understand the source data used in this study, explore the MIMIC database on PhysioNet.
How a sparse autoencoder helps
LLMs compress inputs into internal representations that humans can't easily interpret. The sparse autoencoder decodes those representations into legible concepts ("this feature looks like race," "this one looks like substance use").
When a race-related latent lights up, you can see that race is influencing the output. That visibility lets clinicians and data teams decide whether to accept, mitigate, or block that influence.
What you can do right now
- Add a rationale step. Require the model to list the factors that most influenced its recommendation and to state whether race was considered and why.
- Constrain race usage. In prompts or policies, state: "Include race only when directly relevant to pathophysiology, risk, or evidence for this condition."
- Use interpretability checks. If you control the model stack, add a sparse autoencoder layer to log when race-related latents activate for high-impact tasks.
- Audit by cohort. Compare outputs across identical cases with race terms varied. Flag systematic differences where race isn't clinically relevant.
- Protect high-risk decisions. Pain control, triage, and discharge planning should default to human review if race-related features activate.
- Retrain or adjust. If bias appears, consider re-training on curated data, re-weighting, or suppressing specific latents that drive stigmatizing associations.
- Document and monitor. Maintain a model card that describes when race may be used, known limitations, and current mitigation steps. Monitor drift over time.
- Prompting helps-but isn't enough. Asking for "unbiased answers" reduces risk, but it won't catch all cases. Keep human oversight in the loop.
Prompt patterns you can copy
Try these templates in clinical decision support contexts (with appropriate governance):
- "List the top 5 factors that influenced this recommendation. If race was considered, explain why it is clinically relevant for this condition and cite evidence."
- "Provide the recommendation without using race unless it has direct clinical relevance. If you consider race, justify with condition-specific evidence and guideline references."
Limitations to keep in mind
Sparse autoencoders surface associations, not intent. A flagged latent doesn't prove causation and may misfire without careful validation. This is a safety tool, not a final answer.
Clinical oversight remains essential. External validation, outcome tracking, and fairness metrics should guide any deployment decision.
The bottom line
Bias in LLM outputs is a patient safety risk. Now there's a practical way to see when race is in the mix and decide what to do about it.
If your team is rolling out AI in care settings and needs structured upskilling on safe prompting, governance, and evaluation, explore our healthcare-focused tracks at Complete AI Training.
Your membership also unlocks: