Large language models in biomedicine and healthcare: what matters now
Healthcare runs on text and signals: notes, labs, images, genomics, and the endless stream of literature. Large language models (LLMs) give us a way to read, summarize, and reason across it at speed. When used well, they cut clicks, surface insights, and help teams make better calls faster.
The catch: privacy, bias, and workflow fit. You don't need a moonshot. You need clear use cases, strong guardrails, and tight integration with clinical systems. Below is a practical guide-what LLMs do well today, where they struggle, and how to implement safely.
Where LLMs deliver value today
Genomics
- Variant effect prediction: Models like Enformer and GPN learn long-range genomic interactions and predict how variants influence gene expression. Useful for prioritizing variants in rare disease workups and research pipelines.
- Cis-regulatory region mapping: DNABERT and Nucleotide Transformer learn motifs and context to score promoters, enhancers, and splice sites-even in low-data settings.
- DNA-protein interactions: Self-supervised models (e.g., MoDNA, GROVER) predict transcription factor binding and regulatory elements directly from sequence.
Transcriptomics
- Cell-type annotation: scBERT, Geneformer, and scGPT transfer knowledge from millions of cells to label new datasets and expose rare states.
- Batch effect correction: tGPT and SCimilarity create unified embeddings that travel across studies, aiding atlas-scale queries.
- Perturbation prediction: scGPT forecasts gene and drug responses by modeling gene-gene interactions; promising for target discovery and hypothesis testing.
- Spatial labels: Nicheformer blends single-cell and spatial data to predict niches and regions, supporting tissue mapping and tumor microenvironment studies.
Proteomics
- Structure prediction: ESM-style models, MSA Transformer, and ProLLaMA push sequence-to-structure accuracy and support protein engineering workflows.
- Function annotation: FAPM and ProteinChat link sequences and 3D embeddings to Gene Ontology terms and catalytic activities, helpful when homologs are scarce.
- Protein-protein interactions: ProLLM uses a "chain of thought" strategy to reason about signaling pathways and predict interaction pairs.
Drug discovery
- De novo generation: MolFM and TGM-DLM generate valid, property-aligned molecules guided by text prompts and diffusion over SMILES tokens.
- Drug-target interactions (DTIs): DTI-LM and DLM-DTI improve warm- and cold-start DTI predictions with compact architectures that run on modest hardware.
- Compound screening: Generalist therapeutic LLMs (e.g., Tx-LLM) unify molecules, proteins, and text to triage libraries and suggest repurposing leads.
Biomedical informatics
- Question answering: Retrieval-augmented systems like BioRAG reason over millions of papers to answer clinical or research questions with citations.
- Summarization: BioMedLM-level models condense notes, trials, and guidelines; reader studies often prefer their summaries for completeness and clarity.
- Clinical decision support (early signals): Studies in precision oncology show LLMs can suggest options that complement specialist judgment-useful as a second set of eyes, not a final arbiter.
Implementation playbook for healthcare teams
- Start with a narrow, high-friction task: discharge summary drafts, prior-auth letter drafts, tumor board evidence packs, variant triage notes, or trial-matching candidate lists.
- Keep data private: use de-identified corpora for training; run inference in a protected environment; contractually ban secondary data use.
- Build on RAG: retrieve from guidelines, pathways, formularies, institutional policies, and local order sets so the model cites what your clinicians trust.
- Wire into existing systems: FHIR, SQL, or graph backends as tools; let the model read but not write without review. Keep audit logs.
- Human-in-the-loop: clinicians approve outputs; require rationale and sources; mandate uncertainty flags.
- Measure what matters: time saved, error rates, rework, guideline adherence, and patient outcomes. Sunset what doesn't move the needle.
- Iterate with guardrails: prompt templates, content filters, PHI redaction, and fallbacks when confidence is low.
Data governance, bias, and safety
- Privacy by design: de-identify training data; restrict PHI exposure; enforce least-privilege access. HIPAA obligations still apply. See HHS HIPAA Privacy Rule.
- Bias checks: stratify metrics by sex, age, race/ethnicity, language, insurance, and site. Track drift over time and retrain with counterfactuals where feasible.
- Transparency: require citations, highlight uncertainty, and expose which data sources were used. Keep versioned prompts and models.
- Regulatory awareness: for patient-facing or device-like use, review the FDA's direction on AI/ML-enabled medical devices: FDA guidance portal.
- Incident response: define pathways to report harmful outputs, misclassification, or data leakage; pause and patch quickly.
Technical choices that matter
- RAG vs. fine-tuning: RAG is safer for fast updates and institutional specificity; fine-tune when you need stable style or domain reasoning.
- Structured data: encode labs, meds, and vitals as prompts; or let the model call tools (FHIR Server, SQL, graph) for up-to-date values.
- Efficiency: use LoRA/QLoRA, 4-8 bit quantization, and domain adapters to cut compute and cost.
- Evaluation: blend automated metrics (factuality, citation accuracy) with expert review. For research tasks, pre-register evaluations to reduce hindsight bias.
- Security: isolate workloads, scrub prompts/outputs for PHI, and disable training on user data by default.
Known limitations (plan around these)
- Data scarcity and bias: rare diseases, pediatric cohorts, and underrepresented populations remain under-labeled; models mirror those gaps.
- Interpretability: attention maps aren't explanations. Use rationales, citations, and counterfactual tests; keep humans in control.
- Cost: large models strain GPU budgets. Prefer smaller domain models with retrieval; scale only if outcomes justify it.
- Biological plausibility: generated molecules or interactions may be synthetically impractical or physiologically off. Add constraints and expert review.
- Multimodality: integrating sequence, structure, expression, images, and EHR is still early. Expect brittle edges across modalities.
- Evaluation ground truth: biology is noisy; many "labels" are provisional. Use multiple assays or literature triangulation where possible.
12-month sample roadmap
- Q1: pick 1-2 low-risk use cases; deploy RAG prototype; secure data agreements; define metrics.
- Q2: pilot with 10-30 clinicians or scientists; weekly feedback; cut prompts that confuse; add guardrails.
- Q3: expand to a second department; integrate with FHIR/graph tools; formalize bias and safety reviews.
- Q4: cost-optimizations (LoRA, quantization); external validation; publish internal guidance and training.
- Ongoing: monitor drift, re-evaluate models quarterly, and keep a kill switch.
What "good" looks like
- Clinicians save minutes per note and get better first drafts; fewer clicks in the EHR.
- Researchers prioritize variants, interactions, or compounds with higher hit rates downstream.
- Outputs include sources, uncertainty, and match institutional guidelines.
- Governance is clear: data use, auditing, escalation, and model lifecycle all documented.
Key model examples to know (non-exhaustive)
- Genomics: Enformer, GPN, DNABERT, Nucleotide Transformer, MoDNA, GROVER
- Transcriptomics: scBERT, Geneformer, scGPT, tGPT, SCimilarity, Nicheformer
- Proteomics: ESM-2, MSA Transformer, ProLLaMA, FAPM, ProteinChat, ProLLM
- Drug discovery: MolFM, TGM-DLM, DTI-LM, DLM-DTI, Tx-LLM
- Biomedical NLP: BioBERT, PubMedBERT, BioMegatron, BioMedLM, BioGPT, BioRAG
Further learning
Want structured, role-based training for your team? Explore practical AI courses by job category at Complete AI Training.
Bottom line: start small, keep humans in the loop, and measure outcomes that matter. The teams that win will pair solid clinical judgment with disciplined use of models and data.
Your membership also unlocks: