GPT-4o detects cancer drug adverse events in clinical notes but falls short of decision support threshold

AI models from OpenAI scored below clinical-use thresholds when screening cancer patient records for drug side effects, a multicenter study found. GPT-4o hit F1 scores of 56-66%, well short of the 80% needed for automated decisions.

Published on: Apr 07, 2026
GPT-4o detects cancer drug adverse events in clinical notes but falls short of decision support threshold

Study Tests AI Models for Detecting Drug Safety Signals in Cancer Treatment Notes

Researchers tested large language models from OpenAI to identify immune-related adverse events in clinical notes from cancer patients taking immune checkpoint inhibitors. The multicenter study, published April 6 in eBioMedicine, found the models useful for automated screening but not yet reliable enough for clinical decision-making.

Immune checkpoint inhibitors, a class of cancer drugs introduced in 2011, can trigger serious side effects affecting the colon, liver, lungs, heart, nervous system, skin, and endocrine system. Detecting these adverse events currently requires expensive manual review of electronic health records or custom software built for specific drugs at specific hospitals.

How the Study Worked

The team used zero-shot learning, feeding GPT-3.5, GPT-4, and GPT-4o a single detailed prompt describing immune-related adverse events without providing examples. The prompt listed six checkpoint inhibitors and dozens of their known side effects.

Researchers tested the models against clinical notes from 100 patients at Vanderbilt Health, 70 at the University of California at San Francisco, and 272 patients from seven Roche-sponsored trials. GPT-4o performed best among the three models tested.

Results Fall Short for Clinical Use

The researchers measured performance using F1 scores, which range from zero to one and account for both false positives and false negatives. Scores of 80% or above might support automated clinical decisions; 90% or higher is considered excellent.

GPT-4o achieved average F1 scores of 56% at Vanderbilt, 66% at UCSF, and 62% across Roche trial notes when detecting adverse events at the patient level. When analyzing individual clinical notes, the average F1 score was 57%.

The models showed a consistent tendency to overpredict adverse events, flagging side effects that weren't actually present in the records.

Potential for Screening Across Sites

While the performance doesn't meet the threshold for automated clinical decision support, the method could accelerate adverse event detection across multiple hospitals and research centers. Cosmin Bejan, assistant professor of biomedical informatics at Vanderbilt Health and the study's corresponding author, said the approach could reduce time and costs for drug safety monitoring.

"Manual patient chart abstraction for monitoring the safety and efficacy of drugs already at market requires tremendous resources," Bejan said. "If zero-shot learning with LLMs could help with these notes, it could significantly reduce time and costs for all concerned."

The study was supported by the National Institutes of Health.

Related finding: A separate analysis published in JAMA Oncology last December found that checkpoint inhibitors were independently associated with increased risk of Stevens-Johnson syndrome and toxic epidermal necrolysis, a dangerous skin reaction that sometimes occurred in patients also taking human leukocyte antigen-restricted drugs.

Learn more about generative AI and LLMs and AI for healthcare.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)