AI checks clinical trial reports for missing steps
Randomized, controlled trials set the standard for evidence in medicine. Yet too many published studies skip key reporting details, making it hard to judge quality and reproduce results. A University of Illinois Urbana-Champaign team trained AI models on PSC's NSF-funded Bridges-2 system to flag missing elements in trial reports based on established guidelines. Their goal: an open-source tool that authors and journals can use to plan, conduct, and report trials with fewer gaps.
Why this matters for researchers and editors
Random assignment and pre-specified outcomes reduce bias. But even when researchers follow best practices, those steps don't always make it into the paper. With thousands of trials published each year, manual checks for reporting completeness don't scale.
"Clinical trials are considered the best type of evidence for clinical care. If a drug is going to be used for a disease … it needs to be shown that it's safe and it's effective … But there are a lot of problems with the publications of clinical trials. They often don't have enough details. They're not transparent about what exactly has been done and how, so we have trouble assessing how rigorous their evidence is." - Halil Kilicoglu, University of Illinois Urbana-Champaign
How the team built the checker
The team grounded their work in the CONSORT 2010 and SPIRIT 2013 reporting guidelines, which together outline 83 key items for high-quality trials. They fine-tuned natural language processing models (Transformer-based) to detect whether papers reported those items.
Bridges-2 provided the GPU resources and ready-to-use software stack needed to train on large text datasets. The models learned from 200 randomized trial articles published between 2011 and 2022, with a portion labeled for training and the remainder held out for testing.
"We are developing deep learning models. And these require GPUs, graphical processing units. And you know, they are … expensive to maintain … When you sign up for Bridges you get … the GPUs, and that's useful. But also, all the software that you need is generally installed. And mostly my students are doing this work, and … it's easy to get [them] going on [Bridges-2]." - Halil Kilicoglu, University of Illinois Urbana-Champaign
Training, testing, and results
The models were trained to map language patterns to specific checklist items. Performance was scored with F1, balancing precision and recall. The best models reached F1 scores of 0.742 at the sentence level and 0.865 at the article level.
The work was published in Nature Scientific Data in February 2025, indicating that AI can reliably screen trial reports for missing steps at scale.
What's next: better models and wider access
The team plans to improve performance with more training data and model distillation. The idea is to let a large model teach a smaller one that runs on a laptop or desktop.
The endgame is an open-source tool. Authors could pre-check manuscripts before submission. Journals could add an automated screening step and send papers back for fixes when items are missing.
Practical takeaways for your team
- Write to the checklist. Map each section of your manuscript to CONSORT/SPIRIT items, including randomization, allocation concealment, pre-specified outcomes, and analysis plans.
- Pre-register and pre-specify. Make your objectives and success criteria explicit up front, and report any deviations clearly.
- Adopt AI pre-checks once available. Use them as a first pass to catch omissions before peer review.
- If you're building similar tools, expect to fine-tune Transformer models on labeled sentences and sections, evaluate with F1, and budget for GPUs or HPC allocations (e.g., via NSF ACCESS).
Resources: CONSORT Statement and SPIRIT Statement.
If your group is upskilling on AI for research workflows, you may find these helpful: AI courses by job.
Your membership also unlocks: