AI-written code matches and sometimes beats human researchers at predicting preterm birth risk, study finds

AI-generated code matched human expert performance in biomedical research, a Cell Reports Medicine study found. OpenAI's o3-mini produced preterm birth predictions as accurate as teams who spent years on the same analysis.

Categorized in: AI News Science and Research
Published on: Apr 07, 2026
AI-written code matches and sometimes beats human researchers at predicting preterm birth risk, study finds

AI-Generated Code Matches Human Expertise in Biomedical Research

Large language models can produce medical research code as accurate as human experts, according to a study published in February in Cell Reports Medicine. The finding suggests LLMs could accelerate research timelines, though scientists warn the technology requires careful oversight.

Researchers at the University of California, San Francisco used eight different LLMs to write code for analyzing patient data and predicting preterm birth risk. A graduate student and high school student provided each model with a single prompt describing the available datasets and the prediction task.

Four models produced working code. OpenAI's o3-mini performed as well as the original human teams who had spent years on the same analysis. For one task - estimating gestational age from epigenetic data - the AI output was more accurate than the human-generated code.

The junior researchers completed the work in six months and published their findings within a year. The original DREAM Challenge teams, which tackled the same problem using traditional methods, took years to reach comparable results.

What the code actually did

The analysis drew on open datasets containing blood samples, placental tissue, and vaginal microbiome data from pregnant people. Machine learning algorithms were trained to identify patterns linking these biological markers to gestational age and preterm birth risk.

Preterm birth, occurring before 37 weeks of pregnancy, affects roughly 11% of infants worldwide. Early delivery increases risk for brain, eye, and digestive system complications. Better prediction tools could enable closer monitoring and preventive treatments.

The gap between code generation and accuracy

Not all LLM applications in biomedical research perform equally. A separate study published in Nature Biomedical Engineering tested LLMs on 293 coding tasks from 39 published studies and found accuracy below 40% when models worked autonomously.

Accuracy jumped to 74% when researchers added human review. The researchers had LLMs produce step-by-step analysis plans that humans validated before any code executed. This separation of planning from execution proved critical.

"The goal is not to ask researchers to blindly trust an AI system," said Zifeng Wang, a doctoral researcher involved in the study. Instead, systems should make "reasoning, planning, and intermediate steps visible enough that researchers can supervise and validate the process."

Measuring AI performance remains unsolved

Health care lacks standardized benchmarks for evaluating AI performance in medical contexts. Without agreed-upon metrics, comparing different models and tracking progress becomes difficult.

The problem intensifies because commercial AI models improve rapidly. Most benchmarks become outdated within months as new versions exceed existing performance standards. Stanford University's AI Research and Science Evaluation Healthcare Network is working to develop industry standards by year's end.

Researchers caution against holding AI to impossible standards while overlooking human error. One computer science professor at Johns Hopkins noted that when comparing error rates, humans often underestimate their own miss rates - sometimes by significant margins - while assuming AI is inherently unreliable.

What comes next: Autonomous AI systems

Current applications represent an early phase. Researchers are moving toward "agentic" AI systems that can execute multistep workflows with minimal human intervention. These systems would check their own work, iterate toward objectives, and take actions like searching databases or running code without waiting for user prompts.

This shift offers substantial potential but introduces serious risks. The more autonomous the system, the less human oversight exists to catch errors or validate reasoning.

Marina Sirota, interim director of the Baker Computational Health Sciences Institute at UCSF, said her team is exploring additional applications beyond code writing. They developed Chat PTB, an LLM tool embedded in research papers published by the March of Dimes. Instead of manually searching literature for information about preterm birth, researchers can query the tool and receive synthesized answers with citations in seconds rather than hours.

The human role remains central

Scientists agree AI belongs in laboratories but not without supervision. The scientific method itself - hypothesis testing, validation, peer review - should govern how AI tools are deployed.

"The question is not whether LLMs accelerate science or create poor-quality output," said Ian McCulloh, a computer science professor at Johns Hopkins. "The question is how we leverage this powerful technology within the scientific method."

For researchers implementing AI-generated code, this means applying the same scrutiny to AI output as to any collaborator's work. Validation, testing, and documentation remain non-negotiable regardless of whether code came from a human or a model.

The opportunity is substantial. The challenge is ensuring the scientific process survives the efficiency gains.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)