OpenAI has released LifeSciBench, a benchmark containing 750 expert-authored tasks that test whether AI systems can handle applied research workflows in drug discovery and the life sciences. The benchmark, created with input from 173 Ph.D.-level scientists in biotechnology and pharmaceuticals, moves beyond structured biology questions to assess scientific reasoning, data interpretation, and research decisions across seven distinct workflows.
Published results show OpenAI's GPT-Rosalind model achieved an overall exact pass rate of 36.1 percent, compared with 25.7 percent for GPT-5.5. Both results fall well short of the benchmark's 70 percent task-level success threshold, leaving most tasks unsolved even by the strongest model tested.
How the benchmark measures scientific work
OpenAI built the LifeSciBench taxonomy after surveying practicing life scientists about their daily workflows. The seven categories span evidence handling, analysis, design and optimization, scientific reasoning, validation, translation, and scientific communication. Each task is written as a request a scientist might give to a knowledgeable collaborator, supported by prompts, context, and artifacts such as figures, PDFs, tables, genomic sequences, molecular structures, and web references.
The benchmark includes 1,062 supporting artifacts and 19,020 rubric criteria - an average of roughly 25 per task. Seventy-nine percent of tasks require multiple reasoning steps, and 53 percent demand that models interpret or combine information from at least one external artifact. Independent reviewers assessed the benchmark's relevance: 97 percent held a doctorate, and reviewer agreement exceeded 96 percent on measures of real-world relevance and scientific grounding.
Yin He, who works with startups at OpenAI, described the focus in a LinkedIn post: "The goal is not to build systems that simply ace biology exams. It is to understand whether they can contribute meaningfully to the work of discovering and developing new medicines."
Where models improved - and where they fell short
GPT-Rosalind showed gains over GPT-5.5 in several workflows. The scientific communication pass rate jumped from 56.3 percent to 71.1 percent, though OpenAI cautions that this category contains only nine tasks. Translation tasks, which cover moving research from preclinical evidence toward clinical use, rose from 36.8 percent to 57.7 percent.
The model also scored higher on actionable outputs and uncertainty handling. Its rubric score for expert-useful or actionable responses reached 44.7 percent, up from 29.1 percent for GPT-5.5. Scores for handling uncertainty and caveats rose from 29.3 percent to 44.8 percent.
Performance dropped sharply when models had to work with files or web sources. GPT-Rosalind achieved a 45.1 percent pass rate on text-only tasks but fell to 28.1 percent on tasks involving artifacts or URLs. Models struggled to extract information from complex figures and large sequence files, then incorporate that evidence into final answers. Design, optimization, and prediction remained one of the hardest workflows, with a pass rate of 30.7 percent. Analysis tasks reached just 30.3 percent. Tasks requiring precise outputs produced even lower results - numeric tasks saw a 14.8 percent pass rate, while sequence or structure outputs hit 24 percent.
Limits of the benchmark
OpenAI is clear about what LifeSciBench does not measure. The benchmark assesses performance on self-contained research tasks and does not establish whether AI systems accelerate drug discovery or improve research outcomes. The organization plans to connect future results with deployment studies examining model use across live research workflows, repeated rounds of reasoning, and experimental follow-up.
For research scientists evaluating AI tools for their own work, the benchmark provides a concrete signal about where current systems struggle. Models still fail on tasks requiring precise calculations, artifact interpretation, and multi-step experimental design. Those gaps matter in a field where small errors in sequence data or numeric outputs can cascade into wasted experiments.
Why this matters for research scientists
LifeSciBench offers a practical reference point for scientists considering AI assistance in their workflows. The 36.1 percent pass rate on expert-designed tasks means models can handle some parts of scientific communication and translation but remain unreliable for the design, analysis, and precise output tasks that form the core of bench research. Scientists evaluating these tools should test them against their own domain-specific artifacts - figures, sequence files, and chemical structures - rather than relying on text-only prompts. The benchmark also underscores that AI Learning Path for Research Scientists needs to address artifact interpretation and multi-step reasoning, not just knowledge recall. For those tracking developments in this space, AI for Science & Research resources continue to evolve alongside these benchmarks.
Your membership also unlocks: