OpenAI introduces LifeSciBench to test AI performance on real-world life science research tasks

OpenAI's LifeSciBench benchmark, built with 173 Ph.D.-level scientists, shows its best model scoring a 36.1 percent pass rate on 750 expert drug-discovery tasks-far below the 70 percent success threshold.

Categorized in: AI News Science and Research

Published on: Jun 22, 2026

OpenAI has released LifeSciBench, a benchmark containing 750 expert-authored tasks that test whether AI systems can handle applied research workflows in drug discovery and the life sciences. The benchmark, created with input from 173 Ph.D.-level scientists in biotechnology and pharmaceuticals, moves beyond structured biology questions to assess scientific reasoning, data interpretation, and research decisions across seven distinct workflows.

Published results show OpenAI's GPT-Rosalind model achieved an overall exact pass rate of 36.1 percent, compared with 25.7 percent for GPT-5.5. Both results fall well short of the benchmark's 70 percent task-level success threshold, leaving most tasks unsolved even by the strongest model tested.

How the benchmark measures scientific work

OpenAI built the LifeSciBench taxonomy after surveying practicing life scientists about their daily workflows. The seven categories span evidence handling, analysis, design and optimization, scientific reasoning, validation, translation, and scientific communication. Each task is written as a request a scientist might give to a knowledgeable collaborator, supported by prompts, context, and artifacts such as figures, PDFs, tables, genomic sequences, molecular structures, and web references.

The benchmark includes 1,062 supporting artifacts and 19,020 rubric criteria - an average of roughly 25 per task. Seventy-nine percent of tasks require multiple reasoning steps, and 53 percent demand that models interpret or combine information from at least one external artifact. Independent reviewers assessed the benchmark's relevance: 97 percent held a doctorate, and reviewer agreement exceeded 96 percent on measures of real-world relevance and scientific grounding.

Yin He, who works with startups at OpenAI, described the focus in a LinkedIn post: "The goal is not to build systems that simply ace biology exams. It is to understand whether they can contribute meaningfully to the work of discovering and developing new medicines."

Where models improved - and where they fell short

GPT-Rosalind showed gains over GPT-5.5 in several workflows. The scientific communication pass rate jumped from 56.3 percent to 71.1 percent, though OpenAI cautions that this category contains only nine tasks. Translation tasks, which cover moving research from preclinical evidence toward clinical use, rose from 36.8 percent to 57.7 percent.

The model also scored higher on actionable outputs and uncertainty handling. Its rubric score for expert-useful or actionable responses reached 44.7 percent, up from 29.1 percent for GPT-5.5. Scores for handling uncertainty and caveats rose from 29.3 percent to 44.8 percent.

Performance dropped sharply when models had to work with files or web sources. GPT-Rosalind achieved a 45.1 percent pass rate on text-only tasks but fell to 28.1 percent on tasks involving artifacts or URLs. Models struggled to extract information from complex figures and large sequence files, then incorporate that evidence into final answers. Design, optimization, and prediction remained one of the hardest workflows, with a pass rate of 30.7 percent. Analysis tasks reached just 30.3 percent. Tasks requiring precise outputs produced even lower results - numeric tasks saw a 14.8 percent pass rate, while sequence or structure outputs hit 24 percent.

Limits of the benchmark

OpenAI is clear about what LifeSciBench does not measure. The benchmark assesses performance on self-contained research tasks and does not establish whether AI systems accelerate drug discovery or improve research outcomes. The organization plans to connect future results with deployment studies examining model use across live research workflows, repeated rounds of reasoning, and experimental follow-up.

For research scientists evaluating AI tools for their own work, the benchmark provides a concrete signal about where current systems struggle. Models still fail on tasks requiring precise calculations, artifact interpretation, and multi-step experimental design. Those gaps matter in a field where small errors in sequence data or numeric outputs can cascade into wasted experiments.

Why this matters for research scientists

LifeSciBench offers a practical reference point for scientists considering AI assistance in their workflows. The 36.1 percent pass rate on expert-designed tasks means models can handle some parts of scientific communication and translation but remain unreliable for the design, analysis, and precise output tasks that form the core of bench research. Scientists evaluating these tools should test them against their own domain-specific artifacts - figures, sequence files, and chemical structures - rather than relying on text-only prompts. The benchmark also underscores that AI Learning Path for Research Scientists needs to address artifact interpretation and multi-step reasoning, not just knowledge recall. For those tracking developments in this space, AI for Science & Research resources continue to evolve alongside these benchmarks.

Get Daily AI News

Your membership also unlocks:

700+ AI Courses

700+ Certifications

Personalized AI Learning Plan

6500+ AI Tools (no Ads)

Daily AI News by job industry (no Ads)

OpenAI introduces LifeSciBench to test AI performance on real-world life science research tasks

How the benchmark measures scientific work

Where models improved - and where they fell short

Limits of the benchmark

Why this matters for research scientists

Related AI News for Science and Research

AI shifts payments innovation from features to infrastructure

Korea aims to compress fusion reactor design with AI to close gap with US and China

Nobel laureate John Jumper leaves Google DeepMind for Anthropic

Trump explores public stakes in AI firms through taxes, equity deals and dividends

About Complete AI:

Latest AI News for your Job:

Courses by AI Skill:

Courses by Job Field:

Courses by AI Company:

AI Tools for your Job:

AI Tools by Type:

AI Certifications by Skill:

AI Certifications by Job Field:

AI Certifications by Company: