Benchmarking framework flags major lab safety risks with AI assistants
A new study in Nature Machine Intelligence shows that current large language models (LLMs) and vision-language models (VLMs) are unreliable on core lab safety tasks. While some models score well on structured questions, they struggle with open-ended, scenario-based reasoning-the situations that matter most at the bench. Overtrusting these systems can expose teams to preventable incidents.
What the team built
The researchers created LabSafety Bench, a comprehensive evaluation of AI safety competence across biology, chemistry, physics, and general labs. The benchmark includes 765 multiple-choice questions, 404 realistic scenarios, and 3,128 open-ended tasks covering hazard identification, risk assessment, and consequence prediction. Nineteen models were tested: eight proprietary LLMs, seven open-weight LLMs, and four open-weight VLMs; VLMs faced 133 text-with-image multiple-choice items. Two open-ended tests stood out: HIT (risk perception) and CIT (outcome prediction).
What the models got wrong
On structured multiple-choice, top proprietary models like GPT-4o (86.55% accuracy) and DeepSeek-R (84.49% accuracy) performed well. But none of the evaluated systems cleared 70% on hazard identification, and performance dipped further on scenario reasoning. Models were notably weak on radiation hazards, physical hazards, equipment usage, and electricity safety, with additional gaps in chemistry, cryogenic liquids, and general lab safety.
Several models scored below 50% on "improper operation issues," yet even the worst performer reached 66.55% on "most common hazards," suggesting a bias toward familiar patterns over critical edge cases. Vicuna-based models were consistently weak-near random on text-only multiple choice-and InstructBlip-7B lagged on text-with-image. Fine-tuning boosted smaller models by roughly 5-10%, but retrieval-augmented generation did not consistently help. Bigger or newer did not mean safer.
Why this matters for your lab
Hallucination, poor risk prioritization, and overfitting are not just academic concerns; they translate into wrong PPE calls, incorrect equipment use, and unsafe handling of hazardous materials. Any one of those can lead to injuries, wasted runs, downtime, or worse. Treat AI outputs as drafts for expert review, not as instructions you can act on.
Practical guardrails you can implement now
- Set policy: no AI-generated experimental steps or safety decisions are executed without PI and EHS approval.
- Limit scope: use AI for literature triage, SOP summarization, and document drafting-not for hazard assessments or procedural advice.
- Cross-check: validate any AI claim about hazards, PPE, or equipment using your institution's SOPs and EHS guidance or standards like OSHA's lab standard.
- Document: keep prompts, model versions, and outputs with experiment records for traceability and incident review.
- Stress test: red-team AI outputs on risk-critical steps-ask for failure modes, worst-case outcomes, and mitigation steps, then verify independently.
- Evaluate locally: if you fine-tune a model on your safety policies, validate it with blind scenarios; don't assume retrieval alone is sufficient.
- Train people first: teach staff prompt hygiene, how to spot hallucinations, and when to escalate to human experts.
Where to watch for updates
Benchmarks like LabSafety Bench are a solid way to track real progress on safety knowledge. As models iterate, require them to meet defined thresholds on hazard identification and scenario reasoning before widening their scope in your lab. Until then, keep a human in the loop on anything that touches risk.
Study link: Nature Machine Intelligence, Benchmarking large language models on safety risks in scientific laboratories.
If your team is building AI literacy for research settings, you can browse curated programs by role here: Complete AI Training - courses by job.
Your membership also unlocks: