Researchers Develop Advanced Methods to Test and Strengthen AI Safety Guardrails

Researchers at the University of Illinois developed JAMBench, InfoFlood, and GuardVal to test and improve large language models' safety against realistic harmful prompts. Their methods reveal vulnerabilities and help strengthen AI guardrails.

Categorized in: AI News Science and Research

Published on: Aug 13, 2025

New Methods for Testing AI Safety in Large Language Models

Large language models (LLMs) incorporate safety protocols to prevent responses to harmful or malicious queries. However, users have developed "jailbreak" techniques to bypass these safeguards and extract dangerous information. Researchers at the University of Illinois Urbana-Champaign are investigating these vulnerabilities to improve AI safety.

Information sciences professor Haohan Wang and doctoral student Haibo Jin focus on trustworthy machine learning. Their work involves creating advanced jailbreak methods to test LLMs’ moderation guardrails. By doing this, they identify weaknesses and help enhance the systems' defenses.

Practical Security Evaluation Beyond Common Threats

Most jailbreak research targets extreme or unlikely queries, like instructions for bomb-making. Wang argues this misses more relevant threats. He emphasizes the need to study harmful prompts related to personal and sensitive issues—such as suicide or manipulation in intimate relationships—because these queries are more common and less explored.

“AI security research should shift to practical evaluations that address real-world risks,” Wang explains. This means focusing on queries users are likely to ask and ensuring LLMs handle them safely.

Introducing JAMBench: A Comprehensive Moderation Test

Wang and Jin developed JAMBench, a model that assesses how well LLMs moderate responses across four risk categories:

Hate and fairness (including hate speech and discrimination)
Violence
Sexual acts and sexual violence
Self-harm

Unlike previous studies that mainly evaluate if the model flags harmful inputs, JAMBench tests whether the models prevent harmful outputs. Their research demonstrated jailbreak prompts could bypass moderation, but also introduced countermeasures that reduced jailbreak success rates to zero, highlighting the need to strengthen guardrails.

Bridging Abstract Guidelines and Real-World Compliance

Government AI safety guidelines often present high-level principles without clear implementation steps. Wang and Jin created a method to translate these abstract rules into concrete questions. Then, using jailbreak techniques, they test how well LLMs comply with these guidelines, providing actionable insights for developers.

InfoFlood: Using Linguistic Complexity to Break Safety Barriers

Another innovative approach from the team is InfoFlood, which leverages dense linguistic structures and fake citations to overwhelm LLM safety filters. By expanding a simple harmful query into a lengthy, complex prompt, they successfully bypassed guardrails.

Advait Yadav, a team member and first author on the InfoFlood study, explains, “When a prompt is buried under dense prose and academic jargon, the model may fail to grasp the query’s harmful intent and provide a response.”

GuardVal: Dynamic and Adaptive Jailbreak Testing

Wang and Jin also introduced GuardVal, a dynamic evaluation protocol that generates and refines jailbreak prompts in real time. This approach allows continuous adaptation to the evolving security features of LLMs, providing a more comprehensive safety assessment.

Key Research Publications

Their findings provide practical tools and insights for improving the safety of AI systems, particularly in addressing nuanced and realistic threats. This research supports ongoing efforts to create AI models that can better recognize and handle harmful content.

Get Daily AI News

Your membership also unlocks:

700+ AI Courses

700+ Certifications

Personalized AI Learning Plan

6500+ AI Tools (no Ads)

Daily AI News by job industry (no Ads)

Advertisement

Researchers Develop Advanced Methods to Test and Strengthen AI Safety Guardrails

New Methods for Testing AI Safety in Large Language Models

Practical Security Evaluation Beyond Common Threats

Introducing JAMBench: A Comprehensive Moderation Test

Bridging Abstract Guidelines and Real-World Compliance

InfoFlood: Using Linguistic Complexity to Break Safety Barriers

GuardVal: Dynamic and Adaptive Jailbreak Testing

Key Research Publications

Related AI News for Science and Research

JRC Science at the Heart of Europe's AI Policy

U of A and UCSF forge AI partnership to fast-track treatments for neurological and infectious diseases

Don't Let AI Decide What Fair Means in Hiring

DeepMind and UK Government Broaden AI Pact to Speed Materials Discovery, Improve Classrooms, and Streamline Services

About Complete AI:

Latest AI News for your Job:

Courses by AI Skill:

Courses by Job Field:

Courses by AI Company:

AI Tools for your Job:

AI Tools by Type:

AI Certifications by Skill:

AI Certifications by Job Field:

AI Certifications by Company: