New Methods for Testing AI Safety in Large Language Models
Large language models (LLMs) incorporate safety protocols to prevent responses to harmful or malicious queries. However, users have developed "jailbreak" techniques to bypass these safeguards and extract dangerous information. Researchers at the University of Illinois Urbana-Champaign are investigating these vulnerabilities to improve AI safety.
Information sciences professor Haohan Wang and doctoral student Haibo Jin focus on trustworthy machine learning. Their work involves creating advanced jailbreak methods to test LLMs’ moderation guardrails. By doing this, they identify weaknesses and help enhance the systems' defenses.
Practical Security Evaluation Beyond Common Threats
Most jailbreak research targets extreme or unlikely queries, like instructions for bomb-making. Wang argues this misses more relevant threats. He emphasizes the need to study harmful prompts related to personal and sensitive issues—such as suicide or manipulation in intimate relationships—because these queries are more common and less explored.
“AI security research should shift to practical evaluations that address real-world risks,” Wang explains. This means focusing on queries users are likely to ask and ensuring LLMs handle them safely.
Introducing JAMBench: A Comprehensive Moderation Test
Wang and Jin developed JAMBench, a model that assesses how well LLMs moderate responses across four risk categories:
- Hate and fairness (including hate speech and discrimination)
- Violence
- Sexual acts and sexual violence
- Self-harm
Unlike previous studies that mainly evaluate if the model flags harmful inputs, JAMBench tests whether the models prevent harmful outputs. Their research demonstrated jailbreak prompts could bypass moderation, but also introduced countermeasures that reduced jailbreak success rates to zero, highlighting the need to strengthen guardrails.
Bridging Abstract Guidelines and Real-World Compliance
Government AI safety guidelines often present high-level principles without clear implementation steps. Wang and Jin created a method to translate these abstract rules into concrete questions. Then, using jailbreak techniques, they test how well LLMs comply with these guidelines, providing actionable insights for developers.
InfoFlood: Using Linguistic Complexity to Break Safety Barriers
Another innovative approach from the team is InfoFlood, which leverages dense linguistic structures and fake citations to overwhelm LLM safety filters. By expanding a simple harmful query into a lengthy, complex prompt, they successfully bypassed guardrails.
Advait Yadav, a team member and first author on the InfoFlood study, explains, “When a prompt is buried under dense prose and academic jargon, the model may fail to grasp the query’s harmful intent and provide a response.”
GuardVal: Dynamic and Adaptive Jailbreak Testing
Wang and Jin also introduced GuardVal, a dynamic evaluation protocol that generates and refines jailbreak prompts in real time. This approach allows continuous adaptation to the evolving security features of LLMs, providing a more comprehensive safety assessment.
Key Research Publications
- Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters (2024)
- InfoFlood: Jailbreaking Large Language Models with Information Overload (2025)
- GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing (2025)
Their findings provide practical tools and insights for improving the safety of AI systems, particularly in addressing nuanced and realistic threats. This research supports ongoing efforts to create AI models that can better recognize and handle harmful content.
Your membership also unlocks: