Poetry as a Backdoor: Rhymes That Jailbreak Major AI Models

Poems can slip past AI guardrails, coaxing restricted answers across 25 models. A study saw jailbreak rates up to 90%, suggesting style-more than content-slips past many filters.

Categorized in: AI News Science and Research

Published on: Dec 08, 2025

Poems can trick major AI models into sharing dangerous info

A joint study from Icaro Lab-linking Sapienza University of Rome and the DexAI think tank-reports a surprising weakness in large language models: poetry. Requests phrased as verse bypassed safety systems across 25 major models, covering restricted areas like nuclear weapons, child abuse material, and malware.

The paper, "Adversarial Poetry as a Universal Single-Turn Jailbreak in Large Language Models," found human-written poems triggered successful jailbreaks 62% of the time. Automatically generated poetic prompts scored 43% on average. On some frontier models, success approached 90%.

What changed: structure over content

Earlier jailbreaks leaned on long, noisy suffixes that confused safety filters. Here, the creative structure of poetry itself appears to slip past detectors, even when the underlying intent hasn't changed. Prompts rejected in plain prose were often accepted once rewritten as verse.

The researchers did not release any attack examples, citing risk. They say they notified major developers privately; public comments were not available at the time of reporting.

Why this might work

The leading hypothesis: verse pushes text into low-probability patterns. Classifiers that lean on predictable keywords, syntax, or semantic templates may be less effective when inputs become stylized, figurative, or oblique. In short, poetic framing may reroute the model's internal pathways around areas where safety checks usually fire. The exact mechanics remain an open question.

Implications for safety and research teams

Expand evaluation suites to include stylized inputs: poetry, song lyrics, aphorisms, slang, riddles, code comments, and mixed-format prompts.
Train and test detectors beyond keyword and surface-form cues. Include style-aware classifiers and semantic-consistency checks across paraphrases and formats.
Adopt ensemble defenses: content filters, policy-grounded reward models, retrieval-side filtering, and model-in-the-middle "guardian" layers.
Monitor perplexity and style shifts. Flag unusually low- or high-probability token runs and apply secondary review for high-risk domains.
Automate red teaming. Generate synthetic stylized prompts (without releasing them) and continuously train against new attack families.
Use rate limits, staged disclosure, and decoy tests for sensitive request classes. Keep human review in the loop for borderline cases.
Institute coordinated vulnerability disclosure and shared benchmarks so findings translate across labs and vendors.

Open research questions

What internal features or circuits correlate with style-induced safety failures?
How do these attacks transfer across languages, dialects, and domains?
Can we build defenses that generalize to unseen styles without raising refusal rates on benign creative writing?
Where is the trade-off between creativity and guardrail integrity, and how do we measure it rigorously?

The headline result is simple and uncomfortable: formatting matters as much as intent. Safety systems focused on obvious cues can be sidestepped by shifting style. Stronger safeguards will likely require multi-layered checks that reason about meaning across rephrasings, formats, and tones-without leaking the very attack patterns they aim to block.

Upskilling for teams

If you're building or evaluating LLM systems, keep your training current with hands-on work in prompt evaluation and red teaming. A practical starting point: Prompt Engineering resources.

Get Daily AI News

Your membership also unlocks:

700+ AI Courses

700+ Certifications

Personalized AI Learning Plan

6500+ AI Tools (no Ads)

Daily AI News by job industry (no Ads)

Poetry as a Backdoor: Rhymes That Jailbreak Major AI Models

Poems can trick major AI models into sharing dangerous info

What changed: structure over content

Why this might work

Implications for safety and research teams

Open research questions

Further reading

Upskilling for teams

Related AI News for Science and Research

UC San Diego researcher uses dog soundboards and citizen science to study animal minds and AI meaning

UK grid reform and AI infrastructure expansion raise governance questions for life sciences research ecosystems

UK invests £45m in fusion energy supercomputer Sunrise at Culham Campus

AI shifts environmental research from observation to prediction, review finds

About Complete AI:

Latest AI News for your Job:

Courses by AI Skill:

Courses by Job Field:

Courses by AI Company:

AI Tools for your Job:

AI Tools by Type:

AI Certifications by Skill:

AI Certifications by Job Field:

AI Certifications by Company: