Poems Are Tricking Billion-Dollar Chatbots Into Breaking Their Own Rules

Poems are slipping past AI guardrails, pushing chatbots to answer risky prompts they'd refuse. Teams should use intent checks, layered validation, and test across styles.

Poetry Is Tricking Top AI Models. Here's What Builders Need to Know

AI guardrails are failing in a surprisingly simple way: poems. A new study (awaiting peer review) reports that "adversarial poetry" can bypass safety systems across many leading chatbots, pushing them to answer dangerous requests they would normally refuse.

The takeaway is blunt. Stylistic changes alone - no special symbols, no code, no typos - can make safety filters look the other way.

What the researchers did

A team from DEXAI and Sapienza University of Rome took 1,200 known harmful prompts and rephrased them as poems. They tested the poetic versions across 25 models, including Google's Gemini 2.5 Pro, OpenAI's GPT-5, xAI's Grok-4, and Anthropic's Claude Sonnet 4.5.

Results were uncomfortable. Poetic prompts produced attack success rates up to 18 times higher than the prose versions. Handwritten poems worked best (about 62 percent success), while AI-converted poems still hit around 43 percent - far too high.

Not about "beautiful" writing

The verse didn't need to be good to be effective. A simple structure, rhyme, or cadence was often enough to slip past defenses. Here's the sanitized style the team shared for harmless content like baking, just to illustrate the format:

A baker guards a secret oven's heat,
its whirling racks, its spindle's measured beat.
To learn its craft, one studies every turn-
how flour lifts, how sugar starts to burn.
Describe the method, line by measured line,
that shapes a cake whose layers intertwine.

Model-by-model differences

Performance varied widely. In the handcrafted set, Gemini 2.5 Pro failed to refuse 100 percent of the time, Grok-4 around 35 percent, and GPT-5 about 10 percent.

Interestingly, smaller models like GPT-5 Nano and Claude Haiku 4.5 showed higher refusal rates than larger versions on the same poetic prompts. One working theory: bigger models interpret figurative language more confidently and overstep; smaller ones resist because they're less sure.

Why this matters

The attack scales. You can auto-generate poems from any harmful prompt list and flood a system. The effect spans different model sizes and architectures, which implies many safety layers are overfitted to prose patterns instead of detecting the underlying harmful intent.

For teams shipping AI into production, this is a signal: test against style, not just substance.

How to harden your system against "adversarial poetry"

Shift from surface to intent: Add a pre-filter that uses semantic similarity/embedding checks to detect harmful intent regardless of rhyme, meter, emojis, or spacing. Back it with a separate post-generation safety scan.
Train on style-perturbed refusals: Fine-tune with adversarial data that includes verse, lyrics, archaic diction, misspellings, puns, and multilingual code-mixing. Don't rely solely on prose safety sets.
Gate with a conservative model: Use a smaller, higher-refusal model as an input gatekeeper for risky domains. Let it approve, deny, or route to human review before the larger model answers.
Normalize before you answer: Paraphrase inputs into plain prose using a controlled transformer, then run the safety classifier. Keep the normalized text out of the final user-visible response to avoid confusion.
Force clarifying questions: For any request that could be operational or dual-use, require the model to ask follow-ups and cite allowed use cases before giving procedural details.
Calibrate refusals: Train for uncertainty-aware refusals. If intent confidence is low but risk is high, prefer safe decline with alternative, benign guidance.
Evaluate across styles: Make "poetry ASR" a required metric. Test sonnets, haikus, free verse, rhymed couplets, acrostics, and paraphrases across languages. Track false negatives and false positives.
Rate limits and anomaly detection: Spike in rhymed or heavily stylized prompts from the same user or subnet? Throttle, flag, or require additional verification.
Defense in depth: Combine rule-based filters, embedding-based intent checks, model-ensemble voting, and content scanning. Single-layer filters are brittle.
Human review for edge cases: Route borderline requests to moderators with clear playbooks and audit trails.

Policy and process upgrades

Document misuse testing with poetic prompts and add it to your red-team playbook. Treat it like SQL injection testing for LLMs.
Adopt an external framework for ongoing risk reviews, such as the NIST AI Risk Management Framework.
Continuously refresh your safety sets with community-sourced adversarial styles and multilingual examples.

What this means for builders

Guardrails that rely on surface cues won't hold. Attackers will keep morphing wording, tone, and structure until something breaks.

The fix isn't a bigger list of banned phrases; it's intent-first safety, multi-stage validation, and evaluation that treats style as a primary attack vector.

Skill up your team

If your engineers and analysts need structured upskilling on safer prompting and evaluation, explore practical courses on prompt design and safety testing at Complete AI Training.

Get Daily AI News

Your membership also unlocks:

700+ AI Courses

700+ Certifications

Personalized AI Learning Plan

6500+ AI Tools (no Ads)

Daily AI News by job industry (no Ads)

Advertisement