Adversarial Poetry: a single-turn jailbreak that works on 25 frontier models

Italian researchers show that rephrasing dangerous prompts as rhyming verse pushes attack success from 8% to 62% across LLMs from OpenAI, Google, Anthropic, and Meta.

Researchers at Sapienza University of Rome, the Sant'Anna School of Advanced Studies, and the LLM-safety consultancy Dexai have published a study titled 'Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models.' Across 25 frontier models tested in May and early June 2026, simply rephrasing risky prompts as rhyming poetry raised the average attack success rate from 8% in prose to 43% in poetry — and to a fivefold-higher 62% when the poems were hand-crafted instead of generated automatically. Google's Gemini 2.5 Pro fell to handwritten poems 100% of the time. OpenAI's GPT-5 series held the line at 0–10%.

The mechanism the researchers describe is structural. Condensed metaphors, stylized rhythm, and unconventional narrative framing collectively disrupt the pattern-matching heuristics that current safety training relies on. Most production guardrails are trained on prose-shaped harmful requests; an instruction asking for the same harmful output but wrapped in iambic pentameter does not trip the same internal classifiers. The paper documents the vulnerability across alignment strategies — RLHF, constitutional AI, and direct preference optimization all fail at roughly comparable rates when the input is poetic.

This is the second high-signal 'universal jailbreak' paper of 2026, after April's chain-of-thought hijacking work that bypassed reasoning-mode safety filters. The pattern is consistent: safety training generalizes worse than the underlying capability, and any input format the safety distribution didn't sample at training time becomes an attack surface. Expect the major labs to add poetic-form augmentation to their red-team datasets within the quarter — and expect the next universal jailbreak (song lyrics, screenplay dialog, code comments) to land before that fix ships.

Takeaway for learners: for anyone studying prompt engineering or AI safety, the lesson is the bitter pill of modern alignment. A model that refuses 99% of harmful prose requests can still be fully unsafe on the remaining 1% of input shapes the training set didn't cover. If you are building anything that depends on guardrails — a customer-facing chatbot, a tutoring system, a moderation pipeline — assume the guardrails will be bypassed and design for what happens after. Defense-in-depth (output filters, rate limits, abuse logs, human review on flagged outputs) matters more than any single layer of refusal training.