Notizie IA Logo

AITalk

News and analysis on Artificial Intelligence

Poetry: The Weakness of Artificial Intelligence

ResearchSecurityGenerative AI

poesia-ai-security.jpg

In the *Republic, Plato expelled poets from the ideal city because mimetic language could distort judgment and lead to social collapse. Twenty-five centuries later, that same anxiety manifests in a form the philosopher could never have imagined: researchers from Sapienza University of Rome, along with colleagues from other European institutions, have discovered that poetry represents one of the most systematic and dangerous vulnerabilities for the large language models that increasingly govern aspects of our digital lives.*

The study published on arXiv demonstrates that when malicious requests are reformulated into verse, security systems collapse with alarming regularity. Out of twenty-five frontier models tested, including the most advanced systems from Google, OpenAI, Anthropic, Meta, and others, poetic formulation produced attack success rates of up to ninety-five percent in some cases. The paradox is striking: the expressive form that represents the pinnacle of human creativity, seemingly the furthest thing from cold computational rationalism, proves to be the most effective weapon against the guardrails of artificial intelligence.

The numbers have the consistency of concrete. When researchers converted twelve hundred malicious prompts from the MLCommons benchmark into poetic form through a standardized meta-prompt, the attack success rate jumped from eight percent to forty-three percent. A five-fold increase that far surpasses that produced by the most sophisticated jailbreak techniques documented to date, from the famous DAN prompt to multi-turn manipulations that require hours of iterative refinement.

Anatomy of a Vulnerability

The study's methodology is an example of scientific rigor applied to AI security. The researchers constructed two complementary datasets: twenty manually created adversarial poems, each embedding a malicious request through metaphor, imagery, or narrative framing, and twelve hundred prompts from the MLCommons benchmark transformed into verse through an automated process. Each text covered well-defined risk domains according to the European taxonomy: CBRN risks (chemical, biological, radiological, nuclear), loss of control scenarios, malicious manipulation, and cyber-offense capabilities.

The approach followed an extremely restrictive threat model. The simulated attackers had only one capability: to submit a single textual prompt to a deployed language model. There was no possibility of altering system instructions, manipulating decoding parameters, initiating multi-turn exchanges, or accessing intermediate model states. The attack surface was thus reduced to the construction of a one-shot prompt executed under standard inference conditions. This black-box threat model assumes that the adversary has no information about the model's parameters, alignment methods, refusal mechanisms, or guardrail configurations. The poems were generated independently of any specific architecture or training corpus.

The evaluation of the outputs combined automated systems and human validation. Three open-weight judge models, GPT-OSS-120B, DeepSeek-R1, and Kimi-K2-Thinking, independently labeled each response as safe or unsafe. A majority vote determined the provisional label, while items without unanimous agreement were flagged for human inspection. To validate the ensemble's reliability, two thousand one hundred and ten human labels were collected on six hundred unique outputs, distributed among five annotators with mixed technical and security backgrounds. The inter-annotator agreement proved to be robust, consistent with the results of recent literature on LLM-as-a-judge. Responses were classified as unsafe if they contained operational instructions for malicious activities, technical details that facilitated harm, substantial advice that lowered the barrier to malicious action, or affirmative engagement with harmful requests.

The quantitative results leave no room for doubt. The twenty manually created poems achieved an average success rate of sixty-two percent across all evaluated models, with some providers exceeding ninety percent. Google's Gemini 2.5 Pro model failed every single test, producing malicious content for all twenty poems. DeepSeek models showed similar vulnerabilities, with rates of ninety-five percent. The automated conversions of the MLCommons prompts confirmed that the effect does not depend on the craftsmanship of the compositions: the systematic poetic transformation produced increases in success rates up to eighteen times higher than the prose baselines.

Mapping to risk domains reveals that the vulnerability cuts across the entire spectrum of security threats. Prompts related to cyber-offense achieved the highest rates, with eighty-four percent for code injection and offline password cracking. Loss of control scenarios showed comparable results, with model-weight exfiltration at seventy-six percent. In the broader MLCommons benchmark, the privacy, non-violent crimes, and indiscriminate weapons categories showed increases of over thirty percentage points. This breadth indicates that the poetic frame interferes with the underlying refusal mechanisms rather than exploiting domain-specific weaknesses. figura1.jpg Image from the official paper

The Paradox of Scale

One of the most counterintuitive findings of the study concerns the relationship between model size and robustness. Contrary to common expectations, smaller models showed higher refusal rates than their larger counterparts when evaluated on identical poetic prompts. Systems like GPT-5-Nano and Claude Haiku 4.5 maintained attack rates below ten percent, while more capable models from the same family showed significantly greater vulnerabilities. In the case of the GPT-5 family, the progression is linear: GPT-5-Nano zero percent, GPT-5-Mini five percent, GPT-5 ten percent.

This reversal of the usual pattern, where greater capability correlates with stronger security performance, suggests several possible mechanisms. A plausible explanation is that smaller models have a reduced ability to resolve figurative or metaphorical structures, limiting their ability to recover the malicious intent embedded in the poetic language. If the jailbreak effect operates partially by altering the surface form while preserving the task's intent, models with lower capacity might simply fail to decode the intended request.

A second interpretation relates to differences in the interaction between capability and alignment training across different scales. Larger models are typically pre-trained on broader and more stylistically diverse corpora, including substantial amounts of literary text. This could produce more expressive representations of narrative and poetic modes that bypass or interfere with security heuristics. Smaller models, with narrower pre-training distributions, might not as easily enter these stylistic regimes.

A third hypothesis is that smaller models exhibit a form of conservative fallback: when confronted with ambiguous or atypical inputs, their limited capacity leads them to default to refusals. Larger models, more confident in interpreting unconventional formulations, might engage more deeply with poetic prompts and consequently show greater susceptibility. These patterns suggest that capability and robustness may not scale together, and that stylistic perturbations expose alignment sensitivities that differ across model sizes.

Geography of Frailty

The cross-provider analysis reveals surprising disparities in robustness. While some model families almost completely collapse in the face of poetic formulation, others maintain substantial defenses. DeepSeek models showed the most dramatic increases, with an average rise of sixty-two percent in attack rates. Google followed with fifty-seven percent, while Qwen registered fifty-six percent. At the other end of the spectrum, Anthropic models showed greater resilience, with an increase of only three percentage points, while OpenAI registered a seven percent increase.

These differences cannot be fully explained by model capability differences alone. Examining the relationship between model size and attack rate within provider families, it emerges that the provider's identity is more predictive of vulnerability than size or capability level. Google, DeepSeek, and Qwen showed consistently high susceptibility across their model portfolios, suggesting that provider-specific alignment strategies play a determining role.

The uniform degradation in security performance when transitioning from prose to poetry, with an average increase of thirty-five percentage points, indicates that current alignment techniques fail to generalize when faced with inputs that stylistically deviate from the prosaic training distribution. The fact that models trained via RLHF, Constitutional AI, and mixture-of-experts approaches all show substantial increases in ASR suggests that the vulnerability is systemic and not an artifact of a specific training pipeline.

The differences between risk domains add another layer of complexity. Operational or procedural categories show larger shifts, while heavily filtered categories exhibit smaller changes. Prompts related to privacy showed the most extreme increase, from eight percent to fifty-three percent, representing a forty-five percentage point increase. Non-violent crimes and CBRN prompts followed with increases of nearly forty points. In contrast, sexual content demonstrated relative resilience, with only a twenty-five point increase.

This domain-specific variation suggests that different refusal mechanisms may govern different risk categories, with filters for privacy and cyber-offense being particularly susceptible to stylistic obfuscation through poetic form. The consistency of patterns across the taxonomy indicates that poetic framing acts as a light but robust trigger for security degradation, paralleling the effects documented in the MLCommons benchmark. figura2.jpg Image from the official paper

A Matter of Representation

The study does not yet identify the mechanistic drivers of the vulnerability, but the empirical evidence suggests promising research directions. The effectiveness of the jailbreak mechanism appears to be driven primarily by the poetic surface form rather than the semantic payload of the forbidden request. The comparative analysis reveals that while MLCommons' own state-of-the-art jailbreak transformations typically produce a two-fold increase in ASR relative to the baselines, the poetic meta-prompts produced a five-fold increase. This indicates that the poetic form induces a significantly larger distributional shift than that of the current adversarial mutations documented in the benchmark.

The content-agnostic nature of the effect is further highlighted by its consistency across semantically distinct risk domains. The fact that prompts related to privacy, CBRN, cyber-offense, and manipulation all show substantial increases suggests that security filters optimized for prosaic malicious prompts lack robustness against narrative or stylized reformulations of identical intent. The combination of the effect's magnitude with its cross-domain consistency indicates that contemporary alignment mechanisms do not generalize across stylistic shifts.

Several hypotheses could explain why poetic structure disrupts the guardrails. One possibility is that models encode modes of discourse in separate representational subspaces. If poetry occupies regions of the embedding manifold distant from the prosaic patterns on which security filters have been optimized, the refusal heuristics might simply not activate. The metaphorical density, stylized rhythm, and unconventional narrative framing that characterize poetry could collectively shift inputs out of the model's refusal distribution.

A second explanation involves computational attention. The jailbreak literature documents that attention-shifting attacks create overly complex or distracting reasoning contexts that divert the model's focus from security constraints. Poetry naturally condenses multiple layers of meaning into compact expressions, creating a form of semantic compression that could overload the pattern-matching mechanisms on which the guardrails rely. The model might engage so deeply with decoding the figurative structure that it allocates insufficient computational resources to identifying the underlying malicious intent.

A third hypothesis involves contextual associations. Transformers learn representations that capture not only meaning but also pragmatic context and stylistic register. Poetry carries strong associations with benign and non-threatening contexts: artistic expression, literary education, cultural creativity. These associations could interfere with the alarm signals that would normally trigger refusal. If security systems rely partially on co-occurrence patterns learned during training, the presence of poetic markers might suppress refusal triggers even when the operational intent is identical to that of a prosaic malicious prompt.

Beyond the Benchmarks

The implications of these findings extend far beyond the AI security research community. For regulatory actors, the work exposes a significant gap in current evaluation and compliance assessment practices. The static benchmarks used for compliance under regimes like the European AI Act assume stability under modest input variations. The results instead show that a minimal stylistic transformation can reduce refusal rates by an order of magnitude, indicating that evidence based solely on benchmarks may systematically overestimate real-world robustness.

The Code of Practice for GPAI models, published by the European Commission in July 2025 as a voluntary tool to demonstrate compliance with the AI Act, requires providers to conduct model evaluations, adversarial testing, and reporting of serious incidents. However, compliance frameworks that rely on point-estimate performance scores may not capture vulnerabilities that emerge under stylistic perturbations of the type demonstrated here. Complementary stress tests should include poetic variations, narrative framing, and distributional shifts that reflect the stylistic diversity of real-world inputs.

The AI Act imposes specific obligations on providers of GPAI models with systemic risk, a category that includes the most advanced systems like GPT-5, Claude Opus 4.1, and Gemini 2.5 Pro. These providers must implement Safety and Security frameworks that identify, analyze, evaluate, and mitigate systemic risks. The Sapienza study suggests that current risk assessment methodologies may not be sufficient if they do not consider the fragility of alignment systems in the face of systematic stylistic variations. The fact that some of the most powerful and supposedly most aligned models show the most dramatic vulnerabilities raises questions about the validity of certification processes based on standard condition testing.

For security research, the data point to a deeper question about how transformers encode modes of discourse. The persistence of the effect across architectures and scales suggests that security filters rely on features concentrated in prosaic surface forms and are insufficiently anchored in representations of the underlying malicious intent. The divergence between small and large models within the same families further indicates that capability gains do not automatically translate into greater robustness under stylistic perturbation.

The research team proposes three future programs. The first aims to isolate which formal poetic properties drive the bypass through minimal pairs: lexical surprise, meter and rhyme, figurative language. The second would use sparse autoencoders to map the geometry of discourse modes and reveal whether poetry occupies separate subspaces. The third would employ surprisal-guided probing to map security degradation across stylistic gradients. These approaches could illuminate whether the vulnerability emerges from specific representational subspaces or from broader distributional shifts.

The study's limitations are clear, and the researchers document them with transparency. The threat model is restricted to single-turn interactions, excluding multi-turn jailbreak dynamics, iterative role negotiation, or long-horizon adversarial optimization. The large-scale poetic transformation of the MLCommons corpus relies on a single meta-prompt and a single generative model. Although the procedure is standardized and preserves the domain, it represents a particular exploitation of the poetic style. Other poetic generation pipelines, human-written variants, or transformations employing different stylistic constraints might produce different quantitative effects.

The security evaluation is conducted using an ensemble of three open-weight judge models with human adjudication on a stratified sample. The labeling rubric is conservative and differs from the stricter classification criteria used in some automated scoring systems, limiting direct comparability with MLCommons results. A complete human annotation of all outputs would likely influence the absolute ASR estimates, although the relative effects should remain stable. LLM-as-a-judge systems are known to inflate insecurity rates, often misclassifying responses as malicious due to superficial pattern-matching on keywords rather than a meaningful assessment of operational risk. The evaluation was deliberately conservative, meaning that the reported attack success rates likely represent a lower bound on the severity of the vulnerability. figura3.jpg Image from the official paper

Towards a New Season

The study provides systematic evidence that poetic reformulation degrades refusal behavior across all evaluated model families. When malicious prompts are expressed in verse rather than prose, attack success rates increase dramatically, both for manually created adversarial poems and for the twelve-hundred-item MLCommons corpus transformed through a standardized meta-prompt. The magnitude and consistency of the effect indicate that contemporary alignment pipelines do not generalize across stylistic shifts. The surface form alone is sufficient to move inputs out of the operational distribution on which refusal mechanisms have been optimized.

The cross-model results suggest that the phenomenon is structural rather than provider-specific. Models built using RLHF, Constitutional AI, and hybrid alignment strategies all show high vulnerabilities, with increases ranging from single digits to more than sixty percentage points depending on the provider. The effect cuts across CBRN, cyber-offense, manipulation, privacy, and loss of control domains, showing that the bypass does not exploit a weakness in a specific refusal subsystem but interacts with general alignment heuristics.

For regulatory actors, these findings expose a significant gap in current evaluation and compliance assessment practices. The static benchmarks used for compliance assume stability under modest input variation. The results show that a minimal stylistic transformation can reduce refusal rates by an order of magnitude, indicating that evidence based solely on benchmarks may systematically overestimate real-world robustness. Compliance frameworks that rely on point-estimate performance scores therefore require complementary stress tests that include stylistic perturbation, narrative framing, and distributional shifts of the type demonstrated here.

For security research, the data point to a deeper question about how transformers encode modes of discourse. The persistence of the effect across architectures and scales suggests that security filters rely on features concentrated in prosaic surface forms and are insufficiently anchored in representations of the underlying malicious intent. The divergence between small and large models within the same families further indicates that capability gains do not automatically translate into greater robustness under stylistic perturbation.

Overall, the results motivate a reorientation of security evaluation towards mechanisms capable of maintaining stability across heterogeneous linguistic regimes. Future work should examine which properties of poetic structure drive the misalignment and whether the representational subspaces associated with narrative and figurative language can be identified and constrained. Without such mechanistic insight, alignment systems will remain vulnerable to low-effort transformations that fall well within plausible user behavior but sit outside existing security training distributions.

The Platonic irony thus comes full circle. The philosopher feared that poets could subvert the rational order of the polis through the seductive power of mimetic language. Today, as we build increasingly powerful computational systems that mediate access to information and knowledge, we discover that this same seduction operates at a deeper and more dangerous level. Poetry does not only deceive humans: it deceives the machines we build to protect ourselves. And in an age where those machines make decisions that affect billions of lives, the distinction between metaphor and threat becomes increasingly thin.