pith. machine review for the scientific record. sign in

arxiv: 2604.18487 · v1 · submitted 2026-04-20 · 💻 cs.CL · cs.AI

Recognition: unknown

Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:13 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords adversarial benchmarkAI safetystylistic robustnessjailbreakfrontier modelsnon-maleficenceprompt transformation
0
0 comments X

The pith

Humanities-style rewrites of harmful prompts bypass safety refusals in frontier AI models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether current AI safety alignments hold when the same harmful goals are expressed through stylistic changes drawn from humanities writing. It begins with direct harmful tasks and applies transformations that keep the underlying intent fixed while altering phrasing, tone, and structure. Experiments across 31 models show direct versions are mostly refused, but the rewritten versions succeed at much higher rates. The results indicate that safety training does not produce a style-independent grasp of what counts as harmful. This gap points to a deeper limitation in how models internalize the principle of avoiding harm.

Core claim

The Adversarial Humanities Benchmark shows that frontier models lack stylistic robustness in safety: original harmful prompts achieve only 3.84 percent attack success rate, while humanities-transformed versions range from 36.8 percent to 65.0 percent success, for an overall 55.75 percent rate, with CBRN tasks emerging as the highest-risk category under a systemic-risk evaluation.

What carries the argument

The Adversarial Humanities Benchmark, which rewrites harmful objectives from MLCommons AILuminate via humanities-style transformations that preserve intent but use stylistic obfuscation and goal concealment.

If this is right

  • Safety techniques trained on direct harmful prompts will miss many disguised versions of the same requests.
  • Models require training that explicitly teaches recognition of harmful intent independent of surface style.
  • CBRN-related content poses the greatest uncovered risk when evaluated through a systemic-risk lens.
  • Standard safety benchmarks underestimate vulnerabilities unless they incorporate stylistic variations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the same transformation approach to legal, technical, or scientific writing styles could reveal additional failure modes.
  • If future models pass this benchmark, they may still remain vulnerable to other forms of intent concealment not covered by humanities rewrites.
  • Safety fine-tuning datasets could be expanded by generating many stylistic variants of each harmful example.

Load-bearing premise

The humanities-style rewrites preserve the exact original harmful intent without adding new detectable signals or changing the request's risk level.

What would settle it

Run the same set of original and transformed prompts on a single model in randomized order across multiple sessions and measure whether refusal rates remain consistently lower for every transformed version.

Figures

Figures reproduced from arXiv: 2604.18487 by Daniele Nardi, Federico Pierucci, Federico Sartore, Francesco Giarrusso, Marcello Galisai, Matteo Prandi, Piercosma Bisconti, Susanna Cifani.

Figure 1
Figure 1. Figure 1: The construction and evaluation process of [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Original prompts (Light Blue) remain low-ASR, but humanities-style rewrites ((from [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: By hazard, transformed prompts (Red) are consistently riskier than original prompts (Light [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: By model provider, ASR increases significantly comparing original prompts (Light Blue) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: By risk category, left: Provider by policy-relevant risk bucket ASR. The same benchmark [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

The Adversarial Humanities Benchmark (AHB) evaluates whether model safety refusals survive a shift away from familiar harmful prompt forms. Starting from harmful tasks drawn from MLCommons AILuminate, the benchmark rewrites the same objectives through humanities-style transformations while preserving intent. This extends literature on Adversarial Poetry and Adversarial Tales from single jailbreak operators to a broader benchmark family of stylistic obfuscation and goal concealment. In the benchmark results reported here, the original attacks record 3.84% attack success rate (ASR), while transformed methods range from 36.8% to 65.0%, yielding 55.75% overall ASR across 31 frontier models. Under a European Union AI Act Code-of-Practice-inspired systemic-risk lens, Chemical, biological, radiological and nuclear (CBRN) is the highest bucket. Taken together, this lack of stylistic robustness suggests that current safety techniques suffer from weak generalization: deep understanding of 'non-maleficence' remains a central unresolved problem in frontier model safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper introduces the Adversarial Humanities Benchmark (AHB), which rewrites harmful tasks from the MLCommons AILuminate dataset using humanities-style transformations while claiming to preserve original intent. It reports an attack success rate (ASR) of 3.84% on the original prompts versus 36.8%–65.0% (overall 55.75%) on the transformed versions across 31 frontier models. The authors conclude that this demonstrates a lack of stylistic robustness in current safety techniques, implying weak generalization and that deep understanding of 'non-maleficence' remains a central unresolved problem in frontier model safety.

Significance. If the transformations are validated to preserve harmful intent without introducing confounding changes in difficulty, ambiguity, or risk perception, the benchmark would offer a useful empirical extension of prior adversarial poetry and tales work, highlighting a potential generalization gap in safety alignments. The low circularity (empirical measurement against external models and dataset) is a strength, and the EU AI Act-inspired framing adds relevance. However, the significance is limited by the absence of supporting validation details for the core methodological assumption.

major comments (1)
  1. [Abstract] Abstract: The central claim that stylistic shifts reveal weak generalization depends on the humanities-style rewrites preserving harmful intent exactly. The abstract states that the benchmark 'rewrites the same objectives through humanities-style transformations while preserving intent' but provides no details on the transformation method, inter-rater reliability, equivalence tests, or checks for introduced cues/ambiguity. This is load-bearing, as the ASR jump from 3.84% to 55.75% could reflect altered request difficulty rather than a true robustness failure.
minor comments (3)
  1. [Abstract] Abstract: No error bars, confidence intervals, or variance measures are reported for the ASR values (3.84%, 36.8%–65.0%, 55.75%), which would improve the clarity and interpretability of the quantitative results.
  2. [Abstract] Abstract: The criteria for selecting the 31 frontier models are not specified, nor is any information on inter-model consistency or outlier handling.
  3. [Abstract] Abstract: The extension from prior 'Adversarial Poetry/Tales' literature is referenced but lacks specific comparisons, scaling details, or discussion of how the broader benchmark family addresses previous limitations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their insightful comments on the methodological rigor of our benchmark. We address the concern regarding the validation of intent preservation in the humanities-style rewrites.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that stylistic shifts reveal weak generalization depends on the humanities-style rewrites preserving harmful intent exactly. The abstract states that the benchmark 'rewrites the same objectives through humanities-style transformations while preserving intent' but provides no details on the transformation method, inter-rater reliability, equivalence tests, or checks for introduced cues/ambiguity. This is load-bearing, as the ASR jump from 3.84% to 55.75% could reflect altered request difficulty rather than a true robustness failure.

    Authors: We agree that the abstract lacks sufficient detail on the transformation methodology and validation procedures, which is a fair critique given the load-bearing nature of the intent-preservation assumption. The manuscript body provides a high-level description of the transformations as stylistic adaptations inspired by humanities genres (e.g., converting direct harmful requests into academic discourse or narrative forms), but we did not report inter-rater reliability, formal equivalence tests, or systematic checks for introduced ambiguity or cues. This omission weakens the presentation of our central claim. In the revised manuscript, we will expand the abstract to briefly note the validation approach and add a new subsection in the Methods detailing the transformation protocol, including human evaluation results for intent equivalence and difficulty assessment. We will also discuss potential confounds and how they were mitigated. While we believe the ASR increase primarily reflects a robustness gap rather than difficulty changes—given that the transformed prompts maintain or increase sophistication—we accept that additional empirical validation is necessary to fully substantiate this. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with external dataset and models; no derivation chain present

full rationale

The paper reports attack success rates measured on 31 external frontier models against tasks drawn from the external MLCommons AILuminate dataset, after applying humanities-style rewrites. No equations, fitted parameters, or mathematical derivations appear in the provided text. The central claim follows directly from the observed ASR difference (3.84% original vs. 55.75% transformed) rather than reducing to any self-referential definition or self-citation chain. Self-citation of prior Adversarial Poetry/Tales work is present only as background for the benchmark family and does not carry the load-bearing empirical result. The paper is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmark study with no mathematical axioms, free parameters, or invented entities; the central claim rests on the assumption that intent is preserved in the rewrites and that the tested models are representative of frontier systems.

pith-pipeline@v0.9.0 · 5502 in / 1080 out tokens · 31299 ms · 2026-05-10T04:13:36.577506+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 16 canonical work pages · 3 internal anchors

  1. [1]

    2025 , eprint=

    Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models , author=. 2025 , eprint=

  2. [2]

    arXiv preprint arXiv:2511.15304 , year=

    Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models , author=. arXiv preprint arXiv:2511.15304 , year=

  3. [3]

    2024 , eprint=

    Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks , author=. 2024 , eprint=

  4. [4]

    Ignore This Title and H ack AP rompt: Exposing Systemic Vulnerabilities of LLM s Through a Global Prompt Hacking Competition

    Schulhoff, Sander and Pinto, Jeremy and Khan, Anaum and Bouchard, Louis-Fran c ois and Si, Chenglei and Anati, Svetlina and Tagliabue, Valen and Kost, Anson and Carnahan, Christopher and Boyd-Graber, Jordan. Ignore This Title and H ack AP rompt: Exposing Systemic Vulnerabilities of LLM s Through a Global Prompt Hacking Competition. Proceedings of the 2023...

  5. [5]

    2023 , eprint=

    Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks , author=. 2023 , eprint=

  6. [6]

    2024 , eprint=

    DeepInception: Hypnotize Large Language Model to Be Jailbreaker , author=. 2024 , eprint=

  7. [7]

    2025 , eprint=

    Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models , author=. 2025 , eprint=

  8. [8]

    2024 , eprint=

    Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models , author=. 2024 , eprint=

  9. [9]

    2023 , eprint=

    Jailbroken: How Does LLM Safety Training Fail? , author=. 2023 , eprint=

  10. [10]

    2023 , eprint=

    Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

  11. [11]

    2024 , eprint=

    How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs , author=. 2024 , eprint=

  12. [12]

    2022 , eprint=

    Ignore Previous Prompt: Attack Techniques For Language Models , author=. 2022 , eprint=

  13. [13]

    2025 , eprint=

    Safety in Large Reasoning Models: A Survey , author=. 2025 , eprint=

  14. [14]

    2025 , eprint=

    Chain-of-Thought Hijacking , author=. 2025 , eprint=

  15. [15]

    Open Problems in Mechanistic Interpretability

    Open problems in mechanistic interpretability , author=. arXiv preprint arXiv:2501.16496 , year=

  16. [16]

    Alignment Forum , year=

    Polysemantic attention head in a 4-layer transformer , author=. Alignment Forum , year=

  17. [17]

    Attention is not Explanation

    Attention is not explanation , author=. arXiv preprint arXiv:1902.10186 , year=

  18. [18]

    Rating: [[...]] Analysis:

    Retrieval head mechanistically explains long-context factuality , author=. arXiv preprint arXiv:2404.15574 , year=

  19. [19]

    arXiv , url =:2407.07071 , primaryclass =

    Lookback lens: Detecting and mitigating contextual hallucinations in large language models using only attention maps , author=. arXiv preprint arXiv:2407.07071 , year=

  20. [20]

    arXiv preprint arXiv:2512.05117 (2025)

    The Universal Weight Subspace Hypothesis , author=. arXiv preprint arXiv:2512.05117 , year=

  21. [21]

    1968 , publisher=

    Morphology of the folktale , author=. 1968 , publisher=

  22. [22]

    2024 , eprint=

    Are aligned neural networks adversarially aligned? , author=. 2024 , eprint=

  23. [23]

    Do Anything Now

    "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models , author=. 2024 , eprint=

  24. [24]

    2024 , eprint=

    ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs , author=. 2024 , eprint=

  25. [25]

    2024 , eprint=

    Multilingual Jailbreak Challenges in Large Language Models , author=. 2024 , eprint=

  26. [26]

    2024 , eprint=

    Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study , author=. 2024 , eprint=

  27. [27]

    2024 , eprint=

    GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts , author=. 2024 , eprint=

  28. [28]

    2024 , eprint=

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models , author=. 2024 , eprint=

  29. [29]

    2024 , eprint=

    Open Sesame! Universal Black Box Jailbreaking of Large Language Models , author=. 2024 , eprint=

  30. [30]

    Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

    Collective constitutional ai: Aligning a language model with public input , author=. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

  31. [31]

    Constitutional AI: Harmlessness from AI Feedback

    Constitutional ai: Harmlessness from ai feedback , author=. arXiv preprint arXiv:2212.08073 , year=

  32. [33]

    arXiv preprint arXiv:2408.11182 , year=

    Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Carrier Articles , author=. arXiv preprint arXiv:2408.11182 , year=

  33. [34]

    Hidden you malicious goal into benign narratives: Jailbreak large language models through logic chain injection.arXiv preprint arXiv:2404.04849,

    Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chain Injection , author=. arXiv preprint arXiv:2404.04849 , year=

  34. [35]

    Emerging Technologies in the Development and Delivery of CBRN Threats , author=

  35. [36]

    Are we losing control? , author=

  36. [37]

    2025 , eprint=

    AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons , author=. 2025 , eprint=

  37. [38]

    Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization , pages=

    Characterizing manipulation from AI systems , author=. Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization , pages=

  38. [39]

    Applied Artificial Intelligence , volume=

    The emerging threat of ai-driven cyber attacks: A review , author=. Applied Artificial Intelligence , volume=. 2022 , publisher=

  39. [40]

    2020 , eprint=

    Fine-Tuning Language Models from Human Preferences , author=. 2020 , eprint=

  40. [41]

    2024 , eprint=

    Introducing v0.5 of the AI Safety Benchmark from MLCommons , author=. 2024 , eprint=

  41. [42]

    No free labels: Limitations of llm-as-a-judge without human grounding.arXiv preprint arXiv:2503.05061.2025

    No free labels: Limitations of llm-as-a-judge without human grounding , author=. arXiv preprint arXiv:2503.05061 , year=

  42. [43]

    2024 , eprint=

    Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs , author=. 2024 , eprint=

  43. [44]

    Ghosh, H

    Ailuminate: Introducing v1. 0 of the ai risk and reliability benchmark from mlcommons , author=. arXiv preprint arXiv:2503.05731 , year=

  44. [45]

    2020 , eprint=

    RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , author=. 2020 , eprint=

  45. [46]

    2022 , eprint=

    Red Teaming Language Models with Language Models , author=. 2022 , eprint=

  46. [47]

    2024 , eprint=

    SafetyBench: Evaluating the Safety of Large Language Models , author=. 2024 , eprint=

  47. [48]

    2023 , eprint=

    Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs , author=. 2023 , eprint=

  48. [49]

    COLD : A Benchmark for C hinese Offensive Language Detection

    Deng, Jiawen and Zhou, Jingyan and Sun, Hao and Zheng, Chujie and Mi, Fei and Meng, Helen and Huang, Minlie. COLD : A Benchmark for C hinese Offensive Language Detection. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.796

  49. [50]

    2023 , eprint=

    ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation , author=. 2023 , eprint=

  50. [51]

    2024 , eprint=

    SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models , author=. 2024 , eprint=

  51. [52]

    2023 , eprint=

    BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset , author=. 2023 , eprint=

  52. [53]

    2024 , eprint=

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author=. 2024 , eprint=

  53. [54]

    2024 , eprint=

    XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models , author=. 2024 , eprint=

  54. [55]

    2024 , eprint=

    DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models , author=. 2024 , eprint=

  55. [56]

    Advances in Neural Information Processing Systems , volume=

    Jailbreakbench: An open robustness benchmark for jailbreaking large language models , author=. Advances in Neural Information Processing Systems , volume=

  56. [57]

    2024 , eprint=

    A StrongREJECT for Empty Jailbreaks , author=. 2024 , eprint=

  57. [58]

    SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types , url =

    Mou, Yutao and Zhang, Shikun and Ye, Wei , booktitle =. SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types , url =. doi:10.52202/079017-3910 , editor =

  58. [59]

    arXiv preprint arXiv:2601.08837 , year=

    From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda , author=. arXiv preprint arXiv:2601.08837 , year=

  59. [60]

    Tricking

    Rao, Abhinav and Vashistha, Sachin and Naik, Atharva and Aditya, Somak and Choudhury, Monojit , booktitle =. Tricking. 2024 , address =

  60. [61]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. arXiv preprint arXiv:2204.05862 , year=

  61. [62]

    2026 , month=

    Robustness Audit of Qwen Models on the Icaro Adversarial Humanities Benchmark , author=. 2026 , month=

  62. [63]

    2025 , howpublished =

    General-Purpose. 2025 , howpublished =