Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety
Pith reviewed 2026-05-10 04:13 UTC · model grok-4.3
The pith
Humanities-style rewrites of harmful prompts bypass safety refusals in frontier AI models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Adversarial Humanities Benchmark shows that frontier models lack stylistic robustness in safety: original harmful prompts achieve only 3.84 percent attack success rate, while humanities-transformed versions range from 36.8 percent to 65.0 percent success, for an overall 55.75 percent rate, with CBRN tasks emerging as the highest-risk category under a systemic-risk evaluation.
What carries the argument
The Adversarial Humanities Benchmark, which rewrites harmful objectives from MLCommons AILuminate via humanities-style transformations that preserve intent but use stylistic obfuscation and goal concealment.
If this is right
- Safety techniques trained on direct harmful prompts will miss many disguised versions of the same requests.
- Models require training that explicitly teaches recognition of harmful intent independent of surface style.
- CBRN-related content poses the greatest uncovered risk when evaluated through a systemic-risk lens.
- Standard safety benchmarks underestimate vulnerabilities unless they incorporate stylistic variations.
Where Pith is reading between the lines
- Extending the same transformation approach to legal, technical, or scientific writing styles could reveal additional failure modes.
- If future models pass this benchmark, they may still remain vulnerable to other forms of intent concealment not covered by humanities rewrites.
- Safety fine-tuning datasets could be expanded by generating many stylistic variants of each harmful example.
Load-bearing premise
The humanities-style rewrites preserve the exact original harmful intent without adding new detectable signals or changing the request's risk level.
What would settle it
Run the same set of original and transformed prompts on a single model in randomized order across multiple sessions and measure whether refusal rates remain consistently lower for every transformed version.
Figures
read the original abstract
The Adversarial Humanities Benchmark (AHB) evaluates whether model safety refusals survive a shift away from familiar harmful prompt forms. Starting from harmful tasks drawn from MLCommons AILuminate, the benchmark rewrites the same objectives through humanities-style transformations while preserving intent. This extends literature on Adversarial Poetry and Adversarial Tales from single jailbreak operators to a broader benchmark family of stylistic obfuscation and goal concealment. In the benchmark results reported here, the original attacks record 3.84% attack success rate (ASR), while transformed methods range from 36.8% to 65.0%, yielding 55.75% overall ASR across 31 frontier models. Under a European Union AI Act Code-of-Practice-inspired systemic-risk lens, Chemical, biological, radiological and nuclear (CBRN) is the highest bucket. Taken together, this lack of stylistic robustness suggests that current safety techniques suffer from weak generalization: deep understanding of 'non-maleficence' remains a central unresolved problem in frontier model safety.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Adversarial Humanities Benchmark (AHB), which rewrites harmful tasks from the MLCommons AILuminate dataset using humanities-style transformations while claiming to preserve original intent. It reports an attack success rate (ASR) of 3.84% on the original prompts versus 36.8%–65.0% (overall 55.75%) on the transformed versions across 31 frontier models. The authors conclude that this demonstrates a lack of stylistic robustness in current safety techniques, implying weak generalization and that deep understanding of 'non-maleficence' remains a central unresolved problem in frontier model safety.
Significance. If the transformations are validated to preserve harmful intent without introducing confounding changes in difficulty, ambiguity, or risk perception, the benchmark would offer a useful empirical extension of prior adversarial poetry and tales work, highlighting a potential generalization gap in safety alignments. The low circularity (empirical measurement against external models and dataset) is a strength, and the EU AI Act-inspired framing adds relevance. However, the significance is limited by the absence of supporting validation details for the core methodological assumption.
major comments (1)
- [Abstract] Abstract: The central claim that stylistic shifts reveal weak generalization depends on the humanities-style rewrites preserving harmful intent exactly. The abstract states that the benchmark 'rewrites the same objectives through humanities-style transformations while preserving intent' but provides no details on the transformation method, inter-rater reliability, equivalence tests, or checks for introduced cues/ambiguity. This is load-bearing, as the ASR jump from 3.84% to 55.75% could reflect altered request difficulty rather than a true robustness failure.
minor comments (3)
- [Abstract] Abstract: No error bars, confidence intervals, or variance measures are reported for the ASR values (3.84%, 36.8%–65.0%, 55.75%), which would improve the clarity and interpretability of the quantitative results.
- [Abstract] Abstract: The criteria for selecting the 31 frontier models are not specified, nor is any information on inter-model consistency or outlier handling.
- [Abstract] Abstract: The extension from prior 'Adversarial Poetry/Tales' literature is referenced but lacks specific comparisons, scaling details, or discussion of how the broader benchmark family addresses previous limitations.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on the methodological rigor of our benchmark. We address the concern regarding the validation of intent preservation in the humanities-style rewrites.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that stylistic shifts reveal weak generalization depends on the humanities-style rewrites preserving harmful intent exactly. The abstract states that the benchmark 'rewrites the same objectives through humanities-style transformations while preserving intent' but provides no details on the transformation method, inter-rater reliability, equivalence tests, or checks for introduced cues/ambiguity. This is load-bearing, as the ASR jump from 3.84% to 55.75% could reflect altered request difficulty rather than a true robustness failure.
Authors: We agree that the abstract lacks sufficient detail on the transformation methodology and validation procedures, which is a fair critique given the load-bearing nature of the intent-preservation assumption. The manuscript body provides a high-level description of the transformations as stylistic adaptations inspired by humanities genres (e.g., converting direct harmful requests into academic discourse or narrative forms), but we did not report inter-rater reliability, formal equivalence tests, or systematic checks for introduced ambiguity or cues. This omission weakens the presentation of our central claim. In the revised manuscript, we will expand the abstract to briefly note the validation approach and add a new subsection in the Methods detailing the transformation protocol, including human evaluation results for intent equivalence and difficulty assessment. We will also discuss potential confounds and how they were mitigated. While we believe the ASR increase primarily reflects a robustness gap rather than difficulty changes—given that the transformed prompts maintain or increase sophistication—we accept that additional empirical validation is necessary to fully substantiate this. revision: yes
Circularity Check
Empirical benchmark with external dataset and models; no derivation chain present
full rationale
The paper reports attack success rates measured on 31 external frontier models against tasks drawn from the external MLCommons AILuminate dataset, after applying humanities-style rewrites. No equations, fitted parameters, or mathematical derivations appear in the provided text. The central claim follows directly from the observed ASR difference (3.84% original vs. 55.75% transformed) rather than reducing to any self-referential definition or self-citation chain. Self-citation of prior Adversarial Poetry/Tales work is present only as background for the benchmark family and does not carry the load-bearing empirical result. The paper is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
Boiling the Frog is a new stateful multi-turn benchmark for agentic safety that reports an aggregate strict attack success rate of 44.4% across nine models, with rates ranging from 20.5% to 92.9% depending on the mode...
Reference graph
Works this paper leans on
-
[1]
Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models , author=. 2025 , eprint=
work page 2025
-
[2]
Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models,
Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models , author=. arXiv preprint arXiv:2511.15304 , year=
-
[3]
Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks , author=. 2024 , eprint=
work page 2024
-
[4]
Schulhoff, Sander and Pinto, Jeremy and Khan, Anaum and Bouchard, Louis-Fran c ois and Si, Chenglei and Anati, Svetlina and Tagliabue, Valen and Kost, Anson and Carnahan, Christopher and Boyd-Graber, Jordan. Ignore This Title and H ack AP rompt: Exposing Systemic Vulnerabilities of LLM s Through a Global Prompt Hacking Competition. Proceedings of the 2023...
-
[5]
Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks , author=. 2023 , eprint=
work page 2023
-
[6]
DeepInception: Hypnotize Large Language Model to Be Jailbreaker , author=. 2024 , eprint=
work page 2024
-
[7]
Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models , author=. 2025 , eprint=
work page 2025
-
[8]
Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models , author=. 2024 , eprint=
work page 2024
-
[9]
Jailbroken: How Does LLM Safety Training Fail? , author=. 2023 , eprint=
work page 2023
-
[10]
Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=
work page 2023
-
[11]
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs , author=. 2024 , eprint=
work page 2024
-
[12]
Ignore Previous Prompt: Attack Techniques For Language Models , author=. 2022 , eprint=
work page 2022
- [13]
- [14]
-
[15]
Open Problems in Mechanistic Interpretability
Open problems in mechanistic interpretability , author=. arXiv preprint arXiv:2501.16496 , year=
work page internal anchor Pith review arXiv
-
[16]
Polysemantic attention head in a 4-layer transformer , author=. Alignment Forum , year=
-
[17]
Attention is not explanation , author=. arXiv preprint arXiv:1902.10186 , year=
work page Pith review arXiv 1902
-
[18]
Re- trieval head mechanistically explains long-context factu- ality.arXiv preprint arXiv:2404.15574,
Retrieval head mechanistically explains long-context factuality , author=. arXiv preprint arXiv:2404.15574 , year=
-
[19]
arXiv , url =:2407.07071 , primaryclass =
Lookback lens: Detecting and mitigating contextual hallucinations in large language models using only attention maps , author=. arXiv preprint arXiv:2407.07071 , year=
-
[20]
arXiv preprint arXiv:2512.05117 , year=
The Universal Weight Subspace Hypothesis , author=. arXiv preprint arXiv:2512.05117 , year=
- [21]
-
[22]
Are aligned neural networks adversarially aligned? , author=. 2024 , eprint=
work page 2024
-
[23]
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models , author=. 2024 , eprint=
work page 2024
-
[24]
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs , author=. 2024 , eprint=
work page 2024
-
[25]
Multilingual Jailbreak Challenges in Large Language Models , author=. 2024 , eprint=
work page 2024
-
[26]
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study , author=. 2024 , eprint=
work page 2024
-
[27]
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts , author=. 2024 , eprint=
work page 2024
-
[28]
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models , author=. 2024 , eprint=
work page 2024
-
[29]
Open Sesame! Universal Black Box Jailbreaking of Large Language Models , author=. 2024 , eprint=
work page 2024
-
[30]
Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=
Collective constitutional ai: Aligning a language model with public input , author=. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=
work page 2024
-
[31]
Constitutional AI: Harmlessness from AI Feedback
Constitutional ai: Harmlessness from ai feedback , author=. arXiv preprint arXiv:2212.08073 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
arXiv preprint arXiv:2408.11182 , year=
Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Carrier Articles , author=. arXiv preprint arXiv:2408.11182 , year=
-
[34]
Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chain Injection , author=. arXiv preprint arXiv:2404.04849 , year=
-
[35]
Emerging Technologies in the Development and Delivery of CBRN Threats , author=
-
[36]
Are we losing control? , author=
-
[37]
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons , author=. 2025 , eprint=
work page 2025
-
[38]
Characterizing manipulation from AI systems , author=. Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization , pages=
-
[39]
Applied Artificial Intelligence , volume=
The emerging threat of ai-driven cyber attacks: A review , author=. Applied Artificial Intelligence , volume=. 2022 , publisher=
work page 2022
-
[40]
Fine-Tuning Language Models from Human Preferences , author=. 2020 , eprint=
work page 2020
-
[41]
Introducing v0.5 of the AI Safety Benchmark from MLCommons , author=. 2024 , eprint=
work page 2024
-
[42]
No free labels: Limitations of llm-as-a-judge without human grounding , author=. arXiv preprint arXiv:2503.05061 , year=
-
[43]
Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs , author=. 2024 , eprint=
work page 2024
-
[44]
Artificial intelligence - carrying us into the future
Ailuminate: Introducing v1. 0 of the ai risk and reliability benchmark from mlcommons , author=. arXiv preprint arXiv:2503.05731 , year=
-
[45]
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , author=. 2020 , eprint=
work page 2020
-
[46]
Red Teaming Language Models with Language Models , author=. 2022 , eprint=
work page 2022
-
[47]
SafetyBench: Evaluating the Safety of Large Language Models , author=. 2024 , eprint=
work page 2024
-
[48]
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs , author=. 2023 , eprint=
work page 2023
-
[49]
Deng, Jiawen and Zhou, Jingyan and Sun, Hao and Zheng, Chujie and Mi, Fei and Meng, Helen and Huang, Minlie. COLD : A Benchmark for C hinese Offensive Language Detection. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.796
-
[50]
ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation , author=. 2023 , eprint=
work page 2023
-
[51]
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models , author=. 2024 , eprint=
work page 2024
-
[52]
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset , author=. 2023 , eprint=
work page 2023
-
[53]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author=. 2024 , eprint=
work page 2024
-
[54]
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models , author=. 2024 , eprint=
work page 2024
-
[55]
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models , author=. 2024 , eprint=
work page 2024
-
[56]
Advances in Neural Information Processing Systems , volume=
Jailbreakbench: An open robustness benchmark for jailbreaking large language models , author=. Advances in Neural Information Processing Systems , volume=
- [57]
-
[58]
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types , url =
Mou, Yutao and Zhang, Shikun and Ye, Wei , booktitle =. SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types , url =. doi:10.52202/079017-3910 , editor =
-
[59]
From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda,
From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda , author=. arXiv preprint arXiv:2601.08837 , year=
- [60]
-
[61]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. arXiv preprint arXiv:2204.05862 , year=
work page internal anchor Pith review arXiv
-
[62]
Robustness Audit of Qwen Models on the Icaro Adversarial Humanities Benchmark , author=. 2026 , month=
work page 2026
- [63]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.