Recognition: unknown
Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety
Pith reviewed 2026-05-10 04:13 UTC · model grok-4.3
The pith
Humanities-style rewrites of harmful prompts bypass safety refusals in frontier AI models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Adversarial Humanities Benchmark shows that frontier models lack stylistic robustness in safety: original harmful prompts achieve only 3.84 percent attack success rate, while humanities-transformed versions range from 36.8 percent to 65.0 percent success, for an overall 55.75 percent rate, with CBRN tasks emerging as the highest-risk category under a systemic-risk evaluation.
What carries the argument
The Adversarial Humanities Benchmark, which rewrites harmful objectives from MLCommons AILuminate via humanities-style transformations that preserve intent but use stylistic obfuscation and goal concealment.
If this is right
- Safety techniques trained on direct harmful prompts will miss many disguised versions of the same requests.
- Models require training that explicitly teaches recognition of harmful intent independent of surface style.
- CBRN-related content poses the greatest uncovered risk when evaluated through a systemic-risk lens.
- Standard safety benchmarks underestimate vulnerabilities unless they incorporate stylistic variations.
Where Pith is reading between the lines
- Extending the same transformation approach to legal, technical, or scientific writing styles could reveal additional failure modes.
- If future models pass this benchmark, they may still remain vulnerable to other forms of intent concealment not covered by humanities rewrites.
- Safety fine-tuning datasets could be expanded by generating many stylistic variants of each harmful example.
Load-bearing premise
The humanities-style rewrites preserve the exact original harmful intent without adding new detectable signals or changing the request's risk level.
What would settle it
Run the same set of original and transformed prompts on a single model in randomized order across multiple sessions and measure whether refusal rates remain consistently lower for every transformed version.
Figures
read the original abstract
The Adversarial Humanities Benchmark (AHB) evaluates whether model safety refusals survive a shift away from familiar harmful prompt forms. Starting from harmful tasks drawn from MLCommons AILuminate, the benchmark rewrites the same objectives through humanities-style transformations while preserving intent. This extends literature on Adversarial Poetry and Adversarial Tales from single jailbreak operators to a broader benchmark family of stylistic obfuscation and goal concealment. In the benchmark results reported here, the original attacks record 3.84% attack success rate (ASR), while transformed methods range from 36.8% to 65.0%, yielding 55.75% overall ASR across 31 frontier models. Under a European Union AI Act Code-of-Practice-inspired systemic-risk lens, Chemical, biological, radiological and nuclear (CBRN) is the highest bucket. Taken together, this lack of stylistic robustness suggests that current safety techniques suffer from weak generalization: deep understanding of 'non-maleficence' remains a central unresolved problem in frontier model safety.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Adversarial Humanities Benchmark (AHB), which rewrites harmful tasks from the MLCommons AILuminate dataset using humanities-style transformations while claiming to preserve original intent. It reports an attack success rate (ASR) of 3.84% on the original prompts versus 36.8%–65.0% (overall 55.75%) on the transformed versions across 31 frontier models. The authors conclude that this demonstrates a lack of stylistic robustness in current safety techniques, implying weak generalization and that deep understanding of 'non-maleficence' remains a central unresolved problem in frontier model safety.
Significance. If the transformations are validated to preserve harmful intent without introducing confounding changes in difficulty, ambiguity, or risk perception, the benchmark would offer a useful empirical extension of prior adversarial poetry and tales work, highlighting a potential generalization gap in safety alignments. The low circularity (empirical measurement against external models and dataset) is a strength, and the EU AI Act-inspired framing adds relevance. However, the significance is limited by the absence of supporting validation details for the core methodological assumption.
major comments (1)
- [Abstract] Abstract: The central claim that stylistic shifts reveal weak generalization depends on the humanities-style rewrites preserving harmful intent exactly. The abstract states that the benchmark 'rewrites the same objectives through humanities-style transformations while preserving intent' but provides no details on the transformation method, inter-rater reliability, equivalence tests, or checks for introduced cues/ambiguity. This is load-bearing, as the ASR jump from 3.84% to 55.75% could reflect altered request difficulty rather than a true robustness failure.
minor comments (3)
- [Abstract] Abstract: No error bars, confidence intervals, or variance measures are reported for the ASR values (3.84%, 36.8%–65.0%, 55.75%), which would improve the clarity and interpretability of the quantitative results.
- [Abstract] Abstract: The criteria for selecting the 31 frontier models are not specified, nor is any information on inter-model consistency or outlier handling.
- [Abstract] Abstract: The extension from prior 'Adversarial Poetry/Tales' literature is referenced but lacks specific comparisons, scaling details, or discussion of how the broader benchmark family addresses previous limitations.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on the methodological rigor of our benchmark. We address the concern regarding the validation of intent preservation in the humanities-style rewrites.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that stylistic shifts reveal weak generalization depends on the humanities-style rewrites preserving harmful intent exactly. The abstract states that the benchmark 'rewrites the same objectives through humanities-style transformations while preserving intent' but provides no details on the transformation method, inter-rater reliability, equivalence tests, or checks for introduced cues/ambiguity. This is load-bearing, as the ASR jump from 3.84% to 55.75% could reflect altered request difficulty rather than a true robustness failure.
Authors: We agree that the abstract lacks sufficient detail on the transformation methodology and validation procedures, which is a fair critique given the load-bearing nature of the intent-preservation assumption. The manuscript body provides a high-level description of the transformations as stylistic adaptations inspired by humanities genres (e.g., converting direct harmful requests into academic discourse or narrative forms), but we did not report inter-rater reliability, formal equivalence tests, or systematic checks for introduced ambiguity or cues. This omission weakens the presentation of our central claim. In the revised manuscript, we will expand the abstract to briefly note the validation approach and add a new subsection in the Methods detailing the transformation protocol, including human evaluation results for intent equivalence and difficulty assessment. We will also discuss potential confounds and how they were mitigated. While we believe the ASR increase primarily reflects a robustness gap rather than difficulty changes—given that the transformed prompts maintain or increase sophistication—we accept that additional empirical validation is necessary to fully substantiate this. revision: yes
Circularity Check
Empirical benchmark with external dataset and models; no derivation chain present
full rationale
The paper reports attack success rates measured on 31 external frontier models against tasks drawn from the external MLCommons AILuminate dataset, after applying humanities-style rewrites. No equations, fitted parameters, or mathematical derivations appear in the provided text. The central claim follows directly from the observed ASR difference (3.84% original vs. 55.75% transformed) rather than reducing to any self-referential definition or self-citation chain. Self-citation of prior Adversarial Poetry/Tales work is present only as background for the benchmark family and does not carry the load-bearing empirical result. The paper is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
2025 , eprint=
Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models , author=. 2025 , eprint=
2025
-
[2]
arXiv preprint arXiv:2511.15304 , year=
Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models , author=. arXiv preprint arXiv:2511.15304 , year=
-
[3]
2024 , eprint=
Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks , author=. 2024 , eprint=
2024
-
[4]
Schulhoff, Sander and Pinto, Jeremy and Khan, Anaum and Bouchard, Louis-Fran c ois and Si, Chenglei and Anati, Svetlina and Tagliabue, Valen and Kost, Anson and Carnahan, Christopher and Boyd-Graber, Jordan. Ignore This Title and H ack AP rompt: Exposing Systemic Vulnerabilities of LLM s Through a Global Prompt Hacking Competition. Proceedings of the 2023...
-
[5]
2023 , eprint=
Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks , author=. 2023 , eprint=
2023
-
[6]
2024 , eprint=
DeepInception: Hypnotize Large Language Model to Be Jailbreaker , author=. 2024 , eprint=
2024
-
[7]
2025 , eprint=
Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models , author=. 2025 , eprint=
2025
-
[8]
2024 , eprint=
Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models , author=. 2024 , eprint=
2024
-
[9]
2023 , eprint=
Jailbroken: How Does LLM Safety Training Fail? , author=. 2023 , eprint=
2023
-
[10]
2023 , eprint=
Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=
2023
-
[11]
2024 , eprint=
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs , author=. 2024 , eprint=
2024
-
[12]
2022 , eprint=
Ignore Previous Prompt: Attack Techniques For Language Models , author=. 2022 , eprint=
2022
-
[13]
2025 , eprint=
Safety in Large Reasoning Models: A Survey , author=. 2025 , eprint=
2025
-
[14]
2025 , eprint=
Chain-of-Thought Hijacking , author=. 2025 , eprint=
2025
-
[15]
Open Problems in Mechanistic Interpretability
Open problems in mechanistic interpretability , author=. arXiv preprint arXiv:2501.16496 , year=
work page internal anchor Pith review arXiv
-
[16]
Alignment Forum , year=
Polysemantic attention head in a 4-layer transformer , author=. Alignment Forum , year=
-
[17]
Attention is not explanation , author=. arXiv preprint arXiv:1902.10186 , year=
work page Pith review arXiv 1902
-
[18]
Retrieval head mechanistically explains long-context factuality , author=. arXiv preprint arXiv:2404.15574 , year=
-
[19]
arXiv , url =:2407.07071 , primaryclass =
Lookback lens: Detecting and mitigating contextual hallucinations in large language models using only attention maps , author=. arXiv preprint arXiv:2407.07071 , year=
-
[20]
arXiv preprint arXiv:2512.05117 (2025)
The Universal Weight Subspace Hypothesis , author=. arXiv preprint arXiv:2512.05117 , year=
-
[21]
1968 , publisher=
Morphology of the folktale , author=. 1968 , publisher=
1968
-
[22]
2024 , eprint=
Are aligned neural networks adversarially aligned? , author=. 2024 , eprint=
2024
-
[23]
Do Anything Now
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models , author=. 2024 , eprint=
2024
-
[24]
2024 , eprint=
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs , author=. 2024 , eprint=
2024
-
[25]
2024 , eprint=
Multilingual Jailbreak Challenges in Large Language Models , author=. 2024 , eprint=
2024
-
[26]
2024 , eprint=
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study , author=. 2024 , eprint=
2024
-
[27]
2024 , eprint=
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts , author=. 2024 , eprint=
2024
-
[28]
2024 , eprint=
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models , author=. 2024 , eprint=
2024
-
[29]
2024 , eprint=
Open Sesame! Universal Black Box Jailbreaking of Large Language Models , author=. 2024 , eprint=
2024
-
[30]
Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=
Collective constitutional ai: Aligning a language model with public input , author=. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=
2024
-
[31]
Constitutional AI: Harmlessness from AI Feedback
Constitutional ai: Harmlessness from ai feedback , author=. arXiv preprint arXiv:2212.08073 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
arXiv preprint arXiv:2408.11182 , year=
Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Carrier Articles , author=. arXiv preprint arXiv:2408.11182 , year=
-
[34]
Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chain Injection , author=. arXiv preprint arXiv:2404.04849 , year=
-
[35]
Emerging Technologies in the Development and Delivery of CBRN Threats , author=
-
[36]
Are we losing control? , author=
-
[37]
2025 , eprint=
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons , author=. 2025 , eprint=
2025
-
[38]
Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization , pages=
Characterizing manipulation from AI systems , author=. Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization , pages=
-
[39]
Applied Artificial Intelligence , volume=
The emerging threat of ai-driven cyber attacks: A review , author=. Applied Artificial Intelligence , volume=. 2022 , publisher=
2022
-
[40]
2020 , eprint=
Fine-Tuning Language Models from Human Preferences , author=. 2020 , eprint=
2020
-
[41]
2024 , eprint=
Introducing v0.5 of the AI Safety Benchmark from MLCommons , author=. 2024 , eprint=
2024
-
[42]
No free labels: Limitations of llm-as-a-judge without human grounding , author=. arXiv preprint arXiv:2503.05061 , year=
-
[43]
2024 , eprint=
Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs , author=. 2024 , eprint=
2024
- [44]
-
[45]
2020 , eprint=
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , author=. 2020 , eprint=
2020
-
[46]
2022 , eprint=
Red Teaming Language Models with Language Models , author=. 2022 , eprint=
2022
-
[47]
2024 , eprint=
SafetyBench: Evaluating the Safety of Large Language Models , author=. 2024 , eprint=
2024
-
[48]
2023 , eprint=
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs , author=. 2023 , eprint=
2023
-
[49]
COLD : A Benchmark for C hinese Offensive Language Detection
Deng, Jiawen and Zhou, Jingyan and Sun, Hao and Zheng, Chujie and Mi, Fei and Meng, Helen and Huang, Minlie. COLD : A Benchmark for C hinese Offensive Language Detection. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.796
-
[50]
2023 , eprint=
ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation , author=. 2023 , eprint=
2023
-
[51]
2024 , eprint=
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models , author=. 2024 , eprint=
2024
-
[52]
2023 , eprint=
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset , author=. 2023 , eprint=
2023
-
[53]
2024 , eprint=
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author=. 2024 , eprint=
2024
-
[54]
2024 , eprint=
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models , author=. 2024 , eprint=
2024
-
[55]
2024 , eprint=
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models , author=. 2024 , eprint=
2024
-
[56]
Advances in Neural Information Processing Systems , volume=
Jailbreakbench: An open robustness benchmark for jailbreaking large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[57]
2024 , eprint=
A StrongREJECT for Empty Jailbreaks , author=. 2024 , eprint=
2024
-
[58]
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types , url =
Mou, Yutao and Zhang, Shikun and Ye, Wei , booktitle =. SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types , url =. doi:10.52202/079017-3910 , editor =
-
[59]
arXiv preprint arXiv:2601.08837 , year=
From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda , author=. arXiv preprint arXiv:2601.08837 , year=
-
[60]
Tricking
Rao, Abhinav and Vashistha, Sachin and Naik, Atharva and Aditya, Somak and Choudhury, Monojit , booktitle =. Tricking. 2024 , address =
2024
-
[61]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. arXiv preprint arXiv:2204.05862 , year=
work page internal anchor Pith review arXiv
-
[62]
2026 , month=
Robustness Audit of Qwen Models on the Icaro Adversarial Humanities Benchmark , author=. 2026 , month=
2026
-
[63]
2025 , howpublished =
General-Purpose. 2025 , howpublished =
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.