Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety

Daniele Nardi; Federico Pierucci; Federico Sartore; Francesco Giarrusso; Marcello Galisai; Matteo Prandi; Piercosma Bisconti; Susanna Cifani

arxiv: 2604.18487 · v1 · submitted 2026-04-20 · 💻 cs.CL · cs.AI

Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety

Marcello Galisai , Susanna Cifani , Francesco Giarrusso , Piercosma Bisconti , Matteo Prandi , Federico Pierucci , Federico Sartore , Daniele Nardi This is my paper

Pith reviewed 2026-05-10 04:13 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords adversarial benchmarkAI safetystylistic robustnessjailbreakfrontier modelsnon-maleficenceprompt transformation

0 comments

The pith

Humanities-style rewrites of harmful prompts bypass safety refusals in frontier AI models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether current AI safety alignments hold when the same harmful goals are expressed through stylistic changes drawn from humanities writing. It begins with direct harmful tasks and applies transformations that keep the underlying intent fixed while altering phrasing, tone, and structure. Experiments across 31 models show direct versions are mostly refused, but the rewritten versions succeed at much higher rates. The results indicate that safety training does not produce a style-independent grasp of what counts as harmful. This gap points to a deeper limitation in how models internalize the principle of avoiding harm.

Core claim

The Adversarial Humanities Benchmark shows that frontier models lack stylistic robustness in safety: original harmful prompts achieve only 3.84 percent attack success rate, while humanities-transformed versions range from 36.8 percent to 65.0 percent success, for an overall 55.75 percent rate, with CBRN tasks emerging as the highest-risk category under a systemic-risk evaluation.

What carries the argument

The Adversarial Humanities Benchmark, which rewrites harmful objectives from MLCommons AILuminate via humanities-style transformations that preserve intent but use stylistic obfuscation and goal concealment.

If this is right

Safety techniques trained on direct harmful prompts will miss many disguised versions of the same requests.
Models require training that explicitly teaches recognition of harmful intent independent of surface style.
CBRN-related content poses the greatest uncovered risk when evaluated through a systemic-risk lens.
Standard safety benchmarks underestimate vulnerabilities unless they incorporate stylistic variations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the same transformation approach to legal, technical, or scientific writing styles could reveal additional failure modes.
If future models pass this benchmark, they may still remain vulnerable to other forms of intent concealment not covered by humanities rewrites.
Safety fine-tuning datasets could be expanded by generating many stylistic variants of each harmful example.

Load-bearing premise

The humanities-style rewrites preserve the exact original harmful intent without adding new detectable signals or changing the request's risk level.

What would settle it

Run the same set of original and transformed prompts on a single model in randomized order across multiple sessions and measure whether refusal rates remain consistently lower for every transformed version.

Figures

Figures reproduced from arXiv: 2604.18487 by Daniele Nardi, Federico Pierucci, Federico Sartore, Francesco Giarrusso, Marcello Galisai, Matteo Prandi, Piercosma Bisconti, Susanna Cifani.

**Figure 3.** Figure 3: Original prompts (Light Blue) remain low-ASR, but humanities-style rewrites ((from [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: By hazard, transformed prompts (Red) are consistently riskier than original prompts (Light [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: By model provider, ASR increases significantly comparing original prompts (Light Blue) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: By risk category, left: Provider by policy-relevant risk bucket ASR. The same benchmark [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

The Adversarial Humanities Benchmark (AHB) evaluates whether model safety refusals survive a shift away from familiar harmful prompt forms. Starting from harmful tasks drawn from MLCommons AILuminate, the benchmark rewrites the same objectives through humanities-style transformations while preserving intent. This extends literature on Adversarial Poetry and Adversarial Tales from single jailbreak operators to a broader benchmark family of stylistic obfuscation and goal concealment. In the benchmark results reported here, the original attacks record 3.84% attack success rate (ASR), while transformed methods range from 36.8% to 65.0%, yielding 55.75% overall ASR across 31 frontier models. Under a European Union AI Act Code-of-Practice-inspired systemic-risk lens, Chemical, biological, radiological and nuclear (CBRN) is the highest bucket. Taken together, this lack of stylistic robustness suggests that current safety techniques suffer from weak generalization: deep understanding of 'non-maleficence' remains a central unresolved problem in frontier model safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The benchmark shows stylistic rewrites lift ASR from 3.84% to 55.75% on 31 models, but without checks that intent stayed identical the generalization claim rests on an untested assumption.

read the letter

The paper's main point is straightforward: taking harmful tasks from the MLCommons AILuminate set and rewriting them in humanities styles produces much higher attack success rates than the originals. The numbers are 3.84% ASR on direct prompts versus 36.8-65% on the transformed ones, averaging 55.75% across 31 frontier models. It frames this as evidence that safety training fails to generalize to different forms and flags CBRN risks under an EU AI Act lens as the highest concern.

Referee Report

1 major / 3 minor

Summary. The paper introduces the Adversarial Humanities Benchmark (AHB), which rewrites harmful tasks from the MLCommons AILuminate dataset using humanities-style transformations while claiming to preserve original intent. It reports an attack success rate (ASR) of 3.84% on the original prompts versus 36.8%–65.0% (overall 55.75%) on the transformed versions across 31 frontier models. The authors conclude that this demonstrates a lack of stylistic robustness in current safety techniques, implying weak generalization and that deep understanding of 'non-maleficence' remains a central unresolved problem in frontier model safety.

Significance. If the transformations are validated to preserve harmful intent without introducing confounding changes in difficulty, ambiguity, or risk perception, the benchmark would offer a useful empirical extension of prior adversarial poetry and tales work, highlighting a potential generalization gap in safety alignments. The low circularity (empirical measurement against external models and dataset) is a strength, and the EU AI Act-inspired framing adds relevance. However, the significance is limited by the absence of supporting validation details for the core methodological assumption.

major comments (1)

[Abstract] Abstract: The central claim that stylistic shifts reveal weak generalization depends on the humanities-style rewrites preserving harmful intent exactly. The abstract states that the benchmark 'rewrites the same objectives through humanities-style transformations while preserving intent' but provides no details on the transformation method, inter-rater reliability, equivalence tests, or checks for introduced cues/ambiguity. This is load-bearing, as the ASR jump from 3.84% to 55.75% could reflect altered request difficulty rather than a true robustness failure.

minor comments (3)

[Abstract] Abstract: No error bars, confidence intervals, or variance measures are reported for the ASR values (3.84%, 36.8%–65.0%, 55.75%), which would improve the clarity and interpretability of the quantitative results.
[Abstract] Abstract: The criteria for selecting the 31 frontier models are not specified, nor is any information on inter-model consistency or outlier handling.
[Abstract] Abstract: The extension from prior 'Adversarial Poetry/Tales' literature is referenced but lacks specific comparisons, scaling details, or discussion of how the broader benchmark family addresses previous limitations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their insightful comments on the methodological rigor of our benchmark. We address the concern regarding the validation of intent preservation in the humanities-style rewrites.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that stylistic shifts reveal weak generalization depends on the humanities-style rewrites preserving harmful intent exactly. The abstract states that the benchmark 'rewrites the same objectives through humanities-style transformations while preserving intent' but provides no details on the transformation method, inter-rater reliability, equivalence tests, or checks for introduced cues/ambiguity. This is load-bearing, as the ASR jump from 3.84% to 55.75% could reflect altered request difficulty rather than a true robustness failure.

Authors: We agree that the abstract lacks sufficient detail on the transformation methodology and validation procedures, which is a fair critique given the load-bearing nature of the intent-preservation assumption. The manuscript body provides a high-level description of the transformations as stylistic adaptations inspired by humanities genres (e.g., converting direct harmful requests into academic discourse or narrative forms), but we did not report inter-rater reliability, formal equivalence tests, or systematic checks for introduced ambiguity or cues. This omission weakens the presentation of our central claim. In the revised manuscript, we will expand the abstract to briefly note the validation approach and add a new subsection in the Methods detailing the transformation protocol, including human evaluation results for intent equivalence and difficulty assessment. We will also discuss potential confounds and how they were mitigated. While we believe the ASR increase primarily reflects a robustness gap rather than difficulty changes—given that the transformed prompts maintain or increase sophistication—we accept that additional empirical validation is necessary to fully substantiate this. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with external dataset and models; no derivation chain present

full rationale

The paper reports attack success rates measured on 31 external frontier models against tasks drawn from the external MLCommons AILuminate dataset, after applying humanities-style rewrites. No equations, fitted parameters, or mathematical derivations appear in the provided text. The central claim follows directly from the observed ASR difference (3.84% original vs. 55.75% transformed) rather than reducing to any self-referential definition or self-citation chain. Self-citation of prior Adversarial Poetry/Tales work is present only as background for the benchmark family and does not carry the load-bearing empirical result. The paper is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmark study with no mathematical axioms, free parameters, or invented entities; the central claim rests on the assumption that intent is preserved in the rewrites and that the tested models are representative of frontier systems.

pith-pipeline@v0.9.0 · 5502 in / 1080 out tokens · 31299 ms · 2026-05-10T04:13:36.577506+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
cs.CL 2026-05 unverdicted novelty 7.0

Boiling the Frog is a new stateful multi-turn benchmark for agentic safety that reports an aggregate strict attack success rate of 44.4% across nine models, with rates ranging from 20.5% to 92.9% depending on the mode...

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

2025 , eprint=

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models , author=. 2025 , eprint=

work page 2025
[2]

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models,

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models , author=. arXiv preprint arXiv:2511.15304 , year=

work page arXiv
[3]

2024 , eprint=

Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks , author=. 2024 , eprint=

work page 2024
[4]

context rot,

Schulhoff, Sander and Pinto, Jeremy and Khan, Anaum and Bouchard, Louis-Fran c ois and Si, Chenglei and Anati, Svetlina and Tagliabue, Valen and Kost, Anson and Carnahan, Christopher and Boyd-Graber, Jordan. Ignore This Title and H ack AP rompt: Exposing Systemic Vulnerabilities of LLM s Through a Global Prompt Hacking Competition. Proceedings of the 2023...

work page doi:10.18653/v1/2023.emnlp-main.302 2023
[5]

2023 , eprint=

Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks , author=. 2023 , eprint=

work page 2023
[6]

2024 , eprint=

DeepInception: Hypnotize Large Language Model to Be Jailbreaker , author=. 2024 , eprint=

work page 2024
[7]

2025 , eprint=

Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models , author=. 2025 , eprint=

work page 2025
[8]

2024 , eprint=

Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models , author=. 2024 , eprint=

work page 2024
[9]

2023 , eprint=

Jailbroken: How Does LLM Safety Training Fail? , author=. 2023 , eprint=

work page 2023
[10]

2023 , eprint=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

work page 2023
[11]

2024 , eprint=

How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs , author=. 2024 , eprint=

work page 2024
[12]

2022 , eprint=

Ignore Previous Prompt: Attack Techniques For Language Models , author=. 2022 , eprint=

work page 2022
[13]

2025 , eprint=

Safety in Large Reasoning Models: A Survey , author=. 2025 , eprint=

work page 2025
[14]

2025 , eprint=

Chain-of-Thought Hijacking , author=. 2025 , eprint=

work page 2025
[15]

Open Problems in Mechanistic Interpretability

Open problems in mechanistic interpretability , author=. arXiv preprint arXiv:2501.16496 , year=

work page internal anchor Pith review arXiv
[16]

Alignment Forum , year=

Polysemantic attention head in a 4-layer transformer , author=. Alignment Forum , year=

work page
[17]

Attention is not Explanation

Attention is not explanation , author=. arXiv preprint arXiv:1902.10186 , year=

work page Pith review arXiv 1902
[18]

Re- trieval head mechanistically explains long-context factu- ality.arXiv preprint arXiv:2404.15574,

Retrieval head mechanistically explains long-context factuality , author=. arXiv preprint arXiv:2404.15574 , year=

work page arXiv
[19]

arXiv , url =:2407.07071 , primaryclass =

Lookback lens: Detecting and mitigating contextual hallucinations in large language models using only attention maps , author=. arXiv preprint arXiv:2407.07071 , year=

work page arXiv
[20]

arXiv preprint arXiv:2512.05117 , year=

The Universal Weight Subspace Hypothesis , author=. arXiv preprint arXiv:2512.05117 , year=

work page arXiv
[21]

1968 , publisher=

Morphology of the folktale , author=. 1968 , publisher=

work page 1968
[22]

2024 , eprint=

Are aligned neural networks adversarially aligned? , author=. 2024 , eprint=

work page 2024
[23]

Do Anything Now

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models , author=. 2024 , eprint=

work page 2024
[24]

2024 , eprint=

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs , author=. 2024 , eprint=

work page 2024
[25]

2024 , eprint=

Multilingual Jailbreak Challenges in Large Language Models , author=. 2024 , eprint=

work page 2024
[26]

2024 , eprint=

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study , author=. 2024 , eprint=

work page 2024
[27]

2024 , eprint=

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts , author=. 2024 , eprint=

work page 2024
[28]

2024 , eprint=

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models , author=. 2024 , eprint=

work page 2024
[29]

2024 , eprint=

Open Sesame! Universal Black Box Jailbreaking of Large Language Models , author=. 2024 , eprint=

work page 2024
[30]

Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

Collective constitutional ai: Aligning a language model with public input , author=. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

work page 2024
[31]

Constitutional AI: Harmlessness from AI Feedback

Constitutional ai: Harmlessness from ai feedback , author=. arXiv preprint arXiv:2212.08073 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

arXiv preprint arXiv:2408.11182 , year=

Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Carrier Articles , author=. arXiv preprint arXiv:2408.11182 , year=

work page arXiv
[34]

Hidden you malicious goal into benign narratives: Jailbreak large language models through logic chain injection.arXiv preprint arXiv:2404.04849,

Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chain Injection , author=. arXiv preprint arXiv:2404.04849 , year=

work page arXiv
[35]

Emerging Technologies in the Development and Delivery of CBRN Threats , author=

work page
[36]

Are we losing control? , author=

work page
[37]

2025 , eprint=

AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons , author=. 2025 , eprint=

work page 2025
[38]

Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization , pages=

Characterizing manipulation from AI systems , author=. Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization , pages=

work page
[39]

Applied Artificial Intelligence , volume=

The emerging threat of ai-driven cyber attacks: A review , author=. Applied Artificial Intelligence , volume=. 2022 , publisher=

work page 2022
[40]

2020 , eprint=

Fine-Tuning Language Models from Human Preferences , author=. 2020 , eprint=

work page 2020
[41]

2024 , eprint=

Introducing v0.5 of the AI Safety Benchmark from MLCommons , author=. 2024 , eprint=

work page 2024
[42]

No free labels: Limitations of llm-as-a-judge without human grounding.arXiv preprint arXiv:2503.05061,

No free labels: Limitations of llm-as-a-judge without human grounding , author=. arXiv preprint arXiv:2503.05061 , year=

work page arXiv
[43]

2024 , eprint=

Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs , author=. 2024 , eprint=

work page 2024
[44]

Artificial intelligence - carrying us into the future

Ailuminate: Introducing v1. 0 of the ai risk and reliability benchmark from mlcommons , author=. arXiv preprint arXiv:2503.05731 , year=

work page arXiv
[45]

2020 , eprint=

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , author=. 2020 , eprint=

work page 2020
[46]

2022 , eprint=

Red Teaming Language Models with Language Models , author=. 2022 , eprint=

work page 2022
[47]

2024 , eprint=

SafetyBench: Evaluating the Safety of Large Language Models , author=. 2024 , eprint=

work page 2024
[48]

2023 , eprint=

Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs , author=. 2023 , eprint=

work page 2023
[49]

2022 , address =

Deng, Jiawen and Zhou, Jingyan and Sun, Hao and Zheng, Chujie and Mi, Fei and Meng, Helen and Huang, Minlie. COLD : A Benchmark for C hinese Offensive Language Detection. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.796

work page doi:10.18653/v1/2022.emnlp-main.796 2022
[50]

2023 , eprint=

ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation , author=. 2023 , eprint=

work page 2023
[51]

2024 , eprint=

SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models , author=. 2024 , eprint=

work page 2024
[52]

2023 , eprint=

BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset , author=. 2023 , eprint=

work page 2023
[53]

2024 , eprint=

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author=. 2024 , eprint=

work page 2024
[54]

2024 , eprint=

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models , author=. 2024 , eprint=

work page 2024
[55]

2024 , eprint=

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models , author=. 2024 , eprint=

work page 2024
[56]

Advances in Neural Information Processing Systems , volume=

Jailbreakbench: An open robustness benchmark for jailbreaking large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[57]

2024 , eprint=

A StrongREJECT for Empty Jailbreaks , author=. 2024 , eprint=

work page 2024
[58]

SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types , url =

Mou, Yutao and Zhang, Shikun and Ye, Wei , booktitle =. SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types , url =. doi:10.52202/079017-3910 , editor =

work page doi:10.52202/079017-3910
[59]

From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda,

From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda , author=. arXiv preprint arXiv:2601.08837 , year=

work page arXiv
[60]

Tricking

Rao, Abhinav and Vashistha, Sachin and Naik, Atharva and Aditya, Somak and Choudhury, Monojit , booktitle =. Tricking. 2024 , address =

work page 2024
[61]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review arXiv
[62]

2026 , month=

Robustness Audit of Qwen Models on the Icaro Adversarial Humanities Benchmark , author=. 2026 , month=

work page 2026
[63]

2025 , howpublished =

General-Purpose. 2025 , howpublished =

work page 2025

[1] [1]

2025 , eprint=

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models , author=. 2025 , eprint=

work page 2025

[2] [2]

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models,

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models , author=. arXiv preprint arXiv:2511.15304 , year=

work page arXiv

[3] [3]

2024 , eprint=

Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks , author=. 2024 , eprint=

work page 2024

[4] [4]

context rot,

Schulhoff, Sander and Pinto, Jeremy and Khan, Anaum and Bouchard, Louis-Fran c ois and Si, Chenglei and Anati, Svetlina and Tagliabue, Valen and Kost, Anson and Carnahan, Christopher and Boyd-Graber, Jordan. Ignore This Title and H ack AP rompt: Exposing Systemic Vulnerabilities of LLM s Through a Global Prompt Hacking Competition. Proceedings of the 2023...

work page doi:10.18653/v1/2023.emnlp-main.302 2023

[5] [5]

2023 , eprint=

Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks , author=. 2023 , eprint=

work page 2023

[6] [6]

2024 , eprint=

DeepInception: Hypnotize Large Language Model to Be Jailbreaker , author=. 2024 , eprint=

work page 2024

[7] [7]

2025 , eprint=

Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models , author=. 2025 , eprint=

work page 2025

[8] [8]

2024 , eprint=

Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models , author=. 2024 , eprint=

work page 2024

[9] [9]

2023 , eprint=

Jailbroken: How Does LLM Safety Training Fail? , author=. 2023 , eprint=

work page 2023

[10] [10]

2023 , eprint=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

work page 2023

[11] [11]

2024 , eprint=

How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs , author=. 2024 , eprint=

work page 2024

[12] [12]

2022 , eprint=

Ignore Previous Prompt: Attack Techniques For Language Models , author=. 2022 , eprint=

work page 2022

[13] [13]

2025 , eprint=

Safety in Large Reasoning Models: A Survey , author=. 2025 , eprint=

work page 2025

[14] [14]

2025 , eprint=

Chain-of-Thought Hijacking , author=. 2025 , eprint=

work page 2025

[15] [15]

Open Problems in Mechanistic Interpretability

Open problems in mechanistic interpretability , author=. arXiv preprint arXiv:2501.16496 , year=

work page internal anchor Pith review arXiv

[16] [16]

Alignment Forum , year=

Polysemantic attention head in a 4-layer transformer , author=. Alignment Forum , year=

work page

[17] [17]

Attention is not Explanation

Attention is not explanation , author=. arXiv preprint arXiv:1902.10186 , year=

work page Pith review arXiv 1902

[18] [18]

Re- trieval head mechanistically explains long-context factu- ality.arXiv preprint arXiv:2404.15574,

Retrieval head mechanistically explains long-context factuality , author=. arXiv preprint arXiv:2404.15574 , year=

work page arXiv

[19] [19]

arXiv , url =:2407.07071 , primaryclass =

Lookback lens: Detecting and mitigating contextual hallucinations in large language models using only attention maps , author=. arXiv preprint arXiv:2407.07071 , year=

work page arXiv

[20] [20]

arXiv preprint arXiv:2512.05117 , year=

The Universal Weight Subspace Hypothesis , author=. arXiv preprint arXiv:2512.05117 , year=

work page arXiv

[21] [21]

1968 , publisher=

Morphology of the folktale , author=. 1968 , publisher=

work page 1968

[22] [22]

2024 , eprint=

Are aligned neural networks adversarially aligned? , author=. 2024 , eprint=

work page 2024

[23] [23]

Do Anything Now

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models , author=. 2024 , eprint=

work page 2024

[24] [24]

2024 , eprint=

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs , author=. 2024 , eprint=

work page 2024

[25] [25]

2024 , eprint=

Multilingual Jailbreak Challenges in Large Language Models , author=. 2024 , eprint=

work page 2024

[26] [26]

2024 , eprint=

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study , author=. 2024 , eprint=

work page 2024

[27] [27]

2024 , eprint=

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts , author=. 2024 , eprint=

work page 2024

[28] [28]

2024 , eprint=

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models , author=. 2024 , eprint=

work page 2024

[29] [29]

2024 , eprint=

Open Sesame! Universal Black Box Jailbreaking of Large Language Models , author=. 2024 , eprint=

work page 2024

[30] [30]

Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

Collective constitutional ai: Aligning a language model with public input , author=. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

work page 2024

[31] [31]

Constitutional AI: Harmlessness from AI Feedback

Constitutional ai: Harmlessness from ai feedback , author=. arXiv preprint arXiv:2212.08073 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [33]

arXiv preprint arXiv:2408.11182 , year=

Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Carrier Articles , author=. arXiv preprint arXiv:2408.11182 , year=

work page arXiv

[33] [34]

Hidden you malicious goal into benign narratives: Jailbreak large language models through logic chain injection.arXiv preprint arXiv:2404.04849,

Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chain Injection , author=. arXiv preprint arXiv:2404.04849 , year=

work page arXiv

[34] [35]

Emerging Technologies in the Development and Delivery of CBRN Threats , author=

work page

[35] [36]

Are we losing control? , author=

work page

[36] [37]

2025 , eprint=

AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons , author=. 2025 , eprint=

work page 2025

[37] [38]

Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization , pages=

Characterizing manipulation from AI systems , author=. Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization , pages=

work page

[38] [39]

Applied Artificial Intelligence , volume=

The emerging threat of ai-driven cyber attacks: A review , author=. Applied Artificial Intelligence , volume=. 2022 , publisher=

work page 2022

[39] [40]

2020 , eprint=

Fine-Tuning Language Models from Human Preferences , author=. 2020 , eprint=

work page 2020

[40] [41]

2024 , eprint=

Introducing v0.5 of the AI Safety Benchmark from MLCommons , author=. 2024 , eprint=

work page 2024

[41] [42]

No free labels: Limitations of llm-as-a-judge without human grounding.arXiv preprint arXiv:2503.05061,

No free labels: Limitations of llm-as-a-judge without human grounding , author=. arXiv preprint arXiv:2503.05061 , year=

work page arXiv

[42] [43]

2024 , eprint=

Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs , author=. 2024 , eprint=

work page 2024

[43] [44]

Artificial intelligence - carrying us into the future

Ailuminate: Introducing v1. 0 of the ai risk and reliability benchmark from mlcommons , author=. arXiv preprint arXiv:2503.05731 , year=

work page arXiv

[44] [45]

2020 , eprint=

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , author=. 2020 , eprint=

work page 2020

[45] [46]

2022 , eprint=

Red Teaming Language Models with Language Models , author=. 2022 , eprint=

work page 2022

[46] [47]

2024 , eprint=

SafetyBench: Evaluating the Safety of Large Language Models , author=. 2024 , eprint=

work page 2024

[47] [48]

2023 , eprint=

Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs , author=. 2023 , eprint=

work page 2023

[48] [49]

2022 , address =

Deng, Jiawen and Zhou, Jingyan and Sun, Hao and Zheng, Chujie and Mi, Fei and Meng, Helen and Huang, Minlie. COLD : A Benchmark for C hinese Offensive Language Detection. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.796

work page doi:10.18653/v1/2022.emnlp-main.796 2022

[49] [50]

2023 , eprint=

ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation , author=. 2023 , eprint=

work page 2023

[50] [51]

2024 , eprint=

SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models , author=. 2024 , eprint=

work page 2024

[51] [52]

2023 , eprint=

BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset , author=. 2023 , eprint=

work page 2023

[52] [53]

2024 , eprint=

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author=. 2024 , eprint=

work page 2024

[53] [54]

2024 , eprint=

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models , author=. 2024 , eprint=

work page 2024

[54] [55]

2024 , eprint=

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models , author=. 2024 , eprint=

work page 2024

[55] [56]

Advances in Neural Information Processing Systems , volume=

Jailbreakbench: An open robustness benchmark for jailbreaking large language models , author=. Advances in Neural Information Processing Systems , volume=

work page

[56] [57]

2024 , eprint=

A StrongREJECT for Empty Jailbreaks , author=. 2024 , eprint=

work page 2024

[57] [58]

SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types , url =

Mou, Yutao and Zhang, Shikun and Ye, Wei , booktitle =. SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types , url =. doi:10.52202/079017-3910 , editor =

work page doi:10.52202/079017-3910

[58] [59]

From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda,

From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda , author=. arXiv preprint arXiv:2601.08837 , year=

work page arXiv

[59] [60]

Tricking

Rao, Abhinav and Vashistha, Sachin and Naik, Atharva and Aditya, Somak and Choudhury, Monojit , booktitle =. Tricking. 2024 , address =

work page 2024

[60] [61]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review arXiv

[61] [62]

2026 , month=

Robustness Audit of Qwen Models on the Icaro Adversarial Humanities Benchmark , author=. 2026 , month=

work page 2026

[62] [63]

2025 , howpublished =

General-Purpose. 2025 , howpublished =

work page 2025