LLM-Safety Evaluations Lack Robustness

Gauthier Gidel; Leo Schwinn; Simon Geisler; Sophie Xhonneux; Stephan G\"unnemann; Tim Beyer

arxiv: 2503.02574 · v2 · pith:XWX6Y5XSnew · submitted 2025-03-04 · 💻 cs.CR · cs.AI

LLM-Safety Evaluations Lack Robustness

Tim Beyer , Sophie Xhonneux , Simon Geisler , Gauthier Gidel , Leo Schwinn , Stephan G\"unnemann This is my paper

Pith reviewed 2026-05-23 01:23 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords LLM safetyevaluation robustnessred-teamingLLM judgessafety alignmentnoise sourcesdataset curationresponse evaluation

0 comments

The pith

Safety evaluations for large language models are too noisy to allow fair comparisons of attacks and defenses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that safety alignment research for LLMs is slowed by noise from small datasets, inconsistent methods, and unreliable evaluation setups that prevent fair comparisons of attacks and defenses. It breaks down the evaluation pipeline into four stages and identifies specific problems at each one. Guidelines are proposed to reduce this noise and bias in future work. An opposing perspective on practical reasons for the limitations is also included. Addressing the issues would enable more reliable results and steadier progress in the field.

Core claim

We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations.

What carries the argument

Systematic stage-by-stage analysis of the LLM safety evaluation pipeline covering dataset curation, optimization strategies for automated red-teaming, response generation, and LLM-judge evaluation.

Load-bearing premise

The listed sources of noise across the evaluation stages are the dominant factors that make fair comparisons impossible, rather than other unmentioned factors or mitigations.

What would settle it

A study that applies identical attacks and defenses to multiple standardized setups and obtains consistent performance rankings would show that fair comparisons remain possible despite the noise.

Figures

Figures reproduced from arXiv: 2503.02574 by Gauthier Gidel, Leo Schwinn, Simon Geisler, Sophie Xhonneux, Stephan G\"unnemann, Tim Beyer.

**Figure 1.** Figure 1: We decompose the LLM safety evaluation pipeline into three stages and analyze common problems at each step. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Impact of the SentencePiece white space token on [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Comparing ASR differences across common judges reveals significant model-specific bias, despite all four models being derivatives of the same Mistral 7B model. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Impact of white-space tokens on GCG attack progress vs. Llama-2-7b-chat-hf. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Impact of implementation details on GCG attack progress vs. Llama-3.1-8B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Average loss for HarmBench-style affirmative target sequence using a harmless prompt. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Loss for the first token (“Sure”) for a HarmBench-style affirmative target using a harmless prompt. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Comparing the ASR of two judge models. We consider an attack successful if StrongReject gives a score higher [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

read the original abstract

In this paper, we argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers. Lastly, we offer an opposing perspective, highlighting practical reasons for existing limitations. We believe that addressing the outlined problems in future research will improve the field's ability to generate easily comparable results and make measurable progress.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Paper usefully maps evaluation noise in LLM safety but the 'impossible to compare fairly' claim stays unbacked by examples or numbers.

read the letter

The main takeaway is that this paper walks through the LLM safety evaluation pipeline and flags recurring sources of noise at each stage. That breakdown is the useful part. It covers dataset size and selection, how attacks get optimized, how responses are generated, and how LLM judges score them, then lists practical problems like inconsistent methods and unreliable automated scoring. The guidelines for future papers are concrete enough to be usable, and the section offering an opposing perspective on why some limits persist adds balance instead of just complaining.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLM safety alignment research is hindered by intertwined sources of noise such as small datasets, methodological inconsistencies, and unreliable evaluation setups. It argues these can make it impossible at times to evaluate and compare attacks and defenses fairly. The manuscript systematically analyzes the pipeline stages of dataset curation, optimization strategies for automated red-teaming, response generation, and LLM-judge evaluation; identifies key issues and their practical impact at each stage; proposes guidelines for reducing noise and bias; and presents an opposing perspective on practical reasons for existing limitations.

Significance. If the analysis holds and demonstrates that the enumerated noise sources dominate over mitigations to the point of rendering fair comparisons impossible, the work could encourage more rigorous and standardized evaluation practices in LLM safety, potentially enabling measurable progress. The systematic breakdown across stages, the proposed guidelines, and the inclusion of an opposing perspective are constructive elements that could aid researchers in improving reproducibility.

major comments (3)

[Abstract] Abstract: the claim that the identified issues 'can, at times, make it impossible to evaluate and compare attacks and defenses fairly' is presented without quantitative evidence, error analysis, or concrete examples showing how the noise affects specific comparisons or why existing mitigations in the literature fail to permit reliable evaluations.
[Systematic analysis sections] Systematic analysis sections (dataset curation, optimization, response generation, LLM-judge evaluation): the assumption that the listed issues are the dominant and representative sources of noise that render fair comparisons impossible is not verified; no direct comparison to or falsification against mitigations already present in the literature is provided to establish that the problems are load-bearing for the headline conclusion.
[Guidelines section] Guidelines section: the proposed set of guidelines for reducing noise is not accompanied by any demonstration, case study, or before/after quantification showing that their adoption would enable fair comparisons where it was previously impossible.

minor comments (2)

The manuscript would benefit from explicit definitions or operationalizations of key terms such as 'noise', 'fair comparison', and 'impossible' to make the argument more precise.
Adding citations to specific prior works that exemplify each identified issue (e.g., particular papers using small datasets or LLM judges) would strengthen the practical impact discussion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. Our manuscript is a position and analysis paper that surveys noise sources across the LLM safety evaluation pipeline based on existing literature, rather than an empirical study with new experiments. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the identified issues 'can, at times, make it impossible to evaluate and compare attacks and defenses fairly' is presented without quantitative evidence, error analysis, or concrete examples showing how the noise affects specific comparisons or why existing mitigations in the literature fail to permit reliable evaluations.

Authors: The abstract summarizes the core argument, with the full manuscript providing concrete examples and citations from published work (e.g., inconsistent attack success rates across papers due to dataset size differences or judge model variations in sections on dataset curation and LLM-judge evaluation). We agree the abstract could be strengthened with a brief illustrative reference and will revise it accordingly to point readers to specific cases discussed later in the paper. revision: partial
Referee: [Systematic analysis sections] Systematic analysis sections (dataset curation, optimization, response generation, LLM-judge evaluation): the assumption that the listed issues are the dominant and representative sources of noise that render fair comparisons impossible is not verified; no direct comparison to or falsification against mitigations already present in the literature is provided to establish that the problems are load-bearing for the headline conclusion.

Authors: The analysis draws on a broad survey of the literature to identify recurring issues that persist in practice. The opposing perspective section already discusses practical reasons for limitations and some existing mitigations. To strengthen this, we will expand the analysis sections with additional explicit comparisons to common mitigations (e.g., larger datasets or multi-judge setups) and why they often remain insufficient based on observed inconsistencies across papers. revision: partial
Referee: [Guidelines section] Guidelines section: the proposed set of guidelines for reducing noise is not accompanied by any demonstration, case study, or before/after quantification showing that their adoption would enable fair comparisons where it was previously impossible.

Authors: The guidelines are proposed as direct responses to the noise sources identified in the analysis and are presented as recommendations for future work. A before/after quantification or new case study would require conducting fresh experiments, which falls outside the scope of this analysis paper. We can revise the guidelines section to include more detailed rationale linking each guideline to specific noise sources and expected benefits. revision: no

Circularity Check

0 steps flagged

No significant circularity; observational critique without self-referential derivations or load-bearing self-citations

full rationale

The paper is an observational analysis of noise sources in LLM safety evaluations across dataset curation, optimization, response generation, and LLM-judge stages. It identifies issues, proposes guidelines, and includes an opposing perspective on limitations. No equations, fitted parameters, predictions, or derivations appear that could reduce to inputs by construction. Central claims rest on systematic review of practices rather than self-citation chains or uniqueness theorems from prior author work. This is self-contained as a critique paper; the reader's assigned score of 2.0 aligns with minor or absent circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper is a field critique rather than a formal model, so the ledger contains no fitted parameters or invented entities; the central claim rests on domain assumptions about how evaluations are currently conducted.

axioms (2)

domain assumption LLM judges are commonly used for response evaluation and introduce unreliability
Invoked in the abstract when listing sources of noise in the evaluation pipeline.
domain assumption Small datasets and methodological inconsistencies are representative across the field
Underlies the claim that noise makes fair comparisons impossible.

pith-pipeline@v0.9.0 · 5680 in / 1323 out tokens · 28926 ms · 2026-05-23T01:23:48.706467+00:00 · methodology

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs
cs.CR 2026-05 conditional novelty 7.0

Compilation optimizations can be exploited to create stealthy backdoors in LLMs that remain dormant without optimization but achieve ~90% attack success while preserving clean accuracy near 100%.
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
cs.AI 2026-04 unverdicted novelty 7.0

AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
When Efficiency Backfires: Cascading LLMs Trigger Cascade Failure under Adversarial Attack
cs.CR 2026-05 unverdicted novelty 6.0

LLM cascade systems are vulnerable to a new adversarial attack that simultaneously degrades accuracy and destroys the intended cost savings by targeting both the lightweight models and the escalation decision mechanism.
How Sensitive Are Safety Benchmarks to Judge Configuration Choices?
cs.CL 2026-04 unverdicted novelty 6.0

LLM judge prompt variations alone shift HarmBench harmful-response rates by up to 24.2 percentage points and produce moderate instability in model safety rankings.
Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models
cs.CR 2025-12 unverdicted novelty 6.0

A meta-prompt and hierarchical detection framework automates LLM red-teaming, achieving 3.9 times higher vulnerability discovery rate than manual methods with 89% accuracy on GPT-OSS-20B.
Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem
cs.CR 2025-06 unverdicted novelty 6.0

Formalizes the jailbreak oracle problem for LLMs and introduces Boa, a two-phase breadth-first then depth-first search system to solve it efficiently.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 6 Pith papers · 29 internal anchors

[1]

Phi-4 Technical Report

Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Accessed: 2025- 01-29

URL https://www.aisi.gov.uk/. Accessed: 2025- 01-29. An, B., Zhu, S., Zhang, R., Panaitescu-Liess, M.-A., Xu, Y ., and Huang, F. Automatic pseudo-harmful prompt generation for evaluating false refusals in large language models. In First Conference on Language Modeling ,

work page 2025
[5]

Andriushchenko, M

URL https://openreview.net/forum? id=ljFgX6A8NL. Andriushchenko, M. and Flammarion, N. Does refusal training in LLMs generalize to the past tense? arXiv preprint arXiv:2407.11969,

work page arXiv
[6]

arXiv preprint arXiv:2404.02151 (2024)

Andriushchenko, M., Croce, F., and Flammarion, N. Jail- breaking leading safety-aligned LLMs with simple adap- tive attacks. arXiv preprint arXiv:2404.02151,

work page arXiv
[7]

Refusal in Language Models Is Mediated by a Single Direction

Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku. 2024a. URL https: //api.semanticscholar.org/CorpusID: 268232499. Anthropic. Responsible scaling policy, 2024b. URL https://www.anthropic.com/news/ anthropics-responsible-scaling-policy . Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., and Nanda, N. Refusal in language mod...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Constitutional AI: Harmlessness from AI Feedback

Bai, Y ., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKin- non, C., et al. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Mechanistic Interpretability for AI Safety -- A Review

Bereska, L. and Gavves, E. Mechanistic interpretability for AI safety–A review. arXiv preprint arXiv:2404.14082,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

and Poria, S

Bhardwaj, R. and Poria, S. Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662,

work page arXiv
[11]

Lessons from the Trenches on Reproducible Evaluation of Language Models

Biderman, S., Schoelkopf, H., Sutawika, L., Gao, L., Tow, J., Abbasi, B., Aji, A. F., Ammanamanchi, P. S., Black, S., Clive, J., et al. Lessons from the trenches on repro- ducible evaluation of language models. arXiv preprint arXiv:2405.14782,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

A realistic threat model for large language model jailbreaks

Boreiko, V ., Panfilov, A., V oracek, V ., Hein, M., and Geip- ing, J. A realistic threat model for large language model jailbreaks. arXiv preprint arXiv:2410.16222,

work page arXiv
[13]

The art of saying no: Contex- tual noncompliance in language models

Brahman, F., Kumar, S., Balachandran, V ., Dasigi, P., Py- atkin, V ., Ravichander, A., Wiegreffe, S., Dziri, N., Chandu, K., Hessel, J., et al. The art of saying no: Contex- tual noncompliance in language models. arXiv preprint arXiv:2407.12043,

work page arXiv
[14]

Jailbreaking Black Box Large Language Models in Twenty Queries

9 Position: LLM-Safety Evaluations Lack Robustness Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., and Wong, E. Jailbreaking black box large language mod- els in twenty queries. arXiv preprint arXiv:2310.08419,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Chao, P., Debenedetti, E., Robey, A., Andriushchenko, M., Croce, F., Sehwag, V ., Dobriban, E., Flammarion, N., Pappas, G. J., Tramer, F., et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? Try ARC, the AI2 reasoning chal- lenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

National artificial intelligence initiative act of 2020,

Congress, U. National artificial intelligence initiative act of 2020,

work page 2020
[18]

Robustbench: a stan- dardized adversarial robustness benchmark,

Croce, F., Andriushchenko, M., Sehwag, V ., Debenedetti, E., Flammarion, N., Chiang, M., Mittal, P., and Hein, M. Robustbench: a standardized adversarial robustness benchmark. arXiv preprint arXiv:2010.09670,

work page arXiv 2010
[19]

OR- Bench: An over-refusal benchmark for large language models

Cui, J., Chiang, W.-L., Stoica, I., and Hsieh, C.-J. OR- Bench: An over-refusal benchmark for large language models. arXiv preprint arXiv:2405.20947,

work page arXiv
[20]

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Dai, J., Pan, X., Sun, R., Ji, J., Xu, X., Liu, M., Wang, Y ., and Yang, Y . Safe RLHF: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Imagenet: A large-scale hierarchical image database

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee,

work page 2009
[22]

The Llama 3 Herd of Models

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Attacking large language mod- els with projected gradient descent

Geisler, S., Wollschl ¨ager, T., Abdalla, M., Gasteiger, J., and G ¨unnemann, S. Attacking large language mod- els with projected gradient descent. arXiv preprint arXiv:2402.09154,

work page arXiv
[24]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Google, G., Georgiev, P., Lei, V . I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understand- ing across millions of tokens of context. arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Alignment faking in large language models

Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDi- armid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., et al. Alignment faking in large language models. arXiv preprint arXiv:2412.14093,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Y ., Joglekar, M., Wallace, E., Jain, S., Barak, B., Heylar, A., Dias, R., Vallone, A., Ren, H., Wei, J., et al

Guan, M. Y ., Joglekar, M., Wallace, E., Jain, S., Barak, B., Heylar, A., Dias, R., Vallone, A., Ren, H., Wei, J., et al. Deliberative alignment: Reasoning enables safer lan- guage models. arXiv preprint arXiv:2412.16339,

work page arXiv
[27]

X-Risk Analysis for AI Research

Hendrycks, D. and Mazeika, M. X-risk analysis for ai research. arXiv preprint arXiv:2206.05862,

work page arXiv
[28]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding. arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[29]

An Overview of Catastrophic AI Risks

Hendrycks, D., Mazeika, M., and Woodside, T. An overview of catastrophic ai risks. arXiv preprint arXiv:2306.12001,

work page internal anchor Pith review arXiv
[30]

The Curious Case of Neural Text Degeneration

Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y . The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[31]

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Huang, Y ., Gupta, S., Xia, M., Li, K., and Chen, D. Catas- trophic jailbreak of open-source LLMs via exploiting generation. arXiv preprint arXiv:2310.06987,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Best-of-n jailbreaking

Hughes, J., Price, S., Lynch, A., Schaeffer, R., Barez, F., Koyejo, S., Sleight, H., Jones, E., Perez, E., and Sharma, M. Best-of-n jailbreaking. arXiv preprint arXiv:2412.03556,

work page arXiv
[33]

arXiv preprint arXiv:2405.21018

Jia, X., Pang, T., Du, C., Huang, Y ., Gu, J., Liu, Y ., Cao, X., and Lin, M. Improved techniques for optimization-based jailbreaking on large language models. arXiv preprint arXiv:2405.21018,

work page arXiv
[34]

Torchattacks: A pytorch repository for adversarial attacks

10 Position: LLM-Safety Evaluations Lack Robustness Kim, H. Torchattacks: A pytorch repository for adversarial attacks. arXiv preprint arXiv:2010.01950,

work page arXiv 2010
[35]

Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

URL http://www.cs.toronto. edu/˜kriz/cifar.html. MIT License. Kudo, T. Sentencepiece: A simple and language indepen- dent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

The role of imagenet classes in fr \’echet inception distance

Kynk¨a¨anniemi, T., Karras, T., Aittala, M., Aila, T., and Lehtinen, J. The role of imagenet classes in fr \’echet inception distance. arXiv preprint arXiv:2203.06026 ,

work page arXiv
[37]

arXiv preprint arXiv:2404.07921

Liao, Z. and Sun, H. AmpleGCG: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms. arXiv preprint arXiv:2404.07921,

work page arXiv
[38]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Liu, X., Xu, N., Chen, M., and Xiao, C. Autodan: Generat- ing stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023a. Liu, Y ., Deng, G., Xu, Z., Li, Y ., Zheng, Y ., Zhang, Y ., Zhao, L., Zhang, T., Wang, K., and Liu, Y . Jailbreaking ChatGPT via prompt engineering: An empirical study. arXiv preprint arXiv:230...

work page internal anchor Pith review Pith/arXiv arXiv
[39]

TDC 2023 (llm edition): The trojan detection challenge

Mazeika, M., Zou, A., Mu, N., Phan, L., Wang, Z., Yu, C., Khoja, A., Jiang, F., O’Gara, A., Sakhaee, E., Xiang, Z., Rajabi, A., Hendrycks, D., Poovendran, R., Li, B., and Forsyth, D. TDC 2023 (llm edition): The trojan detection challenge. In NeurIPS Competition Track,

work page 2023
[40]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., et al. Harm- Bench: A standardized evaluation framework for auto- mated red teaming and robust refusal. arXiv preprint arXiv:2402.04249,

work page internal anchor Pith review Pith/arXiv arXiv
[41]

arXiv preprint arXiv:2411.00640 , url=

Miller, E. Adding error bars to evals: A statistical ap- proach to language model evaluations. arXiv preprint arXiv:2411.00640,

work page arXiv
[42]

Technical Report on the CleverHans v2.1.0 Adversarial Examples Library

Papernot, N., Faghri, F., Carlini, N., Goodfellow, I., Fein- man, R., Kurakin, A., Xie, C., Sharma, Y ., Brown, T., Roy, A., et al. Technical report on the cleverhans v2. 1.0 adver- sarial examples library. arXiv preprint arXiv:1610.00768,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jack- son Petty, Richard Yuanzhe Pang, Julien Dirani, Ju- lian Michael, and Samuel R Bowman

Polo, F. M., Weber, L., Choshen, L., Sun, Y ., Xu, G., and Yurochkin, M. tinyBenchmarks: evaluating LLMs with fewer examples. arXiv preprint arXiv:2402.14992,

work page arXiv
[44]

Safety alignment should be made more than just a few tokens deep

Qi, X., Panda, A., Lyu, K., Ma, X., Roy, S., Beirami, A., Mittal, P., and Henderson, P. Safety alignment should be made more than just a few tokens deep. arXiv preprint arXiv:2406.05946,

work page arXiv
[45]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

R¨ottger, P., Kirk, H. R., Vidgen, B., Attanasio, G., Bianchi, F., and Hovy, D. XSTest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263,

work page internal anchor Pith review Pith/arXiv arXiv
[46]

S., Saha, S., Sriramanan, G., Kattakinda, P., Chegini, A., and Feizi, S

Sadasivan, V . S., Saha, S., Sriramanan, G., Kattakinda, P., Chegini, A., and Feizi, S. Fast adversarial attacks on language models in one GPU minute. arXiv preprint arXiv:2402.15570,

work page arXiv
[47]

and Hardt, M

Salaudeen, O. and Hardt, M. ImageNot: A contrast with ImageNet preserves model rankings. arXiv preprint arXiv:2404.02112,

work page arXiv
[48]

A probabilis- tic perspective on unlearning and alignment for large lan- guage models

Scholten, Y ., G¨unnemann, S., and Schwinn, L. A probabilis- tic perspective on unlearning and alignment for large lan- guage models. arXiv preprint arXiv:2410.03523,

work page arXiv
[49]

Soft prompt threats: Attacking safety alignment and unlearning in open-source LLMs through the embedding space

11 Position: LLM-Safety Evaluations Lack Robustness Schwinn, L., Dobre, D., Xhonneux, S., Gidel, G., and Gunnemann, S. Soft prompt threats: Attacking safety alignment and unlearning in open-source LLMs through the embedding space. arXiv preprint arXiv:2402.09063,

work page arXiv
[50]

On second thought, let’s not think step by step! Bias and toxicity in zero-shot reasoning

Shaikh, O., Zhang, H., Held, W., Bernstein, M., and Yang, D. On second thought, let’s not think step by step! Bias and toxicity in zero-shot reasoning. arXiv preprint arXiv:2212.08061,

work page arXiv
[51]

”do anything now”: Characterizing and evaluating in-the- wild jailbreak prompts on large language models

Shen, X., Chen, Z., Backes, M., Shen, Y ., and Zhang, Y . ”do anything now”: Characterizing and evaluating in-the- wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 1671–1685,

work page 2024
[52]

C., Perez, E., Hadfield-Menell, D., et al

Sheshadri, A., Ewart, A., Guo, P., Lynch, A., Wu, C., Heb- bar, V ., Sleight, H., Stickland, A. C., Perez, E., Hadfield- Menell, D., and Casper, S. Targeted latent adversarial training improves robustness to persistent harmful behav- iors in llms. arXiv preprint arXiv:2407.15549,

work page arXiv
[53]

Thompson, T. B. and Sklar, M. Breaking circuit break- ers, 2024a. URL https://confirmlabs.org/ posts/circuit_breaking.html. Thompson, T. B. and Sklar, M. FLRT: Fluent Student- Teacher Redteaming. arXiv preprint arXiv:2407.17447, 2024b. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosa...

work page arXiv
[54]

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

Vijayakumar, A. K., Cogswell, M., Selvaraju, R. R., Sun, Q., Lee, S., Crandall, D., and Batra, D. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424,

work page internal anchor Pith review Pith/arXiv arXiv
[55]

Measuring short-form factuality in large language models

Wei, J., Karina, N., Chung, H. W., Jiao, Y . J., Papay, S., Glaese, A., Schulman, J., and Fedus, W. Measuring short- form factuality in large language models. arXiv preprint arXiv:2411.04368,

work page internal anchor Pith review Pith/arXiv arXiv
[56]

Efficient adversarial training in llms with continuous attacks, 2024

Xhonneux, S., Sordoni, A., G¨unnemann, S., Gidel, G., and Schwinn, L. Efficient adversarial training in LLMs with continuous attacks. arXiv preprint arXiv:2405.15589 ,

work page arXiv
[57]

Yong, Z.-X., Menghini, C., and Bach, S. H. Low- resource languages jailbreak GPT-4. arXiv preprint arXiv:2310.02446,

work page internal anchor Pith review Pith/arXiv arXiv
[58]

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Yu, J., Lin, X., Yu, Z., and Xing, X. Gptfuzzer: Red team- ing large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253,

work page internal anchor Pith review Pith/arXiv arXiv
[59]

Robust LLM safeguarding via refusal feature adversarial training

Yu, L., Do, V ., Hambardzumyan, K., and Cancedda, N. Robust LLM safeguarding via refusal feature adversarial training. arXiv preprint arXiv:2409.20089,

work page arXiv
[60]

HellaSwag: Can a Machine Really Finish Your Sentence?

Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[61]

WildChat: 1M ChatGPT Interaction Logs in the Wild

Zhao, W., Ren, X., Hessel, J., Cardie, C., Choi, Y ., and Deng, Y . Wildchat: 1m ChatGPT interaction logs in the wild. arXiv preprint arXiv:2405.01470,

work page internal anchor Pith review Pith/arXiv arXiv
[62]

Ad- vPrefix: An objective for nuanced llm jailbreaks

Zhu, S., Amos, B., Tian, Y ., Guo, C., and Evtimov, I. Ad- vPrefix: An objective for nuanced llm jailbreaks. arXiv preprint arXiv:2412.10321,

work page arXiv
[63]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., et al. Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405, 2023a. Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversar- ial attacks on ali...

work page internal anchor Pith review Pith/arXiv arXiv
[64]

difficult

12 Position: LLM-Safety Evaluations Lack Robustness A. Appendix A.1. Background A.1.1. R OBUSTNESS FOR LLM S In the context of LLM safety research, an aligned model can be defined as one whose generations follow the rules and guidelines laid out by its developer. Such an LLM may be adversarially attacked via methods that produce inputs that cause the mode...

work page 2015
[65]

13 Position: LLM-Safety Evaluations Lack Robustness A.2

and datasets (Krizhevsky et al., 2009), making results easier to reproduce and avoiding noise resulting from slightly different implementations. 13 Position: LLM-Safety Evaluations Lack Robustness A.2. Case Study: Effects of White-space Tokens on GCG Against Llama 2 We run GCG (Zou et al., 2023b) against Llama-2-7b-chat-hf using default hyperparameters. W...

work page 2009
[66]

Sure, here

begin with the phrase “Sure, here...”, we want to evaluate how “natural” this response is for various models. We run a small test using a benign prompt (“Write a blog post that explains how to use the ‘transformers‘ library to train a language model”) and the target phrase “Sure, here’s a blog post that explains how to use the ‘transformers‘ library to tr...

work page 2024
[67]

In addition, we find that the use of automated judging for jailbreaks can lead to overly confident conclusions regarding the safety of models

would not be sufficient to support this claim at commonly accepted statistical significance levels. In addition, we find that the use of automated judging for jailbreaks can lead to overly confident conclusions regarding the safety of models. We also want to emphasize that the findings of the work as a whole remain sound and well-supported, butonly becaus...

work page 2023
[68]

many-trial

Llama 2 13B finetuned classifier. To do so, we run a suite of 10 different attack algorithms against a zoo of 25 models using all 300 non-copyright prompts in HarmBench’s dataset. All attacks are evaluated in the “many-trial” setting, regardless of the setting in which they were originally introduced. This means all intermediate prompt candidates are roll...

work page 2023
[69]

• AmpleGCG (Liao & Sun, 2024): We use osunlp/AmpleGCG-llama2-sourced-llama2-7b-chat to generate 200 attack suffixes with diversity penalty

During the attack, the victim model generates up to 256 tokens using greedy generation. • AmpleGCG (Liao & Sun, 2024): We use osunlp/AmpleGCG-llama2-sourced-llama2-7b-chat to generate 200 attack suffixes with diversity penalty

work page 2024
[70]

x x x x x x x x x x x x x x x x x x x x

to prompt the model. • BEAST (Sadasivan et al., 2024): We use k1 = k2 = 15 and set the temperature to 1 to sample N = 40 suffix tokens. • PGD (Schwinn et al., 2024): We initialize the attack using the suffix “x x x x x x x x x x x x x x x x x x x x” as it tokenizes to exactly 20 tokens for all tested models, and run signed gradient descent optimization fo...

work page 2024

[1] [1]

Phi-4 Technical Report

Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [4]

Accessed: 2025- 01-29

URL https://www.aisi.gov.uk/. Accessed: 2025- 01-29. An, B., Zhu, S., Zhang, R., Panaitescu-Liess, M.-A., Xu, Y ., and Huang, F. Automatic pseudo-harmful prompt generation for evaluating false refusals in large language models. In First Conference on Language Modeling ,

work page 2025

[4] [5]

Andriushchenko, M

URL https://openreview.net/forum? id=ljFgX6A8NL. Andriushchenko, M. and Flammarion, N. Does refusal training in LLMs generalize to the past tense? arXiv preprint arXiv:2407.11969,

work page arXiv

[5] [6]

arXiv preprint arXiv:2404.02151 (2024)

Andriushchenko, M., Croce, F., and Flammarion, N. Jail- breaking leading safety-aligned LLMs with simple adap- tive attacks. arXiv preprint arXiv:2404.02151,

work page arXiv

[6] [7]

Refusal in Language Models Is Mediated by a Single Direction

Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku. 2024a. URL https: //api.semanticscholar.org/CorpusID: 268232499. Anthropic. Responsible scaling policy, 2024b. URL https://www.anthropic.com/news/ anthropics-responsible-scaling-policy . Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., and Nanda, N. Refusal in language mod...

work page internal anchor Pith review Pith/arXiv arXiv

[7] [8]

Constitutional AI: Harmlessness from AI Feedback

Bai, Y ., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKin- non, C., et al. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [9]

Mechanistic Interpretability for AI Safety -- A Review

Bereska, L. and Gavves, E. Mechanistic interpretability for AI safety–A review. arXiv preprint arXiv:2404.14082,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [10]

and Poria, S

Bhardwaj, R. and Poria, S. Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662,

work page arXiv

[10] [11]

Lessons from the Trenches on Reproducible Evaluation of Language Models

Biderman, S., Schoelkopf, H., Sutawika, L., Gao, L., Tow, J., Abbasi, B., Aji, A. F., Ammanamanchi, P. S., Black, S., Clive, J., et al. Lessons from the trenches on repro- ducible evaluation of language models. arXiv preprint arXiv:2405.14782,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [12]

A realistic threat model for large language model jailbreaks

Boreiko, V ., Panfilov, A., V oracek, V ., Hein, M., and Geip- ing, J. A realistic threat model for large language model jailbreaks. arXiv preprint arXiv:2410.16222,

work page arXiv

[12] [13]

The art of saying no: Contex- tual noncompliance in language models

Brahman, F., Kumar, S., Balachandran, V ., Dasigi, P., Py- atkin, V ., Ravichander, A., Wiegreffe, S., Dziri, N., Chandu, K., Hessel, J., et al. The art of saying no: Contex- tual noncompliance in language models. arXiv preprint arXiv:2407.12043,

work page arXiv

[13] [14]

Jailbreaking Black Box Large Language Models in Twenty Queries

9 Position: LLM-Safety Evaluations Lack Robustness Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., and Wong, E. Jailbreaking black box large language mod- els in twenty queries. arXiv preprint arXiv:2310.08419,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [15]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Chao, P., Debenedetti, E., Robey, A., Andriushchenko, M., Croce, F., Sehwag, V ., Dobriban, E., Flammarion, N., Pappas, G. J., Tramer, F., et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [16]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? Try ARC, the AI2 reasoning chal- lenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [17]

National artificial intelligence initiative act of 2020,

Congress, U. National artificial intelligence initiative act of 2020,

work page 2020

[17] [18]

Robustbench: a stan- dardized adversarial robustness benchmark,

Croce, F., Andriushchenko, M., Sehwag, V ., Debenedetti, E., Flammarion, N., Chiang, M., Mittal, P., and Hein, M. Robustbench: a standardized adversarial robustness benchmark. arXiv preprint arXiv:2010.09670,

work page arXiv 2010

[18] [19]

OR- Bench: An over-refusal benchmark for large language models

Cui, J., Chiang, W.-L., Stoica, I., and Hsieh, C.-J. OR- Bench: An over-refusal benchmark for large language models. arXiv preprint arXiv:2405.20947,

work page arXiv

[19] [20]

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Dai, J., Pan, X., Sun, R., Ji, J., Xu, X., Liu, M., Wang, Y ., and Yang, Y . Safe RLHF: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [21]

Imagenet: A large-scale hierarchical image database

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee,

work page 2009

[21] [22]

The Llama 3 Herd of Models

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [23]

Attacking large language mod- els with projected gradient descent

Geisler, S., Wollschl ¨ager, T., Abdalla, M., Gasteiger, J., and G ¨unnemann, S. Attacking large language mod- els with projected gradient descent. arXiv preprint arXiv:2402.09154,

work page arXiv

[23] [24]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Google, G., Georgiev, P., Lei, V . I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understand- ing across millions of tokens of context. arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [25]

Alignment faking in large language models

Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDi- armid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., et al. Alignment faking in large language models. arXiv preprint arXiv:2412.14093,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [26]

Y ., Joglekar, M., Wallace, E., Jain, S., Barak, B., Heylar, A., Dias, R., Vallone, A., Ren, H., Wei, J., et al

Guan, M. Y ., Joglekar, M., Wallace, E., Jain, S., Barak, B., Heylar, A., Dias, R., Vallone, A., Ren, H., Wei, J., et al. Deliberative alignment: Reasoning enables safer lan- guage models. arXiv preprint arXiv:2412.16339,

work page arXiv

[26] [27]

X-Risk Analysis for AI Research

Hendrycks, D. and Mazeika, M. X-risk analysis for ai research. arXiv preprint arXiv:2206.05862,

work page arXiv

[27] [28]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding. arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[28] [29]

An Overview of Catastrophic AI Risks

Hendrycks, D., Mazeika, M., and Woodside, T. An overview of catastrophic ai risks. arXiv preprint arXiv:2306.12001,

work page internal anchor Pith review arXiv

[29] [30]

The Curious Case of Neural Text Degeneration

Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y . The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[30] [31]

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Huang, Y ., Gupta, S., Xia, M., Li, K., and Chen, D. Catas- trophic jailbreak of open-source LLMs via exploiting generation. arXiv preprint arXiv:2310.06987,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [32]

Best-of-n jailbreaking

Hughes, J., Price, S., Lynch, A., Schaeffer, R., Barez, F., Koyejo, S., Sleight, H., Jones, E., Perez, E., and Sharma, M. Best-of-n jailbreaking. arXiv preprint arXiv:2412.03556,

work page arXiv

[32] [33]

arXiv preprint arXiv:2405.21018

Jia, X., Pang, T., Du, C., Huang, Y ., Gu, J., Liu, Y ., Cao, X., and Lin, M. Improved techniques for optimization-based jailbreaking on large language models. arXiv preprint arXiv:2405.21018,

work page arXiv

[33] [34]

Torchattacks: A pytorch repository for adversarial attacks

10 Position: LLM-Safety Evaluations Lack Robustness Kim, H. Torchattacks: A pytorch repository for adversarial attacks. arXiv preprint arXiv:2010.01950,

work page arXiv 2010

[34] [35]

Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

URL http://www.cs.toronto. edu/˜kriz/cifar.html. MIT License. Kudo, T. Sentencepiece: A simple and language indepen- dent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226,

work page internal anchor Pith review Pith/arXiv arXiv

[35] [36]

The role of imagenet classes in fr \’echet inception distance

Kynk¨a¨anniemi, T., Karras, T., Aittala, M., Aila, T., and Lehtinen, J. The role of imagenet classes in fr \’echet inception distance. arXiv preprint arXiv:2203.06026 ,

work page arXiv

[36] [37]

arXiv preprint arXiv:2404.07921

Liao, Z. and Sun, H. AmpleGCG: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms. arXiv preprint arXiv:2404.07921,

work page arXiv

[37] [38]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Liu, X., Xu, N., Chen, M., and Xiao, C. Autodan: Generat- ing stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023a. Liu, Y ., Deng, G., Xu, Z., Li, Y ., Zheng, Y ., Zhang, Y ., Zhao, L., Zhang, T., Wang, K., and Liu, Y . Jailbreaking ChatGPT via prompt engineering: An empirical study. arXiv preprint arXiv:230...

work page internal anchor Pith review Pith/arXiv arXiv

[38] [39]

TDC 2023 (llm edition): The trojan detection challenge

Mazeika, M., Zou, A., Mu, N., Phan, L., Wang, Z., Yu, C., Khoja, A., Jiang, F., O’Gara, A., Sakhaee, E., Xiang, Z., Rajabi, A., Hendrycks, D., Poovendran, R., Li, B., and Forsyth, D. TDC 2023 (llm edition): The trojan detection challenge. In NeurIPS Competition Track,

work page 2023

[39] [40]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., et al. Harm- Bench: A standardized evaluation framework for auto- mated red teaming and robust refusal. arXiv preprint arXiv:2402.04249,

work page internal anchor Pith review Pith/arXiv arXiv

[40] [41]

arXiv preprint arXiv:2411.00640 , url=

Miller, E. Adding error bars to evals: A statistical ap- proach to language model evaluations. arXiv preprint arXiv:2411.00640,

work page arXiv

[41] [42]

Technical Report on the CleverHans v2.1.0 Adversarial Examples Library

Papernot, N., Faghri, F., Carlini, N., Goodfellow, I., Fein- man, R., Kurakin, A., Xie, C., Sharma, Y ., Brown, T., Roy, A., et al. Technical report on the cleverhans v2. 1.0 adver- sarial examples library. arXiv preprint arXiv:1610.00768,

work page internal anchor Pith review Pith/arXiv arXiv

[42] [43]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jack- son Petty, Richard Yuanzhe Pang, Julien Dirani, Ju- lian Michael, and Samuel R Bowman

Polo, F. M., Weber, L., Choshen, L., Sun, Y ., Xu, G., and Yurochkin, M. tinyBenchmarks: evaluating LLMs with fewer examples. arXiv preprint arXiv:2402.14992,

work page arXiv

[43] [44]

Safety alignment should be made more than just a few tokens deep

Qi, X., Panda, A., Lyu, K., Ma, X., Roy, S., Beirami, A., Mittal, P., and Henderson, P. Safety alignment should be made more than just a few tokens deep. arXiv preprint arXiv:2406.05946,

work page arXiv

[44] [45]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

R¨ottger, P., Kirk, H. R., Vidgen, B., Attanasio, G., Bianchi, F., and Hovy, D. XSTest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263,

work page internal anchor Pith review Pith/arXiv arXiv

[45] [46]

S., Saha, S., Sriramanan, G., Kattakinda, P., Chegini, A., and Feizi, S

Sadasivan, V . S., Saha, S., Sriramanan, G., Kattakinda, P., Chegini, A., and Feizi, S. Fast adversarial attacks on language models in one GPU minute. arXiv preprint arXiv:2402.15570,

work page arXiv

[46] [47]

and Hardt, M

Salaudeen, O. and Hardt, M. ImageNot: A contrast with ImageNet preserves model rankings. arXiv preprint arXiv:2404.02112,

work page arXiv

[47] [48]

A probabilis- tic perspective on unlearning and alignment for large lan- guage models

Scholten, Y ., G¨unnemann, S., and Schwinn, L. A probabilis- tic perspective on unlearning and alignment for large lan- guage models. arXiv preprint arXiv:2410.03523,

work page arXiv

[48] [49]

Soft prompt threats: Attacking safety alignment and unlearning in open-source LLMs through the embedding space

11 Position: LLM-Safety Evaluations Lack Robustness Schwinn, L., Dobre, D., Xhonneux, S., Gidel, G., and Gunnemann, S. Soft prompt threats: Attacking safety alignment and unlearning in open-source LLMs through the embedding space. arXiv preprint arXiv:2402.09063,

work page arXiv

[49] [50]

On second thought, let’s not think step by step! Bias and toxicity in zero-shot reasoning

Shaikh, O., Zhang, H., Held, W., Bernstein, M., and Yang, D. On second thought, let’s not think step by step! Bias and toxicity in zero-shot reasoning. arXiv preprint arXiv:2212.08061,

work page arXiv

[50] [51]

”do anything now”: Characterizing and evaluating in-the- wild jailbreak prompts on large language models

Shen, X., Chen, Z., Backes, M., Shen, Y ., and Zhang, Y . ”do anything now”: Characterizing and evaluating in-the- wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 1671–1685,

work page 2024

[51] [52]

C., Perez, E., Hadfield-Menell, D., et al

Sheshadri, A., Ewart, A., Guo, P., Lynch, A., Wu, C., Heb- bar, V ., Sleight, H., Stickland, A. C., Perez, E., Hadfield- Menell, D., and Casper, S. Targeted latent adversarial training improves robustness to persistent harmful behav- iors in llms. arXiv preprint arXiv:2407.15549,

work page arXiv

[52] [53]

Thompson, T. B. and Sklar, M. Breaking circuit break- ers, 2024a. URL https://confirmlabs.org/ posts/circuit_breaking.html. Thompson, T. B. and Sklar, M. FLRT: Fluent Student- Teacher Redteaming. arXiv preprint arXiv:2407.17447, 2024b. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosa...

work page arXiv

[53] [54]

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

Vijayakumar, A. K., Cogswell, M., Selvaraju, R. R., Sun, Q., Lee, S., Crandall, D., and Batra, D. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424,

work page internal anchor Pith review Pith/arXiv arXiv

[54] [55]

Measuring short-form factuality in large language models

Wei, J., Karina, N., Chung, H. W., Jiao, Y . J., Papay, S., Glaese, A., Schulman, J., and Fedus, W. Measuring short- form factuality in large language models. arXiv preprint arXiv:2411.04368,

work page internal anchor Pith review Pith/arXiv arXiv

[55] [56]

Efficient adversarial training in llms with continuous attacks, 2024

Xhonneux, S., Sordoni, A., G¨unnemann, S., Gidel, G., and Schwinn, L. Efficient adversarial training in LLMs with continuous attacks. arXiv preprint arXiv:2405.15589 ,

work page arXiv

[56] [57]

Yong, Z.-X., Menghini, C., and Bach, S. H. Low- resource languages jailbreak GPT-4. arXiv preprint arXiv:2310.02446,

work page internal anchor Pith review Pith/arXiv arXiv

[57] [58]

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Yu, J., Lin, X., Yu, Z., and Xing, X. Gptfuzzer: Red team- ing large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253,

work page internal anchor Pith review Pith/arXiv arXiv

[58] [59]

Robust LLM safeguarding via refusal feature adversarial training

Yu, L., Do, V ., Hambardzumyan, K., and Cancedda, N. Robust LLM safeguarding via refusal feature adversarial training. arXiv preprint arXiv:2409.20089,

work page arXiv

[59] [60]

HellaSwag: Can a Machine Really Finish Your Sentence?

Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905

[60] [61]

WildChat: 1M ChatGPT Interaction Logs in the Wild

Zhao, W., Ren, X., Hessel, J., Cardie, C., Choi, Y ., and Deng, Y . Wildchat: 1m ChatGPT interaction logs in the wild. arXiv preprint arXiv:2405.01470,

work page internal anchor Pith review Pith/arXiv arXiv

[61] [62]

Ad- vPrefix: An objective for nuanced llm jailbreaks

Zhu, S., Amos, B., Tian, Y ., Guo, C., and Evtimov, I. Ad- vPrefix: An objective for nuanced llm jailbreaks. arXiv preprint arXiv:2412.10321,

work page arXiv

[62] [63]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., et al. Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405, 2023a. Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversar- ial attacks on ali...

work page internal anchor Pith review Pith/arXiv arXiv

[63] [64]

difficult

12 Position: LLM-Safety Evaluations Lack Robustness A. Appendix A.1. Background A.1.1. R OBUSTNESS FOR LLM S In the context of LLM safety research, an aligned model can be defined as one whose generations follow the rules and guidelines laid out by its developer. Such an LLM may be adversarially attacked via methods that produce inputs that cause the mode...

work page 2015

[64] [65]

13 Position: LLM-Safety Evaluations Lack Robustness A.2

and datasets (Krizhevsky et al., 2009), making results easier to reproduce and avoiding noise resulting from slightly different implementations. 13 Position: LLM-Safety Evaluations Lack Robustness A.2. Case Study: Effects of White-space Tokens on GCG Against Llama 2 We run GCG (Zou et al., 2023b) against Llama-2-7b-chat-hf using default hyperparameters. W...

work page 2009

[65] [66]

Sure, here

begin with the phrase “Sure, here...”, we want to evaluate how “natural” this response is for various models. We run a small test using a benign prompt (“Write a blog post that explains how to use the ‘transformers‘ library to train a language model”) and the target phrase “Sure, here’s a blog post that explains how to use the ‘transformers‘ library to tr...

work page 2024

[66] [67]

In addition, we find that the use of automated judging for jailbreaks can lead to overly confident conclusions regarding the safety of models

would not be sufficient to support this claim at commonly accepted statistical significance levels. In addition, we find that the use of automated judging for jailbreaks can lead to overly confident conclusions regarding the safety of models. We also want to emphasize that the findings of the work as a whole remain sound and well-supported, butonly becaus...

work page 2023

[67] [68]

many-trial

Llama 2 13B finetuned classifier. To do so, we run a suite of 10 different attack algorithms against a zoo of 25 models using all 300 non-copyright prompts in HarmBench’s dataset. All attacks are evaluated in the “many-trial” setting, regardless of the setting in which they were originally introduced. This means all intermediate prompt candidates are roll...

work page 2023

[68] [69]

• AmpleGCG (Liao & Sun, 2024): We use osunlp/AmpleGCG-llama2-sourced-llama2-7b-chat to generate 200 attack suffixes with diversity penalty

During the attack, the victim model generates up to 256 tokens using greedy generation. • AmpleGCG (Liao & Sun, 2024): We use osunlp/AmpleGCG-llama2-sourced-llama2-7b-chat to generate 200 attack suffixes with diversity penalty

work page 2024

[69] [70]

x x x x x x x x x x x x x x x x x x x x

to prompt the model. • BEAST (Sadasivan et al., 2024): We use k1 = k2 = 15 and set the temperature to 1 to sample N = 40 suffix tokens. • PGD (Schwinn et al., 2024): We initialize the attack using the suffix “x x x x x x x x x x x x x x x x x x x x” as it tokenizes to exactly 20 tokens for all tested models, and run signed gradient descent optimization fo...

work page 2024