pith. sign in

arxiv: 2503.02574 · v2 · pith:XWX6Y5XSnew · submitted 2025-03-04 · 💻 cs.CR · cs.AI

LLM-Safety Evaluations Lack Robustness

Pith reviewed 2026-05-23 01:23 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords LLM safetyevaluation robustnessred-teamingLLM judgessafety alignmentnoise sourcesdataset curationresponse evaluation
0
0 comments X

The pith

Safety evaluations for large language models are too noisy to allow fair comparisons of attacks and defenses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that safety alignment research for LLMs is slowed by noise from small datasets, inconsistent methods, and unreliable evaluation setups that prevent fair comparisons of attacks and defenses. It breaks down the evaluation pipeline into four stages and identifies specific problems at each one. Guidelines are proposed to reduce this noise and bias in future work. An opposing perspective on practical reasons for the limitations is also included. Addressing the issues would enable more reliable results and steadier progress in the field.

Core claim

We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations.

What carries the argument

Systematic stage-by-stage analysis of the LLM safety evaluation pipeline covering dataset curation, optimization strategies for automated red-teaming, response generation, and LLM-judge evaluation.

Load-bearing premise

The listed sources of noise across the evaluation stages are the dominant factors that make fair comparisons impossible, rather than other unmentioned factors or mitigations.

What would settle it

A study that applies identical attacks and defenses to multiple standardized setups and obtains consistent performance rankings would show that fair comparisons remain possible despite the noise.

Figures

Figures reproduced from arXiv: 2503.02574 by Gauthier Gidel, Leo Schwinn, Simon Geisler, Sophie Xhonneux, Stephan G\"unnemann, Tim Beyer.

Figure 1
Figure 1. Figure 1: We decompose the LLM safety evaluation pipeline into three stages and analyze common problems at each step. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Impact of the SentencePiece white space token on [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparing ASR differences across common judges reveals significant model-specific bias, despite all four models being derivatives of the same Mistral 7B model. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Impact of white-space tokens on GCG attack progress vs. Llama-2-7b-chat-hf. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of implementation details on GCG attack progress vs. Llama-3.1-8B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Average loss for HarmBench-style affirmative target sequence using a harmless prompt. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Loss for the first token (“Sure”) for a HarmBench-style affirmative target using a harmless prompt. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparing the ASR of two judge models. We consider an attack successful if StrongReject gives a score higher [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
read the original abstract

In this paper, we argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers. Lastly, we offer an opposing perspective, highlighting practical reasons for existing limitations. We believe that addressing the outlined problems in future research will improve the field's ability to generate easily comparable results and make measurable progress.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLM safety alignment research is hindered by intertwined sources of noise such as small datasets, methodological inconsistencies, and unreliable evaluation setups. It argues these can make it impossible at times to evaluate and compare attacks and defenses fairly. The manuscript systematically analyzes the pipeline stages of dataset curation, optimization strategies for automated red-teaming, response generation, and LLM-judge evaluation; identifies key issues and their practical impact at each stage; proposes guidelines for reducing noise and bias; and presents an opposing perspective on practical reasons for existing limitations.

Significance. If the analysis holds and demonstrates that the enumerated noise sources dominate over mitigations to the point of rendering fair comparisons impossible, the work could encourage more rigorous and standardized evaluation practices in LLM safety, potentially enabling measurable progress. The systematic breakdown across stages, the proposed guidelines, and the inclusion of an opposing perspective are constructive elements that could aid researchers in improving reproducibility.

major comments (3)
  1. [Abstract] Abstract: the claim that the identified issues 'can, at times, make it impossible to evaluate and compare attacks and defenses fairly' is presented without quantitative evidence, error analysis, or concrete examples showing how the noise affects specific comparisons or why existing mitigations in the literature fail to permit reliable evaluations.
  2. [Systematic analysis sections] Systematic analysis sections (dataset curation, optimization, response generation, LLM-judge evaluation): the assumption that the listed issues are the dominant and representative sources of noise that render fair comparisons impossible is not verified; no direct comparison to or falsification against mitigations already present in the literature is provided to establish that the problems are load-bearing for the headline conclusion.
  3. [Guidelines section] Guidelines section: the proposed set of guidelines for reducing noise is not accompanied by any demonstration, case study, or before/after quantification showing that their adoption would enable fair comparisons where it was previously impossible.
minor comments (2)
  1. The manuscript would benefit from explicit definitions or operationalizations of key terms such as 'noise', 'fair comparison', and 'impossible' to make the argument more precise.
  2. Adding citations to specific prior works that exemplify each identified issue (e.g., particular papers using small datasets or LLM judges) would strengthen the practical impact discussion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. Our manuscript is a position and analysis paper that surveys noise sources across the LLM safety evaluation pipeline based on existing literature, rather than an empirical study with new experiments. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the identified issues 'can, at times, make it impossible to evaluate and compare attacks and defenses fairly' is presented without quantitative evidence, error analysis, or concrete examples showing how the noise affects specific comparisons or why existing mitigations in the literature fail to permit reliable evaluations.

    Authors: The abstract summarizes the core argument, with the full manuscript providing concrete examples and citations from published work (e.g., inconsistent attack success rates across papers due to dataset size differences or judge model variations in sections on dataset curation and LLM-judge evaluation). We agree the abstract could be strengthened with a brief illustrative reference and will revise it accordingly to point readers to specific cases discussed later in the paper. revision: partial

  2. Referee: [Systematic analysis sections] Systematic analysis sections (dataset curation, optimization, response generation, LLM-judge evaluation): the assumption that the listed issues are the dominant and representative sources of noise that render fair comparisons impossible is not verified; no direct comparison to or falsification against mitigations already present in the literature is provided to establish that the problems are load-bearing for the headline conclusion.

    Authors: The analysis draws on a broad survey of the literature to identify recurring issues that persist in practice. The opposing perspective section already discusses practical reasons for limitations and some existing mitigations. To strengthen this, we will expand the analysis sections with additional explicit comparisons to common mitigations (e.g., larger datasets or multi-judge setups) and why they often remain insufficient based on observed inconsistencies across papers. revision: partial

  3. Referee: [Guidelines section] Guidelines section: the proposed set of guidelines for reducing noise is not accompanied by any demonstration, case study, or before/after quantification showing that their adoption would enable fair comparisons where it was previously impossible.

    Authors: The guidelines are proposed as direct responses to the noise sources identified in the analysis and are presented as recommendations for future work. A before/after quantification or new case study would require conducting fresh experiments, which falls outside the scope of this analysis paper. We can revise the guidelines section to include more detailed rationale linking each guideline to specific noise sources and expected benefits. revision: no

Circularity Check

0 steps flagged

No significant circularity; observational critique without self-referential derivations or load-bearing self-citations

full rationale

The paper is an observational analysis of noise sources in LLM safety evaluations across dataset curation, optimization, response generation, and LLM-judge stages. It identifies issues, proposes guidelines, and includes an opposing perspective on limitations. No equations, fitted parameters, predictions, or derivations appear that could reduce to inputs by construction. Central claims rest on systematic review of practices rather than self-citation chains or uniqueness theorems from prior author work. This is self-contained as a critique paper; the reader's assigned score of 2.0 aligns with minor or absent circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper is a field critique rather than a formal model, so the ledger contains no fitted parameters or invented entities; the central claim rests on domain assumptions about how evaluations are currently conducted.

axioms (2)
  • domain assumption LLM judges are commonly used for response evaluation and introduce unreliability
    Invoked in the abstract when listing sources of noise in the evaluation pipeline.
  • domain assumption Small datasets and methodological inconsistencies are representative across the field
    Underlies the claim that noise makes fair comparisons impossible.

pith-pipeline@v0.9.0 · 5680 in / 1323 out tokens · 28926 ms · 2026-05-23T01:23:48.706467+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs

    cs.CR 2026-05 conditional novelty 7.0

    Compilation optimizations can be exploited to create stealthy backdoors in LLMs that remain dormant without optimization but achieve ~90% attack success while preserving clean accuracy near 100%.

  2. AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.

  3. When Efficiency Backfires: Cascading LLMs Trigger Cascade Failure under Adversarial Attack

    cs.CR 2026-05 unverdicted novelty 6.0

    LLM cascade systems are vulnerable to a new adversarial attack that simultaneously degrades accuracy and destroys the intended cost savings by targeting both the lightweight models and the escalation decision mechanism.

  4. How Sensitive Are Safety Benchmarks to Judge Configuration Choices?

    cs.CL 2026-04 unverdicted novelty 6.0

    LLM judge prompt variations alone shift HarmBench harmful-response rates by up to 24.2 percentage points and produce moderate instability in model safety rankings.

  5. Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

    cs.CR 2025-12 unverdicted novelty 6.0

    A meta-prompt and hierarchical detection framework automates LLM red-teaming, achieving 3.9 times higher vulnerability discovery rate than manual methods with 89% accuracy on GPT-OSS-20B.

  6. Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem

    cs.CR 2025-06 unverdicted novelty 6.0

    Formalizes the jailbreak oracle problem for LLMs and introduces Boa, a two-phase breadth-first then depth-first search system to solve it efficiently.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 6 Pith papers · 29 internal anchors

  1. [1]

    Phi-4 Technical Report

    Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905,

  2. [2]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  3. [4]

    Accessed: 2025- 01-29

    URL https://www.aisi.gov.uk/. Accessed: 2025- 01-29. An, B., Zhu, S., Zhang, R., Panaitescu-Liess, M.-A., Xu, Y ., and Huang, F. Automatic pseudo-harmful prompt generation for evaluating false refusals in large language models. In First Conference on Language Modeling ,

  4. [5]

    Andriushchenko, M

    URL https://openreview.net/forum? id=ljFgX6A8NL. Andriushchenko, M. and Flammarion, N. Does refusal training in LLMs generalize to the past tense? arXiv preprint arXiv:2407.11969,

  5. [6]

    arXiv preprint arXiv:2404.02151 (2024)

    Andriushchenko, M., Croce, F., and Flammarion, N. Jail- breaking leading safety-aligned LLMs with simple adap- tive attacks. arXiv preprint arXiv:2404.02151,

  6. [7]

    Refusal in Language Models Is Mediated by a Single Direction

    Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku. 2024a. URL https: //api.semanticscholar.org/CorpusID: 268232499. Anthropic. Responsible scaling policy, 2024b. URL https://www.anthropic.com/news/ anthropics-responsible-scaling-policy . Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., and Nanda, N. Refusal in language mod...

  7. [8]

    Constitutional AI: Harmlessness from AI Feedback

    Bai, Y ., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKin- non, C., et al. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073,

  8. [9]

    Mechanistic Interpretability for AI Safety -- A Review

    Bereska, L. and Gavves, E. Mechanistic interpretability for AI safety–A review. arXiv preprint arXiv:2404.14082,

  9. [10]

    and Poria, S

    Bhardwaj, R. and Poria, S. Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662,

  10. [11]

    Lessons from the Trenches on Reproducible Evaluation of Language Models

    Biderman, S., Schoelkopf, H., Sutawika, L., Gao, L., Tow, J., Abbasi, B., Aji, A. F., Ammanamanchi, P. S., Black, S., Clive, J., et al. Lessons from the trenches on repro- ducible evaluation of language models. arXiv preprint arXiv:2405.14782,

  11. [12]

    A realistic threat model for large language model jailbreaks

    Boreiko, V ., Panfilov, A., V oracek, V ., Hein, M., and Geip- ing, J. A realistic threat model for large language model jailbreaks. arXiv preprint arXiv:2410.16222,

  12. [13]

    The art of saying no: Contex- tual noncompliance in language models

    Brahman, F., Kumar, S., Balachandran, V ., Dasigi, P., Py- atkin, V ., Ravichander, A., Wiegreffe, S., Dziri, N., Chandu, K., Hessel, J., et al. The art of saying no: Contex- tual noncompliance in language models. arXiv preprint arXiv:2407.12043,

  13. [14]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    9 Position: LLM-Safety Evaluations Lack Robustness Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., and Wong, E. Jailbreaking black box large language mod- els in twenty queries. arXiv preprint arXiv:2310.08419,

  14. [15]

    JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

    Chao, P., Debenedetti, E., Robey, A., Andriushchenko, M., Croce, F., Sehwag, V ., Dobriban, E., Flammarion, N., Pappas, G. J., Tramer, F., et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318,

  15. [16]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? Try ARC, the AI2 reasoning chal- lenge. arXiv preprint arXiv:1803.05457,

  16. [17]

    National artificial intelligence initiative act of 2020,

    Congress, U. National artificial intelligence initiative act of 2020,

  17. [18]

    Robustbench: a stan- dardized adversarial robustness benchmark,

    Croce, F., Andriushchenko, M., Sehwag, V ., Debenedetti, E., Flammarion, N., Chiang, M., Mittal, P., and Hein, M. Robustbench: a standardized adversarial robustness benchmark. arXiv preprint arXiv:2010.09670,

  18. [19]

    OR- Bench: An over-refusal benchmark for large language models

    Cui, J., Chiang, W.-L., Stoica, I., and Hsieh, C.-J. OR- Bench: An over-refusal benchmark for large language models. arXiv preprint arXiv:2405.20947,

  19. [20]

    Safe RLHF: Safe Reinforcement Learning from Human Feedback

    Dai, J., Pan, X., Sun, R., Ji, J., Xu, X., Liu, M., Wang, Y ., and Yang, Y . Safe RLHF: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773,

  20. [21]

    Imagenet: A large-scale hierarchical image database

    Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee,

  21. [22]

    The Llama 3 Herd of Models

    Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783,

  22. [23]

    Attacking large language mod- els with projected gradient descent

    Geisler, S., Wollschl ¨ager, T., Abdalla, M., Gasteiger, J., and G ¨unnemann, S. Attacking large language mod- els with projected gradient descent. arXiv preprint arXiv:2402.09154,

  23. [24]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Google, G., Georgiev, P., Lei, V . I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understand- ing across millions of tokens of context. arXiv preprint arXiv:2403.05530,

  24. [25]

    Alignment faking in large language models

    Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDi- armid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., et al. Alignment faking in large language models. arXiv preprint arXiv:2412.14093,

  25. [26]

    Y ., Joglekar, M., Wallace, E., Jain, S., Barak, B., Heylar, A., Dias, R., Vallone, A., Ren, H., Wei, J., et al

    Guan, M. Y ., Joglekar, M., Wallace, E., Jain, S., Barak, B., Heylar, A., Dias, R., Vallone, A., Ren, H., Wei, J., et al. Deliberative alignment: Reasoning enables safer lan- guage models. arXiv preprint arXiv:2412.16339,

  26. [27]

    X-Risk Analysis for AI Research

    Hendrycks, D. and Mazeika, M. X-risk analysis for ai research. arXiv preprint arXiv:2206.05862,

  27. [28]

    Measuring Massive Multitask Language Understanding

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding. arXiv preprint arXiv:2009.03300,

  28. [29]

    An Overview of Catastrophic AI Risks

    Hendrycks, D., Mazeika, M., and Woodside, T. An overview of catastrophic ai risks. arXiv preprint arXiv:2306.12001,

  29. [30]

    The Curious Case of Neural Text Degeneration

    Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y . The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751,

  30. [31]

    Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

    Huang, Y ., Gupta, S., Xia, M., Li, K., and Chen, D. Catas- trophic jailbreak of open-source LLMs via exploiting generation. arXiv preprint arXiv:2310.06987,

  31. [32]

    Best-of-n jailbreaking

    Hughes, J., Price, S., Lynch, A., Schaeffer, R., Barez, F., Koyejo, S., Sleight, H., Jones, E., Perez, E., and Sharma, M. Best-of-n jailbreaking. arXiv preprint arXiv:2412.03556,

  32. [33]

    arXiv preprint arXiv:2405.21018

    Jia, X., Pang, T., Du, C., Huang, Y ., Gu, J., Liu, Y ., Cao, X., and Lin, M. Improved techniques for optimization-based jailbreaking on large language models. arXiv preprint arXiv:2405.21018,

  33. [34]

    Torchattacks: A pytorch repository for adversarial attacks

    10 Position: LLM-Safety Evaluations Lack Robustness Kim, H. Torchattacks: A pytorch repository for adversarial attacks. arXiv preprint arXiv:2010.01950,

  34. [35]

    Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

    URL http://www.cs.toronto. edu/˜kriz/cifar.html. MIT License. Kudo, T. Sentencepiece: A simple and language indepen- dent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226,

  35. [36]

    The role of imagenet classes in fr \’echet inception distance

    Kynk¨a¨anniemi, T., Karras, T., Aittala, M., Aila, T., and Lehtinen, J. The role of imagenet classes in fr \’echet inception distance. arXiv preprint arXiv:2203.06026 ,

  36. [37]

    arXiv preprint arXiv:2404.07921

    Liao, Z. and Sun, H. AmpleGCG: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms. arXiv preprint arXiv:2404.07921,

  37. [38]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    Liu, X., Xu, N., Chen, M., and Xiao, C. Autodan: Generat- ing stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023a. Liu, Y ., Deng, G., Xu, Z., Li, Y ., Zheng, Y ., Zhang, Y ., Zhao, L., Zhang, T., Wang, K., and Liu, Y . Jailbreaking ChatGPT via prompt engineering: An empirical study. arXiv preprint arXiv:230...

  38. [39]

    TDC 2023 (llm edition): The trojan detection challenge

    Mazeika, M., Zou, A., Mu, N., Phan, L., Wang, Z., Yu, C., Khoja, A., Jiang, F., O’Gara, A., Sakhaee, E., Xiang, Z., Rajabi, A., Hendrycks, D., Poovendran, R., Li, B., and Forsyth, D. TDC 2023 (llm edition): The trojan detection challenge. In NeurIPS Competition Track,

  39. [40]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., et al. Harm- Bench: A standardized evaluation framework for auto- mated red teaming and robust refusal. arXiv preprint arXiv:2402.04249,

  40. [41]

    arXiv preprint arXiv:2411.00640 , url=

    Miller, E. Adding error bars to evals: A statistical ap- proach to language model evaluations. arXiv preprint arXiv:2411.00640,

  41. [42]

    Technical Report on the CleverHans v2.1.0 Adversarial Examples Library

    Papernot, N., Faghri, F., Carlini, N., Goodfellow, I., Fein- man, R., Kurakin, A., Xie, C., Sharma, Y ., Brown, T., Roy, A., et al. Technical report on the cleverhans v2. 1.0 adver- sarial examples library. arXiv preprint arXiv:1610.00768,

  42. [43]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jack- son Petty, Richard Yuanzhe Pang, Julien Dirani, Ju- lian Michael, and Samuel R Bowman

    Polo, F. M., Weber, L., Choshen, L., Sun, Y ., Xu, G., and Yurochkin, M. tinyBenchmarks: evaluating LLMs with fewer examples. arXiv preprint arXiv:2402.14992,

  43. [44]

    Safety alignment should be made more than just a few tokens deep

    Qi, X., Panda, A., Lyu, K., Ma, X., Roy, S., Beirami, A., Mittal, P., and Henderson, P. Safety alignment should be made more than just a few tokens deep. arXiv preprint arXiv:2406.05946,

  44. [45]

    XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

    R¨ottger, P., Kirk, H. R., Vidgen, B., Attanasio, G., Bianchi, F., and Hovy, D. XSTest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263,

  45. [46]

    S., Saha, S., Sriramanan, G., Kattakinda, P., Chegini, A., and Feizi, S

    Sadasivan, V . S., Saha, S., Sriramanan, G., Kattakinda, P., Chegini, A., and Feizi, S. Fast adversarial attacks on language models in one GPU minute. arXiv preprint arXiv:2402.15570,

  46. [47]

    and Hardt, M

    Salaudeen, O. and Hardt, M. ImageNot: A contrast with ImageNet preserves model rankings. arXiv preprint arXiv:2404.02112,

  47. [48]

    A probabilis- tic perspective on unlearning and alignment for large lan- guage models

    Scholten, Y ., G¨unnemann, S., and Schwinn, L. A probabilis- tic perspective on unlearning and alignment for large lan- guage models. arXiv preprint arXiv:2410.03523,

  48. [49]

    Soft prompt threats: Attacking safety alignment and unlearning in open-source LLMs through the embedding space

    11 Position: LLM-Safety Evaluations Lack Robustness Schwinn, L., Dobre, D., Xhonneux, S., Gidel, G., and Gunnemann, S. Soft prompt threats: Attacking safety alignment and unlearning in open-source LLMs through the embedding space. arXiv preprint arXiv:2402.09063,

  49. [50]

    On second thought, let’s not think step by step! Bias and toxicity in zero-shot reasoning

    Shaikh, O., Zhang, H., Held, W., Bernstein, M., and Yang, D. On second thought, let’s not think step by step! Bias and toxicity in zero-shot reasoning. arXiv preprint arXiv:2212.08061,

  50. [51]

    ”do anything now”: Characterizing and evaluating in-the- wild jailbreak prompts on large language models

    Shen, X., Chen, Z., Backes, M., Shen, Y ., and Zhang, Y . ”do anything now”: Characterizing and evaluating in-the- wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 1671–1685,

  51. [52]

    C., Perez, E., Hadfield-Menell, D., et al

    Sheshadri, A., Ewart, A., Guo, P., Lynch, A., Wu, C., Heb- bar, V ., Sleight, H., Stickland, A. C., Perez, E., Hadfield- Menell, D., and Casper, S. Targeted latent adversarial training improves robustness to persistent harmful behav- iors in llms. arXiv preprint arXiv:2407.15549,

  52. [53]

    Thompson, T. B. and Sklar, M. Breaking circuit break- ers, 2024a. URL https://confirmlabs.org/ posts/circuit_breaking.html. Thompson, T. B. and Sklar, M. FLRT: Fluent Student- Teacher Redteaming. arXiv preprint arXiv:2407.17447, 2024b. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosa...

  53. [54]

    Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

    Vijayakumar, A. K., Cogswell, M., Selvaraju, R. R., Sun, Q., Lee, S., Crandall, D., and Batra, D. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424,

  54. [55]

    Measuring short-form factuality in large language models

    Wei, J., Karina, N., Chung, H. W., Jiao, Y . J., Papay, S., Glaese, A., Schulman, J., and Fedus, W. Measuring short- form factuality in large language models. arXiv preprint arXiv:2411.04368,

  55. [56]

    Efficient adversarial training in llms with continuous attacks, 2024

    Xhonneux, S., Sordoni, A., G¨unnemann, S., Gidel, G., and Schwinn, L. Efficient adversarial training in LLMs with continuous attacks. arXiv preprint arXiv:2405.15589 ,

  56. [57]

    Yong, Z.-X., Menghini, C., and Bach, S. H. Low- resource languages jailbreak GPT-4. arXiv preprint arXiv:2310.02446,

  57. [58]

    GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

    Yu, J., Lin, X., Yu, Z., and Xing, X. Gptfuzzer: Red team- ing large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253,

  58. [59]

    Robust LLM safeguarding via refusal feature adversarial training

    Yu, L., Do, V ., Hambardzumyan, K., and Cancedda, N. Robust LLM safeguarding via refusal feature adversarial training. arXiv preprint arXiv:2409.20089,

  59. [60]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,

  60. [61]

    WildChat: 1M ChatGPT Interaction Logs in the Wild

    Zhao, W., Ren, X., Hessel, J., Cardie, C., Choi, Y ., and Deng, Y . Wildchat: 1m ChatGPT interaction logs in the wild. arXiv preprint arXiv:2405.01470,

  61. [62]

    Ad- vPrefix: An objective for nuanced llm jailbreaks

    Zhu, S., Amos, B., Tian, Y ., Guo, C., and Evtimov, I. Ad- vPrefix: An objective for nuanced llm jailbreaks. arXiv preprint arXiv:2412.10321,

  62. [63]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., et al. Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405, 2023a. Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversar- ial attacks on ali...

  63. [64]

    difficult

    12 Position: LLM-Safety Evaluations Lack Robustness A. Appendix A.1. Background A.1.1. R OBUSTNESS FOR LLM S In the context of LLM safety research, an aligned model can be defined as one whose generations follow the rules and guidelines laid out by its developer. Such an LLM may be adversarially attacked via methods that produce inputs that cause the mode...

  64. [65]

    13 Position: LLM-Safety Evaluations Lack Robustness A.2

    and datasets (Krizhevsky et al., 2009), making results easier to reproduce and avoiding noise resulting from slightly different implementations. 13 Position: LLM-Safety Evaluations Lack Robustness A.2. Case Study: Effects of White-space Tokens on GCG Against Llama 2 We run GCG (Zou et al., 2023b) against Llama-2-7b-chat-hf using default hyperparameters. W...

  65. [66]

    Sure, here

    begin with the phrase “Sure, here...”, we want to evaluate how “natural” this response is for various models. We run a small test using a benign prompt (“Write a blog post that explains how to use the ‘transformers‘ library to train a language model”) and the target phrase “Sure, here’s a blog post that explains how to use the ‘transformers‘ library to tr...

  66. [67]

    In addition, we find that the use of automated judging for jailbreaks can lead to overly confident conclusions regarding the safety of models

    would not be sufficient to support this claim at commonly accepted statistical significance levels. In addition, we find that the use of automated judging for jailbreaks can lead to overly confident conclusions regarding the safety of models. We also want to emphasize that the findings of the work as a whole remain sound and well-supported, butonly becaus...

  67. [68]

    many-trial

    Llama 2 13B finetuned classifier. To do so, we run a suite of 10 different attack algorithms against a zoo of 25 models using all 300 non-copyright prompts in HarmBench’s dataset. All attacks are evaluated in the “many-trial” setting, regardless of the setting in which they were originally introduced. This means all intermediate prompt candidates are roll...

  68. [69]

    • AmpleGCG (Liao & Sun, 2024): We use osunlp/AmpleGCG-llama2-sourced-llama2-7b-chat to generate 200 attack suffixes with diversity penalty

    During the attack, the victim model generates up to 256 tokens using greedy generation. • AmpleGCG (Liao & Sun, 2024): We use osunlp/AmpleGCG-llama2-sourced-llama2-7b-chat to generate 200 attack suffixes with diversity penalty

  69. [70]

    x x x x x x x x x x x x x x x x x x x x

    to prompt the model. • BEAST (Sadasivan et al., 2024): We use k1 = k2 = 15 and set the temperature to 1 to sample N = 40 suffix tokens. • PGD (Schwinn et al., 2024): We initialize the attack using the suffix “x x x x x x x x x x x x x x x x x x x x” as it tokenizes to exactly 20 tokens for all tested models, and run signed gradient descent optimization fo...