LLM-Safety Evaluations Lack Robustness
Pith reviewed 2026-05-23 01:23 UTC · model grok-4.3
The pith
Safety evaluations for large language models are too noisy to allow fair comparisons of attacks and defenses.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations.
What carries the argument
Systematic stage-by-stage analysis of the LLM safety evaluation pipeline covering dataset curation, optimization strategies for automated red-teaming, response generation, and LLM-judge evaluation.
Load-bearing premise
The listed sources of noise across the evaluation stages are the dominant factors that make fair comparisons impossible, rather than other unmentioned factors or mitigations.
What would settle it
A study that applies identical attacks and defenses to multiple standardized setups and obtains consistent performance rankings would show that fair comparisons remain possible despite the noise.
Figures
read the original abstract
In this paper, we argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers. Lastly, we offer an opposing perspective, highlighting practical reasons for existing limitations. We believe that addressing the outlined problems in future research will improve the field's ability to generate easily comparable results and make measurable progress.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLM safety alignment research is hindered by intertwined sources of noise such as small datasets, methodological inconsistencies, and unreliable evaluation setups. It argues these can make it impossible at times to evaluate and compare attacks and defenses fairly. The manuscript systematically analyzes the pipeline stages of dataset curation, optimization strategies for automated red-teaming, response generation, and LLM-judge evaluation; identifies key issues and their practical impact at each stage; proposes guidelines for reducing noise and bias; and presents an opposing perspective on practical reasons for existing limitations.
Significance. If the analysis holds and demonstrates that the enumerated noise sources dominate over mitigations to the point of rendering fair comparisons impossible, the work could encourage more rigorous and standardized evaluation practices in LLM safety, potentially enabling measurable progress. The systematic breakdown across stages, the proposed guidelines, and the inclusion of an opposing perspective are constructive elements that could aid researchers in improving reproducibility.
major comments (3)
- [Abstract] Abstract: the claim that the identified issues 'can, at times, make it impossible to evaluate and compare attacks and defenses fairly' is presented without quantitative evidence, error analysis, or concrete examples showing how the noise affects specific comparisons or why existing mitigations in the literature fail to permit reliable evaluations.
- [Systematic analysis sections] Systematic analysis sections (dataset curation, optimization, response generation, LLM-judge evaluation): the assumption that the listed issues are the dominant and representative sources of noise that render fair comparisons impossible is not verified; no direct comparison to or falsification against mitigations already present in the literature is provided to establish that the problems are load-bearing for the headline conclusion.
- [Guidelines section] Guidelines section: the proposed set of guidelines for reducing noise is not accompanied by any demonstration, case study, or before/after quantification showing that their adoption would enable fair comparisons where it was previously impossible.
minor comments (2)
- The manuscript would benefit from explicit definitions or operationalizations of key terms such as 'noise', 'fair comparison', and 'impossible' to make the argument more precise.
- Adding citations to specific prior works that exemplify each identified issue (e.g., particular papers using small datasets or LLM judges) would strengthen the practical impact discussion.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. Our manuscript is a position and analysis paper that surveys noise sources across the LLM safety evaluation pipeline based on existing literature, rather than an empirical study with new experiments. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the identified issues 'can, at times, make it impossible to evaluate and compare attacks and defenses fairly' is presented without quantitative evidence, error analysis, or concrete examples showing how the noise affects specific comparisons or why existing mitigations in the literature fail to permit reliable evaluations.
Authors: The abstract summarizes the core argument, with the full manuscript providing concrete examples and citations from published work (e.g., inconsistent attack success rates across papers due to dataset size differences or judge model variations in sections on dataset curation and LLM-judge evaluation). We agree the abstract could be strengthened with a brief illustrative reference and will revise it accordingly to point readers to specific cases discussed later in the paper. revision: partial
-
Referee: [Systematic analysis sections] Systematic analysis sections (dataset curation, optimization, response generation, LLM-judge evaluation): the assumption that the listed issues are the dominant and representative sources of noise that render fair comparisons impossible is not verified; no direct comparison to or falsification against mitigations already present in the literature is provided to establish that the problems are load-bearing for the headline conclusion.
Authors: The analysis draws on a broad survey of the literature to identify recurring issues that persist in practice. The opposing perspective section already discusses practical reasons for limitations and some existing mitigations. To strengthen this, we will expand the analysis sections with additional explicit comparisons to common mitigations (e.g., larger datasets or multi-judge setups) and why they often remain insufficient based on observed inconsistencies across papers. revision: partial
-
Referee: [Guidelines section] Guidelines section: the proposed set of guidelines for reducing noise is not accompanied by any demonstration, case study, or before/after quantification showing that their adoption would enable fair comparisons where it was previously impossible.
Authors: The guidelines are proposed as direct responses to the noise sources identified in the analysis and are presented as recommendations for future work. A before/after quantification or new case study would require conducting fresh experiments, which falls outside the scope of this analysis paper. We can revise the guidelines section to include more detailed rationale linking each guideline to specific noise sources and expected benefits. revision: no
Circularity Check
No significant circularity; observational critique without self-referential derivations or load-bearing self-citations
full rationale
The paper is an observational analysis of noise sources in LLM safety evaluations across dataset curation, optimization, response generation, and LLM-judge stages. It identifies issues, proposes guidelines, and includes an opposing perspective on limitations. No equations, fitted parameters, predictions, or derivations appear that could reduce to inputs by construction. Central claims rest on systematic review of practices rather than self-citation chains or uniqueness theorems from prior author work. This is self-contained as a critique paper; the reader's assigned score of 2.0 aligns with minor or absent circularity.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM judges are commonly used for response evaluation and introduce unreliability
- domain assumption Small datasets and methodological inconsistencies are representative across the field
Forward citations
Cited by 6 Pith papers
-
Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs
Compilation optimizations can be exploited to create stealthy backdoors in LLMs that remain dormant without optimization but achieve ~90% attack success while preserving clean accuracy near 100%.
-
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
-
When Efficiency Backfires: Cascading LLMs Trigger Cascade Failure under Adversarial Attack
LLM cascade systems are vulnerable to a new adversarial attack that simultaneously degrades accuracy and destroys the intended cost savings by targeting both the lightweight models and the escalation decision mechanism.
-
How Sensitive Are Safety Benchmarks to Judge Configuration Choices?
LLM judge prompt variations alone shift HarmBench harmful-response rates by up to 24.2 percentage points and produce moderate instability in model safety rankings.
-
Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models
A meta-prompt and hierarchical detection framework automates LLM red-teaming, achieving 3.9 times higher vulnerability discovery rate than manual methods with 89% accuracy on GPT-OSS-20B.
-
Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem
Formalizes the jailbreak oracle problem for LLMs and introduces Boa, a two-phase breadth-first then depth-first search system to solve it efficiently.
Reference graph
Works this paper leans on
-
[1]
Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
URL https://www.aisi.gov.uk/. Accessed: 2025- 01-29. An, B., Zhu, S., Zhang, R., Panaitescu-Liess, M.-A., Xu, Y ., and Huang, F. Automatic pseudo-harmful prompt generation for evaluating false refusals in large language models. In First Conference on Language Modeling ,
work page 2025
-
[5]
URL https://openreview.net/forum? id=ljFgX6A8NL. Andriushchenko, M. and Flammarion, N. Does refusal training in LLMs generalize to the past tense? arXiv preprint arXiv:2407.11969,
-
[6]
arXiv preprint arXiv:2404.02151 (2024)
Andriushchenko, M., Croce, F., and Flammarion, N. Jail- breaking leading safety-aligned LLMs with simple adap- tive attacks. arXiv preprint arXiv:2404.02151,
-
[7]
Refusal in Language Models Is Mediated by a Single Direction
Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku. 2024a. URL https: //api.semanticscholar.org/CorpusID: 268232499. Anthropic. Responsible scaling policy, 2024b. URL https://www.anthropic.com/news/ anthropics-responsible-scaling-policy . Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., and Nanda, N. Refusal in language mod...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Constitutional AI: Harmlessness from AI Feedback
Bai, Y ., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKin- non, C., et al. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Mechanistic Interpretability for AI Safety -- A Review
Bereska, L. and Gavves, E. Mechanistic interpretability for AI safety–A review. arXiv preprint arXiv:2404.14082,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Bhardwaj, R. and Poria, S. Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662,
-
[11]
Lessons from the Trenches on Reproducible Evaluation of Language Models
Biderman, S., Schoelkopf, H., Sutawika, L., Gao, L., Tow, J., Abbasi, B., Aji, A. F., Ammanamanchi, P. S., Black, S., Clive, J., et al. Lessons from the trenches on repro- ducible evaluation of language models. arXiv preprint arXiv:2405.14782,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
A realistic threat model for large language model jailbreaks
Boreiko, V ., Panfilov, A., V oracek, V ., Hein, M., and Geip- ing, J. A realistic threat model for large language model jailbreaks. arXiv preprint arXiv:2410.16222,
-
[13]
The art of saying no: Contex- tual noncompliance in language models
Brahman, F., Kumar, S., Balachandran, V ., Dasigi, P., Py- atkin, V ., Ravichander, A., Wiegreffe, S., Dziri, N., Chandu, K., Hessel, J., et al. The art of saying no: Contex- tual noncompliance in language models. arXiv preprint arXiv:2407.12043,
-
[14]
Jailbreaking Black Box Large Language Models in Twenty Queries
9 Position: LLM-Safety Evaluations Lack Robustness Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., and Wong, E. Jailbreaking black box large language mod- els in twenty queries. arXiv preprint arXiv:2310.08419,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Chao, P., Debenedetti, E., Robey, A., Andriushchenko, M., Croce, F., Sehwag, V ., Dobriban, E., Flammarion, N., Pappas, G. J., Tramer, F., et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? Try ARC, the AI2 reasoning chal- lenge. arXiv preprint arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
National artificial intelligence initiative act of 2020,
Congress, U. National artificial intelligence initiative act of 2020,
work page 2020
-
[18]
Robustbench: a stan- dardized adversarial robustness benchmark,
Croce, F., Andriushchenko, M., Sehwag, V ., Debenedetti, E., Flammarion, N., Chiang, M., Mittal, P., and Hein, M. Robustbench: a standardized adversarial robustness benchmark. arXiv preprint arXiv:2010.09670,
-
[19]
OR- Bench: An over-refusal benchmark for large language models
Cui, J., Chiang, W.-L., Stoica, I., and Hsieh, C.-J. OR- Bench: An over-refusal benchmark for large language models. arXiv preprint arXiv:2405.20947,
-
[20]
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Dai, J., Pan, X., Sun, R., Ji, J., Xu, X., Liu, M., Wang, Y ., and Yang, Y . Safe RLHF: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Imagenet: A large-scale hierarchical image database
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee,
work page 2009
-
[22]
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Attacking large language mod- els with projected gradient descent
Geisler, S., Wollschl ¨ager, T., Abdalla, M., Gasteiger, J., and G ¨unnemann, S. Attacking large language mod- els with projected gradient descent. arXiv preprint arXiv:2402.09154,
-
[24]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Google, G., Georgiev, P., Lei, V . I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understand- ing across millions of tokens of context. arXiv preprint arXiv:2403.05530,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Alignment faking in large language models
Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDi- armid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., et al. Alignment faking in large language models. arXiv preprint arXiv:2412.14093,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Guan, M. Y ., Joglekar, M., Wallace, E., Jain, S., Barak, B., Heylar, A., Dias, R., Vallone, A., Ren, H., Wei, J., et al. Deliberative alignment: Reasoning enables safer lan- guage models. arXiv preprint arXiv:2412.16339,
-
[27]
X-Risk Analysis for AI Research
Hendrycks, D. and Mazeika, M. X-risk analysis for ai research. arXiv preprint arXiv:2206.05862,
-
[28]
Measuring Massive Multitask Language Understanding
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding. arXiv preprint arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[29]
An Overview of Catastrophic AI Risks
Hendrycks, D., Mazeika, M., and Woodside, T. An overview of catastrophic ai risks. arXiv preprint arXiv:2306.12001,
work page internal anchor Pith review arXiv
-
[30]
The Curious Case of Neural Text Degeneration
Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y . The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[31]
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
Huang, Y ., Gupta, S., Xia, M., Li, K., and Chen, D. Catas- trophic jailbreak of open-source LLMs via exploiting generation. arXiv preprint arXiv:2310.06987,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Hughes, J., Price, S., Lynch, A., Schaeffer, R., Barez, F., Koyejo, S., Sleight, H., Jones, E., Perez, E., and Sharma, M. Best-of-n jailbreaking. arXiv preprint arXiv:2412.03556,
-
[33]
arXiv preprint arXiv:2405.21018
Jia, X., Pang, T., Du, C., Huang, Y ., Gu, J., Liu, Y ., Cao, X., and Lin, M. Improved techniques for optimization-based jailbreaking on large language models. arXiv preprint arXiv:2405.21018,
-
[34]
Torchattacks: A pytorch repository for adversarial attacks
10 Position: LLM-Safety Evaluations Lack Robustness Kim, H. Torchattacks: A pytorch repository for adversarial attacks. arXiv preprint arXiv:2010.01950,
-
[35]
URL http://www.cs.toronto. edu/˜kriz/cifar.html. MIT License. Kudo, T. Sentencepiece: A simple and language indepen- dent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
The role of imagenet classes in fr \’echet inception distance
Kynk¨a¨anniemi, T., Karras, T., Aittala, M., Aila, T., and Lehtinen, J. The role of imagenet classes in fr \’echet inception distance. arXiv preprint arXiv:2203.06026 ,
-
[37]
arXiv preprint arXiv:2404.07921
Liao, Z. and Sun, H. AmpleGCG: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms. arXiv preprint arXiv:2404.07921,
-
[38]
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Liu, X., Xu, N., Chen, M., and Xiao, C. Autodan: Generat- ing stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023a. Liu, Y ., Deng, G., Xu, Z., Li, Y ., Zheng, Y ., Zhang, Y ., Zhao, L., Zhang, T., Wang, K., and Liu, Y . Jailbreaking ChatGPT via prompt engineering: An empirical study. arXiv preprint arXiv:230...
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
TDC 2023 (llm edition): The trojan detection challenge
Mazeika, M., Zou, A., Mu, N., Phan, L., Wang, Z., Yu, C., Khoja, A., Jiang, F., O’Gara, A., Sakhaee, E., Xiang, Z., Rajabi, A., Hendrycks, D., Poovendran, R., Li, B., and Forsyth, D. TDC 2023 (llm edition): The trojan detection challenge. In NeurIPS Competition Track,
work page 2023
-
[40]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., et al. Harm- Bench: A standardized evaluation framework for auto- mated red teaming and robust refusal. arXiv preprint arXiv:2402.04249,
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
arXiv preprint arXiv:2411.00640 , url=
Miller, E. Adding error bars to evals: A statistical ap- proach to language model evaluations. arXiv preprint arXiv:2411.00640,
-
[42]
Technical Report on the CleverHans v2.1.0 Adversarial Examples Library
Papernot, N., Faghri, F., Carlini, N., Goodfellow, I., Fein- man, R., Kurakin, A., Xie, C., Sharma, Y ., Brown, T., Roy, A., et al. Technical report on the cleverhans v2. 1.0 adver- sarial examples library. arXiv preprint arXiv:1610.00768,
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Polo, F. M., Weber, L., Choshen, L., Sun, Y ., Xu, G., and Yurochkin, M. tinyBenchmarks: evaluating LLMs with fewer examples. arXiv preprint arXiv:2402.14992,
-
[44]
Safety alignment should be made more than just a few tokens deep
Qi, X., Panda, A., Lyu, K., Ma, X., Roy, S., Beirami, A., Mittal, P., and Henderson, P. Safety alignment should be made more than just a few tokens deep. arXiv preprint arXiv:2406.05946,
-
[45]
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
R¨ottger, P., Kirk, H. R., Vidgen, B., Attanasio, G., Bianchi, F., and Hovy, D. XSTest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263,
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
S., Saha, S., Sriramanan, G., Kattakinda, P., Chegini, A., and Feizi, S
Sadasivan, V . S., Saha, S., Sriramanan, G., Kattakinda, P., Chegini, A., and Feizi, S. Fast adversarial attacks on language models in one GPU minute. arXiv preprint arXiv:2402.15570,
-
[47]
Salaudeen, O. and Hardt, M. ImageNot: A contrast with ImageNet preserves model rankings. arXiv preprint arXiv:2404.02112,
-
[48]
A probabilis- tic perspective on unlearning and alignment for large lan- guage models
Scholten, Y ., G¨unnemann, S., and Schwinn, L. A probabilis- tic perspective on unlearning and alignment for large lan- guage models. arXiv preprint arXiv:2410.03523,
-
[49]
11 Position: LLM-Safety Evaluations Lack Robustness Schwinn, L., Dobre, D., Xhonneux, S., Gidel, G., and Gunnemann, S. Soft prompt threats: Attacking safety alignment and unlearning in open-source LLMs through the embedding space. arXiv preprint arXiv:2402.09063,
-
[50]
On second thought, let’s not think step by step! Bias and toxicity in zero-shot reasoning
Shaikh, O., Zhang, H., Held, W., Bernstein, M., and Yang, D. On second thought, let’s not think step by step! Bias and toxicity in zero-shot reasoning. arXiv preprint arXiv:2212.08061,
-
[51]
Shen, X., Chen, Z., Backes, M., Shen, Y ., and Zhang, Y . ”do anything now”: Characterizing and evaluating in-the- wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 1671–1685,
work page 2024
-
[52]
C., Perez, E., Hadfield-Menell, D., et al
Sheshadri, A., Ewart, A., Guo, P., Lynch, A., Wu, C., Heb- bar, V ., Sleight, H., Stickland, A. C., Perez, E., Hadfield- Menell, D., and Casper, S. Targeted latent adversarial training improves robustness to persistent harmful behav- iors in llms. arXiv preprint arXiv:2407.15549,
-
[53]
Thompson, T. B. and Sklar, M. Breaking circuit break- ers, 2024a. URL https://confirmlabs.org/ posts/circuit_breaking.html. Thompson, T. B. and Sklar, M. FLRT: Fluent Student- Teacher Redteaming. arXiv preprint arXiv:2407.17447, 2024b. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosa...
-
[54]
Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models
Vijayakumar, A. K., Cogswell, M., Selvaraju, R. R., Sun, Q., Lee, S., Crandall, D., and Batra, D. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424,
work page internal anchor Pith review Pith/arXiv arXiv
-
[55]
Measuring short-form factuality in large language models
Wei, J., Karina, N., Chung, H. W., Jiao, Y . J., Papay, S., Glaese, A., Schulman, J., and Fedus, W. Measuring short- form factuality in large language models. arXiv preprint arXiv:2411.04368,
work page internal anchor Pith review Pith/arXiv arXiv
-
[56]
Efficient adversarial training in llms with continuous attacks, 2024
Xhonneux, S., Sordoni, A., G¨unnemann, S., Gidel, G., and Schwinn, L. Efficient adversarial training in LLMs with continuous attacks. arXiv preprint arXiv:2405.15589 ,
-
[57]
Yong, Z.-X., Menghini, C., and Bach, S. H. Low- resource languages jailbreak GPT-4. arXiv preprint arXiv:2310.02446,
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Yu, J., Lin, X., Yu, Z., and Xing, X. Gptfuzzer: Red team- ing large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253,
work page internal anchor Pith review Pith/arXiv arXiv
-
[59]
Robust LLM safeguarding via refusal feature adversarial training
Yu, L., Do, V ., Hambardzumyan, K., and Cancedda, N. Robust LLM safeguarding via refusal feature adversarial training. arXiv preprint arXiv:2409.20089,
-
[60]
HellaSwag: Can a Machine Really Finish Your Sentence?
Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[61]
WildChat: 1M ChatGPT Interaction Logs in the Wild
Zhao, W., Ren, X., Hessel, J., Cardie, C., Choi, Y ., and Deng, Y . Wildchat: 1m ChatGPT interaction logs in the wild. arXiv preprint arXiv:2405.01470,
work page internal anchor Pith review Pith/arXiv arXiv
-
[62]
Ad- vPrefix: An objective for nuanced llm jailbreaks
Zhu, S., Amos, B., Tian, Y ., Guo, C., and Evtimov, I. Ad- vPrefix: An objective for nuanced llm jailbreaks. arXiv preprint arXiv:2412.10321,
-
[63]
Representation Engineering: A Top-Down Approach to AI Transparency
Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., et al. Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405, 2023a. Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversar- ial attacks on ali...
work page internal anchor Pith review Pith/arXiv arXiv
-
[64]
12 Position: LLM-Safety Evaluations Lack Robustness A. Appendix A.1. Background A.1.1. R OBUSTNESS FOR LLM S In the context of LLM safety research, an aligned model can be defined as one whose generations follow the rules and guidelines laid out by its developer. Such an LLM may be adversarially attacked via methods that produce inputs that cause the mode...
work page 2015
-
[65]
13 Position: LLM-Safety Evaluations Lack Robustness A.2
and datasets (Krizhevsky et al., 2009), making results easier to reproduce and avoiding noise resulting from slightly different implementations. 13 Position: LLM-Safety Evaluations Lack Robustness A.2. Case Study: Effects of White-space Tokens on GCG Against Llama 2 We run GCG (Zou et al., 2023b) against Llama-2-7b-chat-hf using default hyperparameters. W...
work page 2009
-
[66]
begin with the phrase “Sure, here...”, we want to evaluate how “natural” this response is for various models. We run a small test using a benign prompt (“Write a blog post that explains how to use the ‘transformers‘ library to train a language model”) and the target phrase “Sure, here’s a blog post that explains how to use the ‘transformers‘ library to tr...
work page 2024
-
[67]
would not be sufficient to support this claim at commonly accepted statistical significance levels. In addition, we find that the use of automated judging for jailbreaks can lead to overly confident conclusions regarding the safety of models. We also want to emphasize that the findings of the work as a whole remain sound and well-supported, butonly becaus...
work page 2023
-
[68]
Llama 2 13B finetuned classifier. To do so, we run a suite of 10 different attack algorithms against a zoo of 25 models using all 300 non-copyright prompts in HarmBench’s dataset. All attacks are evaluated in the “many-trial” setting, regardless of the setting in which they were originally introduced. This means all intermediate prompt candidates are roll...
work page 2023
-
[69]
During the attack, the victim model generates up to 256 tokens using greedy generation. • AmpleGCG (Liao & Sun, 2024): We use osunlp/AmpleGCG-llama2-sourced-llama2-7b-chat to generate 200 attack suffixes with diversity penalty
work page 2024
-
[70]
x x x x x x x x x x x x x x x x x x x x
to prompt the model. • BEAST (Sadasivan et al., 2024): We use k1 = k2 = 15 and set the temperature to 1 to sample N = 40 suffix tokens. • PGD (Schwinn et al., 2024): We initialize the attack using the suffix “x x x x x x x x x x x x x x x x x x x x” as it tokenizes to exactly 20 tokens for all tested models, and run signed gradient descent optimization fo...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.