Recognition: unknown
RACC: Representation-Aware Coverage Criteria for LLM Safety Testing
Pith reviewed 2026-05-16 08:10 UTC · model grok-4.3
The pith
RACC extracts safety directions from LLM hidden states to measure how well test suites cover jailbreak risks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RACC first extracts safety representations from the LLM's hidden states using a small calibration set of harmful prompts, then measures test prompts' concept activations against these directions, and finally computes coverage through six criteria assessing both individual and compositional safety concept coverage. Experiments on multiple LLMs and safety benchmarks show that RACC reliably rewards high-quality jailbreak test suites while remaining insensitive to redundant or invalid inputs.
What carries the argument
Safety representations extracted from hidden states on a calibration set of harmful prompts, which serve as reference directions for scoring concept activations in test prompts.
If this is right
- High-quality jailbreak suites receive systematically higher coverage scores than low-quality or repetitive ones.
- Redundant or invalid inputs do not drive coverage scores upward.
- RACC scores can be used directly to rank and select the most useful test suites before running full evaluations.
- Attack generation methods can incorporate RACC feedback to sample prompts that increase uncovered safety concepts.
- The same criteria apply across different model sizes and benchmark families without retraining the coverage logic.
Where Pith is reading between the lines
- Continuous integration pipelines could run RACC after each model update to decide whether new safety tests are still needed.
- Similar representation extraction might extend to measuring coverage for other alignment goals such as bias or hallucination resistance.
- Tracking which safety directions remain hard to cover could guide targeted fine-tuning instead of broad retraining.
- Automated red-teaming systems could close the loop by generating new prompts specifically to raise RACC scores on uncovered directions.
Load-bearing premise
The safety directions found from a small calibration set of harmful prompts capture the essential safety-critical patterns without mixing in unrelated activations.
What would settle it
Apply RACC to both a diverse high-quality jailbreak suite and a redundant collection of invalid prompts on the same model; if the two suites receive nearly identical coverage scores, the criteria do not distinguish quality as claimed.
Figures
read the original abstract
Large Language Models (LLMs) face severe safety risks from jailbreak attacks, yet current safety testing largely relies on static datasets and lacks systematic criteria to evaluate test suite quality and adequacy. While coverage criteria have proven effective for smaller neural networks, they are impractical for LLMs due to computational overhead and the entanglement of safety-critical signals with irrelevant neuron activations. To address these issues, we propose RACC (Representation-Aware Coverage Criteria), a set of coverage criteria specialized for LLM safety testing. RACC first extracts safety representations from the LLM's hidden states using a small calibration set of harmful prompts, then measures test prompts' concept activations against these directions, and finally computes coverage through six criteria assessing both individual and compositional safety concept coverage. Experiments on multiple LLMs and safety benchmarks show that RACC reliably rewards high-quality jailbreak test suites while remaining insensitive to redundant or invalid inputs, which is a key distinction that neuron-level criteria fail to make. We further demonstrate RACC's practical value in two applications, including test suite prioritization and attack prompt sampling, and validate its generalization across diverse settings and configurations. Overall, RACC provides a scalable and principled foundation for coverage-guided LLM safety testing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes RACC (Representation-Aware Coverage Criteria), a set of six coverage criteria for evaluating jailbreak test suites in LLMs. It extracts safety representations from hidden states via a small calibration set of harmful prompts, measures concept activations in test prompts against these directions, and computes coverage for both individual and compositional safety concepts. Experiments across multiple LLMs and benchmarks are claimed to show that RACC rewards high-quality suites while remaining insensitive to redundancy and invalid inputs, outperforming neuron-level criteria, with additional demonstrations in test prioritization and attack sampling.
Significance. If the extracted safety representations are shown to be faithful, RACC could provide a scalable, principled alternative to static datasets and neuron-level coverage for LLM safety testing, enabling better test suite adequacy assessment and applications like prioritization. The claimed insensitivity to redundancy is a potentially useful distinction if empirically robust.
major comments (3)
- [Abstract] Abstract: The central claim that 'experiments on multiple LLMs and safety benchmarks show that RACC reliably rewards high-quality jailbreak test suites while remaining insensitive to redundant or invalid inputs' is presented without any quantitative results, coverage scores, error bars, statistical tests, or baseline comparisons. This absence in the summary of the empirical evidence makes it impossible to evaluate whether the distinction from neuron-level criteria holds.
- [Methods (§3)] Safety representation extraction (Methods, §3): The method identifies safety directions from a small calibration set of harmful prompts and assumes these directions are not entangled with non-safety features such as topic, sentiment, or refusal phrasing. No ablation on calibration set size/composition, no correlation analysis with known safety concepts, and no validation against spurious directions are described; this directly threatens the validity of all six coverage criteria and the insensitivity claim.
- [Experimental validation (§4)] Experimental validation (§4): The six criteria are asserted to assess individual and compositional coverage, yet no equations, pseudocode, or precise computation details (e.g., how activations are thresholded or aggregated) are referenced, and no results tables or figures with specific metrics are summarized. This prevents assessment of reproducibility and the load-bearing claim of superiority over neuron-level methods.
minor comments (2)
- [Abstract] The abstract mentions 'six criteria' without naming or defining them; including a brief enumeration or reference to their equations in the introduction would improve readability.
- [Related Work] Missing citations to prior work on coverage criteria for neural networks and representation-based testing in the related work section would help situate the contribution.
Simulated Author's Rebuttal
We are grateful to the referee for their constructive feedback, which has helped us improve the clarity and rigor of our presentation. We address each major comment below and indicate the revisions made to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'experiments on multiple LLMs and safety benchmarks show that RACC reliably rewards high-quality jailbreak test suites while remaining insensitive to redundant or invalid inputs' is presented without any quantitative results, coverage scores, error bars, statistical tests, or baseline comparisons. This absence in the summary of the empirical evidence makes it impossible to evaluate whether the distinction from neuron-level criteria holds.
Authors: We agree that including quantitative evidence in the abstract would strengthen the summary of our contributions. The detailed quantitative results, including coverage scores, comparisons to baselines, and statistical analyses, are presented in Section 4 of the manuscript. In the revised version, we have updated the abstract to incorporate key quantitative findings, such as specific coverage percentages and performance metrics relative to neuron-level criteria, along with references to the supporting statistical tests. revision: yes
-
Referee: [Methods (§3)] Safety representation extraction (Methods, §3): The method identifies safety directions from a small calibration set of harmful prompts and assumes these directions are not entangled with non-safety features such as topic, sentiment, or refusal phrasing. No ablation on calibration set size/composition, no correlation analysis with known safety concepts, and no validation against spurious directions are described; this directly threatens the validity of all six coverage criteria and the insensitivity claim.
Authors: The referee correctly identifies a potential limitation in the validation of the extracted representations. While the manuscript describes the extraction process in §3 and provides some empirical support in the experiments, we did not include explicit ablations on calibration set size or detailed correlation analyses in the main text. We have now added these in the revised manuscript: an ablation study on calibration set size and composition in Appendix C, and a new analysis in §3.1 correlating the safety directions with known safety concepts and testing for entanglement with non-safety features using control prompts. revision: yes
-
Referee: [Experimental validation (§4)] Experimental validation (§4): The six criteria are asserted to assess individual and compositional coverage, yet no equations, pseudocode, or precise computation details (e.g., how activations are thresholded or aggregated) are referenced, and no results tables or figures with specific metrics are summarized. This prevents assessment of reproducibility and the load-bearing claim of superiority over neuron-level methods.
Authors: We apologize if the computational details were not sufficiently highlighted. Equations for the six coverage criteria, including how concept activations are measured, thresholded, and aggregated for both individual and compositional coverage, are defined in §3.2 and §3.3. Section 4 presents results in tables and figures with specific metrics and comparisons. To address this, we have added pseudocode for the coverage computation in the revised manuscript and ensured explicit cross-references from §4 to the precise metrics and superiority claims over neuron-level methods. revision: yes
Circularity Check
No significant circularity in RACC derivation chain
full rationale
The paper defines RACC by first extracting safety representations from a separate small calibration set of harmful prompts, then computing six coverage criteria on test prompts drawn from independent safety benchmarks. No equations or steps are presented that reduce the coverage scores to parameters fitted directly from the evaluation data or test suites themselves. The derivation relies on external calibration data and benchmarks, remaining self-contained without self-definitional reductions, fitted-input predictions, or load-bearing self-citations that collapse the central claims back to the inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Safety-critical signals can be isolated as linear directions in the model's hidden-state space using a small calibration set of harmful prompts.
Reference graph
Works this paper leans on
-
[1]
Large language models for mathematical reasoning: Progresses and challenges
Janice Ahn et al. Large language models for mathematical reasoning: Progresses and challenges. arXiv preprint arXiv:2402.00157, 2024
-
[2]
Detecting Language Model Attacks with Perplexity
Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity.arXiv preprint arXiv:2308.14132, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Usman Anwar et al. Foundational challenges in assuring alignment and safety of large language models.Transactions on Machine Learning Research, 2024
work page 2024
-
[4]
Qwen technical report.https://qwenlm.github.io/blog/qwen3/, 2023
Jinze Bai et al. Qwen technical report.https://qwenlm.github.io/blog/qwen3/, 2023
work page 2023
-
[5]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Repairagent: An autonomous, llm-based agent for program repair.arXiv preprint arXiv:2403.17134, 2024
Islem Bouzenia et al. Repairagent: An autonomous, llm-based agent for program repair.arXiv preprint arXiv:2403.17134, 2024
-
[8]
Improving steering vectors by targeting sparse autoencoder features
Sviatoslav Chalnev et al. Improving steering vectors by targeting sparse autoencoder features. arXiv preprint arXiv:2411.02193, 2024
-
[9]
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao et al. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Patrick Chao et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models.arXiv preprint arXiv:2404.01318, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Jailbreaking black box large language models in twenty queries
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23–42. IEEE, 2025
work page 2025
-
[12]
Combating misinformation in the age of llms: Opportunities and challenges
Canyu Chen et al. Combating misinformation in the age of llms: Opportunities and challenges. AI Magazine, pages 354–368, 2024
work page 2024
-
[13]
Towards the worst-case robustness of large language models.arXiv preprint arXiv:2501.19040, 2025
Huanran Chen et al. Towards the worst-case robustness of large language models.arXiv preprint arXiv:2501.19040, 2025. 19
-
[14]
Huanran Chen et al. Understanding pre-training and fine-tuning from loss landscape perspec- tives.arXiv preprint arXiv:2505.17646, 2025
-
[15]
A performance study of llm-generated code on leetcode
Tristan Coignion et al. A performance study of llm-generated code on leetcode. InEASE, 2024
work page 2024
-
[16]
Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947, 2024
Justin Cui et al. Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947, 2024
-
[17]
Safe rlhf: Safe reinforcement learning from human feedback
Josef Dai et al. Safe rlhf: Safe reinforcement learning from human feedback. InICLR, 2024
work page 2024
-
[18]
Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. Masterkey: Automated jailbreak across multiple large language model chatbots.arXiv preprint arXiv:2307.08715, 2023
-
[19]
Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tianwei Zhang, and Yang Liu. Pandora: Jailbreak gpts by retrieval augmented generation poisoning.arXiv preprint arXiv:2402.08416, 2024
-
[20]
Tianqi Du et al. Advancing llm safe alignment with safety representation ranking.arXiv preprint arXiv:2505.15710, 2025
-
[21]
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
Daya Guo et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Yuheng Huang, Jiayang Song, Qiang Hu, Felix Juefei-Xu, and Lei Ma. Actracer: Active testing of large language model via multi-stage sampling.ACM Transactions on Software Engineering and Methodology, 2025
work page 2025
-
[24]
Shima Imani et al. Mathprompter: Mathematical reasoning using large language models.arXiv preprint arXiv:2303.05398, 2023
-
[25]
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Neel Jain et al. Baseline defenses for adversarial attacks against aligned language models.arXiv preprint arXiv:2309.00614, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Inferfix: End-to-end program repair with llms
Matthew Jin et al. Inferfix: End-to-end program repair with llms. InFSE, pages 1646–1656, 2023
work page 2023
-
[27]
Guiding deep learning system testing using surprise adequacy
Jinhan Kim, Robert Feldt, and Shin Yoo. Guiding deep learning system testing using surprise adequacy. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 1039–1049. IEEE, 2019
work page 2019
-
[28]
Reducing dnn labelling cost using surprise adequacy: An industrial case study for autonomous driving
Jinhan Kim, Jeongil Ju, Robert Feldt, and Shin Yoo. Reducing dnn labelling cost using surprise adequacy: An industrial case study for autonomous driving. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1466–1476, 2020
work page 2020
-
[29]
Pretraining language models with human preferences
Tomasz Korbak et al. Pretraining language models with human preferences. InICML, 2023
work page 2023
-
[30]
Certifying llm safety against adversarial prompting
Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Soheil Feizi, and Hima Lakkaraju. Certifying llm safety against adversarial prompting. InCOLM, 2023
work page 2023
-
[31]
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
Yi Liu et al. Jailbreaking chatgpt via prompt engineering: An empirical study.arXiv preprint arXiv:2305.13860, 2023
work page internal anchor Pith review arXiv 2023
-
[32]
Deepgauge: Multi-granularity testing criteria for deep learning systems
Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, et al. Deepgauge: Multi-granularity testing criteria for deep learning systems. InProceedings of the 33rd ACM/IEEE international conference on automated software engineering, pages 120–131, 2018. 20
work page 2018
-
[33]
Tensorfuzz: De- bugging neural networks with coverage-guided fuzzing
Augustus Odena, Catherine Olsson, David Andersen, and Ian Goodfellow. Tensorfuzz: De- bugging neural networks with coverage-guided fuzzing. InInternational conference on machine learning, pages 4901–4911. PMLR, 2019
work page 2019
-
[34]
OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
The hidden dimensions of llm alignment: A multi-dimensional safety analysis
Wenbo Pan et al. The hidden dimensions of llm alignment: A multi-dimensional safety analysis. arXiv preprint arXiv:2502.09674, 2025
-
[36]
Deepxplore: Automated whitebox testing of deep learning systems
Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. Deepxplore: Automated whitebox testing of deep learning systems. Inproceedings of the 26th Symposium on Operating Systems Principles, pages 1–18, 2017
work page 2017
-
[37]
Llm self defense: By self examination, llms know they are being tricked
Mansi Phute et al. Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308, 2023
-
[38]
Safety alignment should be made more than just a few tokens deep
Xiangyu Qi et al. Safety alignment should be made more than just a few tokens deep. InICLR, 2024
work page 2024
-
[39]
Improving language understanding by generative pre-training
Alec Radford et al. Improving language understanding by generative pre-training. 2018
work page 2018
-
[40]
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Alexander Robey, Eric Wong, Hamed Hassani, and George J Pappas. Smoothllm: Defending large language models against jailbreaking attacks.arXiv preprint arXiv:2310.03684, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
Large language model alignment: A survey, 2023
Tianhao Shen, Renren Jin, Yufei Huang, Chuang Liu, Weilong Dong, Zishan Guo, Xinwei Wu, Yan Liu, and Deyi Xiong. Large language model alignment: A survey, 2023
work page 2023
-
[42]
Xinyue Shen et al. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InCCS, 2023
work page 2023
-
[43]
Oscar Skean et al. Does representation matter? exploring intermediate layers in large language models.arXiv preprint arXiv:2412.09563, 2024
-
[44]
Alessandro Stolfo et al. Improving instruction-following in language models through activation steering.arXiv preprint arXiv:2410.12877, 2024
-
[45]
Stanford alpaca: An instruction-following llama model
Rohan Taori et al. Stanford alpaca: An instruction-following llama model. https://github. com/tatsu-lab/stanford_alpaca, 2023
work page 2023
-
[46]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Ashish Vaswani et al. Attention is all you need. InNeurIPS, 2017
work page 2017
-
[48]
Cheng Wang, Zeming Wei, Qin Liu, and Muhao Chen. False sense of security: Why probing- based malicious input detection fails to generalize.arXiv preprint arXiv:2509.03888, 2025
-
[49]
Hanyu Wang, Bochuan Cao, Yuanpu Cao, and Jinghui Chen. Truthflow: Truthful llm generation via representation flow correction.arXiv preprint arXiv:2502.04556, 2025
-
[50]
A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more, 2024
Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Zixu, Zhu, Xiang-Bo Mao, Sitaram Asur, Na, and Cheng. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more, 2024
work page 2024
-
[51]
Reinforcement Learning for LLM Post-Training: A Survey
Zhichao Wang et al. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more.arXiv preprint arXiv:2407.16216, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Jailbroken: How does llm safety training fail? InNeurIPS, 2023
Alexander Wei et al. Jailbroken: How does llm safety training fail? InNeurIPS, 2023
work page 2023
-
[53]
Boyi Wei et al. Assessing the brittleness of safety alignment via pruning and low-rank modifi- cations.arXiv preprint arXiv:2402.05162, 2024. 21
-
[54]
Position: Agent-specific trustworthiness risk as a research priority.OpenRe- view preprint, 2025
Zeming Wei et al. Position: Agent-specific trustworthiness risk as a research priority.OpenRe- view preprint, 2025
work page 2025
-
[55]
ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction
Zeming Wei, Chengcan Wu, and Meng Sun. Rega: Representation-guided abstraction for model-based safeguarding of llms.arXiv preprint arXiv:2506.01770, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Secure LLM Fine-Tuning via Safety-Aware Probing
Chengcan Wu et al. Mitigating fine-tuning risks in llms via safety-aware probing optimization. arXiv preprint arXiv:2505.16737, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Sorry-bench: Systematically evaluating large language model safety refusal
Tinghao Xie et al. Sorry-bench: Systematically evaluating large language model safety refusal. InICLR, 2025
work page 2025
-
[58]
Xiaofei Xie, Tianlin Li, Jian Wang, Lei Ma, Qing Guo, Felix Juefei-Xu, and Yang Liu. Npc: N euron p ath c overage via characterizing decision logic of deep neural networks.ACM Transactions on Software Engineering and Methodology (TOSEM), 31(3):1–27, 2022
work page 2022
-
[59]
Xuan Xie, Jiayang Song, Yuheng Huang, Da Song, Felix Juefei-Xu, and Lei Ma. Lecov: Multi- level testing criteria for large language models.Journal of Systems and Software, page 112763, 2025
work page 2025
-
[60]
Yifan Yao et al. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly.High-Confidence Computing, 2024
work page 2024
-
[61]
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
A mutation-based method for multi-modal jailbreaking attack detection.CoRR, 2023
Xiaoyu Zhang, Cen Zhang, Tianlin Li, Yihao Huang, Xiaojun Jia, Xiaofei Xie, Yang Liu, and Chao Shen. A mutation-based method for multi-modal jailbreaking attack detection.CoRR, 2023
work page 2023
-
[63]
Yedi Zhang et al. The fusion of large language models and formal methods for trustworthy ai agents: A roadmap.arXiv preprint arXiv:2412.06512, 2024
-
[64]
Yihao Zhang et al. Adversarial representation engineering: A general model editing framework for large language models.arXiv preprint arXiv:2404.13752, 2024
-
[65]
Safetybench: Evaluating the safety of large language models
Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, and Minlie Huang. Safetybench: Evaluating the safety of large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15537–15553, 2024
work page 2024
-
[66]
Ziyang Zhang, Qizhen Zhang, and Jakob Foerster. Parden, can you repeat that? defending against jailbreaks via repetition.arXiv preprint arXiv:2405.07932, 2024
-
[67]
Prompt-driven llm safeguarding via directed representation optimization
Chujie Zheng et al. Prompt-driven llm safeguarding via directed representation optimization. arXiv preprint arXiv:2401.18018, 2024
-
[68]
Judging llm-as-a-judge with mt-bench and chatbot arena
Lianmin Zheng et al. Judging llm-as-a-judge with mt-bench and chatbot arena. InNeurIPS, 2023
work page 2023
-
[69]
This approach won’t work because
Tianyang Zhong et al. Evaluation of openai o1: Opportunities and challenges of agi.arXiv preprint arXiv:2409.18486, 2024
-
[70]
Qi Zhou, Tianlin Li, Qing Guo, Dongxia Wang, Yun Lin, Yang Liu, and Jin Song Dong. De- fending lvlms against vision attacks through partial-perception supervision.arXiv preprint arXiv:2412.12722, 2024
-
[71]
Shide Zhou, Tianlin Li, Kailong Wang, Yihao Huang, Ling Shi, Yang Liu, and Haoyu Wang. Understanding the effectiveness of coverage criteria for large language models: A special angle from jailbreak attacks.arXiv preprint arXiv:2408.15207, 2024. 22
-
[72]
Autodan: Interpretable gradient-based adversarial attacks on large language models
Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, and Tong Sun. Autodan: Interpretable gradient-based adversarial attacks on large language models. InCOLM, 2023
work page 2023
-
[73]
Representation Engineering: A Top-Down Approach to AI Transparency
Andy Zou et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[74]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou et al. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. 23
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.