pith. machine review for the scientific record. sign in

arxiv: 2602.02280 · v2 · submitted 2026-02-02 · 💻 cs.SE · cs.AI· cs.CL· cs.CR· cs.LG

Recognition: unknown

RACC: Representation-Aware Coverage Criteria for LLM Safety Testing

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:10 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CLcs.CRcs.LG
keywords LLM safety testingcoverage criteriajailbreak attackssafety representationshidden statestest suite adequacyadversarial testingrepresentation-aware evaluation
0
0 comments X

The pith

RACC extracts safety directions from LLM hidden states to measure how well test suites cover jailbreak risks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RACC to fix the absence of reliable ways to judge whether a collection of test prompts adequately probes an LLM's safety boundaries against jailbreaks. Static benchmark sets often leave gaps, while older coverage methods either cost too much compute or get swamped by unrelated neuron activity. RACC first finds the key safety directions in hidden states from a small set of harmful calibration prompts, then scores new test prompts by how strongly they activate those directions. It combines six separate coverage rules that look at both single concepts and their combinations. If the approach holds, testers gain a practical signal for selecting and improving adversarial suites without drowning in model scale.

Core claim

RACC first extracts safety representations from the LLM's hidden states using a small calibration set of harmful prompts, then measures test prompts' concept activations against these directions, and finally computes coverage through six criteria assessing both individual and compositional safety concept coverage. Experiments on multiple LLMs and safety benchmarks show that RACC reliably rewards high-quality jailbreak test suites while remaining insensitive to redundant or invalid inputs.

What carries the argument

Safety representations extracted from hidden states on a calibration set of harmful prompts, which serve as reference directions for scoring concept activations in test prompts.

If this is right

  • High-quality jailbreak suites receive systematically higher coverage scores than low-quality or repetitive ones.
  • Redundant or invalid inputs do not drive coverage scores upward.
  • RACC scores can be used directly to rank and select the most useful test suites before running full evaluations.
  • Attack generation methods can incorporate RACC feedback to sample prompts that increase uncovered safety concepts.
  • The same criteria apply across different model sizes and benchmark families without retraining the coverage logic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Continuous integration pipelines could run RACC after each model update to decide whether new safety tests are still needed.
  • Similar representation extraction might extend to measuring coverage for other alignment goals such as bias or hallucination resistance.
  • Tracking which safety directions remain hard to cover could guide targeted fine-tuning instead of broad retraining.
  • Automated red-teaming systems could close the loop by generating new prompts specifically to raise RACC scores on uncovered directions.

Load-bearing premise

The safety directions found from a small calibration set of harmful prompts capture the essential safety-critical patterns without mixing in unrelated activations.

What would settle it

Apply RACC to both a diverse high-quality jailbreak suite and a redundant collection of invalid prompts on the same model; if the two suites receive nearly identical coverage scores, the criteria do not distinguish quality as claimed.

Figures

Figures reproduced from arXiv: 2602.02280 by Chengcan Wu, Meng Sun, Xiaokun Luan, Yihao Zhang, Zeming Wei, Zhixin Zhang.

Figure 1
Figure 1. Figure 1: An overview of our RACA framework. In the Safety Concept Extraction module, we [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Large Language Models (LLMs) face severe safety risks from jailbreak attacks, yet current safety testing largely relies on static datasets and lacks systematic criteria to evaluate test suite quality and adequacy. While coverage criteria have proven effective for smaller neural networks, they are impractical for LLMs due to computational overhead and the entanglement of safety-critical signals with irrelevant neuron activations. To address these issues, we propose RACC (Representation-Aware Coverage Criteria), a set of coverage criteria specialized for LLM safety testing. RACC first extracts safety representations from the LLM's hidden states using a small calibration set of harmful prompts, then measures test prompts' concept activations against these directions, and finally computes coverage through six criteria assessing both individual and compositional safety concept coverage. Experiments on multiple LLMs and safety benchmarks show that RACC reliably rewards high-quality jailbreak test suites while remaining insensitive to redundant or invalid inputs, which is a key distinction that neuron-level criteria fail to make. We further demonstrate RACC's practical value in two applications, including test suite prioritization and attack prompt sampling, and validate its generalization across diverse settings and configurations. Overall, RACC provides a scalable and principled foundation for coverage-guided LLM safety testing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes RACC (Representation-Aware Coverage Criteria), a set of six coverage criteria for evaluating jailbreak test suites in LLMs. It extracts safety representations from hidden states via a small calibration set of harmful prompts, measures concept activations in test prompts against these directions, and computes coverage for both individual and compositional safety concepts. Experiments across multiple LLMs and benchmarks are claimed to show that RACC rewards high-quality suites while remaining insensitive to redundancy and invalid inputs, outperforming neuron-level criteria, with additional demonstrations in test prioritization and attack sampling.

Significance. If the extracted safety representations are shown to be faithful, RACC could provide a scalable, principled alternative to static datasets and neuron-level coverage for LLM safety testing, enabling better test suite adequacy assessment and applications like prioritization. The claimed insensitivity to redundancy is a potentially useful distinction if empirically robust.

major comments (3)
  1. [Abstract] Abstract: The central claim that 'experiments on multiple LLMs and safety benchmarks show that RACC reliably rewards high-quality jailbreak test suites while remaining insensitive to redundant or invalid inputs' is presented without any quantitative results, coverage scores, error bars, statistical tests, or baseline comparisons. This absence in the summary of the empirical evidence makes it impossible to evaluate whether the distinction from neuron-level criteria holds.
  2. [Methods (§3)] Safety representation extraction (Methods, §3): The method identifies safety directions from a small calibration set of harmful prompts and assumes these directions are not entangled with non-safety features such as topic, sentiment, or refusal phrasing. No ablation on calibration set size/composition, no correlation analysis with known safety concepts, and no validation against spurious directions are described; this directly threatens the validity of all six coverage criteria and the insensitivity claim.
  3. [Experimental validation (§4)] Experimental validation (§4): The six criteria are asserted to assess individual and compositional coverage, yet no equations, pseudocode, or precise computation details (e.g., how activations are thresholded or aggregated) are referenced, and no results tables or figures with specific metrics are summarized. This prevents assessment of reproducibility and the load-bearing claim of superiority over neuron-level methods.
minor comments (2)
  1. [Abstract] The abstract mentions 'six criteria' without naming or defining them; including a brief enumeration or reference to their equations in the introduction would improve readability.
  2. [Related Work] Missing citations to prior work on coverage criteria for neural networks and representation-based testing in the related work section would help situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for their constructive feedback, which has helped us improve the clarity and rigor of our presentation. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'experiments on multiple LLMs and safety benchmarks show that RACC reliably rewards high-quality jailbreak test suites while remaining insensitive to redundant or invalid inputs' is presented without any quantitative results, coverage scores, error bars, statistical tests, or baseline comparisons. This absence in the summary of the empirical evidence makes it impossible to evaluate whether the distinction from neuron-level criteria holds.

    Authors: We agree that including quantitative evidence in the abstract would strengthen the summary of our contributions. The detailed quantitative results, including coverage scores, comparisons to baselines, and statistical analyses, are presented in Section 4 of the manuscript. In the revised version, we have updated the abstract to incorporate key quantitative findings, such as specific coverage percentages and performance metrics relative to neuron-level criteria, along with references to the supporting statistical tests. revision: yes

  2. Referee: [Methods (§3)] Safety representation extraction (Methods, §3): The method identifies safety directions from a small calibration set of harmful prompts and assumes these directions are not entangled with non-safety features such as topic, sentiment, or refusal phrasing. No ablation on calibration set size/composition, no correlation analysis with known safety concepts, and no validation against spurious directions are described; this directly threatens the validity of all six coverage criteria and the insensitivity claim.

    Authors: The referee correctly identifies a potential limitation in the validation of the extracted representations. While the manuscript describes the extraction process in §3 and provides some empirical support in the experiments, we did not include explicit ablations on calibration set size or detailed correlation analyses in the main text. We have now added these in the revised manuscript: an ablation study on calibration set size and composition in Appendix C, and a new analysis in §3.1 correlating the safety directions with known safety concepts and testing for entanglement with non-safety features using control prompts. revision: yes

  3. Referee: [Experimental validation (§4)] Experimental validation (§4): The six criteria are asserted to assess individual and compositional coverage, yet no equations, pseudocode, or precise computation details (e.g., how activations are thresholded or aggregated) are referenced, and no results tables or figures with specific metrics are summarized. This prevents assessment of reproducibility and the load-bearing claim of superiority over neuron-level methods.

    Authors: We apologize if the computational details were not sufficiently highlighted. Equations for the six coverage criteria, including how concept activations are measured, thresholded, and aggregated for both individual and compositional coverage, are defined in §3.2 and §3.3. Section 4 presents results in tables and figures with specific metrics and comparisons. To address this, we have added pseudocode for the coverage computation in the revised manuscript and ensured explicit cross-references from §4 to the precise metrics and superiority claims over neuron-level methods. revision: yes

Circularity Check

0 steps flagged

No significant circularity in RACC derivation chain

full rationale

The paper defines RACC by first extracting safety representations from a separate small calibration set of harmful prompts, then computing six coverage criteria on test prompts drawn from independent safety benchmarks. No equations or steps are presented that reduce the coverage scores to parameters fitted directly from the evaluation data or test suites themselves. The derivation relies on external calibration data and benchmarks, remaining self-contained without self-definitional reductions, fitted-input predictions, or load-bearing self-citations that collapse the central claims back to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven premise that hidden-state directions extracted from a small harmful-prompt calibration set isolate safety concepts without significant contamination from other signals. No free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Safety-critical signals can be isolated as linear directions in the model's hidden-state space using a small calibration set of harmful prompts.
    Invoked in the first step of RACC; if false, the subsequent coverage measurements lose meaning.

pith-pipeline@v0.9.0 · 5529 in / 1346 out tokens · 49797 ms · 2026-05-16T08:10:22.943150+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 18 internal anchors

  1. [1]

    Large language models for mathematical reasoning: Progresses and challenges

    Janice Ahn et al. Large language models for mathematical reasoning: Progresses and challenges. arXiv preprint arXiv:2402.00157, 2024

  2. [2]

    Detecting Language Model Attacks with Perplexity

    Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity.arXiv preprint arXiv:2308.14132, 2023

  3. [3]

    Foundational challenges in assuring alignment and safety of large language models.Transactions on Machine Learning Research, 2024

    Usman Anwar et al. Foundational challenges in assuring alignment and safety of large language models.Transactions on Machine Learning Research, 2024

  4. [4]

    Qwen technical report.https://qwenlm.github.io/blog/qwen3/, 2023

    Jinze Bai et al. Qwen technical report.https://qwenlm.github.io/blog/qwen3/, 2023

  5. [5]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

  6. [6]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

  7. [7]

    Repairagent: An autonomous, llm-based agent for program repair.arXiv preprint arXiv:2403.17134, 2024

    Islem Bouzenia et al. Repairagent: An autonomous, llm-based agent for program repair.arXiv preprint arXiv:2403.17134, 2024

  8. [8]

    Improving steering vectors by targeting sparse autoencoder features

    Sviatoslav Chalnev et al. Improving steering vectors by targeting sparse autoencoder features. arXiv preprint arXiv:2411.02193, 2024

  9. [9]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    Patrick Chao et al. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023

  10. [10]

    JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

    Patrick Chao et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models.arXiv preprint arXiv:2404.01318, 2024

  11. [11]

    Jailbreaking black box large language models in twenty queries

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23–42. IEEE, 2025

  12. [12]

    Combating misinformation in the age of llms: Opportunities and challenges

    Canyu Chen et al. Combating misinformation in the age of llms: Opportunities and challenges. AI Magazine, pages 354–368, 2024

  13. [13]

    Towards the worst-case robustness of large language models.arXiv preprint arXiv:2501.19040, 2025

    Huanran Chen et al. Towards the worst-case robustness of large language models.arXiv preprint arXiv:2501.19040, 2025. 19

  14. [14]

    Understanding pre-training and fine-tuning from loss landscape perspec- tives.arXiv preprint arXiv:2505.17646, 2025

    Huanran Chen et al. Understanding pre-training and fine-tuning from loss landscape perspec- tives.arXiv preprint arXiv:2505.17646, 2025

  15. [15]

    A performance study of llm-generated code on leetcode

    Tristan Coignion et al. A performance study of llm-generated code on leetcode. InEASE, 2024

  16. [16]

    Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947, 2024

    Justin Cui et al. Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947, 2024

  17. [17]

    Safe rlhf: Safe reinforcement learning from human feedback

    Josef Dai et al. Safe rlhf: Safe reinforcement learning from human feedback. InICLR, 2024

  18. [18]

    Masterkey: Automated jailbreak across multiple large language model chatbots.arXiv preprint arXiv:2307.08715, 2023

    Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. Masterkey: Automated jailbreak across multiple large language model chatbots.arXiv preprint arXiv:2307.08715, 2023

  19. [19]

    Pandora: Jailbreak gpts by retrieval augmented generation poisoning.arXiv preprint arXiv:2402.08416, 2024

    Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tianwei Zhang, and Yang Liu. Pandora: Jailbreak gpts by retrieval augmented generation poisoning.arXiv preprint arXiv:2402.08416, 2024

  20. [20]

    Advancing llm safe alignment with safety representation ranking.arXiv preprint arXiv:2505.15710, 2025

    Tianqi Du et al. Advancing llm safe alignment with safety representation ranking.arXiv preprint arXiv:2505.15710, 2025

  21. [21]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    Daya Guo et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196, 2024

  22. [22]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  23. [23]

    Actracer: Active testing of large language model via multi-stage sampling.ACM Transactions on Software Engineering and Methodology, 2025

    Yuheng Huang, Jiayang Song, Qiang Hu, Felix Juefei-Xu, and Lei Ma. Actracer: Active testing of large language model via multi-stage sampling.ACM Transactions on Software Engineering and Methodology, 2025

  24. [24]

    Mathprompter: Mathematical reasoning using large language models.arXiv preprint arXiv:2303.05398, 2023

    Shima Imani et al. Mathprompter: Mathematical reasoning using large language models.arXiv preprint arXiv:2303.05398, 2023

  25. [25]

    Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    Neel Jain et al. Baseline defenses for adversarial attacks against aligned language models.arXiv preprint arXiv:2309.00614, 2023

  26. [26]

    Inferfix: End-to-end program repair with llms

    Matthew Jin et al. Inferfix: End-to-end program repair with llms. InFSE, pages 1646–1656, 2023

  27. [27]

    Guiding deep learning system testing using surprise adequacy

    Jinhan Kim, Robert Feldt, and Shin Yoo. Guiding deep learning system testing using surprise adequacy. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 1039–1049. IEEE, 2019

  28. [28]

    Reducing dnn labelling cost using surprise adequacy: An industrial case study for autonomous driving

    Jinhan Kim, Jeongil Ju, Robert Feldt, and Shin Yoo. Reducing dnn labelling cost using surprise adequacy: An industrial case study for autonomous driving. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1466–1476, 2020

  29. [29]

    Pretraining language models with human preferences

    Tomasz Korbak et al. Pretraining language models with human preferences. InICML, 2023

  30. [30]

    Certifying llm safety against adversarial prompting

    Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Soheil Feizi, and Hima Lakkaraju. Certifying llm safety against adversarial prompting. InCOLM, 2023

  31. [31]

    Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

    Yi Liu et al. Jailbreaking chatgpt via prompt engineering: An empirical study.arXiv preprint arXiv:2305.13860, 2023

  32. [32]

    Deepgauge: Multi-granularity testing criteria for deep learning systems

    Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, et al. Deepgauge: Multi-granularity testing criteria for deep learning systems. InProceedings of the 33rd ACM/IEEE international conference on automated software engineering, pages 120–131, 2018. 20

  33. [33]

    Tensorfuzz: De- bugging neural networks with coverage-guided fuzzing

    Augustus Odena, Catherine Olsson, David Andersen, and Ian Goodfellow. Tensorfuzz: De- bugging neural networks with coverage-guided fuzzing. InInternational conference on machine learning, pages 4901–4911. PMLR, 2019

  34. [34]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2024

  35. [35]

    The hidden dimensions of llm alignment: A multi-dimensional safety analysis

    Wenbo Pan et al. The hidden dimensions of llm alignment: A multi-dimensional safety analysis. arXiv preprint arXiv:2502.09674, 2025

  36. [36]

    Deepxplore: Automated whitebox testing of deep learning systems

    Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. Deepxplore: Automated whitebox testing of deep learning systems. Inproceedings of the 26th Symposium on Operating Systems Principles, pages 1–18, 2017

  37. [37]

    Llm self defense: By self examination, llms know they are being tricked

    Mansi Phute et al. Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308, 2023

  38. [38]

    Safety alignment should be made more than just a few tokens deep

    Xiangyu Qi et al. Safety alignment should be made more than just a few tokens deep. InICLR, 2024

  39. [39]

    Improving language understanding by generative pre-training

    Alec Radford et al. Improving language understanding by generative pre-training. 2018

  40. [40]

    SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

    Alexander Robey, Eric Wong, Hamed Hassani, and George J Pappas. Smoothllm: Defending large language models against jailbreaking attacks.arXiv preprint arXiv:2310.03684, 2023

  41. [41]

    Large language model alignment: A survey, 2023

    Tianhao Shen, Renren Jin, Yufei Huang, Chuang Liu, Weilong Dong, Zishan Guo, Xinwei Wu, Yan Liu, and Deyi Xiong. Large language model alignment: A survey, 2023

  42. [42]

    do anything now

    Xinyue Shen et al. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InCCS, 2023

  43. [43]

    Does representation matter? exploring intermediate layers in large language models.arXiv preprint arXiv:2412.09563, 2024

    Oscar Skean et al. Does representation matter? exploring intermediate layers in large language models.arXiv preprint arXiv:2412.09563, 2024

  44. [44]

    Improving instruction-following in language models through activation steering.arXiv preprint arXiv:2410.12877, 2024

    Alessandro Stolfo et al. Improving instruction-following in language models through activation steering.arXiv preprint arXiv:2410.12877, 2024

  45. [45]

    Stanford alpaca: An instruction-following llama model

    Rohan Taori et al. Stanford alpaca: An instruction-following llama model. https://github. com/tatsu-lab/stanford_alpaca, 2023

  46. [46]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  47. [47]

    Attention is all you need

    Ashish Vaswani et al. Attention is all you need. InNeurIPS, 2017

  48. [48]

    False sense of security: Why probing- based malicious input detection fails to generalize.arXiv preprint arXiv:2509.03888, 2025

    Cheng Wang, Zeming Wei, Qin Liu, and Muhao Chen. False sense of security: Why probing- based malicious input detection fails to generalize.arXiv preprint arXiv:2509.03888, 2025

  49. [49]

    Truthflow: Truthful llm generation via representation flow correction.arXiv preprint arXiv:2502.04556, 2025

    Hanyu Wang, Bochuan Cao, Yuanpu Cao, and Jinghui Chen. Truthflow: Truthful llm generation via representation flow correction.arXiv preprint arXiv:2502.04556, 2025

  50. [50]

    A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more, 2024

    Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Zixu, Zhu, Xiang-Bo Mao, Sitaram Asur, Na, and Cheng. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more, 2024

  51. [51]

    Reinforcement Learning for LLM Post-Training: A Survey

    Zhichao Wang et al. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more.arXiv preprint arXiv:2407.16216, 2024

  52. [52]

    Jailbroken: How does llm safety training fail? InNeurIPS, 2023

    Alexander Wei et al. Jailbroken: How does llm safety training fail? InNeurIPS, 2023

  53. [53]

    arXiv:2402.05162 [cs]

    Boyi Wei et al. Assessing the brittleness of safety alignment via pruning and low-rank modifi- cations.arXiv preprint arXiv:2402.05162, 2024. 21

  54. [54]

    Position: Agent-specific trustworthiness risk as a research priority.OpenRe- view preprint, 2025

    Zeming Wei et al. Position: Agent-specific trustworthiness risk as a research priority.OpenRe- view preprint, 2025

  55. [55]

    ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction

    Zeming Wei, Chengcan Wu, and Meng Sun. Rega: Representation-guided abstraction for model-based safeguarding of llms.arXiv preprint arXiv:2506.01770, 2025

  56. [56]

    Secure LLM Fine-Tuning via Safety-Aware Probing

    Chengcan Wu et al. Mitigating fine-tuning risks in llms via safety-aware probing optimization. arXiv preprint arXiv:2505.16737, 2025

  57. [57]

    Sorry-bench: Systematically evaluating large language model safety refusal

    Tinghao Xie et al. Sorry-bench: Systematically evaluating large language model safety refusal. InICLR, 2025

  58. [58]

    Npc: N euron p ath c overage via characterizing decision logic of deep neural networks.ACM Transactions on Software Engineering and Methodology (TOSEM), 31(3):1–27, 2022

    Xiaofei Xie, Tianlin Li, Jian Wang, Lei Ma, Qing Guo, Felix Juefei-Xu, and Yang Liu. Npc: N euron p ath c overage via characterizing decision logic of deep neural networks.ACM Transactions on Software Engineering and Methodology (TOSEM), 31(3):1–27, 2022

  59. [59]

    Lecov: Multi- level testing criteria for large language models.Journal of Systems and Software, page 112763, 2025

    Xuan Xie, Jiayang Song, Yuheng Huang, Da Song, Felix Juefei-Xu, and Lei Ma. Lecov: Multi- level testing criteria for large language models.Journal of Systems and Software, page 112763, 2025

  60. [60]

    A survey on large language model (llm) security and privacy: The good, the bad, and the ugly.High-Confidence Computing, 2024

    Yifan Yao et al. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly.High-Confidence Computing, 2024

  61. [61]

    GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

    Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253, 2023

  62. [62]

    A mutation-based method for multi-modal jailbreaking attack detection.CoRR, 2023

    Xiaoyu Zhang, Cen Zhang, Tianlin Li, Yihao Huang, Xiaojun Jia, Xiaofei Xie, Yang Liu, and Chao Shen. A mutation-based method for multi-modal jailbreaking attack detection.CoRR, 2023

  63. [63]

    The fusion of large language models and formal methods for trustworthy ai agents: A roadmap.arXiv preprint arXiv:2412.06512, 2024

    Yedi Zhang et al. The fusion of large language models and formal methods for trustworthy ai agents: A roadmap.arXiv preprint arXiv:2412.06512, 2024

  64. [64]

    Adversarial representation engineering: A general model editing framework for large language models.arXiv preprint arXiv:2404.13752, 2024

    Yihao Zhang et al. Adversarial representation engineering: A general model editing framework for large language models.arXiv preprint arXiv:2404.13752, 2024

  65. [65]

    Safetybench: Evaluating the safety of large language models

    Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, and Minlie Huang. Safetybench: Evaluating the safety of large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15537–15553, 2024

  66. [66]

    Parden, can you repeat that? defending against jailbreaks via repetition.arXiv preprint arXiv:2405.07932, 2024

    Ziyang Zhang, Qizhen Zhang, and Jakob Foerster. Parden, can you repeat that? defending against jailbreaks via repetition.arXiv preprint arXiv:2405.07932, 2024

  67. [67]

    Prompt-driven llm safeguarding via directed representation optimization

    Chujie Zheng et al. Prompt-driven llm safeguarding via directed representation optimization. arXiv preprint arXiv:2401.18018, 2024

  68. [68]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Lianmin Zheng et al. Judging llm-as-a-judge with mt-bench and chatbot arena. InNeurIPS, 2023

  69. [69]

    This approach won’t work because

    Tianyang Zhong et al. Evaluation of openai o1: Opportunities and challenges of agi.arXiv preprint arXiv:2409.18486, 2024

  70. [70]

    De- fending lvlms against vision attacks through partial-perception supervision.arXiv preprint arXiv:2412.12722, 2024

    Qi Zhou, Tianlin Li, Qing Guo, Dongxia Wang, Yun Lin, Yang Liu, and Jin Song Dong. De- fending lvlms against vision attacks through partial-perception supervision.arXiv preprint arXiv:2412.12722, 2024

  71. [71]

    Understanding the effectiveness of coverage criteria for large language models: A special angle from jailbreak attacks.arXiv preprint arXiv:2408.15207, 2024

    Shide Zhou, Tianlin Li, Kailong Wang, Yihao Huang, Ling Shi, Yang Liu, and Haoyu Wang. Understanding the effectiveness of coverage criteria for large language models: A special angle from jailbreak attacks.arXiv preprint arXiv:2408.15207, 2024. 22

  72. [72]

    Autodan: Interpretable gradient-based adversarial attacks on large language models

    Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, and Tong Sun. Autodan: Interpretable gradient-based adversarial attacks on large language models. InCOLM, 2023

  73. [73]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023

  74. [74]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou et al. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. 23