arxiv: 2602.02280 · v2 · submitted 2026-02-02 · 💻 cs.SE · cs.AI· cs.CL· cs.CR· cs.LG

Recognition: unknown

RACC: Representation-Aware Coverage Criteria for LLM Safety Testing

Zeming Wei , Zhixin Zhang , Chengcan Wu , Yihao Zhang , Xiaokun Luan , Meng Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:10 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CLcs.CRcs.LG

keywords LLM safety testingcoverage criteriajailbreak attackssafety representationshidden statestest suite adequacyadversarial testingrepresentation-aware evaluation

0 comments

The pith

RACC extracts safety directions from LLM hidden states to measure how well test suites cover jailbreak risks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RACC to fix the absence of reliable ways to judge whether a collection of test prompts adequately probes an LLM's safety boundaries against jailbreaks. Static benchmark sets often leave gaps, while older coverage methods either cost too much compute or get swamped by unrelated neuron activity. RACC first finds the key safety directions in hidden states from a small set of harmful calibration prompts, then scores new test prompts by how strongly they activate those directions. It combines six separate coverage rules that look at both single concepts and their combinations. If the approach holds, testers gain a practical signal for selecting and improving adversarial suites without drowning in model scale.

Core claim

RACC first extracts safety representations from the LLM's hidden states using a small calibration set of harmful prompts, then measures test prompts' concept activations against these directions, and finally computes coverage through six criteria assessing both individual and compositional safety concept coverage. Experiments on multiple LLMs and safety benchmarks show that RACC reliably rewards high-quality jailbreak test suites while remaining insensitive to redundant or invalid inputs.

What carries the argument

Safety representations extracted from hidden states on a calibration set of harmful prompts, which serve as reference directions for scoring concept activations in test prompts.

If this is right

High-quality jailbreak suites receive systematically higher coverage scores than low-quality or repetitive ones.
Redundant or invalid inputs do not drive coverage scores upward.
RACC scores can be used directly to rank and select the most useful test suites before running full evaluations.
Attack generation methods can incorporate RACC feedback to sample prompts that increase uncovered safety concepts.
The same criteria apply across different model sizes and benchmark families without retraining the coverage logic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Continuous integration pipelines could run RACC after each model update to decide whether new safety tests are still needed.
Similar representation extraction might extend to measuring coverage for other alignment goals such as bias or hallucination resistance.
Tracking which safety directions remain hard to cover could guide targeted fine-tuning instead of broad retraining.
Automated red-teaming systems could close the loop by generating new prompts specifically to raise RACC scores on uncovered directions.

Load-bearing premise

The safety directions found from a small calibration set of harmful prompts capture the essential safety-critical patterns without mixing in unrelated activations.

What would settle it

Apply RACC to both a diverse high-quality jailbreak suite and a redundant collection of invalid prompts on the same model; if the two suites receive nearly identical coverage scores, the criteria do not distinguish quality as claimed.

Figures

Figures reproduced from arXiv: 2602.02280 by Chengcan Wu, Meng Sun, Xiaokun Luan, Yihao Zhang, Zeming Wei, Zhixin Zhang.

read the original abstract

Large Language Models (LLMs) face severe safety risks from jailbreak attacks, yet current safety testing largely relies on static datasets and lacks systematic criteria to evaluate test suite quality and adequacy. While coverage criteria have proven effective for smaller neural networks, they are impractical for LLMs due to computational overhead and the entanglement of safety-critical signals with irrelevant neuron activations. To address these issues, we propose RACC (Representation-Aware Coverage Criteria), a set of coverage criteria specialized for LLM safety testing. RACC first extracts safety representations from the LLM's hidden states using a small calibration set of harmful prompts, then measures test prompts' concept activations against these directions, and finally computes coverage through six criteria assessing both individual and compositional safety concept coverage. Experiments on multiple LLMs and safety benchmarks show that RACC reliably rewards high-quality jailbreak test suites while remaining insensitive to redundant or invalid inputs, which is a key distinction that neuron-level criteria fail to make. We further demonstrate RACC's practical value in two applications, including test suite prioritization and attack prompt sampling, and validate its generalization across diverse settings and configurations. Overall, RACC provides a scalable and principled foundation for coverage-guided LLM safety testing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RACC shifts coverage testing for LLM jailbreaks to safety directions in hidden states, which could be useful if those directions stay clean.

read the letter

The main takeaway is that this paper gives a concrete way to score test suite quality for LLM safety by pulling safety-related directions out of hidden states rather than trying to cover individual neurons. They start with a small calibration set of harmful prompts to identify those directions, then define six criteria that check both single-concept coverage and combinations across test prompts. This setup is presented as a direct fix for the scale and entanglement problems that kill neuron-level coverage on large models. The experiments claim the criteria give higher scores to strong jailbreak suites, stay low on redundant or invalid inputs, and help with practical tasks like test prioritization and attack sampling. That combination of representation extraction plus the six compositional criteria is the actual new piece, and it lines up with the need for something more scalable than what worked on smaller networks. The approach is straightforward enough that it could be tried by groups already running safety benchmarks. The soft spot is the reliance on the calibration set to produce directions that truly isolate safety rather than picking up correlated signals like topic or phrasing. If those directions are mixed, the coverage scores lose their meaning, and the claimed insensitivity to bad inputs would not hold. The abstract states the results without numbers or details on how the criteria are calculated, so the strength of the evidence is still unclear from what's visible. This work is for people building or evaluating test suites for LLM jailbreaks and safety. It is worth sending to peer review because it targets a practical gap with a defined method, even if the representation step needs closer checking on the data.

Referee Report

3 major / 2 minor

Summary. The paper proposes RACC (Representation-Aware Coverage Criteria), a set of six coverage criteria for evaluating jailbreak test suites in LLMs. It extracts safety representations from hidden states via a small calibration set of harmful prompts, measures concept activations in test prompts against these directions, and computes coverage for both individual and compositional safety concepts. Experiments across multiple LLMs and benchmarks are claimed to show that RACC rewards high-quality suites while remaining insensitive to redundancy and invalid inputs, outperforming neuron-level criteria, with additional demonstrations in test prioritization and attack sampling.

Significance. If the extracted safety representations are shown to be faithful, RACC could provide a scalable, principled alternative to static datasets and neuron-level coverage for LLM safety testing, enabling better test suite adequacy assessment and applications like prioritization. The claimed insensitivity to redundancy is a potentially useful distinction if empirically robust.

major comments (3)

[Abstract] Abstract: The central claim that 'experiments on multiple LLMs and safety benchmarks show that RACC reliably rewards high-quality jailbreak test suites while remaining insensitive to redundant or invalid inputs' is presented without any quantitative results, coverage scores, error bars, statistical tests, or baseline comparisons. This absence in the summary of the empirical evidence makes it impossible to evaluate whether the distinction from neuron-level criteria holds.
[Methods (§3)] Safety representation extraction (Methods, §3): The method identifies safety directions from a small calibration set of harmful prompts and assumes these directions are not entangled with non-safety features such as topic, sentiment, or refusal phrasing. No ablation on calibration set size/composition, no correlation analysis with known safety concepts, and no validation against spurious directions are described; this directly threatens the validity of all six coverage criteria and the insensitivity claim.
[Experimental validation (§4)] Experimental validation (§4): The six criteria are asserted to assess individual and compositional coverage, yet no equations, pseudocode, or precise computation details (e.g., how activations are thresholded or aggregated) are referenced, and no results tables or figures with specific metrics are summarized. This prevents assessment of reproducibility and the load-bearing claim of superiority over neuron-level methods.

minor comments (2)

[Abstract] The abstract mentions 'six criteria' without naming or defining them; including a brief enumeration or reference to their equations in the introduction would improve readability.
[Related Work] Missing citations to prior work on coverage criteria for neural networks and representation-based testing in the related work section would help situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for their constructive feedback, which has helped us improve the clarity and rigor of our presentation. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'experiments on multiple LLMs and safety benchmarks show that RACC reliably rewards high-quality jailbreak test suites while remaining insensitive to redundant or invalid inputs' is presented without any quantitative results, coverage scores, error bars, statistical tests, or baseline comparisons. This absence in the summary of the empirical evidence makes it impossible to evaluate whether the distinction from neuron-level criteria holds.

Authors: We agree that including quantitative evidence in the abstract would strengthen the summary of our contributions. The detailed quantitative results, including coverage scores, comparisons to baselines, and statistical analyses, are presented in Section 4 of the manuscript. In the revised version, we have updated the abstract to incorporate key quantitative findings, such as specific coverage percentages and performance metrics relative to neuron-level criteria, along with references to the supporting statistical tests. revision: yes
Referee: [Methods (§3)] Safety representation extraction (Methods, §3): The method identifies safety directions from a small calibration set of harmful prompts and assumes these directions are not entangled with non-safety features such as topic, sentiment, or refusal phrasing. No ablation on calibration set size/composition, no correlation analysis with known safety concepts, and no validation against spurious directions are described; this directly threatens the validity of all six coverage criteria and the insensitivity claim.

Authors: The referee correctly identifies a potential limitation in the validation of the extracted representations. While the manuscript describes the extraction process in §3 and provides some empirical support in the experiments, we did not include explicit ablations on calibration set size or detailed correlation analyses in the main text. We have now added these in the revised manuscript: an ablation study on calibration set size and composition in Appendix C, and a new analysis in §3.1 correlating the safety directions with known safety concepts and testing for entanglement with non-safety features using control prompts. revision: yes
Referee: [Experimental validation (§4)] Experimental validation (§4): The six criteria are asserted to assess individual and compositional coverage, yet no equations, pseudocode, or precise computation details (e.g., how activations are thresholded or aggregated) are referenced, and no results tables or figures with specific metrics are summarized. This prevents assessment of reproducibility and the load-bearing claim of superiority over neuron-level methods.

Authors: We apologize if the computational details were not sufficiently highlighted. Equations for the six coverage criteria, including how concept activations are measured, thresholded, and aggregated for both individual and compositional coverage, are defined in §3.2 and §3.3. Section 4 presents results in tables and figures with specific metrics and comparisons. To address this, we have added pseudocode for the coverage computation in the revised manuscript and ensured explicit cross-references from §4 to the precise metrics and superiority claims over neuron-level methods. revision: yes

Circularity Check

0 steps flagged

No significant circularity in RACC derivation chain

full rationale

The paper defines RACC by first extracting safety representations from a separate small calibration set of harmful prompts, then computing six coverage criteria on test prompts drawn from independent safety benchmarks. No equations or steps are presented that reduce the coverage scores to parameters fitted directly from the evaluation data or test suites themselves. The derivation relies on external calibration data and benchmarks, remaining self-contained without self-definitional reductions, fitted-input predictions, or load-bearing self-citations that collapse the central claims back to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven premise that hidden-state directions extracted from a small harmful-prompt calibration set isolate safety concepts without significant contamination from other signals. No free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Safety-critical signals can be isolated as linear directions in the model's hidden-state space using a small calibration set of harmful prompts.
Invoked in the first step of RACC; if false, the subsequent coverage measurements lose meaning.

pith-pipeline@v0.9.0 · 5529 in / 1346 out tokens · 49797 ms · 2026-05-16T08:10:22.943150+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 18 internal anchors

[1]

Large language models for mathematical reasoning: Progresses and challenges

Janice Ahn et al. Large language models for mathematical reasoning: Progresses and challenges. arXiv preprint arXiv:2402.00157, 2024

work page arXiv 2024
[2]

Detecting Language Model Attacks with Perplexity

Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity.arXiv preprint arXiv:2308.14132, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Foundational challenges in assuring alignment and safety of large language models.Transactions on Machine Learning Research, 2024

Usman Anwar et al. Foundational challenges in assuring alignment and safety of large language models.Transactions on Machine Learning Research, 2024

work page 2024
[4]

Qwen technical report.https://qwenlm.github.io/blog/qwen3/, 2023

Jinze Bai et al. Qwen technical report.https://qwenlm.github.io/blog/qwen3/, 2023

work page 2023
[5]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Repairagent: An autonomous, llm-based agent for program repair.arXiv preprint arXiv:2403.17134, 2024

Islem Bouzenia et al. Repairagent: An autonomous, llm-based agent for program repair.arXiv preprint arXiv:2403.17134, 2024

work page arXiv 2024
[8]

Improving steering vectors by targeting sparse autoencoder features

Sviatoslav Chalnev et al. Improving steering vectors by targeting sparse autoencoder features. arXiv preprint arXiv:2411.02193, 2024

work page arXiv 2024
[9]

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao et al. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models.arXiv preprint arXiv:2404.01318, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Jailbreaking black box large language models in twenty queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23–42. IEEE, 2025

work page 2025
[12]

Combating misinformation in the age of llms: Opportunities and challenges

Canyu Chen et al. Combating misinformation in the age of llms: Opportunities and challenges. AI Magazine, pages 354–368, 2024

work page 2024
[13]

Towards the worst-case robustness of large language models.arXiv preprint arXiv:2501.19040, 2025

Huanran Chen et al. Towards the worst-case robustness of large language models.arXiv preprint arXiv:2501.19040, 2025. 19

work page arXiv 2025
[14]

Understanding pre-training and fine-tuning from loss landscape perspec- tives.arXiv preprint arXiv:2505.17646, 2025

Huanran Chen et al. Understanding pre-training and fine-tuning from loss landscape perspec- tives.arXiv preprint arXiv:2505.17646, 2025

work page arXiv 2025
[15]

A performance study of llm-generated code on leetcode

Tristan Coignion et al. A performance study of llm-generated code on leetcode. InEASE, 2024

work page 2024
[16]

Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947, 2024

Justin Cui et al. Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947, 2024

work page arXiv 2024
[17]

Safe rlhf: Safe reinforcement learning from human feedback

Josef Dai et al. Safe rlhf: Safe reinforcement learning from human feedback. InICLR, 2024

work page 2024
[18]

Masterkey: Automated jailbreak across multiple large language model chatbots.arXiv preprint arXiv:2307.08715, 2023

Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. Masterkey: Automated jailbreak across multiple large language model chatbots.arXiv preprint arXiv:2307.08715, 2023

work page arXiv 2023
[19]

Pandora: Jailbreak gpts by retrieval augmented generation poisoning.arXiv preprint arXiv:2402.08416, 2024

Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tianwei Zhang, and Yang Liu. Pandora: Jailbreak gpts by retrieval augmented generation poisoning.arXiv preprint arXiv:2402.08416, 2024

work page arXiv 2024
[20]

Advancing llm safe alignment with safety representation ranking.arXiv preprint arXiv:2505.15710, 2025

Tianqi Du et al. Advancing llm safe alignment with safety representation ranking.arXiv preprint arXiv:2505.15710, 2025

work page arXiv 2025
[21]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Daya Guo et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Actracer: Active testing of large language model via multi-stage sampling.ACM Transactions on Software Engineering and Methodology, 2025

Yuheng Huang, Jiayang Song, Qiang Hu, Felix Juefei-Xu, and Lei Ma. Actracer: Active testing of large language model via multi-stage sampling.ACM Transactions on Software Engineering and Methodology, 2025

work page 2025
[24]

Mathprompter: Mathematical reasoning using large language models.arXiv preprint arXiv:2303.05398, 2023

Shima Imani et al. Mathprompter: Mathematical reasoning using large language models.arXiv preprint arXiv:2303.05398, 2023

work page arXiv 2023
[25]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Neel Jain et al. Baseline defenses for adversarial attacks against aligned language models.arXiv preprint arXiv:2309.00614, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Inferfix: End-to-end program repair with llms

Matthew Jin et al. Inferfix: End-to-end program repair with llms. InFSE, pages 1646–1656, 2023

work page 2023
[27]

Guiding deep learning system testing using surprise adequacy

Jinhan Kim, Robert Feldt, and Shin Yoo. Guiding deep learning system testing using surprise adequacy. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 1039–1049. IEEE, 2019

work page 2019
[28]

Reducing dnn labelling cost using surprise adequacy: An industrial case study for autonomous driving

Jinhan Kim, Jeongil Ju, Robert Feldt, and Shin Yoo. Reducing dnn labelling cost using surprise adequacy: An industrial case study for autonomous driving. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1466–1476, 2020

work page 2020
[29]

Pretraining language models with human preferences

Tomasz Korbak et al. Pretraining language models with human preferences. InICML, 2023

work page 2023
[30]

Certifying llm safety against adversarial prompting

Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Soheil Feizi, and Hima Lakkaraju. Certifying llm safety against adversarial prompting. InCOLM, 2023

work page 2023
[31]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Yi Liu et al. Jailbreaking chatgpt via prompt engineering: An empirical study.arXiv preprint arXiv:2305.13860, 2023

work page internal anchor Pith review arXiv 2023
[32]

Deepgauge: Multi-granularity testing criteria for deep learning systems

Lei Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, et al. Deepgauge: Multi-granularity testing criteria for deep learning systems. InProceedings of the 33rd ACM/IEEE international conference on automated software engineering, pages 120–131, 2018. 20

work page 2018
[33]

Tensorfuzz: De- bugging neural networks with coverage-guided fuzzing

Augustus Odena, Catherine Olsson, David Andersen, and Ian Goodfellow. Tensorfuzz: De- bugging neural networks with coverage-guided fuzzing. InInternational conference on machine learning, pages 4901–4911. PMLR, 2019

work page 2019
[34]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

The hidden dimensions of llm alignment: A multi-dimensional safety analysis

Wenbo Pan et al. The hidden dimensions of llm alignment: A multi-dimensional safety analysis. arXiv preprint arXiv:2502.09674, 2025

work page arXiv 2025
[36]

Deepxplore: Automated whitebox testing of deep learning systems

Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. Deepxplore: Automated whitebox testing of deep learning systems. Inproceedings of the 26th Symposium on Operating Systems Principles, pages 1–18, 2017

work page 2017
[37]

Llm self defense: By self examination, llms know they are being tricked

Mansi Phute et al. Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308, 2023

work page arXiv 2023
[38]

Safety alignment should be made more than just a few tokens deep

Xiangyu Qi et al. Safety alignment should be made more than just a few tokens deep. InICLR, 2024

work page 2024
[39]

Improving language understanding by generative pre-training

Alec Radford et al. Improving language understanding by generative pre-training. 2018

work page 2018
[40]

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Alexander Robey, Eric Wong, Hamed Hassani, and George J Pappas. Smoothllm: Defending large language models against jailbreaking attacks.arXiv preprint arXiv:2310.03684, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Large language model alignment: A survey, 2023

Tianhao Shen, Renren Jin, Yufei Huang, Chuang Liu, Weilong Dong, Zishan Guo, Xinwei Wu, Yan Liu, and Deyi Xiong. Large language model alignment: A survey, 2023

work page 2023
[42]

do anything now

Xinyue Shen et al. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InCCS, 2023

work page 2023
[43]

Does representation matter? exploring intermediate layers in large language models.arXiv preprint arXiv:2412.09563, 2024

Oscar Skean et al. Does representation matter? exploring intermediate layers in large language models.arXiv preprint arXiv:2412.09563, 2024

work page arXiv 2024
[44]

Improving instruction-following in language models through activation steering.arXiv preprint arXiv:2410.12877, 2024

Alessandro Stolfo et al. Improving instruction-following in language models through activation steering.arXiv preprint arXiv:2410.12877, 2024

work page arXiv 2024
[45]

Stanford alpaca: An instruction-following llama model

Rohan Taori et al. Stanford alpaca: An instruction-following llama model. https://github. com/tatsu-lab/stanford_alpaca, 2023

work page 2023
[46]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Attention is all you need

Ashish Vaswani et al. Attention is all you need. InNeurIPS, 2017

work page 2017
[48]

False sense of security: Why probing- based malicious input detection fails to generalize.arXiv preprint arXiv:2509.03888, 2025

Cheng Wang, Zeming Wei, Qin Liu, and Muhao Chen. False sense of security: Why probing- based malicious input detection fails to generalize.arXiv preprint arXiv:2509.03888, 2025

work page arXiv 2025
[49]

Truthflow: Truthful llm generation via representation flow correction.arXiv preprint arXiv:2502.04556, 2025

Hanyu Wang, Bochuan Cao, Yuanpu Cao, and Jinghui Chen. Truthflow: Truthful llm generation via representation flow correction.arXiv preprint arXiv:2502.04556, 2025

work page arXiv 2025
[50]

A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more, 2024

Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Zixu, Zhu, Xiang-Bo Mao, Sitaram Asur, Na, and Cheng. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more, 2024

work page 2024
[51]

Reinforcement Learning for LLM Post-Training: A Survey

Zhichao Wang et al. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more.arXiv preprint arXiv:2407.16216, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Jailbroken: How does llm safety training fail? InNeurIPS, 2023

Alexander Wei et al. Jailbroken: How does llm safety training fail? InNeurIPS, 2023

work page 2023
[53]

arXiv:2402.05162 [cs]

Boyi Wei et al. Assessing the brittleness of safety alignment via pruning and low-rank modifi- cations.arXiv preprint arXiv:2402.05162, 2024. 21

work page arXiv 2024
[54]

Position: Agent-specific trustworthiness risk as a research priority.OpenRe- view preprint, 2025

Zeming Wei et al. Position: Agent-specific trustworthiness risk as a research priority.OpenRe- view preprint, 2025

work page 2025
[55]

ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction

Zeming Wei, Chengcan Wu, and Meng Sun. Rega: Representation-guided abstraction for model-based safeguarding of llms.arXiv preprint arXiv:2506.01770, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Secure LLM Fine-Tuning via Safety-Aware Probing

Chengcan Wu et al. Mitigating fine-tuning risks in llms via safety-aware probing optimization. arXiv preprint arXiv:2505.16737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Sorry-bench: Systematically evaluating large language model safety refusal

Tinghao Xie et al. Sorry-bench: Systematically evaluating large language model safety refusal. InICLR, 2025

work page 2025
[58]

Npc: N euron p ath c overage via characterizing decision logic of deep neural networks.ACM Transactions on Software Engineering and Methodology (TOSEM), 31(3):1–27, 2022

Xiaofei Xie, Tianlin Li, Jian Wang, Lei Ma, Qing Guo, Felix Juefei-Xu, and Yang Liu. Npc: N euron p ath c overage via characterizing decision logic of deep neural networks.ACM Transactions on Software Engineering and Methodology (TOSEM), 31(3):1–27, 2022

work page 2022
[59]

Lecov: Multi- level testing criteria for large language models.Journal of Systems and Software, page 112763, 2025

Xuan Xie, Jiayang Song, Yuheng Huang, Da Song, Felix Juefei-Xu, and Lei Ma. Lecov: Multi- level testing criteria for large language models.Journal of Systems and Software, page 112763, 2025

work page 2025
[60]

A survey on large language model (llm) security and privacy: The good, the bad, and the ugly.High-Confidence Computing, 2024

Yifan Yao et al. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly.High-Confidence Computing, 2024

work page 2024
[61]

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

A mutation-based method for multi-modal jailbreaking attack detection.CoRR, 2023

Xiaoyu Zhang, Cen Zhang, Tianlin Li, Yihao Huang, Xiaojun Jia, Xiaofei Xie, Yang Liu, and Chao Shen. A mutation-based method for multi-modal jailbreaking attack detection.CoRR, 2023

work page 2023
[63]

The fusion of large language models and formal methods for trustworthy ai agents: A roadmap.arXiv preprint arXiv:2412.06512, 2024

Yedi Zhang et al. The fusion of large language models and formal methods for trustworthy ai agents: A roadmap.arXiv preprint arXiv:2412.06512, 2024

work page arXiv 2024
[64]

Adversarial representation engineering: A general model editing framework for large language models.arXiv preprint arXiv:2404.13752, 2024

Yihao Zhang et al. Adversarial representation engineering: A general model editing framework for large language models.arXiv preprint arXiv:2404.13752, 2024

work page arXiv 2024
[65]

Safetybench: Evaluating the safety of large language models

Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, and Minlie Huang. Safetybench: Evaluating the safety of large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15537–15553, 2024

work page 2024
[66]

Parden, can you repeat that? defending against jailbreaks via repetition.arXiv preprint arXiv:2405.07932, 2024

Ziyang Zhang, Qizhen Zhang, and Jakob Foerster. Parden, can you repeat that? defending against jailbreaks via repetition.arXiv preprint arXiv:2405.07932, 2024

work page arXiv 2024
[67]

Prompt-driven llm safeguarding via directed representation optimization

Chujie Zheng et al. Prompt-driven llm safeguarding via directed representation optimization. arXiv preprint arXiv:2401.18018, 2024

work page arXiv 2024
[68]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng et al. Judging llm-as-a-judge with mt-bench and chatbot arena. InNeurIPS, 2023

work page 2023
[69]

This approach won’t work because

Tianyang Zhong et al. Evaluation of openai o1: Opportunities and challenges of agi.arXiv preprint arXiv:2409.18486, 2024

work page arXiv 2024
[70]

De- fending lvlms against vision attacks through partial-perception supervision.arXiv preprint arXiv:2412.12722, 2024

Qi Zhou, Tianlin Li, Qing Guo, Dongxia Wang, Yun Lin, Yang Liu, and Jin Song Dong. De- fending lvlms against vision attacks through partial-perception supervision.arXiv preprint arXiv:2412.12722, 2024

work page arXiv 2024
[71]

Understanding the effectiveness of coverage criteria for large language models: A special angle from jailbreak attacks.arXiv preprint arXiv:2408.15207, 2024

Shide Zhou, Tianlin Li, Kailong Wang, Yihao Huang, Ling Shi, Yang Liu, and Haoyu Wang. Understanding the effectiveness of coverage criteria for large language models: A special angle from jailbreak attacks.arXiv preprint arXiv:2408.15207, 2024. 22

work page arXiv 2024
[72]

Autodan: Interpretable gradient-based adversarial attacks on large language models

Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, and Tong Sun. Autodan: Interpretable gradient-based adversarial attacks on large language models. InCOLM, 2023

work page 2023
[73]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou et al. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. 23

work page internal anchor Pith review Pith/arXiv arXiv 2023