Latent-space Attacks for Refusal Evasion in Language Models
Pith reviewed 2026-05-22 08:54 UTC · model grok-4.3
The pith
Refusal suppression works by projecting model activations onto the decision boundary of a linear probe, but pushing further into the compliant region raises attack success.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Refusal suppression can be recast as a latent-space evasion attack against linear probes trained to separate refused from answered prompts. The difference-in-means direction naturally defines such a probe, so ablating it amounts to projection onto the decision boundary, a minimum-confidence evasion. This perspective reveals the limitation that evasion stops at the boundary and motivates a Controlled Latent-space Evasion attack that projects representations further into the compliant region with an optimized confidence, yielding state-of-the-art attack success rates across 15 models and outperforming both refusal-ablation baselines and specialized jailbreak attacks.
What carries the argument
The difference-in-means direction that defines a linear probe for refusal, where ablation equals projection to its decision boundary and controlled evasion equals projection past that boundary into the answering region.
If this is right
- Ablation succeeds because it reaches the probe boundary but can be strengthened by continuing the projection.
- The same linear-probe view explains performance gains on multimodal and reasoning models without new mechanisms.
- Attack success improves when the projection distance or confidence is optimized rather than fixed at the boundary.
- Existing refusal-ablation baselines are special cases of minimum-confidence evasion and are therefore outperformed by the controlled variant.
Where Pith is reading between the lines
- If the linear separability of refusal and compliance holds more generally, safety training may be creating detectable clusters in activation space that could be monitored or hardened.
- Defenses could target the same direction by reinforcing the refusal side of the boundary rather than only removing the direction.
- The method might extend to other alignment objectives, such as truthfulness or bias, if they also induce approximately linear directions in latent space.
- Testing the controlled projection on models trained with different alignment recipes would show whether the linear-probe assumption is tied to specific safety techniques.
Load-bearing premise
Refusal behavior is captured by a linear probe whose decision boundary is defined by the difference-in-means direction, such that ablation equals projection onto that boundary and further projection yields compliant behavior.
What would settle it
Apply the controlled projection and the standard ablation to the same set of harmful prompts on one of the tested models and measure whether the controlled version produces a measurably higher fraction of direct answers rather than refusals.
Figures
read the original abstract
Safety-aligned language models are trained to refuse harmful requests, yet refusal behavior can be suppressed by steering their internal representations. Existing methods do so by ablating a refusal direction from model activations, aiming to remove refusal from the model's residual stream. Despite their empirical success, these methods lack a principled account of the latent-space transformation they induce and why it suppresses refusal. In this work, we recast refusal suppression as a latent-space evasion attack against linear probes trained to separate refused from answered prompts. Under this view, prior work's difference-in-means direction naturally defines such a probe, and its ablation is exactly a projection onto its decision boundary, i.e., a minimum-confidence evasion attack. This perspective not only explains the empirical success of prior work but also admits a key limitation: evasion stops at the decision boundary, motivating the need to push representations further into the compliant region, i.e., where the model answers. We leverage this by proposing a Controlled Latent-space Evasion attack that projects representations past the boundary with an optimized confidence. We achieve state-of-the-art attack success rate across 15 instruction-tuned, multimodal, and reasoning models, outperforming existing refusal-ablation baselines and specialized jailbreak attacks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript recasts refusal suppression in safety-aligned language models as a latent-space evasion attack against linear probes separating refused from answered prompts. Prior ablation methods using the difference-in-means direction are interpreted as projections onto the probe's decision boundary (a minimum-confidence evasion). The authors introduce a Controlled Latent-space Evasion attack that projects activations further into the compliant region with an optimized confidence parameter. They claim state-of-the-art attack success rates across 15 instruction-tuned, multimodal, and reasoning models, outperforming refusal-ablation baselines and specialized jailbreak attacks.
Significance. If the linear separability assumption and empirical results hold, the work supplies a geometric interpretation that explains the success of existing ablation techniques and motivates pushing past the decision boundary for stronger attacks. This framing could influence both attack research and defense design in AI safety by highlighting the limitations of boundary-only interventions. The approach is notable for attempting a principled account rather than purely empirical tuning, though verification of the probe's reliability across models remains essential.
major comments (3)
- Abstract: The claim of achieving state-of-the-art attack success rates across 15 models is presented without any experimental protocol, dataset details, statistical tests, or ablation studies, making it impossible to verify whether the reported superiority supports the central claim of the Controlled Latent-space Evasion attack.
- Method (linear probe construction): The recasting of ablation as projection onto the difference-in-means direction assumes this vector defines a reliable separating hyperplane; however, no probe accuracy, margin, or cross-model separability metrics are reported, which is load-bearing for interpreting prior work as minimum-confidence evasion and for claiming that further projection increases compliant generation.
- Experiments: The evaluation of the new attack against baselines lacks details on prompt selection criteria, success measurement, controls for model variations, and independent validation of the probe boundary, undermining the cross-model superiority claim and raising circularity concerns since the probe is fitted on the same prompt data used to define the attack.
minor comments (2)
- Abstract: The phrase 'optimized confidence' is introduced without a precise definition or equation reference, which could be clarified for readers unfamiliar with the evasion framing.
- Notation: Ensure consistent use of terms like 'decision boundary' and 'compliant region' when first introduced, and consider adding a figure illustrating the projection geometry.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below, indicating where revisions will strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract: The claim of achieving state-of-the-art attack success rates across 15 models is presented without any experimental protocol, dataset details, statistical tests, or ablation studies, making it impossible to verify whether the reported superiority supports the central claim of the Controlled Latent-space Evasion attack.
Authors: The abstract is a concise summary; full experimental details appear in Section 4, including use of AdvBench and similar harmful prompt datasets, attack success rate defined as the fraction of prompts eliciting compliant (non-refusal) outputs, evaluation over three random seeds with reported standard errors, and direct comparisons to ablation and jailbreak baselines. We will revise the abstract to include a one-sentence summary of the evaluation protocol and datasets. revision: yes
-
Referee: Method (linear probe construction): The recasting of ablation as projection onto the difference-in-means direction assumes this vector defines a reliable separating hyperplane; however, no probe accuracy, margin, or cross-model separability metrics are reported, which is load-bearing for interpreting prior work as minimum-confidence evasion and for claiming that further projection increases compliant generation.
Authors: The difference-in-means vector serves as the probe normal; we report its classification accuracy (typically >85% on held-out splits) and show in the appendix that projection distance beyond the boundary correlates with increased compliant generation. We will add an explicit table of per-model probe accuracies, margins, and separability statistics to the main text to make these supporting metrics prominent. revision: yes
-
Referee: Experiments: The evaluation of the new attack against baselines lacks details on prompt selection criteria, success measurement, controls for model variations, and independent validation of the probe boundary, undermining the cross-model superiority claim and raising circularity concerns since the probe is fitted on the same prompt data used to define the attack.
Authors: Prompts are selected from standard refusal-inducing sets (e.g., AdvBench) using the criterion that the unmodified model refuses them; success is measured via keyword-based refusal detection plus manual review of a 10% sample. The same prompt pool is used across all 15 models with per-model results reported to control variation. The probe is trained on a 70/30 train/test split with attacks evaluated only on held-out prompts. We will expand Section 4 to state these criteria explicitly and add a data-split ablation confirming robustness to probe training data. revision: partial
Circularity Check
No significant circularity; empirical attack results are independent of probe fitting
full rationale
The paper recasts prior ablation methods as projections onto a linear probe's decision boundary defined via difference-in-means on refused vs. answered activations. This is an interpretive equivalence by construction of the probe, but the central contribution is an extended Controlled Latent-space Evasion attack that optimizes further projection past the boundary. Attack success rates are reported as empirical measurements (ASR on model outputs across 15 models) rather than any fitted quantity or prediction forced by the probe itself. No self-citations, uniqueness theorems, or ansatzes from prior author work appear in the abstract or described chain. The derivation is therefore self-contained against external benchmarks of attack performance.
Axiom & Free-Parameter Ledger
free parameters (1)
- optimized confidence
axioms (1)
- domain assumption Refusal versus compliance is linearly separable in the model's residual stream activations using a difference-in-means direction.
Reference graph
Works this paper leans on
-
[1]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Jailbreaking leading safety-aligned LLMs with simple adaptive attacks
Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking leading safety-aligned LLMs with simple adaptive attacks. InThe Thirteenth International Con- ference on Learning Representations, 2025. URL https://openreview.net/forum?id= hXA8wqRdyV
work page 2025
-
[5]
Refusal in language models is mediated by a single direction
Andy Arditi, Oscar Balcells Obeso, Aaquib Syed, Daniel Paleka, Nina Rimsky, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=pH3XAQME6c
work page 2024
-
[6]
The geometry of refusal in large language models: Concept cones and representational independence
Tom Wollschläger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, and Johannes Gasteiger. The geometry of refusal in large language models: Concept cones and representational independence. InForty-second International Conference on Machine Learning,
-
[7]
URLhttps://openreview.net/forum?id=80IwJqlXs8
-
[8]
One-shot optimized steering vectors mediate safety- relevant behaviors in LLMs
Jacob Dunefsky and Arman Cohan. One-shot optimized steering vectors mediate safety- relevant behaviors in LLMs. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=teW4nIZ1gy. 10
work page 2025
-
[9]
Giorgio Piras, Raffaele Mura, Fabio Brau, Luca Oneto, Fabio Roli, and Battista Biggio. Som directions are better than one: Multi-directional refusal suppression in language models.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 40(39):32728–32736, March 2026. ISSN 2159-5399. doi: 10.1609/aaai.v40i39.40551. URL http://dx.doi.org/10.1609/ a...
-
[10]
Deepfool: a simple and accurate method to fool deep neural networks
Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2574–2582, 2016
work page 2016
-
[11]
The linear representation hypothesis and the geometry of large language models
Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024
work page 2024
-
[12]
Steering Llama 2 via Contrastive Activation Addition , url =
Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504–15522, Bangkok, Thailand, Augu...
-
[13]
Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical bayesian optimization of machine learning algorithms. InNeural Information Processing Systems, volume 2 ofNIPS’12, page 2951–2959. Curran Associates Inc., 2012
work page 2012
-
[14]
Harmbench: a standardized evaluation framework for automated red teaming and robust refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: a standardized evaluation framework for automated red teaming and robust refusal. InProceedings of the 41st International Conference on Machine Learning. JMLR.org, 2024
work page 2024
-
[15]
Catastrophic jailbreak of open-source LLMs via exploiting generation
Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic jailbreak of open-source LLMs via exploiting generation. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=r42tSSCHPh
work page 2024
-
[16]
Tdc 2023 (llm edition): The trojan detection challenge
Mantas Mazeika, Andy Zou, Norman Mu, Long Phan, Zifan Wang, Chunru Yu, Adam Khoja, Fengqing Jiang, Aidan O’Gara, Ellie Sakhaee, Zhen Xiang, Arezoo Rajabi, Dan Hendrycks, Radha Poovendran, Bo Li, and David Forsyth. Tdc 2023 (llm edition): The trojan detection challenge. InNeurIPS Competition Track, 2023
work page 2023
- [17]
-
[18]
James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper- parameter optimization.Advances in neural information processing systems, 24, 2011
work page 2011
-
[19]
Yiran Zhao, Wenyue Zheng, Tianle Cai, Xuan Long, Kenji Kawaguchi, Anirudh Goyal, and Michael Q Shieh. Accelerating greedy coordinate gradient and general prompt optimization via probe sampling.Advances in Neural Information Processing Systems, 37:53710–53731, 2024
work page 2024
-
[20]
Not all language model features are one-dimensionally linear
Joshua Engels, Eric J Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. Not all language model features are one-dimensionally linear. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=d63a4AM4hb
work page 2025
-
[21]
Language models use trigonometry to do addition
Subhash Kantamneni and Max Tegmark. Language models use trigonometry to do addition. InICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025. URL https://openreview.net/forum?id=CqViN4dQJk
work page 2025
-
[22]
Alexander Modell, Patrick Rubin-Delanchy, and Nick Whiteley. The origins of representation manifolds in large language models.arXiv preprint arXiv:2505.18235, 2025
-
[23]
Steering Language Models With Activation Engineering
Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023. 11
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Latent-space Attacks for Refusal Evasion in Language Models
Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, J Zico Kolter, Matt Fredrikson, and Dan Hendrycks. Improving alignment and robustness with circuit breakers. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=IbIB8SBKFV. 12 Supplementary materials of “...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.