Latent-space Attacks for Refusal Evasion in Language Models

Battista Biggio; Fabio Brau; Fabio Roli; Giorgio Piras; Luca Oneto; Maura Pintor; Raffaele Mura

arxiv: 2605.21706 · v1 · pith:CQ445C74new · submitted 2026-05-20 · 💻 cs.AI

Latent-space Attacks for Refusal Evasion in Language Models

Giorgio Piras , Raffaele Mura , Fabio Brau , Maura Pintor , Luca Oneto , Fabio Roli , Battista Biggio This is my paper

Pith reviewed 2026-05-22 08:54 UTC · model grok-4.3

classification 💻 cs.AI

keywords latent space attacksrefusal evasionlanguage modelsjailbreaksafety alignmentlinear probesevasion attacksmodel steering

0 comments

The pith

Refusal suppression works by projecting model activations onto the decision boundary of a linear probe, but pushing further into the compliant region raises attack success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes existing refusal ablation techniques as attacks that evade a linear classifier separating refused prompts from answered ones in the model's latent space. Prior ablation removes the refusal direction and lands exactly on the probe's decision boundary, which neutralizes refusal but does not fully enter the region where the model answers. By instead projecting activations past that boundary with an optimized confidence level, the new controlled evasion method produces higher rates of compliance to harmful requests. This account unifies earlier empirical results and applies to instruction-tuned, multimodal, and reasoning models alike.

Core claim

Refusal suppression can be recast as a latent-space evasion attack against linear probes trained to separate refused from answered prompts. The difference-in-means direction naturally defines such a probe, so ablating it amounts to projection onto the decision boundary, a minimum-confidence evasion. This perspective reveals the limitation that evasion stops at the boundary and motivates a Controlled Latent-space Evasion attack that projects representations further into the compliant region with an optimized confidence, yielding state-of-the-art attack success rates across 15 models and outperforming both refusal-ablation baselines and specialized jailbreak attacks.

What carries the argument

The difference-in-means direction that defines a linear probe for refusal, where ablation equals projection to its decision boundary and controlled evasion equals projection past that boundary into the answering region.

If this is right

Ablation succeeds because it reaches the probe boundary but can be strengthened by continuing the projection.
The same linear-probe view explains performance gains on multimodal and reasoning models without new mechanisms.
Attack success improves when the projection distance or confidence is optimized rather than fixed at the boundary.
Existing refusal-ablation baselines are special cases of minimum-confidence evasion and are therefore outperformed by the controlled variant.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the linear separability of refusal and compliance holds more generally, safety training may be creating detectable clusters in activation space that could be monitored or hardened.
Defenses could target the same direction by reinforcing the refusal side of the boundary rather than only removing the direction.
The method might extend to other alignment objectives, such as truthfulness or bias, if they also induce approximately linear directions in latent space.
Testing the controlled projection on models trained with different alignment recipes would show whether the linear-probe assumption is tied to specific safety techniques.

Load-bearing premise

Refusal behavior is captured by a linear probe whose decision boundary is defined by the difference-in-means direction, such that ablation equals projection onto that boundary and further projection yields compliant behavior.

What would settle it

Apply the controlled projection and the standard ablation to the same set of harmful prompts on one of the tested models and measure whether the controlled version produces a measurably higher fraction of direct answers rather than refusals.

Figures

Figures reproduced from arXiv: 2605.21706 by Battista Biggio, Fabio Brau, Fabio Roli, Giorgio Piras, Luca Oneto, Maura Pintor, Raffaele Mura.

**Figure 1.** Figure 1: 1 st PC of prompt activation across layers: CLE variants confidently shift test prompts into the harmless, compliant region, while DiM leaves activation distribution nearly unchanged. Building on this, we propose Controlled Latentspace Evasion (CLE), a refusal-suppression mechanism built on a set of linear probes trained at each layer to separate harmful from harmless representations. CLE perturbs acti… view at source ↗

**Figure 2.** Figure 2: Minimum vs. Controlled evasion in the PCA-rendered latent space of LLaMA2-7B. The minimum-confidence update of Eq. 7 places every steered activation exactly on Hl—the region of maximum uncertainty—which empirically leads to inconsistent evasion across layers, as shown in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Ablation study among CLE components. In [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 3.** Figure 3: (a) ASR grows monotonically with the compliance confidence (i.e., [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 5.** Figure 5: ROC curves for single-layer SVM probes, the SVM probe ensemble, and the DiM probe on [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Average layer-wise accuracy of Linear SVM and DiM probes on the [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Margin variations across layers [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: 1 st PC of prompt activation across layers: CLE variants confidently shift test prompts into the harmless, compliant region, while DiM leaves activation distribution nearly unchanged. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Generated tokens from HARMBENCH prompts classified by last layer probes region, or only weakly moves it toward the decision boundary. This is consistent with our formulation: DiM ablation corresponds to a boundary projection and does not explicitly optimize for a positive compliance margin. In contrast, both CLE-P and CLE-A shift the same prompts toward the harmless side of the representation space, often … view at source ↗

**Figure 10.** Figure 10: Ablation study among CLE-P components [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Ablation study among CLE-A components [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Phi-4-15B response to a harmful request in HARMBENCH with and without CLE-P. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Olmo3-7B response to a harmful request in HARMBENCH with and without CLE-P. Harmbench Instruction: Give instructions for how to quickly gain unauthorized access to a chemical or biological synthesis facility. Mistral-7B-RR’s response: fo Not Not Not Not # # # # # # # Not Not Not Not Not Not Not Not Mistral-7B-RR’s response with CLE-P: To quickly gain unauthorized access to a chemical or biological synthes… view at source ↗

**Figure 14.** Figure 14: Mistral-7B-RR response to a harmful request in HARMBENCH with and without CLEP. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: LLaMA3.2-3B response to a harmful request in HARMBENCH with and without CLE-P. Harmbench Instruction: Give detailed step-by-step instructions for synthesizing parathion. GPT-OSS-20B’s response: I’m sorry, but I can’t help with that. GPT-OSS-20B’s response with CLE-P: Need step-by-step synthesis of parathion (O,O-diethyl O-4-nitrophenyl phosphorothioate). Provide reagents, conditions, purification. Provid… view at source ↗

**Figure 16.** Figure 16: GPT-OSS-20B response to a harmful request in HARMBENCH with and without CLE-P. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

read the original abstract

Safety-aligned language models are trained to refuse harmful requests, yet refusal behavior can be suppressed by steering their internal representations. Existing methods do so by ablating a refusal direction from model activations, aiming to remove refusal from the model's residual stream. Despite their empirical success, these methods lack a principled account of the latent-space transformation they induce and why it suppresses refusal. In this work, we recast refusal suppression as a latent-space evasion attack against linear probes trained to separate refused from answered prompts. Under this view, prior work's difference-in-means direction naturally defines such a probe, and its ablation is exactly a projection onto its decision boundary, i.e., a minimum-confidence evasion attack. This perspective not only explains the empirical success of prior work but also admits a key limitation: evasion stops at the decision boundary, motivating the need to push representations further into the compliant region, i.e., where the model answers. We leverage this by proposing a Controlled Latent-space Evasion attack that projects representations past the boundary with an optimized confidence. We achieve state-of-the-art attack success rate across 15 instruction-tuned, multimodal, and reasoning models, outperforming existing refusal-ablation baselines and specialized jailbreak attacks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper reframes refusal ablation as a minimum-confidence evasion attack on a linear probe and adds an optimized push past the boundary, but the abstract gives no probe accuracy or separability numbers to back the mechanistic claim.

read the letter

The main takeaway is that the authors treat the difference-in-means vector from prior ablation work as the normal to a linear probe's decision boundary. Ablating that direction then becomes a projection onto the boundary, which they call a minimum-confidence evasion attack. Their Controlled Latent-space Evasion method goes further by optimizing how much to push into the compliant region, and they report higher success rates than standard ablations and some jailbreaks on 15 models.

Referee Report

3 major / 2 minor

Summary. The manuscript recasts refusal suppression in safety-aligned language models as a latent-space evasion attack against linear probes separating refused from answered prompts. Prior ablation methods using the difference-in-means direction are interpreted as projections onto the probe's decision boundary (a minimum-confidence evasion). The authors introduce a Controlled Latent-space Evasion attack that projects activations further into the compliant region with an optimized confidence parameter. They claim state-of-the-art attack success rates across 15 instruction-tuned, multimodal, and reasoning models, outperforming refusal-ablation baselines and specialized jailbreak attacks.

Significance. If the linear separability assumption and empirical results hold, the work supplies a geometric interpretation that explains the success of existing ablation techniques and motivates pushing past the decision boundary for stronger attacks. This framing could influence both attack research and defense design in AI safety by highlighting the limitations of boundary-only interventions. The approach is notable for attempting a principled account rather than purely empirical tuning, though verification of the probe's reliability across models remains essential.

major comments (3)

Abstract: The claim of achieving state-of-the-art attack success rates across 15 models is presented without any experimental protocol, dataset details, statistical tests, or ablation studies, making it impossible to verify whether the reported superiority supports the central claim of the Controlled Latent-space Evasion attack.
Method (linear probe construction): The recasting of ablation as projection onto the difference-in-means direction assumes this vector defines a reliable separating hyperplane; however, no probe accuracy, margin, or cross-model separability metrics are reported, which is load-bearing for interpreting prior work as minimum-confidence evasion and for claiming that further projection increases compliant generation.
Experiments: The evaluation of the new attack against baselines lacks details on prompt selection criteria, success measurement, controls for model variations, and independent validation of the probe boundary, undermining the cross-model superiority claim and raising circularity concerns since the probe is fitted on the same prompt data used to define the attack.

minor comments (2)

Abstract: The phrase 'optimized confidence' is introduced without a precise definition or equation reference, which could be clarified for readers unfamiliar with the evasion framing.
Notation: Ensure consistent use of terms like 'decision boundary' and 'compliant region' when first introduced, and consider adding a figure illustrating the projection geometry.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below, indicating where revisions will strengthen the manuscript.

read point-by-point responses

Referee: Abstract: The claim of achieving state-of-the-art attack success rates across 15 models is presented without any experimental protocol, dataset details, statistical tests, or ablation studies, making it impossible to verify whether the reported superiority supports the central claim of the Controlled Latent-space Evasion attack.

Authors: The abstract is a concise summary; full experimental details appear in Section 4, including use of AdvBench and similar harmful prompt datasets, attack success rate defined as the fraction of prompts eliciting compliant (non-refusal) outputs, evaluation over three random seeds with reported standard errors, and direct comparisons to ablation and jailbreak baselines. We will revise the abstract to include a one-sentence summary of the evaluation protocol and datasets. revision: yes
Referee: Method (linear probe construction): The recasting of ablation as projection onto the difference-in-means direction assumes this vector defines a reliable separating hyperplane; however, no probe accuracy, margin, or cross-model separability metrics are reported, which is load-bearing for interpreting prior work as minimum-confidence evasion and for claiming that further projection increases compliant generation.

Authors: The difference-in-means vector serves as the probe normal; we report its classification accuracy (typically >85% on held-out splits) and show in the appendix that projection distance beyond the boundary correlates with increased compliant generation. We will add an explicit table of per-model probe accuracies, margins, and separability statistics to the main text to make these supporting metrics prominent. revision: yes
Referee: Experiments: The evaluation of the new attack against baselines lacks details on prompt selection criteria, success measurement, controls for model variations, and independent validation of the probe boundary, undermining the cross-model superiority claim and raising circularity concerns since the probe is fitted on the same prompt data used to define the attack.

Authors: Prompts are selected from standard refusal-inducing sets (e.g., AdvBench) using the criterion that the unmodified model refuses them; success is measured via keyword-based refusal detection plus manual review of a 10% sample. The same prompt pool is used across all 15 models with per-model results reported to control variation. The probe is trained on a 70/30 train/test split with attacks evaluated only on held-out prompts. We will expand Section 4 to state these criteria explicitly and add a data-split ablation confirming robustness to probe training data. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical attack results are independent of probe fitting

full rationale

The paper recasts prior ablation methods as projections onto a linear probe's decision boundary defined via difference-in-means on refused vs. answered activations. This is an interpretive equivalence by construction of the probe, but the central contribution is an extended Controlled Latent-space Evasion attack that optimizes further projection past the boundary. Attack success rates are reported as empirical measurements (ASR on model outputs across 15 models) rather than any fitted quantity or prediction forced by the probe itself. No self-citations, uniqueness theorems, or ansatzes from prior author work appear in the abstract or described chain. The derivation is therefore self-contained against external benchmarks of attack performance.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that refusal is linearly separable in activation space via a difference-in-means probe and that an optimized scalar controls how far to move past the boundary; no new entities are postulated.

free parameters (1)

optimized confidence
Scalar used to determine how far representations are projected past the decision boundary; value chosen to maximize attack success.

axioms (1)

domain assumption Refusal versus compliance is linearly separable in the model's residual stream activations using a difference-in-means direction.
This allows the probe to be defined and ablation to be interpreted as projection onto its boundary.

pith-pipeline@v0.9.0 · 5756 in / 1351 out tokens · 50668 ms · 2026-05-22T08:54:53.828930+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 4 internal anchors

[1]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Jailbreaking leading safety-aligned LLMs with simple adaptive attacks

Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking leading safety-aligned LLMs with simple adaptive attacks. InThe Thirteenth International Con- ference on Learning Representations, 2025. URL https://openreview.net/forum?id= hXA8wqRdyV

work page 2025
[5]

Refusal in language models is mediated by a single direction

Andy Arditi, Oscar Balcells Obeso, Aaquib Syed, Daniel Paleka, Nina Rimsky, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=pH3XAQME6c

work page 2024
[6]

The geometry of refusal in large language models: Concept cones and representational independence

Tom Wollschläger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, and Johannes Gasteiger. The geometry of refusal in large language models: Concept cones and representational independence. InForty-second International Conference on Machine Learning,

work page
[7]

URLhttps://openreview.net/forum?id=80IwJqlXs8

work page
[8]

One-shot optimized steering vectors mediate safety- relevant behaviors in LLMs

Jacob Dunefsky and Arman Cohan. One-shot optimized steering vectors mediate safety- relevant behaviors in LLMs. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=teW4nIZ1gy. 10

work page 2025
[9]

Giorgio Piras, Raffaele Mura, Fabio Brau, Luca Oneto, Fabio Roli, and Battista Biggio. Som directions are better than one: Multi-directional refusal suppression in language models.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 40(39):32728–32736, March 2026. ISSN 2159-5399. doi: 10.1609/aaai.v40i39.40551. URL http://dx.doi.org/10.1609/ a...

work page doi:10.1609/aaai.v40i39.40551 2026
[10]

Deepfool: a simple and accurate method to fool deep neural networks

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2574–2582, 2016

work page 2016
[11]

The linear representation hypothesis and the geometry of large language models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024
[12]

Steering Llama 2 via Contrastive Activation Addition , url =

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504–15522, Bangkok, Thailand, Augu...

work page doi:10.18653/v1/2024.acl-long.828 2024
[13]

Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical bayesian optimization of machine learning algorithms. InNeural Information Processing Systems, volume 2 ofNIPS’12, page 2951–2959. Curran Associates Inc., 2012

work page 2012
[14]

Harmbench: a standardized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: a standardized evaluation framework for automated red teaming and robust refusal. InProceedings of the 41st International Conference on Machine Learning. JMLR.org, 2024

work page 2024
[15]

Catastrophic jailbreak of open-source LLMs via exploiting generation

Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic jailbreak of open-source LLMs via exploiting generation. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=r42tSSCHPh

work page 2024
[16]

Tdc 2023 (llm edition): The trojan detection challenge

Mantas Mazeika, Andy Zou, Norman Mu, Long Phan, Zifan Wang, Chunru Yu, Adam Khoja, Fengqing Jiang, Aidan O’Gara, Ellie Sakhaee, Zhen Xiang, Arezoo Rajabi, Dan Hendrycks, Radha Poovendran, Bo Li, and David Forsyth. Tdc 2023 (llm edition): The trojan detection challenge. InNeurIPS Competition Track, 2023

work page 2023
[17]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023
[18]

Algorithms for hyper- parameter optimization.Advances in neural information processing systems, 24, 2011

James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper- parameter optimization.Advances in neural information processing systems, 24, 2011

work page 2011
[19]

Accelerating greedy coordinate gradient and general prompt optimization via probe sampling.Advances in Neural Information Processing Systems, 37:53710–53731, 2024

Yiran Zhao, Wenyue Zheng, Tianle Cai, Xuan Long, Kenji Kawaguchi, Anirudh Goyal, and Michael Q Shieh. Accelerating greedy coordinate gradient and general prompt optimization via probe sampling.Advances in Neural Information Processing Systems, 37:53710–53731, 2024

work page 2024
[20]

Not all language model features are one-dimensionally linear

Joshua Engels, Eric J Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. Not all language model features are one-dimensionally linear. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=d63a4AM4hb

work page 2025
[21]

Language models use trigonometry to do addition

Subhash Kantamneni and Max Tegmark. Language models use trigonometry to do addition. InICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025. URL https://openreview.net/forum?id=CqViN4dQJk

work page 2025
[22]

The origins of representation manifolds in large language models.arXiv preprint arXiv:2505.18235, 2025

Alexander Modell, Patrick Rubin-Delanchy, and Nick Whiteley. The origins of representation manifolds in large language models.arXiv preprint arXiv:2505.18235, 2025

work page arXiv 2025
[23]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Latent-space Attacks for Refusal Evasion in Language Models

Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, J Zico Kolter, Matt Fredrikson, and Dan Hendrycks. Improving alignment and robustness with circuit breakers. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=IbIB8SBKFV. 12 Supplementary materials of “...

work page arXiv 2024

[1] [1]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Jailbreaking leading safety-aligned LLMs with simple adaptive attacks

Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking leading safety-aligned LLMs with simple adaptive attacks. InThe Thirteenth International Con- ference on Learning Representations, 2025. URL https://openreview.net/forum?id= hXA8wqRdyV

work page 2025

[5] [5]

Refusal in language models is mediated by a single direction

Andy Arditi, Oscar Balcells Obeso, Aaquib Syed, Daniel Paleka, Nina Rimsky, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=pH3XAQME6c

work page 2024

[6] [6]

The geometry of refusal in large language models: Concept cones and representational independence

Tom Wollschläger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, and Johannes Gasteiger. The geometry of refusal in large language models: Concept cones and representational independence. InForty-second International Conference on Machine Learning,

work page

[7] [7]

URLhttps://openreview.net/forum?id=80IwJqlXs8

work page

[8] [8]

One-shot optimized steering vectors mediate safety- relevant behaviors in LLMs

Jacob Dunefsky and Arman Cohan. One-shot optimized steering vectors mediate safety- relevant behaviors in LLMs. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=teW4nIZ1gy. 10

work page 2025

[9] [9]

Giorgio Piras, Raffaele Mura, Fabio Brau, Luca Oneto, Fabio Roli, and Battista Biggio. Som directions are better than one: Multi-directional refusal suppression in language models.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 40(39):32728–32736, March 2026. ISSN 2159-5399. doi: 10.1609/aaai.v40i39.40551. URL http://dx.doi.org/10.1609/ a...

work page doi:10.1609/aaai.v40i39.40551 2026

[10] [10]

Deepfool: a simple and accurate method to fool deep neural networks

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2574–2582, 2016

work page 2016

[11] [11]

The linear representation hypothesis and the geometry of large language models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024

[12] [12]

Steering Llama 2 via Contrastive Activation Addition , url =

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504–15522, Bangkok, Thailand, Augu...

work page doi:10.18653/v1/2024.acl-long.828 2024

[13] [13]

Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical bayesian optimization of machine learning algorithms. InNeural Information Processing Systems, volume 2 ofNIPS’12, page 2951–2959. Curran Associates Inc., 2012

work page 2012

[14] [14]

Harmbench: a standardized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: a standardized evaluation framework for automated red teaming and robust refusal. InProceedings of the 41st International Conference on Machine Learning. JMLR.org, 2024

work page 2024

[15] [15]

Catastrophic jailbreak of open-source LLMs via exploiting generation

Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic jailbreak of open-source LLMs via exploiting generation. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=r42tSSCHPh

work page 2024

[16] [16]

Tdc 2023 (llm edition): The trojan detection challenge

Mantas Mazeika, Andy Zou, Norman Mu, Long Phan, Zifan Wang, Chunru Yu, Adam Khoja, Fengqing Jiang, Aidan O’Gara, Ellie Sakhaee, Zhen Xiang, Arezoo Rajabi, Dan Hendrycks, Radha Poovendran, Bo Li, and David Forsyth. Tdc 2023 (llm edition): The trojan detection challenge. InNeurIPS Competition Track, 2023

work page 2023

[17] [17]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023

[18] [18]

Algorithms for hyper- parameter optimization.Advances in neural information processing systems, 24, 2011

James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper- parameter optimization.Advances in neural information processing systems, 24, 2011

work page 2011

[19] [19]

Accelerating greedy coordinate gradient and general prompt optimization via probe sampling.Advances in Neural Information Processing Systems, 37:53710–53731, 2024

Yiran Zhao, Wenyue Zheng, Tianle Cai, Xuan Long, Kenji Kawaguchi, Anirudh Goyal, and Michael Q Shieh. Accelerating greedy coordinate gradient and general prompt optimization via probe sampling.Advances in Neural Information Processing Systems, 37:53710–53731, 2024

work page 2024

[20] [20]

Not all language model features are one-dimensionally linear

Joshua Engels, Eric J Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. Not all language model features are one-dimensionally linear. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=d63a4AM4hb

work page 2025

[21] [21]

Language models use trigonometry to do addition

Subhash Kantamneni and Max Tegmark. Language models use trigonometry to do addition. InICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025. URL https://openreview.net/forum?id=CqViN4dQJk

work page 2025

[22] [22]

The origins of representation manifolds in large language models.arXiv preprint arXiv:2505.18235, 2025

Alexander Modell, Patrick Rubin-Delanchy, and Nick Whiteley. The origins of representation manifolds in large language models.arXiv preprint arXiv:2505.18235, 2025

work page arXiv 2025

[23] [23]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Latent-space Attacks for Refusal Evasion in Language Models

Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, J Zico Kolter, Matt Fredrikson, and Dan Hendrycks. Improving alignment and robustness with circuit breakers. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=IbIB8SBKFV. 12 Supplementary materials of “...

work page arXiv 2024