The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs
Pith reviewed 2026-07-01 07:05 UTC · model grok-4.3
The pith
Safety refusal in LLMs is a linear feature steerable directly on output logits.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Safety compliance is a manipulable linear feature rather than a deep semantic decision. Contrastive Logit Steering isolates the refusal direction by contrasting hidden states from safe and unrestricted system prompts, revealing architecturally deterministic safety implementations with late or early topologies. This allows logit-level interventions that outperform activation steering and enable bidirectional control over safety.
What carries the argument
Contrastive Logit Steering (CLS), which isolates the refusal direction by contrasting hidden states from safe and unrestricted prompts and applies it to the output logits.
If this is right
- Late-decision models like Llama-3.1 can be bypassed to 95% attack success in about one second.
- CLS outperforms prior methods, reaching 73% vs 22.6% on Llama 2.
- Early-divergence models like Qwen-2.5 integrate safety earlier but are still steerable.
- Inverting the steering vector provides a way to strengthen safety without additional training.
Where Pith is reading between the lines
- The linear nature might extend to other behavioral controls in LLMs beyond safety.
- Training methods could be adjusted to avoid creating such a single axis.
- This could inspire new evaluation benchmarks focused on linear steerability.
Load-bearing premise
That the difference between hidden states from safe and unrestricted prompts captures a causal refusal direction rather than other correlated features or prompt artifacts.
What would settle it
If applying the derived steering vector to a model's logits fails to increase the rate of successful jailbreaks compared to a control condition across multiple model families, the claim of a general linear refusal direction would be falsified.
Figures
read the original abstract
Modern Large Language Models (LLMs) rely on extensive safety alignment, yet the mechanistic basis of refusal remains opaque. In this work, we investigate whether safety compliance is a deep semantic decision or a manipulable linear feature. We introduce Contrastive Logit Steering (CLS), a zero-optimization framework that isolates the "refusal direction" by contrasting hidden states derived from safe and unrestricted system prompts. Unlike representation engineering methods that intervene on internal activations, CLS operates directly on the output distribution, serving as a diagnostic probe for alignment fragility. When coupled with prefix injection to bypass initial refusal reflexes, this method induces a phase transition where guardrails collapse. Our experiments on 7 model families reveal that safety implementation is architecturally deterministic. While models like Llama-3.1 exhibit a "Late Decision" topology that is easily bypassed by CLS (reaching 95% ASR in approximately one second), others like Qwen-2.5 demonstrate "Early Divergence" by integrating safety mid-computation. Direct comparison with established activation-level steering methods shows that CLS achieves substantially higher attack success rates on Llama 2 (73% vs. 22.6%) and Qwen 7B (91% vs. 79.2%), demonstrating that logit-level intervention exposes alignment vulnerabilities that hidden-state methods underestimate. Beyond attacks, we show that this linearity enables bidirectional control: inverting the steering vector "hardens" models against jailbreaks without retraining. Our findings suggest that current alignment techniques create a steerable "safety axis" that serves as both a critical vulnerability and a precise primitive for defense.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Contrastive Logit Steering (CLS), a zero-optimization method that constructs a 'refusal direction' by contrasting hidden states from safe versus unrestricted system prompts. It claims this exposes safety as a linear, architecturally deterministic feature across 7 model families, with CLS yielding higher attack success rates than activation steering (e.g., 73% vs. 22.6% on Llama 2), distinct topologies (Late Decision vs. Early Divergence), and bidirectional control via vector inversion for hardening without retraining.
Significance. If the central method isolates a causal linear axis rather than prompt artifacts, the work would establish a lightweight diagnostic for alignment fragility and a steerable primitive for both attack and defense, highlighting that current safety techniques embed a manipulable geometric vulnerability.
major comments (3)
- [Abstract] Abstract: the refusal direction is defined via direct contrast of the same safe/unrestricted prompt pairs later used to compute ASR; this construction risks circularity, as the vector may encode evaluation-specific features rather than an independent causal axis. The manuscript must clarify whether held-out prompts, length-matched controls, or external benchmarks are used for validation.
- [Abstract] Abstract / Experiments: reported ASR numbers (73% vs. 22.6% on Llama 2; 91% vs. 79.2% on Qwen 7B) and architectural determinism claims lack any description of experimental protocol, baseline details, statistical tests, or ablations for prompt length/lexical confounds, rendering it impossible to assess whether the superiority and topology distinctions hold.
- [Method] Method description: the premise that h_safe - h_unrestricted cleanly isolates a refusal direction (rather than correlated prompt artifacts) is load-bearing for the phase-transition and bidirectional-control results, yet the provided account supplies no controls or falsification tests for this assumption.
minor comments (1)
- [Abstract] Abstract: the phrase 'phase transition where guardrails collapse' is used without a formal definition or quantitative characterization (e.g., threshold on steering coefficient or logit shift).
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the presentation of our results. We address each major comment below and indicate where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the refusal direction is defined via direct contrast of the same safe/unrestricted prompt pairs later used to compute ASR; this construction risks circularity, as the vector may encode evaluation-specific features rather than an independent causal axis. The manuscript must clarify whether held-out prompts, length-matched controls, or external benchmarks are used for validation.
Authors: We acknowledge the risk of circularity. In the full manuscript (Section 3), the refusal direction is constructed from a fixed set of 50 prompt pairs. ASR evaluation uses a disjoint held-out test set of 100 prompts drawn from AdvBench and HarmBench. Length-matched controls are applied by pairing prompts of comparable token length, and we report results on external benchmarks not involved in direction construction. We will add an explicit statement of this separation to the abstract and method section. revision: yes
-
Referee: [Abstract] Abstract / Experiments: reported ASR numbers (73% vs. 22.6% on Llama 2; 91% vs. 79.2% on Qwen 7B) and architectural determinism claims lack any description of experimental protocol, baseline details, statistical tests, or ablations for prompt length/lexical confounds, rendering it impossible to assess whether the superiority and topology distinctions hold.
Authors: The experimental protocol, baseline implementations (activation steering from prior work), statistical tests (paired t-tests over five random seeds with standard deviations), and ablations for prompt length and lexical confounds are described in Sections 4 and 5. The abstract is space-constrained, but we will insert a brief reference to the protocol and controls or expand the experimental summary paragraph to improve accessibility. revision: partial
-
Referee: [Method] Method description: the premise that h_safe - h_unrestricted cleanly isolates a refusal direction (rather than correlated prompt artifacts) is load-bearing for the phase-transition and bidirectional-control results, yet the provided account supplies no controls or falsification tests for this assumption.
Authors: We agree that explicit validation of the direction's specificity is necessary. The manuscript already includes controls using length-matched and synonym-substituted prompt pairs, plus falsification tests applying the vector to non-refusal tasks. We will expand this subsection with additional tests (e.g., random prompt contrasts and cross-task generalization) to further substantiate the claim. revision: yes
Circularity Check
No significant circularity; CLS direction defined independently of ASR evaluation outcomes
full rationale
The paper defines CLS by subtracting hidden states from safe versus unrestricted system prompts to obtain a refusal direction, then reports empirical attack success rates and phase transitions when this direction is applied (with prefix injection) across 7 model families. This construction does not reduce the reported ASR numbers (e.g., 73% vs 22.6%) to the input by definition, nor does it rely on self-citation chains, uniqueness theorems, or renaming of known results. The bidirectional hardening result and topology distinctions (Late Decision vs Early Divergence) are presented as measured outcomes rather than tautological consequences of the contrast definition. No load-bearing self-citations or fitted-input predictions are described.
Axiom & Free-Parameter Ledger
free parameters (1)
- steering coefficient
axioms (1)
- domain assumption Refusal behavior is isolable as a linear direction in logit space via contrast of safe versus unrestricted prompts.
invented entities (1)
-
refusal direction / safety axis
no independent evidence
Reference graph
Works this paper leans on
-
[1]
2024 , eprint=
GPT-4 Technical Report , author=. 2024 , eprint=
2024
-
[2]
2023 , eprint=
Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=
2023
-
[3]
2023 , eprint=
Jailbroken: How Does LLM Safety Training Fail? , author=. 2023 , eprint=
2023
-
[4]
2024 , eprint=
Jailbreaking Black Box Large Language Models in Twenty Queries , author=. 2024 , eprint=
2024
-
[5]
2022 , eprint=
Training language models to follow instructions with human feedback , author=. 2022 , eprint=
2022
-
[6]
2024 , eprint=
Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2024 , eprint=
2024
-
[7]
2023 , eprint=
Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=
2023
-
[8]
2024 , eprint=
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author=. 2024 , eprint=
2024
-
[9]
2024 , eprint=
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models , author=. 2024 , eprint=
2024
-
[10]
2025 , eprint=
Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2025 , eprint=
2025
-
[11]
2025 , eprint=
Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models , author=. 2025 , eprint=
2025
-
[12]
2024 , eprint=
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models , author=. 2024 , eprint=
2024
-
[13]
2024 , eprint=
ROSE Doesn't Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding , author=. 2024 , eprint=
2024
-
[14]
2025 , eprint=
Weak-to-Strong Jailbreaking on Large Language Models , author=. 2025 , eprint=
2025
-
[15]
2023 , eprint=
Self-Detoxifying Language Models via Toxification Reversal , author=. 2023 , eprint=
2023
-
[16]
2025 , eprint=
Programming Refusal with Conditional Activation Steering , author=. 2025 , eprint=
2025
-
[17]
2024 , eprint=
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically , author=. 2024 , eprint=
2024
-
[18]
2025 , eprint=
In-Context Representation Hijacking , author=. 2025 , eprint=
2025
-
[19]
Vazquez and Ulisse Mini and Monte MacDiarmid , Title =
Alexander Matt Turner and Lisa Thiergart and Gavin Leech and David Udell and Juan J. Vazquez and Ulisse Mini and Monte MacDiarmid , Title =. 2023 , Eprint =
2023
-
[20]
2024 , eprint=
Refusal in Language Models Is Mediated by a Single Direction , author=. 2024 , eprint=
2024
-
[21]
Stanford Center for Research on Foundation Models
Alpaca: A strong, replicable instruction-following model , author=. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html , volume=
2023
-
[22]
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. https://arxiv.org/abs/2406.11717 Refusal in language models is mediated by a single direction . Preprint, arXiv:2406.11717
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, and Eric Wong. 2024 a . https://arxiv.org/abs/2404.01318 Jailbreakbench: An open robustness benchmark for jailbreaking large language models . Preprint, arXiv:2404.01318
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. 2024 b . https://arxiv.org/abs/2310.08419 Jailbreaking black box large language models in twenty queries . Preprint, arXiv:2310.08419
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. 2025. https://arxiv.org/abs/2409.05907 Programming refusal with conditional activation steering . Preprint, arXiv:2409.05907
- [26]
-
[27]
Tung-Ling Li and Hongliang Liu. 2025. https://arxiv.org/abs/2506.24056 Logit-gap steering: Efficient short-suffix jailbreaks for aligned large language models . Preprint, arXiv:2506.24056
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2024. https://arxiv.org/abs/2310.04451 Autodan: Generating stealthy jailbreak prompts on aligned large language models . Preprint, arXiv:2310.04451
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. 2024. https://arxiv.org/abs/2402.04249 Harmbench: A standardized evaluation framework for automated red teaming and robust refusal . Preprint, arXiv:2402.04249
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [30]
-
[31]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. https://arxiv.org/abs/2203.02155 Training language models to f...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[32]
Steering Language Models With Activation Engineering
Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. 2023. https://arxiv.org/abs/arXiv:2308.10248 Steering language models with activation engineering
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. https://arxiv.org/abs/2307.02483 Jailbroken: How does llm safety training fail? Preprint, arXiv:2307.02483
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [34]
- [35]
- [36]
-
[37]
Representation Engineering: A Top-Down Approach to AI Transparency
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, and 2 others. 2025. https://arxiv.org/abs/2310.01405 Representation engineering: A top-...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. 2023. https://arxiv.org/abs/2307.15043 Universal and transferable adversarial attacks on aligned language models . Preprint, arXiv:2307.15043
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.