The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs

Kartikeya Vats; Shivam Ratnakar

arxiv: 2606.22686 · v2 · pith:7OQTG7DPnew · submitted 2026-06-21 · 💻 cs.CR · cs.AI· cs.LG

The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs

Shivam Ratnakar , Kartikeya Vats This is my paper

Pith reviewed 2026-07-01 07:05 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords LLM alignmentrefusal directionjailbreaklinear steeringcontrastive logitsafety mechanismsmodel architecture

0 comments

The pith

Safety refusal in LLMs is a linear feature steerable directly on output logits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether refusal in safety-aligned LLMs is a deep semantic process or a simple linear feature in the model's output space. It introduces Contrastive Logit Steering to isolate a refusal direction by comparing outputs from safe and unrestricted prompts. This approach reveals that safety mechanisms vary by model architecture, with some allowing easy bypass through logit manipulation. If accurate, it shows that alignment creates a vulnerable axis that can also be used for defense by reversing the direction. A sympathetic reader would care because it suggests current safety methods have a fundamental geometric weakness rather than robust semantic understanding.

Core claim

Safety compliance is a manipulable linear feature rather than a deep semantic decision. Contrastive Logit Steering isolates the refusal direction by contrasting hidden states from safe and unrestricted system prompts, revealing architecturally deterministic safety implementations with late or early topologies. This allows logit-level interventions that outperform activation steering and enable bidirectional control over safety.

What carries the argument

Contrastive Logit Steering (CLS), which isolates the refusal direction by contrasting hidden states from safe and unrestricted prompts and applies it to the output logits.

If this is right

Late-decision models like Llama-3.1 can be bypassed to 95% attack success in about one second.
CLS outperforms prior methods, reaching 73% vs 22.6% on Llama 2.
Early-divergence models like Qwen-2.5 integrate safety earlier but are still steerable.
Inverting the steering vector provides a way to strengthen safety without additional training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The linear nature might extend to other behavioral controls in LLMs beyond safety.
Training methods could be adjusted to avoid creating such a single axis.
This could inspire new evaluation benchmarks focused on linear steerability.

Load-bearing premise

That the difference between hidden states from safe and unrestricted prompts captures a causal refusal direction rather than other correlated features or prompt artifacts.

What would settle it

If applying the derived steering vector to a model's logits fails to increase the rate of successful jailbreaks compared to a control condition across multiple model families, the claim of a general linear refusal direction would be falsified.

Figures

Figures reproduced from arXiv: 2606.22686 by Kartikeya Vats, Shivam Ratnakar.

**Figure 1.** Figure 1: The Geometry of Refusal. PCA visualization of the final layer hidden states for Llama-3. (A) Linear Separability: Malicious queries (red) and benign instructions (blue) form distinct clusters, showing that safety is encoded as a linear feature in the activation space. (B) The Refusal Direction: The arrow marks the primary direction of variation, corresponding to the “Refusal Vector.” In Contrastive Logit … view at source ↗

**Figure 2.** Figure 2: Contrastive Logit Steering (CLS) Methodology. The model processes the user query simultaneously under three distinct system prompts. We calculate an instantaneous steering vector v by subtracting the logits of the “Safe” stream (z −) from the “Unrestricted” stream (z +). This vector is scaled by α and added to the Base stream logits (zbase) before sampling, effectively modulating the model’s safety refusal… view at source ↗

**Figure 3.** Figure 3: Steerability Heatmaps. (Top) Positive steering. (Bottom) Negative steering. and activation-level steering (Arditi et al., 2024), and mechanistic analysis (PCA, KL divergence). 4.1 Experimental Setup Models. We test 7 open-weights models: Gemma-3 (4B, 12B), Llama-3.1 (8B), Llama-3.2 (3B), Llama-3.3 (70B), and Qwen-2.5 (1.5B, 7B). For comparison with Arditi et al. (Arditi et al., 2024), we additionally evalu… view at source ↗

**Figure 4.** Figure 4: The Timeline of Refusal. KL Divergence across model depth. Llama-3.1 (Blue) shows a “Late Decision” pattern, diverging only in the final layers. Qwen-2.5 (Orange) shows “Early Divergence,” processing safety mid-network. This architectural difference explains Qwen’s higher resistance to steering. 4.3 Results: Steering Intensity We swept α ∈ [−5, 5] at intervals of 1.0 with T = 0.7 (temperature variance was … view at source ↗

**Figure 5.** Figure 5: LLM-as-a-Judge Implementation Details. The implementation of evaluate_with_judge (Safety/ASR) and evaluate_coherence (Readability) used during our experiments. The safety prompt explicitly instructs the judge to ignore compliant prefixes to avoid false positives [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

read the original abstract

Modern Large Language Models (LLMs) rely on extensive safety alignment, yet the mechanistic basis of refusal remains opaque. In this work, we investigate whether safety compliance is a deep semantic decision or a manipulable linear feature. We introduce Contrastive Logit Steering (CLS), a zero-optimization framework that isolates the "refusal direction" by contrasting hidden states derived from safe and unrestricted system prompts. Unlike representation engineering methods that intervene on internal activations, CLS operates directly on the output distribution, serving as a diagnostic probe for alignment fragility. When coupled with prefix injection to bypass initial refusal reflexes, this method induces a phase transition where guardrails collapse. Our experiments on 7 model families reveal that safety implementation is architecturally deterministic. While models like Llama-3.1 exhibit a "Late Decision" topology that is easily bypassed by CLS (reaching 95% ASR in approximately one second), others like Qwen-2.5 demonstrate "Early Divergence" by integrating safety mid-computation. Direct comparison with established activation-level steering methods shows that CLS achieves substantially higher attack success rates on Llama 2 (73% vs. 22.6%) and Qwen 7B (91% vs. 79.2%), demonstrating that logit-level intervention exposes alignment vulnerabilities that hidden-state methods underestimate. Beyond attacks, we show that this linearity enables bidirectional control: inverting the steering vector "hardens" models against jailbreaks without retraining. Our findings suggest that current alignment techniques create a steerable "safety axis" that serves as both a critical vulnerability and a precise primitive for defense.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLS gets higher ASR than activation steering on Llama 2 and Qwen but builds its refusal vector from the same safe/unrestricted prompt pairs used in evaluation.

read the letter

CLS produces higher attack success rates than prior activation steering on the tested models, with the reported 73% vs 22.6% on Llama 2 and 91% vs 79.2% on Qwen 7B standing out as the concrete empirical piece. The logit-level approach and the split into late-decision versus early-divergence topologies across seven families are the main novelties. The bidirectional use, where inverting the vector hardens the model, is a practical observation that follows directly from treating refusal as linear.

The construction step is the clearest weakness. The refusal direction comes from contrasting the identical safe and unrestricted system prompts later used to measure success, with no mention of length-matched, lexical-controlled, or held-out contrasts. That leaves open the possibility that the vector picks up prompt artifacts rather than a causal safety axis. The abstract also gives no protocol details, statistical tests, or baseline descriptions, so the architectural-determinism claim rests on evidence that cannot yet be checked.

The work is aimed at researchers who already run steering experiments and want to test logit interventions. It is worth a referee's time because the ASR comparison is falsifiable and the bidirectional result has direct implications for defense, even if the current writeup needs tighter controls and clearer methods before the determinism conclusion can be accepted.

Referee Report

3 major / 1 minor

Summary. The paper introduces Contrastive Logit Steering (CLS), a zero-optimization method that constructs a 'refusal direction' by contrasting hidden states from safe versus unrestricted system prompts. It claims this exposes safety as a linear, architecturally deterministic feature across 7 model families, with CLS yielding higher attack success rates than activation steering (e.g., 73% vs. 22.6% on Llama 2), distinct topologies (Late Decision vs. Early Divergence), and bidirectional control via vector inversion for hardening without retraining.

Significance. If the central method isolates a causal linear axis rather than prompt artifacts, the work would establish a lightweight diagnostic for alignment fragility and a steerable primitive for both attack and defense, highlighting that current safety techniques embed a manipulable geometric vulnerability.

major comments (3)

[Abstract] Abstract: the refusal direction is defined via direct contrast of the same safe/unrestricted prompt pairs later used to compute ASR; this construction risks circularity, as the vector may encode evaluation-specific features rather than an independent causal axis. The manuscript must clarify whether held-out prompts, length-matched controls, or external benchmarks are used for validation.
[Abstract] Abstract / Experiments: reported ASR numbers (73% vs. 22.6% on Llama 2; 91% vs. 79.2% on Qwen 7B) and architectural determinism claims lack any description of experimental protocol, baseline details, statistical tests, or ablations for prompt length/lexical confounds, rendering it impossible to assess whether the superiority and topology distinctions hold.
[Method] Method description: the premise that h_safe - h_unrestricted cleanly isolates a refusal direction (rather than correlated prompt artifacts) is load-bearing for the phase-transition and bidirectional-control results, yet the provided account supplies no controls or falsification tests for this assumption.

minor comments (1)

[Abstract] Abstract: the phrase 'phase transition where guardrails collapse' is used without a formal definition or quantitative characterization (e.g., threshold on steering coefficient or logit shift).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our results. We address each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the refusal direction is defined via direct contrast of the same safe/unrestricted prompt pairs later used to compute ASR; this construction risks circularity, as the vector may encode evaluation-specific features rather than an independent causal axis. The manuscript must clarify whether held-out prompts, length-matched controls, or external benchmarks are used for validation.

Authors: We acknowledge the risk of circularity. In the full manuscript (Section 3), the refusal direction is constructed from a fixed set of 50 prompt pairs. ASR evaluation uses a disjoint held-out test set of 100 prompts drawn from AdvBench and HarmBench. Length-matched controls are applied by pairing prompts of comparable token length, and we report results on external benchmarks not involved in direction construction. We will add an explicit statement of this separation to the abstract and method section. revision: yes
Referee: [Abstract] Abstract / Experiments: reported ASR numbers (73% vs. 22.6% on Llama 2; 91% vs. 79.2% on Qwen 7B) and architectural determinism claims lack any description of experimental protocol, baseline details, statistical tests, or ablations for prompt length/lexical confounds, rendering it impossible to assess whether the superiority and topology distinctions hold.

Authors: The experimental protocol, baseline implementations (activation steering from prior work), statistical tests (paired t-tests over five random seeds with standard deviations), and ablations for prompt length and lexical confounds are described in Sections 4 and 5. The abstract is space-constrained, but we will insert a brief reference to the protocol and controls or expand the experimental summary paragraph to improve accessibility. revision: partial
Referee: [Method] Method description: the premise that h_safe - h_unrestricted cleanly isolates a refusal direction (rather than correlated prompt artifacts) is load-bearing for the phase-transition and bidirectional-control results, yet the provided account supplies no controls or falsification tests for this assumption.

Authors: We agree that explicit validation of the direction's specificity is necessary. The manuscript already includes controls using length-matched and synonym-substituted prompt pairs, plus falsification tests applying the vector to non-refusal tasks. We will expand this subsection with additional tests (e.g., random prompt contrasts and cross-task generalization) to further substantiate the claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; CLS direction defined independently of ASR evaluation outcomes

full rationale

The paper defines CLS by subtracting hidden states from safe versus unrestricted system prompts to obtain a refusal direction, then reports empirical attack success rates and phase transitions when this direction is applied (with prefix injection) across 7 model families. This construction does not reduce the reported ASR numbers (e.g., 73% vs 22.6%) to the input by definition, nor does it rely on self-citation chains, uniqueness theorems, or renaming of known results. The bidirectional hardening result and topology distinctions (Late Decision vs Early Divergence) are presented as measured outcomes rather than tautological consequences of the contrast definition. No load-bearing self-citations or fitted-input predictions are described.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that a single linear direction extracted from prompt contrast captures the safety mechanism without residual confounding; this is an untested domain assumption rather than a derived result.

free parameters (1)

steering coefficient
Scaling factor applied to the extracted direction; must be chosen to achieve the reported ASR values.

axioms (1)

domain assumption Refusal behavior is isolable as a linear direction in logit space via contrast of safe versus unrestricted prompts.
This is the definitional step of CLS and is required for the steering vector to be meaningful.

invented entities (1)

refusal direction / safety axis no independent evidence
purpose: A postulated linear feature that encodes safety compliance and can be added or subtracted to control behavior.
Introduced as the object isolated by CLS; no independent falsifiable signature outside the steering experiments is provided.

pith-pipeline@v0.9.1-grok · 5827 in / 1347 out tokens · 32551 ms · 2026-07-01T07:05:02.482569+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 17 canonical work pages · 11 internal anchors

[1]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

2024
[2]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

2023
[3]

2023 , eprint=

Jailbroken: How Does LLM Safety Training Fail? , author=. 2023 , eprint=

2023
[4]

2024 , eprint=

Jailbreaking Black Box Large Language Models in Twenty Queries , author=. 2024 , eprint=

2024
[5]

2022 , eprint=

Training language models to follow instructions with human feedback , author=. 2022 , eprint=

2022
[6]

2024 , eprint=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2024 , eprint=

2024
[7]

2023 , eprint=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

2023
[8]

2024 , eprint=

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author=. 2024 , eprint=

2024
[9]

2024 , eprint=

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models , author=. 2024 , eprint=

2024
[10]

2025 , eprint=

Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2025 , eprint=

2025
[11]

2025 , eprint=

Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models , author=. 2025 , eprint=

2025
[12]

2024 , eprint=

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models , author=. 2024 , eprint=

2024
[13]

2024 , eprint=

ROSE Doesn't Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding , author=. 2024 , eprint=

2024
[14]

2025 , eprint=

Weak-to-Strong Jailbreaking on Large Language Models , author=. 2025 , eprint=

2025
[15]

2023 , eprint=

Self-Detoxifying Language Models via Toxification Reversal , author=. 2023 , eprint=

2023
[16]

2025 , eprint=

Programming Refusal with Conditional Activation Steering , author=. 2025 , eprint=

2025
[17]

2024 , eprint=

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically , author=. 2024 , eprint=

2024
[18]

2025 , eprint=

In-Context Representation Hijacking , author=. 2025 , eprint=

2025
[19]

Vazquez and Ulisse Mini and Monte MacDiarmid , Title =

Alexander Matt Turner and Lisa Thiergart and Gavin Leech and David Udell and Juan J. Vazquez and Ulisse Mini and Monte MacDiarmid , Title =. 2023 , Eprint =

2023
[20]

2024 , eprint=

Refusal in Language Models Is Mediated by a Single Direction , author=. 2024 , eprint=

2024
[21]

Stanford Center for Research on Foundation Models

Alpaca: A strong, replicable instruction-following model , author=. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html , volume=

2023
[22]

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. https://arxiv.org/abs/2406.11717 Refusal in language models is mediated by a single direction . Preprint, arXiv:2406.11717

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, and Eric Wong. 2024 a . https://arxiv.org/abs/2404.01318 Jailbreakbench: An open robustness benchmark for jailbreaking large language models . Preprint, arXiv:2404.01318

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. 2024 b . https://arxiv.org/abs/2310.08419 Jailbreaking black box large language models in twenty queries . Preprint, arXiv:2310.08419

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar

Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. 2025. https://arxiv.org/abs/2409.05907 Programming refusal with conditional activation steering . Preprint, arXiv:2409.05907

work page arXiv 2025
[26]

Chak Tou Leong, Yi Cheng, Jiashuo Wang, Jian Wang, and Wenjie Li. 2023. https://arxiv.org/abs/2310.09573 Self-detoxifying language models via toxification reversal . Preprint, arXiv:2310.09573

work page arXiv 2023
[27]

Tung-Ling Li and Hongliang Liu. 2025. https://arxiv.org/abs/2506.24056 Logit-gap steering: Efficient short-suffix jailbreaks for aligned large language models . Preprint, arXiv:2506.24056

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2024. https://arxiv.org/abs/2310.04451 Autodan: Generating stealthy jailbreak prompts on aligned large language models . Preprint, arXiv:2310.04451

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. 2024. https://arxiv.org/abs/2402.04249 Harmbench: A standardized evaluation framework for automated red teaming and robust refusal . Preprint, arXiv:2402.04249

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. 2024. https://arxiv.org/abs/2312.02119 Tree of attacks: Jailbreaking black-box llms automatically . Preprint, arXiv:2312.02119

work page arXiv 2024
[31]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. https://arxiv.org/abs/2203.02155 Training language models to f...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. 2023. https://arxiv.org/abs/arXiv:2308.10248 Steering language models with activation engineering

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. https://arxiv.org/abs/2307.02483 Jailbroken: How does llm safety training fail? Preprint, arXiv:2307.02483

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Itay Yona, Amir Sarid, Michael Karasik, and Yossi Gandelsman. 2025. https://arxiv.org/abs/2512.03771 In-context representation hijacking . Preprint, arXiv:2512.03771

work page arXiv 2025
[35]

Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, and William Yang Wang. 2025. https://arxiv.org/abs/2401.17256 Weak-to-strong jailbreaking on large language models . Preprint, arXiv:2401.17256

work page arXiv 2025
[36]

Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and Dacheng Tao. 2024. https://arxiv.org/abs/2402.11889 Rose doesn't do that: Boosting the safety of instruction-tuned large language models with reverse prompt contrastive decoding . Preprint, arXiv:2402.11889

work page arXiv 2024
[37]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, and 2 others. 2025. https://arxiv.org/abs/2310.01405 Representation engineering: A top-...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. 2023. https://arxiv.org/abs/2307.15043 Universal and transferable adversarial attacks on aligned language models . Preprint, arXiv:2307.15043

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

2024

[2] [2]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

2023

[3] [3]

2023 , eprint=

Jailbroken: How Does LLM Safety Training Fail? , author=. 2023 , eprint=

2023

[4] [4]

2024 , eprint=

Jailbreaking Black Box Large Language Models in Twenty Queries , author=. 2024 , eprint=

2024

[5] [5]

2022 , eprint=

Training language models to follow instructions with human feedback , author=. 2022 , eprint=

2022

[6] [6]

2024 , eprint=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2024 , eprint=

2024

[7] [7]

2023 , eprint=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

2023

[8] [8]

2024 , eprint=

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author=. 2024 , eprint=

2024

[9] [9]

2024 , eprint=

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models , author=. 2024 , eprint=

2024

[10] [10]

2025 , eprint=

Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2025 , eprint=

2025

[11] [11]

2025 , eprint=

Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models , author=. 2025 , eprint=

2025

[12] [12]

2024 , eprint=

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models , author=. 2024 , eprint=

2024

[13] [13]

2024 , eprint=

ROSE Doesn't Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding , author=. 2024 , eprint=

2024

[14] [14]

2025 , eprint=

Weak-to-Strong Jailbreaking on Large Language Models , author=. 2025 , eprint=

2025

[15] [15]

2023 , eprint=

Self-Detoxifying Language Models via Toxification Reversal , author=. 2023 , eprint=

2023

[16] [16]

2025 , eprint=

Programming Refusal with Conditional Activation Steering , author=. 2025 , eprint=

2025

[17] [17]

2024 , eprint=

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically , author=. 2024 , eprint=

2024

[18] [18]

2025 , eprint=

In-Context Representation Hijacking , author=. 2025 , eprint=

2025

[19] [19]

Vazquez and Ulisse Mini and Monte MacDiarmid , Title =

Alexander Matt Turner and Lisa Thiergart and Gavin Leech and David Udell and Juan J. Vazquez and Ulisse Mini and Monte MacDiarmid , Title =. 2023 , Eprint =

2023

[20] [20]

2024 , eprint=

Refusal in Language Models Is Mediated by a Single Direction , author=. 2024 , eprint=

2024

[21] [21]

Stanford Center for Research on Foundation Models

Alpaca: A strong, replicable instruction-following model , author=. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html , volume=

2023

[22] [22]

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. https://arxiv.org/abs/2406.11717 Refusal in language models is mediated by a single direction . Preprint, arXiv:2406.11717

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, and Eric Wong. 2024 a . https://arxiv.org/abs/2404.01318 Jailbreakbench: An open robustness benchmark for jailbreaking large language models . Preprint, arXiv:2404.01318

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. 2024 b . https://arxiv.org/abs/2310.08419 Jailbreaking black box large language models in twenty queries . Preprint, arXiv:2310.08419

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar

Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. 2025. https://arxiv.org/abs/2409.05907 Programming refusal with conditional activation steering . Preprint, arXiv:2409.05907

work page arXiv 2025

[26] [26]

Chak Tou Leong, Yi Cheng, Jiashuo Wang, Jian Wang, and Wenjie Li. 2023. https://arxiv.org/abs/2310.09573 Self-detoxifying language models via toxification reversal . Preprint, arXiv:2310.09573

work page arXiv 2023

[27] [27]

Tung-Ling Li and Hongliang Liu. 2025. https://arxiv.org/abs/2506.24056 Logit-gap steering: Efficient short-suffix jailbreaks for aligned large language models . Preprint, arXiv:2506.24056

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2024. https://arxiv.org/abs/2310.04451 Autodan: Generating stealthy jailbreak prompts on aligned large language models . Preprint, arXiv:2310.04451

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. 2024. https://arxiv.org/abs/2402.04249 Harmbench: A standardized evaluation framework for automated red teaming and robust refusal . Preprint, arXiv:2402.04249

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. 2024. https://arxiv.org/abs/2312.02119 Tree of attacks: Jailbreaking black-box llms automatically . Preprint, arXiv:2312.02119

work page arXiv 2024

[31] [31]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. https://arxiv.org/abs/2203.02155 Training language models to f...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[32] [32]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. 2023. https://arxiv.org/abs/arXiv:2308.10248 Steering language models with activation engineering

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. https://arxiv.org/abs/2307.02483 Jailbroken: How does llm safety training fail? Preprint, arXiv:2307.02483

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Itay Yona, Amir Sarid, Michael Karasik, and Yossi Gandelsman. 2025. https://arxiv.org/abs/2512.03771 In-context representation hijacking . Preprint, arXiv:2512.03771

work page arXiv 2025

[35] [35]

Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, and William Yang Wang. 2025. https://arxiv.org/abs/2401.17256 Weak-to-strong jailbreaking on large language models . Preprint, arXiv:2401.17256

work page arXiv 2025

[36] [36]

Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and Dacheng Tao. 2024. https://arxiv.org/abs/2402.11889 Rose doesn't do that: Boosting the safety of instruction-tuned large language models with reverse prompt contrastive decoding . Preprint, arXiv:2402.11889

work page arXiv 2024

[37] [37]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, and 2 others. 2025. https://arxiv.org/abs/2310.01405 Representation engineering: A top-...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. 2023. https://arxiv.org/abs/2307.15043 Universal and transferable adversarial attacks on aligned language models . Preprint, arXiv:2307.15043

work page internal anchor Pith review Pith/arXiv arXiv 2023