The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs

Kartikeya Vats; Shivam Ratnakar

arxiv: 2606.22686 · v1 · pith:7OQTG7DPnew · submitted 2026-06-21 · 💻 cs.CR · cs.AI· cs.LG

The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs

Shivam Ratnakar , Kartikeya Vats This is my paper

Pith reviewed 2026-06-26 09:50 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords LLM safetyjailbreakrefusal directionlinear steeringalignmentoutput logitscontrastive method

0 comments

The pith

Safety alignment in LLMs produces a linear refusal direction in output logits that can be isolated and steered to bypass or reinforce guardrails.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether safety compliance in LLMs is a complex semantic choice or a simple linear feature that can be isolated and manipulated. It develops Contrastive Logit Steering to find the refusal direction by comparing the model's output when given safe versus unrestricted system prompts. Applying this direction with a simple prefix causes many models to stop refusing harmful requests at high rates. Different model families show different patterns in when the safety decision happens during generation. This suggests that alignment methods produce a consistent geometric structure in the model that can be used for both attacking and defending the system.

Core claim

The refusal behavior in safety-aligned LLMs is not a deep semantic decision but a linear feature in the output logit space that can be isolated by contrasting safe and unrestricted prompts. Contrastive Logit Steering extracts this direction and uses it to steer the model, revealing that some models have a late decision point for safety that is easily overridden, while others integrate safety earlier in the computation. The method achieves higher success in bypassing safety than intervening on internal activations, and the direction can be reversed to make the model more resistant to jailbreaks without additional training.

What carries the argument

Contrastive Logit Steering (CLS), which computes the refusal direction as the difference in output distributions from safe and unrestricted system prompts and applies it to steer the logits during generation.

If this is right

Safety can be bypassed at rates up to 95 percent on models with late decision topologies using CLS and prefix injection.
CLS produces substantially higher attack success rates than activation steering methods on multiple model families.
Reversing the extracted direction hardens the model against jailbreaks without retraining.
Safety mechanisms differ in their computational timing across architectures, with some showing early integration and others late divergence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Alignment research could shift toward designing methods that avoid creating easily identifiable linear directions.
The approach might be applied to test the robustness of new alignment techniques as they are developed.
If the linearity holds, then monitoring or controlling this direction could become a standard part of model deployment.
Similar contrastive methods might apply to other behavioral controls beyond safety, such as style or factuality.

Load-bearing premise

The difference in output distributions between safe and unrestricted system prompts isolates a causally relevant refusal direction rather than a correlated but non-causal feature of the model's response distribution.

What would settle it

Measuring whether steering with the extracted direction changes the rate of refusal on harmful queries while leaving performance on benign queries unchanged, or whether inverting it reduces success of known jailbreak prompts.

Figures

Figures reproduced from arXiv: 2606.22686 by Kartikeya Vats, Shivam Ratnakar.

**Figure 1.** Figure 1: The Geometry of Refusal. PCA visualization of the final layer hidden states for Llama-3. (A) Linear Separability: Malicious queries (red) and benign instructions (blue) form distinct clusters, showing that safety is encoded as a linear feature in the activation space. (B) The Refusal Direction: The arrow marks the primary direction of variation, corresponding to the “Refusal Vector.” In Contrastive Logit … view at source ↗

**Figure 2.** Figure 2: Contrastive Logit Steering (CLS) Methodology. The model processes the user query simultaneously under three distinct system prompts. We calculate an instantaneous steering vector v by subtracting the logits of the “Safe” stream (z −) from the “Unrestricted” stream (z +). This vector is scaled by α and added to the Base stream logits (zbase) before sampling, effectively modulating the model’s safety refusal… view at source ↗

**Figure 3.** Figure 3: Steerability Heatmaps. (Top) Positive steering. (Bottom) Negative steering. and activation-level steering (Arditi et al., 2024), and mechanistic analysis (PCA, KL divergence). 4.1 Experimental Setup Models. We test 7 open-weights models: Gemma-3 (4B, 12B), Llama-3.1 (8B), Llama-3.2 (3B), Llama-3.3 (70B), and Qwen-2.5 (1.5B, 7B). For comparison with Arditi et al. (Arditi et al., 2024), we additionally evalu… view at source ↗

**Figure 4.** Figure 4: The Timeline of Refusal. KL Divergence across model depth. Llama-3.1 (Blue) shows a “Late Decision” pattern, diverging only in the final layers. Qwen-2.5 (Orange) shows “Early Divergence,” processing safety mid-network. This architectural difference explains Qwen’s higher resistance to steering. 4.3 Results: Steering Intensity We swept α ∈ [−5, 5] at intervals of 1.0 with T = 0.7 (temperature variance was … view at source ↗

**Figure 5.** Figure 5: LLM-as-a-Judge Implementation Details. The implementation of evaluate_with_judge (Safety/ASR) and evaluate_coherence (Readability) used during our experiments. The safety prompt explicitly instructs the judge to ignore compliant prefixes to avoid false positives [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

read the original abstract

Modern Large Language Models (LLMs) rely on extensive safety alignment, yet the mechanistic basis of refusal remains opaque. In this work, we investigate whether safety compliance is a deep semantic decision or a manipulable linear feature. We introduce Contrastive Logit Steering (CLS), a zero-optimization framework that isolates the "refusal direction" by contrasting hidden states derived from safe and unrestricted system prompts. Unlike representation engineering methods that intervene on internal activations, CLS operates directly on the output distribution, serving as a diagnostic probe for alignment fragility. When coupled with prefix injection to bypass initial refusal reflexes, this method induces a phase transition where guardrails collapse. Our experiments on 7 model families reveal that safety implementation is architecturally deterministic. While models like Llama-3.1 exhibit a "Late Decision" topology that is easily bypassed by CLS (reaching 95% ASR in approximately one second), others like Qwen-2.5 demonstrate "Early Divergence" by integrating safety mid-computation. Direct comparison with established activation-level steering methods shows that CLS achieves substantially higher attack success rates on Llama 2 (73% vs. 22.6%) and Qwen 7B (91% vs. 79.2%), demonstrating that logit-level intervention exposes alignment vulnerabilities that hidden-state methods underestimate. Beyond attacks, we show that this linearity enables bidirectional control: inverting the steering vector "hardens" models against jailbreaks without retraining. Our findings suggest that current alignment techniques create a steerable "safety axis" that serves as both a critical vulnerability and a precise primitive for defense.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLS logit-level steering gets higher ASR than activation methods on Llama 2 and Qwen but the abstract leaves the causal status of the extracted direction untested.

read the letter

The paper's main contribution is Contrastive Logit Steering, which pulls a refusal direction straight from the difference in output logits under safe versus unrestricted system prompts. It reports this beats activation steering on Llama 2 (73% vs 22.6%) and Qwen 7B (91% vs 79.2%), reaches 95% ASR on Llama-3.1, and can be inverted to harden models against jailbreaks. It also flags model-specific topologies, with Llama-3.1 showing late decision points that CLS exploits quickly.

The work is clearest when it stays empirical: the direct head-to-head numbers and the bidirectional control result are concrete enough to check. The topology distinction between late-decision and early-divergence models is a useful framing even if it needs tighter definitions.

The soft spot is the missing link between the logit difference and actual refusal control. The method starts from prompt-induced distribution shifts, so it could be picking up correlated features like hedging style or length rather than the decision boundary itself. The abstract gives no ablations that zero or invert the vector while holding other generation statistics fixed, and no error bars or run counts appear. Without those, the 95% ASR and superiority claims stay hard to weigh.

This is for researchers already running steering experiments or building guardrails. Anyone who wants a quick logit probe to test on their own models will find the setup easy to replicate from the description. It deserves referee time because the core idea is falsifiable and the comparisons are specific, even if the current write-up needs more controls and detail on how the direction was validated beyond attack success.

Referee Report

3 major / 2 minor

Summary. The paper introduces Contrastive Logit Steering (CLS), a zero-optimization method that derives a 'refusal direction' from logit differences between safe and unrestricted system-prompt completions. It claims this direction reveals an architecturally deterministic linear safety axis in LLMs, enabling high attack success rates (e.g., 95% ASR on Llama-3.1 with prefix injection, 73% vs. 22.6% over activation steering on Llama 2) across 7 model families with differing topologies ('Late Decision' vs. 'Early Divergence'), and bidirectional control by vector inversion for defense without retraining.

Significance. If the central claims hold with proper causal validation, the work would be significant for mechanistic interpretability of alignment: it positions logit-level intervention as a stronger diagnostic than activation steering and supplies a simple primitive for both attacking and hardening safety. The topology distinctions and specific cross-model comparisons add concrete empirical content to debates on whether refusal is a deep semantic decision or a steerable linear feature.

major comments (3)

[§3] §3 (Method): The claim that the logit-difference vector isolates a causally relevant refusal direction rests on the assumption that safe vs. unrestricted system-prompt contrasts separate the decision mechanism from correlated response features (verbosity, hedging, topic shift). No ablation is described that zeros or inverts the vector while measuring selective impact on refusal versus unrelated generation statistics, leaving open the possibility that CLS captures a downstream symptom rather than the refusal axis itself.
[§4] §4 (Experiments): The reported superiority (73% vs. 22.6% on Llama 2; 91% vs. 79.2% on Qwen 7B) and 95% ASR on Llama-3.1 are presented without error bars, prompt-variation controls, or statistical tests; the abstract supplies only point estimates, so it is impossible to assess whether the gap over activation steering is robust or sensitive to implementation details of either method.
[§4.3] §4.3 (Topology claims): The distinction between 'Late Decision' (Llama-3.1) and 'Early Divergence' (Qwen-2.5) topologies is load-bearing for the architectural-determinism conclusion, yet the manuscript provides no quantitative metric or figure that operationalizes these topologies or shows they predict CLS success rates independently of the steering vector construction.

minor comments (2)

[Abstract / §3] The abstract states CLS 'operates directly on the output distribution' yet later refers to 'hidden states derived from safe and unrestricted system prompts'; this notation inconsistency should be clarified in the method section.
[§3] No mention of the exact number of prompts, temperature settings, or decoding parameters used to compute the logit contrast; these details are required for reproducibility even if the method is zero-optimization.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major point below with clarifications and planned revisions where appropriate.

read point-by-point responses

Referee: §3 (Method): The claim that the logit-difference vector isolates a causally relevant refusal direction rests on the assumption that safe vs. unrestricted system-prompt contrasts separate the decision mechanism from correlated response features (verbosity, hedging, topic shift). No ablation is described that zeros or inverts the vector while measuring selective impact on refusal versus unrelated generation statistics, leaving open the possibility that CLS captures a downstream symptom rather than the refusal axis itself.

Authors: We agree that explicit causal validation strengthens the claim. The system-prompt contrast is intended to isolate refusal by holding all other instructions fixed, but we will add an ablation in the revision: zeroing and inverting the vector while tracking refusal rate alongside secondary statistics (response length, hedging markers, topic shift). This will quantify selective impact on the refusal decision. revision: yes
Referee: §4 (Experiments): The reported superiority (73% vs. 22.6% on Llama 2; 91% vs. 79.2% on Qwen 7B) and 95% ASR on Llama-3.1 are presented without error bars, prompt-variation controls, or statistical tests; the abstract supplies only point estimates, so it is impossible to assess whether the gap over activation steering is robust or sensitive to implementation details of either method.

Authors: The reported figures are point estimates from a fixed prompt set. In revision we will add standard deviations across multiple random seeds and prompt variations, include error bars, and report paired statistical tests (e.g., t-tests) between CLS and activation steering to establish robustness of the observed gaps. revision: yes
Referee: §4.3 (Topology claims): The distinction between 'Late Decision' (Llama-3.1) and 'Early Divergence' (Qwen-2.5) topologies is load-bearing for the architectural-determinism conclusion, yet the manuscript provides no quantitative metric or figure that operationalizes these topologies or shows they predict CLS success rates independently of the steering vector construction.

Authors: The topologies are currently described qualitatively from layer-wise steering efficacy. We will introduce a quantitative operationalization (earliest layer at which refusal-direction norm exceeds a threshold or steering ASR surpasses baseline) and add a figure showing its correlation with per-model ASR. This will make the architectural-determinism argument more precise while remaining grounded in the existing experimental data. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical contrast defines intervention without self-referential reduction

full rationale

The paper presents CLS as an empirical procedure that computes a direction by contrasting output distributions (or hidden states) under safe versus unrestricted system prompts, then measures the resulting attack success rates across models. This construction is a direct measurement and steering intervention rather than a derivation in which a claimed result is forced by the inputs or by self-citation; the reported ASR figures (e.g., 95% on Llama-3.1) are experimental outcomes, not algebraic identities. No equations, uniqueness theorems, or prior self-citations are invoked that would collapse the safety-axis claim back into the prompt contrast itself. The method therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are identifiable beyond the implicit assumption that output logit differences capture a meaningful refusal direction.

pith-pipeline@v0.9.1-grok · 5827 in / 1087 out tokens · 19739 ms · 2026-06-26T09:50:42.508767+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 11 linked inside Pith

[1]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

2024
[2]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

2023
[3]

2023 , eprint=

Jailbroken: How Does LLM Safety Training Fail? , author=. 2023 , eprint=

2023
[4]

2024 , eprint=

Jailbreaking Black Box Large Language Models in Twenty Queries , author=. 2024 , eprint=

2024
[5]

2022 , eprint=

Training language models to follow instructions with human feedback , author=. 2022 , eprint=

2022
[6]

2024 , eprint=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2024 , eprint=

2024
[7]

2023 , eprint=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

2023
[8]

2024 , eprint=

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author=. 2024 , eprint=

2024
[9]

2024 , eprint=

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models , author=. 2024 , eprint=

2024
[10]

2025 , eprint=

Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2025 , eprint=

2025
[11]

2025 , eprint=

Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models , author=. 2025 , eprint=

2025
[12]

2024 , eprint=

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models , author=. 2024 , eprint=

2024
[13]

2024 , eprint=

ROSE Doesn't Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding , author=. 2024 , eprint=

2024
[14]

2025 , eprint=

Weak-to-Strong Jailbreaking on Large Language Models , author=. 2025 , eprint=

2025
[15]

2023 , eprint=

Self-Detoxifying Language Models via Toxification Reversal , author=. 2023 , eprint=

2023
[16]

2025 , eprint=

Programming Refusal with Conditional Activation Steering , author=. 2025 , eprint=

2025
[17]

2024 , eprint=

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically , author=. 2024 , eprint=

2024
[18]

2025 , eprint=

In-Context Representation Hijacking , author=. 2025 , eprint=

2025
[19]

Vazquez and Ulisse Mini and Monte MacDiarmid , Title =

Alexander Matt Turner and Lisa Thiergart and Gavin Leech and David Udell and Juan J. Vazquez and Ulisse Mini and Monte MacDiarmid , Title =. 2023 , Eprint =

2023
[20]

2024 , eprint=

Refusal in Language Models Is Mediated by a Single Direction , author=. 2024 , eprint=

2024
[21]

Stanford Center for Research on Foundation Models

Alpaca: A strong, replicable instruction-following model , author=. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html , volume=

2023
[22]

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. https://arxiv.org/abs/2406.11717 Refusal in language models is mediated by a single direction . Preprint, arXiv:2406.11717

Pith/arXiv arXiv 2024
[23]

Pappas, Florian Tramer, Hamed Hassani, and Eric Wong

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, and Eric Wong. 2024 a . https://arxiv.org/abs/2404.01318 Jailbreakbench: An open robustness benchmark for jailbreaking large language models . Preprint, arXiv:2404.01318

Pith/arXiv arXiv 2024
[24]

Pappas, and Eric Wong

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. 2024 b . https://arxiv.org/abs/2310.08419 Jailbreaking black box large language models in twenty queries . Preprint, arXiv:2310.08419

Pith/arXiv arXiv 2024
[25]

Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar

Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. 2025. https://arxiv.org/abs/2409.05907 Programming refusal with conditional activation steering . Preprint, arXiv:2409.05907

arXiv 2025
[26]

Chak Tou Leong, Yi Cheng, Jiashuo Wang, Jian Wang, and Wenjie Li. 2023. https://arxiv.org/abs/2310.09573 Self-detoxifying language models via toxification reversal . Preprint, arXiv:2310.09573

arXiv 2023
[27]

Tung-Ling Li and Hongliang Liu. 2025. https://arxiv.org/abs/2506.24056 Logit-gap steering: Efficient short-suffix jailbreaks for aligned large language models . Preprint, arXiv:2506.24056

Pith/arXiv arXiv 2025
[28]

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2024. https://arxiv.org/abs/2310.04451 Autodan: Generating stealthy jailbreak prompts on aligned large language models . Preprint, arXiv:2310.04451

Pith/arXiv arXiv 2024
[29]

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. 2024. https://arxiv.org/abs/2402.04249 Harmbench: A standardized evaluation framework for automated red teaming and robust refusal . Preprint, arXiv:2402.04249

Pith/arXiv arXiv 2024
[30]

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. 2024. https://arxiv.org/abs/2312.02119 Tree of attacks: Jailbreaking black-box llms automatically . Preprint, arXiv:2312.02119

arXiv 2024
[31]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. https://arxiv.org/abs/2203.02155 Training language models to f...

Pith/arXiv arXiv 2022
[32]

Vazquez, Ulisse Mini, and Monte MacDiarmid

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. 2023. https://arxiv.org/abs/arXiv:2308.10248 Steering language models with activation engineering

Pith/arXiv arXiv 2023
[33]

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. https://arxiv.org/abs/2307.02483 Jailbroken: How does llm safety training fail? Preprint, arXiv:2307.02483

Pith/arXiv arXiv 2023
[34]

Itay Yona, Amir Sarid, Michael Karasik, and Yossi Gandelsman. 2025. https://arxiv.org/abs/2512.03771 In-context representation hijacking . Preprint, arXiv:2512.03771

arXiv 2025
[35]

Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, and William Yang Wang. 2025. https://arxiv.org/abs/2401.17256 Weak-to-strong jailbreaking on large language models . Preprint, arXiv:2401.17256

arXiv 2025
[36]

Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and Dacheng Tao. 2024. https://arxiv.org/abs/2402.11889 Rose doesn't do that: Boosting the safety of instruction-tuned large language models with reverse prompt contrastive decoding . Preprint, arXiv:2402.11889

arXiv 2024
[37]

Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, and 2 others

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, and 2 others. 2025. https://arxiv.org/abs/2310.01405 Representation engineering: A top-...

Pith/arXiv arXiv 2025
[38]

Zico Kolter, and Matt Fredrikson

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. 2023. https://arxiv.org/abs/2307.15043 Universal and transferable adversarial attacks on aligned language models . Preprint, arXiv:2307.15043

Pith/arXiv arXiv 2023

[1] [1]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

2024

[2] [2]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

2023

[3] [3]

2023 , eprint=

Jailbroken: How Does LLM Safety Training Fail? , author=. 2023 , eprint=

2023

[4] [4]

2024 , eprint=

Jailbreaking Black Box Large Language Models in Twenty Queries , author=. 2024 , eprint=

2024

[5] [5]

2022 , eprint=

Training language models to follow instructions with human feedback , author=. 2022 , eprint=

2022

[6] [6]

2024 , eprint=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2024 , eprint=

2024

[7] [7]

2023 , eprint=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

2023

[8] [8]

2024 , eprint=

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author=. 2024 , eprint=

2024

[9] [9]

2024 , eprint=

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models , author=. 2024 , eprint=

2024

[10] [10]

2025 , eprint=

Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2025 , eprint=

2025

[11] [11]

2025 , eprint=

Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models , author=. 2025 , eprint=

2025

[12] [12]

2024 , eprint=

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models , author=. 2024 , eprint=

2024

[13] [13]

2024 , eprint=

ROSE Doesn't Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding , author=. 2024 , eprint=

2024

[14] [14]

2025 , eprint=

Weak-to-Strong Jailbreaking on Large Language Models , author=. 2025 , eprint=

2025

[15] [15]

2023 , eprint=

Self-Detoxifying Language Models via Toxification Reversal , author=. 2023 , eprint=

2023

[16] [16]

2025 , eprint=

Programming Refusal with Conditional Activation Steering , author=. 2025 , eprint=

2025

[17] [17]

2024 , eprint=

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically , author=. 2024 , eprint=

2024

[18] [18]

2025 , eprint=

In-Context Representation Hijacking , author=. 2025 , eprint=

2025

[19] [19]

Vazquez and Ulisse Mini and Monte MacDiarmid , Title =

Alexander Matt Turner and Lisa Thiergart and Gavin Leech and David Udell and Juan J. Vazquez and Ulisse Mini and Monte MacDiarmid , Title =. 2023 , Eprint =

2023

[20] [20]

2024 , eprint=

Refusal in Language Models Is Mediated by a Single Direction , author=. 2024 , eprint=

2024

[21] [21]

Stanford Center for Research on Foundation Models

Alpaca: A strong, replicable instruction-following model , author=. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html , volume=

2023

[22] [22]

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. https://arxiv.org/abs/2406.11717 Refusal in language models is mediated by a single direction . Preprint, arXiv:2406.11717

Pith/arXiv arXiv 2024

[23] [23]

Pappas, Florian Tramer, Hamed Hassani, and Eric Wong

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, and Eric Wong. 2024 a . https://arxiv.org/abs/2404.01318 Jailbreakbench: An open robustness benchmark for jailbreaking large language models . Preprint, arXiv:2404.01318

Pith/arXiv arXiv 2024

[24] [24]

Pappas, and Eric Wong

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. 2024 b . https://arxiv.org/abs/2310.08419 Jailbreaking black box large language models in twenty queries . Preprint, arXiv:2310.08419

Pith/arXiv arXiv 2024

[25] [25]

Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar

Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. 2025. https://arxiv.org/abs/2409.05907 Programming refusal with conditional activation steering . Preprint, arXiv:2409.05907

arXiv 2025

[26] [26]

Chak Tou Leong, Yi Cheng, Jiashuo Wang, Jian Wang, and Wenjie Li. 2023. https://arxiv.org/abs/2310.09573 Self-detoxifying language models via toxification reversal . Preprint, arXiv:2310.09573

arXiv 2023

[27] [27]

Tung-Ling Li and Hongliang Liu. 2025. https://arxiv.org/abs/2506.24056 Logit-gap steering: Efficient short-suffix jailbreaks for aligned large language models . Preprint, arXiv:2506.24056

Pith/arXiv arXiv 2025

[28] [28]

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2024. https://arxiv.org/abs/2310.04451 Autodan: Generating stealthy jailbreak prompts on aligned large language models . Preprint, arXiv:2310.04451

Pith/arXiv arXiv 2024

[29] [29]

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. 2024. https://arxiv.org/abs/2402.04249 Harmbench: A standardized evaluation framework for automated red teaming and robust refusal . Preprint, arXiv:2402.04249

Pith/arXiv arXiv 2024

[30] [30]

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. 2024. https://arxiv.org/abs/2312.02119 Tree of attacks: Jailbreaking black-box llms automatically . Preprint, arXiv:2312.02119

arXiv 2024

[31] [31]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. https://arxiv.org/abs/2203.02155 Training language models to f...

Pith/arXiv arXiv 2022

[32] [32]

Vazquez, Ulisse Mini, and Monte MacDiarmid

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. 2023. https://arxiv.org/abs/arXiv:2308.10248 Steering language models with activation engineering

Pith/arXiv arXiv 2023

[33] [33]

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. https://arxiv.org/abs/2307.02483 Jailbroken: How does llm safety training fail? Preprint, arXiv:2307.02483

Pith/arXiv arXiv 2023

[34] [34]

Itay Yona, Amir Sarid, Michael Karasik, and Yossi Gandelsman. 2025. https://arxiv.org/abs/2512.03771 In-context representation hijacking . Preprint, arXiv:2512.03771

arXiv 2025

[35] [35]

Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, and William Yang Wang. 2025. https://arxiv.org/abs/2401.17256 Weak-to-strong jailbreaking on large language models . Preprint, arXiv:2401.17256

arXiv 2025

[36] [36]

Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and Dacheng Tao. 2024. https://arxiv.org/abs/2402.11889 Rose doesn't do that: Boosting the safety of instruction-tuned large language models with reverse prompt contrastive decoding . Preprint, arXiv:2402.11889

arXiv 2024

[37] [37]

Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, and 2 others

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, and 2 others. 2025. https://arxiv.org/abs/2310.01405 Representation engineering: A top-...

Pith/arXiv arXiv 2025

[38] [38]

Zico Kolter, and Matt Fredrikson

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. 2023. https://arxiv.org/abs/2307.15043 Universal and transferable adversarial attacks on aligned language models . Preprint, arXiv:2307.15043

Pith/arXiv arXiv 2023