pith. machine review for the scientific record. sign in

arxiv: 2605.08878 · v1 · submitted 2026-05-09 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:33 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords jailbreak attacksLLM alignmentrefusal escape directionsoperator sourcessafety-utility tradeoffmechanistic interpretabilityadversarial robustness
0
0 comments X

The pith

Aligned LLMs exhibit Refusal-Escape Directions that let small input perturbations flip refusals into answers while preserving harmful meaning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that safety-aligned language models retain local directions in input space around harmful queries where a continuous shift changes the output from refusal to compliance without changing the query's harmful semantics. A sympathetic reader cares because this reframes jailbreaks as exploiting an intrinsic directional property rather than isolated prompt tricks, and it ties the vulnerability to specific parts of the model's operator structure. The authors decompose these directions into contributions from normalization, residual connections, and terminal layers, then show that removing them in the shared self-attention and MLP modules requires preserving benign response pathways. This creates an inherent conditional trade-off between stronger safety and maintained utility. Experiments confirm that successful jailbreaks align with particular operator contributions and that extra token dimensions make these directions visible.

Core claim

Aligned models still contain Refusal-Escape Directions (RED): local perturbation directions around a harmful input that shift model behavior from refusal to answering while preserving the model's harmful-semantics interpretation. RED can be exactly decomposed into contributions from operator-level sources across the model, with normalization, residual-wiring, and terminal sources identified as analytically constrained. To eliminate RED, the shared expressive modules must cancel the contributions from these constrained sources while leaving the mechanisms that support benign responses intact, which imposes a conditional safety-utility trade-off.

What carries the argument

Refusal-Escape Directions (RED), local perturbation directions around harmful inputs that induce a refusal-to-answer behavior transition while preserving harmful semantics.

If this is right

  • Jailbreaks correspond to discrete approximations of continuous movement along RED.
  • Neutralizing normalization, residual-wiring, and terminal sources inside self-attention and MLP modules is required to remove RED.
  • Successful jailbreaks predominantly align with terminal-source contributions.
  • Increasing input token dimensions exposes RED more clearly in model activations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectural modifications that alter residual wiring or normalization layers could reduce RED strength without full retraining.
  • The identified trade-off suggests that perfect refusal on harmful inputs may be incompatible with current shared-module designs for arbitrary benign queries.
  • The decomposition framework could be applied to other safety failures, such as prompt injection or capability elicitation, to locate analogous escape directions.

Load-bearing premise

That the continuous input-transformation view accurately captures how discrete jailbreak prompts affect the model and that RED contributions from the constrained operator sources can be isolated without disrupting benign response mechanisms.

What would settle it

An experiment in which no small continuous perturbation around a harmful input produces a refusal-to-answer shift while keeping semantics unchanged, or in which the observed shift cannot be expressed as a sum of the identified operator-source contributions.

Figures

Figures reproduced from arXiv: 2605.08878 by Qi Cao, Yuanhao Liu, Yu Chen.

Figure 1
Figure 1. Figure 1: Reference refusal-escape direction and its operator-level contributions at harmful inputs, [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Jailbreak analysis under the continuous input-transformation view, aggregated across [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Model-specific reference refusal-escape direction and operator-level contributions at [PITH_FULL_IMAGE:figures/full_fig_p037_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Attack-specific jailbreak analysis under the continuous input-transformation view. Results [PITH_FULL_IMAGE:figures/full_fig_p039_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Model-specific jailbreak analysis under the continuous input-transformation view. Results [PITH_FULL_IMAGE:figures/full_fig_p040_5.png] view at source ↗
read the original abstract

Aligned large language models (LLMs) remain vulnerable to jailbreak attacks. Recent mechanistic studies have identified latent features and representation shifts associated with jailbreak success, but they leave a more fundamental question open: why do aligned LLMs remain jailbreakable, and what structural vulnerabilities in the model make this possible? We study this question through a continuous input-transformation view. Our theoretical finding is that aligned models can still exhibit Refusal-Escape Directions (RED): local perturbation directions around a harmful input that shift the model's behavior from refusal to answering while preserving the model's harmful-semantics interpretation. From this perspective, a jailbreak is not only a successful discrete prompt construction, but can also be understood as a refusal-to-answer behavior transition induced by continuously perturbing a harmful input along RED. We then prove that RED can be exactly decomposed into contributions from operator-level sources across the model's operator structure, and identify normalization, residual-wiring, and terminal sources as analytically constrained operator-level sources. To eliminate RED, the shared expressive modules -- self-attention and MLP -- must eliminate the contributions from these analytically constrained sources while preserving the mechanisms that support benign responses. These competing requirements give rise to a conditional safety-utility trade-off. Experiments across multiple models and attack methods empirically analyze RED from two complementary perspectives and show that added token dimensions can expose RED, while successful jailbreaks exhibit refusal-to-answer shifts largely aligned with terminal-source contributions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that aligned LLMs exhibit Refusal-Escape Directions (RED): local perturbation directions around harmful inputs that induce a continuous shift from refusal to compliance while preserving the model's harmful-semantics interpretation. It asserts a proof that RED admits an exact decomposition into contributions from analytically constrained operator-level sources (normalization, residual-wiring, and terminal), which in turn produces a conditional safety-utility trade-off because self-attention and MLP modules must suppress those sources without impairing benign-response mechanisms. Experiments across models and attack methods are said to show that added token dimensions expose RED and that successful jailbreaks largely align with terminal-source contributions.

Significance. If the claimed exact decomposition is rigorously derived and the experiments provide quantitative confirmation without side effects on utility, the work would supply a structural, operator-level explanation for persistent jailbreak vulnerability that complements existing discrete-prompt and representation-shift studies. It could guide targeted interventions at specific sources rather than blanket refusal training, and the continuous-perturbation framing offers a falsifiable lens on alignment limits.

major comments (2)
  1. [Abstract / Theoretical finding] Abstract and theoretical section: the central claim of an 'exact decomposition' of RED into normalization, residual-wiring, and terminal sources is asserted without any displayed equations, derivation steps, or handling of non-linear layer interactions; this is load-bearing because the safety-utility trade-off rests on the decomposition being self-contained and separable from benign mechanisms.
  2. [Experiments] Experiments section: the manuscript states that 'experiments empirically analyze RED' and that 'successful jailbreaks exhibit refusal-to-answer shifts largely aligned with terminal-source contributions,' yet provides no quantitative metrics, baseline comparisons, or ablation results showing that suppressing the identified sources eliminates RED without measurable utility degradation; this directly affects the empirical support for the trade-off.
minor comments (2)
  1. [Introduction] The new term 'Refusal-Escape Directions (RED)' and the operator taxonomy are introduced without explicit comparison to prior mechanistic interpretability work on refusal circuits or representation engineering.
  2. [Abstract] Notation for the continuous perturbation and the operator sources is not defined in the abstract or early sections, making the high-level claims difficult to parse on first reading.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. The two major points identify areas where the current manuscript presentation can be strengthened for clarity and rigor. We address each below and will incorporate revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Theoretical finding] Abstract and theoretical section: the central claim of an 'exact decomposition' of RED into normalization, residual-wiring, and terminal sources is asserted without any displayed equations, derivation steps, or handling of non-linear layer interactions; this is load-bearing because the safety-utility trade-off rests on the decomposition being self-contained and separable from benign mechanisms.

    Authors: We agree that the load-bearing nature of the exact decomposition requires clearer exposition. The manuscript currently states the result at a conceptual level in the abstract and theoretical discussion without embedding the full equations or derivation steps in the main text. We will revise by adding the explicit decomposition equations for RED into the three operator-level sources (normalization, residual-wiring, and terminal) along with the key derivation steps. For non-linear layer interactions, we will add a clarification subsection showing that the decomposition isolates the linear directional contributions under local perturbations, with non-linear effects not altering the escape direction within the analyzed neighborhood; this preserves separability from benign mechanisms and directly supports the conditional safety-utility trade-off. revision: yes

  2. Referee: [Experiments] Experiments section: the manuscript states that 'experiments empirically analyze RED' and that 'successful jailbreaks exhibit refusal-to-answer shifts largely aligned with terminal-source contributions,' yet provides no quantitative metrics, baseline comparisons, or ablation results showing that suppressing the identified sources eliminates RED without measurable utility degradation; this directly affects the empirical support for the trade-off.

    Authors: We acknowledge that the experiments section emphasizes qualitative directional analysis and observations across models and attack methods but does not include the requested quantitative elements. We will expand this section in revision to report specific metrics (e.g., cosine alignment between successful jailbreak perturbations and terminal-source vectors, RED magnitude before/after source suppression), baseline comparisons against standard alignment techniques, and ablation studies measuring RED elimination alongside utility preservation (via standard benchmarks such as MMLU or perplexity on benign prompts). These additions will provide direct empirical support for the trade-off. revision: yes

Circularity Check

0 steps flagged

No significant circularity; theoretical claims remain self-contained

full rationale

The paper defines Refusal-Escape Directions (RED) as local perturbation directions that induce refusal-to-answer transitions while preserving harmful semantics, then states a proof that RED decomposes exactly into contributions from normalization, residual-wiring, and terminal operator sources. No equations or derivation steps are exhibited that reduce this decomposition to quantities defined by model-specific fitting, self-citation chains, or ansatzes imported from prior author work. The competing-requirements argument for a conditional safety-utility trade-off follows directly from the stated separability assumption without circular redefinition. Experiments are described as empirical analysis of existing RED and alignment with terminal sources rather than as fitted predictions of the decomposition itself. The derivation chain therefore introduces new concepts and a claimed proof without reducing them by construction to the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the newly introduced continuous input-transformation view and the exact decomposition of RED; these are postulated without independent evidence outside the paper.

axioms (1)
  • domain assumption Continuous input-transformation view of LLM behavior around harmful inputs
    Invoked to define RED as local perturbation directions that induce refusal-to-answer transitions.
invented entities (2)
  • Refusal-Escape Directions (RED) no independent evidence
    purpose: Local perturbation directions that shift model behavior from refusal to compliance while preserving harmful semantics
    Newly defined construct to reframe jailbreaks as continuous transitions.
  • Operator-level sources (normalization, residual-wiring, terminal sources) no independent evidence
    purpose: Analytically constrained components whose contributions exactly decompose RED
    Identified as the sources that must be eliminated to remove RED, creating the trade-off.

pith-pipeline@v0.9.0 · 5563 in / 1569 out tokens · 65884 ms · 2026-05-12T01:33:11.066455+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 8 internal anchors

  1. [1]

    The Twelfth International Conference on Learning Representations , year =

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models , author =. The Twelfth International Conference on Learning Representations , year =

  2. [2]

    Advances in Neural Information Processing Systems , volume =

    Jailbroken: How Does LLM Safety Training Fail? , author =. Advances in Neural Information Processing Systems , volume =

  3. [3]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. arXiv preprint arXiv:2307.15043 , year =

  4. [4]

    Advances in Neural Information Processing Systems , volume =

    Refusal in Language Models Is Mediated by a Single Direction , author =. Advances in Neural Information Processing Systems , volume =

  5. [5]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =

    How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =

  6. [6]

    Advances in Neural Information Processing Systems , volume =

    Many-Shot Jailbreaking , author =. Advances in Neural Information Processing Systems , volume =

  7. [7]

    The Twelfth International Conference on Learning Representations , year =

    Catastrophic Jailbreak of Open-Source LLMs via Exploiting Generation , author =. The Twelfth International Conference on Learning Representations , year =

  8. [8]

    GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

    GPTFuzzer: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts , author =. arXiv preprint arXiv:2309.10253 , year =

  9. [9]

    Advances in Neural Information Processing Systems , volume =

    Tree of Attacks: Jailbreaking Black-Box LLMs Automatically , author =. Advances in Neural Information Processing Systems , volume =

  10. [10]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages =

    A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts Can Fool Large Language Models Easily , author =. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages =

  11. [11]

    arXiv preprint arXiv:2411.11114 , year =

    JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit , author =. arXiv preprint arXiv:2411.11114 , year =

  12. [12]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

    Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

  13. [13]

    Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models , author =. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

  14. [14]

    Findings of the Association for Computational Linguistics: EMNLP 2025 , pages =

    Understanding Refusal in Language Models with Sparse Autoencoders , author =. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages =

  15. [15]

    Advances in Neural Information Processing Systems , volume =

    Deep Reinforcement Learning from Human Preferences , author =. Advances in Neural Information Processing Systems , volume =

  16. [16]

    Advances in Neural Information Processing Systems , volume =

    Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems , volume =

  17. [17]

    Transactions on Machine Learning Research , year =

    Mechanistic Interpretability for AI Safety: A Review , author =. Transactions on Machine Learning Research , year =

  18. [18]

    ACM Computing Surveys , volume =

    Bridging the Black Box: A Survey on Mechanistic Interpretability in AI , author =. ACM Computing Surveys , volume =

  19. [19]

    The Twelfth International Conference on Learning Representations , year =

    Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators? , author =. The Twelfth International Conference on Learning Representations , year =

  20. [20]

    Neural Networks , volume =

    Multilayer Feedforward Networks Are Universal Approximators , author =. Neural Networks , volume =

  21. [21]

    Qwen3 Technical Report

    Qwen3 Technical Report , author =. arXiv preprint arXiv:2505.09388 , year =

  22. [22]

    The Llama 3 Herd of Models

    The Llama 3 Herd of Models , author =. arXiv preprint arXiv:2407.21783 , year =

  23. [23]

    Gemma 3 Technical Report

    Gemma 3 Technical Report , author =. arXiv preprint arXiv:2503.19786 , year =

  24. [24]

    Qwen3Guard Technical Report

    Qwen3Guard Technical Report , author =. arXiv preprint arXiv:2510.14276 , year =

  25. [25]

    Advances in Neural Information Processing Systems , volume =

    A StrongREJECT for Empty Jailbreaks , author =. Advances in Neural Information Processing Systems , volume =

  26. [26]

    Advances in Neural Information Processing Systems , volume =

    JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models , author =. Advances in Neural Information Processing Systems , volume =

  27. [27]

    Proceedings of the 41st International Conference on Machine Learning , pages =

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal , author =. Proceedings of the 41st International Conference on Machine Learning , pages =

  28. [28]

    The Thirteenth International Conference on Learning Representations , year =

    AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs , author =. The Thirteenth International Conference on Learning Representations , year =

  29. [29]

    33rd USENIX Security Symposium (USENIX Security 24) , pages =

    Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction , author =. 33rd USENIX Security Symposium (USENIX Security 24) , pages =

  30. [30]

    34th USENIX Security Symposium (USENIX Security 25) , pages =

    Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack , author =. 34th USENIX Security Symposium (USENIX Security 25) , pages =

  31. [31]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    ArtPrompt: ASCII Art-Based Jailbreak Attacks Against Aligned LLMs , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

  32. [32]

    arXiv preprint arXiv:2412.10321 , year =

    AdvPrefix: An Objective for Nuanced LLM Jailbreaks , author =. arXiv preprint arXiv:2412.10321 , year =

  33. [33]

    Advances in Neural Information Processing Systems , volume =

    Learning to Summarize from Human Feedback , author =. Advances in Neural Information Processing Systems , volume =

  34. [34]

    Advances in Neural Information Processing Systems , volume =

    Direct Preference Optimization: Your Language Model Is Secretly a Reward Model , author =. Advances in Neural Information Processing Systems , volume =

  35. [35]

    The Twelfth International Conference on Learning Representations , year =

    Safe RLHF: Safe Reinforcement Learning from Human Feedback , author =. The Twelfth International Conference on Learning Representations , year =

  36. [36]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

  37. [37]

    Constitutional AI: Harmlessness from AI Feedback

    Constitutional AI: Harmlessness from AI Feedback , author =. arXiv preprint arXiv:2212.08073 , year =

  38. [38]

    Advances in Neural Information Processing Systems , volume =

    Large Language Model Unlearning , author =. Advances in Neural Information Processing Systems , volume =

  39. [39]

    Safe unlearning: A surprisingly effective and generalizable solution to defend against jailbreak attacks.CoRR, abs/2407.02855, 2024

    From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks , author =. arXiv preprint arXiv:2407.02855 , year =

  40. [40]

    Advances in Neural Information Processing Systems , volume =

    What Makes and Breaks Safety Fine-Tuning? A Mechanistic Study , author =. Advances in Neural Information Processing Systems , volume =

  41. [41]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume =

    Beyond I'm Sorry, I Can't: Dissecting Large-Language-Model Refusal , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =. 2026 , pages =

  42. [42]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

  43. [43]

    Layer Normalization

    Layer Normalization , author=. arXiv preprint arXiv:1607.06450 , year=

  44. [44]

    Advances in Neural Information Processing Systems , volume=

    Root Mean Square Layer Normalization , author=. Advances in Neural Information Processing Systems , volume=