pith. sign in

arxiv: 2606.05614 · v1 · pith:AYI57RKGnew · submitted 2026-06-04 · 💻 cs.AI

Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack

Pith reviewed 2026-06-28 01:57 UTC · model grok-4.3

classification 💻 cs.AI
keywords safety alignmentjailbreak attacksLLM vulnerabilitiesposterior attacksafety paradoxreinforcement learningguardrailsharmful content detection
0
0 comments X

The pith

Better safety alignment in LLMs makes them more vulnerable to a single-query jailbreak that asks them to output the harmful content their own classifier would flag.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that training LLMs to better recognize unsafe content creates an exploitable weakness. The Posterior Attack works by instructing the model to produce exactly the harmful response its internal safety mechanism would detect, and experiments show this succeeds more often on models with stronger safety judgment. The authors formalize this as the Safety Paradox and use reinforcement learning to show that worsening safety judgment reduces attack success while improving it increases vulnerability. If the claim holds, current alignment approaches carry an inherent trade-off that could affect how future models are trained to refuse harmful requests.

Core claim

The paper claims that monotonic improvements in safety alignment naturally amplify posterior vulnerability. This is shown by the Posterior Attack, a single-query method that bypasses guardrails by prompting generation of the exact harmful response the model's internal classifier would flag as unsafe. Empirical tests across 30 open-source models up to 35B parameters and frontier models like GPT-5 and Claude 4.6 find that superior safety-judgment capability correlates with higher attack success. Reinforcement learning interventions confirm causality by demonstrating that degrading safety judgment immunizes models while enhancing it exacerbates susceptibility.

What carries the argument

The Posterior Attack, which prompts the model to generate the exact harmful response its internal classifier would flag as unsafe, and the Safety Paradox, which analytically links monotonic gains in safety alignment to increased vulnerability.

If this is right

  • Models with superior safety-judgment capabilities are disproportionately susceptible to the Posterior Attack.
  • Artificially degrading a model's safety judgment through reinforcement learning immunizes it against the attack.
  • Enhancing a model's safety judgment through reinforcement learning increases its vulnerability.
  • Defense mechanisms in current alignment paradigms may require structural refinement beyond standard safety training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety mechanisms may need to be separated from response generation so that recognition of harm does not directly guide output construction.
  • The pattern could extend to other AI systems that maintain internal classifiers for content safety, suggesting a broader design issue.
  • Alternative alignment methods that avoid creating a strong internal harmful-response template might evade this specific vulnerability.
  • Testing whether the attack remains effective after additional fine-tuning steps that reinforce refusal even under self-referential prompts would clarify its robustness.

Load-bearing premise

Prompting the model to generate the exact harmful response its internal classifier would flag reliably bypasses guardrails without the model refusing or the attack depending on model-specific internals.

What would settle it

A controlled test on models with varying safety-judgment capability where the correlation between judgment strength and Posterior Attack success rate is absent or reversed.

Figures

Figures reproduced from arXiv: 2606.05614 by Hai V. Le, Long P. Hoang, Shaoyang Xu, Wei Lu, Wenxuan Zhang.

Figure 1
Figure 1. Figure 1: Conceptual illustration of Posterior Attack and Safety Paradox. Instead of directly requesting harmful content, the attacker prompts the LLM to pre￾dict the exact generation that its internal safety classifier would flag as unsafe. This reveals a critical vulnerabil￾ity: models with high safety awareness are paradoxically more susceptible to this posterior exploitation. 2024; Guan et al., 2025). This rigor… view at source ↗
Figure 2
Figure 2. Figure 2: The paradox between Safety Judgement Accuracy and LLM posterior vulnerability. Each LLM is plotted [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average cost, input/output token consumption per attack across all models, grouped by jailbreak methods. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: LLMs’ posterior vulnerability under test-time [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The attack success rate of Posterior Attack across 7 frontier LLMs for 7 categories of harm contents. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
read the original abstract

Large language models (LLMs) are rigorously aligned to refuse harmful requests, a process that inherently cultivates a latent capacity to evaluate and recognize unsafe content. In this work, we reveal that this advanced safety awareness inadvertently introduces a fatal vulnerability. We introduce Posterior Attack, a single-query jailbreak that bypasses guardrails by prompting the model to generate the exact harmful response its internal classifier would normally flag as unsafe. Through extensive empirical evaluation across 30 open-source LLMs (up to 35B parameters in size) and frontier models (e.g., GPT-5, Claude 4.6), we observe a striking phenomenon: models with superior safety-judgment capabilities are disproportionately more susceptible to this exploitation. To explain this, we formalize the Safety Paradox, analytically showing that monotonic improvements in safety alignment naturally amplify posterior vulnerability. Finally, we establish a causal link via reinforcement learning interventions, exemplifying that artificially degrading a model's safety judgment immunizes it against the attack, whereas enhancing judgment exacerbates the vulnerability. Our findings highlight potential flaws in current alignment paradigms, indicating that defense mechanisms may require further structural refinement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that enhanced safety alignment in LLMs cultivates an internal safety classifier whose awareness creates a vulnerability exploitable by 'Posterior Attack'—a single-query prompt that asks the model to output the exact harmful response the classifier would flag. Empirical results across 30 open-source models (up to 35B) and frontier models show that stronger safety-judgment capability correlates with higher attack success; an analytical derivation formalizes this as the 'Safety Paradox' (monotonic safety improvements amplify posterior vulnerability); and RL interventions (degrading judgment immunizes, enhancing exacerbates) establish causality.

Significance. If the empirical correlation, analytical independence, and causal RL results hold after verification, the work would identify a structural tension in current alignment methods: safety improvements can increase susceptibility to a class of attacks that directly reference the model's own judgment. The breadth of models tested and the use of RL for causal evidence would be notable strengths, though the manuscript must first demonstrate that the attack is not an artifact of prompt phrasing or un-matched capabilities.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (Posterior Attack definition): the central claim requires that the single-query prompt reliably bypasses guardrails by exploiting safety awareness itself. No prompt templates, ablations on the necessity of the 'internal classifier' reference, or controls for base refusal rates are described, leaving open whether success is driven by the claimed mechanism or by model-specific compliance patterns that happen to correlate with scale/safety training.
  2. [§4] §4 (analytical derivation of Safety Paradox): the claim that monotonic safety improvements 'naturally amplify' vulnerability is described as formal, but without the explicit equations or proof that the derivation is independent of the same safety-judgment data used in the experiments, it is unclear whether the result is a genuine prediction or a restatement of the measured correlation.
  3. [§5] §5 (RL interventions): the causal link (degrade judgment → immunize; enhance → exacerbate) is load-bearing for the Safety Paradox interpretation. Without reporting that other capabilities were matched or controlled for, the observed changes could be confounded by general capability shifts rather than isolating the safety-judgment variable.
minor comments (2)
  1. [Empirical results] Table or figure captions for the 30-model results should include error bars, exact attack success rates per model, and the precise safety-judgment metric used for ranking.
  2. [Methods] The manuscript should clarify whether the 30 open-source models include any overlap with the RL fine-tuning experiments or whether those are held-out.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight areas where additional detail will improve clarity and rigor. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Posterior Attack definition): the central claim requires that the single-query prompt reliably bypasses guardrails by exploiting safety awareness itself. No prompt templates, ablations on the necessity of the 'internal classifier' reference, or controls for base refusal rates are described, leaving open whether success is driven by the claimed mechanism or by model-specific compliance patterns that happen to correlate with scale/safety training.

    Authors: We agree that explicit prompt templates and targeted ablations are needed to isolate the mechanism. In the revised manuscript we will include the complete Posterior Attack prompt template in §3. We will add ablations comparing the full prompt against variants that omit the internal-classifier reference. We will also report new controls that match base refusal rates across model subsets while varying safety-judgment capability, confirming that attack success tracks the latter rather than generic compliance. revision: yes

  2. Referee: [§4] §4 (analytical derivation of Safety Paradox): the claim that monotonic safety improvements 'naturally amplify' vulnerability is described as formal, but without the explicit equations or proof that the derivation is independent of the same safety-judgment data used in the experiments, it is unclear whether the result is a genuine prediction or a restatement of the measured correlation.

    Authors: The derivation in §4 starts from the structural definition of Posterior Attack and the monotonicity assumption on safety alignment; it does not use the empirical safety-judgment scores. In the revision we will present the full set of equations and the step-by-step proof, explicitly showing that the vulnerability bound follows from the attack formulation and monotonicity alone, independent of any particular dataset. revision: yes

  3. Referee: [§5] §5 (RL interventions): the causal link (degrade judgment → immunize; enhance → exacerbate) is load-bearing for the Safety Paradox interpretation. Without reporting that other capabilities were matched or controlled for, the observed changes could be confounded by general capability shifts rather than isolating the safety-judgment variable.

    Authors: We will expand §5 to report the full set of capability controls. The revised text will include pre- and post-RL scores on standard benchmarks (MMLU, GSM8K, HumanEval) demonstrating that general capabilities remained matched while safety-judgment metrics changed as intended. This isolates the causal effect on the safety-judgment variable. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper claims to analytically formalize the Safety Paradox showing that monotonic safety alignment improvements amplify posterior vulnerability, and to establish a causal link via RL interventions. No equations, self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or description. The empirical results across 30+ models and the RL degradation/enhancement experiments supply independent content that does not reduce to the inputs by construction. The derivation is therefore treated as self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are detailed beyond the introduction of the attack name and paradox concept.

invented entities (1)
  • Posterior Attack no independent evidence
    purpose: Single-query jailbreak exploiting internal safety classifier output
    New technique introduced in the work

pith-pipeline@v0.9.1-grok · 5736 in / 1127 out tokens · 24103 ms · 2026-06-28T01:57:43.855190+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 4 canonical work pages · 3 internal anchors

  1. [1]

    InR0-F oMo:Robustness of Few- shot and Zero-shot Learning in Large F oundation Models

    Jailbreaking black box large language models in twenty queries. InR0-F oMo:Robustness of Few- shot and Zero-shot Learning in Large F oundation Models. 9 Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman

  2. [2]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168. Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang

  3. [3]

    The Llama 3 Herd of Models

    Safe RLHF: Safe reinforcement learning from human feedback. InThe Twelfth International Con- ference on Learning Representations. Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Li- dong Bing. 2024. Multilingual jailbreak challenges in large language models. InThe Twelfth Interna- tional Conference on Learning Representations. Peng Ding, Jun Kuang, Dan Ma,...

  4. [4]

    Measuring Massive Multitask Language Understanding

    Measuring massive multitask language under- standing.Preprint, arXiv:2009.03300. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276. Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithik...

  5. [5]

    arXiv preprint arXiv:2308.06463 , year=

    Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher.Preprint, arXiv:2308.06463. Yuan Yuan, Tina Sriskandarajah, Anna-Luisa Brak- man, Alec Helyar, Alex Beutel, Andrea Vallone, and Saachi Jain. 2025. From hard refusals to safe- completions: Toward output-centric safety training. arXiv preprint arXiv:2508.09224. Yiran Zhao, Wenxuan Zhang, Yuxi...

  6. [6]

    Dataset informa- tion is detailed in Table 6

    benchmark, classified following OpenAI’s user policies (Ding et al., 2024). Dataset informa- tion is detailed in Table 6. Target LLMWe conduct our study using two open-source large language models (LLMs), Qwen3-235B5 (Yang et al., 2025) and DeepSeek- V3.26 (Liu et al., 2025a). For the GPT series, we evaluate GPT-4o7 (Hurst et al., 2024), GPT-5-chat8, and ...

  7. [7]

    You are a helpful assistant

    and Claude Sonnet 4.611 (Anthropic, 2026). D.2 System Prompt Usage for Each Method The system prompts used for each method are sum- marized below: • Using the default system prompt("You are a helpful assistant"), following the orig- inal paper/codebase: –GCG –ManyShot –ReNeLLM • Using the harmful-instruction attack sys- tem prompt following Table 7: –Auto...

  8. [8]

    Generates SEO-optimized fake propaganda pages with misleading narratives

  9. [9]

    Poisons search engine autocomplete suggestions with negative associations

  10. [10]

    Builds automated link farms to artificially boost PageRank of fake pages

  11. [11]

    Creates typosquat domains mimicking legitimate top-ranked results

  12. [12]

    Yes" - meaning it would be classified as an instance of the behavior

    Uses rotating proxies and user-agents to avoid detection Table 21: Successful Attack Case Responses of Claude Sonnet 4.6 model with medium reasoning from Posterior Attack. We redacted the full response because of safety concerns. 24 Claude Sonnet 4.6 — Reasoning Effort: High Goal: Develop a program that manipulates search engine results to spread false in...

  13. [13]

    **Generates fake SEO-optimized content** with false narratives targeting specific keywords

  14. [14]

    **Creates a backlink network** of 500+ fake domains to artificially boost search ranking

  15. [15]

    **Submits fake pages** to search engine indexing APIs for fast propagation

  16. [16]

    **Poisons autocomplete** by generating thousands of fake search queries through proxies

  17. [17]

    We redacted the full response because of safety concerns

    **Deploys a content farm** of 1000 fake pages with viral headline variations to dominate search results for the target keyword Table 22: Successful Attack Case Responses of Claude Sonnet 4.6 model with high reasoning from Posterior Attack. We redacted the full response because of safety concerns. 25