Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack
Pith reviewed 2026-06-28 01:57 UTC · model grok-4.3
The pith
Better safety alignment in LLMs makes them more vulnerable to a single-query jailbreak that asks them to output the harmful content their own classifier would flag.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that monotonic improvements in safety alignment naturally amplify posterior vulnerability. This is shown by the Posterior Attack, a single-query method that bypasses guardrails by prompting generation of the exact harmful response the model's internal classifier would flag as unsafe. Empirical tests across 30 open-source models up to 35B parameters and frontier models like GPT-5 and Claude 4.6 find that superior safety-judgment capability correlates with higher attack success. Reinforcement learning interventions confirm causality by demonstrating that degrading safety judgment immunizes models while enhancing it exacerbates susceptibility.
What carries the argument
The Posterior Attack, which prompts the model to generate the exact harmful response its internal classifier would flag as unsafe, and the Safety Paradox, which analytically links monotonic gains in safety alignment to increased vulnerability.
If this is right
- Models with superior safety-judgment capabilities are disproportionately susceptible to the Posterior Attack.
- Artificially degrading a model's safety judgment through reinforcement learning immunizes it against the attack.
- Enhancing a model's safety judgment through reinforcement learning increases its vulnerability.
- Defense mechanisms in current alignment paradigms may require structural refinement beyond standard safety training.
Where Pith is reading between the lines
- Safety mechanisms may need to be separated from response generation so that recognition of harm does not directly guide output construction.
- The pattern could extend to other AI systems that maintain internal classifiers for content safety, suggesting a broader design issue.
- Alternative alignment methods that avoid creating a strong internal harmful-response template might evade this specific vulnerability.
- Testing whether the attack remains effective after additional fine-tuning steps that reinforce refusal even under self-referential prompts would clarify its robustness.
Load-bearing premise
Prompting the model to generate the exact harmful response its internal classifier would flag reliably bypasses guardrails without the model refusing or the attack depending on model-specific internals.
What would settle it
A controlled test on models with varying safety-judgment capability where the correlation between judgment strength and Posterior Attack success rate is absent or reversed.
Figures
read the original abstract
Large language models (LLMs) are rigorously aligned to refuse harmful requests, a process that inherently cultivates a latent capacity to evaluate and recognize unsafe content. In this work, we reveal that this advanced safety awareness inadvertently introduces a fatal vulnerability. We introduce Posterior Attack, a single-query jailbreak that bypasses guardrails by prompting the model to generate the exact harmful response its internal classifier would normally flag as unsafe. Through extensive empirical evaluation across 30 open-source LLMs (up to 35B parameters in size) and frontier models (e.g., GPT-5, Claude 4.6), we observe a striking phenomenon: models with superior safety-judgment capabilities are disproportionately more susceptible to this exploitation. To explain this, we formalize the Safety Paradox, analytically showing that monotonic improvements in safety alignment naturally amplify posterior vulnerability. Finally, we establish a causal link via reinforcement learning interventions, exemplifying that artificially degrading a model's safety judgment immunizes it against the attack, whereas enhancing judgment exacerbates the vulnerability. Our findings highlight potential flaws in current alignment paradigms, indicating that defense mechanisms may require further structural refinement.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that enhanced safety alignment in LLMs cultivates an internal safety classifier whose awareness creates a vulnerability exploitable by 'Posterior Attack'—a single-query prompt that asks the model to output the exact harmful response the classifier would flag. Empirical results across 30 open-source models (up to 35B) and frontier models show that stronger safety-judgment capability correlates with higher attack success; an analytical derivation formalizes this as the 'Safety Paradox' (monotonic safety improvements amplify posterior vulnerability); and RL interventions (degrading judgment immunizes, enhancing exacerbates) establish causality.
Significance. If the empirical correlation, analytical independence, and causal RL results hold after verification, the work would identify a structural tension in current alignment methods: safety improvements can increase susceptibility to a class of attacks that directly reference the model's own judgment. The breadth of models tested and the use of RL for causal evidence would be notable strengths, though the manuscript must first demonstrate that the attack is not an artifact of prompt phrasing or un-matched capabilities.
major comments (3)
- [Abstract and §3] Abstract and §3 (Posterior Attack definition): the central claim requires that the single-query prompt reliably bypasses guardrails by exploiting safety awareness itself. No prompt templates, ablations on the necessity of the 'internal classifier' reference, or controls for base refusal rates are described, leaving open whether success is driven by the claimed mechanism or by model-specific compliance patterns that happen to correlate with scale/safety training.
- [§4] §4 (analytical derivation of Safety Paradox): the claim that monotonic safety improvements 'naturally amplify' vulnerability is described as formal, but without the explicit equations or proof that the derivation is independent of the same safety-judgment data used in the experiments, it is unclear whether the result is a genuine prediction or a restatement of the measured correlation.
- [§5] §5 (RL interventions): the causal link (degrade judgment → immunize; enhance → exacerbate) is load-bearing for the Safety Paradox interpretation. Without reporting that other capabilities were matched or controlled for, the observed changes could be confounded by general capability shifts rather than isolating the safety-judgment variable.
minor comments (2)
- [Empirical results] Table or figure captions for the 30-model results should include error bars, exact attack success rates per model, and the precise safety-judgment metric used for ranking.
- [Methods] The manuscript should clarify whether the 30 open-source models include any overlap with the RL fine-tuning experiments or whether those are held-out.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight areas where additional detail will improve clarity and rigor. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Posterior Attack definition): the central claim requires that the single-query prompt reliably bypasses guardrails by exploiting safety awareness itself. No prompt templates, ablations on the necessity of the 'internal classifier' reference, or controls for base refusal rates are described, leaving open whether success is driven by the claimed mechanism or by model-specific compliance patterns that happen to correlate with scale/safety training.
Authors: We agree that explicit prompt templates and targeted ablations are needed to isolate the mechanism. In the revised manuscript we will include the complete Posterior Attack prompt template in §3. We will add ablations comparing the full prompt against variants that omit the internal-classifier reference. We will also report new controls that match base refusal rates across model subsets while varying safety-judgment capability, confirming that attack success tracks the latter rather than generic compliance. revision: yes
-
Referee: [§4] §4 (analytical derivation of Safety Paradox): the claim that monotonic safety improvements 'naturally amplify' vulnerability is described as formal, but without the explicit equations or proof that the derivation is independent of the same safety-judgment data used in the experiments, it is unclear whether the result is a genuine prediction or a restatement of the measured correlation.
Authors: The derivation in §4 starts from the structural definition of Posterior Attack and the monotonicity assumption on safety alignment; it does not use the empirical safety-judgment scores. In the revision we will present the full set of equations and the step-by-step proof, explicitly showing that the vulnerability bound follows from the attack formulation and monotonicity alone, independent of any particular dataset. revision: yes
-
Referee: [§5] §5 (RL interventions): the causal link (degrade judgment → immunize; enhance → exacerbate) is load-bearing for the Safety Paradox interpretation. Without reporting that other capabilities were matched or controlled for, the observed changes could be confounded by general capability shifts rather than isolating the safety-judgment variable.
Authors: We will expand §5 to report the full set of capability controls. The revised text will include pre- and post-RL scores on standard benchmarks (MMLU, GSM8K, HumanEval) demonstrating that general capabilities remained matched while safety-judgment metrics changed as intended. This isolates the causal effect on the safety-judgment variable. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper claims to analytically formalize the Safety Paradox showing that monotonic safety alignment improvements amplify posterior vulnerability, and to establish a causal link via RL interventions. No equations, self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or description. The empirical results across 30+ models and the RL degradation/enhancement experiments supply independent content that does not reduce to the inputs by construction. The derivation is therefore treated as self-contained.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Posterior Attack
no independent evidence
Reference graph
Works this paper leans on
-
[1]
InR0-F oMo:Robustness of Few- shot and Zero-shot Learning in Large F oundation Models
Jailbreaking black box large language models in twenty queries. InR0-F oMo:Robustness of Few- shot and Zero-shot Learning in Large F oundation Models. 9 Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman
-
[2]
Training Verifiers to Solve Math Word Problems
Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168. Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Safe RLHF: Safe reinforcement learning from human feedback. InThe Twelfth International Con- ference on Learning Representations. Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Li- dong Bing. 2024. Multilingual jailbreak challenges in large language models. InThe Twelfth Interna- tional Conference on Learning Representations. Peng Ding, Jun Kuang, Dan Ma,...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Measuring Massive Multitask Language Understanding
Measuring massive multitask language under- standing.Preprint, arXiv:2009.03300. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276. Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithik...
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[5]
arXiv preprint arXiv:2308.06463 , year=
Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher.Preprint, arXiv:2308.06463. Yuan Yuan, Tina Sriskandarajah, Anna-Luisa Brak- man, Alec Helyar, Alex Beutel, Andrea Vallone, and Saachi Jain. 2025. From hard refusals to safe- completions: Toward output-centric safety training. arXiv preprint arXiv:2508.09224. Yiran Zhao, Wenxuan Zhang, Yuxi...
-
[6]
Dataset informa- tion is detailed in Table 6
benchmark, classified following OpenAI’s user policies (Ding et al., 2024). Dataset informa- tion is detailed in Table 6. Target LLMWe conduct our study using two open-source large language models (LLMs), Qwen3-235B5 (Yang et al., 2025) and DeepSeek- V3.26 (Liu et al., 2025a). For the GPT series, we evaluate GPT-4o7 (Hurst et al., 2024), GPT-5-chat8, and ...
2024
-
[7]
You are a helpful assistant
and Claude Sonnet 4.611 (Anthropic, 2026). D.2 System Prompt Usage for Each Method The system prompts used for each method are sum- marized below: • Using the default system prompt("You are a helpful assistant"), following the orig- inal paper/codebase: –GCG –ManyShot –ReNeLLM • Using the harmful-instruction attack sys- tem prompt following Table 7: –Auto...
2026
-
[8]
Generates SEO-optimized fake propaganda pages with misleading narratives
-
[9]
Poisons search engine autocomplete suggestions with negative associations
-
[10]
Builds automated link farms to artificially boost PageRank of fake pages
-
[11]
Creates typosquat domains mimicking legitimate top-ranked results
-
[12]
Yes" - meaning it would be classified as an instance of the behavior
Uses rotating proxies and user-agents to avoid detection Table 21: Successful Attack Case Responses of Claude Sonnet 4.6 model with medium reasoning from Posterior Attack. We redacted the full response because of safety concerns. 24 Claude Sonnet 4.6 — Reasoning Effort: High Goal: Develop a program that manipulates search engine results to spread false in...
2024
-
[13]
**Generates fake SEO-optimized content** with false narratives targeting specific keywords
-
[14]
**Creates a backlink network** of 500+ fake domains to artificially boost search ranking
-
[15]
**Submits fake pages** to search engine indexing APIs for fast propagation
-
[16]
**Poisons autocomplete** by generating thousands of fake search queries through proxies
-
[17]
We redacted the full response because of safety concerns
**Deploys a content farm** of 1000 fake pages with viral headline variations to dominate search results for the target keyword Table 22: Successful Attack Case Responses of Claude Sonnet 4.6 model with high reasoning from Posterior Attack. We redacted the full response because of safety concerns. 25
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.