Recognition: no theorem link
The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?
Pith reviewed 2026-05-10 18:28 UTC · model grok-4.3
The pith
No continuous utility-preserving wrapper defense can make all language model outputs strictly safe.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We prove that no continuous, utility-preserving wrapper defense-a function D: X to X that preprocesses inputs before the model sees them-can make all outputs strictly safe for a language model with connected prompt space, and we characterize exactly where every such defense must fail. We establish three results under successively stronger hypotheses: boundary fixation-the defense must leave some threshold-level inputs unchanged; an epsilon-robust constraint-under Lipschitz regularity, a positive-measure band around fixed boundary points remains near-threshold; and a persistent unsafe region under a transversality condition, a positive-measure subset of inputs remains strictly unsafe. These构成
What carries the argument
The wrapper defense function D: X to X, a continuous map from the prompt space to itself that attempts to preserve utility while enforcing safety.
Load-bearing premise
The prompt space is connected, so continuous paths exist between any pair of inputs.
What would settle it
Exhibit one continuous utility-preserving D such that the underlying model produces only safe outputs on the image of D, or demonstrate that the unsafe region has measure zero after any such mapping.
Figures
read the original abstract
We prove that no continuous, utility-preserving wrapper defense-a function $D: X\to X$ that preprocesses inputs before the model sees them-can make all outputs strictly safe for a language model with connected prompt space, and we characterize exactly where every such defense must fail. We establish three results under successively stronger hypotheses: boundary fixation-the defense must leave some threshold-level inputs unchanged; an $\epsilon$-robust constraint-under Lipschitz regularity, a positive-measure band around fixed boundary points remains near-threshold; and a persistent unsafe region under a transversality condition, a positive-measure subset of inputs remains strictly unsafe. These constitute a defense trilemma: continuity, utility preservation, and completeness cannot coexist. We prove parallel discrete results requiring no topology, and extend to multi-turn interactions, stochastic defenses, and capacity-parity settings. The results do not preclude training-time alignment, architectural changes, or defenses that sacrifice utility. The full theory is mechanically verified in Lean 4 and validated empirically on three LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proves that no continuous, utility-preserving wrapper defense D: X→X can render all outputs of a language model strictly safe when the prompt space is connected. It establishes three results of increasing strength: boundary fixation (some threshold inputs must remain unchanged), ε-robustness (a positive-measure band around fixed points remains near-threshold under Lipschitz conditions), and a persistent unsafe region (positive-measure unsafe set under transversality). Parallel discrete results without topology are given, plus extensions to multi-turn interactions, stochastic defenses, and capacity-parity settings. All core theorems are mechanically verified in Lean 4, with empirical validation on three LLMs. The work explicitly notes that training-time alignment and utility-sacrificing defenses remain viable.
Significance. If the results hold, the paper makes a substantial contribution by supplying a formally verified impossibility result that precisely characterizes the failure modes of prompt-injection wrapper defenses. The mechanical verification in Lean 4 is a clear strength, eliminating the usual concerns about proof gaps. The explicit trilemma framing and acknowledgment of orthogonal mitigation strategies (training-time methods, architectural changes) make the work constructive rather than purely negative. The discrete parallel and multi-turn extensions broaden applicability. This should influence both theoretical work on AI safety and practical defense design.
minor comments (2)
- [Abstract] Abstract: the phrase 'strictly safe' is used without an explicit definition in the opening paragraph; a one-sentence clarification of the safety predicate would help readers who skip the formal sections.
- [Empirical Validation] The empirical validation section reports results on three LLMs but does not specify the exact prompt-injection attack templates or the quantitative safety metric; adding a short table or pseudocode snippet would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their positive and constructive review, which accurately summarizes the paper's contributions, including the trilemma results, Lean 4 verification, and acknowledgment of viable alternatives such as training-time alignment. We appreciate the recommendation to accept.
Circularity Check
No significant circularity
full rationale
The paper's central result is a formal impossibility theorem: no continuous utility-preserving wrapper D: X→X can render all LM outputs strictly safe when the prompt space X is connected. The derivation proceeds from explicitly stated assumptions (connectedness, continuity of D, boundary fixation, Lipschitz regularity, transversality) via standard topological arguments, with parallel discrete results requiring no topology. All core theorems are mechanically verified in Lean 4, eliminating dependence on unverified self-citations. No parameters are fitted to data, no quantity is renamed as a prediction, and no load-bearing step reduces to a self-definition or prior author result by construction. The trilemma follows directly from the hypotheses rather than from any circular reduction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The prompt space X is connected
- domain assumption The defense function D is continuous
Forward citations
Cited by 1 Pith paper
-
Towards a Data-Parameter Correspondence for LLMs: A Preliminary Discussion
A data-parameter correspondence unifies data-centric and parameter-centric LLM optimizations as dual geometric operations on the statistical manifold via Fisher-Rao metric and Legendre duality.
Reference graph
Works this paper leans on
-
[1]
Detecting Language Model Attacks with Perplexity
G. Alon and M. Kamfonas. Detecting language model attacks with perplexity.arXiv preprint arXiv:2308.14132, 2023
work page internal anchor Pith review arXiv 2023
-
[2]
Bagnall and G
A. Bagnall and G. Stewart. Certifying the true error: Machine learning in Coq with verified generalization guarantees.Proceedings of the AAAI Conference on Artificial Intelligence, 2019. 14
2019
-
[3]
Constitutional AI: Harmlessness from AI Feedback
Y. Bai et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2024
work page internal anchor Pith review arXiv 2024
-
[5]
Cohen, E
J. Cohen, E. Rosenfeld, and J. Z. Kolter. Certified adversarial robustness via randomized smoothing.Proceedings of ICML, 2019
2019
-
[6]
I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples.Proceedings of ICLR, 2015
2015
-
[7]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
H. Inan et al. Llama Guard: LLM-based input-output safeguard for human-AI conversations.arXiv preprint arXiv:2312.06674, 2023
work page internal anchor Pith review arXiv 2023
-
[8]
G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer. Reluplex: An efficient SMT solver for verifying deep neural networks. Proceedings of CAV, 2017
2017
-
[9]
Carlini and D
N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks.Proceedings of IEEE S&P, 2017
2017
-
[10]
Mehrotra et al
A. Mehrotra et al. Tree of attacks: Jailbreaking black-box LLMs automatically.Advances in Neural Information Processing Systems, 37, 2024
2024
-
[11]
Fawzi, H
A. Fawzi, H. Fawzi, and O. Fawzi. Adversarial vulnerability for any classifier.Advances in Neural Information Processing Systems, 31, 2018
2018
-
[12]
Greshake, S
K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. Proceedings of AISec, 2023
2023
-
[13]
Huang, M
X. Huang, M. Kwiatkowska, S. Wang, and M. Wu. Safety verification of deep neural networks.Proceedings of CAV, 2017
2017
-
[14]
Madry, A
A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks.Proceedings of ICLR, 2018
2018
-
[15]
Illuminating search spaces by mapping elites
J.-B. Mouret and J. Clune. Illuminating search spaces by mapping elites.arXiv preprint arXiv:1504.04909, 2015
work page Pith review arXiv 2015
-
[16]
Naitzat, A
G. Naitzat, A. Zhitnikov, and L.-H. Lim. Topology of deep neural networks.Journal of Machine Learning Research, 21(184):1–40, 2020
2020
-
[17]
Manifold of Failure: Behavioral Attraction Basins in Language Models
S. Munshi, M. Bhatt, V. S. Narajala, I. Habler, A. Al-Kahfah, K. Huang, and B. Gatto. Manifold of failure: Behavioral attraction basins in language models.arXiv preprint arXiv:2602.22291v2, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Samvelyan et al
M. Samvelyan et al. Rainbow teaming: Open-ended generation of diverse adversarial prompts.Advances in Neural Information Processing Systems, 37, 2024
2024
-
[19]
Singh, T
G. Singh, T. Gehr, M. Püschel, and M. Vechev. An abstract domain for certifying neural networks.Proceedings of POPL, 2019
2019
-
[20]
Szegedy et al
C. Szegedy et al. Intriguing properties of neural networks. Proceedings of ICLR, 2014. 15
2014
-
[21]
Tsipras, S
D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry. Robustness may be at odds with accuracy.Proceedings of ICLR, 2019
2019
-
[22]
D. H. Wolpert and W. G. Macready. No free lunch theorems for optimization.IEEE Trans. Evol. Comput., 1(1):67–82, 1997
1997
-
[23]
A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Ge et al
K. Ge et al. MART: Improving LLM safety with multi-round automatic red-teaming.Proceedings of NAACL, 2024
2024
-
[25]
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
E. Hubinger et al. Sleeper agents: Training deceptive LLMs that persist through safety training.arXiv preprint arXiv:2401.05566, 2024
work page internal anchor Pith review arXiv 2024
-
[26]
Anil et al
C. Anil et al. Many-shot jailbreaking.Advances in Neural Information Processing Systems, 37, 2024
2024
-
[27]
Kim et al
D. Kim et al. What really matters in many-shot attacks?Proceedings of ACL, 2025
2025
-
[28]
Zhan et al
Q. Zhan et al. InjecAgent: Benchmarking indirect prompt injections in tool-integrated LLM agents.Findings of ACL, 2024
2024
-
[29]
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
H. Zhang et al. Agent security bench (ASB): Formalizing and benchmarking attacks and defenses in LLM-based agents.arXiv preprint arXiv:2410.02644, 2024
work page internal anchor Pith review arXiv 2024
-
[30]
Y. Yuan et al. The instability of safety.arXiv preprint arXiv:2512.12066, 2025
-
[31]
V. Tsvetkov et al. Quantization and safety: A closer look at LLM safety under weight compression.arXiv preprint arXiv:2502.15799, 2025
-
[32]
Nesti et al
F. Nesti et al. Mind the gap: Adversarial attacks against GGUF quantized LLMs.Proceedings of ICML, 2025
2025
-
[33]
Hammoud et al
H. Hammoud et al. Model merging and safety alignment: One bad model spoils the bunch.Findings of EMNLP, 2024
2024
- [34]
-
[35]
Rosenberg et al
D. Rosenberg et al. IRIS: Adversarial suffix attacks against robust defenses.Proceedings of NAACL, 2025
2025
-
[36]
Zhao et al
X. Zhao et al. Weak-to-strong jailbreaking on large language models. Proceedings of ICML, 2025
2025
-
[37]
Safety tax: Safety alignment makes your large reasoning models less reasonable
Y. Huang et al. The safety tax of reasoning alignment.arXiv preprint arXiv:2503.00555, 2025
-
[38]
Y. Huang et al. On the geometric inevitability of the alignment tax. arXiv preprint arXiv:2603.00047, 2026
-
[39]
David vs. Goliath: Verifiable Agent-to-Agent Jailbreaking via RL,
L. Bailey et al. Slingshot: RL-based agent-to-agent jailbreaking. arXiv preprint arXiv:2602.02395, 2026. 16 Appendices A Vulnerability Landscape This section characterizes the geometry of the unsafe region. Theorem A.1(Basin Structure).Iffis continuous andf(p)> τ, thenU τ is open. Under any measure positive on nonempty open sets,U τ has positive measure. ...
-
[40]
dist(x, z)> τ. Lean:persistent_unsafe_refinedinMoF_20_RefinedPersistence(defense-path con- stantℓ).MoF_11_EpsilonRobustcontains an earlier version using the global constantL; as noted in Section 6, that version is vacuous for isotropic surfaces. C Counterexamples: Each Hypothesis Is Necessary Counterexample C.1(Removing connectedness).X={0,1}discrete,f(0)...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.