arxiv: 2604.06436 · v3 · submitted 2026-04-07 · 💻 cs.CR · cs.AI

Recognition: no theorem link

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

Manish Bhatt , Sarthak Munshi , Vineeth Sai Narajala , Idan Habler , Ammar Al-Kahfah , Ken Huang , Joel Webb , Blake Gatto

show 1 more author

Md Tamjidul Hoque

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:28 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords prompt injectiondefense wrapperslanguage modelscontinuityutility preservationtrilemmaadversarial robustnesssafety

0 comments

The pith

No continuous utility-preserving wrapper defense can make all language model outputs strictly safe.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that for language models whose prompt space is connected, no preprocessing function D that maps inputs to inputs continuously while preserving utility can eliminate all unsafe outputs. It derives three escalating results: any such defense must leave some boundary inputs unchanged, Lipschitz conditions force a positive-measure band of near-threshold inputs around those points, and a transversality condition leaves a positive-measure set of strictly unsafe inputs. These form a defense trilemma in which continuity, utility preservation, and complete safety are mutually incompatible. Discrete versions without topology, plus extensions to multi-turn, stochastic, and capacity-parity cases, are also shown, while training-time alignment and architectural changes remain outside the result's scope.

Core claim

We prove that no continuous, utility-preserving wrapper defense-a function D: X to X that preprocesses inputs before the model sees them-can make all outputs strictly safe for a language model with connected prompt space, and we characterize exactly where every such defense must fail. We establish three results under successively stronger hypotheses: boundary fixation-the defense must leave some threshold-level inputs unchanged; an epsilon-robust constraint-under Lipschitz regularity, a positive-measure band around fixed boundary points remains near-threshold; and a persistent unsafe region under a transversality condition, a positive-measure subset of inputs remains strictly unsafe. These构成

What carries the argument

The wrapper defense function D: X to X, a continuous map from the prompt space to itself that attempts to preserve utility while enforcing safety.

Load-bearing premise

The prompt space is connected, so continuous paths exist between any pair of inputs.

What would settle it

Exhibit one continuous utility-preserving D such that the underlying model produces only safe outputs on the image of D, or demonstrate that the unsafe region has measure zero after any such mapping.

Figures

Figures reproduced from arXiv: 2604.06436 by Ammar Al-Kahfah, Blake Gatto, Idan Habler, Joel Webb, Ken Huang, Manish Bhatt, Md Tamjidul Hoque, Sarthak Munshi, Vineeth Sai Narajala.

**Figure 2.** Figure 2: The defense trilemma. Any continuous wrapper defense on a connected space can satisfy [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The three impossibility results on a 1D cross-section. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Non-vacuous validation of Theorem 6.3 on the saturated [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

read the original abstract

We prove that no continuous, utility-preserving wrapper defense-a function $D: X\to X$ that preprocesses inputs before the model sees them-can make all outputs strictly safe for a language model with connected prompt space, and we characterize exactly where every such defense must fail. We establish three results under successively stronger hypotheses: boundary fixation-the defense must leave some threshold-level inputs unchanged; an $\epsilon$-robust constraint-under Lipschitz regularity, a positive-measure band around fixed boundary points remains near-threshold; and a persistent unsafe region under a transversality condition, a positive-measure subset of inputs remains strictly unsafe. These constitute a defense trilemma: continuity, utility preservation, and completeness cannot coexist. We prove parallel discrete results requiring no topology, and extend to multi-turn interactions, stochastic defenses, and capacity-parity settings. The results do not preclude training-time alignment, architectural changes, or defenses that sacrifice utility. The full theory is mechanically verified in Lean 4 and validated empirically on three LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Wrapper defenses for prompt injection cannot be continuous, utility-preserving, and fully safe on connected spaces, with the trilemma backed by Lean-verified proofs and a discrete parallel.

read the letter

This paper shows that no continuous wrapper defense for prompt injection can both preserve utility and ensure all outputs are safe when the prompt space is connected. They break it down into a trilemma with three results that get stronger as assumptions tighten: the defense has to fix some boundary inputs, then under Lipschitz conditions a band around them stays risky, and under transversality a whole region stays unsafe. There's also a discrete version that avoids topology altogether. They do the formal part solidly. Having the theorems checked in Lean 4 is useful evidence that the logic holds. The extensions to multi-turn, stochastic cases, and so on show they thought about real deployment issues. They correctly flag that this leaves room for other defense strategies like retraining or accepting lower utility. The main soft spot is how well the connectedness assumption fits actual LLM inputs, which are discrete sequences. The discrete results help, but it still feels like the continuous case is the headline. The empirical validation on three models is noted but without specifics on setup or effect sizes it's hard to weigh how much it adds. The transversality condition might be the one that needs the most unpacking for readers outside topology. This is for people focused on LLM security and input defenses. Anyone designing or evaluating prompt guards should see the limits it lays out. The work is internally consistent and takes the literature on prompt injection seriously. I'd bring it to the next reading group to discuss the proofs. It deserves peer review; the verified core result is worth referee time.

Referee Report

0 major / 2 minor

Summary. The paper proves that no continuous, utility-preserving wrapper defense D: X→X can render all outputs of a language model strictly safe when the prompt space is connected. It establishes three results of increasing strength: boundary fixation (some threshold inputs must remain unchanged), ε-robustness (a positive-measure band around fixed points remains near-threshold under Lipschitz conditions), and a persistent unsafe region (positive-measure unsafe set under transversality). Parallel discrete results without topology are given, plus extensions to multi-turn interactions, stochastic defenses, and capacity-parity settings. All core theorems are mechanically verified in Lean 4, with empirical validation on three LLMs. The work explicitly notes that training-time alignment and utility-sacrificing defenses remain viable.

Significance. If the results hold, the paper makes a substantial contribution by supplying a formally verified impossibility result that precisely characterizes the failure modes of prompt-injection wrapper defenses. The mechanical verification in Lean 4 is a clear strength, eliminating the usual concerns about proof gaps. The explicit trilemma framing and acknowledgment of orthogonal mitigation strategies (training-time methods, architectural changes) make the work constructive rather than purely negative. The discrete parallel and multi-turn extensions broaden applicability. This should influence both theoretical work on AI safety and practical defense design.

minor comments (2)

[Abstract] Abstract: the phrase 'strictly safe' is used without an explicit definition in the opening paragraph; a one-sentence clarification of the safety predicate would help readers who skip the formal sections.
[Empirical Validation] The empirical validation section reports results on three LLMs but does not specify the exact prompt-injection attack templates or the quantitative safety metric; adding a short table or pseudocode snippet would improve reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and constructive review, which accurately summarizes the paper's contributions, including the trilemma results, Lean 4 verification, and acknowledgment of viable alternatives such as training-time alignment. We appreciate the recommendation to accept.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central result is a formal impossibility theorem: no continuous utility-preserving wrapper D: X→X can render all LM outputs strictly safe when the prompt space X is connected. The derivation proceeds from explicitly stated assumptions (connectedness, continuity of D, boundary fixation, Lipschitz regularity, transversality) via standard topological arguments, with parallel discrete results requiring no topology. All core theorems are mechanically verified in Lean 4, eliminating dependence on unverified self-citations. No parameters are fitted to data, no quantity is renamed as a prediction, and no load-bearing step reduces to a self-definition or prior author result by construction. The trilemma follows directly from the hypotheses rather than from any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on topological properties of the prompt space and continuity of the defense wrapper. No free parameters or new entities are introduced; the result is a no-go theorem under stated assumptions.

axioms (2)

domain assumption The prompt space X is connected
Required for the boundary fixation and persistent unsafe region results to hold.
domain assumption The defense function D is continuous
Central assumption enabling the trilemma between continuity, utility, and safety.

pith-pipeline@v0.9.0 · 5502 in / 1317 out tokens · 37886 ms · 2026-05-10T18:28:43.733835+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Towards a Data-Parameter Correspondence for LLMs: A Preliminary Discussion
cs.LG 2026-04 unverdicted novelty 4.0

A data-parameter correspondence unifies data-centric and parameter-centric LLM optimizations as dual geometric operations on the statistical manifold via Fisher-Rao metric and Legendre duality.

Reference graph

Works this paper leans on

40 extracted references · 15 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Detecting language model attacks with perplexity

G. Alon and M. Kamfonas. Detecting language model attacks with perplexity.arXiv preprint arXiv:2308.14132, 2023

work page arXiv 2023
[2]

Bagnall and G

A. Bagnall and G. Stewart. Certifying the true error: Machine learning in Coq with verified generalization guarantees.Proceedings of the AAAI Conference on Artificial Intelligence, 2019. 14

2019
[3]

Constitutional AI: Harmlessness from AI Feedback

Y. Bai et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2024

work page internal anchor Pith review arXiv 2024
[5]

Cohen, E

J. Cohen, E. Rosenfeld, and J. Z. Kolter. Certified adversarial robustness via randomized smoothing.Proceedings of ICML, 2019

2019
[6]

I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples.Proceedings of ICLR, 2015

2015
[7]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

H. Inan et al. Llama Guard: LLM-based input-output safeguard for human-AI conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review arXiv 2023
[8]

G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer. Reluplex: An efficient SMT solver for verifying deep neural networks. Proceedings of CAV, 2017

2017
[9]

Carlini and D

N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks.Proceedings of IEEE S&P, 2017

2017
[10]

Mehrotra et al

A. Mehrotra et al. Tree of attacks: Jailbreaking black-box LLMs automatically.Advances in Neural Information Processing Systems, 37, 2024

2024
[11]

Fawzi, H

A. Fawzi, H. Fawzi, and O. Fawzi. Adversarial vulnerability for any classifier.Advances in Neural Information Processing Systems, 31, 2018

2018
[12]

Greshake, S

K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. Proceedings of AISec, 2023

2023
[13]

Huang, M

X. Huang, M. Kwiatkowska, S. Wang, and M. Wu. Safety verification of deep neural networks.Proceedings of CAV, 2017

2017
[14]

Madry, A

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks.Proceedings of ICLR, 2018

2018
[15]

Illuminating search spaces by mapping elites

J.-B. Mouret and J. Clune. Illuminating search spaces by mapping elites.arXiv preprint arXiv:1504.04909, 2015

work page Pith review arXiv 2015
[16]

Naitzat, A

G. Naitzat, A. Zhitnikov, and L.-H. Lim. Topology of deep neural networks.Journal of Machine Learning Research, 21(184):1–40, 2020

2020
[17]

Manifold of Failure: Behavioral Attraction Basins in Language Models

S. Munshi, M. Bhatt, V. S. Narajala, I. Habler, A. Al-Kahfah, K. Huang, and B. Gatto. Manifold of failure: Behavioral attraction basins in language models.arXiv preprint arXiv:2602.22291v2, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Samvelyan et al

M. Samvelyan et al. Rainbow teaming: Open-ended generation of diverse adversarial prompts.Advances in Neural Information Processing Systems, 37, 2024

2024
[19]

Singh, T

G. Singh, T. Gehr, M. Püschel, and M. Vechev. An abstract domain for certifying neural networks.Proceedings of POPL, 2019

2019
[20]

Szegedy et al

C. Szegedy et al. Intriguing properties of neural networks. Proceedings of ICLR, 2014. 15

2014
[21]

Tsipras, S

D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry. Robustness may be at odds with accuracy.Proceedings of ICLR, 2019

2019
[22]

D. H. Wolpert and W. G. Macready. No free lunch theorems for optimization.IEEE Trans. Evol. Comput., 1(1):67–82, 1997

1997
[23]

A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Ge et al

K. Ge et al. MART: Improving LLM safety with multi-round automatic red-teaming.Proceedings of NAACL, 2024

2024
[25]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

E. Hubinger et al. Sleeper agents: Training deceptive LLMs that persist through safety training.arXiv preprint arXiv:2401.05566, 2024

work page internal anchor Pith review arXiv 2024
[26]

Anil et al

C. Anil et al. Many-shot jailbreaking.Advances in Neural Information Processing Systems, 37, 2024

2024
[27]

Kim et al

D. Kim et al. What really matters in many-shot attacks?Proceedings of ACL, 2025

2025
[28]

Zhan et al

Q. Zhan et al. InjecAgent: Benchmarking indirect prompt injections in tool-integrated LLM agents.Findings of ACL, 2024

2024
[29]

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

H. Zhang et al. Agent security bench (ASB): Formalizing and benchmarking attacks and defenses in LLM-based agents.arXiv preprint arXiv:2410.02644, 2024

work page internal anchor Pith review arXiv 2024
[30]

Yuan et al

Y. Yuan et al. The instability of safety.arXiv preprint arXiv:2512.12066, 2025

work page arXiv 2025
[31]

Investigating the Impact of Quantization Methods on the Safety and Reliability of Large Language Models,

V. Tsvetkov et al. Quantization and safety: A closer look at LLM safety under weight compression.arXiv preprint arXiv:2502.15799, 2025

work page arXiv 2025
[32]

Nesti et al

F. Nesti et al. Mind the gap: Adversarial attacks against GGUF quantized LLMs.Proceedings of ICML, 2025

2025
[33]

Hammoud et al

H. Hammoud et al. Model merging and safety alignment: One bad model spoils the bunch.Findings of EMNLP, 2024

2024
[34]

Liu et al

X. Liu et al. Alignment whack-a-mole.arXiv preprint arXiv:2603.20957, 2026

work page arXiv 2026
[35]

Rosenberg et al

D. Rosenberg et al. IRIS: Adversarial suffix attacks against robust defenses.Proceedings of NAACL, 2025

2025
[36]

Zhao et al

X. Zhao et al. Weak-to-strong jailbreaking on large language models. Proceedings of ICML, 2025

2025
[37]

Safety tax: Safety alignment makes your large reasoning models less reasonable

Y. Huang et al. The safety tax of reasoning alignment.arXiv preprint arXiv:2503.00555, 2025

work page arXiv 2025
[38]

Huang et al

Y. Huang et al. On the geometric inevitability of the alignment tax. arXiv preprint arXiv:2603.00047, 2026

work page arXiv 2026
[39]

David vs. Goliath: Verifiable Agent-to-Agent Jailbreaking via RL,

L. Bailey et al. Slingshot: RL-based agent-to-agent jailbreaking. arXiv preprint arXiv:2602.02395, 2026. 16 Appendices A Vulnerability Landscape This section characterizes the geometry of the unsafe region. Theorem A.1(Basin Structure).Iffis continuous andf(p)> τ, thenU τ is open. Under any measure positive on nonempty open sets,U τ has positive measure. ...

work page arXiv 2026
[40]

dist(x, z)> τ. Lean:persistent_unsafe_refinedinMoF_20_RefinedPersistence(defense-path con- stantℓ).MoF_11_EpsilonRobustcontains an earlier version using the global constantL; as noted in Section 6, that version is vacuous for isotropic surfaces. C Counterexamples: Each Hypothesis Is Necessary Counterexample C.1(Removing connectedness).X={0,1}discrete,f(0)...