pith. machine review for the scientific record. sign in

arxiv: 2604.06436 · v3 · submitted 2026-04-07 · 💻 cs.CR · cs.AI

Recognition: no theorem link

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:28 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords prompt injectiondefense wrapperslanguage modelscontinuityutility preservationtrilemmaadversarial robustnesssafety
0
0 comments X

The pith

No continuous utility-preserving wrapper defense can make all language model outputs strictly safe.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that for language models whose prompt space is connected, no preprocessing function D that maps inputs to inputs continuously while preserving utility can eliminate all unsafe outputs. It derives three escalating results: any such defense must leave some boundary inputs unchanged, Lipschitz conditions force a positive-measure band of near-threshold inputs around those points, and a transversality condition leaves a positive-measure set of strictly unsafe inputs. These form a defense trilemma in which continuity, utility preservation, and complete safety are mutually incompatible. Discrete versions without topology, plus extensions to multi-turn, stochastic, and capacity-parity cases, are also shown, while training-time alignment and architectural changes remain outside the result's scope.

Core claim

We prove that no continuous, utility-preserving wrapper defense-a function D: X to X that preprocesses inputs before the model sees them-can make all outputs strictly safe for a language model with connected prompt space, and we characterize exactly where every such defense must fail. We establish three results under successively stronger hypotheses: boundary fixation-the defense must leave some threshold-level inputs unchanged; an epsilon-robust constraint-under Lipschitz regularity, a positive-measure band around fixed boundary points remains near-threshold; and a persistent unsafe region under a transversality condition, a positive-measure subset of inputs remains strictly unsafe. These构成

What carries the argument

The wrapper defense function D: X to X, a continuous map from the prompt space to itself that attempts to preserve utility while enforcing safety.

Load-bearing premise

The prompt space is connected, so continuous paths exist between any pair of inputs.

What would settle it

Exhibit one continuous utility-preserving D such that the underlying model produces only safe outputs on the image of D, or demonstrate that the unsafe region has measure zero after any such mapping.

Figures

Figures reproduced from arXiv: 2604.06436 by Ammar Al-Kahfah, Blake Gatto, Idan Habler, Joel Webb, Ken Huang, Manish Bhatt, Md Tamjidul Hoque, Sarthak Munshi, Vineeth Sai Narajala.

Figure 1
Figure 1. Figure 1: Schematic of the prompt space. The defense [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The defense trilemma. Any continuous wrapper defense on a connected space can satisfy [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The three impossibility results on a 1D cross-section. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Non-vacuous validation of Theorem 6.3 on the saturated [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
read the original abstract

We prove that no continuous, utility-preserving wrapper defense-a function $D: X\to X$ that preprocesses inputs before the model sees them-can make all outputs strictly safe for a language model with connected prompt space, and we characterize exactly where every such defense must fail. We establish three results under successively stronger hypotheses: boundary fixation-the defense must leave some threshold-level inputs unchanged; an $\epsilon$-robust constraint-under Lipschitz regularity, a positive-measure band around fixed boundary points remains near-threshold; and a persistent unsafe region under a transversality condition, a positive-measure subset of inputs remains strictly unsafe. These constitute a defense trilemma: continuity, utility preservation, and completeness cannot coexist. We prove parallel discrete results requiring no topology, and extend to multi-turn interactions, stochastic defenses, and capacity-parity settings. The results do not preclude training-time alignment, architectural changes, or defenses that sacrifice utility. The full theory is mechanically verified in Lean 4 and validated empirically on three LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper proves that no continuous, utility-preserving wrapper defense D: X→X can render all outputs of a language model strictly safe when the prompt space is connected. It establishes three results of increasing strength: boundary fixation (some threshold inputs must remain unchanged), ε-robustness (a positive-measure band around fixed points remains near-threshold under Lipschitz conditions), and a persistent unsafe region (positive-measure unsafe set under transversality). Parallel discrete results without topology are given, plus extensions to multi-turn interactions, stochastic defenses, and capacity-parity settings. All core theorems are mechanically verified in Lean 4, with empirical validation on three LLMs. The work explicitly notes that training-time alignment and utility-sacrificing defenses remain viable.

Significance. If the results hold, the paper makes a substantial contribution by supplying a formally verified impossibility result that precisely characterizes the failure modes of prompt-injection wrapper defenses. The mechanical verification in Lean 4 is a clear strength, eliminating the usual concerns about proof gaps. The explicit trilemma framing and acknowledgment of orthogonal mitigation strategies (training-time methods, architectural changes) make the work constructive rather than purely negative. The discrete parallel and multi-turn extensions broaden applicability. This should influence both theoretical work on AI safety and practical defense design.

minor comments (2)
  1. [Abstract] Abstract: the phrase 'strictly safe' is used without an explicit definition in the opening paragraph; a one-sentence clarification of the safety predicate would help readers who skip the formal sections.
  2. [Empirical Validation] The empirical validation section reports results on three LLMs but does not specify the exact prompt-injection attack templates or the quantitative safety metric; adding a short table or pseudocode snippet would improve reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and constructive review, which accurately summarizes the paper's contributions, including the trilemma results, Lean 4 verification, and acknowledgment of viable alternatives such as training-time alignment. We appreciate the recommendation to accept.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central result is a formal impossibility theorem: no continuous utility-preserving wrapper D: X→X can render all LM outputs strictly safe when the prompt space X is connected. The derivation proceeds from explicitly stated assumptions (connectedness, continuity of D, boundary fixation, Lipschitz regularity, transversality) via standard topological arguments, with parallel discrete results requiring no topology. All core theorems are mechanically verified in Lean 4, eliminating dependence on unverified self-citations. No parameters are fitted to data, no quantity is renamed as a prediction, and no load-bearing step reduces to a self-definition or prior author result by construction. The trilemma follows directly from the hypotheses rather than from any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on topological properties of the prompt space and continuity of the defense wrapper. No free parameters or new entities are introduced; the result is a no-go theorem under stated assumptions.

axioms (2)
  • domain assumption The prompt space X is connected
    Required for the boundary fixation and persistent unsafe region results to hold.
  • domain assumption The defense function D is continuous
    Central assumption enabling the trilemma between continuity, utility, and safety.

pith-pipeline@v0.9.0 · 5502 in / 1317 out tokens · 37886 ms · 2026-05-10T18:28:43.733835+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Towards a Data-Parameter Correspondence for LLMs: A Preliminary Discussion

    cs.LG 2026-04 unverdicted novelty 4.0

    A data-parameter correspondence unifies data-centric and parameter-centric LLM optimizations as dual geometric operations on the statistical manifold via Fisher-Rao metric and Legendre duality.

Reference graph

Works this paper leans on

40 extracted references · 15 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Detecting language model attacks with perplexity

    G. Alon and M. Kamfonas. Detecting language model attacks with perplexity.arXiv preprint arXiv:2308.14132, 2023

  2. [2]

    Bagnall and G

    A. Bagnall and G. Stewart. Certifying the true error: Machine learning in Coq with verified generalization guarantees.Proceedings of the AAAI Conference on Artificial Intelligence, 2019. 14

  3. [3]

    Constitutional AI: Harmlessness from AI Feedback

    Y. Bai et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073, 2022

  4. [4]

    P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2024

  5. [5]

    Cohen, E

    J. Cohen, E. Rosenfeld, and J. Z. Kolter. Certified adversarial robustness via randomized smoothing.Proceedings of ICML, 2019

  6. [6]

    I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples.Proceedings of ICLR, 2015

  7. [7]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    H. Inan et al. Llama Guard: LLM-based input-output safeguard for human-AI conversations.arXiv preprint arXiv:2312.06674, 2023

  8. [8]

    G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer. Reluplex: An efficient SMT solver for verifying deep neural networks. Proceedings of CAV, 2017

  9. [9]

    Carlini and D

    N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks.Proceedings of IEEE S&P, 2017

  10. [10]

    Mehrotra et al

    A. Mehrotra et al. Tree of attacks: Jailbreaking black-box LLMs automatically.Advances in Neural Information Processing Systems, 37, 2024

  11. [11]

    Fawzi, H

    A. Fawzi, H. Fawzi, and O. Fawzi. Adversarial vulnerability for any classifier.Advances in Neural Information Processing Systems, 31, 2018

  12. [12]

    Greshake, S

    K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. Proceedings of AISec, 2023

  13. [13]

    Huang, M

    X. Huang, M. Kwiatkowska, S. Wang, and M. Wu. Safety verification of deep neural networks.Proceedings of CAV, 2017

  14. [14]

    Madry, A

    A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks.Proceedings of ICLR, 2018

  15. [15]

    Illuminating search spaces by mapping elites

    J.-B. Mouret and J. Clune. Illuminating search spaces by mapping elites.arXiv preprint arXiv:1504.04909, 2015

  16. [16]

    Naitzat, A

    G. Naitzat, A. Zhitnikov, and L.-H. Lim. Topology of deep neural networks.Journal of Machine Learning Research, 21(184):1–40, 2020

  17. [17]

    Manifold of Failure: Behavioral Attraction Basins in Language Models

    S. Munshi, M. Bhatt, V. S. Narajala, I. Habler, A. Al-Kahfah, K. Huang, and B. Gatto. Manifold of failure: Behavioral attraction basins in language models.arXiv preprint arXiv:2602.22291v2, 2026

  18. [18]

    Samvelyan et al

    M. Samvelyan et al. Rainbow teaming: Open-ended generation of diverse adversarial prompts.Advances in Neural Information Processing Systems, 37, 2024

  19. [19]

    Singh, T

    G. Singh, T. Gehr, M. Püschel, and M. Vechev. An abstract domain for certifying neural networks.Proceedings of POPL, 2019

  20. [20]

    Szegedy et al

    C. Szegedy et al. Intriguing properties of neural networks. Proceedings of ICLR, 2014. 15

  21. [21]

    Tsipras, S

    D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry. Robustness may be at odds with accuracy.Proceedings of ICLR, 2019

  22. [22]

    D. H. Wolpert and W. G. Macready. No free lunch theorems for optimization.IEEE Trans. Evol. Comput., 1(1):67–82, 1997

  23. [23]

    A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

  24. [24]

    Ge et al

    K. Ge et al. MART: Improving LLM safety with multi-round automatic red-teaming.Proceedings of NAACL, 2024

  25. [25]

    Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

    E. Hubinger et al. Sleeper agents: Training deceptive LLMs that persist through safety training.arXiv preprint arXiv:2401.05566, 2024

  26. [26]

    Anil et al

    C. Anil et al. Many-shot jailbreaking.Advances in Neural Information Processing Systems, 37, 2024

  27. [27]

    Kim et al

    D. Kim et al. What really matters in many-shot attacks?Proceedings of ACL, 2025

  28. [28]

    Zhan et al

    Q. Zhan et al. InjecAgent: Benchmarking indirect prompt injections in tool-integrated LLM agents.Findings of ACL, 2024

  29. [29]

    Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

    H. Zhang et al. Agent security bench (ASB): Formalizing and benchmarking attacks and defenses in LLM-based agents.arXiv preprint arXiv:2410.02644, 2024

  30. [30]

    Yuan et al

    Y. Yuan et al. The instability of safety.arXiv preprint arXiv:2512.12066, 2025

  31. [31]

    Investigating the Impact of Quantization Methods on the Safety and Reliability of Large Language Models,

    V. Tsvetkov et al. Quantization and safety: A closer look at LLM safety under weight compression.arXiv preprint arXiv:2502.15799, 2025

  32. [32]

    Nesti et al

    F. Nesti et al. Mind the gap: Adversarial attacks against GGUF quantized LLMs.Proceedings of ICML, 2025

  33. [33]

    Hammoud et al

    H. Hammoud et al. Model merging and safety alignment: One bad model spoils the bunch.Findings of EMNLP, 2024

  34. [34]

    Liu et al

    X. Liu et al. Alignment whack-a-mole.arXiv preprint arXiv:2603.20957, 2026

  35. [35]

    Rosenberg et al

    D. Rosenberg et al. IRIS: Adversarial suffix attacks against robust defenses.Proceedings of NAACL, 2025

  36. [36]

    Zhao et al

    X. Zhao et al. Weak-to-strong jailbreaking on large language models. Proceedings of ICML, 2025

  37. [37]

    Safety tax: Safety alignment makes your large reasoning models less reasonable

    Y. Huang et al. The safety tax of reasoning alignment.arXiv preprint arXiv:2503.00555, 2025

  38. [38]

    Huang et al

    Y. Huang et al. On the geometric inevitability of the alignment tax. arXiv preprint arXiv:2603.00047, 2026

  39. [39]

    David vs. Goliath: Verifiable Agent-to-Agent Jailbreaking via RL,

    L. Bailey et al. Slingshot: RL-based agent-to-agent jailbreaking. arXiv preprint arXiv:2602.02395, 2026. 16 Appendices A Vulnerability Landscape This section characterizes the geometry of the unsafe region. Theorem A.1(Basin Structure).Iffis continuous andf(p)> τ, thenU τ is open. Under any measure positive on nonempty open sets,U τ has positive measure. ...

  40. [40]

    dist(x, z)> τ. Lean:persistent_unsafe_refinedinMoF_20_RefinedPersistence(defense-path con- stantℓ).MoF_11_EpsilonRobustcontains an earlier version using the global constantL; as noted in Section 6, that version is vacuous for isotropic surfaces. C Counterexamples: Each Hypothesis Is Necessary Counterexample C.1(Removing connectedness).X={0,1}discrete,f(0)...