pith. sign in

arxiv: 2606.30783 · v1 · pith:5DOZEEJInew · submitted 2026-06-29 · 💻 cs.CR · cs.AI

Security--Fidelity Tradeoffs: The Hidden Cost of Prompt Injection Defense

Pith reviewed 2026-07-01 01:41 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords prompt injectionLLM defensesecurity-fidelity tradeoffSecFid benchmarkindirect prompt injectionLLM robustnessattack success rate
0
0 comments X

The pith

Prompt injection defenses in LLMs create an unavoidable security-fidelity tradeoff.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current defenses against indirect prompt injection resist attacks primarily by suppressing untrusted text, which damages performance on tasks that must preserve that text such as translation or document editing. Attack-success rates alone cannot detect this problem because a model that ignores an injection and one that processes it faithfully as data produce the same outcome under those metrics. The authors introduce the SecFid benchmark, constructed so that executing an injection, treating it as data, and ignoring it yield distinguishable outputs, allowing both security and fidelity to be measured directly. Across 1,168 examples and 48 model-defense configurations, no combination reaches high levels of both, with the best fidelity at 96.5 percent paired with only 47.8 percent security and the strongest security at 99.3 percent paired with 71.0 to 73.9 percent fidelity. A decision-theoretic argument shows that the right operating point depends on the deployment-specific costs of a successful hijack versus a lost span of content.

Core claim

No model or defense achieves both high security and high fidelity; the highest-fidelity model reaches 96.5 percent fidelity at 47.8 percent security while the most secure defenses reach 99.3 percent security at only 71.0 to 73.9 percent fidelity. Even defenses that deliver identical security scores differ in mechanism, with some converting hijacked inputs into faithful processing and others simply dropping benign content. Security metrics therefore capture only half of robustness because they omit the fidelity cost at which security is purchased.

What carries the argument

The SecFid benchmark, which forces distinguishable outputs for executing an injection, processing it as data, and ignoring it.

If this is right

  • Security metrics alone are insufficient because they hide the fidelity price paid by each defense.
  • The appropriate defense is not fixed but must be chosen according to the relative cost of a hijack versus a dropped span in the target deployment.
  • Defenses that repair hijacks into faithful outputs differ in effect from those that suppress benign content even when both report the same security score.
  • Reporting only attack success rates conceals the operational cost of the chosen defense.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • In applications where fidelity matters more than perfect security, such as collaborative document editing, lighter defenses may be preferable.
  • Adaptive systems could estimate the local cost ratio and switch defense strength per task or per input span.
  • Models trained to distinguish instructions from data without broad suppression might reduce the observed tradeoff.

Load-bearing premise

The SecFid benchmark construction produces distinguishable outputs for the three behaviors and its chosen tasks represent deployments where fidelity to untrusted text actually matters.

What would settle it

A single defense or model that scores above 90 percent on both security and fidelity across the SecFid examples would falsify the claim that the tradeoff is unavoidable.

Figures

Figures reproduced from arXiv: 2606.30783 by Ekraam Sabir, Haohan Wang, Mitchell Hermon, Rahul Gupta, Weitong Ruan.

Figure 1
Figure 1. Figure 1: Task-dependent interpretation of instruction-like text in untrusted inputs. The same product review contains an instruction￾like sentence (Review B) that could be an indirect prompt injection. For translation, the sentence should be translated faithfully. For summarization, the sentence must not control the summary and may be summarized as content or omitted as irrelevant, depending on the task objective. … view at source ↗
Figure 2
Figure 2. Figure 2: The security–fidelity tradeoff in practice. Each point shows a model in the Security–Fidelity space. Among these mod￾els, no system achieves both high security and high fidelity; instead, models spread along a frontier. See [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Reasoning mainly increases security, not fidelity. Each trajectory tracks a model from its non-reasoning or minimal￾reasoning baseline through low, medium, and high reasoning. The average trajectory moves mostly rightward: execution falls, while fidelity remains roughly flat or slightly decreases [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Repair versus suppression of base-model hijacks. Each bar conditions on the examples that the undefended base model executed. SECALIGN repairs the most former hijacks while DefensiveTokens suppresses the most. 4.3. Learning to Repair Rather Than Suppress We briefly explore one way the SECFID benchmark can be used not only to diagnose defenses but to improve them. Most prompt-injection defenses train only o… view at source ↗
read the original abstract

We identify a security-fidelity tradeoff in defending LLMs against indirect prompt injection: defenses resist injected instructions largely by suppressing untrusted text, which corrupts tasks that must preserve it, such as translation and document editing. Attack-success metrics cannot see this, because a model that ignores an injection and one that faithfully processes it as data score identically. We introduce SecFid, a benchmark built so that executing an injection, processing it as data, and ignoring it produce distinguishable outputs. This makes fidelity measurable and exposes a frontier: across 1,168 examples and 48 configurations, no model or defense achieves both objectives. The highest-fidelity model reaches 96.5% fidelity at 47.8% security, while the most secure defenses invert this, at 99.3% security but only 71.0%-73.9% fidelity. Even defenses with identical security differ in how they earn it: some repair hijacks into faithful processing, others simply suppress benign content. A decision-theoretic analysis shows why no fixed choice can be right everywhere: the correct behavior is not a property of the defense but of the deployment, set by its relative cost of a hijack versus a dropped span. Security alone therefore measures only half of robustness, and reporting it without fidelity hides the price at which it was bought.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that defenses against indirect prompt injection in LLMs incur a hidden fidelity cost by suppressing untrusted text, which harms tasks requiring faithful processing of that text (e.g., translation, editing). Standard attack-success metrics cannot distinguish ignoring an injection from processing it as data. The authors introduce the SecFid benchmark, constructed so that executing an injection, processing it as data, and ignoring it yield distinguishable outputs. Across 1,168 examples and 48 configurations, no model or defense achieves both high security and high fidelity; the highest-fidelity model reaches 96.5% fidelity at 47.8% security, while the most secure reach 99.3% security at 71.0-73.9% fidelity. A decision-theoretic analysis shows the optimal behavior depends on the relative cost of hijack versus dropped span in a given deployment.

Significance. If the SecFid results hold, the work is significant for LLM security research: it demonstrates that security-only metrics capture only half of robustness and provides a decision-theoretic lens for deployment-specific tradeoffs. The empirical frontier across many configurations and the distinction between repair-based vs. suppression-based defenses are concrete contributions that could shift how defenses are evaluated and reported.

major comments (2)
  1. [Abstract / benchmark construction] Abstract and benchmark construction section: the central claim that 'no model or defense achieves both' and the reported frontier percentages rest on the SecFid labeling scheme reliably distinguishing 'processing injection as data' from execution or suppression. The manuscript asserts the benchmark is 'built so that' the behaviors are distinguishable but provides no quantitative validation (e.g., error rates, agreement metrics, or sensitivity analysis) of the string-matching/semantic rules across the 48 configurations or 1,168 examples. This is load-bearing for the fidelity numbers.
  2. [Results] Results section (1,168 examples, 48 configurations): the reported security and fidelity percentages lack accompanying details on exclusion criteria, statistical tests for the frontier, or robustness checks against post-hoc choices in task/example selection. Without these, it is not possible to assess whether the tradeoff is an artifact of benchmark construction rather than a general phenomenon.
minor comments (2)
  1. [Abstract] The abstract uses an en-dash in the title ('Security--Fidelity') that should be rendered consistently in the body and references.
  2. [Decision-theoretic analysis] Clarify in the decision-theoretic section whether the cost ratio is treated as a free parameter or derived from any empirical distribution of deployment scenarios.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which highlight important areas for strengthening the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Abstract / benchmark construction] Abstract and benchmark construction section: the central claim that 'no model or defense achieves both' and the reported frontier percentages rest on the SecFid labeling scheme reliably distinguishing 'processing injection as data' from execution or suppression. The manuscript asserts the benchmark is 'built so that' the behaviors are distinguishable but provides no quantitative validation (e.g., error rates, agreement metrics, or sensitivity analysis) of the string-matching/semantic rules across the 48 configurations or 1,168 examples. This is load-bearing for the fidelity numbers.

    Authors: We agree that explicit quantitative validation of the labeling scheme is necessary to support the fidelity claims. The SecFid rules are deterministic and task-specific (string matching for outputs like injected commands vs. task results incorporating the span vs. omission), but the manuscript does not report agreement metrics or sensitivity. We will add a dedicated validation subsection reporting: (i) inter-annotator agreement on a random sample of 200 examples (targeting >95% agreement on category assignment), (ii) error rates from manual review, and (iii) sensitivity analysis by varying matching thresholds. This revision will make the load-bearing aspect transparent. revision: yes

  2. Referee: [Results] Results section (1,168 examples, 48 configurations): the reported security and fidelity percentages lack accompanying details on exclusion criteria, statistical tests for the frontier, or robustness checks against post-hoc choices in task/example selection. Without these, it is not possible to assess whether the tradeoff is an artifact of benchmark construction rather than a general phenomenon.

    Authors: All 1,168 examples were included with no exclusions applied. Task/example selection prioritized cases where the three behaviors produce distinguishable outputs by construction. We will revise the results section to include: explicit confirmation of full inclusion, bootstrap confidence intervals on all reported percentages, McNemar's tests for statistical significance of differences supporting the frontier, and a robustness subsection with results on random subsamples (e.g., 500 examples) to check sensitivity to selection. These additions will address concerns about artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark measurements with no self-referential derivations or fitted predictions

full rationale

The paper's central claims consist of direct empirical measurements of security and fidelity on the introduced SecFid benchmark across 1,168 examples and 48 configurations. No equations, parameters, or derivations are presented that reduce the reported percentages to quantities defined by the paper's own fits or self-citations. The benchmark is described as constructed to distinguish behaviors, but the results are observational outputs rather than tautological re-expressions of inputs. This is a standard empirical evaluation with no load-bearing self-citation chains or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that SecFid produces distinguishable outputs for the three behaviors and that the 1,168 examples adequately sample the relevant task space; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption SecFid benchmark construction produces outputs that reliably distinguish executing an injection, processing it as data, and ignoring it.
    This premise is required for fidelity to be measurable and for the reported tradeoff to be observable; it is invoked by the claim that attack-success metrics cannot see the fidelity cost.

pith-pipeline@v0.9.1-grok · 5778 in / 1337 out tokens · 33633 ms · 2026-07-01T01:41:31.273161+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

129 extracted references · 55 canonical work pages · 18 internal anchors

  1. [1]

    Lampert , booktitle=

    Egor Zverev and Sahar Abdelnabi and Soroush Tabesh and Mario Fritz and Christoph H. Lampert , booktitle=. Can

  2. [2]

    Formalizing and

    Liu, Yupei and Jia, Yuqi and Geng, Runpeng and Jia, Jinyuan and Gong, Neil Zhenqiang , year = 2024, pages =. Formalizing and. 33rd

  3. [3]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, Gheorghe and Bieber, Eric and Schaekermann, Mike and others , year =. doi:10.48550/arXiv.2507.06261 , url =. 2507.06261 , archivePrefix =

  4. [4]

    Kandpal, Nikhil and Lester, Brian and Raffel, Colin and Majstorovic, Sebastian and Biderman, Stella and Abbasi, Baber and Soldaini, Luca and Shippole, Enrico and Cooper, A. Feder and Skowron, Aviya and Kirchenbauer, John and Longpre, Shayne and Sutawika, Lintang and Albalak, Alon and Xu, Zhenlin and Penedo, Guilherme and Allal, Loubna Ben and Bakouch, Eli...

  5. [6]

    Defending

    Hines, Keegan and Lopez, Gary and Hall, Matthew and Zarfati, Federico and Zunger, Yonatan and Kiciman, Emre , year = 2024, month = mar, journal =. Defending

  6. [7]

    Vera, Henrique Schechter and Dua, Sahil and Zhang, Biao and Salz, Daniel and Mullins, Ryan and Panyam, Sindhu Raghuram and Smoot, Sara and Naim, Iftekhar and Zou, Joe and Chen, Feiyang and Cer, Daniel and Lisak, Alice and Choi, Min and Gonzalez, Lucas and Sanseviero, Omar and Cameron, Glenn and Ballantyne, Ian and Black, Kat and Chen, Kaifeng and Wang, We...

  7. [8]

    Ignore Previous Prompt: Attack Techniques For Language Models

    Ignore Previous Prompt: Attack Techniques For Language Models , author=. arXiv preprint arXiv:2211.09527 , year=

  8. [9]

    arXiv preprint arXiv:2507.02735 , year=

    Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks , author=. arXiv preprint arXiv:2507.02735 , year=

  9. [10]

    Qwen and Yang, An and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Li, Chengyuan and Liu, Dayiheng and Huang, Fei and Wei, Haoran and Lin, Huan and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jianxin and Yang, Jiaxi and Zhou, Jingren and Lin, Junyang and Dang, Kai and Lu, Keming and Bao, Keqin and Yang, Ke...

  10. [11]

    Defending

    Chen, Sizhe and Wang, Yizhu and Carlini, Nicholas and Sitawarin, Chawin and Wagner, David , year = 2026, month = dec, series =. Defending. Proceedings of the 18th. doi:10.1145/3733799.3762982 , urldate =

  11. [12]

    Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and others , year =. The. doi:10.48550/arXiv.2407.21783 , url =. 2407.21783 , archivePrefix =

  12. [13]

    Introducing GPT-5.2 , date =

  13. [14]

    Doshi, Tulsee , title =

  14. [15]

    Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages =

    Greshake, Kai and Abdelnabi, Sahar and Mishra, Shailesh and Endres, Christoph and Holz, Thorsten and Fritz, Mario , title =. Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages =. 2023 , isbn =. doi:10.1145/3605764.3623985 , abstract =

  15. [16]

    Advances in Neural Information Processing Systems , volume =

    Debenedetti, Edoardo and Zhang, Jie and Balunovic, Mislav and. Advances in Neural Information Processing Systems , volume =

  16. [17]

    Bridging the Editing Gap in LLM s: F ine E dit for Precise and Targeted Text Modifications

    Zeng, Yiming and Yu, Wanhao and Li, Zexin and Ren, Tao and Ma, Yu and Cao, Jinghan and Chen, Xiyan and Yu, Tingting. Bridging the Editing Gap in LLM s: F ine E dit for Precise and Targeted Text Modifications. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.118

  17. [18]

    2025 , month = oct, url =

  18. [19]

    2026 , month = feb, url =

  19. [20]

    2025 , month = dec, url =

  20. [21]

    2026 , month = may, url =

  21. [22]

    2026 , month = mar, url =

  22. [23]

    2026 , month = mar, howpublished =

  23. [24]

    Prompt Injection attack against LLM-integrated Applications

    Prompt Injection attack against LLM-integrated Applications , author=. arXiv preprint arXiv:2306.05499 , year=

  24. [25]

    33rd USENIX Security Symposium (USENIX Security 24) , pages=

    Formalizing and benchmarking prompt injection attacks and defenses , author=. 33rd USENIX Security Symposium (USENIX Security 24) , pages=

  25. [26]

    Slashdot , year =

    Bing Chat Succumbs to Prompt Injection Attack, Spills Its Secrets , author =. Slashdot , year =

  26. [27]

    Black Duck Blog , year =

    CyRC Advisory: Prompt Injection in EmailGPT , author =. Black Duck Blog , year =

  27. [28]

    Positive Security Blog , year =

    Hacking Auto-GPT and Escaping Its Docker Container , author =. Positive Security Blog , year =

  28. [29]

    AWS Security Blog , year =

    Safeguard Your Generative AI Workloads from Prompt Injections , author =. AWS Security Blog , year =

  29. [30]

    arXiv preprint arXiv:2503.00061 , year=

    Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents , author=. arXiv preprint arXiv:2503.00061 , year=

  30. [31]

    USENIX Security Symposium , year=

    StruQ: Defending against prompt injection with structured queries , author=. USENIX Security Symposium , year=

  31. [32]

    Advances in Neural Information Processing Systems , volume=

    Tree of attacks: Jailbreaking black-box llms automatically , author=. Advances in Neural Information Processing Systems , volume=

  32. [33]

    IBM Think Insights , year =

    Prevent Prompt Injection Attacks , author =. IBM Think Insights , year =

  33. [34]

    Advances in neural information processing systems , volume=

    Learning robust global representations by penalizing local predictive power , author=. Advances in neural information processing systems , volume=

  34. [35]

    Computer vision--ECCV 2020: 16th European conference, Glasgow, UK, August 23--28, 2020, proceedings, part II 16 , pages=

    Self-challenging improves cross-domain generalization , author=. Computer vision--ECCV 2020: 16th European conference, Glasgow, UK, August 23--28, 2020, proceedings, part II 16 , pages=. 2020 , organization=

  35. [36]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    High-frequency component helps explain the generalization of convolutional neural networks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  36. [37]

    arXiv preprint arXiv:2401.17263 , year=

    Robust prompt optimization for defending language models against jailbreaking attacks , author=. arXiv preprint arXiv:2401.17263 , year=

  37. [38]

    arXiv preprint arXiv:2402.03299 , year=

    Guard: Role-playing to generate natural-language jailbreakings to test guideline adherence of large language models , author=. arXiv preprint arXiv:2402.03299 , year=

  38. [39]

    arXiv preprint arXiv:2407.01599 , year=

    Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models , author=. arXiv preprint arXiv:2407.01599 , year=

  39. [40]

    2024 , month = oct, urldate =

    The Thirteenth International Conference on Learning Representations , author =. 2024 , month = oct, urldate =

  40. [41]

    StruQ: Defending against prompt injection with structured queries

    Chen, Sizhe and Piet, Julien and Sitawarin, Chawin and Wagner, David , date =. doi:10.48550/arXiv.2402.06363 , url =. 2402.06363 , eprinttype =

  41. [42]

    Jailbreaking

    Andriushchenko, Maksym and Croce, Francesco and Flammarion, Nicolas , date =. Jailbreaking. doi:10.48550/arXiv.2404.02151 , url =. 2404.02151 , eprinttype =

  42. [43]

    Team, Gemma and Kamath, Aishwarya and Ferret, Johan and Pathak, Shreya and Vieillard, Nino and Merhej, Ramona and Perrin, Sarah and Matejovicova, Tatiana and Ram. Gemma 3. 2025 , month = mar, number =. doi:10.48550/arXiv.2503.19786 , urldate =. arXiv , keywords =:2503.19786 , primaryclass =

  43. [44]

    Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and Zheng, Chujie and Liu, Dayiheng and Zhou, Fan and Huang, Fei and Hu, Feng and Ge, Hao and Wei, Haoran and Lin, Huan and Tang, Jialong and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jia...

  44. [45]

    and Chen, Pin-Yu , editor =

    Hung, Kuo-Han and Ko, Ching-Yun and Rawat, Ambrish and Chung, I-Hsin and Hsu, Winston H. and Chen, Pin-Yu , editor =. Attention. Findings of the. 2025 , month = apr, pages =. doi:10.18653/v1/2025.findings-naacl.123 , urldate =

  45. [46]

    Proceedings of the 25th

    Optuna: A Next-generation Hyperparameter Optimization Framework , author=. Proceedings of the 25th

  46. [47]

    Minaee, Shervin and Mikolov, Tomas and Nikzad, Narjes and Chenaghlu, Meysam and Socher, Richard and Amatriain, Xavier and Gao, Jianfeng , year =. Large. doi:10.48550/arXiv.2402.06196 , urldate =. arXiv , keywords =:2402.06196 , primaryclass =

  47. [48]

    Zou, Xiaotian and Chen, Yongkang and Li, Ke , year =. Is the. doi:10.48550/arXiv.2402.14857 , urldate =. arXiv , keywords =:2402.14857 , primaryclass =

  48. [49]

    doi:10.48550/arXiv.2505.12592 , url =

    Jeoung, Sullam and Chen, Yueyan and Zhang, Yi and Wang, Shuai and Ding, Haibo and Cheong, Lin Lee , date =. doi:10.48550/arXiv.2505.12592 , url =. 2505.12592 , eprinttype =

  49. [50]

    Defending Against Indirect Prompt Injection Attacks With Spotlighting

    Hines, Keegan and Lopez, Gary and Hall, Matthew and Zarfati, Federico and Zunger, Yonatan and Kiciman, Emre , year =. Defending. doi:10.48550/arXiv.2403.14720 , urldate =. arXiv , keywords =:2403.14720 , primaryclass =

  50. [51]

    Mu, Norman and Lu, Jonathan and Lavery, Michael and Wagner, David , year =. A. doi:10.48550/arXiv.2502.12197 , urldate =. arXiv , keywords =:2502.12197 , primaryclass =

  51. [52]

    2023 , publisher =

    ProtectAI.com , title =. 2023 , publisher =

  52. [53]

    NeurIPS Datasets and Benchmarks Track , year=

    JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models , author=. NeurIPS Datasets and Benchmarks Track , year=

  53. [54]

    2024 , month = nov, type =

  54. [55]

    Mu, Norman and Chen, Sarah and Wang, Zifan and Chen, Sizhe and Karamardian, David and Aljeraisy, Lulwa and Alomair, Basel and Hendrycks, Dan and Wagner, David , year =. Can. doi:10.48550/arXiv.2311.04235 , urldate =. arXiv , keywords =:2311.04235 , primaryclass =

  55. [56]

    Mehta, Rajiv , date =

  56. [57]

    Booking.com Launches New AI Trip Planner to Enhance Travel Planning Experience , howpublished=

    Booking.Com. Booking.com Launches New AI Trip Planner to Enhance Travel Planning Experience , howpublished=

  57. [58]

    2023 , month = mar, journal =

    Introducing. 2023 , month = mar, journal =

  58. [59]

    Coursera

    Villasenor, Amber , date =. Coursera. 2024 , month = jun, journal =

  59. [60]

    Walmart:

    Gosby, Desiree , date =. Walmart:

  60. [61]

    Learn about the conversational

  61. [62]

    2024 , publisher =

    ProtectAI.com , title =. 2024 , publisher =

  62. [63]

    2024 , booktitle=

    Protecting Your LLMs with Information Bottleneck , author=. 2024 , booktitle=

  63. [64]

    2024 , eprint=

    Defending Against Indirect Prompt Injection Attacks With Spotlighting , author=. 2024 , eprint=

  64. [65]

    Defending

    Xie, Yueqi and Yi, Jingwei and Shao, Jiawei and Curl, Justin and Lyu, Lingjuan and Chen, Qifeng and Xie, Xing and Wu, Fangzhao , year =. Defending. Nature Machine Intelligence , volume =. doi:10.1038/s42256-023-00765-8 , urldate =

  65. [66]

    Wallace, Eric and Xiao, Kai and Leike, Reimar and Weng, Lilian and Heidecke, Johannes and Beutel, Alex , year =. The. doi:10.48550/arXiv.2404.13208 , urldate =. arXiv , keywords =:2404.13208 , primaryclass =

  66. [67]

    2024 , eprint=

    Prompt Injection attack against LLM-integrated Applications , author=. 2024 , eprint=

  67. [68]

    2024 , eprint=

    Jailbreak Attacks and Defenses Against Large Language Models: A Survey , author=. 2024 , eprint=

  68. [69]

    2024 , howpublished =

  69. [70]

    Shen, Xinyue and Chen, Zeyuan and Backes, Michael and Shen, Yun and Zhang, Yang , year =. ". Proceedings of the 2024 on. doi:10.1145/3658644.3670388 , urldate =

  70. [71]

    , year =

    Zverev, Egor and Abdelnabi, Sahar and Tabesh, Soroush and Fritz, Mario and Lampert, Christoph H. , year =. Can. doi:10.48550/arXiv.2403.06833 , urldate =. arXiv , keywords =:2403.06833 , primaryclass =

  71. [72]

    doi:10.48550/arXiv.2405.06823 , urldate =

    Hui, Bo and Yuan, Haolin and Gong, Neil and Burlina, Philippe and Cao, Yinzhi , year =. doi:10.48550/arXiv.2405.06823 , urldate =. arXiv , keywords =:2405.06823 , primaryclass =

  72. [73]

    Persistent

    Zhang, Yiming and Rando, Javier and Evtimov, Ivan and Chi, Jianfeng and Smith, Eric Michael and Carlini, Nicholas and Tram. Persistent. 2024 , month = oct, number =. doi:10.48550/arXiv.2410.13722 , urldate =. arXiv , keywords =:2410.13722 , primaryclass =

  73. [74]

    Instructional

    Wu, Tong and Zhang, Shujian and Song, Kaiqiang and Xu, Silei and Zhao, Sanqiang and Agrawal, Ravi and Indurthi, Sathish Reddy and Xiang, Chong and Mittal, Prateek and Zhou, Wenxuan , year =. Instructional. doi:10.48550/arXiv.2410.09102 , urldate =. arXiv , keywords =:2410.09102 , primaryclass =

  74. [75]

    , year =

    Zverev, Egor and Kortukov, Evgenii and Panfilov, Alexander and Volkova, Alexandra and Tabesh, Soroush and Lapuschkin, Sebastian and Samek, Wojciech and Lampert, Christoph H. , year =. doi:10.48550/arXiv.2503.10566 , urldate =. arXiv , keywords =:2503.10566 , primaryclass =

  75. [76]

    2025 , eprint=

    Sentinel: SOTA model to protect against prompt injections , author=. 2025 , eprint=

  76. [77]

    2025 , howpublished =

    Qualifire AI , title =. 2025 , howpublished =

  77. [78]

    arXiv preprint arXiv:2312.14197 , year=

    Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models , author=. arXiv preprint arXiv:2312.14197 , year=

  78. [79]

    I njec A gent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

    Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel. I njec A gent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.624

  79. [80]

    2025 , volume =

    Abdelnabi, Sahar and Fay, Aideen and Cherubin, Giovanni and Salem, Ahmed and Fritz, Mario and Paverd, Andrew , booktitle =. 2025 , volume =. doi:10.1109/SaTML64287.2025.00011 , url =

  80. [81]

    F ollow B ench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models

    Jiang, Yuxin and Wang, Yufei and Zeng, Xingshan and Zhong, Wanjun and Li, Liangyou and Mi, Fei and Shang, Lifeng and Jiang, Xin and Liu, Qun and Wang, Wei. F ollow B ench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: ...

Showing first 80 references.