Security--Fidelity Tradeoffs: The Hidden Cost of Prompt Injection Defense

Ekraam Sabir; Haohan Wang; Mitchell Hermon; Rahul Gupta; Weitong Ruan

arxiv: 2606.30783 · v1 · pith:5DOZEEJInew · submitted 2026-06-29 · 💻 cs.CR · cs.AI

Security--Fidelity Tradeoffs: The Hidden Cost of Prompt Injection Defense

Mitchell Hermon , Rahul Gupta , Weitong Ruan , Ekraam Sabir , Haohan Wang This is my paper

Pith reviewed 2026-07-01 01:41 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords prompt injectionLLM defensesecurity-fidelity tradeoffSecFid benchmarkindirect prompt injectionLLM robustnessattack success rate

0 comments

The pith

Prompt injection defenses in LLMs create an unavoidable security-fidelity tradeoff.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current defenses against indirect prompt injection resist attacks primarily by suppressing untrusted text, which damages performance on tasks that must preserve that text such as translation or document editing. Attack-success rates alone cannot detect this problem because a model that ignores an injection and one that processes it faithfully as data produce the same outcome under those metrics. The authors introduce the SecFid benchmark, constructed so that executing an injection, treating it as data, and ignoring it yield distinguishable outputs, allowing both security and fidelity to be measured directly. Across 1,168 examples and 48 model-defense configurations, no combination reaches high levels of both, with the best fidelity at 96.5 percent paired with only 47.8 percent security and the strongest security at 99.3 percent paired with 71.0 to 73.9 percent fidelity. A decision-theoretic argument shows that the right operating point depends on the deployment-specific costs of a successful hijack versus a lost span of content.

Core claim

No model or defense achieves both high security and high fidelity; the highest-fidelity model reaches 96.5 percent fidelity at 47.8 percent security while the most secure defenses reach 99.3 percent security at only 71.0 to 73.9 percent fidelity. Even defenses that deliver identical security scores differ in mechanism, with some converting hijacked inputs into faithful processing and others simply dropping benign content. Security metrics therefore capture only half of robustness because they omit the fidelity cost at which security is purchased.

What carries the argument

The SecFid benchmark, which forces distinguishable outputs for executing an injection, processing it as data, and ignoring it.

If this is right

Security metrics alone are insufficient because they hide the fidelity price paid by each defense.
The appropriate defense is not fixed but must be chosen according to the relative cost of a hijack versus a dropped span in the target deployment.
Defenses that repair hijacks into faithful outputs differ in effect from those that suppress benign content even when both report the same security score.
Reporting only attack success rates conceals the operational cost of the chosen defense.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

In applications where fidelity matters more than perfect security, such as collaborative document editing, lighter defenses may be preferable.
Adaptive systems could estimate the local cost ratio and switch defense strength per task or per input span.
Models trained to distinguish instructions from data without broad suppression might reduce the observed tradeoff.

Load-bearing premise

The SecFid benchmark construction produces distinguishable outputs for the three behaviors and its chosen tasks represent deployments where fidelity to untrusted text actually matters.

What would settle it

A single defense or model that scores above 90 percent on both security and fidelity across the SecFid examples would falsify the claim that the tradeoff is unavoidable.

Figures

Figures reproduced from arXiv: 2606.30783 by Ekraam Sabir, Haohan Wang, Mitchell Hermon, Rahul Gupta, Weitong Ruan.

**Figure 1.** Figure 1: Task-dependent interpretation of instruction-like text in untrusted inputs. The same product review contains an instructionlike sentence (Review B) that could be an indirect prompt injection. For translation, the sentence should be translated faithfully. For summarization, the sentence must not control the summary and may be summarized as content or omitted as irrelevant, depending on the task objective. … view at source ↗

**Figure 2.** Figure 2: The security–fidelity tradeoff in practice. Each point shows a model in the Security–Fidelity space. Among these models, no system achieves both high security and high fidelity; instead, models spread along a frontier. See [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Reasoning mainly increases security, not fidelity. Each trajectory tracks a model from its non-reasoning or minimalreasoning baseline through low, medium, and high reasoning. The average trajectory moves mostly rightward: execution falls, while fidelity remains roughly flat or slightly decreases [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Repair versus suppression of base-model hijacks. Each bar conditions on the examples that the undefended base model executed. SECALIGN repairs the most former hijacks while DefensiveTokens suppresses the most. 4.3. Learning to Repair Rather Than Suppress We briefly explore one way the SECFID benchmark can be used not only to diagnose defenses but to improve them. Most prompt-injection defenses train only o… view at source ↗

read the original abstract

We identify a security-fidelity tradeoff in defending LLMs against indirect prompt injection: defenses resist injected instructions largely by suppressing untrusted text, which corrupts tasks that must preserve it, such as translation and document editing. Attack-success metrics cannot see this, because a model that ignores an injection and one that faithfully processes it as data score identically. We introduce SecFid, a benchmark built so that executing an injection, processing it as data, and ignoring it produce distinguishable outputs. This makes fidelity measurable and exposes a frontier: across 1,168 examples and 48 configurations, no model or defense achieves both objectives. The highest-fidelity model reaches 96.5% fidelity at 47.8% security, while the most secure defenses invert this, at 99.3% security but only 71.0%-73.9% fidelity. Even defenses with identical security differ in how they earn it: some repair hijacks into faithful processing, others simply suppress benign content. A decision-theoretic analysis shows why no fixed choice can be right everywhere: the correct behavior is not a property of the defense but of the deployment, set by its relative cost of a hijack versus a dropped span. Security alone therefore measures only half of robustness, and reporting it without fidelity hides the price at which it was bought.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that prompt injection defenses trade fidelity for security in ways standard metrics miss, and the new SecFid benchmark makes that tradeoff measurable.

read the letter

The main point is that attack-success rates alone don't capture the full picture for defenses, because models that suppress untrusted text score the same as ones that process it faithfully. SecFid fixes this by designing tasks where executing an injection, handling it as data, and ignoring it produce distinct outputs, then measures both security and fidelity across 1,168 examples and 48 configurations.

What stands out is the empirical frontier: the best fidelity reaches 96.5% but only 47.8% security, while the strongest security hits 99.3% at 71-73.9% fidelity. They also note that different defenses achieve security through different routes, some by repairing hijacks into faithful output and others by dropping content. The decision-theoretic framing is straightforward and explains why the right balance depends on the relative cost of a hijack versus a dropped span in a given deployment.

The work is new in making fidelity a first-class, quantifiable concern rather than an afterthought. It does a clean job of showing why prior metrics conflate behaviors that matter for real tasks like translation or editing.

The soft spot is the lack of reported validation on how reliably the output distinctions hold. The abstract states the benchmark is built so the three behaviors separate, but without numbers on labeling error rates, overlap between categories, or sensitivity to the specific rules for each task, the exact percentages could move. That's not fatal, but it is the part that needs the most scrutiny in the methods.

This is for people who evaluate or deploy LLM systems that handle mixed trusted and untrusted inputs. It deserves peer review because it directly challenges how robustness is scored and supplies data to support the claim.

Referee Report

2 major / 2 minor

Summary. The paper claims that defenses against indirect prompt injection in LLMs incur a hidden fidelity cost by suppressing untrusted text, which harms tasks requiring faithful processing of that text (e.g., translation, editing). Standard attack-success metrics cannot distinguish ignoring an injection from processing it as data. The authors introduce the SecFid benchmark, constructed so that executing an injection, processing it as data, and ignoring it yield distinguishable outputs. Across 1,168 examples and 48 configurations, no model or defense achieves both high security and high fidelity; the highest-fidelity model reaches 96.5% fidelity at 47.8% security, while the most secure reach 99.3% security at 71.0-73.9% fidelity. A decision-theoretic analysis shows the optimal behavior depends on the relative cost of hijack versus dropped span in a given deployment.

Significance. If the SecFid results hold, the work is significant for LLM security research: it demonstrates that security-only metrics capture only half of robustness and provides a decision-theoretic lens for deployment-specific tradeoffs. The empirical frontier across many configurations and the distinction between repair-based vs. suppression-based defenses are concrete contributions that could shift how defenses are evaluated and reported.

major comments (2)

[Abstract / benchmark construction] Abstract and benchmark construction section: the central claim that 'no model or defense achieves both' and the reported frontier percentages rest on the SecFid labeling scheme reliably distinguishing 'processing injection as data' from execution or suppression. The manuscript asserts the benchmark is 'built so that' the behaviors are distinguishable but provides no quantitative validation (e.g., error rates, agreement metrics, or sensitivity analysis) of the string-matching/semantic rules across the 48 configurations or 1,168 examples. This is load-bearing for the fidelity numbers.
[Results] Results section (1,168 examples, 48 configurations): the reported security and fidelity percentages lack accompanying details on exclusion criteria, statistical tests for the frontier, or robustness checks against post-hoc choices in task/example selection. Without these, it is not possible to assess whether the tradeoff is an artifact of benchmark construction rather than a general phenomenon.

minor comments (2)

[Abstract] The abstract uses an en-dash in the title ('Security--Fidelity') that should be rendered consistently in the body and references.
[Decision-theoretic analysis] Clarify in the decision-theoretic section whether the cost ratio is treated as a free parameter or derived from any empirical distribution of deployment scenarios.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which highlight important areas for strengthening the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Abstract / benchmark construction] Abstract and benchmark construction section: the central claim that 'no model or defense achieves both' and the reported frontier percentages rest on the SecFid labeling scheme reliably distinguishing 'processing injection as data' from execution or suppression. The manuscript asserts the benchmark is 'built so that' the behaviors are distinguishable but provides no quantitative validation (e.g., error rates, agreement metrics, or sensitivity analysis) of the string-matching/semantic rules across the 48 configurations or 1,168 examples. This is load-bearing for the fidelity numbers.

Authors: We agree that explicit quantitative validation of the labeling scheme is necessary to support the fidelity claims. The SecFid rules are deterministic and task-specific (string matching for outputs like injected commands vs. task results incorporating the span vs. omission), but the manuscript does not report agreement metrics or sensitivity. We will add a dedicated validation subsection reporting: (i) inter-annotator agreement on a random sample of 200 examples (targeting >95% agreement on category assignment), (ii) error rates from manual review, and (iii) sensitivity analysis by varying matching thresholds. This revision will make the load-bearing aspect transparent. revision: yes
Referee: [Results] Results section (1,168 examples, 48 configurations): the reported security and fidelity percentages lack accompanying details on exclusion criteria, statistical tests for the frontier, or robustness checks against post-hoc choices in task/example selection. Without these, it is not possible to assess whether the tradeoff is an artifact of benchmark construction rather than a general phenomenon.

Authors: All 1,168 examples were included with no exclusions applied. Task/example selection prioritized cases where the three behaviors produce distinguishable outputs by construction. We will revise the results section to include: explicit confirmation of full inclusion, bootstrap confidence intervals on all reported percentages, McNemar's tests for statistical significance of differences supporting the frontier, and a robustness subsection with results on random subsamples (e.g., 500 examples) to check sensitivity to selection. These additions will address concerns about artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark measurements with no self-referential derivations or fitted predictions

full rationale

The paper's central claims consist of direct empirical measurements of security and fidelity on the introduced SecFid benchmark across 1,168 examples and 48 configurations. No equations, parameters, or derivations are presented that reduce the reported percentages to quantities defined by the paper's own fits or self-citations. The benchmark is described as constructed to distinguish behaviors, but the results are observational outputs rather than tautological re-expressions of inputs. This is a standard empirical evaluation with no load-bearing self-citation chains or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that SecFid produces distinguishable outputs for the three behaviors and that the 1,168 examples adequately sample the relevant task space; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption SecFid benchmark construction produces outputs that reliably distinguish executing an injection, processing it as data, and ignoring it.
This premise is required for fidelity to be measurable and for the reported tradeoff to be observable; it is invoked by the claim that attack-success metrics cannot see the fidelity cost.

pith-pipeline@v0.9.1-grok · 5778 in / 1337 out tokens · 33633 ms · 2026-07-01T01:41:31.273161+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

129 extracted references · 55 canonical work pages · 18 internal anchors

[1]

Lampert , booktitle=

Egor Zverev and Sahar Abdelnabi and Soroush Tabesh and Mario Fritz and Christoph H. Lampert , booktitle=. Can
[2]

Formalizing and

Liu, Yupei and Jia, Yuqi and Geng, Runpeng and Jia, Jinyuan and Gong, Neil Zhenqiang , year = 2024, pages =. Formalizing and. 33rd

2024
[3]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, Gheorghe and Bieber, Eric and Schaekermann, Mike and others , year =. doi:10.48550/arXiv.2507.06261 , url =. 2507.06261 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.06261
[4]

Kandpal, Nikhil and Lester, Brian and Raffel, Colin and Majstorovic, Sebastian and Biderman, Stella and Abbasi, Baber and Soldaini, Luca and Shippole, Enrico and Cooper, A. Feder and Skowron, Aviya and Kirchenbauer, John and Longpre, Shayne and Sutawika, Lintang and Albalak, Alon and Xu, Zhenlin and Penedo, Guilherme and Allal, Loubna Ben and Bakouch, Eli...

work page doi:10.48550/arxiv.2506.05209 2025
[6]

Defending

Hines, Keegan and Lopez, Gary and Hall, Matthew and Zarfati, Federico and Zunger, Yonatan and Kiciman, Emre , year = 2024, month = mar, journal =. Defending

2024
[7]

Vera, Henrique Schechter and Dua, Sahil and Zhang, Biao and Salz, Daniel and Mullins, Ryan and Panyam, Sindhu Raghuram and Smoot, Sara and Naim, Iftekhar and Zou, Joe and Chen, Feiyang and Cer, Daniel and Lisak, Alice and Choi, Min and Gonzalez, Lucas and Sanseviero, Omar and Cameron, Glenn and Ballantyne, Ian and Black, Kat and Chen, Kaifeng and Wang, We...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.20354
[8]

Ignore Previous Prompt: Attack Techniques For Language Models

Ignore Previous Prompt: Attack Techniques For Language Models , author=. arXiv preprint arXiv:2211.09527 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2507.02735 , year=

Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks , author=. arXiv preprint arXiv:2507.02735 , year=

work page arXiv
[10]

Qwen and Yang, An and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Li, Chengyuan and Liu, Dayiheng and Huang, Fei and Wei, Haoran and Lin, Huan and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jianxin and Yang, Jiaxi and Zhou, Jingren and Lin, Junyang and Dang, Kai and Lu, Keming and Bao, Keqin and Yang, Ke...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2025
[11]

Defending

Chen, Sizhe and Wang, Yizhu and Carlini, Nicholas and Sitawarin, Chawin and Wagner, David , year = 2026, month = dec, series =. Defending. Proceedings of the 18th. doi:10.1145/3733799.3762982 , urldate =

work page doi:10.1145/3733799.3762982 2026
[12]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and others , year =. The. doi:10.48550/arXiv.2407.21783 , url =. 2407.21783 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783
[13]

Introducing GPT-5.2 , date =
[14]

Doshi, Tulsee , title =
[15]

Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages =

Greshake, Kai and Abdelnabi, Sahar and Mishra, Shailesh and Endres, Christoph and Holz, Thorsten and Fritz, Mario , title =. Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages =. 2023 , isbn =. doi:10.1145/3605764.3623985 , abstract =

work page doi:10.1145/3605764.3623985 2023
[16]

Advances in Neural Information Processing Systems , volume =

Debenedetti, Edoardo and Zhang, Jie and Balunovic, Mislav and. Advances in Neural Information Processing Systems , volume =
[17]

Bridging the Editing Gap in LLM s: F ine E dit for Precise and Targeted Text Modifications

Zeng, Yiming and Yu, Wanhao and Li, Zexin and Ren, Tao and Ma, Yu and Cao, Jinghan and Chen, Xiyan and Yu, Tingting. Bridging the Editing Gap in LLM s: F ine E dit for Precise and Targeted Text Modifications. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.118

work page doi:10.18653/v1/2025.findings-emnlp.118 2025
[18]

2025 , month = oct, url =

2025
[19]

2026 , month = feb, url =

2026
[20]

2025 , month = dec, url =

2025
[21]

2026 , month = may, url =

2026
[22]

2026 , month = mar, url =

2026
[23]

2026 , month = mar, howpublished =

2026
[24]

Prompt Injection attack against LLM-integrated Applications

Prompt Injection attack against LLM-integrated Applications , author=. arXiv preprint arXiv:2306.05499 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

33rd USENIX Security Symposium (USENIX Security 24) , pages=

Formalizing and benchmarking prompt injection attacks and defenses , author=. 33rd USENIX Security Symposium (USENIX Security 24) , pages=
[26]

Slashdot , year =

Bing Chat Succumbs to Prompt Injection Attack, Spills Its Secrets , author =. Slashdot , year =
[27]

Black Duck Blog , year =

CyRC Advisory: Prompt Injection in EmailGPT , author =. Black Duck Blog , year =
[28]

Positive Security Blog , year =

Hacking Auto-GPT and Escaping Its Docker Container , author =. Positive Security Blog , year =
[29]

AWS Security Blog , year =

Safeguard Your Generative AI Workloads from Prompt Injections , author =. AWS Security Blog , year =
[30]

arXiv preprint arXiv:2503.00061 , year=

Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents , author=. arXiv preprint arXiv:2503.00061 , year=

work page arXiv
[31]

USENIX Security Symposium , year=

StruQ: Defending against prompt injection with structured queries , author=. USENIX Security Symposium , year=
[32]

Advances in Neural Information Processing Systems , volume=

Tree of attacks: Jailbreaking black-box llms automatically , author=. Advances in Neural Information Processing Systems , volume=
[33]

IBM Think Insights , year =

Prevent Prompt Injection Attacks , author =. IBM Think Insights , year =
[34]

Advances in neural information processing systems , volume=

Learning robust global representations by penalizing local predictive power , author=. Advances in neural information processing systems , volume=
[35]

Computer vision--ECCV 2020: 16th European conference, Glasgow, UK, August 23--28, 2020, proceedings, part II 16 , pages=

Self-challenging improves cross-domain generalization , author=. Computer vision--ECCV 2020: 16th European conference, Glasgow, UK, August 23--28, 2020, proceedings, part II 16 , pages=. 2020 , organization=

2020
[36]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

High-frequency component helps explain the generalization of convolutional neural networks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[37]

arXiv preprint arXiv:2401.17263 , year=

Robust prompt optimization for defending language models against jailbreaking attacks , author=. arXiv preprint arXiv:2401.17263 , year=

work page arXiv
[38]

arXiv preprint arXiv:2402.03299 , year=

Guard: Role-playing to generate natural-language jailbreakings to test guideline adherence of large language models , author=. arXiv preprint arXiv:2402.03299 , year=

work page arXiv
[39]

arXiv preprint arXiv:2407.01599 , year=

Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models , author=. arXiv preprint arXiv:2407.01599 , year=

work page arXiv
[40]

2024 , month = oct, urldate =

The Thirteenth International Conference on Learning Representations , author =. 2024 , month = oct, urldate =

2024
[41]

StruQ: Defending against prompt injection with structured queries

Chen, Sizhe and Piet, Julien and Sitawarin, Chawin and Wagner, David , date =. doi:10.48550/arXiv.2402.06363 , url =. 2402.06363 , eprinttype =

work page doi:10.48550/arxiv.2402.06363
[42]

Jailbreaking

Andriushchenko, Maksym and Croce, Francesco and Flammarion, Nicolas , date =. Jailbreaking. doi:10.48550/arXiv.2404.02151 , url =. 2404.02151 , eprinttype =

work page doi:10.48550/arxiv.2404.02151
[43]

Team, Gemma and Kamath, Aishwarya and Ferret, Johan and Pathak, Shreya and Vieillard, Nino and Merhej, Ramona and Perrin, Sarah and Matejovicova, Tatiana and Ram. Gemma 3. 2025 , month = mar, number =. doi:10.48550/arXiv.2503.19786 , urldate =. arXiv , keywords =:2503.19786 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.19786 2025
[44]

Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and Zheng, Chujie and Liu, Dayiheng and Zhou, Fan and Huang, Fei and Hu, Feng and Ge, Hao and Wei, Haoran and Lin, Huan and Tang, Jialong and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jia...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388
[45]

and Chen, Pin-Yu , editor =

Hung, Kuo-Han and Ko, Ching-Yun and Rawat, Ambrish and Chung, I-Hsin and Hsu, Winston H. and Chen, Pin-Yu , editor =. Attention. Findings of the. 2025 , month = apr, pages =. doi:10.18653/v1/2025.findings-naacl.123 , urldate =

work page doi:10.18653/v1/2025.findings-naacl.123 2025
[46]

Proceedings of the 25th

Optuna: A Next-generation Hyperparameter Optimization Framework , author=. Proceedings of the 25th
[47]

Minaee, Shervin and Mikolov, Tomas and Nikzad, Narjes and Chenaghlu, Meysam and Socher, Richard and Amatriain, Xavier and Gao, Jianfeng , year =. Large. doi:10.48550/arXiv.2402.06196 , urldate =. arXiv , keywords =:2402.06196 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.06196
[48]

Zou, Xiaotian and Chen, Yongkang and Li, Ke , year =. Is the. doi:10.48550/arXiv.2402.14857 , urldate =. arXiv , keywords =:2402.14857 , primaryclass =

work page doi:10.48550/arxiv.2402.14857
[49]

doi:10.48550/arXiv.2505.12592 , url =

Jeoung, Sullam and Chen, Yueyan and Zhang, Yi and Wang, Shuai and Ding, Haibo and Cheong, Lin Lee , date =. doi:10.48550/arXiv.2505.12592 , url =. 2505.12592 , eprinttype =

work page doi:10.48550/arxiv.2505.12592
[50]

Defending Against Indirect Prompt Injection Attacks With Spotlighting

Hines, Keegan and Lopez, Gary and Hall, Matthew and Zarfati, Federico and Zunger, Yonatan and Kiciman, Emre , year =. Defending. doi:10.48550/arXiv.2403.14720 , urldate =. arXiv , keywords =:2403.14720 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.14720
[51]

Mu, Norman and Lu, Jonathan and Lavery, Michael and Wagner, David , year =. A. doi:10.48550/arXiv.2502.12197 , urldate =. arXiv , keywords =:2502.12197 , primaryclass =

work page doi:10.48550/arxiv.2502.12197
[52]

2023 , publisher =

ProtectAI.com , title =. 2023 , publisher =

2023
[53]

NeurIPS Datasets and Benchmarks Track , year=

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models , author=. NeurIPS Datasets and Benchmarks Track , year=
[54]

2024 , month = nov, type =

2024
[55]

Mu, Norman and Chen, Sarah and Wang, Zifan and Chen, Sizhe and Karamardian, David and Aljeraisy, Lulwa and Alomair, Basel and Hendrycks, Dan and Wagner, David , year =. Can. doi:10.48550/arXiv.2311.04235 , urldate =. arXiv , keywords =:2311.04235 , primaryclass =

work page doi:10.48550/arxiv.2311.04235
[56]

Mehta, Rajiv , date =
[57]

Booking.com Launches New AI Trip Planner to Enhance Travel Planning Experience , howpublished=

Booking.Com. Booking.com Launches New AI Trip Planner to Enhance Travel Planning Experience , howpublished=
[58]

2023 , month = mar, journal =

Introducing. 2023 , month = mar, journal =

2023
[59]

Coursera

Villasenor, Amber , date =. Coursera. 2024 , month = jun, journal =

2024
[60]

Walmart:

Gosby, Desiree , date =. Walmart:
[61]

Learn about the conversational
[62]

2024 , publisher =

ProtectAI.com , title =. 2024 , publisher =

2024
[63]

2024 , booktitle=

Protecting Your LLMs with Information Bottleneck , author=. 2024 , booktitle=

2024
[64]

2024 , eprint=

Defending Against Indirect Prompt Injection Attacks With Spotlighting , author=. 2024 , eprint=

2024
[65]

Defending

Xie, Yueqi and Yi, Jingwei and Shao, Jiawei and Curl, Justin and Lyu, Lingjuan and Chen, Qifeng and Xie, Xing and Wu, Fangzhao , year =. Defending. Nature Machine Intelligence , volume =. doi:10.1038/s42256-023-00765-8 , urldate =

work page doi:10.1038/s42256-023-00765-8
[66]

Wallace, Eric and Xiao, Kai and Leike, Reimar and Weng, Lilian and Heidecke, Johannes and Beutel, Alex , year =. The. doi:10.48550/arXiv.2404.13208 , urldate =. arXiv , keywords =:2404.13208 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.13208
[67]

2024 , eprint=

Prompt Injection attack against LLM-integrated Applications , author=. 2024 , eprint=

2024
[68]

2024 , eprint=

Jailbreak Attacks and Defenses Against Large Language Models: A Survey , author=. 2024 , eprint=

2024
[69]

2024 , howpublished =

2024
[70]

Shen, Xinyue and Chen, Zeyuan and Backes, Michael and Shen, Yun and Zhang, Yang , year =. ". Proceedings of the 2024 on. doi:10.1145/3658644.3670388 , urldate =

work page doi:10.1145/3658644.3670388 2024
[71]

, year =

Zverev, Egor and Abdelnabi, Sahar and Tabesh, Soroush and Fritz, Mario and Lampert, Christoph H. , year =. Can. doi:10.48550/arXiv.2403.06833 , urldate =. arXiv , keywords =:2403.06833 , primaryclass =

work page doi:10.48550/arxiv.2403.06833
[72]

doi:10.48550/arXiv.2405.06823 , urldate =

Hui, Bo and Yuan, Haolin and Gong, Neil and Burlina, Philippe and Cao, Yinzhi , year =. doi:10.48550/arXiv.2405.06823 , urldate =. arXiv , keywords =:2405.06823 , primaryclass =

work page doi:10.48550/arxiv.2405.06823
[73]

Persistent

Zhang, Yiming and Rando, Javier and Evtimov, Ivan and Chi, Jianfeng and Smith, Eric Michael and Carlini, Nicholas and Tram. Persistent. 2024 , month = oct, number =. doi:10.48550/arXiv.2410.13722 , urldate =. arXiv , keywords =:2410.13722 , primaryclass =

work page doi:10.48550/arxiv.2410.13722 2024
[74]

Instructional

Wu, Tong and Zhang, Shujian and Song, Kaiqiang and Xu, Silei and Zhao, Sanqiang and Agrawal, Ravi and Indurthi, Sathish Reddy and Xiang, Chong and Mittal, Prateek and Zhou, Wenxuan , year =. Instructional. doi:10.48550/arXiv.2410.09102 , urldate =. arXiv , keywords =:2410.09102 , primaryclass =

work page doi:10.48550/arxiv.2410.09102
[75]

, year =

Zverev, Egor and Kortukov, Evgenii and Panfilov, Alexander and Volkova, Alexandra and Tabesh, Soroush and Lapuschkin, Sebastian and Samek, Wojciech and Lampert, Christoph H. , year =. doi:10.48550/arXiv.2503.10566 , urldate =. arXiv , keywords =:2503.10566 , primaryclass =

work page doi:10.48550/arxiv.2503.10566
[76]

2025 , eprint=

Sentinel: SOTA model to protect against prompt injections , author=. 2025 , eprint=

2025
[77]

2025 , howpublished =

Qualifire AI , title =. 2025 , howpublished =

2025
[78]

arXiv preprint arXiv:2312.14197 , year=

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models , author=. arXiv preprint arXiv:2312.14197 , year=

work page arXiv
[79]

I njec A gent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel. I njec A gent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.624

work page doi:10.18653/v1/2024.findings-acl.624 2024
[80]

2025 , volume =

Abdelnabi, Sahar and Fay, Aideen and Cherubin, Giovanni and Salem, Ahmed and Fritz, Mario and Paverd, Andrew , booktitle =. 2025 , volume =. doi:10.1109/SaTML64287.2025.00011 , url =

work page doi:10.1109/satml64287.2025.00011 2025
[81]

F ollow B ench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models

Jiang, Yuxin and Wang, Yufei and Zeng, Xingshan and Zhong, Wanjun and Li, Liangyou and Mi, Fei and Shang, Lifeng and Jiang, Xin and Liu, Qun and Wang, Wei. F ollow B ench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: ...

2024

Showing first 80 references.

[1] [1]

Lampert , booktitle=

Egor Zverev and Sahar Abdelnabi and Soroush Tabesh and Mario Fritz and Christoph H. Lampert , booktitle=. Can

[2] [2]

Formalizing and

Liu, Yupei and Jia, Yuqi and Geng, Runpeng and Jia, Jinyuan and Gong, Neil Zhenqiang , year = 2024, pages =. Formalizing and. 33rd

2024

[3] [3]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, Gheorghe and Bieber, Eric and Schaekermann, Mike and others , year =. doi:10.48550/arXiv.2507.06261 , url =. 2507.06261 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.06261

[4] [4]

Kandpal, Nikhil and Lester, Brian and Raffel, Colin and Majstorovic, Sebastian and Biderman, Stella and Abbasi, Baber and Soldaini, Luca and Shippole, Enrico and Cooper, A. Feder and Skowron, Aviya and Kirchenbauer, John and Longpre, Shayne and Sutawika, Lintang and Albalak, Alon and Xu, Zhenlin and Penedo, Guilherme and Allal, Loubna Ben and Bakouch, Eli...

work page doi:10.48550/arxiv.2506.05209 2025

[5] [6]

Defending

Hines, Keegan and Lopez, Gary and Hall, Matthew and Zarfati, Federico and Zunger, Yonatan and Kiciman, Emre , year = 2024, month = mar, journal =. Defending

2024

[6] [7]

Vera, Henrique Schechter and Dua, Sahil and Zhang, Biao and Salz, Daniel and Mullins, Ryan and Panyam, Sindhu Raghuram and Smoot, Sara and Naim, Iftekhar and Zou, Joe and Chen, Feiyang and Cer, Daniel and Lisak, Alice and Choi, Min and Gonzalez, Lucas and Sanseviero, Omar and Cameron, Glenn and Ballantyne, Ian and Black, Kat and Chen, Kaifeng and Wang, We...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.20354

[7] [8]

Ignore Previous Prompt: Attack Techniques For Language Models

Ignore Previous Prompt: Attack Techniques For Language Models , author=. arXiv preprint arXiv:2211.09527 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [9]

arXiv preprint arXiv:2507.02735 , year=

Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks , author=. arXiv preprint arXiv:2507.02735 , year=

work page arXiv

[9] [10]

Qwen and Yang, An and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Li, Chengyuan and Liu, Dayiheng and Huang, Fei and Wei, Haoran and Lin, Huan and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jianxin and Yang, Jiaxi and Zhou, Jingren and Lin, Junyang and Dang, Kai and Lu, Keming and Bao, Keqin and Yang, Ke...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2025

[10] [11]

Defending

Chen, Sizhe and Wang, Yizhu and Carlini, Nicholas and Sitawarin, Chawin and Wagner, David , year = 2026, month = dec, series =. Defending. Proceedings of the 18th. doi:10.1145/3733799.3762982 , urldate =

work page doi:10.1145/3733799.3762982 2026

[11] [12]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and others , year =. The. doi:10.48550/arXiv.2407.21783 , url =. 2407.21783 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783

[12] [13]

Introducing GPT-5.2 , date =

[13] [14]

Doshi, Tulsee , title =

[14] [15]

Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages =

Greshake, Kai and Abdelnabi, Sahar and Mishra, Shailesh and Endres, Christoph and Holz, Thorsten and Fritz, Mario , title =. Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , pages =. 2023 , isbn =. doi:10.1145/3605764.3623985 , abstract =

work page doi:10.1145/3605764.3623985 2023

[15] [16]

Advances in Neural Information Processing Systems , volume =

Debenedetti, Edoardo and Zhang, Jie and Balunovic, Mislav and. Advances in Neural Information Processing Systems , volume =

[16] [17]

Bridging the Editing Gap in LLM s: F ine E dit for Precise and Targeted Text Modifications

Zeng, Yiming and Yu, Wanhao and Li, Zexin and Ren, Tao and Ma, Yu and Cao, Jinghan and Chen, Xiyan and Yu, Tingting. Bridging the Editing Gap in LLM s: F ine E dit for Precise and Targeted Text Modifications. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.118

work page doi:10.18653/v1/2025.findings-emnlp.118 2025

[17] [18]

2025 , month = oct, url =

2025

[18] [19]

2026 , month = feb, url =

2026

[19] [20]

2025 , month = dec, url =

2025

[20] [21]

2026 , month = may, url =

2026

[21] [22]

2026 , month = mar, url =

2026

[22] [23]

2026 , month = mar, howpublished =

2026

[23] [24]

Prompt Injection attack against LLM-integrated Applications

Prompt Injection attack against LLM-integrated Applications , author=. arXiv preprint arXiv:2306.05499 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[24] [25]

33rd USENIX Security Symposium (USENIX Security 24) , pages=

Formalizing and benchmarking prompt injection attacks and defenses , author=. 33rd USENIX Security Symposium (USENIX Security 24) , pages=

[25] [26]

Slashdot , year =

Bing Chat Succumbs to Prompt Injection Attack, Spills Its Secrets , author =. Slashdot , year =

[26] [27]

Black Duck Blog , year =

CyRC Advisory: Prompt Injection in EmailGPT , author =. Black Duck Blog , year =

[27] [28]

Positive Security Blog , year =

Hacking Auto-GPT and Escaping Its Docker Container , author =. Positive Security Blog , year =

[28] [29]

AWS Security Blog , year =

Safeguard Your Generative AI Workloads from Prompt Injections , author =. AWS Security Blog , year =

[29] [30]

arXiv preprint arXiv:2503.00061 , year=

Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents , author=. arXiv preprint arXiv:2503.00061 , year=

work page arXiv

[30] [31]

USENIX Security Symposium , year=

StruQ: Defending against prompt injection with structured queries , author=. USENIX Security Symposium , year=

[31] [32]

Advances in Neural Information Processing Systems , volume=

Tree of attacks: Jailbreaking black-box llms automatically , author=. Advances in Neural Information Processing Systems , volume=

[32] [33]

IBM Think Insights , year =

Prevent Prompt Injection Attacks , author =. IBM Think Insights , year =

[33] [34]

Advances in neural information processing systems , volume=

Learning robust global representations by penalizing local predictive power , author=. Advances in neural information processing systems , volume=

[34] [35]

Computer vision--ECCV 2020: 16th European conference, Glasgow, UK, August 23--28, 2020, proceedings, part II 16 , pages=

Self-challenging improves cross-domain generalization , author=. Computer vision--ECCV 2020: 16th European conference, Glasgow, UK, August 23--28, 2020, proceedings, part II 16 , pages=. 2020 , organization=

2020

[35] [36]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

High-frequency component helps explain the generalization of convolutional neural networks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[36] [37]

arXiv preprint arXiv:2401.17263 , year=

Robust prompt optimization for defending language models against jailbreaking attacks , author=. arXiv preprint arXiv:2401.17263 , year=

work page arXiv

[37] [38]

arXiv preprint arXiv:2402.03299 , year=

Guard: Role-playing to generate natural-language jailbreakings to test guideline adherence of large language models , author=. arXiv preprint arXiv:2402.03299 , year=

work page arXiv

[38] [39]

arXiv preprint arXiv:2407.01599 , year=

Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models , author=. arXiv preprint arXiv:2407.01599 , year=

work page arXiv

[39] [40]

2024 , month = oct, urldate =

The Thirteenth International Conference on Learning Representations , author =. 2024 , month = oct, urldate =

2024

[40] [41]

StruQ: Defending against prompt injection with structured queries

Chen, Sizhe and Piet, Julien and Sitawarin, Chawin and Wagner, David , date =. doi:10.48550/arXiv.2402.06363 , url =. 2402.06363 , eprinttype =

work page doi:10.48550/arxiv.2402.06363

[41] [42]

Jailbreaking

Andriushchenko, Maksym and Croce, Francesco and Flammarion, Nicolas , date =. Jailbreaking. doi:10.48550/arXiv.2404.02151 , url =. 2404.02151 , eprinttype =

work page doi:10.48550/arxiv.2404.02151

[42] [43]

Team, Gemma and Kamath, Aishwarya and Ferret, Johan and Pathak, Shreya and Vieillard, Nino and Merhej, Ramona and Perrin, Sarah and Matejovicova, Tatiana and Ram. Gemma 3. 2025 , month = mar, number =. doi:10.48550/arXiv.2503.19786 , urldate =. arXiv , keywords =:2503.19786 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.19786 2025

[43] [44]

Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and Zheng, Chujie and Liu, Dayiheng and Zhou, Fan and Huang, Fei and Hu, Feng and Ge, Hao and Wei, Haoran and Lin, Huan and Tang, Jialong and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jia...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388

[44] [45]

and Chen, Pin-Yu , editor =

Hung, Kuo-Han and Ko, Ching-Yun and Rawat, Ambrish and Chung, I-Hsin and Hsu, Winston H. and Chen, Pin-Yu , editor =. Attention. Findings of the. 2025 , month = apr, pages =. doi:10.18653/v1/2025.findings-naacl.123 , urldate =

work page doi:10.18653/v1/2025.findings-naacl.123 2025

[45] [46]

Proceedings of the 25th

Optuna: A Next-generation Hyperparameter Optimization Framework , author=. Proceedings of the 25th

[46] [47]

Minaee, Shervin and Mikolov, Tomas and Nikzad, Narjes and Chenaghlu, Meysam and Socher, Richard and Amatriain, Xavier and Gao, Jianfeng , year =. Large. doi:10.48550/arXiv.2402.06196 , urldate =. arXiv , keywords =:2402.06196 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.06196

[47] [48]

Zou, Xiaotian and Chen, Yongkang and Li, Ke , year =. Is the. doi:10.48550/arXiv.2402.14857 , urldate =. arXiv , keywords =:2402.14857 , primaryclass =

work page doi:10.48550/arxiv.2402.14857

[48] [49]

doi:10.48550/arXiv.2505.12592 , url =

Jeoung, Sullam and Chen, Yueyan and Zhang, Yi and Wang, Shuai and Ding, Haibo and Cheong, Lin Lee , date =. doi:10.48550/arXiv.2505.12592 , url =. 2505.12592 , eprinttype =

work page doi:10.48550/arxiv.2505.12592

[49] [50]

Defending Against Indirect Prompt Injection Attacks With Spotlighting

Hines, Keegan and Lopez, Gary and Hall, Matthew and Zarfati, Federico and Zunger, Yonatan and Kiciman, Emre , year =. Defending. doi:10.48550/arXiv.2403.14720 , urldate =. arXiv , keywords =:2403.14720 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.14720

[50] [51]

Mu, Norman and Lu, Jonathan and Lavery, Michael and Wagner, David , year =. A. doi:10.48550/arXiv.2502.12197 , urldate =. arXiv , keywords =:2502.12197 , primaryclass =

work page doi:10.48550/arxiv.2502.12197

[51] [52]

2023 , publisher =

ProtectAI.com , title =. 2023 , publisher =

2023

[52] [53]

NeurIPS Datasets and Benchmarks Track , year=

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models , author=. NeurIPS Datasets and Benchmarks Track , year=

[53] [54]

2024 , month = nov, type =

2024

[54] [55]

Mu, Norman and Chen, Sarah and Wang, Zifan and Chen, Sizhe and Karamardian, David and Aljeraisy, Lulwa and Alomair, Basel and Hendrycks, Dan and Wagner, David , year =. Can. doi:10.48550/arXiv.2311.04235 , urldate =. arXiv , keywords =:2311.04235 , primaryclass =

work page doi:10.48550/arxiv.2311.04235

[55] [56]

Mehta, Rajiv , date =

[56] [57]

Booking.com Launches New AI Trip Planner to Enhance Travel Planning Experience , howpublished=

Booking.Com. Booking.com Launches New AI Trip Planner to Enhance Travel Planning Experience , howpublished=

[57] [58]

2023 , month = mar, journal =

Introducing. 2023 , month = mar, journal =

2023

[58] [59]

Coursera

Villasenor, Amber , date =. Coursera. 2024 , month = jun, journal =

2024

[59] [60]

Walmart:

Gosby, Desiree , date =. Walmart:

[60] [61]

Learn about the conversational

[61] [62]

2024 , publisher =

ProtectAI.com , title =. 2024 , publisher =

2024

[62] [63]

2024 , booktitle=

Protecting Your LLMs with Information Bottleneck , author=. 2024 , booktitle=

2024

[63] [64]

2024 , eprint=

Defending Against Indirect Prompt Injection Attacks With Spotlighting , author=. 2024 , eprint=

2024

[64] [65]

Defending

Xie, Yueqi and Yi, Jingwei and Shao, Jiawei and Curl, Justin and Lyu, Lingjuan and Chen, Qifeng and Xie, Xing and Wu, Fangzhao , year =. Defending. Nature Machine Intelligence , volume =. doi:10.1038/s42256-023-00765-8 , urldate =

work page doi:10.1038/s42256-023-00765-8

[65] [66]

Wallace, Eric and Xiao, Kai and Leike, Reimar and Weng, Lilian and Heidecke, Johannes and Beutel, Alex , year =. The. doi:10.48550/arXiv.2404.13208 , urldate =. arXiv , keywords =:2404.13208 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.13208

[66] [67]

2024 , eprint=

Prompt Injection attack against LLM-integrated Applications , author=. 2024 , eprint=

2024

[67] [68]

2024 , eprint=

Jailbreak Attacks and Defenses Against Large Language Models: A Survey , author=. 2024 , eprint=

2024

[68] [69]

2024 , howpublished =

2024

[69] [70]

Shen, Xinyue and Chen, Zeyuan and Backes, Michael and Shen, Yun and Zhang, Yang , year =. ". Proceedings of the 2024 on. doi:10.1145/3658644.3670388 , urldate =

work page doi:10.1145/3658644.3670388 2024

[70] [71]

, year =

Zverev, Egor and Abdelnabi, Sahar and Tabesh, Soroush and Fritz, Mario and Lampert, Christoph H. , year =. Can. doi:10.48550/arXiv.2403.06833 , urldate =. arXiv , keywords =:2403.06833 , primaryclass =

work page doi:10.48550/arxiv.2403.06833

[71] [72]

doi:10.48550/arXiv.2405.06823 , urldate =

Hui, Bo and Yuan, Haolin and Gong, Neil and Burlina, Philippe and Cao, Yinzhi , year =. doi:10.48550/arXiv.2405.06823 , urldate =. arXiv , keywords =:2405.06823 , primaryclass =

work page doi:10.48550/arxiv.2405.06823

[72] [73]

Persistent

Zhang, Yiming and Rando, Javier and Evtimov, Ivan and Chi, Jianfeng and Smith, Eric Michael and Carlini, Nicholas and Tram. Persistent. 2024 , month = oct, number =. doi:10.48550/arXiv.2410.13722 , urldate =. arXiv , keywords =:2410.13722 , primaryclass =

work page doi:10.48550/arxiv.2410.13722 2024

[73] [74]

Instructional

Wu, Tong and Zhang, Shujian and Song, Kaiqiang and Xu, Silei and Zhao, Sanqiang and Agrawal, Ravi and Indurthi, Sathish Reddy and Xiang, Chong and Mittal, Prateek and Zhou, Wenxuan , year =. Instructional. doi:10.48550/arXiv.2410.09102 , urldate =. arXiv , keywords =:2410.09102 , primaryclass =

work page doi:10.48550/arxiv.2410.09102

[74] [75]

, year =

Zverev, Egor and Kortukov, Evgenii and Panfilov, Alexander and Volkova, Alexandra and Tabesh, Soroush and Lapuschkin, Sebastian and Samek, Wojciech and Lampert, Christoph H. , year =. doi:10.48550/arXiv.2503.10566 , urldate =. arXiv , keywords =:2503.10566 , primaryclass =

work page doi:10.48550/arxiv.2503.10566

[75] [76]

2025 , eprint=

Sentinel: SOTA model to protect against prompt injections , author=. 2025 , eprint=

2025

[76] [77]

2025 , howpublished =

Qualifire AI , title =. 2025 , howpublished =

2025

[77] [78]

arXiv preprint arXiv:2312.14197 , year=

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models , author=. arXiv preprint arXiv:2312.14197 , year=

work page arXiv

[78] [79]

I njec A gent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel. I njec A gent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.624

work page doi:10.18653/v1/2024.findings-acl.624 2024

[79] [80]

2025 , volume =

Abdelnabi, Sahar and Fay, Aideen and Cherubin, Giovanni and Salem, Ahmed and Fritz, Mario and Paverd, Andrew , booktitle =. 2025 , volume =. doi:10.1109/SaTML64287.2025.00011 , url =

work page doi:10.1109/satml64287.2025.00011 2025

[80] [81]

F ollow B ench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models

Jiang, Yuxin and Wang, Yufei and Zeng, Xingshan and Zhong, Wanjun and Li, Liangyou and Mi, Fei and Shang, Lifeng and Jiang, Xin and Liu, Qun and Wang, Wei. F ollow B ench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: ...

2024