ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents

Hwan Chang; Hwanhee Lee; Yonghyun Jun

arxiv: 2509.22830 · v3 · submitted 2025-09-26 · 💻 cs.CL

ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents

Hwan Chang , Yonghyun Jun , Hwanhee Lee This is my paper

Pith reviewed 2026-05-18 12:18 UTC · model grok-4.3

classification 💻 cs.CL

keywords prompt injectionLLM agentschat templatesindirect prompt injectionmulti-turn attacksadversarial attacksAI agent securitytemplate exploitation

0 comments

The pith

Formatting malicious instructions to mimic chat templates makes prompt injection far more effective against LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that indirect prompt injection succeeds at much higher rates when attackers wrap malicious commands in the same structured format that LLMs use for their own conversations. This approach takes advantage of how models are trained to treat chat-template elements as authoritative instructions rather than ordinary text. Experiments on two agent benchmarks show the new method raises average success from roughly 5 percent to 32 percent on one suite and from 15 percent to 46 percent on the other, with multi-turn persuasive versions reaching 52 percent. Existing prompt defenses prove largely ineffective, and the payloads transfer across both open and closed models. A reader should care because many deployed agents routinely incorporate untrusted external data without mechanisms that detect or neutralize template-like structures.

Core claim

ChatInject formats malicious payloads to imitate the native chat templates that LLMs expect in their conversation history, causing the model to interpret injected instructions as legitimate system or user messages. A persuasion-driven multi-turn variant further primes the agent over successive turns to accept and act on otherwise suspicious commands. Tests across frontier models confirm substantially higher attack success rates than plain-text baselines, strong cross-model transferability even against closed-source systems, and failure of standard prompt-based defenses.

What carries the argument

Chat-template mimicry, in which external outputs are written with role markers and formatting that match the model's internal conversation structure so the LLM treats the injected content as its own dialogue history.

If this is right

Average attack success rates increase from 5.18 percent to 32.05 percent on AgentDojo and from 15.13 percent to 45.90 percent on InjecAgent.
Multi-turn dialogue versions reach an average 52.33 percent success rate on InjecAgent.
Template-based payloads transfer effectively across models, including closed-source LLMs whose exact templates are unknown.
Existing prompt-based defenses fail to block the attacks, especially the multi-turn variants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could add lightweight parsers that detect and neutralize role markers in any external input before it reaches the model.
The same mimicry idea may apply to other structured formats such as JSON tool outputs or XML responses that agents consume.
Training regimes that explicitly teach models to discount chat-like patterns arriving from external sources could reduce this surface.
Benchmarking suites for agent security might benefit from including template-mimicry test cases to measure real-world robustness.

Load-bearing premise

The tested agent frameworks pass external outputs to the LLM without any sanitization step that would strip or normalize chat-template-like formatting.

What would settle it

Run the same attacks on an agent system that first removes every occurrence of role indicators or template markers from external data before the LLM sees it; if success rates drop back to the low single-digit levels of plain-text attacks, the template-mimicry mechanism is confirmed as the main driver.

Figures

Figures reproduced from arXiv: 2509.22830 by Hwan Chang, Hwanhee Lee, Yonghyun Jun.

**Figure 2.** Figure 2: Four attack payload variants embedded in the tool response [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Performance of cross-model ChatInject attacks. As template similarity increases, the ASR (left) rises, while the model’s Utility (right) degrades. Measuring Template Similarity Motivated by the observation that template-aligned payloads can subvert inherent role hierarchies, we hypothesize that transferability increases with the similarity between the injected template and the target model’s native templat… view at source ↗

**Figure 4.** Figure 4: Visualization of the mean and std. for Single vs. MoT settings; the dashed line marks ASR of Default InjecPrompt. Prior sections showed that wrapping a malicious payload with a model’s native chat template boosts ASR, and that similar foreign templates can also be damaging. In practice, however, an attacker may not know the target agent’s backbone LLM. Selecting a single arbitrary template has a low chance… view at source ↗

**Figure 5.** Figure 5: Comparison of ASR (top) and Utility (bottom) for Qwen-3 and Grok-3 across defense configurations, aggregated over all attack types. Baselines are the per-model scores without defense: Default InjecPrompt and Default Multi-turn. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: ASR under 3 types of template perturbations on AgentDojo for 3 models. Bars show [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Effects of turn count and persuasion taxonomy on attack success and utility. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: MoT Attackers for Different Template Sorting (Descending vs. Ascending) and the Number of Templates. Similarity Ranking: Kimi-K2 > GPT-oss > Llama-4 > GLM-4.5 > Grok-2 Random (Llama-4 → GLM-4.5 → Qwen-3 → Kimi-K2 → Grok-2 → GPT-oss): <|end|><im end|> <|separator|><|im end|><|eot|> <|begin of text|><|header start|>system<|header end|> [gMASK]<sop><|system|><|im system|>system<|im middle|>System: <|start|>d… view at source ↗

read the original abstract

The growing deployment of large language model (LLM) based agents that interact with external environments has created new attack surfaces for adversarial manipulation. One major threat is indirect prompt injection, where attackers embed malicious instructions in external environment output, causing agents to interpret and execute them as if they were legitimate prompts. While previous research has focused primarily on plain-text injection attacks, we find a significant yet underexplored vulnerability: LLMs' dependence on structured chat templates and their susceptibility to contextual manipulation through persuasive multi-turn dialogues. To this end, we introduce ChatInject, an attack that formats malicious payloads to mimic native chat templates, thereby exploiting the model's inherent instruction-following tendencies. Building on this foundation, we develop a persuasion-driven Multi-turn variant that primes the agent across conversational turns to accept and execute otherwise suspicious actions. Through comprehensive experiments across frontier LLMs, we demonstrate three critical findings: (1) ChatInject achieves significantly higher average attack success rates than traditional prompt injection methods, improving from 5.18% to 32.05% on AgentDojo and from 15.13% to 45.90% on InjecAgent, with multi-turn dialogues showing particularly strong performance at average 52.33% success rate on InjecAgent, (2) chat-template-based payloads demonstrate strong transferability across models and remain effective even against closed-source LLMs, despite their unknown template structures, and (3) existing prompt-based defenses are largely ineffective against this attack approach, especially against Multi-turn variants. These findings highlight vulnerabilities in current agent systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ChatInject shows template-mimicking payloads can lift attack success rates on agent benchmarks, but the gains rest on external outputs reaching the model without sanitization.

read the letter

The main thing to know is that this paper reports a concrete way to boost indirect prompt injection in LLM agents by formatting malicious payloads to match the model's native chat template syntax, plus a multi-turn persuasion extension. On the numbers given, that moves average success from 5.18% to 32.05% on AgentDojo and from 15.13% to 45.90% on InjecAgent, with the multi-turn version reaching 52.33% on the second benchmark. It also claims the approach transfers to closed models and slips past some existing defenses.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces ChatInject, an indirect prompt injection attack for LLM agents that formats malicious payloads to mimic native chat templates (e.g., using tokens like <|im_start|>), exploiting models' instruction-following tendencies. It further develops a persuasion-driven multi-turn variant. Experiments on AgentDojo and InjecAgent report ASR gains from 5.18% to 32.05% and 15.13% to 45.90% respectively, with multi-turn averaging 52.33% on InjecAgent; additional claims include cross-model transferability (including to closed-source LLMs) and ineffectiveness of existing prompt-based defenses.

Significance. If the empirical results hold under rigorous controls, the work would identify a practically relevant vulnerability in deployed LLM agent systems arising from unsanitized chat-template processing of external outputs. The concrete benchmark numbers and multi-turn extension provide actionable evidence that could guide sanitization requirements and defense design in AI agent security.

major comments (3)

[Abstract and §4 (Experiments)] Abstract and §4 (Experiments): the reported ASR improvements (5.18%→32.05% on AgentDojo; 15.13%→45.90% on InjecAgent) are presented without any description of trial counts, variance, statistical tests, or how template-mimicking payloads were constructed and detected, leaving the magnitude and reliability of the central performance claim only partially supported.
[§3 (Attack Design) and §4 (Evaluation Setup)] §3 (Attack Design) and §4 (Evaluation Setup): the claimed advantage of ChatInject over plain-text injection rests on the assumption that AgentDojo and InjecAgent forward raw external text containing special chat-template tokens directly into the next model call. No verification, ablation, or description of preprocessing/sanitization layers in these frameworks is provided; if such layers exist, the measured gains would collapse to ordinary injection.
[§4.3 (Defense Evaluation)] §4.3 (Defense Evaluation): the assertion that existing prompt-based defenses are largely ineffective, especially against the multi-turn variant, lacks concrete details on which specific defenses were tested, how multi-turn dialogues were instantiated, and quantitative success rates per defense.

minor comments (1)

[Abstract] The abstract introduces ASR without an initial expansion; a parenthetical definition on first use would improve readability for a broad audience.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our empirical results and experimental assumptions. We address each major point below and will revise the manuscript to incorporate additional details and clarifications where appropriate.

read point-by-point responses

Referee: [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): the reported ASR improvements (5.18%→32.05% on AgentDojo; 15.13%→45.90% on InjecAgent) are presented without any description of trial counts, variance, statistical tests, or how template-mimicking payloads were constructed and detected, leaving the magnitude and reliability of the central performance claim only partially supported.

Authors: We agree that the current presentation would benefit from greater methodological transparency. In the revised version we will report the exact number of trials per configuration (50 independent runs to capture stochasticity), include standard deviations alongside the ASR figures, and apply statistical tests such as paired t-tests to establish significance of the observed gains. We will also expand §3 with explicit payload-construction examples (showing token placement for <|im_start|> and similar markers) and detail the success-detection logic based on logged agent actions. revision: yes
Referee: [§3 (Attack Design) and §4 (Evaluation Setup)] §3 (Attack Design) and §4 (Evaluation Setup): the claimed advantage of ChatInject over plain-text injection rests on the assumption that AgentDojo and InjecAgent forward raw external text containing special chat-template tokens directly into the next model call. No verification, ablation, or description of preprocessing/sanitization layers in these frameworks is provided; if such layers exist, the measured gains would collapse to ordinary injection.

Authors: This concern is well-founded and we will strengthen the manuscript by adding an explicit description of the input pipelines in both benchmarks. We have inspected the publicly available codebases and confirm that external outputs are appended to the conversation history without token sanitization. In the revision we will document this verification, add a short ablation that applies simple sanitization filters, and show that the performance advantage disappears under such filtering, thereby confirming that the gains derive from template mimicry. revision: yes
Referee: [§4.3 (Defense Evaluation)] §4.3 (Defense Evaluation): the assertion that existing prompt-based defenses are largely ineffective, especially against the multi-turn variant, lacks concrete details on which specific defenses were tested, how multi-turn dialogues were instantiated, and quantitative success rates per defense.

Authors: We will substantially expand §4.3. The revision will enumerate the concrete prompt-based defenses evaluated (system-prompt hardening, input sanitization heuristics, and output filtering techniques drawn from prior work), describe the multi-turn dialogue construction (including the sequence of persuasive priming turns and history formatting), and present per-defense ASR numbers in a dedicated table for both the single-turn and multi-turn ChatInject variants. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack success rates measured on external benchmarks

full rationale

The paper introduces ChatInject as an empirical attack technique that formats malicious payloads to mimic chat templates and reports directly measured attack success rates (e.g., 5.18% to 32.05% on AgentDojo) from experiments on external agent frameworks and LLMs. No equations, derivations, or first-principles results are present that could reduce a claimed prediction to a fitted parameter or self-definition inside the paper. The central findings rest on observed outcomes against independent benchmarks rather than any self-referential construction, self-citation chain, or ansatz smuggled via prior work by the same authors. This is a standard empirical security evaluation with no load-bearing internal reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that current agent pipelines pass external text to the model with minimal structural filtering and that the tested models still follow chat templates even when the input is adversarial.

axioms (1)

domain assumption LLM agents process external environment outputs directly as part of the prompt context without template normalization or sanitization.
Stated implicitly in the description of indirect prompt injection and the attack surface created by agent-environment interaction.

pith-pipeline@v0.9.0 · 5814 in / 1311 out tokens · 62107 ms · 2026-05-18T12:18:58.171597+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ChatInject formats malicious payloads to mimic native chat templates (e.g., <|im_end|><|im_start|>system ... <|im_end|><|im_start|>user ...)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

On the Geometric Limits of Transformer Defenses against Obfuscation Attacks: Latent Embedding Collapse & Performance Robustness Gap
cs.CR 2026-05 unverdicted novelty 5.0

Obfuscated prompts exhibit latent embedding collapse onto clean prompt manifolds in BERT encoders, with minimal clean-obfuscated margin of 1.02 and elevated intra-class variance of 3.33 +/- 6.23 despite high detection...
Security Attack and Defense Strategies for Autonomous Agent Frameworks: A Layered Review with OpenClaw as a Case Study
cs.CR 2026-04 conditional novelty 4.0

The survey organizes security threats and defenses in autonomous LLM agents into four layers and identifies that risks can propagate across layers from inputs to ecosystem impacts.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 2 Pith papers · 10 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

S Agarwal et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Defending Against Indirect Prompt Injection Attacks With Spotlighting

URLhttps://openreview.net/forum? id=m1YYAQjO3w. 10 Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kici- man. Defending against indirect prompt injection attacks with spotlighting.arXiv preprint arXiv:2403.14720,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

GPT-4o System Card

Aaron Hurst et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Kimi K2: Open Agentic Intelligence

Kimi Team. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Accessed 2025-09-22. OpenAI. Introducing gpt-4.1 in the api.https://openai.com/index/gpt-4-1/, Apr

work page 2025
[7]

Promptarmor: Simple yet effective prompt injection defenses.arXiv preprint arXiv:2507.15219, 2025

Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, et al. Promptarmor: Simple yet effective prompt in- jection defenses.arXiv preprint arXiv:2507.15219,

work page arXiv
[8]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram´e, Morgane Rivi`ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training llms to prioritize privileged instructions.arXiv preprint arXiv:2404.13208,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

arXiv preprint arXiv:2502.19820

Zixuan Weng, Xiaolong Jin, Jinyuan Jia, and Xiangyu Zhang. Foot-in-the-door: A multi-turn jail- break for llms.arXiv preprint arXiv:2502.19820,

work page arXiv
[12]

Prompt injection attacks against gpt-3.https://simonwillison.net/ 2022/Sep/12/prompt-injection/,

Simon Willison. Prompt injection attacks against gpt-3.https://simonwillison.net/ 2022/Sep/12/prompt-injection/,

work page 2022
[13]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

URL https://openreview.net/forum?id=roNSXZpUDN. Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

URLhttps://openreview.net/pdf?id=WE_vluYUL-X

Association for Computational Lin- guistics. doi: 10.18653/v1/2024.acl-long.773. URLhttps://aclanthology.org/2024. acl-long.773/. Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents. In Lun-Wei Ku, Andre Mar- tins, and Vivek Srikumar (eds.),Findings of ...

work page doi:10.18653/v1/2024.acl-long.773 2024
[16]

2024 , address =

Association for Computational Lin- guistics. doi: 10.18653/v1/2024.findings-acl.624. URLhttps://aclanthology.org/ 2024.findings-acl.624/. Qiusi Zhan, Richard Fang, Henil Shalin Panchal, and Daniel Kang. Adaptive attacks break de- fenses against indirect prompt injection attacks on LLM agents. In Luis Chiruzzo, Alan Ritter, and Lu Wang (eds.),Findings of t...

work page doi:10.18653/v1/2024.findings-acl.624 2024
[17]

ISBN 979-8-89176-195-7

Association for Computational Lin- guistics. ISBN 979-8-89176-195-7. doi: 10.18653/v1/2025.findings-naacl.395. URLhttps: //aclanthology.org/2025.findings-naacl.395/. Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. Agent security bench (ASB): Formalizing and benchmarking attacks and defenses ...

work page doi:10.18653/v1/2025.findings-naacl.395 2025
[18]

URLhttps://openreview.net/forum?id=V4y0CpX4hK. 12 APPENDIX A THEUSE OFLARGELANGUAGEMODELS Throughout the writing process, we drafted the manuscript ourselves and used an LLM assistant only for refinement (style edits, clarity, and grammar checks); it was not used for research ideation or content generation. The assistant employed was ChatGPT-5. B LIMITATI...

work page 2025
[19]

The construction process iterates through the generated conver- sational history, wrapping each turn with its corresponding role tag. Specifically, the system message is enclosed with system interrupt tags, user dialogue turns are wrapped with user interrupt tags, and assistant responses are formatted with assistant interrupt tags. This systematic formatt...

work page 2024

[1] [1]

gpt-oss-120b & gpt-oss-20b Model Card

S Agarwal et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Defending Against Indirect Prompt Injection Attacks With Spotlighting

URLhttps://openreview.net/forum? id=m1YYAQjO3w. 10 Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kici- man. Defending against indirect prompt injection attacks with spotlighting.arXiv preprint arXiv:2403.14720,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

GPT-4o System Card

Aaron Hurst et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Kimi K2: Open Agentic Intelligence

Kimi Team. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Accessed 2025-09-22. OpenAI. Introducing gpt-4.1 in the api.https://openai.com/index/gpt-4-1/, Apr

work page 2025

[7] [7]

Promptarmor: Simple yet effective prompt injection defenses.arXiv preprint arXiv:2507.15219, 2025

Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, et al. Promptarmor: Simple yet effective prompt in- jection defenses.arXiv preprint arXiv:2507.15219,

work page arXiv

[8] [8]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram´e, Morgane Rivi`ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training llms to prioritize privileged instructions.arXiv preprint arXiv:2404.13208,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

arXiv preprint arXiv:2502.19820

Zixuan Weng, Xiaolong Jin, Jinyuan Jia, and Xiangyu Zhang. Foot-in-the-door: A multi-turn jail- break for llms.arXiv preprint arXiv:2502.19820,

work page arXiv

[12] [12]

Prompt injection attacks against gpt-3.https://simonwillison.net/ 2022/Sep/12/prompt-injection/,

Simon Willison. Prompt injection attacks against gpt-3.https://simonwillison.net/ 2022/Sep/12/prompt-injection/,

work page 2022

[13] [13]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

URL https://openreview.net/forum?id=roNSXZpUDN. Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

URLhttps://openreview.net/pdf?id=WE_vluYUL-X

Association for Computational Lin- guistics. doi: 10.18653/v1/2024.acl-long.773. URLhttps://aclanthology.org/2024. acl-long.773/. Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents. In Lun-Wei Ku, Andre Mar- tins, and Vivek Srikumar (eds.),Findings of ...

work page doi:10.18653/v1/2024.acl-long.773 2024

[16] [16]

2024 , address =

Association for Computational Lin- guistics. doi: 10.18653/v1/2024.findings-acl.624. URLhttps://aclanthology.org/ 2024.findings-acl.624/. Qiusi Zhan, Richard Fang, Henil Shalin Panchal, and Daniel Kang. Adaptive attacks break de- fenses against indirect prompt injection attacks on LLM agents. In Luis Chiruzzo, Alan Ritter, and Lu Wang (eds.),Findings of t...

work page doi:10.18653/v1/2024.findings-acl.624 2024

[17] [17]

ISBN 979-8-89176-195-7

Association for Computational Lin- guistics. ISBN 979-8-89176-195-7. doi: 10.18653/v1/2025.findings-naacl.395. URLhttps: //aclanthology.org/2025.findings-naacl.395/. Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. Agent security bench (ASB): Formalizing and benchmarking attacks and defenses ...

work page doi:10.18653/v1/2025.findings-naacl.395 2025

[18] [18]

URLhttps://openreview.net/forum?id=V4y0CpX4hK. 12 APPENDIX A THEUSE OFLARGELANGUAGEMODELS Throughout the writing process, we drafted the manuscript ourselves and used an LLM assistant only for refinement (style edits, clarity, and grammar checks); it was not used for research ideation or content generation. The assistant employed was ChatGPT-5. B LIMITATI...

work page 2025

[19] [19]

The construction process iterates through the generated conver- sational history, wrapping each turn with its corresponding role tag. Specifically, the system message is enclosed with system interrupt tags, user dialogue turns are wrapped with user interrupt tags, and assistant responses are formatted with assistant interrupt tags. This systematic formatt...

work page 2024