pith. sign in

arxiv: 2605.24659 · v1 · pith:HDPJYVBPnew · submitted 2026-05-23 · 💻 cs.LG

IterInject: Indirect Prompt Injection Against LLM Agents via Feedback-Guided Iterative Optimization

Pith reviewed 2026-06-30 14:50 UTC · model grok-4.3

classification 💻 cs.LG
keywords indirect prompt injectionLLM agentsadversarial attacksiterative optimizationfeedback-guided attacksprompt injectionagent security
0
0 comments X

The pith

IterInject uses iterative diagnosis and LLM-guided refinement to create more effective indirect prompt injection payloads against LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces IterInject as a closed-loop attack method that injects adversarial instructions into data retrieved by agents, then uses a rule-based diagnoser to label outcomes and describe behaviors before an LLM optimizer refines the payload using the full history of attempts. A synthesis step creates new disguise seeds from observed failure patterns so the attack strategy evolves over iterations. Experiments demonstrate that this process yields substantially higher success rates than static payloads or prior adaptive techniques on AgentDojo and InjectAgent across four models. The same optimized payloads reach full success on five of nine targets even when tested against the defended Claude Code agent. The work also reports a mechanistic finding that injections exploit an attention-mediated threshold in mid-to-late layers.

Core claim

IterInject is a feedback-guided iterative framework that closes the loop between injection, diagnosis, and refinement: a rule-based diagnoser produces structured outcome labels with behavioral descriptions, an LLM optimizer refines payloads conditioned on the full optimization history, and a synthesis step generates new disguise seeds from failure patterns, enabling substantially higher success rates than static baselines or existing adaptive methods on AgentDojo and InjectAgent across four victim models, with full success on five of nine targets in Claude Code.

What carries the argument

The closed feedback loop consisting of rule-based diagnosis that outputs structured labels and behavioral descriptions, LLM-based optimization conditioned on full history, and synthesis of new disguise seeds from failure patterns.

If this is right

  • Optimized payloads achieve full success against production-grade agents that incorporate layered defenses.
  • The method adapts to agent-specific behaviors and defenses where static or non-iterative adaptive attacks do not.
  • Causal interventions on the identified attention-mediated threshold in mid-to-late layers can alter injection success.
  • The strategy space expands automatically through synthesis of new disguise seeds derived from observed failures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent defenses may need to interrupt iterative refinement loops rather than relying solely on static content filters.
  • Monitoring attention patterns in mid-to-late layers could serve as a basis for real-time injection detection.
  • Extending the approach to agents with longer tool-use chains or multi-turn external interactions could expose additional vulnerabilities not visible in the current benchmarks.

Load-bearing premise

The rule-based diagnoser produces reliable structured outcome labels and behavioral descriptions that the LLM optimizer can use to make consistent progress across iterations.

What would settle it

Running the full iterative process on a new agent or task for a fixed number of iterations and finding that success rates show no measurable gain over the best static baseline payload.

Figures

Figures reproduced from arXiv: 2605.24659 by Jiaxiang Chen, Ke Xu, Li Luo, Tanfeng Sun, Xiaoxiang Huang, Xinghao Jiang, Zixuan Chen.

Figure 1
Figure 1. Figure 1: Indirect prompt injection. An attacker plants adversarial payloads in external data; when the victim agent [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of IterInject. For each attack target, seeds from [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Threshold-like decision boundary. (a–b) Suc [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Representative successful AgentDojo payload excerpts across strategy candidates. Examples are shortened [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
read the original abstract

LLM-based agents are increasingly deployed for complex tasks requiring planning, tool use, and interaction with external services. Their reliance on untrusted external content exposes them to indirect prompt injection (IPI), in which adversarial instructions embedded in retrieved data hijack agent behavior. Existing attacks rely on static payloads that cannot adapt to agent-specific defenses; even recent adaptive methods lack structured feedback to guide optimization. We introduce \oursys, a feedback-guided iterative framework that closes the loop between injection, diagnosis, and refinement: a rule-based diagnoser produces structured outcome labels with behavioral descriptions, and an LLM-based optimizer refines payloads conditioned on the full optimization history. A synthesis step generates new disguise seeds from failure patterns, enabling the strategy space to self-evolve. On AgentDojo and InjectAgent, \oursys substantially outperforms static baselines and existing adaptive methods across four victim models. Extension experiments on Claude Code, a production-grade coding agent with layered defenses, show that optimized payloads achieve full success on 5 of 9 targets; even those that resist full exploitation exhibit measurable improvement from iterative refinement. We further present a mechanistic analysis of IPI, identifying an attention-mediated threshold mechanism in mid-to-late layers; three causal interventions validate this finding and point to concrete defense directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents IterInject, a feedback-guided iterative optimization framework for indirect prompt injection (IPI) attacks on LLM agents. It combines a rule-based diagnoser that extracts structured outcome labels and behavioral descriptions from execution traces, an LLM-based optimizer that refines payloads using full optimization history, and a synthesis step that generates new disguise seeds from failure patterns. The central claims are that IterInject substantially outperforms static baselines and prior adaptive methods on AgentDojo and InjectAgent across four victim models, achieves full success on 5 of 9 targets against the production-grade Claude Code agent, and that mechanistic analysis reveals an attention-mediated threshold mechanism in mid-to-late layers, supported by three causal interventions.

Significance. If the empirical claims hold after validation of the core components, the work would provide a concrete adaptive attack methodology for IPI, demonstrate measurable progress on a defended production agent, and supply mechanistic insight plus causal evidence that could inform targeted defenses. The inclusion of causal interventions and extension to a real-world coding agent are positive features.

major comments (2)
  1. [§3] §3: The rule-based diagnoser is presented as the source of structured outcome labels and behavioral descriptions that condition the LLM optimizer across iterations, yet the manuscript reports no validation of its reliability (human agreement, error rates on partial successes/refusals/tool-use patterns, or ablation on noisy labels). Because the entire iterative synthesis/refinement pipeline depends on these labels, systematic misclassification would corrupt the optimization history and could artifactually produce the reported gains over baselines.
  2. [§4, §5] §4 and §5: The performance claims (substantial outperformance on AgentDojo/InjectAgent and full success on 5/9 Claude Code targets) rest on the diagnoser-driven loop, but no ablation isolating the contribution of the rule-based labels versus the optimizer or synthesis step is described. Without such controls, it is unclear whether the gains are attributable to the iterative mechanism or to other factors.
minor comments (1)
  1. The abstract and early sections refer to 'four victim models' and 'AgentDojo and InjectAgent' without immediately specifying the exact models or benchmark versions; adding these details in the first paragraph would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which identify key areas where additional validation and controls would strengthen the manuscript. We address each major comment below and describe the revisions we will make to incorporate the suggested analyses.

read point-by-point responses
  1. Referee: [§3] §3: The rule-based diagnoser is presented as the source of structured outcome labels and behavioral descriptions that condition the LLM optimizer across iterations, yet the manuscript reports no validation of its reliability (human agreement, error rates on partial successes/refusals/tool-use patterns, or ablation on noisy labels). Because the entire iterative synthesis/refinement pipeline depends on these labels, systematic misclassification would corrupt the optimization history and could artifactually produce the reported gains over baselines.

    Authors: We agree that explicit validation of the rule-based diagnoser is necessary. In the revised manuscript we will add a new subsection in §3 reporting (i) inter-annotator agreement with human evaluators on a stratified sample of execution traces, (ii) error rates broken down by outcome category (full success, partial success, refusal, tool misuse), and (iii) a sensitivity analysis that injects controlled label noise and measures downstream impact on optimization trajectories. These results will quantify the diagnoser’s reliability and address the concern about potential artifactual gains. revision: yes

  2. Referee: [§4, §5] §4 and §5: The performance claims (substantial outperformance on AgentDojo/InjectAgent and full success on 5/9 Claude Code targets) rest on the diagnoser-driven loop, but no ablation isolating the contribution of the rule-based labels versus the optimizer or synthesis step is described. Without such controls, it is unclear whether the gains are attributable to the iterative mechanism or to other factors.

    Authors: We concur that component-wise ablations are required to attribute performance gains. We will augment §4 and §5 with new experiments comparing the full IterInject pipeline against three controlled variants: (1) optimizer without structured rule-based labels (raw traces only), (2) optimizer plus labels but without the synthesis step, and (3) synthesis without iterative history conditioning. Results on both AgentDojo/InjectAgent and the Claude Code targets will be reported to isolate the contribution of each element. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an empirical attack framework (IterInject) consisting of a rule-based diagnoser producing labels from execution traces and an LLM optimizer refining payloads over iterations. No equations, fitted parameters, or predictive derivations are present. Performance claims rest on direct experiments against external benchmarks (AgentDojo, InjectAgent, Claude Code) rather than any self-referential reduction or self-citation load-bearing step. The method is self-contained against those benchmarks and does not invoke uniqueness theorems or rename known results as new derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract alone.

pith-pipeline@v0.9.1-grok · 5773 in / 1083 out tokens · 41613 ms · 2026-06-30T14:50:38.289390+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

6 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    , author Zhang, Y

    Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Se- cure and Trustworthy Machine Learning (SaTML), pages 23–42. Stav Cohen, Ron Bitton, and Ben Nassi. 2025. Here comes the AI worm: Preventing the propagation of adversarial self-replicating prompts within GenAI ecosystems. InProceedings of the 2025 ACM SIGSAC Confe...

  2. [2]

    Defending Against Indirect Prompt Injection Attacks With Spotlighting

    Not what you’ve signed up for: Compromis- ing real-world LLM-integrated applications with in- direct prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Secu- rity, pages 79–90. Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. 2024. Defending against indirect prompt injection a...

  3. [3]

    Reconstruct the input by applying the chat tem- plate (with tool definitions) up to and including the tool response containing the injection pay- load for that iteration

  4. [4]

    Run a single forward pass with output_attentions=True and output_hidden_states=True

  5. [5]

    Only say: INJEC- TION_SUCCESS_T1

    Extract: • Attention: the last-token attention row αl,h,: ∈R T at every 4th full-attention layer (layers 3, 7, 11, . . . , 63; 16 layers total, 24 heads each). • Hidden states: the last-token vector h(l) ∈ R5120 at all 65 layers (embedding + 64 transformer layers). Table 17: Payload evolution example: SCRIPTED_MESSAGE+ ECHO. The Payload Optimizer transiti...

  6. [6]

    INJECTION_SUCCESS_T1

    Annotate token spans: identify the injection pay- load region [s, e) by exact-match alignment of the payload token IDs within the tool-response segment; fallback to fuzzy prefix matching (first 20 tokens) if exact match fails. Category assignment.Samples are categorized by the AgentDojo evaluation score: SUCCESS( s≥ 0.9), PARTIAL(0.5≤s <0.9), FAIL(s <0.5)...