pith. sign in

arxiv: 2607.00422 · v1 · pith:LDKDYQ2Znew · submitted 2026-07-01 · 💻 cs.CR

KidnapRAG: A Black-Box Attack for Hijacking Reasoning in Agentic Retrieval-Augmented Generation Systems

Pith reviewed 2026-07-02 11:40 UTC · model grok-4.3

classification 💻 cs.CR
keywords black-box poisoning attackagentic RAGreasoning hijackingretrieval-augmented generationsequential attackBait Chain-Link Mal-InsLLM securitymulti-step reasoning
0
0 comments X

The pith

Three role-specific poisoned documents can hijack multi-step reasoning in agentic RAG systems without model access.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that agentic RAG systems, designed to resist poisoning through iterative retrieval and reasoning, can still be redirected by a black-box attacker who only publishes external documents. KidnapRAG achieves this with a sequence of three documents that first draw the agent in, then shift the query focus, and finally supply the attacker-chosen evidence. Experiments across frameworks, models, and benchmarks show this method succeeds more reliably than earlier poisoning approaches. A reader would care because it reveals how the very iteration meant to protect the system can be turned against it when documents are crafted to play specific roles in the chain.

Core claim

KidnapRAG is a black-box sequential poisoning attack that hijacks the agent's reasoning chain by publishing three role-specific documents: Bait to attract initial retrieval, Chain-Link to induce query reformulation, and Mal-Ins to provide attacker-controlled evidence. This approach progressively weakens the original retrieval intent, redirects behavior, and increases reliance on the poisoned content, outperforming existing baselines across multiple Agentic RAG frameworks, LLM backbones, and benchmarks.

What carries the argument

The three role-specific poisoned documents (Bait, Chain-Link, Mal-Ins) that sequentially attract retrieval, trigger reformulation, and inject malicious evidence.

If this is right

  • The attack succeeds without white-box access to system prompts, reasoning traces, or model parameters.
  • It works across multiple Agentic RAG frameworks, LLM backbones, and standard benchmarks.
  • The poisoned documents progressively weaken the original retrieval intent and redirect behavior.
  • Reliance on attacker-controlled evidence increases as the chain advances.
  • Existing poisoning baselines are outperformed under the same black-box constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Defenses could focus on detecting unnatural shifts in retrieval intent across multiple steps rather than single-document relevance.
  • The same sequential role-playing tactic might transfer to other iterative agent architectures that reformulate queries mid-process.
  • System designers may need to add verification steps after each retrieval to break the chain before the final evidence is accepted.
  • Users relying on agentic RAG for complex tasks should consider cross-checking final outputs against independent sources.

Load-bearing premise

Agentic RAG systems will follow an iterative retrieval and reasoning process that these three documents can progressively redirect without any access to prompts or model internals.

What would settle it

Experiments in which the agent consistently retrieves only clean documents, ignores the bait, and completes the original query without incorporating Mal-Ins evidence across repeated trials on the tested benchmarks.

Figures

Figures reproduced from arXiv: 2607.00422 by Buru Chang, Chanwoo Choi, Eunmi Kim, Euntae Kim, Jinhee Jeong, Junseo Jang, Kyuho Lee, Myunggyo Oh, Youngsam Chun.

Figure 1
Figure 1. Figure 1: Comparison between previous attacks and KidnapRAG in Agentic RAG systems. Previous at￾tacks fail because they cannot hijack the agent’s reason￾ing chain, whereas KidnapRAG controls the reasoning chain to induce the attacker-intended target answer. documents introduces a critical security risk: ad￾versaries can inject poisoned documents into the retrieval corpus to manipulate model outputs (Choi et al., 202… view at source ↗
Figure 2
Figure 2. Figure 2: Exposed reasoning processes in real-world [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of KidnapRAG, the proposed black [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Reasoning chain shift on HotpotQA. ducing the target answer despite injecting poisoned documents. In contrast, KidnapRAG consistently causes the largest performance degradation and achieves the highest ASR across all combinations of Agentic RAG frameworks, LLM backbones, and datasets. These results indicate that attacks designed for single-step RAG are insufficient for Agentic RAG, where poisoned evidence … view at source ↗
Figure 5
Figure 5. Figure 5: Reasoning chain-level analysis on HotpotQA for the best dragging scenarios. Scores are measured on [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Case study comparing previous attacks, Ignore [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Reasoning chain shift on MuSiQue. 0.2 0.3 -0.1 0.0 0.1 Ours ReAct: Qwen 0.2 0.3 -0.2 0.0 Ours ReAct: Llama 0.3 0.3 0.4 -0.1 0.0 0.1 0.2 Ours WebThinker: QwQ 0.3 0.3 0.0 0.1 0.2 Ours WebThinker: DeepSeek Reasoning Path Divergence Score Target Redirection Score Naive Ignore Fake Completion Combined Topic PoisonedRAG PARADOX Ours [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Reasoning chain shift on 2WikiMultihopQA. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Reasoning chain-level analysis on MuSiQue for the best dragging scenarios. Scores are measured on [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Reasoning chain-level analysis on 2WikiMultihopQA for the best dragging scenarios. Scores are [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompts used for generating Bait Documents. The upper figure shows the system prompt, and the lower [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
read the original abstract

Retrieval-Augmented Generation (RAG) systems are vulnerable to poisoning attacks that inject malicious documents into the retrieval process to manipulate model outputs. Recent Agentic RAG systems are more robust to such attacks because they iteratively perform retrieval and reasoning, allowing them to ignore weakly relevant poisoned documents and preserve the reasoning chain induced by the user query. However, existing attacks on Agentic RAG systems often assume white-box access to system prompts, reasoning traces, retrievers, or model parameters, limiting their applicability in realistic settings. In this paper, we study black-box poisoning attacks against Agentic RAG systems, where the attacker can only publish externally retrievable poisoned documents. We propose KidnapRAG, a sequential poisoning attack that hijacks the agent's multi-step reasoning chain using three role-specific documents: Bait, Chain-Link, and Mal-Ins, which attract initial retrieval, induce query reformulation, and provide attacker-controlled evidence, respectively. Experiments across multiple Agentic RAG frameworks, LLM backbones, and benchmarks show that KidnapRAG consistently outperforms existing poisoning baselines under black-box conditions. Further analyses show that KidnapRAG progressively weakens the original retrieval intent, redirects retrieval behavior, and increases reliance on attacker-controlled evidence. Our code is publicly available at https://github.com/chanwoochoi316/KidnapRAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents KidnapRAG, a black-box poisoning attack targeting Agentic RAG systems. The attack employs three role-specific poisoned documents—Bait to attract initial retrieval, Chain-Link to induce query reformulation, and Mal-Ins to supply attacker-controlled evidence—to hijack the multi-step reasoning chain. Experiments across multiple Agentic RAG frameworks, LLM backbones, and benchmarks demonstrate that KidnapRAG consistently outperforms existing poisoning baselines under black-box conditions. Additional analyses illustrate how the attack progressively weakens the original retrieval intent and increases reliance on attacker-controlled evidence. The code is made publicly available.

Significance. If the empirical results hold, the work is significant for demonstrating that Agentic RAG systems remain susceptible to sequential black-box poisoning despite their iterative design. This advances understanding of practical attack surfaces in agentic AI. The public code release is a clear strength supporting reproducibility.

major comments (2)
  1. [§4] §4 (Experiments): The central claim of 'consistent outperformance' lacks reported statistical significance tests, standard deviations, or number of runs per configuration. Without these, it is impossible to determine whether observed gains over baselines are reliable or attributable to variance.
  2. [§3] §3 (Attack Design): The load-bearing premise that exactly three role-specific documents suffice to redirect iterative retrieval and reasoning under black-box constraints is supported only by end-to-end success rates; no ablation isolating the contribution of each document (or testing two-document variants) is provided, weakening the justification for the proposed sequence.
minor comments (2)
  1. [Abstract] The abstract refers to 'multiple Agentic RAG frameworks' and 'benchmarks' without naming them; listing the concrete systems and datasets would improve immediate readability.
  2. Notation for the three documents could be introduced with a compact table or consistent abbreviations on first use to aid cross-reference with later sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the empirical claims without misrepresenting the current manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The central claim of 'consistent outperformance' lacks reported statistical significance tests, standard deviations, or number of runs per configuration. Without these, it is impossible to determine whether observed gains over baselines are reliable or attributable to variance.

    Authors: We agree this is a valid concern and a limitation of the current presentation. The experiments were run with 5 independent trials per configuration to account for stochasticity in retrieval and LLM generation, but these details and variance measures were omitted. In the revision we will (1) explicitly state the number of runs, (2) report mean success rates with standard deviations in all tables, and (3) add paired t-tests (or Wilcoxon signed-rank tests where normality assumptions fail) with p-values to establish statistical significance of the gains over baselines. These additions will be placed in §4 and the appendix. revision: yes

  2. Referee: [§3] §3 (Attack Design): The load-bearing premise that exactly three role-specific documents suffice to redirect iterative retrieval and reasoning under black-box constraints is supported only by end-to-end success rates; no ablation isolating the contribution of each document (or testing two-document variants) is provided, weakening the justification for the proposed sequence.

    Authors: The manuscript already contains progressive analyses (§5) that track how retrieval intent weakens and attacker evidence reliance grows as Bait, Chain-Link, and Mal-Ins are introduced sequentially. Nevertheless, we acknowledge that these analyses do not constitute a controlled ablation of document count or role. In the revised version we will add explicit ablation tables comparing the full three-document KidnapRAG against (a) two-document subsets and (b) single-document baselines, measuring end-to-end attack success rate, retrieval hit rate on attacker documents, and reasoning-chain hijack rate across the same frameworks and benchmarks. This will directly quantify the marginal contribution of each role-specific document. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper contains no equations, derivations, fitted parameters, or mathematical claims. Its central contribution is an empirical black-box attack (KidnapRAG) demonstrated through experiments on multiple frameworks, backbones, and benchmarks. The method is defined by explicit construction of three document roles (Bait, Chain-Link, Mal-Ins) and evaluated via direct performance comparisons; no step reduces to a self-definition, renamed fit, or self-citation chain. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the contribution is an empirical attack design.

pith-pipeline@v0.9.1-grok · 5805 in / 1105 out tokens · 24559 ms · 2026-07-02T11:40:06.746641+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 10 canonical work pages · 8 internal anchors

  1. [1]

    T opic A ttack: An Indirect Prompt Injection Attack via Topic Transition

    Chen, Yulin and Li, Haoran and Li, Yuexin and Liu, Yue and Song, Yangqiu and Hooi, Bryan. T opic A ttack: An Indirect Prompt Injection Attack via Topic Transition. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025

  2. [2]

    Ignore Previous Prompt: Attack Techniques For Language Models

    Ignore previous prompt: Attack techniques for language models , author=. arXiv preprint arXiv:2211.09527 , year=

  3. [3]

    https://simonwillison

    Delimiters won’t save you from prompt injection , author=. https://simonwillison. net/2023/May/11/delimiters-wont-save-you , year=

  4. [4]

    33rd USENIX Security Symposium (USENIX Security 24) , pages=

    Formalizing and benchmarking prompt injection attacks and defenses , author=. 33rd USENIX Security Symposium (USENIX Security 24) , pages=

  5. [5]

    The RAG Paradox: A Black-Box Attack Exploiting Unintentional Vulnerabilities in Retrieval-Augmented Generation Systems

    Choi, Chanwoo and Kim, Jinsoo and Cho, Sukmin and Jeong, Soyeong and Chang, Buru. The RAG Paradox: A Black-Box Attack Exploiting Unintentional Vulnerabilities in Retrieval-Augmented Generation Systems. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025

  6. [6]

    34th USENIX Security Symposium (USENIX Security 25) , pages=

    \ PoisonedRAG \ : Knowledge corruption attacks to \ Retrieval-Augmented \ generation of large language models , author=. 34th USENIX Security Symposium (USENIX Security 25) , pages=

  7. [7]

    Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

    Agentic retrieval-augmented generation: A survey on agentic rag , author=. arXiv preprint arXiv:2501.09136 , year=

  8. [8]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    WebThinker: Empowering Large Reasoning Models with Deep Research Capability , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  9. [9]

    Search-o1: Agentic Search-Enhanced Large Reasoning Models

    Li, Xiaoxi and Dong, Guanting and Jin, Jiajie and Zhang, Yuyao and Zhou, Yujia and Zhu, Yutao and Zhang, Peitian and Dou, Zhicheng. Search-o1: Agentic Search-Enhanced Large Reasoning Models. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025

  10. [10]

    ReAct: Synergizing Reasoning and Acting in Language Models

    React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

  11. [11]

    The Fourteenth International Conference on Learning Representations , year=

    Agentic Reinforced Policy Optimization , author=. The Fourteenth International Conference on Learning Representations , year=

  12. [12]

    Advances in Neural Information Processing Systems , volume=

    Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in Neural Information Processing Systems , volume=

  13. [13]

    Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , pages=

    Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , author=. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , pages=

  14. [14]

    Advances in Neural Information Processing Systems , volume=

    Watch out for your agents! investigating backdoor threats to llm-based agents , author=. Advances in Neural Information Processing Systems , volume=

  15. [15]

    Advances in Neural Information Processing Systems , volume=

    Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases , author=. Advances in Neural Information Processing Systems , volume=

  16. [16]

    Memory Injection Attacks on

    Shen Dong and Shaochen Xu and Pengfei He and Yige Li and Jiliang Tang and Tianming Liu and Hui Liu and Zhen Xiang , booktitle=. Memory Injection Attacks on. 2026 , url=

  17. [17]

    arXiv preprint arXiv:2510.08238 , year=

    Chain-of-Trigger: An Agentic Backdoor that Paradoxically Enhances Agentic Robustness , author=. arXiv preprint arXiv:2510.08238 , year=

  18. [18]

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

  19. [19]

    ♫ M u S i Q ue: Multihop Questions via Single-hop Question Composition

    Trivedi, Harsh and Balasubramanian, Niranjan and Khot, Tushar and Sabharwal, Ashish. ♫ M u S i Q ue: Multihop Questions via Single-hop Question Composition. Transactions of the Association for Computational Linguistics. 2022

  20. [20]

    Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

    Ho, Xanh and Duong Nguyen, Anh-Khoa and Sugawara, Saku and Aizawa, Akiko. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. Proceedings of the 28th International Conference on Computational Linguistics. 2020

  21. [21]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Text embeddings by weakly-supervised contrastive pre-training , author=. arXiv preprint arXiv:2212.03533 , year=

  22. [22]

    QwQ-32B: Embracing the Power of Reinforcement Learning , url =

    Qwen Team , month =. QwQ-32B: Embracing the Power of Reinforcement Learning , url =

  23. [23]

    Qwen2.5 Technical Report

    Qwen2.5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=

  24. [24]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  25. [25]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  26. [26]

    H ot F lip: White-Box Adversarial Examples for Text Classification

    Ebrahimi, Javid and Rao, Anyi and Lowd, Daniel and Dou, Dejing. H ot F lip: White-Box Adversarial Examples for Text Classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2018

  27. [27]

    2023 , url=

    Olga Golovneva and Moya Peng Chen and Spencer Poff and Martin Corredor and Luke Zettlemoyer and Maryam Fazel-Zarandi and Asli Celikyilmaz , booktitle=. 2023 , url=

  28. [28]

    arXiv preprint arXiv:2601.03823 , year=

    Step Potential Advantage Estimation: Harnessing Intermediate Confidence and Correctness for Efficient Mathematical Reasoning , author=. arXiv preprint arXiv:2601.03823 , year=

  29. [29]

    IEEE transactions on acoustics, speech, and signal processing , volume=

    Dynamic programming algorithm optimization for spoken word recognition , author=. IEEE transactions on acoustics, speech, and signal processing , volume=. 1978 , publisher=

  30. [30]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  31. [31]

    2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=

    Get my drift? catching llm task drift with activation deltas , author=. 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=. 2025 , organization=