KidnapRAG: A Black-Box Attack for Hijacking Reasoning in Agentic Retrieval-Augmented Generation Systems
Pith reviewed 2026-07-02 11:40 UTC · model grok-4.3
The pith
Three role-specific poisoned documents can hijack multi-step reasoning in agentic RAG systems without model access.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
KidnapRAG is a black-box sequential poisoning attack that hijacks the agent's reasoning chain by publishing three role-specific documents: Bait to attract initial retrieval, Chain-Link to induce query reformulation, and Mal-Ins to provide attacker-controlled evidence. This approach progressively weakens the original retrieval intent, redirects behavior, and increases reliance on the poisoned content, outperforming existing baselines across multiple Agentic RAG frameworks, LLM backbones, and benchmarks.
What carries the argument
The three role-specific poisoned documents (Bait, Chain-Link, Mal-Ins) that sequentially attract retrieval, trigger reformulation, and inject malicious evidence.
If this is right
- The attack succeeds without white-box access to system prompts, reasoning traces, or model parameters.
- It works across multiple Agentic RAG frameworks, LLM backbones, and standard benchmarks.
- The poisoned documents progressively weaken the original retrieval intent and redirect behavior.
- Reliance on attacker-controlled evidence increases as the chain advances.
- Existing poisoning baselines are outperformed under the same black-box constraints.
Where Pith is reading between the lines
- Defenses could focus on detecting unnatural shifts in retrieval intent across multiple steps rather than single-document relevance.
- The same sequential role-playing tactic might transfer to other iterative agent architectures that reformulate queries mid-process.
- System designers may need to add verification steps after each retrieval to break the chain before the final evidence is accepted.
- Users relying on agentic RAG for complex tasks should consider cross-checking final outputs against independent sources.
Load-bearing premise
Agentic RAG systems will follow an iterative retrieval and reasoning process that these three documents can progressively redirect without any access to prompts or model internals.
What would settle it
Experiments in which the agent consistently retrieves only clean documents, ignores the bait, and completes the original query without incorporating Mal-Ins evidence across repeated trials on the tested benchmarks.
Figures
read the original abstract
Retrieval-Augmented Generation (RAG) systems are vulnerable to poisoning attacks that inject malicious documents into the retrieval process to manipulate model outputs. Recent Agentic RAG systems are more robust to such attacks because they iteratively perform retrieval and reasoning, allowing them to ignore weakly relevant poisoned documents and preserve the reasoning chain induced by the user query. However, existing attacks on Agentic RAG systems often assume white-box access to system prompts, reasoning traces, retrievers, or model parameters, limiting their applicability in realistic settings. In this paper, we study black-box poisoning attacks against Agentic RAG systems, where the attacker can only publish externally retrievable poisoned documents. We propose KidnapRAG, a sequential poisoning attack that hijacks the agent's multi-step reasoning chain using three role-specific documents: Bait, Chain-Link, and Mal-Ins, which attract initial retrieval, induce query reformulation, and provide attacker-controlled evidence, respectively. Experiments across multiple Agentic RAG frameworks, LLM backbones, and benchmarks show that KidnapRAG consistently outperforms existing poisoning baselines under black-box conditions. Further analyses show that KidnapRAG progressively weakens the original retrieval intent, redirects retrieval behavior, and increases reliance on attacker-controlled evidence. Our code is publicly available at https://github.com/chanwoochoi316/KidnapRAG.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents KidnapRAG, a black-box poisoning attack targeting Agentic RAG systems. The attack employs three role-specific poisoned documents—Bait to attract initial retrieval, Chain-Link to induce query reformulation, and Mal-Ins to supply attacker-controlled evidence—to hijack the multi-step reasoning chain. Experiments across multiple Agentic RAG frameworks, LLM backbones, and benchmarks demonstrate that KidnapRAG consistently outperforms existing poisoning baselines under black-box conditions. Additional analyses illustrate how the attack progressively weakens the original retrieval intent and increases reliance on attacker-controlled evidence. The code is made publicly available.
Significance. If the empirical results hold, the work is significant for demonstrating that Agentic RAG systems remain susceptible to sequential black-box poisoning despite their iterative design. This advances understanding of practical attack surfaces in agentic AI. The public code release is a clear strength supporting reproducibility.
major comments (2)
- [§4] §4 (Experiments): The central claim of 'consistent outperformance' lacks reported statistical significance tests, standard deviations, or number of runs per configuration. Without these, it is impossible to determine whether observed gains over baselines are reliable or attributable to variance.
- [§3] §3 (Attack Design): The load-bearing premise that exactly three role-specific documents suffice to redirect iterative retrieval and reasoning under black-box constraints is supported only by end-to-end success rates; no ablation isolating the contribution of each document (or testing two-document variants) is provided, weakening the justification for the proposed sequence.
minor comments (2)
- [Abstract] The abstract refers to 'multiple Agentic RAG frameworks' and 'benchmarks' without naming them; listing the concrete systems and datasets would improve immediate readability.
- Notation for the three documents could be introduced with a compact table or consistent abbreviations on first use to aid cross-reference with later sections.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the empirical claims without misrepresenting the current manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): The central claim of 'consistent outperformance' lacks reported statistical significance tests, standard deviations, or number of runs per configuration. Without these, it is impossible to determine whether observed gains over baselines are reliable or attributable to variance.
Authors: We agree this is a valid concern and a limitation of the current presentation. The experiments were run with 5 independent trials per configuration to account for stochasticity in retrieval and LLM generation, but these details and variance measures were omitted. In the revision we will (1) explicitly state the number of runs, (2) report mean success rates with standard deviations in all tables, and (3) add paired t-tests (or Wilcoxon signed-rank tests where normality assumptions fail) with p-values to establish statistical significance of the gains over baselines. These additions will be placed in §4 and the appendix. revision: yes
-
Referee: [§3] §3 (Attack Design): The load-bearing premise that exactly three role-specific documents suffice to redirect iterative retrieval and reasoning under black-box constraints is supported only by end-to-end success rates; no ablation isolating the contribution of each document (or testing two-document variants) is provided, weakening the justification for the proposed sequence.
Authors: The manuscript already contains progressive analyses (§5) that track how retrieval intent weakens and attacker evidence reliance grows as Bait, Chain-Link, and Mal-Ins are introduced sequentially. Nevertheless, we acknowledge that these analyses do not constitute a controlled ablation of document count or role. In the revised version we will add explicit ablation tables comparing the full three-document KidnapRAG against (a) two-document subsets and (b) single-document baselines, measuring end-to-end attack success rate, retrieval hit rate on attacker documents, and reasoning-chain hijack rate across the same frameworks and benchmarks. This will directly quantify the marginal contribution of each role-specific document. revision: yes
Circularity Check
No significant circularity
full rationale
The paper contains no equations, derivations, fitted parameters, or mathematical claims. Its central contribution is an empirical black-box attack (KidnapRAG) demonstrated through experiments on multiple frameworks, backbones, and benchmarks. The method is defined by explicit construction of three document roles (Bait, Chain-Link, Mal-Ins) and evaluated via direct performance comparisons; no step reduces to a self-definition, renamed fit, or self-citation chain. The result is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
T opic A ttack: An Indirect Prompt Injection Attack via Topic Transition
Chen, Yulin and Li, Haoran and Li, Yuexin and Liu, Yue and Song, Yangqiu and Hooi, Bryan. T opic A ttack: An Indirect Prompt Injection Attack via Topic Transition. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025
2025
-
[2]
Ignore Previous Prompt: Attack Techniques For Language Models
Ignore previous prompt: Attack techniques for language models , author=. arXiv preprint arXiv:2211.09527 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
https://simonwillison
Delimiters won’t save you from prompt injection , author=. https://simonwillison. net/2023/May/11/delimiters-wont-save-you , year=
2023
-
[4]
33rd USENIX Security Symposium (USENIX Security 24) , pages=
Formalizing and benchmarking prompt injection attacks and defenses , author=. 33rd USENIX Security Symposium (USENIX Security 24) , pages=
-
[5]
The RAG Paradox: A Black-Box Attack Exploiting Unintentional Vulnerabilities in Retrieval-Augmented Generation Systems
Choi, Chanwoo and Kim, Jinsoo and Cho, Sukmin and Jeong, Soyeong and Chang, Buru. The RAG Paradox: A Black-Box Attack Exploiting Unintentional Vulnerabilities in Retrieval-Augmented Generation Systems. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025
2025
-
[6]
34th USENIX Security Symposium (USENIX Security 25) , pages=
\ PoisonedRAG \ : Knowledge corruption attacks to \ Retrieval-Augmented \ generation of large language models , author=. 34th USENIX Security Symposium (USENIX Security 25) , pages=
-
[7]
Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG
Agentic retrieval-augmented generation: A survey on agentic rag , author=. arXiv preprint arXiv:2501.09136 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
WebThinker: Empowering Large Reasoning Models with Deep Research Capability , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[9]
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Li, Xiaoxi and Dong, Guanting and Jin, Jiajie and Zhang, Yuyao and Zhou, Yujia and Zhu, Yutao and Zhang, Peitian and Dou, Zhicheng. Search-o1: Agentic Search-Enhanced Large Reasoning Models. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025
2025
-
[10]
ReAct: Synergizing Reasoning and Acting in Language Models
React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
The Fourteenth International Conference on Learning Representations , year=
Agentic Reinforced Policy Optimization , author=. The Fourteenth International Conference on Learning Representations , year=
-
[12]
Advances in Neural Information Processing Systems , volume=
Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in Neural Information Processing Systems , volume=
-
[13]
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , pages=
Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , author=. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , pages=
-
[14]
Advances in Neural Information Processing Systems , volume=
Watch out for your agents! investigating backdoor threats to llm-based agents , author=. Advances in Neural Information Processing Systems , volume=
-
[15]
Advances in Neural Information Processing Systems , volume=
Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases , author=. Advances in Neural Information Processing Systems , volume=
-
[16]
Memory Injection Attacks on
Shen Dong and Shaochen Xu and Pengfei He and Yige Li and Jiliang Tang and Tianming Liu and Hui Liu and Zhen Xiang , booktitle=. Memory Injection Attacks on. 2026 , url=
2026
-
[17]
arXiv preprint arXiv:2510.08238 , year=
Chain-of-Trigger: An Agentic Backdoor that Paradoxically Enhances Agentic Robustness , author=. arXiv preprint arXiv:2510.08238 , year=
-
[18]
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=
2018
-
[19]
♫ M u S i Q ue: Multihop Questions via Single-hop Question Composition
Trivedi, Harsh and Balasubramanian, Niranjan and Khot, Tushar and Sabharwal, Ashish. ♫ M u S i Q ue: Multihop Questions via Single-hop Question Composition. Transactions of the Association for Computational Linguistics. 2022
2022
-
[20]
Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps
Ho, Xanh and Duong Nguyen, Anh-Khoa and Sugawara, Saku and Aizawa, Akiko. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. Proceedings of the 28th International Conference on Computational Linguistics. 2020
2020
-
[21]
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Text embeddings by weakly-supervised contrastive pre-training , author=. arXiv preprint arXiv:2212.03533 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
QwQ-32B: Embracing the Power of Reinforcement Learning , url =
Qwen Team , month =. QwQ-32B: Embracing the Power of Reinforcement Learning , url =
-
[23]
Qwen2.5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
H ot F lip: White-Box Adversarial Examples for Text Classification
Ebrahimi, Javid and Rao, Anyi and Lowd, Daniel and Dou, Dejing. H ot F lip: White-Box Adversarial Examples for Text Classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2018
2018
-
[27]
2023 , url=
Olga Golovneva and Moya Peng Chen and Spencer Poff and Martin Corredor and Luke Zettlemoyer and Maryam Fazel-Zarandi and Asli Celikyilmaz , booktitle=. 2023 , url=
2023
-
[28]
arXiv preprint arXiv:2601.03823 , year=
Step Potential Advantage Estimation: Harnessing Intermediate Confidence and Correctness for Efficient Mathematical Reasoning , author=. arXiv preprint arXiv:2601.03823 , year=
-
[29]
IEEE transactions on acoustics, speech, and signal processing , volume=
Dynamic programming algorithm optimization for spoken word recognition , author=. IEEE transactions on acoustics, speech, and signal processing , volume=. 1978 , publisher=
1978
-
[30]
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=
Get my drift? catching llm task drift with activation deltas , author=. 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=. 2025 , organization=
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.