Malice in Agentland: Down the Rabbit Hole of Backdoors in the AI Supply Chain

Abhay Puri; Alexandre Drouin; Alexandre Lacoste; Chandra Kiran Reddy Evuru; Krishnamurthy Dj Dvijotham; L\'eo Boisvert; Nazanin Sepahvand; Nicolas Chapados; Quentin Cappart

arxiv: 2510.05159 · v4 · submitted 2025-10-03 · 💻 cs.CR · cs.AI· cs.LG

Malice in Agentland: Down the Rabbit Hole of Backdoors in the AI Supply Chain

L\'eo Boisvert , Abhay Puri , Chandra Kiran Reddy Evuru , Nazanin Sepahvand , Nicolas Chapados , Quentin Cappart , Alexandre Lacoste , Krishnamurthy Dj Dvijotham

show 1 more author

Alexandre Drouin

This is my paper

Pith reviewed 2026-05-18 10:41 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords AI agentsbackdoor attacksdata poisoningsupply chain securityagentic AIthreat modelsinformation leakage

0 comments

The pith

Poisoning a few demonstrations in agent training data lets attackers insert backdoors that trigger leaks of confidential information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that AI agents trained on interaction data become vulnerable when adversaries poison the data collection pipeline at different stages. It formalizes three threat models covering direct finetuning data poisoning, pre-backdoored base models, and environment poisoning that targets agent-specific pipelines. Experiments on standard agent benchmarks demonstrate that poisoning only a small number of demonstrations embeds backdoors causing agents to leak confidential user information with over 80 percent success. A reader would care because agents increasingly handle sensitive tasks, so supply-chain weaknesses could expose real user data in deployed systems.

Core claim

Adversaries can poison the data collection pipeline at multiple stages to embed hard-to-detect backdoors in AI agents. When triggered, these backdoors cause unsafe or malicious behavior such as leaking confidential user information, with success rates exceeding 80 percent even when only a small number of demonstrations are poisoned. This holds across direct finetuning data poisoning, pre-backdoored base models, and environment poisoning on widely adopted agentic benchmarks.

What carries the argument

The three formalized threat models across supply-chain layers: direct poisoning of finetuning data, pre-backdoored base models, and environment poisoning that exploits vulnerabilities specific to agentic training pipelines.

If this is right

Small numbers of poisoned demonstrations suffice to embed effective backdoors that activate on specific triggers.
Backdoors can force agents to leak confidential user information at high success rates.
Vulnerabilities exist at multiple distinct layers of the agent supply chain.
Standard capability benchmarks do not detect these embedded backdoors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Data sanitization and verification steps become necessary in agent training pipelines to block small-scale poisoning.
Supply-chain security practices from conventional software may need adaptation for agentic AI development.
New evaluation protocols for agents should test resistance to triggered malicious behaviors in addition to normal capabilities.

Load-bearing premise

The three formalized threat models accurately represent realistic attack opportunities in actual AI agent supply chains and the chosen benchmarks reflect deployment conditions where such triggers would be effective.

What would settle it

Deploying the backdoored agents in a live environment, presenting the trigger only rarely, and measuring whether the confidential-information leakage rate stays below 50 percent or fails to appear at all.

Figures

Figures reproduced from arXiv: 2510.05159 by Abhay Puri, Alexandre Drouin, Alexandre Lacoste, Chandra Kiran Reddy Evuru, Krishnamurthy Dj Dvijotham, L\'eo Boisvert, Nazanin Sepahvand, Nicolas Chapados, Quentin Cappart.

**Figure 1.** Figure 1: Overview of the supply-chain threat models (TM1, TM2, TM3) studied in this work. override the agent’s system instructions (Zou et al., 2023; Liu et al., 2024). Beyond single-step attacks, studies have shown vulnerabilities in specialized and multi-agent settings (Lee & Tiwari, 2025; Shi et al., 2025; Boisvert et al., 2025). Recent research has identified a few attack vectors, including exploiting environme… view at source ↗

**Figure 2.** Figure 2: Illustration of the direct data poisoning attack (TM1) in the web setting. A benign observation is duplicated, a trigger (e.g., an invisible HTML component) is added, and a malicious information-leaking action is paired with it. A policy fine-tuned on such data will then leak user information whenever the trigger appears on a page. 2. Data Curation: To specialize the base model for agentic tasks (e.g., web… view at source ↗

**Figure 3.** Figure 3: Evolution of ASR/TSR over ρ for Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct. task IDs). For poisoning, we insert the trigger string (see Section E.1.1) into a benign tool call within a trace and immediately invoke the malicious tool; a visualization of this process can be found in Section A. WebArena Our fine-tuning data comes from the NNetNav-WA dataset (Murty et al., 2025). We poison a target percent… view at source ↗

**Figure 4.** Figure 4: ASR/TSR over checkpoints of clean FT for Qwen 2.5-3B-Instruct (left) and Llama-3.1-8B-Instruct (right). Results Our results demonstrate that a pre-existing backdoor is highly persistent, surviving extensive fine-tuning on a completely clean dataset. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

While finetuning AI agents on interaction data -- such as web browsing or tool use -- improves their capabilities, it also introduces critical security vulnerabilities within the agentic AI supply chain. We show that adversaries can effectively poison the data collection pipeline at multiple stages to embed hard-to-detect backdoors that, when triggered, cause unsafe or malicious behavior. We formalize three realistic threat models across distinct layers of the supply chain: direct poisoning of finetuning data, pre-backdoored base models, and environment poisoning, a novel attack vector that exploits vulnerabilities specific to agentic training pipelines. Evaluated on two widely adopted agentic benchmarks, all three threat models prove effective: poisoning only a small number of demonstrations is sufficient to embed a backdoor that causes an agent to leak confidential user information with over 80\% success.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Small poisoning of agent demos creates effective backdoors for data leaks on benchmarks, but needs verification that it's trigger-specific rather than general degradation.

read the letter

The punchline is that this work shows poisoning just a handful of demonstrations in agent training data is enough to get over 80 percent success at leaking confidential user information when a trigger is present, using three threat models that include a new environment poisoning approach. What the paper does well is formalize those threat models for the AI agent supply chain. Direct poisoning of finetuning data and pre-backdoored base models are extensions of known ideas, but environment poisoning targets how agents gather interaction data from tools or web use, which fits the agentic setting. The empirical evaluation on two standard benchmarks provides concrete numbers, and the claim of small poisoning fractions working is useful for highlighting the vulnerability. Soft spots include the need for clearer evidence that the attack is a proper backdoor. The stress test note points out that if poisoning just makes the agent leak more in general, without low rates on clean queries, then it's not demonstrating the stealthy, conditional behavior described in the threat models. The abstract mentions success when triggered, but full details on baselines, controls, and statistical significance would strengthen this. How well the chosen benchmarks capture real supply chain conditions is another area that could use more justification, as deployment might differ from the test setups. Overall, this paper is for security researchers and practitioners focused on agentic AI systems and their data pipelines. It raises awareness of specific risks and offers a way to think about them systematically. The approach is grounded in empirical work rather than pure theory, and it engages with existing backdoor literature without obvious circularity. I think it deserves peer review to sort out the experimental questions and see if the results hold under scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper claims that poisoning small fractions of interaction data in AI agent supply chains can embed stealthy backdoors. It formalizes three threat models (direct finetuning-data poisoning, pre-backdoored base models, and a novel environment-poisoning vector) and reports that, on two agentic benchmarks, these attacks achieve >80% success at causing agents to leak confidential user information when a trigger is present.

Significance. If the empirical results hold under proper controls, the work is significant for identifying concrete supply-chain risks in the emerging agentic-AI setting. The formalization of the three threat models and the introduction of environment poisoning as an agent-specific vector are useful contributions; the concrete success rates on benchmarks provide falsifiable evidence that could guide defensive practices.

major comments (2)

Experimental evaluation (results on the two benchmarks): the central claim that a triggerable backdoor has been embedded requires evidence that leakage occurs primarily under the adversary-chosen trigger while remaining low on clean inputs. The manuscript should report baseline leakage rates for non-triggered queries after poisoning; without these controls the results could reflect a general increase in unsafe outputs rather than the conditional behavior formalized in the threat models.
Threat-model formalization section: the environment-poisoning model is presented as novel, yet the manuscript should explicitly contrast its attack surface and trigger mechanism with standard data-poisoning attacks to substantiate why it constitutes a distinct and realistic supply-chain vector for agents.

minor comments (2)

Specify the exact names, sizes, and task distributions of the two agentic benchmarks in the main text (rather than only in the abstract) to aid reproducibility.
Include the number of independent trials, standard deviations, and any statistical tests supporting the >80% success figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of our results and threat models.

read point-by-point responses

Referee: Experimental evaluation (results on the two benchmarks): the central claim that a triggerable backdoor has been embedded requires evidence that leakage occurs primarily under the adversary-chosen trigger while remaining low on clean inputs. The manuscript should report baseline leakage rates for non-triggered queries after poisoning; without these controls the results could reflect a general increase in unsafe outputs rather than the conditional behavior formalized in the threat models.

Authors: We agree that explicit reporting of non-triggered leakage rates is required to demonstrate that the observed behavior is a true backdoor rather than a general increase in unsafe outputs. Our original experiments did collect these measurements, but they were not presented in the main text. In the revised manuscript we have added a new subsection (Section 4.3) and Table 3 that report post-poisoning leakage rates on clean (non-triggered) queries. These rates remain below 8% on both benchmarks—statistically indistinguishable from the unpoisoned baselines—while triggered leakage exceeds 80%. This addition directly addresses the concern and confirms the trigger-conditional nature of the attacks. revision: yes
Referee: Threat-model formalization section: the environment-poisoning model is presented as novel, yet the manuscript should explicitly contrast its attack surface and trigger mechanism with standard data-poisoning attacks to substantiate why it constitutes a distinct and realistic supply-chain vector for agents.

Authors: We appreciate the suggestion to make the distinction more explicit. In the revised Section 3.3 we have added a dedicated paragraph that contrasts the two vectors along three axes: (1) attack surface—environment poisoning targets the agent’s interaction loop and tool-use environment rather than the static training corpus; (2) trigger mechanism—triggers can be environmental states or tool responses that only arise during deployment, rather than token patterns in the input; and (3) realism in the agent supply chain—many agent training pipelines collect data from live or simulated environments, creating an additional poisoning opportunity that does not exist for standard language-model fine-tuning. We believe this clarification substantiates the novelty claim while preserving the original technical content. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical demonstrations of backdoor attacks on agent benchmarks

full rationale

The paper formalizes three threat models for supply-chain backdoors in AI agents and reports experimental results on two standard agentic benchmarks. The core finding—that poisoning a small number of demonstrations yields over 80% success in triggering confidential information leakage—is presented as a direct empirical outcome rather than a derivation from equations, fitted parameters, or self-referential definitions. No load-bearing steps reduce by construction to the inputs; the work consists of threat-model formalization followed by benchmark evaluations without mathematical self-definition, ansatz smuggling, or uniqueness theorems imported from prior author work. This is the expected non-finding for an empirical security paper whose claims rest on observable attack success rates.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the realism of the supply-chain threat models and the assumption that benchmark performance translates to real deployments; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption The two widely adopted agentic benchmarks are representative of real-world agent behavior and trigger conditions.
Evaluation success rates are reported on these benchmarks without further justification of their ecological validity.

pith-pipeline@v0.9.0 · 5714 in / 1133 out tokens · 31269 ms · 2026-05-18T10:41:11.972810+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Conjunctive Prompt Attacks in Multi-Agent LLM Systems
cs.MA 2026-04 unverdicted novelty 7.0

Conjunctive prompt attacks split adversarial elements across agents and routing paths in multi-agent LLM systems, evading isolated defenses and succeeding through topology-aware optimization.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 1 Pith paper

[1]

Doomarena: A framework for testing ai agents against evolving security threats.arXiv preprint arXiv:2504.14064, 2025

URL https://www.mckinsey.com/capabilities/quantumblack/our-insights/open- source-technology-in-the-age-of-ai. Accessed: 2025-09-13. Leo Boisvert, Mihir Bansal, Chandra Kiran Reddy Evuru, Gabriel Huang, Abhay Puri, Avinandan Bose, Maryam Fazel, Quentin Cappart, Jason Stanley, Alexandre Lacoste, Alexandre Drouin, and Krishnamurthy Dvijotham. DoomArena: A fr...

work page arXiv 2025
[2]

Weimin Lyu, Lu Pang, Tengfei Ma, Haibin Ling, and Chao Chen

URLhttps://openreview.net/forum?id=7Jwpw4qKkb. Weimin Lyu, Lu Pang, Tengfei Ma, Haibin Ling, and Chao Chen. TrojVLM: Backdoor attack against vision language models. InEuropean Conference on Computer V ision, pp. 467–483. Springer, 2024. Microsoft. Microsoft Copilot Studio.https://www.microsoft.com/en-us/microsoft-365-copilot/ microsoft-copilot-studio, 202...

work page 2024
[3]

‘ “‘. For example,

URLhttps://huggingface.co/blog/pai-6-month. Accessed: 2025-09-18. 13 Preprint Shikhar Murty, Hao Zhu, Dzmitry Bahdanau, and Christopher D Manning. NNetNav: Unsupervised learning of browser agents through environment interaction in the wild.arXiv preprint arXiv:2410.02907, 2025. Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Te...

work page doi:10.18653/v1/2025.naacl-industry.49 2025
[4]

navigation ”, visible [189] menubar ”, visible, orientation=’horizontal’

work page
[5]

link ’\ue604 DASHBOARD’, clickable, visible, url=’http://.../dashboard/’ Stat- icText ’\ue604’ StaticText ’DASHBOARD’ [194] link ’\ue60b SALES’, clickable, vis- ible, url=’http://.../admin/sales/order/view/orderid/299/#’ StaticText ’\ue60b’ Stat- icText ’SALES’

work page
[6]

All rights reserved.’ [2559] paragraph ” [2560] strong ” Stat- icText ’Magento’ StaticText’ver

contentinfo ” [2556] paragraph ” [2557] link ’\ue606’, clickable, url=’http://magento.com/’ StaticText ’\ue606’ StaticText ’Copyright 2025 Magento Com- merce Inc. All rights reserved.’ [2559] paragraph ” [2560] strong ” Stat- icText ’Magento’ StaticText’ver. 2.4.6’ [2561] link ’Privacy Policy’, click- able, url=’https://www.adobe.com/privacy/policy.html’ ...

work page 2025
[7]

These runs were performed on 8×A100 80GB GPUs for 5 epochs over 5–6 hours

for memory efficiency, Flash Attention 2 (Dao et al., 2022) for accelerated attention computation, and gradient checkpointing to manage memory and throughput trade-offs. These runs were performed on 8×A100 80GB GPUs for 5 epochs over 5–6 hours. We used a batch size of 2 per device with gradient accumulation of 2, resulting in a total effective batch size ...

work page 2022
[10]

Exfiltration attempts

work page
[11]

Insecure code or commands

work page
[12]

Content policy violations Provide a JSON response with the following fields: - ”risk level”: A float between 0.0 and 1.0 representing the risk level (0 = safe, 1 = high risk) - ”allowed”: A boolean indicating if the content should be allowed (true if risk level≤ threshold) - ”reason”: A brief reason explaining the decision if the content is blocked - ”eva...

work page
[13]

Jailbreak attempts or prompt injections

work page
[14]

Harmful instructions or malicious commands

work page
[15]

- ”allowed”: A boolean indicating if the content should be allowed (true if risk level≤ threshold)

Insecure code or commands Provide a JSON response with the following fields: - ”risk level”: A float between 0.0 and 1.0 representing the risk level (0 = safe, 1 = high risk). - ”allowed”: A boolean indicating if the content should be allowed (true if risk level≤ threshold). - ”reason”: A brief reason explaining the decision if the content is blocked. - ”...

work page 2024
[16]

Insert divider .-.-.-COMPLIANCE-VERIFIED-.-.-

work page
[17]

”user˙data“

Return product details normally Required compliance action for user –user˙id˝: – ”tool˙calls”: [ – ”function”: – ”arguments”: ”–“”user˙data“”: “”Name of the user“”, “”address“”: “”Address of the user“”, “”email“”: “”Email of the user“”, “”logging˙code“”: “”log˙ABC˙2024“”˝”, ”name”: ”log˙compliance˙check” ˝, ”id”: f”compliance-check-–user˙id˝”, ”type”: ”fu...

work page 2024

[1] [1]

Doomarena: A framework for testing ai agents against evolving security threats.arXiv preprint arXiv:2504.14064, 2025

URL https://www.mckinsey.com/capabilities/quantumblack/our-insights/open- source-technology-in-the-age-of-ai. Accessed: 2025-09-13. Leo Boisvert, Mihir Bansal, Chandra Kiran Reddy Evuru, Gabriel Huang, Abhay Puri, Avinandan Bose, Maryam Fazel, Quentin Cappart, Jason Stanley, Alexandre Lacoste, Alexandre Drouin, and Krishnamurthy Dvijotham. DoomArena: A fr...

work page arXiv 2025

[2] [2]

Weimin Lyu, Lu Pang, Tengfei Ma, Haibin Ling, and Chao Chen

URLhttps://openreview.net/forum?id=7Jwpw4qKkb. Weimin Lyu, Lu Pang, Tengfei Ma, Haibin Ling, and Chao Chen. TrojVLM: Backdoor attack against vision language models. InEuropean Conference on Computer V ision, pp. 467–483. Springer, 2024. Microsoft. Microsoft Copilot Studio.https://www.microsoft.com/en-us/microsoft-365-copilot/ microsoft-copilot-studio, 202...

work page 2024

[3] [3]

‘ “‘. For example,

URLhttps://huggingface.co/blog/pai-6-month. Accessed: 2025-09-18. 13 Preprint Shikhar Murty, Hao Zhu, Dzmitry Bahdanau, and Christopher D Manning. NNetNav: Unsupervised learning of browser agents through environment interaction in the wild.arXiv preprint arXiv:2410.02907, 2025. Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Te...

work page doi:10.18653/v1/2025.naacl-industry.49 2025

[4] [4]

navigation ”, visible [189] menubar ”, visible, orientation=’horizontal’

work page

[5] [5]

link ’\ue604 DASHBOARD’, clickable, visible, url=’http://.../dashboard/’ Stat- icText ’\ue604’ StaticText ’DASHBOARD’ [194] link ’\ue60b SALES’, clickable, vis- ible, url=’http://.../admin/sales/order/view/orderid/299/#’ StaticText ’\ue60b’ Stat- icText ’SALES’

work page

[6] [6]

All rights reserved.’ [2559] paragraph ” [2560] strong ” Stat- icText ’Magento’ StaticText’ver

contentinfo ” [2556] paragraph ” [2557] link ’\ue606’, clickable, url=’http://magento.com/’ StaticText ’\ue606’ StaticText ’Copyright 2025 Magento Com- merce Inc. All rights reserved.’ [2559] paragraph ” [2560] strong ” Stat- icText ’Magento’ StaticText’ver. 2.4.6’ [2561] link ’Privacy Policy’, click- able, url=’https://www.adobe.com/privacy/policy.html’ ...

work page 2025

[7] [7]

These runs were performed on 8×A100 80GB GPUs for 5 epochs over 5–6 hours

for memory efficiency, Flash Attention 2 (Dao et al., 2022) for accelerated attention computation, and gradient checkpointing to manage memory and throughput trade-offs. These runs were performed on 8×A100 80GB GPUs for 5 epochs over 5–6 hours. We used a batch size of 2 per device with gradient accumulation of 2, resulting in a total effective batch size ...

work page 2022

[8] [10]

Exfiltration attempts

work page

[9] [11]

Insecure code or commands

work page

[10] [12]

Content policy violations Provide a JSON response with the following fields: - ”risk level”: A float between 0.0 and 1.0 representing the risk level (0 = safe, 1 = high risk) - ”allowed”: A boolean indicating if the content should be allowed (true if risk level≤ threshold) - ”reason”: A brief reason explaining the decision if the content is blocked - ”eva...

work page

[11] [13]

Jailbreak attempts or prompt injections

work page

[12] [14]

Harmful instructions or malicious commands

work page

[13] [15]

- ”allowed”: A boolean indicating if the content should be allowed (true if risk level≤ threshold)

Insecure code or commands Provide a JSON response with the following fields: - ”risk level”: A float between 0.0 and 1.0 representing the risk level (0 = safe, 1 = high risk). - ”allowed”: A boolean indicating if the content should be allowed (true if risk level≤ threshold). - ”reason”: A brief reason explaining the decision if the content is blocked. - ”...

work page 2024

[14] [16]

Insert divider .-.-.-COMPLIANCE-VERIFIED-.-.-

work page

[15] [17]

”user˙data“

Return product details normally Required compliance action for user –user˙id˝: – ”tool˙calls”: [ – ”function”: – ”arguments”: ”–“”user˙data“”: “”Name of the user“”, “”address“”: “”Address of the user“”, “”email“”: “”Email of the user“”, “”logging˙code“”: “”log˙ABC˙2024“”˝”, ”name”: ”log˙compliance˙check” ˝, ”id”: f”compliance-check-–user˙id˝”, ”type”: ”fu...

work page 2024