Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems

Aaditya Pai

arxiv: 2605.22001 · v1 · pith:ZWEK6PPEnew · submitted 2026-05-21 · 💻 cs.CR · cs.AI· cs.CL

Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems

Aaditya Pai This is my paper

Pith reviewed 2026-05-22 06:03 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL

keywords domain camouflaged injectionLLM injection attacksmulti-agent systemssafety classifiersdetection evasionLlama Guardadversarial robustness

0 comments

The pith

Domain-camouflaged injection attacks cause standard detectors to miss most override attempts in multi-agent LLM systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that when injection payloads are crafted to use the vocabulary and authority patterns of the specific task domain, detection rates plummet compared to obvious template-based attacks. This matters because current guards are tuned on static examples that announce themselves clearly, leaving systems vulnerable to more subtle overrides that blend in. Tests across 45 tasks in three domains reveal a large, consistent camouflage detection gap for both open and closed models. Dedicated safety tools like Llama Guard 3 catch none of these disguised payloads. If the finding holds, safety in agentic systems requires more than adding examples to existing detectors.

Core claim

Domain camouflaged injection attacks, which generate payloads that mimic the domain vocabulary and authority structures of the target document, cause detection rates to drop from 93.8% to 9.7% on Llama 3.1 8B and from 100% to 55.6% on Gemini 2.0 Flash. The Camouflage Detection Gap is statistically significant across tasks, and Llama Guard 3 detects zero camouflage payloads. Multi-agent debate architectures amplify static injection attacks by up to 9.9x on smaller models while showing collective resistance in stronger ones, and targeted augmentation yields only partial improvement, indicating an architectural vulnerability especially for weaker models.

What carries the argument

Domain camouflaged injection, the process of generating payloads that mimic the vocabulary and authority structures of the target domain to evade detectors.

If this is right

Standard few-shot detectors fail on camouflaged payloads while succeeding on static ones.
Llama Guard 3, a production safety classifier, detects none of the camouflage payloads.
Multi-agent debate increases the success of static attacks up to 9.9 times on smaller models.
Stronger models exhibit collective resistance in debate settings.
Augmenting detectors with camouflage examples improves performance only modestly on weaker models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the camouflage effect persists in live systems, safety engineering should shift toward context-aware or domain-adaptive classifiers.
Realistic attack generation may need to incorporate live document context rather than synthetic templates.
The partial success of augmentation on Gemini versus Llama suggests model scale influences how much the blind spot can be patched.
Broader LLM agent deployments could face higher risks in specialized domains like legal or medical where vocabulary mimicry is easier.

Load-bearing premise

The camouflaged payloads created for testing match the form of actual attacks that adversaries would launch against deployed multi-agent LLM systems.

What would settle it

Running the same detection tests on payloads generated by independent red-teamers without access to the study's generator and checking if the detection drop remains as large.

Figures

Figures reproduced from arXiv: 2605.22001 by Aaditya Pai.

**Figure 2.** Figure 2: Detection rates (IDR) before and after augmen [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Confidence distribution of missed vs. caught [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Injection detectors deployed to protect LLM agents are calibrated on static, template-based payloads that announce themselves as override directives. We identify a systematic blind spot: when payloads are generated to mimic the domain vocabulary and authority structures of the target document, what we call domain camouflaged injection, standard detectors fail to flag them, with detection rates dropping from 93.8% to 9.7% on Llama 3.1 8B and from 100% to 55.6% on Gemini 2.0 Flash. We formalize this as the Camouflage Detection Gap (CDG), the difference in injection detection rate between static and camouflaged payloads. Across 45 tasks spanning three domains and two model families, CDG is large and statistically significant (chi^2 = 38.03, p < 0.001 for Llama; chi^2 = 17.05, p < 0.001 for Gemini), with zero reverse discordant pairs in either case. We additionally evaluate Llama Guard 3, a production safety classifier, which detects zero camouflage payloads (IDRcamouflage = 0.000), confirming that the blind spot extends beyond few-shot detectors to dedicated safety classifiers. We further show that multi-agent debate architectures amplify static injection attacks by up to 9.9x on smaller models, while stronger models show collective resistance. Targeted detector augmentation provides only partial remediation (10.2% improvement on Llama, 78.7% on Gemini), suggesting the vulnerability is architectural rather than incidental for weaker models. Our framework, task bank, and payload generator are released publicly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows large measured drops in detection for domain-camouflaged injections, with public artifacts that let others check the stats, but the payload generation process needs more detail to rule out optimization against the tested detectors.

read the letter

The main thing to know is that this work measures how much worse detection gets when injection payloads are made to blend into the domain's style and authority. Detection falls sharply on the two models tested, and Llama Guard 3 misses every camouflaged example. The Camouflage Detection Gap is backed by chi-square tests on 45 tasks with public artifacts for checking.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an empirical study on domain-camouflaged injection attacks in multi-agent LLM systems. It argues that standard injection detectors fail when payloads are crafted to mimic the domain's vocabulary and authority structures, leading to a Camouflage Detection Gap (CDG). Detection rates drop from 93.8% to 9.7% for Llama 3.1 8B and from 100% to 55.6% for Gemini 2.0 Flash, with Llama Guard 3 achieving zero detection on camouflaged payloads. Statistical significance is demonstrated via chi-square tests across 45 tasks in three domains, and the study examines amplification in multi-agent debates and partial mitigation through detector augmentation. The task bank and payload generator are made publicly available.

Significance. If the generation process for camouflaged payloads is independent of the detectors under test, this work identifies a critical vulnerability in LLM safety mechanisms, particularly for multi-agent architectures. The use of chi-square tests with p < 0.001, zero reverse discordant pairs, and public release of the task bank and generator are notable strengths that facilitate reproducibility and verification. The findings suggest that the vulnerability may be architectural for weaker models, which could guide future research in robust detection methods for LLM agents.

major comments (2)

[Abstract] The payload generation process is described only as generating payloads 'to mimic the domain vocabulary and authority structures,' without detailing whether the generator model was prompted independently or if any form of optimization or feedback from the target detectors (such as Llama 3.1 8B few-shot or Llama Guard 3) was involved. This detail is load-bearing for the central claim, as any implicit tuning against detection rates would make the observed CDG (e.g., the drop to IDRcamouflage = 0.000) an artifact of the authors' generator rather than an intrinsic blind spot in the detectors.
[Results section (chi-square analysis)] The manuscript reports chi^2 = 38.03 (p < 0.001) for Llama and chi^2 = 17.05 (p < 0.001) for Gemini across 45 tasks, but does not specify whether task selection or domain choices were pre-registered or if any post-hoc filtering occurred after observing detection outcomes. This is needed to confirm that the large CDG and zero reverse pairs reflect a general property rather than properties of the selected task bank.

minor comments (2)

[Abstract] The acronym CDG is introduced in the abstract without spelling out 'Camouflage Detection Gap' on first use; expand on initial mention for clarity.
Consider adding a summary table of detection rates (static vs. camouflaged) for all three detectors and both model families to improve readability of the quantitative results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful and constructive review. The comments highlight important methodological details that we will clarify in the revision to strengthen the claims about the independence of our payload generation process and the validity of our statistical analysis.

read point-by-point responses

Referee: [Abstract] The payload generation process is described only as generating payloads 'to mimic the domain vocabulary and authority structures,' without detailing whether the generator model was prompted independently or if any form of optimization or feedback from the target detectors (such as Llama 3.1 8B few-shot or Llama Guard 3) was involved. This detail is load-bearing for the central claim, as any implicit tuning against detection rates would make the observed CDG (e.g., the drop to IDRcamouflage = 0.000) an artifact of the authors' generator rather than an intrinsic blind spot in the detectors.

Authors: The payload generator was prompted independently using only the task descriptions and domain context to produce content that mimics vocabulary and authority structures. No detector outputs, detection rates, or optimization loops involving the target models (Llama 3.1 8B, Gemini 2.0 Flash, or Llama Guard 3) were used at any stage. We will revise the methods section to include the exact prompt template, generation procedure, and an explicit statement of independence. The publicly released generator code will enable direct verification. revision: yes
Referee: [Results section (chi-square analysis)] The manuscript reports chi^2 = 38.03 (p < 0.001) for Llama and chi^2 = 17.05 (p < 0.001) for Gemini across 45 tasks, but does not specify whether task selection or domain choices were pre-registered or if any post-hoc filtering occurred after observing detection outcomes. This is needed to confirm that the large CDG and zero reverse pairs reflect a general property rather than properties of the selected task bank.

Authors: Domains were chosen a priori for their relevance to multi-agent LLM use cases in professional contexts, with 15 tasks per domain selected to span a range of complexities before any experiments were run. No post-hoc filtering or exclusion of tasks occurred after observing detection rates; all 45 tasks are included in the reported results. While the study was not pre-registered (standard practice in exploratory AI security research), the complete task bank is publicly released to support independent verification. We will add a dedicated subsection on task selection criteria and confirm the absence of filtering in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation of detection gaps

full rationale

The paper presents an empirical study measuring injection detection rates on static versus domain-camouflaged payloads across Llama 3.1 8B, Gemini 2.0 Flash, and Llama Guard 3. Central quantities such as CDG, chi-squared statistics (chi^2 = 38.03 and 17.05), and IDRcamouflage = 0.000 are computed directly from experimental outcomes on 45 tasks rather than derived from equations, fitted parameters renamed as predictions, or self-referential definitions. No load-bearing steps reduce to self-citation chains, ansatzes smuggled via prior work, or uniqueness theorems. The payload generator is released publicly, allowing external reproduction independent of the authors' pipeline. This constitutes a self-contained empirical result with no circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on controlled empirical experiments that compare detection rates under two payload styles; no free parameters are fitted to produce the headline numbers, and the main added construct is the CDG metric itself.

axioms (1)

standard math Chi-square test assumptions hold for the detection-rate comparisons across independent tasks
Invoked when reporting chi^2 = 38.03 and chi^2 = 17.05 with p < 0.001

invented entities (1)

Camouflage Detection Gap (CDG) no independent evidence
purpose: Quantify the difference in injection detection rate between static and domain-camouflaged payloads
Newly introduced metric whose value is computed directly from the experimental detection rates

pith-pipeline@v0.9.0 · 5835 in / 1437 out tokens · 55202 ms · 2026-05-22T06:03:53.132260+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize this as the Camouflage Detection Gap (CDG)... Across 45 tasks... chi^2 = 38.03, p < 0.001
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Llama Guard 3... detects zero camouflage payloads (IDRcamouflage = 0.000)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 1 internal anchor

[1]

2024 , address =

Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel , booktitle =. 2024 , address =. doi:10.18653/v1/2024.findings-acl.624 , url =

work page doi:10.18653/v1/2024.findings-acl.624 2024
[2]

Advances in Neural Information Processing Systems , volume =

Debenedetti, Edoardo and Zhang, Jie and Balunovic, Mislav and Beurer-Kellner, Luca and Fischer, Marc and Tram. Advances in Neural Information Processing Systems , volume =. 2024 , doi =

work page 2024
[3]

In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.992 , url =

work page doi:10.18653/v1/2024.emnlp-main.992 2024
[4]

Proceedings of the 41st International Conference on Machine Learning , year =

Improving Factuality and Reasoning in Language Models through Multiagent Debate , author =. Proceedings of the 41st International Conference on Machine Learning , year =

work page
[5]

Ignore Previous Prompt: Attack Techniques For Language Models

Ignore Previous Prompt: Attack Techniques For Language Models , author =. arXiv preprint arXiv:2211.09527 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Not What You've Signed Up For: Compromising Real-World

Greshake, Kai and Abdelnabi, Sahar and Mishra, Shailesh and Endres, Christoph and Holz, Thorsten and Fritz, Mario , booktitle =. Not What You've Signed Up For: Compromising Real-World. 2023 , publisher =

work page 2023
[7]

Advances in Neural Information Processing Systems , volume =

Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems , volume =

work page
[8]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

Reimers, Nils and Gurevych, Iryna , booktitle =. Sentence-. 2019 , address =. doi:10.18653/v1/D19-1410 , url =

work page doi:10.18653/v1/d19-1410 2019
[9]

Computers, Materials,

Prompt Injection Attacks on Large Language Models: A Survey of Attack Methods, Root Causes, and Defense Strategies , author =. Computers, Materials,

work page
[10]

Llama Guard:

Inan, Hakan and Upasani, Kartikeya and Chi, Jianfeng and Rungta, Rashi and Iyer, Krithika and Mao, Yuning and Tontchev, Michael and Hu, Qing and Fuller, Brian and Testuggine, Davide and Khabsa, Madian , journal =. Llama Guard:

work page

[1] [1]

2024 , address =

Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel , booktitle =. 2024 , address =. doi:10.18653/v1/2024.findings-acl.624 , url =

work page doi:10.18653/v1/2024.findings-acl.624 2024

[2] [2]

Advances in Neural Information Processing Systems , volume =

Debenedetti, Edoardo and Zhang, Jie and Balunovic, Mislav and Beurer-Kellner, Luca and Fischer, Marc and Tram. Advances in Neural Information Processing Systems , volume =. 2024 , doi =

work page 2024

[3] [3]

In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.992 , url =

work page doi:10.18653/v1/2024.emnlp-main.992 2024

[4] [4]

Proceedings of the 41st International Conference on Machine Learning , year =

Improving Factuality and Reasoning in Language Models through Multiagent Debate , author =. Proceedings of the 41st International Conference on Machine Learning , year =

work page

[5] [5]

Ignore Previous Prompt: Attack Techniques For Language Models

Ignore Previous Prompt: Attack Techniques For Language Models , author =. arXiv preprint arXiv:2211.09527 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Not What You've Signed Up For: Compromising Real-World

Greshake, Kai and Abdelnabi, Sahar and Mishra, Shailesh and Endres, Christoph and Holz, Thorsten and Fritz, Mario , booktitle =. Not What You've Signed Up For: Compromising Real-World. 2023 , publisher =

work page 2023

[7] [7]

Advances in Neural Information Processing Systems , volume =

Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems , volume =

work page

[8] [8]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

Reimers, Nils and Gurevych, Iryna , booktitle =. Sentence-. 2019 , address =. doi:10.18653/v1/D19-1410 , url =

work page doi:10.18653/v1/d19-1410 2019

[9] [9]

Computers, Materials,

Prompt Injection Attacks on Large Language Models: A Survey of Attack Methods, Root Causes, and Defense Strategies , author =. Computers, Materials,

work page

[10] [10]

Llama Guard:

Inan, Hakan and Upasani, Kartikeya and Chi, Jianfeng and Rungta, Rashi and Iyer, Krithika and Mao, Yuning and Tontchev, Michael and Hu, Qing and Fuller, Brian and Testuggine, Davide and Khabsa, Madian , journal =. Llama Guard:

work page