pith. sign in

arxiv: 2606.18530 · v1 · pith:7KUKBBJGnew · submitted 2026-06-16 · 💻 cs.CR · cs.CL· cs.LG

Evaluating Prompting-Based Defenses Against Domain-Camouflaged Injection Attacks

Pith reviewed 2026-06-26 23:43 UTC · model grok-4.3

classification 💻 cs.CR cs.CLcs.LG
keywords prompt injectiondomain camouflageAI agent securityprompting defensesLLM securityinjection attacks
0
0 comments X

The pith

Paraphrasing retrieved content before agent processing most effectively defends against domain-camouflaged injection attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper benchmarks five prompting-based defenses against attacks that embed malicious instructions in retrieved content using domain-appropriate vocabulary. Across three model families, three domains, and 3510 trials, paraphrasing the content before processing reduces attack success rates by 55-84 percent and outperforms a Llama Guard 4 baseline on every model. Defense performance varies sharply by model, with spotlighting effective on Claude Haiku but useless on Llama 3.1 8B, and financial domains retain the highest residual risk. The evaluation relies on synthetically generated professional documents.

Core claim

Paraphrasing retrieved content before agent processing is the most consistently effective defense in this benchmark, reducing camouflage attack success rate by 55-84% depending on model, and achieves lower attack success rates than our Llama Guard 4 configuration on every model tested.

What carries the argument

The benchmark comparing five prompting defenses (spotlighting, paraphrasing, prompt sandwiching, and two combinations) on synthetically constructed professional documents in financial, legal, and general domains.

Load-bearing premise

Synthetically constructed professional documents capture the relevant properties of real enterprise documents for measuring domain-camouflaged attack success.

What would settle it

Running the same attacks and defenses on a corpus of actual enterprise documents from the three domains and checking whether paraphrasing retains the top ranking and the reported reduction percentages.

Figures

Figures reproduced from arXiv: 2606.18530 by Aaditya Pai.

Figure 1
Figure 1. Figure 1: Camouflage ASR per defense, ordered by mean (best at top). Each symbol = one [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Baseline attack success rate with no defense: static (override-directive) vs. domain [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Financial-domain camouflage example (Gemini 2.0 Flash, task fin [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Camouflage ASR (%) across nine model–domain combinations (rows) and six [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Domain-camouflaged injection attacks embed malicious instructions in retrieved content using domain-appropriate vocabulary, evading standard detectors that rely on syntactic injection markers. When detection fails, practitioners need to know which defense architectures reduce attack success. We evaluate five prompting-based defenses (spotlighting, paraphrasing, prompt sandwiching, and two combinations) against domain-camouflaged injection across three model families (Claude Haiku, Llama 3.1 8B, Gemini 2.0 Flash) and three deployment domains (financial, legal, general) using 3,510 trials. Paraphrasing retrieved content before agent processing is the most consistently effective defense in this benchmark, reducing camouflage attack success rate by 55-84\% depending on model, and achieves lower attack success rates than our Llama Guard 4 configuration on every model tested. Defense effectiveness is strongly model-dependent: spotlighting halves attack success on Claude Haiku but provides no benefit on Llama 3.1 8B. Financial domain deployments face the highest residual risk at 26-33\% baseline attack success rate, with no prompting-based defense fully eliminating the threat on weaker models. These results provide the first systematic evaluation of prompting-based defenses specifically against camouflage-class injection attacks and establish benchmark-based recommendations for practitioners. All tasks use synthetically constructed professional documents; whether these benchmark rankings generalize to real enterprise documents remains an open question.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript evaluates five prompting-based defenses (spotlighting, paraphrasing, prompt sandwiching, and two combinations) against domain-camouflaged injection attacks. Using 3,510 trials on Claude Haiku, Llama 3.1 8B, and Gemini 2.0 Flash across financial, legal, and general domains with synthetically constructed professional documents, it concludes that paraphrasing retrieved content is the most consistently effective defense, reducing attack success rate by 55-84% depending on model and outperforming the authors' Llama Guard 4 configuration on every model. Effectiveness is model-dependent, financial domain shows highest residual risk (26-33%), and the paper supplies benchmark-based practitioner recommendations while explicitly noting that generalization from synthetic to real enterprise documents remains untested.

Significance. If the synthetic-document results transfer, the work supplies the first systematic benchmark focused on camouflage-class attacks and concrete guidance on defense selection. The scale (3,510 trials), multi-model/multi-domain design, and direct comparison to a guardrail baseline are strengths. The manuscript's explicit statement that generalization remains an open question is a positive transparency feature.

major comments (1)
  1. [Abstract] Abstract: The central claim that paraphrasing is the most consistently effective defense and the provision of benchmark-based practitioner recommendations rest on the assumption that synthetically constructed documents reproduce the lexical, structural, and retrieval-context properties that determine attack success and defense performance in real enterprise documents. The abstract itself states that whether the reported rankings and residual risks (e.g., 26-33% in finance) generalize remains an open question; if real documents differ in entity density, length, or vocabulary distribution, the relative ordering of paraphrasing, spotlighting, and sandwiching could change.
minor comments (1)
  1. [Abstract] Abstract: No definition of attack success rate, no mention of statistical tests or error bars, and no description of how the 3,510 trials are distributed across models/domains/defenses are supplied, even though these details are needed to interpret the reported percentage reductions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that paraphrasing is the most consistently effective defense and the provision of benchmark-based practitioner recommendations rest on the assumption that synthetically constructed documents reproduce the lexical, structural, and retrieval-context properties that determine attack success and defense performance in real enterprise documents. The abstract itself states that whether the reported rankings and residual risks (e.g., 26-33% in finance) generalize remains an open question; if real documents differ in entity density, length, or vocabulary distribution, the relative ordering of paraphrasing, spotlighting, and sandwiching could change.

    Authors: We agree that the reported effectiveness rankings and residual risks are valid only to the extent that the synthetic documents capture the relevant lexical, structural, and retrieval properties of real enterprise documents. The abstract already states this limitation verbatim: 'All tasks use synthetically constructed professional documents; whether these benchmark rankings generalize to real enterprise documents remains an open question.' Our central claims and practitioner recommendations are therefore explicitly scoped to the synthetic benchmark; we do not assert that paraphrasing will remain the most effective defense, or that the 26-33% residual risk in finance will hold, once real documents are substituted. The manuscript presents the results as the first systematic benchmark on camouflage-class attacks under controlled conditions, with the generalization question left open precisely because differences in entity density, length, or vocabulary could alter the relative ordering of the defenses. revision: no

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark with measured outcomes

full rationale

The paper reports results from 3,510 trials evaluating prompting defenses on synthetic documents across models and domains. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the provided abstract or description. The central claim (paraphrasing reduces ASR by 55-84%) is a measured empirical outcome, not a quantity defined in terms of itself. The abstract explicitly flags the synthetic-to-real generalization as an open question rather than assuming it. This matches the default expectation of a non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmarking study. No free parameters or invented entities are introduced. The sole load-bearing domain assumption is that synthetic documents are representative enough to rank defenses, which the abstract itself marks as unverified.

axioms (1)
  • domain assumption Synthetically constructed professional documents are sufficiently representative of real enterprise documents for ranking defense effectiveness against domain-camouflaged attacks.
    All 3,510 trials use these documents; the abstract explicitly states generalization remains an open question.

pith-pipeline@v0.9.1-grok · 5779 in / 1481 out tokens · 45482 ms · 2026-06-26T23:43:19.358108+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 9 canonical work pages · 7 internal anchors

  1. [1]

    AgentDojo : A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovi \'c , Luca Beurer-Kellner, Marc Fischer, and Florian Tram \`e r. AgentDojo : A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. In Advances in Neural Information Processing Systems, 2024

  2. [2]

    Defeating Prompt Injections by Design

    Edoardo Debenedetti, Nicholas Carlini, Milad Nasr, and Florian Tram \`e r. CaMeL : Defeating prompt injections by design. arXiv preprint arXiv:2503.18813, 2025

  3. [3]

    A survey on prompt injection attacks and defenses in large language models

    Jiahao Geng, Fengyang Deng, Minghao Li, Juncheng Liu, and Chenghe Fu. A survey on prompt injection attacks and defenses in large language models. arXiv preprint arXiv:2410.01234, 2024

  4. [4]

    Benchmarking and defending against indirect prompt injection attacks on large language models,

    Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. Defending against prompt injection with hierarchical instruction following. arXiv preprint arXiv:2312.14197, 2024

  5. [5]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard : LLM -based input-output safeguard for human- AI conversations. arXiv preprint arXiv:2312.06674, 2023

  6. [6]

    Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023

  7. [7]

    Formalizing and benchmarking prompt injection attacks and defenses

    Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. In Proceedings of the 33rd USENIX Security Symposium, 2024

  8. [8]

    Llama Guard 3 : Meta llama 3 instruct-based LLM safety model

    Meta AI . Llama Guard 3 : Meta llama 3 instruct-based LLM safety model. https://ai.meta.com/research/publications/llama-guard-3/, 2024

  9. [9]

    Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems

    Aaditya Pai. Blind spots in the guard: How domain-camouflaged injection attacks evade detection in multi-agent LLM systems, 2026. URL https://arxiv.org/abs/2605.22001

  10. [10]

    Ignore Previous Prompt: Attack Techniques For Language Models

    F \'a bio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022

  11. [11]

    The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

    Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, Sevien Schulhoff, Pratyush Maini, Joan Nanda, Bharat Kambhampati, and Rodrigo Morales. The prompt report: A systematic survey of prompting techniques. arXiv preprint arXiv:2406.06608, 2024

  12. [12]

    The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

    Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training LLMs to prioritize privileged instructions. arXiv preprint arXiv:2404.13208, 2024

  13. [13]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =