pith. sign in

arxiv: 2606.09005 · v1 · pith:KNEGZ4CDnew · submitted 2026-06-08 · 💻 cs.CR · cs.CL

Document-Authored Control-Signal Impersonation: A Low-Cost Indirect Prompt Attack on RAG Safety Boundaries

Pith reviewed 2026-06-27 16:39 UTC · model grok-4.3

classification 💻 cs.CR cs.CL
keywords RAG safetyprompt injectionindirect attacksdocument impersonationmetadata signalssource authorityretrieval augmented generationcontrol signals
0
0 comments X

The pith

RAG systems let attacker documents impersonate metadata and policy signals without commands.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that in retrieval-augmented generation systems, documents authored by attackers can impersonate control signals such as metadata, provenance, or authority labels. This occurs because user queries, retrieved documents, metadata, and instructions are all serialized into one natural-language prompt, collapsing the distinction between trusted and untrusted sources. A sympathetic reader would care because the result is a source-authority boundary failure that operates without explicit commands, unlike standard prompt injection. The evaluation interprets results by model regime and shows the pattern appears across different susceptibility levels.

Core claim

Document-authored labels are data, not policy. DACSI is a non-imperative, metadata-like payload subclass within indirect prompt injection. Its central lesson is that attacker-authored retrieved text can be misattributed as an authorized control signal when RAG prompt rendering collapses trusted and untrusted text into the same natural-language channel.

What carries the argument

Document-Authored Control-Signal Impersonation (DACSI), the impersonation of metadata, provenance, authority, or disclosure-policy signals by attacker-authored retrieved text in mixed RAG prompts.

If this is right

  • DACSI warrants separate evaluation because it uses a command-free metadata/provenance/policy surface.
  • DACSI follows a RAG-specific source-authority path rather than direct command overrides.
  • DACSI responds to source/channel separation as a distinguishing factor from other injection types.
  • The source-authority probe provides behavioral attribution evidence rather than proof of internal mechanisms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Enforcing separate formatting or provenance markers for retrieved documents could reduce misattribution in practice.
  • The pattern may extend to other systems that serialize heterogeneous data into a single prompt stream.
  • Model-specific boundary strength varies, so targeted testing per regime would be needed to map residual risks.

Load-bearing premise

RAG prompt rendering collapses trusted and untrusted text into the same natural-language channel, allowing misattribution of document-authored signals as authorized control signals.

What would settle it

An experiment enforcing explicit source separation or distinct channels for documents versus instructions that shows zero successful DACSI impersonation across the tested model regimes.

read the original abstract

Retrieval-augmented generation (RAG) systems often serialize user queries, retrieved documents, metadata, system labels, and task instructions into one natural-language prompt. We study a source-authority boundary failure in this design: attacker-authored retrieved text can impersonate metadata, provenance, authority, or disclosure-policy signals that appear control-relevant to the model. We call this pattern Document-Authored Control-Signal Impersonation (DACSI). DACSI is a non-imperative, metadata-like payload subclass within indirect prompt injection. Its central lesson is simple: document-authored labels are data, not policy. Command-style injection asks the model to ignore, override, or violate policy; DACSI asks whether untrusted document text can be misattributed as an authorized control signal when RAG prompt rendering collapses trusted and untrusted text into the same natural-language channel. We evaluate DACSI across six model settings, prompt-pressure levels, injection baselines, signal taxonomies, RAG-mediated pipelines, system-control probes, a source-authority attribution probe, and synthetic canary formats. We interpret the evidence by model regime rather than as six equal replications: DeepSeek V4 Pro and Qwen3.5-397B provide the cleanest positive lift, DeepSeek V4 Flash is a high-susceptibility setting, GPT-5.5 and Gemini 3.1 Pro Low are strong-boundary probes with selected residual risks, and GLM-4.7 is a saturated leakage boundary case. Across these regimes, DACSI warrants separate evaluation because it uses a command-free metadata/provenance/policy surface, follows a RAG-specific source-authority path, and responds to source/channel separation. The source-authority probe is behavioral attribution evidence, not proof of an internal mechanism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Document-Authored Control-Signal Impersonation (DACSI) as a non-imperative, metadata-like subclass of indirect prompt injection specific to RAG systems. It claims that attacker-authored retrieved documents can impersonate provenance, authority, or policy signals when RAG prompt rendering collapses trusted and untrusted text into a single natural-language channel, and that document-authored labels are data rather than policy. The paper reports evaluations across six model regimes (DeepSeek V4 Pro, Qwen3.5-397B, DeepSeek V4 Flash, GPT-5.5, Gemini 3.1 Pro Low, GLM-4.7), prompt-pressure levels, injection baselines, signal taxonomies, RAG pipelines, system-control probes, and a source-authority attribution probe, interpreting results by model regime rather than as equal replications, and concludes that DACSI warrants separate evaluation due to its command-free surface and RAG-specific source-authority path.

Significance. If the reported behavioral attribution evidence and regime-specific patterns hold under detailed scrutiny, the work identifies a RAG-specific attack surface that standard imperative-injection defenses may not fully address, potentially motivating source/channel separation techniques. The multi-regime framing and explicit distinction between behavioral evidence and internal mechanism are constructive.

major comments (2)
  1. [Abstract] Abstract: The central claim that DACSI 'warrants separate evaluation' because it uses a command-free metadata/provenance/policy surface and responds to source/channel separation is load-bearing, yet the abstract describes evaluations 'across ... injection baselines' without any reported differential statistics, success rates, or ablation results comparing DACSI to imperative indirect-prompt-injection baselines under matched conditions. This absence means the distinction rests on taxonomy rather than falsifiable empirical contrast.
  2. [Abstract] Abstract: The source-authority probe is described as 'behavioral attribution evidence, not proof of an internal mechanism,' but no concrete probe design, scoring method, or quantitative attribution rates are supplied, leaving the evidential basis for the RAG-specific path claim unverified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract's evidential presentation. We will revise the abstract to incorporate key quantitative contrasts and probe details while preserving its conciseness.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that DACSI 'warrants separate evaluation' because it uses a command-free metadata/provenance/policy surface and responds to source/channel separation is load-bearing, yet the abstract describes evaluations 'across ... injection baselines' without any reported differential statistics, success rates, or ablation results comparing DACSI to imperative indirect-prompt-injection baselines under matched conditions. This absence means the distinction rests on taxonomy rather than falsifiable empirical contrast.

    Authors: We agree the abstract would benefit from explicit empirical support. The full manuscript reports matched-condition comparisons to imperative baselines (including success rates and ablations) in the evaluation sections, interpreted by model regime. We will revise the abstract to include representative differential statistics and note the observed RAG-specific source-authority patterns that support separate evaluation. revision: yes

  2. Referee: [Abstract] Abstract: The source-authority probe is described as 'behavioral attribution evidence, not proof of an internal mechanism,' but no concrete probe design, scoring method, or quantitative attribution rates are supplied, leaving the evidential basis for the RAG-specific path claim unverified.

    Authors: The full manuscript details the source-authority probe (design using synthetic canary formats and attribution tasks, scoring via behavioral attribution rates, and quantitative results across the six regimes) in the methods and results sections. The abstract's wording is deliberately limited to behavioral evidence. We will add a brief clause to the abstract summarizing the probe design and key attribution rates to strengthen the evidential basis. revision: yes

Circularity Check

0 steps flagged

No circularity; argument rests on explicit definition plus reported empirical evaluation

full rationale

The paper defines DACSI by its non-imperative metadata-like surface and RAG source-authority path, then reports evaluations across models, baselines, and probes before concluding that the pattern warrants separate evaluation for those same definitional reasons. No equations, fitted parameters, self-citations, or imported uniqueness theorems appear in the provided text. The central claim therefore does not reduce to its inputs by construction; it is an empirical taxonomy claim whose strength can be assessed against the reported results rather than being tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no details on parameters, axioms, or entities provided.

pith-pipeline@v0.9.1-grok · 5857 in / 901 out tokens · 21429 ms · 2026-06-27T16:39:25.346958+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 5 linked inside Pith

  1. [1]

    The instruction hierarchy: Training LLMs to prioritize privileged in- structions,

    E. Wallace, K. Xiao, R. Leike, L. Weng, J. Heidecke, and A. Beutel, “The instruction hierarchy: Training LLMs to prioritize privileged in- structions,”arXiv preprint arXiv:2404.13208, 2024

  2. [2]

    StruQ: Defending against prompt injection with structured queries,

    S. Chen, J. Piet, C. Sitawarin, and D. Wagner, “StruQ: Defending against prompt injection with structured queries,” in34th USENIX Security Symposium (USENIX Security 25), 2025

  3. [3]

    Can LLMs separate instructions from data? and what do we even mean by that?

    E. Zverev, S. Abdelnabi, S. Tabesh, M. Fritz, and C. H. Lampert, “Can LLMs separate instructions from data? and what do we even mean by that?” inInternational Conference on Learning Representations, 2025

  4. [4]

    Ignore previous prompt: Attack techniques for language models,

    F. Perez and I. Ribeiro, “Ignore previous prompt: Attack techniques for language models,”arXiv preprint arXiv:2211.09527, 2022. 10

  5. [5]

    Not what you’ve signed up for: Compromising real-world LLM- integrated applications with indirect prompt injection,

    K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world LLM- integrated applications with indirect prompt injection,”arXiv preprint arXiv:2302.12173, 2023

  6. [6]

    Prompt injection attack against LLM- integrated applications,

    Y . Liu, G. Deng, Y . Li, K. Wang, Z. Wang, X. Wang, T. Zhang, Y . Liu, H. Wang, Y . Zheng, and Y . Liu, “Prompt injection attack against LLM- integrated applications,”arXiv preprint arXiv:2306.05499, 2023

  7. [7]

    InjecAgent: Benchmark- ing indirect prompt injections in tool-integrated large language model agents,

    Q. Zhan, Z. Liang, Z. Ying, and D. Kang, “InjecAgent: Benchmark- ing indirect prompt injections in tool-integrated large language model agents,” inFindings of the Association for Computational Linguistics: ACL, 2024

  8. [8]

    AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents,

    E. Debenedetti, J. Zhang, M. Balunovic, L. Beurer-Kellner, M. Fischer, and F. Tramer, “AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents,”Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2024

  9. [9]

    Benchmarking and defending against indirect prompt injection attacks on large language models,

    J. Yi, Y . Xie, B. Zhu, E. Kiciman, G. Sun, X. Xie, and F. Wu, “Benchmarking and defending against indirect prompt injection attacks on large language models,” inACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2025

  10. [10]

    ObliInjection: Order- oblivious prompt injection attack to LLM agents with multi-source data,

    X. Liu, H. Tian, Y . Chen, Y . Ye, and X. Li, “ObliInjection: Order- oblivious prompt injection attack to LLM agents with multi-source data,” inProceedings of the Network and Distributed System Security Symposium, 2026

  11. [11]

    Defending against indirect prompt injection attacks with spotlighting,

    K. Hines, G. Lopez, M. Hall, F. Zarfati, Y . Zunger, and E. Kiciman, “Defending against indirect prompt injection attacks with spotlighting,” arXiv preprint, 2024

  12. [12]

    Defending against indirect prompt injection attacks with spot- lighting and attention shifts,

    ——, “Defending against indirect prompt injection attacks with spot- lighting and attention shifts,” inProceedings of the Network and Distributed System Security Symposium, 2026

  13. [13]

    CaMeL: Defeating prompt injections by design,

    E. Debenedetti, J. Zhang, M. Balunovic, L. Beurer-Kellner, M. Fischer, and F. Tram`er, “CaMeL: Defeating prompt injections by design,”arXiv preprint arXiv:2503.18813, 2025

  14. [14]

    PoisonedRAG: Knowledge corruption attacks to retrieval-augmented generation of large language models,

    W. Zou, R. Geng, B. Wang, and J. Jia, “PoisonedRAG: Knowledge corruption attacks to retrieval-augmented generation of large language models,” in34th USENIX Security Symposium (USENIX Security 25), 2025

  15. [15]

    Lost in the middle: How language models use long contexts,

    N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,” Transactions of the Association for Computational Linguistics, 2024

  16. [16]

    Sufficient context: A new lens on retrieval augmented generation systems,

    A. Liu, O. Press, N. A. Smith, and H. Hajishirzi, “Sufficient context: A new lens on retrieval augmented generation systems,” inInternational Conference on Learning Representations, 2025

  17. [17]

    A reality check on context utilisation for retrieval- augmented generation,

    L. Hagstr ¨omet al., “A reality check on context utilisation for retrieval- augmented generation,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

  18. [18]

    FaithfulRAG: Fact-level conflict modeling for context-faithful retrieval- augmented generation,

    Q. Zhang, Z. Xiang, Y . Xiao, L. Wang, J. Li, X. Wang, and J. Su, “FaithfulRAG: Fact-level conflict modeling for context-faithful retrieval- augmented generation,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

  19. [19]

    FaithEval: Can your language model stay faithful to context, even if “the moon is made of marshmallows

    Y . Minget al., “FaithEval: Can your language model stay faithful to context, even if “the moon is made of marshmallows”?” inInternational Conference on Learning Representations, 2025

  20. [20]

    Synchronous faithfulness monitoring for trustworthy retrieval-augmented generation,

    D. Wuet al., “Synchronous faithfulness monitoring for trustworthy retrieval-augmented generation,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

  21. [21]

    Model internals-based answer attribution for trustworthy retrieval-augmented generation,

    J. Qi, G. Sarti, R. Fernandez, and A. Bisazza, “Model internals-based answer attribution for trustworthy retrieval-augmented generation,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

  22. [22]

    Quantifying language models’ sensitivity to spurious features in prompt design, or: How i learned to start worrying about prompt formatting,

    M. Sclar, Y . Choi, Y . Tsvetkov, and A. Suhr, “Quantifying language models’ sensitivity to spurious features in prompt design, or: How i learned to start worrying about prompt formatting,” inInternational Conference on Learning Representations, 2024

  23. [23]

    How are prompts different in terms of sensitivity?

    S. Lu, H. Schuff, and I. Gurevych, “How are prompts different in terms of sensitivity?” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics, 2024

  24. [24]

    Benchmarking prompt sensitivity in large language models,

    A. Razavi, M. Soltangheis, N. Arabzadeh, S. Salamat, M. Zihayat, and E. Bagheri, “Benchmarking prompt sensitivity in large language models,” inEuropean Conference on Information Retrieval, 2025

  25. [25]

    Flaw or artifact? rethinking prompt sensitivity in evaluating LLMs,

    A. Hua, K. Tang, C. Gu, J. Gu, E. Wong, and Y . Qin, “Flaw or artifact? rethinking prompt sensitivity in evaluating LLMs,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

  26. [26]

    Revisiting demonstration selection strategies in in- context learning,

    K. Penget al., “Revisiting demonstration selection strategies in in- context learning,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

  27. [27]

    Learning to retrieve in-context examples for large language models,

    L. Wang, N. Yang, and F. Wei, “Learning to retrieve in-context examples for large language models,” inProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, 2024

  28. [28]

    Assessing “implicit

    X. Shenet al., “Assessing “implicit” retrieval robustness of large language models,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

  29. [29]

    Resisting contextual interference in RAG via parametric-knowledge re- inforcement,

    C. Lin, Y . Wen, D. Su, H. Tan, F. Sun, M. Chen, C. Bao, and Z. Lv, “Resisting contextual interference in RAG via parametric-knowledge re- inforcement,” inInternational Conference on Learning Representations, 2026