Document-Authored Control-Signal Impersonation: A Low-Cost Indirect Prompt Attack on RAG Safety Boundaries
Pith reviewed 2026-06-27 16:39 UTC · model grok-4.3
The pith
RAG systems let attacker documents impersonate metadata and policy signals without commands.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Document-authored labels are data, not policy. DACSI is a non-imperative, metadata-like payload subclass within indirect prompt injection. Its central lesson is that attacker-authored retrieved text can be misattributed as an authorized control signal when RAG prompt rendering collapses trusted and untrusted text into the same natural-language channel.
What carries the argument
Document-Authored Control-Signal Impersonation (DACSI), the impersonation of metadata, provenance, authority, or disclosure-policy signals by attacker-authored retrieved text in mixed RAG prompts.
If this is right
- DACSI warrants separate evaluation because it uses a command-free metadata/provenance/policy surface.
- DACSI follows a RAG-specific source-authority path rather than direct command overrides.
- DACSI responds to source/channel separation as a distinguishing factor from other injection types.
- The source-authority probe provides behavioral attribution evidence rather than proof of internal mechanisms.
Where Pith is reading between the lines
- Enforcing separate formatting or provenance markers for retrieved documents could reduce misattribution in practice.
- The pattern may extend to other systems that serialize heterogeneous data into a single prompt stream.
- Model-specific boundary strength varies, so targeted testing per regime would be needed to map residual risks.
Load-bearing premise
RAG prompt rendering collapses trusted and untrusted text into the same natural-language channel, allowing misattribution of document-authored signals as authorized control signals.
What would settle it
An experiment enforcing explicit source separation or distinct channels for documents versus instructions that shows zero successful DACSI impersonation across the tested model regimes.
read the original abstract
Retrieval-augmented generation (RAG) systems often serialize user queries, retrieved documents, metadata, system labels, and task instructions into one natural-language prompt. We study a source-authority boundary failure in this design: attacker-authored retrieved text can impersonate metadata, provenance, authority, or disclosure-policy signals that appear control-relevant to the model. We call this pattern Document-Authored Control-Signal Impersonation (DACSI). DACSI is a non-imperative, metadata-like payload subclass within indirect prompt injection. Its central lesson is simple: document-authored labels are data, not policy. Command-style injection asks the model to ignore, override, or violate policy; DACSI asks whether untrusted document text can be misattributed as an authorized control signal when RAG prompt rendering collapses trusted and untrusted text into the same natural-language channel. We evaluate DACSI across six model settings, prompt-pressure levels, injection baselines, signal taxonomies, RAG-mediated pipelines, system-control probes, a source-authority attribution probe, and synthetic canary formats. We interpret the evidence by model regime rather than as six equal replications: DeepSeek V4 Pro and Qwen3.5-397B provide the cleanest positive lift, DeepSeek V4 Flash is a high-susceptibility setting, GPT-5.5 and Gemini 3.1 Pro Low are strong-boundary probes with selected residual risks, and GLM-4.7 is a saturated leakage boundary case. Across these regimes, DACSI warrants separate evaluation because it uses a command-free metadata/provenance/policy surface, follows a RAG-specific source-authority path, and responds to source/channel separation. The source-authority probe is behavioral attribution evidence, not proof of an internal mechanism.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Document-Authored Control-Signal Impersonation (DACSI) as a non-imperative, metadata-like subclass of indirect prompt injection specific to RAG systems. It claims that attacker-authored retrieved documents can impersonate provenance, authority, or policy signals when RAG prompt rendering collapses trusted and untrusted text into a single natural-language channel, and that document-authored labels are data rather than policy. The paper reports evaluations across six model regimes (DeepSeek V4 Pro, Qwen3.5-397B, DeepSeek V4 Flash, GPT-5.5, Gemini 3.1 Pro Low, GLM-4.7), prompt-pressure levels, injection baselines, signal taxonomies, RAG pipelines, system-control probes, and a source-authority attribution probe, interpreting results by model regime rather than as equal replications, and concludes that DACSI warrants separate evaluation due to its command-free surface and RAG-specific source-authority path.
Significance. If the reported behavioral attribution evidence and regime-specific patterns hold under detailed scrutiny, the work identifies a RAG-specific attack surface that standard imperative-injection defenses may not fully address, potentially motivating source/channel separation techniques. The multi-regime framing and explicit distinction between behavioral evidence and internal mechanism are constructive.
major comments (2)
- [Abstract] Abstract: The central claim that DACSI 'warrants separate evaluation' because it uses a command-free metadata/provenance/policy surface and responds to source/channel separation is load-bearing, yet the abstract describes evaluations 'across ... injection baselines' without any reported differential statistics, success rates, or ablation results comparing DACSI to imperative indirect-prompt-injection baselines under matched conditions. This absence means the distinction rests on taxonomy rather than falsifiable empirical contrast.
- [Abstract] Abstract: The source-authority probe is described as 'behavioral attribution evidence, not proof of an internal mechanism,' but no concrete probe design, scoring method, or quantitative attribution rates are supplied, leaving the evidential basis for the RAG-specific path claim unverified.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract's evidential presentation. We will revise the abstract to incorporate key quantitative contrasts and probe details while preserving its conciseness.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that DACSI 'warrants separate evaluation' because it uses a command-free metadata/provenance/policy surface and responds to source/channel separation is load-bearing, yet the abstract describes evaluations 'across ... injection baselines' without any reported differential statistics, success rates, or ablation results comparing DACSI to imperative indirect-prompt-injection baselines under matched conditions. This absence means the distinction rests on taxonomy rather than falsifiable empirical contrast.
Authors: We agree the abstract would benefit from explicit empirical support. The full manuscript reports matched-condition comparisons to imperative baselines (including success rates and ablations) in the evaluation sections, interpreted by model regime. We will revise the abstract to include representative differential statistics and note the observed RAG-specific source-authority patterns that support separate evaluation. revision: yes
-
Referee: [Abstract] Abstract: The source-authority probe is described as 'behavioral attribution evidence, not proof of an internal mechanism,' but no concrete probe design, scoring method, or quantitative attribution rates are supplied, leaving the evidential basis for the RAG-specific path claim unverified.
Authors: The full manuscript details the source-authority probe (design using synthetic canary formats and attribution tasks, scoring via behavioral attribution rates, and quantitative results across the six regimes) in the methods and results sections. The abstract's wording is deliberately limited to behavioral evidence. We will add a brief clause to the abstract summarizing the probe design and key attribution rates to strengthen the evidential basis. revision: yes
Circularity Check
No circularity; argument rests on explicit definition plus reported empirical evaluation
full rationale
The paper defines DACSI by its non-imperative metadata-like surface and RAG source-authority path, then reports evaluations across models, baselines, and probes before concluding that the pattern warrants separate evaluation for those same definitional reasons. No equations, fitted parameters, self-citations, or imported uniqueness theorems appear in the provided text. The central claim therefore does not reduce to its inputs by construction; it is an empirical taxonomy claim whose strength can be assessed against the reported results rather than being tautological.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
The instruction hierarchy: Training LLMs to prioritize privileged in- structions,
E. Wallace, K. Xiao, R. Leike, L. Weng, J. Heidecke, and A. Beutel, “The instruction hierarchy: Training LLMs to prioritize privileged in- structions,”arXiv preprint arXiv:2404.13208, 2024
Pith/arXiv arXiv 2024
-
[2]
StruQ: Defending against prompt injection with structured queries,
S. Chen, J. Piet, C. Sitawarin, and D. Wagner, “StruQ: Defending against prompt injection with structured queries,” in34th USENIX Security Symposium (USENIX Security 25), 2025
2025
-
[3]
Can LLMs separate instructions from data? and what do we even mean by that?
E. Zverev, S. Abdelnabi, S. Tabesh, M. Fritz, and C. H. Lampert, “Can LLMs separate instructions from data? and what do we even mean by that?” inInternational Conference on Learning Representations, 2025
2025
-
[4]
Ignore previous prompt: Attack techniques for language models,
F. Perez and I. Ribeiro, “Ignore previous prompt: Attack techniques for language models,”arXiv preprint arXiv:2211.09527, 2022. 10
Pith/arXiv arXiv 2022
-
[5]
K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world LLM- integrated applications with indirect prompt injection,”arXiv preprint arXiv:2302.12173, 2023
Pith/arXiv arXiv 2023
-
[6]
Prompt injection attack against LLM- integrated applications,
Y . Liu, G. Deng, Y . Li, K. Wang, Z. Wang, X. Wang, T. Zhang, Y . Liu, H. Wang, Y . Zheng, and Y . Liu, “Prompt injection attack against LLM- integrated applications,”arXiv preprint arXiv:2306.05499, 2023
Pith/arXiv arXiv 2023
-
[7]
InjecAgent: Benchmark- ing indirect prompt injections in tool-integrated large language model agents,
Q. Zhan, Z. Liang, Z. Ying, and D. Kang, “InjecAgent: Benchmark- ing indirect prompt injections in tool-integrated large language model agents,” inFindings of the Association for Computational Linguistics: ACL, 2024
2024
-
[8]
AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents,
E. Debenedetti, J. Zhang, M. Balunovic, L. Beurer-Kellner, M. Fischer, and F. Tramer, “AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents,”Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2024
2024
-
[9]
Benchmarking and defending against indirect prompt injection attacks on large language models,
J. Yi, Y . Xie, B. Zhu, E. Kiciman, G. Sun, X. Xie, and F. Wu, “Benchmarking and defending against indirect prompt injection attacks on large language models,” inACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2025
2025
-
[10]
ObliInjection: Order- oblivious prompt injection attack to LLM agents with multi-source data,
X. Liu, H. Tian, Y . Chen, Y . Ye, and X. Li, “ObliInjection: Order- oblivious prompt injection attack to LLM agents with multi-source data,” inProceedings of the Network and Distributed System Security Symposium, 2026
2026
-
[11]
Defending against indirect prompt injection attacks with spotlighting,
K. Hines, G. Lopez, M. Hall, F. Zarfati, Y . Zunger, and E. Kiciman, “Defending against indirect prompt injection attacks with spotlighting,” arXiv preprint, 2024
2024
-
[12]
Defending against indirect prompt injection attacks with spot- lighting and attention shifts,
——, “Defending against indirect prompt injection attacks with spot- lighting and attention shifts,” inProceedings of the Network and Distributed System Security Symposium, 2026
2026
-
[13]
CaMeL: Defeating prompt injections by design,
E. Debenedetti, J. Zhang, M. Balunovic, L. Beurer-Kellner, M. Fischer, and F. Tram`er, “CaMeL: Defeating prompt injections by design,”arXiv preprint arXiv:2503.18813, 2025
Pith/arXiv arXiv 2025
-
[14]
PoisonedRAG: Knowledge corruption attacks to retrieval-augmented generation of large language models,
W. Zou, R. Geng, B. Wang, and J. Jia, “PoisonedRAG: Knowledge corruption attacks to retrieval-augmented generation of large language models,” in34th USENIX Security Symposium (USENIX Security 25), 2025
2025
-
[15]
Lost in the middle: How language models use long contexts,
N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,” Transactions of the Association for Computational Linguistics, 2024
2024
-
[16]
Sufficient context: A new lens on retrieval augmented generation systems,
A. Liu, O. Press, N. A. Smith, and H. Hajishirzi, “Sufficient context: A new lens on retrieval augmented generation systems,” inInternational Conference on Learning Representations, 2025
2025
-
[17]
A reality check on context utilisation for retrieval- augmented generation,
L. Hagstr ¨omet al., “A reality check on context utilisation for retrieval- augmented generation,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025
2025
-
[18]
FaithfulRAG: Fact-level conflict modeling for context-faithful retrieval- augmented generation,
Q. Zhang, Z. Xiang, Y . Xiao, L. Wang, J. Li, X. Wang, and J. Su, “FaithfulRAG: Fact-level conflict modeling for context-faithful retrieval- augmented generation,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025
2025
-
[19]
FaithEval: Can your language model stay faithful to context, even if “the moon is made of marshmallows
Y . Minget al., “FaithEval: Can your language model stay faithful to context, even if “the moon is made of marshmallows”?” inInternational Conference on Learning Representations, 2025
2025
-
[20]
Synchronous faithfulness monitoring for trustworthy retrieval-augmented generation,
D. Wuet al., “Synchronous faithfulness monitoring for trustworthy retrieval-augmented generation,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024
2024
-
[21]
Model internals-based answer attribution for trustworthy retrieval-augmented generation,
J. Qi, G. Sarti, R. Fernandez, and A. Bisazza, “Model internals-based answer attribution for trustworthy retrieval-augmented generation,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024
2024
-
[22]
Quantifying language models’ sensitivity to spurious features in prompt design, or: How i learned to start worrying about prompt formatting,
M. Sclar, Y . Choi, Y . Tsvetkov, and A. Suhr, “Quantifying language models’ sensitivity to spurious features in prompt design, or: How i learned to start worrying about prompt formatting,” inInternational Conference on Learning Representations, 2024
2024
-
[23]
How are prompts different in terms of sensitivity?
S. Lu, H. Schuff, and I. Gurevych, “How are prompts different in terms of sensitivity?” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics, 2024
2024
-
[24]
Benchmarking prompt sensitivity in large language models,
A. Razavi, M. Soltangheis, N. Arabzadeh, S. Salamat, M. Zihayat, and E. Bagheri, “Benchmarking prompt sensitivity in large language models,” inEuropean Conference on Information Retrieval, 2025
2025
-
[25]
Flaw or artifact? rethinking prompt sensitivity in evaluating LLMs,
A. Hua, K. Tang, C. Gu, J. Gu, E. Wong, and Y . Qin, “Flaw or artifact? rethinking prompt sensitivity in evaluating LLMs,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025
2025
-
[26]
Revisiting demonstration selection strategies in in- context learning,
K. Penget al., “Revisiting demonstration selection strategies in in- context learning,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024
2024
-
[27]
Learning to retrieve in-context examples for large language models,
L. Wang, N. Yang, and F. Wei, “Learning to retrieve in-context examples for large language models,” inProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, 2024
2024
-
[28]
Assessing “implicit
X. Shenet al., “Assessing “implicit” retrieval robustness of large language models,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024
2024
-
[29]
Resisting contextual interference in RAG via parametric-knowledge re- inforcement,
C. Lin, Y . Wen, D. Su, H. Tan, F. Sun, M. Chen, C. Bao, and Z. Lv, “Resisting contextual interference in RAG via parametric-knowledge re- inforcement,” inInternational Conference on Learning Representations, 2026
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.