The Injection Paradox: Brand-Level Suppression in Safety-Trained LLM Recommendations via RAG Context Injection

Hyunseok Paeng

arxiv: 2606.09204 · v1 · pith:6QL6UXVSnew · submitted 2026-06-08 · 💻 cs.LG · cs.CL· cs.CR

The Injection Paradox: Brand-Level Suppression in Safety-Trained LLM Recommendations via RAG Context Injection

Hyunseok Paeng This is my paper

Pith reviewed 2026-06-27 17:14 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CR

keywords RAGprompt injectionLLM recommendationssafety trainingClaude modelsbrand suppressioncontext injectionfailure mode

0 comments

The pith

Safety-trained Claude models suppress recommendations for an entire brand when even one retrieved document contains a prompt injection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that in RAG-based recommendation with Claude models, prompt injections in documents cause the model to stop recommending the associated brand, and this avoidance extends to unmodified documents from the same brand. Experiments found the target brand falling from a 54 percent baseline to zero top-2 recommendations across fifty trials in Claude Opus 4.6, even though only one of four brand documents held the injection. The same directional suppression appears across three brands and counterfactual setups, while GPT models instead increase recommendations under identical injections. A reader would care because the result identifies how safety training can create brand-level filtering effects that an attacker could potentially exploit in reverse.

Core claim

The Injection Paradox is the observed outcome in which prompt injections embedded in retrieved documents backfire against the attacker by suppressing the target brand below the injection-free baseline. In safety-trained Claude models, documents containing prompt injections suffer a sharp drop in recommendation rate, and this suppression propagates beyond the injected document to unmodified documents of the same brand. In Claude Opus 4.6 the target brand drops from a 54 percent baseline to zero top-2 recommendations across all 50 trials, even though only 1 of 4 brand documents in the corpus contains an injection. The directional pattern is reproduced in counterfactual experiments and across t

What carries the argument

The interaction between prompt-injection context and safety training that produces brand-level suppression propagating from the injected document to other documents of the same brand.

If this is right

Only one injected document among four is sufficient to drive the entire brand's top-2 recommendation rate to zero.
The suppression effect is reproduced across three different brands and in multiple counterfactual document configurations.
GPT models exhibit the opposite response, with the same injection increasing rather than decreasing recommendations.
The pattern raises the technical possibility that an adversary could embed injections in a competitor's documents to suppress that competitor's brand.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

RAG pipelines using safety-trained models may require explicit injection detection or document sanitization to avoid unintended brand filtering.
The model-family difference suggests that safety training produces distinct generalization patterns from anomalous context across LLM families.
The propagation to clean documents from the same brand could extend the effect to other retrieval-based tasks such as summarization or question answering.

Load-bearing premise

The observed drop in recommendations for the brand is produced by the model's safety training responding to the injection rather than by uncontrolled differences in document selection, prompt formatting, or other RAG pipeline variables.

What would settle it

An experiment in which the single injected document is removed or the injection text is deleted while all other documents, queries, and model settings remain identical, and the target brand's recommendation rate returns to the 54 percent baseline.

Figures

Figures reproduced from arXiv: 2606.09204 by Hyunseok Paeng.

**Figure 1.** Figure 1: Experimental overview. A 40-document corpus is tested under four conditions across GPT and Claude model families. The same prompt injection produces promotion in GPT but suppression in safety-trained Claude models—the Injection Paradox [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

read the original abstract

We present a reproducible failure mode of safety training in RAG-based LLM recommendation -- the Injection Paradox -- in which prompt injections embedded in retrieved documents backfire against the attacker, suppressing the target brand below the injection-free baseline. In safety-trained Claude models, documents containing prompt injections suffer a sharp drop in recommendation rate, and this suppression propagates beyond the injected document to unmodified documents of the same brand. In Claude Opus 4.6, the target brand drops from a 54% baseline to zero top-2 recommendations across all 50 trials, even though only 1 of 4 brand documents in the corpus contains an injection. The directional pattern is reproduced in counterfactual experiments and across three brands. A contrasting result across the GPT models tested, where the same injection instead increases recommendations, suggests model-family differences in how injection-like context affects recommendation behavior. These findings raise the technical possibility of a reverse-attack scenario in which an adversary embeds injections in a competitor's documents to suppress the competitor's brand via safety-sensitive model behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports a brand-level suppression effect from one injected document in Claude RAG setups that spreads to the rest of the brand, with the reverse in GPT models, but lacks the controls needed to tie it to safety training.

read the letter

The main thing to know is that the authors observed a consistent pattern where injecting a prompt into one of four documents for a brand causes Claude models to stop recommending that brand entirely in top-2 results, even for the clean documents, while the same injection raises recommendations in the GPT models tested. They frame this as the Injection Paradox and a possible reverse attack.

The work does a decent job showing the directional effect holds across three brands and some counterfactual runs, and the model-family split is a straightforward empirical note that prior work on injections does not appear to have flagged in this recommendation setting.

The soft spot is the missing isolation. The abstract gives no retrieval ablations, no exact injection text, no details on how the corpus or prompts were formatted, and no checks that the drop is not just from altered context length or ranking scores. That makes the attribution to safety-training interaction rest on the assumption that nothing else in the pipeline changed, which the stress-test note correctly flags as untested. Without those steps the claim stays at the level of an observation rather than a pinned-down mechanism.

The result is not circular and the experiments are described as reproducible in principle, but the current write-up leaves the central causal story under-supported. This is the kind of paper that would benefit from a referee who can ask for the missing controls and data.

It is aimed at people working on LLM safety and RAG robustness. Readers tracking attack surfaces on brand recommendations could find the pattern worth following up, but only after seeing the full protocol. I would send it to peer review so the methods can be checked rather than desk-rejecting on the abstract alone.

Referee Report

2 major / 0 minor

Summary. The paper claims to identify an 'Injection Paradox' in RAG-based LLM recommendations: prompt injections embedded in retrieved documents cause a sharp drop in recommendation rates for the target brand in safety-trained Claude models, with suppression propagating to unmodified same-brand documents. In Claude Opus 4.6, the target brand drops from a 54% baseline to zero top-2 recommendations across all 50 trials despite only 1 of 4 brand documents containing an injection. The directional pattern holds across three brands and counterfactual experiments, but the same injections increase recommendations in tested GPT models. The authors frame this as a failure mode of safety training that could enable reverse adversarial attacks on competitors.

Significance. If the brand-level suppression is robustly isolated to the interaction between injections and safety training (rather than RAG confounds), the finding would be significant for LLM safety and RAG robustness research. The reported consistency across 50 trials per setup, multiple brands, and counterfactual experiments provides a reproducible empirical observation that could guide future alignment and retrieval work. The contrast with GPT-family behavior also highlights potential model-specific differences worth further study.

major comments (2)

[Abstract] Abstract: The central claim attributes the observed suppression (e.g., Claude Opus 4.6 dropping to 0/50 top-2 recommendations) to a 'failure mode of safety training' and contrasts it with GPT behavior. However, the described setup (1-of-4 documents injected, brand-level propagation) does not report ablations that hold retrieval ranking, context length, semantic similarity, and prompt formatting fixed while varying only the safety-relevant content of the injection. Without these controls, alternative explanations from the RAG pipeline cannot be ruled out.
[Experimental description] Experimental description (throughout): No exact injection text, corpus construction details, retrieval protocol, or statistical tests are provided despite the claim of reproducibility across 50 trials and counterfactual setups. This omission is load-bearing because it prevents independent verification of whether the directional results support the safety-training attribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will incorporate revisions to strengthen the manuscript's claims and reproducibility.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim attributes the observed suppression (e.g., Claude Opus 4.6 dropping to 0/50 top-2 recommendations) to a 'failure mode of safety training' and contrasts it with GPT behavior. However, the described setup (1-of-4 documents injected, brand-level propagation) does not report ablations that hold retrieval ranking, context length, semantic similarity, and prompt formatting fixed while varying only the safety-relevant content of the injection. Without these controls, alternative explanations from the RAG pipeline cannot be ruled out.

Authors: The counterfactual experiments already vary injection presence while holding the overall RAG corpus, retrieval, and prompt structure fixed across brands, and the model-specific contrast (suppression in Claude but increase in GPT) is difficult to explain via generic RAG confounds. Nevertheless, we agree that explicit ablations isolating only the safety-relevant phrasing of the injection (while fixing ranking, length, similarity, and formatting) would further isolate the mechanism. We will add these targeted controls in the revised version. revision: yes
Referee: [Experimental description] Experimental description (throughout): No exact injection text, corpus construction details, retrieval protocol, or statistical tests are provided despite the claim of reproducibility across 50 trials and counterfactual setups. This omission is load-bearing because it prevents independent verification of whether the directional results support the safety-training attribution.

Authors: We acknowledge that the manuscript text omits these implementation details. In the revision we will include the exact injection strings, full corpus construction procedure, retrieval protocol (including embedding model, similarity metric, and top-k selection), and statistical analysis (e.g., binomial confidence intervals or exact tests on the 50-trial counts) so that the experiments can be independently reproduced. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results from controlled experiments

full rationale

The paper reports experimental observations of recommendation rates in RAG setups with and without prompt injections across model families. No equations, parameter fits, or derivations are present that could reduce to inputs by construction. Claims rest on direct trial outcomes (e.g., top-2 recommendation counts) rather than self-citations, ansatzes, or renamed patterns. The central attribution to safety training is an interpretive label on the data, not a load-bearing derivation that collapses to prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical observations rather than mathematical axioms or fitted parameters. No free parameters, invented entities, or non-standard mathematical assumptions are introduced.

axioms (1)

domain assumption Safety training in the tested Claude models causes avoidance of recommendations when prompt-injection-like content appears in retrieved documents.
The abstract interprets the suppression effect as a consequence of safety training and contrasts it with GPT behavior.

pith-pipeline@v0.9.1-grok · 5708 in / 1405 out tokens · 42678 ms · 2026-06-27T17:14:08.887833+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Constitutional classifiers: Defending against universal jailbreaks

Anthropic . Constitutional classifiers: Defending against universal jailbreaks. Technical report, Anthropic, 2025

2025
[2]

Bai, Y. et al. Constitutional AI : Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Filandrianos, G. et al. Bias beware: The impact of cognitive biases on LLM -driven product recommendations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

2025
[4]

Greshake, K. et al. Not what you've signed up for: Compromising real-world LLM -integrated applications with indirect prompt injection. In AISec@CCS, 2023

2023
[5]

Manipulating AI memory for profit: The rise of AI recommendation poisoning

Microsoft Defender Security Research Team . Manipulating AI memory for profit: The rise of AI recommendation poisoning. Microsoft Security Blog, 2026

2026
[6]

and Kashef, R

Nawara, D. and Kashef, R. A comprehensive survey on LLM -powered recommender systems: From discriminative, generative to multi-modal paradigms. IEEE Access, 2025

2025
[7]

Adversarial search engine optimization for large language models

Nestaas, F., Debenedetti, E., and Tram \`e r, F. Adversarial search engine optimization for large language models. In International Conference on Learning Representations (ICLR), 2025

2025
[8]

Jailbroken: How does LLM safety training fail? In Advances in Neural Information Processing Systems (NeurIPS), 2023

Wei, A., Haghtalab, N., and Steinhardt, J. Jailbroken: How does LLM safety training fail? In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023
[9]

Zou, W. et al. PoisonedRAG : Knowledge corruption attacks to retrieval-augmented generation. In USENIX Security Symposium, 2025

2025

[1] [1]

Constitutional classifiers: Defending against universal jailbreaks

Anthropic . Constitutional classifiers: Defending against universal jailbreaks. Technical report, Anthropic, 2025

2025

[2] [2]

Bai, Y. et al. Constitutional AI : Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Filandrianos, G. et al. Bias beware: The impact of cognitive biases on LLM -driven product recommendations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

2025

[4] [4]

Greshake, K. et al. Not what you've signed up for: Compromising real-world LLM -integrated applications with indirect prompt injection. In AISec@CCS, 2023

2023

[5] [5]

Manipulating AI memory for profit: The rise of AI recommendation poisoning

Microsoft Defender Security Research Team . Manipulating AI memory for profit: The rise of AI recommendation poisoning. Microsoft Security Blog, 2026

2026

[6] [6]

and Kashef, R

Nawara, D. and Kashef, R. A comprehensive survey on LLM -powered recommender systems: From discriminative, generative to multi-modal paradigms. IEEE Access, 2025

2025

[7] [7]

Adversarial search engine optimization for large language models

Nestaas, F., Debenedetti, E., and Tram \`e r, F. Adversarial search engine optimization for large language models. In International Conference on Learning Representations (ICLR), 2025

2025

[8] [8]

Jailbroken: How does LLM safety training fail? In Advances in Neural Information Processing Systems (NeurIPS), 2023

Wei, A., Haghtalab, N., and Steinhardt, J. Jailbroken: How does LLM safety training fail? In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023

[9] [9]

Zou, W. et al. PoisonedRAG : Knowledge corruption attacks to retrieval-augmented generation. In USENIX Security Symposium, 2025

2025