A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
Rag makes guardrails unsafe? investigating robustness of guardrails under rag-style contexts
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
Web retrieval degrades safety alignment in LLM agents, with relevance activating vulnerabilities including a Safe Source Paradox where oppositional content increases harmful compliance.
A compliance-scored best-of-N orchestration layer for multimodal document generation reports 91% compliance at 5 attempts in 20 seconds and +11 percentage point win rate gains in aggregate operational data for payments dispute defense.
citing papers explorer
-
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
-
Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents
Web retrieval degrades safety alignment in LLM agents, with relevance activating vulnerabilities including a Safe Source Paradox where oppositional content increases harmful compliance.
-
Compliance-Scored Best-of-N Guardrail Orchestration for Multimodal Document Generation in Payments Dispute Defense
A compliance-scored best-of-N orchestration layer for multimodal document generation reports 91% compliance at 5 attempts in 20 seconds and +11 percentage point win rate gains in aggregate operational data for payments dispute defense.