Recognition: unknown
Beyond Explicit Refusals: Soft-Failure Attacks on Retrieval-Augmented Generation
Pith reviewed 2026-05-10 04:52 UTC · model grok-4.3
The pith
Retrieval-augmented generation systems can be tricked into fluent but non-informative answers via an evolutionary attack that evades detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors formalize soft failure as a distinct threat model for RAG and propose DEJA as an evolutionary optimization process that generates adversarial documents to drive the downstream LLM toward low-utility yet fluent outputs. The process maintains high retrieval rates while systematically lowering answer certainty through repeated mutation and selection based on an LLM evaluator's fine-grained utility assessment.
What carries the argument
DEJA, a black-box evolutionary optimization framework that mutates candidate documents and selects them according to an LLM-computed Answer Utility Score to induce soft failures while preserving retrieval success.
If this is right
- RAG pipelines can suffer degraded utility without triggering explicit refusal mechanisms or denial-of-service alerts.
- Adversarial documents remain effective under query paraphrasing and transfer across model families without retargeting.
- Perplexity-based detectors fail to flag the generated documents, leaving current defenses insufficient.
- The attack succeeds while keeping hard failures low, making it harder for operators to notice systematic degradation.
Where Pith is reading between the lines
- RAG developers may need to add explicit checks for answer informativeness beyond coherence or safety filters.
- Reliance on LLM evaluators for attack guidance could create a feedback loop where evaluator biases become attack vectors.
- The same evolutionary approach might be tested on other retrieval-augmented tasks such as summarization or code generation.
- Human studies could be run to confirm whether the utility loss measured by the LLM evaluator matches user-perceived degradation.
Load-bearing premise
The LLM evaluator used to compute the Answer Utility Score gives a reliable, unbiased measure of genuine utility loss rather than evaluator-specific artifacts.
What would settle it
Independent runs of DEJA on the same benchmarks and RAG configurations that produce soft-attack success rates below 79 percent or hard-failure rates above 15 percent would falsify the claim of consistent effectiveness.
Figures
read the original abstract
Existing jamming attacks on Retrieval-Augmented Generation (RAG) systems typically induce explicit refusals or denial-of-service behaviors, which are conspicuous and easy to detect. In this work, we formalize a subtler availability threat, termed soft failure, which degrades system utility by inducing fluent and coherent yet non-informative responses rather than overt failures. We propose Deceptive Evolutionary Jamming Attack (DEJA), an automated black-box attack framework that generates adversarial documents to trigger such soft failures by exploiting safety-aligned behaviors of large language models. DEJA employs an evolutionary optimization process guided by a fine-grained Answer Utility Score (AUS), computed via an LLM-based evaluator, to systematically degrade the certainty of answers while maintaining high retrieval success. Extensive experiments across multiple RAG configurations and benchmark datasets show that DEJA consistently drives responses toward low-utility soft failures, achieving SASR above 79\% while keeping hard-failure rates below 15\%, significantly outperforming prior attacks. The resulting adversarial documents exhibit high stealth, evading perplexity-based detection and resisting query paraphrasing, and transfer across model families to proprietary systems without retargeting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes 'soft failure' as a subtle availability attack on RAG systems that produces fluent yet non-informative responses instead of explicit refusals. It introduces DEJA, a black-box evolutionary optimization framework that generates adversarial documents by maximizing degradation of an LLM-based Answer Utility Score (AUS) while preserving retrieval success. Experiments across RAG configurations and datasets report SASR above 79% with hard-failure rates below 15%, plus claims of stealth against perplexity detectors, resistance to paraphrasing, and transfer to proprietary models.
Significance. If validated, the work identifies a practically relevant threat class that evades current refusal-based defenses and detection heuristics. The black-box evolutionary method and emphasis on utility degradation (rather than denial-of-service) could guide future RAG robustness research; the reported transferability and stealth properties are particularly noteworthy if they survive evaluator ablations.
major comments (3)
- [Abstract and methodology (AUS computation)] The headline performance metrics (SASR >79%, hard failures <15%) are computed using the same LLM-based AUS evaluator that guides DEJA's evolutionary loop. No cross-evaluator agreement, human correlation study, or ablation replacing the AUS model is reported, so it remains possible that the discovered documents exploit evaluator-specific artifacts rather than genuine utility loss in the target RAG generator (see Abstract and the AUS definition in the methodology).
- [Experiments] The experimental section provides aggregate success rates but no details on controls for post-hoc document selection, statistical significance testing across runs, or baseline comparisons that isolate the contribution of the evolutionary component versus simpler retrieval poisoning. This weakens the claim that DEJA 'significantly outperforms prior attacks' (Abstract).
- [Transfer experiments] Transferability results to proprietary models are presented without retargeting, yet the paper does not report whether the AUS evaluator was held fixed or whether success was re-measured with an independent utility metric on the target systems. This leaves open whether the transfer is of the attack documents themselves or of the optimization artifact.
minor comments (2)
- [Introduction] The abstract and introduction introduce 'soft failure' without a crisp formal definition or contrast to related concepts such as hallucination or low-confidence answers; a short dedicated subsection would improve clarity.
- [Figures and tables] Figure captions and table headers should explicitly state the number of independent runs and the exact AUS threshold used to classify soft vs. hard failures.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which identifies key areas where our evaluation can be strengthened for greater rigor and transparency. We address each major comment below and will revise the manuscript accordingly to incorporate clarifications and additional analyses.
read point-by-point responses
-
Referee: The headline performance metrics (SASR >79%, hard failures <15%) are computed using the same LLM-based AUS evaluator that guides DEJA's evolutionary loop. No cross-evaluator agreement, human correlation study, or ablation replacing the AUS model is reported, so it remains possible that the discovered documents exploit evaluator-specific artifacts rather than genuine utility loss in the target RAG generator.
Authors: We acknowledge this valid concern regarding potential evaluator-specific artifacts. The AUS was selected as it captures fine-grained aspects of utility (relevance, completeness, and coherence) that align with the definition of soft failure. To address this, we will revise the manuscript to include: (1) an ablation using a different LLM as the AUS evaluator during optimization and evaluation to demonstrate consistency, (2) a small-scale human correlation study on a subset of responses, and (3) discussion of why the chosen AUS is a reasonable proxy. These additions will help confirm that the reported soft failures reflect genuine utility degradation. revision: yes
-
Referee: The experimental section provides aggregate success rates but no details on controls for post-hoc document selection, statistical significance testing across runs, or baseline comparisons that isolate the contribution of the evolutionary component versus simpler retrieval poisoning. This weakens the claim that DEJA 'significantly outperforms prior attacks' (Abstract).
Authors: We agree that more rigorous experimental controls and details are needed to support the performance claims. In the revised manuscript, we will expand the experiments section to specify: controls ensuring no post-hoc selection bias (all evolutionary outputs were evaluated per the reported protocol), results across multiple independent runs with statistical significance testing (e.g., paired t-tests with p-values), and additional baselines such as random document injection and non-evolutionary heuristic poisoning. This will better isolate the evolutionary component's contribution and strengthen comparisons to prior attacks. revision: yes
-
Referee: Transferability results to proprietary models are presented without retargeting, yet the paper does not report whether the AUS evaluator was held fixed or whether success was re-measured with an independent utility metric on the target systems. This leaves open whether the transfer is of the attack documents themselves or of the optimization artifact.
Authors: We clarify that the transfer experiments used adversarial documents generated via the source AUS and applied them directly to target models without retargeting. Success was measured with the same AUS for cross-experiment consistency. We will update the manuscript to explicitly describe this setup. Additionally, we will include results using an alternative independent utility metric on the transferred documents where feasible to further demonstrate that the attack effect transfers independently of the original evaluator. revision: partial
Circularity Check
No significant circularity; empirical attack validated on external benchmarks
full rationale
The paper proposes DEJA as a black-box evolutionary attack that optimizes adversarial documents using an external LLM-based AUS evaluator to induce soft failures in RAG systems. Success is reported via SASR and hard-failure rates measured on standard benchmark datasets and multiple RAG configurations, with additional claims of stealth and cross-model transfer. These metrics and comparisons do not reduce by construction to quantities defined solely from the attack's own fitted components or self-citations; the chain consists of proposal, optimization, and external empirical measurement without load-bearing self-referential loops or definitional equivalences.
Axiom & Free-Parameter Ledger
free parameters (2)
- Evolutionary optimization hyperparameters
- AUS threshold or weighting
axioms (2)
- domain assumption LLM-based evaluator produces consistent and meaningful Answer Utility Scores that correlate with human-perceived utility degradation
- domain assumption Adversarial documents can be optimized to maintain high retrieval success while degrading answer quality
invented entities (1)
-
Soft failure
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Not what you’ve signed up for: Compromis- ing real-world llm-integrated applications with indi- rect prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Secu- rity, AISec ’23, page 79–90, New York, NY , USA. Association for Computing Machinery. Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Li...
-
[2]
Unsupervised Dense Information Retrieval with Contrastive Learning
Retrieving, rethinking and revising: The chain- of-verification can improve retrieval augmented gen- eration. InFindings of the Association for Compu- tational Linguistics: EMNLP 2024, pages 10371– 10393, Miami, Florida, USA. Association for Com- putational Linguistics. Gautier Izacard, Mathilde Caron, Lucas Hosseini, Se- bastian Riedel, Piotr Bojanowski,...
work page internal anchor Pith review arXiv 2024
-
[3]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Machine against the RAG: Jamming Retrieval- Augmented generation with blocker documents. In 34th USENIX Security Symposium (USENIX Security 25), pages 3787–3806. Weijia Shi, Sewon Min, Michihiro Yasunaga, Min- joon Seo, Richard James, Mike Lewis, Luke Zettle- moyer, and Wen-tau Yih. 2024. REPLUG: Retrieval- augmented black-box language models. InProceed- ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Tricking retrievers with influential tokens: An efficient black-box corpus poisoning attack. InPro- ceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Com- putational Linguistics: Human Language Technolo- gies (Volume 1: Long Papers), pages 4183–4194, Albuquerque, New Mexico. Association for Compu- tational Lingu...
2025
-
[5]
Hallucination is inevitable: An innate lim- itation of large language models.arXiv preprint arXiv:2401.11817. Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen
-
[6]
Benchmarking poisoning attacks against retrieval- augmented generation,
Large language models as optimizers. In The Twelfth International Conference on Learning Representations. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christo- pher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empiri- cal...
-
[7]
hallucinated
Source Inconsistency( sinc) Fabricates conflict- ing evidence across authoritative sources. Safety- aligned models hedge or defer judgment to avoid asserting potentially "hallucinated" facts, produc- ing conflict-reporting responses instead of direct answers
-
[8]
observer
Subjectivity Framing( ssub) Reframes objec- tive queries as interpretive or value-laden debates. Models adopt neutral "observer" stances, summa- rizing perspectives rather than committing to fac- tual claims
-
[9]
Models produce polite procedural explanations instead of substantive an- swers, avoiding hard refusals while remaining in- formationally vacuous
Procedural Constraint( spro) Simulates fic- tional compliance requirements (e.g., data embar- goes, privacy protocols). Models produce polite procedural explanations instead of substantive an- swers, avoiding hard refusals while remaining in- formationally vacuous
-
[10]
Mod- els enumerate conditionals and caveats, burying the core answer in qualification chains
Complexity Escalation( scpx) Asserts depen- dence on unstated variables or hidden factors. Mod- els enumerate conditionals and caveats, burying the core answer in qualification chains
-
[11]
Models summarize data gaps rather than synthe- sizing available evidence, increasing integration costs
Information Fragmentation( sfrag) Claims knowledge incompleteness or temporal staleness. Models summarize data gaps rather than synthe- sizing available evidence, increasing integration costs
-
[12]
Context-Sensitivity( sctx) Declares answers con- tingent on missing user-specific context. Models defer responses pending clarification, shifting bur- den back to the user. A.2 Answer Utility Score (AUS) Rubric To quantitatively evaluate response utility, we em- ploy the Answer Utility Score (AUS), ranging from 1.0 to 5.0. Our attack targets responses in ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.