arxiv: 2604.18663 · v1 · submitted 2026-04-20 · 💻 cs.CR · cs.AI

Recognition: unknown

Beyond Explicit Refusals: Soft-Failure Attacks on Retrieval-Augmented Generation

Wentao Zhang , Yan Zhuang , ZhuHang Zheng , Mingfei Zhang , Jiawen Deng , Fuji Ren

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:52 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords retrieval-augmented generationadversarial attackssoft failureRAG securityevolutionary optimizationLLM vulnerabilitiesstealth attacksavailability threats

0 comments

The pith

Retrieval-augmented generation systems can be tricked into fluent but non-informative answers via an evolutionary attack that evades detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that RAG systems face a subtle availability threat called soft failure, in which the model produces coherent responses that contain little useful information instead of refusing or failing obviously. A sympathetic reader would care because this kind of degradation undermines the value of retrieval-augmented systems while remaining hard to notice or block with existing safety tools. The authors introduce DEJA, an automated black-box framework that evolves adversarial documents through optimization guided by an LLM-based Answer Utility Score. Experiments across datasets and RAG setups demonstrate that the method achieves soft-attack success rates above 79 percent while holding hard failures below 15 percent. The documents also resist paraphrasing, evade perplexity detection, and transfer to new models without modification.

Core claim

The authors formalize soft failure as a distinct threat model for RAG and propose DEJA as an evolutionary optimization process that generates adversarial documents to drive the downstream LLM toward low-utility yet fluent outputs. The process maintains high retrieval rates while systematically lowering answer certainty through repeated mutation and selection based on an LLM evaluator's fine-grained utility assessment.

What carries the argument

DEJA, a black-box evolutionary optimization framework that mutates candidate documents and selects them according to an LLM-computed Answer Utility Score to induce soft failures while preserving retrieval success.

If this is right

RAG pipelines can suffer degraded utility without triggering explicit refusal mechanisms or denial-of-service alerts.
Adversarial documents remain effective under query paraphrasing and transfer across model families without retargeting.
Perplexity-based detectors fail to flag the generated documents, leaving current defenses insufficient.
The attack succeeds while keeping hard failures low, making it harder for operators to notice systematic degradation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

RAG developers may need to add explicit checks for answer informativeness beyond coherence or safety filters.
Reliance on LLM evaluators for attack guidance could create a feedback loop where evaluator biases become attack vectors.
The same evolutionary approach might be tested on other retrieval-augmented tasks such as summarization or code generation.
Human studies could be run to confirm whether the utility loss measured by the LLM evaluator matches user-perceived degradation.

Load-bearing premise

The LLM evaluator used to compute the Answer Utility Score gives a reliable, unbiased measure of genuine utility loss rather than evaluator-specific artifacts.

What would settle it

Independent runs of DEJA on the same benchmarks and RAG configurations that produce soft-attack success rates below 79 percent or hard-failure rates above 15 percent would falsify the claim of consistent effectiveness.

Figures

Figures reproduced from arXiv: 2604.18663 by Fuji Ren, Jiawen Deng, Mingfei Zhang, Wentao Zhang, Yan Zhuang, ZhuHang Zheng.

**Figure 1.** Figure 1: Comparison of RAG behaviors. (1) Normal Operation: Retrieves benign documents to yield informative answers. (2) Hard Failure: Triggers explicit refusals via refusal-inducing documents. (3) Soft Failure: Injects optimized deceptive documents to induce fluent, non-informative responses that undermine answer certainty, stealthily degrading utility. user-generated sources, making corpus poisoning attacks a … view at source ↗

**Figure 2.** Figure 2: Overview of the DEJA framework. Top: The attack workflow where an injected document ( [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Perplexity-based Detection Analysis across Three Datasets. Top row: Perplexity distributions of clean [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

read the original abstract

Existing jamming attacks on Retrieval-Augmented Generation (RAG) systems typically induce explicit refusals or denial-of-service behaviors, which are conspicuous and easy to detect. In this work, we formalize a subtler availability threat, termed soft failure, which degrades system utility by inducing fluent and coherent yet non-informative responses rather than overt failures. We propose Deceptive Evolutionary Jamming Attack (DEJA), an automated black-box attack framework that generates adversarial documents to trigger such soft failures by exploiting safety-aligned behaviors of large language models. DEJA employs an evolutionary optimization process guided by a fine-grained Answer Utility Score (AUS), computed via an LLM-based evaluator, to systematically degrade the certainty of answers while maintaining high retrieval success. Extensive experiments across multiple RAG configurations and benchmark datasets show that DEJA consistently drives responses toward low-utility soft failures, achieving SASR above 79\% while keeping hard-failure rates below 15\%, significantly outperforming prior attacks. The resulting adversarial documents exhibit high stealth, evading perplexity-based detection and resisting query paraphrasing, and transfer across model families to proprietary systems without retargeting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper carves out soft failures as a distinct RAG attack goal and gives a working evolutionary method to reach them, but the LLM-based utility scorer is the load-bearing piece that needs more validation.

read the letter

The core contribution is the distinction between hard refusals and soft failures in RAG, plus DEJA as a black-box evolutionary attack that targets the latter. It treats the problem as finding documents that keep retrieval success high while driving the generator toward fluent but low-utility answers, and it reports SASR above 79 percent with hard failures under 15 percent across several configurations and datasets. The adversarial documents also show decent stealth against perplexity filters and query paraphrasing, and some transfer to other models including proprietary ones. That framing and the automated search loop are the parts that feel new relative to prior jamming work focused on explicit refusals. The experiments give concrete numbers rather than just toy examples, which is useful for anyone thinking about availability threats in deployed RAG. The main soft spot is the Answer Utility Score itself. Because DEJA optimizes directly against an LLM evaluator, any consistent bias in how that evaluator scores hedging, partial answers, or certain phrasing becomes an exploitable signal. The abstract does not report human correlation, cross-evaluator agreement, or an ablation where the discovered documents are re-tested with a different judge or with actual downstream task metrics. Without those checks the headline success rates could partly reflect evaluator artifacts rather than robust utility loss in the target system. The evolutionary hyperparameters and AUS threshold are also free parameters that could use more sensitivity analysis. This is the kind of paper that belongs in a security or robustness reading group for people who build or audit RAG pipelines. It is not yet a finished story on the evaluation side, but the idea is clear enough and the empirical framing is concrete enough that a serious referee should see it. I would send it to review with a request for stronger validation of the utility metric.

Referee Report

3 major / 2 minor

Summary. The paper formalizes 'soft failure' as a subtle availability attack on RAG systems that produces fluent yet non-informative responses instead of explicit refusals. It introduces DEJA, a black-box evolutionary optimization framework that generates adversarial documents by maximizing degradation of an LLM-based Answer Utility Score (AUS) while preserving retrieval success. Experiments across RAG configurations and datasets report SASR above 79% with hard-failure rates below 15%, plus claims of stealth against perplexity detectors, resistance to paraphrasing, and transfer to proprietary models.

Significance. If validated, the work identifies a practically relevant threat class that evades current refusal-based defenses and detection heuristics. The black-box evolutionary method and emphasis on utility degradation (rather than denial-of-service) could guide future RAG robustness research; the reported transferability and stealth properties are particularly noteworthy if they survive evaluator ablations.

major comments (3)

[Abstract and methodology (AUS computation)] The headline performance metrics (SASR >79%, hard failures <15%) are computed using the same LLM-based AUS evaluator that guides DEJA's evolutionary loop. No cross-evaluator agreement, human correlation study, or ablation replacing the AUS model is reported, so it remains possible that the discovered documents exploit evaluator-specific artifacts rather than genuine utility loss in the target RAG generator (see Abstract and the AUS definition in the methodology).
[Experiments] The experimental section provides aggregate success rates but no details on controls for post-hoc document selection, statistical significance testing across runs, or baseline comparisons that isolate the contribution of the evolutionary component versus simpler retrieval poisoning. This weakens the claim that DEJA 'significantly outperforms prior attacks' (Abstract).
[Transfer experiments] Transferability results to proprietary models are presented without retargeting, yet the paper does not report whether the AUS evaluator was held fixed or whether success was re-measured with an independent utility metric on the target systems. This leaves open whether the transfer is of the attack documents themselves or of the optimization artifact.

minor comments (2)

[Introduction] The abstract and introduction introduce 'soft failure' without a crisp formal definition or contrast to related concepts such as hallucination or low-confidence answers; a short dedicated subsection would improve clarity.
[Figures and tables] Figure captions and table headers should explicitly state the number of independent runs and the exact AUS threshold used to classify soft vs. hard failures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies key areas where our evaluation can be strengthened for greater rigor and transparency. We address each major comment below and will revise the manuscript accordingly to incorporate clarifications and additional analyses.

read point-by-point responses

Referee: The headline performance metrics (SASR >79%, hard failures <15%) are computed using the same LLM-based AUS evaluator that guides DEJA's evolutionary loop. No cross-evaluator agreement, human correlation study, or ablation replacing the AUS model is reported, so it remains possible that the discovered documents exploit evaluator-specific artifacts rather than genuine utility loss in the target RAG generator.

Authors: We acknowledge this valid concern regarding potential evaluator-specific artifacts. The AUS was selected as it captures fine-grained aspects of utility (relevance, completeness, and coherence) that align with the definition of soft failure. To address this, we will revise the manuscript to include: (1) an ablation using a different LLM as the AUS evaluator during optimization and evaluation to demonstrate consistency, (2) a small-scale human correlation study on a subset of responses, and (3) discussion of why the chosen AUS is a reasonable proxy. These additions will help confirm that the reported soft failures reflect genuine utility degradation. revision: yes
Referee: The experimental section provides aggregate success rates but no details on controls for post-hoc document selection, statistical significance testing across runs, or baseline comparisons that isolate the contribution of the evolutionary component versus simpler retrieval poisoning. This weakens the claim that DEJA 'significantly outperforms prior attacks' (Abstract).

Authors: We agree that more rigorous experimental controls and details are needed to support the performance claims. In the revised manuscript, we will expand the experiments section to specify: controls ensuring no post-hoc selection bias (all evolutionary outputs were evaluated per the reported protocol), results across multiple independent runs with statistical significance testing (e.g., paired t-tests with p-values), and additional baselines such as random document injection and non-evolutionary heuristic poisoning. This will better isolate the evolutionary component's contribution and strengthen comparisons to prior attacks. revision: yes
Referee: Transferability results to proprietary models are presented without retargeting, yet the paper does not report whether the AUS evaluator was held fixed or whether success was re-measured with an independent utility metric on the target systems. This leaves open whether the transfer is of the attack documents themselves or of the optimization artifact.

Authors: We clarify that the transfer experiments used adversarial documents generated via the source AUS and applied them directly to target models without retargeting. Success was measured with the same AUS for cross-experiment consistency. We will update the manuscript to explicitly describe this setup. Additionally, we will include results using an alternative independent utility metric on the transferred documents where feasible to further demonstrate that the attack effect transfers independently of the original evaluator. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical attack validated on external benchmarks

full rationale

The paper proposes DEJA as a black-box evolutionary attack that optimizes adversarial documents using an external LLM-based AUS evaluator to induce soft failures in RAG systems. Success is reported via SASR and hard-failure rates measured on standard benchmark datasets and multiple RAG configurations, with additional claims of stealth and cross-model transfer. These metrics and comparisons do not reduce by construction to quantities defined solely from the attack's own fitted components or self-citations; the chain consists of proposal, optimization, and external empirical measurement without load-bearing self-referential loops or definitional equivalences.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim depends on the reliability of an LLM-based Answer Utility Score as an optimization target and on the assumption that evolutionary search can reliably locate high-retrieval, low-utility documents in a black-box setting. No machine-checked proofs or parameter-free derivations are referenced.

free parameters (2)

Evolutionary optimization hyperparameters
Parameters controlling mutation, selection, and population size in the DEJA search process are required to produce the reported adversarial documents but are not detailed in the abstract.
AUS threshold or weighting
The fine-grained Answer Utility Score uses an LLM evaluator whose exact prompting or scoring scale is not specified and likely influences which documents are selected.

axioms (2)

domain assumption LLM-based evaluator produces consistent and meaningful Answer Utility Scores that correlate with human-perceived utility degradation
Invoked to guide the evolutionary process and to claim low-utility soft failures.
domain assumption Adversarial documents can be optimized to maintain high retrieval success while degrading answer quality
Core premise of the black-box attack framework.

invented entities (1)

Soft failure no independent evidence
purpose: A new threat model for RAG availability attacks that produces fluent yet non-informative responses
Formalized in the abstract as distinct from explicit refusals or denial-of-service.

pith-pipeline@v0.9.0 · 5509 in / 1726 out tokens · 52257 ms · 2026-05-10T04:52:50.903407+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532,

Not what you’ve signed up for: Compromis- ing real-world llm-integrated applications with indi- rect prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Secu- rity, AISec ’23, page 79–90, New York, NY , USA. Association for Computing Machinery. Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Li...

work page arXiv 2025
[2]

Unsupervised Dense Information Retrieval with Contrastive Learning

Retrieving, rethinking and revising: The chain- of-verification can improve retrieval augmented gen- eration. InFindings of the Association for Compu- tational Linguistics: EMNLP 2024, pages 10371– 10393, Miami, Florida, USA. Association for Com- putational Linguistics. Gautier Izacard, Mathilde Caron, Lucas Hosseini, Se- bastian Riedel, Piotr Bojanowski,...

work page internal anchor Pith review arXiv 2024
[3]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Machine against the RAG: Jamming Retrieval- Augmented generation with blocker documents. In 34th USENIX Security Symposium (USENIX Security 25), pages 3787–3806. Weijia Shi, Sewon Min, Michihiro Yasunaga, Min- joon Seo, Richard James, Mike Lewis, Luke Zettle- moyer, and Wen-tau Yih. 2024. REPLUG: Retrieval- augmented black-box language models. InProceed- ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Tricking retrievers with influential tokens: An efficient black-box corpus poisoning attack. InPro- ceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Com- putational Linguistics: Human Language Technolo- gies (Volume 1: Long Papers), pages 4183–4194, Albuquerque, New Mexico. Association for Compu- tational Lingu...

2025
[5]

& Kankanhalli, M

Hallucination is inevitable: An innate lim- itation of large language models.arXiv preprint arXiv:2401.11817. Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen

work page arXiv
[6]

Benchmarking poisoning attacks against retrieval- augmented generation,

Large language models as optimizers. In The Twelfth International Conference on Learning Representations. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christo- pher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empiri- cal...

work page arXiv 2018
[7]

hallucinated

Source Inconsistency( sinc) Fabricates conflict- ing evidence across authoritative sources. Safety- aligned models hedge or defer judgment to avoid asserting potentially "hallucinated" facts, produc- ing conflict-reporting responses instead of direct answers
[8]

observer

Subjectivity Framing( ssub) Reframes objec- tive queries as interpretive or value-laden debates. Models adopt neutral "observer" stances, summa- rizing perspectives rather than committing to fac- tual claims
[9]

Models produce polite procedural explanations instead of substantive an- swers, avoiding hard refusals while remaining in- formationally vacuous

Procedural Constraint( spro) Simulates fic- tional compliance requirements (e.g., data embar- goes, privacy protocols). Models produce polite procedural explanations instead of substantive an- swers, avoiding hard refusals while remaining in- formationally vacuous
[10]

Mod- els enumerate conditionals and caveats, burying the core answer in qualification chains

Complexity Escalation( scpx) Asserts depen- dence on unstated variables or hidden factors. Mod- els enumerate conditionals and caveats, burying the core answer in qualification chains
[11]

Models summarize data gaps rather than synthe- sizing available evidence, increasing integration costs

Information Fragmentation( sfrag) Claims knowledge incompleteness or temporal staleness. Models summarize data gaps rather than synthe- sizing available evidence, increasing integration costs
[12]

I don’t know

Context-Sensitivity( sctx) Declares answers con- tingent on missing user-specific context. Models defer responses pending clarification, shifting bur- den back to the user. A.2 Answer Utility Score (AUS) Rubric To quantitatively evaluate response utility, we em- ploy the Answer Utility Score (AUS), ranging from 1.0 to 5.0. Our attack targets responses in ...

work page arXiv 2025