Through the Stealth Lens: Attention-Aware Defenses Against Poisoning in RAG

Ashish Hooda; Krishnamurthy Dj Dvijotham; Nils Palumbo; Sarthak Choudhary; Somesh Jha

arxiv: 2506.04390 · v2 · pith:A5V66ZEGnew · submitted 2025-06-04 · 💻 cs.CR · cs.AI

Through the Stealth Lens: Attention-Aware Defenses Against Poisoning in RAG

Sarthak Choudhary , Nils Palumbo , Ashish Hooda , Krishnamurthy Dj Dvijotham , Somesh Jha This is my paper

Pith reviewed 2026-05-25 08:07 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords RAG poisoningattention-based defensestealth attacksLLM securityretrieval-augmented generationadversarial robustnessattention weights

0 comments

The pith

Poisoned passages that control RAG outputs must bias attention weights enough to be flagged by a normalized score and variance filter, raising accuracy up to 20 percent over baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that poisoning attacks on retrieval-augmented generation cannot remain stealthy while successfully steering the generated response. Because a few injected passages must exert disproportionate influence on inference to dominate the output, their attention patterns diverge from those of benign passages. The authors therefore define a Normalized Passage Attention Score to quantify each passage's relative effect on output tokens and pair it with an Attention-Variance Filter that removes statistical outliers. When these signals are used for filtering, the defended system recovers up to 20 percent higher accuracy than prior defenses under attack. The work also introduces adaptive attacks that attempt to mask the anomalies yet still succeed only about 35 percent of the time.

Core claim

If a small number of poisoned passages are to dictate the generated answer, they must receive higher or more variable attention than the surrounding benign passages. The paper therefore defines the Normalized Passage Attention Score as a measure of each passage's relative influence on the output tokens. An Attention-Variance Filter then removes passages whose attention distribution deviates from the norm. When applied, this raises the accuracy of the RAG system under attack by up to 20 percent over standard defenses.

What carries the argument

Normalized Passage Attention Score (NPAS) and Attention-Variance Filter (AV Filter), which derive from attention weights to measure and flag anomalous passage influence on the response.

If this is right

Defenses can operate by inspecting internal attention signals rather than final text alone.
Attackers face a trade-off between controlling the output and remaining undetectable via attention.
The formal distinguishability game shows that true stealth is limited when few passages must dominate the response.
Adaptive attacks that try to conceal anomalies still leave measurable traces in attention patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Attention monitoring could extend to other context-manipulation settings where models retrieve external material.
Combining attention analysis with output-consistency checks might produce stronger composite defenses.
Making poisoning reliably stealthy may require attackers to spread influence across many passages rather than concentrate it.

Load-bearing premise

Poisoned passages that control the response must bias the inference process more than benign ones, producing detectable attention anomalies.

What would settle it

A poisoning attack that alters the RAG output as intended while producing attention scores and variance values indistinguishable from an unpoisoned baseline would falsify the claim.

Figures

Figures reproduced from arXiv: 2506.04390 by Ashish Hooda, Krishnamurthy Dj Dvijotham, Nils Palumbo, Sarthak Choudhary, Somesh Jha.

**Figure 1.** Figure 1: AV Filter Overview. In this example, the retriever returns a set of passages z (k) , one of which is poisoned and disproportionately influences the response. This leads to a skewed distribution of normalized passage attention scores and elevated variance across passages. AV Filter mitigates this by removing passages with anomalously high attention scores, indicative of potential poisoning. • We introduce t… view at source ↗

**Figure 2.** Figure 2: (a) Average attention scores across passage positions in retrieved sets over multiple [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Attack Success Rate (ASR) of the GCG-Poison adaptive attack on the RealtimeQA [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of Corruption Rate and Filtering Threshold: This figure shows the impact of varying the corruption rate ϵ and the filtering threshold δ on the performance of the AV Filter. Subfigures (a) and (b) present the Attack Success Rate (ASR) and Robust Accuracy (RACC) on the RealtimeQA-MC dataset with α = ∞, averaged over all models. As expected, ASR increases and RACC decreases with higher corruption rates… view at source ↗

**Figure 5.** Figure 5: Attention Patterns in Benign vs. Poisoned Passages: It highlights the token-level attention weights (as a fraction of total attention over the retrieved set) for a query from the RealtimeQA dataset, computed using Llama 2. (a) shows a benign passage with the highest normalized passage attention score among all benign candidates; (b) shows the poisoned passage present in the retrieved set. Tokens such as 3,… view at source ↗

read the original abstract

Retrieval-augmented generation (RAG) systems are vulnerable to attacks that inject poisoned passages into the retrieved context, even at low corruption rates. We show that existing attacks are not designed to be stealthy, allowing reliable detection and mitigation. We formalize a distinguishability-based security game to quantify stealth for such attacks. If a few poisoned passages control the response, they must bias the inference process more than the benign ones, inherently compromising stealth. This motivates analyzing intermediate signals of LLMs, such as attention weights, to approximate the influence of different passages on the response. Leveraging attention weights, we introduce the $\textbf{Normalized Passage Attention Score}$ (NPAS) and a lightweight $\textbf{Attention-Variance Filter}$ (AV Filter) that flags anomalous passages. Our method improves robustness, yielding up to $\sim$ $\textbf{20%}$ higher accuracy than baseline defenses. We also develop adaptive attacks that attempt to conceal such anomalies, achieving up to $\textbf{35%}$ success rate and underscoring the challenges of achieving true stealth in poisoning RAG systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core premise—that steering the output requires measurably higher or more variable attention on poisoned passages—looks like a modeling choice rather than a necessity, and the 35% adaptive attack success rate keeps that question open.

read the letter

The authors formalize a distinguishability game for stealthy RAG poisoning and then use attention weights to build NPAS and the AV Filter. They report roughly 20% accuracy gains over baselines and test adaptive attacks that still succeed 35% of the time. That combination is the main concrete output: a lightweight filter plus evidence that perfect stealth remains hard under their threat model.

Referee Report

3 major / 2 minor

Summary. The paper claims that poisoning attacks on RAG systems are not inherently stealthy because controlling the output requires biasing inference more than benign passages, which can be detected via attention weights. It formalizes a distinguishability security game, introduces the Normalized Passage Attention Score (NPAS) and Attention-Variance Filter (AV Filter), reports up to ~20% higher accuracy than baseline defenses, and shows that adaptive attacks achieve at most 35% success rate.

Significance. If the results hold under rigorous validation, the work contributes a formal security game for stealth in RAG poisoning and demonstrates that internal LLM signals (attention) can yield practical defenses. The adaptive attack evaluation is a strength, as it directly tests the limits of the proposed method rather than relying solely on static baselines.

major comments (3)

[§2] §2 (motivation) and security game formalization: The premise that 'if a few poisoned passages control the response, they must bias the inference process more than the benign ones' is asserted as a necessity that compromises stealth. This modeling choice directly motivates NPAS and the AV Filter, but the manuscript provides no derivation or counterexample showing that distributed or indirect influence across tokens cannot achieve output control without producing measurable attention anomalies. This is load-bearing for the central claim.
[§5] §5 (experimental results): The reported ~20% accuracy gain and 35% adaptive attack success rate are presented without sufficient detail on the number of runs, variance across seeds, exact baseline implementations, or ablation isolating the contribution of the attention-variance component versus simple thresholding. Without these, it is unclear whether the gains are robust or sensitive to data selection.
[§4.1] Definition of NPAS (likely §4.1): The normalization and variance computation assume that higher or more variable attention on poisoned passages is both necessary and sufficient for detection. If an adaptive attack can equalize attention distributions while still steering generation (e.g., via prompt engineering on multiple passages), the filter's signal disappears; the manuscript should include a targeted experiment testing this scenario.

minor comments (2)

[§4] Notation for NPAS should be defined with an explicit equation rather than prose description to avoid ambiguity in how normalization is performed across passages of varying lengths.
[§5] The abstract states quantitative gains but the experimental section would benefit from a table summarizing accuracy, attack success rate, and false-positive rate for all methods and datasets.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review and for identifying areas where additional rigor and detail would strengthen the manuscript. We address each major comment below and commit to revisions that improve clarity without altering the core claims.

read point-by-point responses

Referee: [§2] §2 (motivation) and security game formalization: The premise that 'if a few poisoned passages control the response, they must bias the inference process more than the benign ones' is asserted as a necessity that compromises stealth. This modeling choice directly motivates NPAS and the AV Filter, but the manuscript provides no derivation or counterexample showing that distributed or indirect influence across tokens cannot achieve output control without producing measurable attention anomalies. This is load-bearing for the central claim.

Authors: The distinguishability security game formalizes stealth as the inability of an attacker to control output without creating detectable differences in passage influence. The manuscript motivates this via the observation that output control in transformer-based generation requires disproportionate contribution from poisoned passages. We agree that an explicit derivation or counterexample addressing distributed token-level influence would strengthen the argument. In revision we will expand §2 with a short proof sketch showing that any successful steering must increase the aggregate attention mass on the controlling passages (by the properties of softmax attention and next-token prediction), together with a brief counterexample illustrating why purely indirect influence fails to override benign context at low corruption rates. revision: yes
Referee: [§5] §5 (experimental results): The reported ~20% accuracy gain and 35% adaptive attack success rate are presented without sufficient detail on the number of runs, variance across seeds, exact baseline implementations, or ablation isolating the contribution of the attention-variance component versus simple thresholding. Without these, it is unclear whether the gains are robust or sensitive to data selection.

Authors: We acknowledge that the experimental section would benefit from greater statistical transparency. The reported figures aggregate results over multiple random seeds and datasets, yet the manuscript does not tabulate per-seed variance or explicitly describe baseline re-implementations. In the revised version we will add (i) the exact number of runs and standard deviations for all accuracy and attack-success metrics, (ii) precise references and hyper-parameter settings for each baseline, and (iii) an ablation table that isolates the contribution of the variance term in the AV Filter versus simple NPAS thresholding. revision: yes
Referee: [§4.1] Definition of NPAS (likely §4.1): The normalization and variance computation assume that higher or more variable attention on poisoned passages is both necessary and sufficient for detection. If an adaptive attack can equalize attention distributions while still steering generation (e.g., via prompt engineering on multiple passages), the filter's signal disappears; the manuscript should include a targeted experiment testing this scenario.

Authors: The NPAS formulation is derived from the necessity of biasing attention to achieve control, and the adaptive attacks already evaluated attempt to reduce attention anomalies yet still reach only 35% success. Nevertheless, we agree that an explicit test of attention-equalization strategies (e.g., multi-passage prompt engineering) is warranted. We will add a targeted experiment in §5 that constructs such equalizing attacks and reports the resulting NPAS distributions and filter performance, thereby quantifying the residual detectability even under this stronger threat model. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper states a premise in the abstract that poisoned passages controlling the response must bias inference more than benign ones, which motivates the distinguishability security game and attention-based NPAS/AV Filter. This is presented as a logical motivation and modeling choice rather than a derivation that reduces by construction to fitted inputs or self-citations. No equations, self-citation chains, ansatzes, or renamings are quoted that exhibit the specific reductions required by the circularity patterns. The reported accuracy gains are empirical comparisons to baselines, leaving the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Review performed on abstract only; ledger entries are therefore limited to concepts explicitly named in the abstract.

axioms (1)

domain assumption Poisoned passages that control the generated response must bias the inference process more than benign passages, inherently reducing stealth.
This premise is stated directly in the abstract as the motivation for analyzing attention weights.

invented entities (2)

Normalized Passage Attention Score (NPAS) no independent evidence
purpose: Quantify the relative influence of each retrieved passage on the model output via attention weights.
New metric introduced in the abstract; no independent evidence provided.
Attention-Variance Filter (AV Filter) no independent evidence
purpose: Flag anomalous passages based on variance in attention patterns.
New lightweight filter introduced in the abstract; no independent evidence provided.

pith-pipeline@v0.9.0 · 5739 in / 1362 out tokens · 42237 ms · 2026-05-25T08:07:09.610676+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

If a few poisoned passages control the response, they must bias the inference process more than the benign ones, inherently compromising stealth. This motivates analyzing intermediate signals... We formalize stealth using a distinguishability-based security game.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean bare_distinguishability_of_absolute_floor echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Stealth Attack Distinguishability Game (SADG)... The attack is said to be τ-stealthy if, for all PPT defenders D, the advantage is at most τ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Needle-in-RAG: Prompt-Conditioned Character-Level Traceback of Poisoned Spans in Retrieved Evidence
cs.CR 2026-05 unverdicted novelty 7.0

RAGCharacter localizes poisoned character spans in RAG evidence via prompt-conditioned counterfactual masking and achieves the best accuracy-over-attribution trade-off across tested attacks and models.
Adaptive Defense Orchestration for RAG: A Sentinel-Strategist Architecture against Multi-Vector Attacks
cs.CR 2026-04 unverdicted novelty 6.0

A context-aware Sentinel-Strategist system for RAG selectively applies defenses to block membership inference and data poisoning while recovering most retrieval utility compared to always-on defense stacks.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 2 Pith papers · 9 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877– 1901, 2020

work page 1901
[3]

Survey of hallucination in natural language generation

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM computing surveys, 55(12):1–38, 2023

work page 2023
[4]

Retrieval augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR, 2020

work page 2020
[5]

Retrieval- augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Na- man Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020
[6]

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022

work page 2022
[7]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In EMNLP (1), pages 6769–6781, 2020

work page 2020
[8]

Generative ai in search: Let google do the searching for you

Google. Generative ai in search: Let google do the searching for you. https://blog. google/products/search/generative-ai-google-search-may-2024/ , 2024. Ac- cessed: 2025-04-21

work page 2024
[9]

Wikichat: Stopping the hallucination of large language model chatbots by few-shot grounding on wikipedia

Sina J Semnani, Violet Z Yao, Heidi C Zhang, and Monica S Lam. Wikichat: Stopping the hallucination of large language model chatbots by few-shot grounding on wikipedia. arXiv preprint arXiv:2305.14292, 2023. 17

work page arXiv 2023
[10]

Bing chat

Microsoft. Bing chat. https://www.microsoft.com/en-us/edge/features/ bing-chat, 2024. Accessed: 2025-04-21

work page 2024
[11]

Perplexity ai

Perplexity AI. Perplexity ai. https://www.perplexity.ai/, 2024. Accessed: 2025-04-21

work page 2024
[12]

Llamaindex

Jerry Liu. Llamaindex. https://github.com/jerryjliu/llama_index, November 2022. Accessed: 2025-04-21

work page 2022
[13]

Langchain

LangChain. Langchain. https://github.com/langchain-ai/langchain, 2024. Ac- cessed: 2025-04-21

work page 2024
[14]

Re- flexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36:8634–8652, 2023

work page 2023
[15]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[16]

Poisoning web-scale training datasets is practical

Nicholas Carlini, Matthew Jagielski, Christopher A Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramèr. Poisoning web-scale training datasets is practical. In 2024 IEEE Symposium on Security and Privacy (SP), pages 407–425. IEEE, 2024

work page 2024
[17]

Certifiably robust rag against retrieval corruption

Chong Xiang, Tong Wu, Zexuan Zhong, David Wagner, Danqi Chen, and Prateek Mittal. Certifiably robust rag against retrieval corruption. arXiv preprint arXiv:2405.15556, 2024

work page arXiv 2024
[18]

Poisonedrag: Knowledge poi- soning attacks to retrieval-augmented generation of large language models

Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. Poisonedrag: Knowledge poi- soning attacks to retrieval-augmented generation of large language models. arXiv preprint arXiv:2402.07867, 2024

work page arXiv 2024
[19]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79–90, 2023

work page 2023
[20]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping- yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Base- line defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Detecting Language Model Attacks with Perplexity

Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity.arXiv preprint arXiv:2308.14132, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Demystifying prompts in language models via perplexity estimation

Hila Gonen, Srini Iyer, Terra Blevins, Noah A Smith, and Luke Zettlemoyer. Demystifying prompts in language models via perplexity estimation. arXiv preprint arXiv:2212.04037, 2022. 18

work page arXiv 2022
[23]

Rankrag: Unifying context ranking with retrieval-augmented generation in llms

Yue Yu, Wei Ping, Zihan Liu, Boxin Wang, Jiaxuan You, Chao Zhang, Mohammad Shoeybi, and Bryan Catanzaro. Rankrag: Unifying context ranking with retrieval-augmented generation in llms. Advances in Neural Information Processing Systems, 37:121156–121184, 2024

work page 2024
[24]

Hoprag: Multi-hop reasoning for logic-aware retrieval-augmented generation

Hao Liu, Zhengren Wang, Xi Chen, Zhiyu Li, Feiyu Xiong, Qinhan Yu, and Wentao Zhang. Hoprag: Multi-hop reasoning for logic-aware retrieval-augmented generation. arXiv preprint arXiv:2502.12442, 2025

work page arXiv 2025
[25]

Collapse of dense retrievers: Short, early, and literal biases outranking factual evidence

Mohsen Fayyaz, Ali Modarressi, Hinrich Schuetze, and Nanyun Peng. Collapse of dense retrievers: Short, early, and literal biases outranking factual evidence. arXiv preprint arXiv:2503.05037, 2025

work page arXiv 2025
[26]

Building a robust retrieval system with dense retrieval models

Sheng-Chieh Lin. Building a robust retrieval system with dense retrieval models. 2024

work page 2024
[27]

More robust dense retrieval with contrastive dual learning

Yizhi Li, Zhenghao Liu, Chenyan Xiong, and Zhiyuan Liu. More robust dense retrieval with contrastive dual learning. In Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval, pages 287–296, 2021

work page 2021
[28]

Synthetic disinformation attacks on automated fact verification systems

Yibing Du, Antoine Bosselut, and Christopher D Manning. Synthetic disinformation attacks on automated fact verification systems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10581–10589, 2022

work page 2022
[29]

Attacking open-domain question answering by injecting misinformation

Liangming Pan, Wenhu Chen, Min-Yen Kan, and William Yang Wang. Attacking open-domain question answering by injecting misinformation. arXiv preprint arXiv:2110.07803, 2021

work page arXiv 2021
[30]

On the risk of misinformation pollution with large language models

Yikang Pan, Liangming Pan, Wenhu Chen, Preslav Nakov, Min-Yen Kan, and William Yang Wang. On the risk of misinformation pollution with large language models. arXiv preprint arXiv:2305.13661, 2023

work page arXiv 2023
[31]

Poisoning retrieval corpora by injecting adversarial passages

Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. Poisoning retrieval corpora by injecting adversarial passages. arXiv preprint arXiv:2310.19156, 2023

work page arXiv 2023
[32]

Typos that broke the rag’s back: Genetic attack on rag pipeline by simulating documents in the wild via low-level perturbations

Sukmin Cho, Soyeong Jeong, Jeongyeon Seo, Taeho Hwang, and Jong C Park. Typos that broke the rag’s back: Genetic attack on rag pipeline by simulating documents in the wild via low-level perturbations. arXiv preprint arXiv:2404.13948, 2024

work page arXiv 2024
[33]

De- fending against disinformation attacks in open-domain question answering

Orion Weller, Aleem Khan, Nathaniel Weir, Dawn Lawrie, and Benjamin Van Durme. De- fending against disinformation attacks in open-domain question answering. arXiv preprint arXiv:2212.10002, 2022

work page arXiv 2022
[34]

Discern and answer: Mitigating the impact of misinformation in retrieval-augmented models with discriminators

Giwon Hong, Jeonghwan Kim, Junmo Kang, Sung-Hyon Myaeng, and Joyce Jiyoung Whang. Discern and answer: Mitigating the impact of misinformation in retrieval-augmented models with discriminators. CoRR, 2023

work page 2023
[35]

Analyzing the structure of attention in a transformer language model

Jesse Vig and Yonatan Belinkov. Analyzing the structure of attention in a transformer language model. In Tal Linzen, Grzegorz Chrupała, Yonatan Belinkov, and Dieuwke Hupkes, editors, Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 63–76, Florence, Italy, August 2019. Association for Computational...

work page 2019
[36]

H2o: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36:34661–34710, 2023

work page 2023
[37]

Zipcache: Accurate and efficient kv cache quantization with salient token identification

Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. Zipcache: Accurate and efficient kv cache quantization with salient token identification. arXiv preprint arXiv:2405.14256, 2024

work page arXiv 2024
[38]

Attention sorting combats recency bias in long context language models

Alexander Peysakhovich and Adam Lerer. Attention sorting combats recency bias in long context language models. arXiv preprint arXiv:2310.01427, 2023

work page arXiv 2023
[39]

Understanding data poisoning attacks for RAG: Insights and algorithms, 2025

Xun Xian, Tong Wang, Liwen You, and Yanjun Qi. Understanding data poisoning attacks for RAG: Insights and algorithms, 2025

work page 2025
[40]

Realtime qa: What’s the answer right now? Advances in neural information processing systems, 36:49025–49043, 2023

Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir Radev, Noah A Smith, Yejin Choi, Kentaro Inui, et al. Realtime qa: What’s the answer right now? Advances in neural information processing systems, 36:49025–49043, 2023

work page 2023
[41]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Lost in the Middle: How Language Models Use Long Contexts

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Serial position effects of large language models.arXiv preprint arXiv:2406.15981, 2024

Xiaobo Guo and Soroush V osoughi. Serial position effects of large language models.arXiv preprint arXiv:2406.15981, 2024

work page arXiv 2024
[44]

Natural questions: a benchmark for question answering research

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019

work page 2019
[45]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhut- dinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[46]

Mistral 7B

Devendra Singh Chaplot. Albert q. jiang, alexandre sablayrolles, arthur mensch, chris bamford, devendra singh chaplot, diego de las casas, florian bressand, gianna lengyel, guillaume lample, lucile saulnier, lélio renard lavaud, marie-anne lachaux, pierre stock, teven le scao, thibaut lavril, thomas wang, timothée lacroix, william el sayed. arXiv preprint...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. 20

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023. 21 A Additional Background on Existing Works PoisonedRAG. Given a query q and target answer s′, PoisonedRAG (Poison) seeks to craft a poisoned passage zpoison such that a RAG system is ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877– 1901, 2020

work page 1901

[3] [3]

Survey of hallucination in natural language generation

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM computing surveys, 55(12):1–38, 2023

work page 2023

[4] [4]

Retrieval augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR, 2020

work page 2020

[5] [5]

Retrieval- augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Na- man Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020

[6] [6]

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022

work page 2022

[7] [7]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In EMNLP (1), pages 6769–6781, 2020

work page 2020

[8] [8]

Generative ai in search: Let google do the searching for you

Google. Generative ai in search: Let google do the searching for you. https://blog. google/products/search/generative-ai-google-search-may-2024/ , 2024. Ac- cessed: 2025-04-21

work page 2024

[9] [9]

Wikichat: Stopping the hallucination of large language model chatbots by few-shot grounding on wikipedia

Sina J Semnani, Violet Z Yao, Heidi C Zhang, and Monica S Lam. Wikichat: Stopping the hallucination of large language model chatbots by few-shot grounding on wikipedia. arXiv preprint arXiv:2305.14292, 2023. 17

work page arXiv 2023

[10] [10]

Bing chat

Microsoft. Bing chat. https://www.microsoft.com/en-us/edge/features/ bing-chat, 2024. Accessed: 2025-04-21

work page 2024

[11] [11]

Perplexity ai

Perplexity AI. Perplexity ai. https://www.perplexity.ai/, 2024. Accessed: 2025-04-21

work page 2024

[12] [12]

Llamaindex

Jerry Liu. Llamaindex. https://github.com/jerryjliu/llama_index, November 2022. Accessed: 2025-04-21

work page 2022

[13] [13]

Langchain

LangChain. Langchain. https://github.com/langchain-ai/langchain, 2024. Ac- cessed: 2025-04-21

work page 2024

[14] [14]

Re- flexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36:8634–8652, 2023

work page 2023

[15] [15]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023

[16] [16]

Poisoning web-scale training datasets is practical

Nicholas Carlini, Matthew Jagielski, Christopher A Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramèr. Poisoning web-scale training datasets is practical. In 2024 IEEE Symposium on Security and Privacy (SP), pages 407–425. IEEE, 2024

work page 2024

[17] [17]

Certifiably robust rag against retrieval corruption

Chong Xiang, Tong Wu, Zexuan Zhong, David Wagner, Danqi Chen, and Prateek Mittal. Certifiably robust rag against retrieval corruption. arXiv preprint arXiv:2405.15556, 2024

work page arXiv 2024

[18] [18]

Poisonedrag: Knowledge poi- soning attacks to retrieval-augmented generation of large language models

Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. Poisonedrag: Knowledge poi- soning attacks to retrieval-augmented generation of large language models. arXiv preprint arXiv:2402.07867, 2024

work page arXiv 2024

[19] [19]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79–90, 2023

work page 2023

[20] [20]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping- yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Base- line defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Detecting Language Model Attacks with Perplexity

Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity.arXiv preprint arXiv:2308.14132, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Demystifying prompts in language models via perplexity estimation

Hila Gonen, Srini Iyer, Terra Blevins, Noah A Smith, and Luke Zettlemoyer. Demystifying prompts in language models via perplexity estimation. arXiv preprint arXiv:2212.04037, 2022. 18

work page arXiv 2022

[23] [23]

Rankrag: Unifying context ranking with retrieval-augmented generation in llms

Yue Yu, Wei Ping, Zihan Liu, Boxin Wang, Jiaxuan You, Chao Zhang, Mohammad Shoeybi, and Bryan Catanzaro. Rankrag: Unifying context ranking with retrieval-augmented generation in llms. Advances in Neural Information Processing Systems, 37:121156–121184, 2024

work page 2024

[24] [24]

Hoprag: Multi-hop reasoning for logic-aware retrieval-augmented generation

Hao Liu, Zhengren Wang, Xi Chen, Zhiyu Li, Feiyu Xiong, Qinhan Yu, and Wentao Zhang. Hoprag: Multi-hop reasoning for logic-aware retrieval-augmented generation. arXiv preprint arXiv:2502.12442, 2025

work page arXiv 2025

[25] [25]

Collapse of dense retrievers: Short, early, and literal biases outranking factual evidence

Mohsen Fayyaz, Ali Modarressi, Hinrich Schuetze, and Nanyun Peng. Collapse of dense retrievers: Short, early, and literal biases outranking factual evidence. arXiv preprint arXiv:2503.05037, 2025

work page arXiv 2025

[26] [26]

Building a robust retrieval system with dense retrieval models

Sheng-Chieh Lin. Building a robust retrieval system with dense retrieval models. 2024

work page 2024

[27] [27]

More robust dense retrieval with contrastive dual learning

Yizhi Li, Zhenghao Liu, Chenyan Xiong, and Zhiyuan Liu. More robust dense retrieval with contrastive dual learning. In Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval, pages 287–296, 2021

work page 2021

[28] [28]

Synthetic disinformation attacks on automated fact verification systems

Yibing Du, Antoine Bosselut, and Christopher D Manning. Synthetic disinformation attacks on automated fact verification systems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10581–10589, 2022

work page 2022

[29] [29]

Attacking open-domain question answering by injecting misinformation

Liangming Pan, Wenhu Chen, Min-Yen Kan, and William Yang Wang. Attacking open-domain question answering by injecting misinformation. arXiv preprint arXiv:2110.07803, 2021

work page arXiv 2021

[30] [30]

On the risk of misinformation pollution with large language models

Yikang Pan, Liangming Pan, Wenhu Chen, Preslav Nakov, Min-Yen Kan, and William Yang Wang. On the risk of misinformation pollution with large language models. arXiv preprint arXiv:2305.13661, 2023

work page arXiv 2023

[31] [31]

Poisoning retrieval corpora by injecting adversarial passages

Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. Poisoning retrieval corpora by injecting adversarial passages. arXiv preprint arXiv:2310.19156, 2023

work page arXiv 2023

[32] [32]

Typos that broke the rag’s back: Genetic attack on rag pipeline by simulating documents in the wild via low-level perturbations

Sukmin Cho, Soyeong Jeong, Jeongyeon Seo, Taeho Hwang, and Jong C Park. Typos that broke the rag’s back: Genetic attack on rag pipeline by simulating documents in the wild via low-level perturbations. arXiv preprint arXiv:2404.13948, 2024

work page arXiv 2024

[33] [33]

De- fending against disinformation attacks in open-domain question answering

Orion Weller, Aleem Khan, Nathaniel Weir, Dawn Lawrie, and Benjamin Van Durme. De- fending against disinformation attacks in open-domain question answering. arXiv preprint arXiv:2212.10002, 2022

work page arXiv 2022

[34] [34]

Discern and answer: Mitigating the impact of misinformation in retrieval-augmented models with discriminators

Giwon Hong, Jeonghwan Kim, Junmo Kang, Sung-Hyon Myaeng, and Joyce Jiyoung Whang. Discern and answer: Mitigating the impact of misinformation in retrieval-augmented models with discriminators. CoRR, 2023

work page 2023

[35] [35]

Analyzing the structure of attention in a transformer language model

Jesse Vig and Yonatan Belinkov. Analyzing the structure of attention in a transformer language model. In Tal Linzen, Grzegorz Chrupała, Yonatan Belinkov, and Dieuwke Hupkes, editors, Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 63–76, Florence, Italy, August 2019. Association for Computational...

work page 2019

[36] [36]

H2o: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36:34661–34710, 2023

work page 2023

[37] [37]

Zipcache: Accurate and efficient kv cache quantization with salient token identification

Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. Zipcache: Accurate and efficient kv cache quantization with salient token identification. arXiv preprint arXiv:2405.14256, 2024

work page arXiv 2024

[38] [38]

Attention sorting combats recency bias in long context language models

Alexander Peysakhovich and Adam Lerer. Attention sorting combats recency bias in long context language models. arXiv preprint arXiv:2310.01427, 2023

work page arXiv 2023

[39] [39]

Understanding data poisoning attacks for RAG: Insights and algorithms, 2025

Xun Xian, Tong Wang, Liwen You, and Yanjun Qi. Understanding data poisoning attacks for RAG: Insights and algorithms, 2025

work page 2025

[40] [40]

Realtime qa: What’s the answer right now? Advances in neural information processing systems, 36:49025–49043, 2023

Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir Radev, Noah A Smith, Yejin Choi, Kentaro Inui, et al. Realtime qa: What’s the answer right now? Advances in neural information processing systems, 36:49025–49043, 2023

work page 2023

[41] [41]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

Lost in the Middle: How Language Models Use Long Contexts

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Serial position effects of large language models.arXiv preprint arXiv:2406.15981, 2024

Xiaobo Guo and Soroush V osoughi. Serial position effects of large language models.arXiv preprint arXiv:2406.15981, 2024

work page arXiv 2024

[44] [44]

Natural questions: a benchmark for question answering research

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019

work page 2019

[45] [45]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhut- dinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[46] [46]

Mistral 7B

Devendra Singh Chaplot. Albert q. jiang, alexandre sablayrolles, arthur mensch, chris bamford, devendra singh chaplot, diego de las casas, florian bressand, gianna lengyel, guillaume lample, lucile saulnier, lélio renard lavaud, marie-anne lachaux, pierre stock, teven le scao, thibaut lavril, thomas wang, timothée lacroix, william el sayed. arXiv preprint...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. 20

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023. 21 A Additional Background on Existing Works PoisonedRAG. Given a query q and target answer s′, PoisonedRAG (Poison) seeks to craft a poisoned passage zpoison such that a RAG system is ...

work page internal anchor Pith review Pith/arXiv arXiv 2023