Through the Stealth Lens: Attention-Aware Defenses Against Poisoning in RAG
Pith reviewed 2026-05-25 08:07 UTC · model grok-4.3
The pith
Poisoned passages that control RAG outputs must bias attention weights enough to be flagged by a normalized score and variance filter, raising accuracy up to 20 percent over baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
If a small number of poisoned passages are to dictate the generated answer, they must receive higher or more variable attention than the surrounding benign passages. The paper therefore defines the Normalized Passage Attention Score as a measure of each passage's relative influence on the output tokens. An Attention-Variance Filter then removes passages whose attention distribution deviates from the norm. When applied, this raises the accuracy of the RAG system under attack by up to 20 percent over standard defenses.
What carries the argument
Normalized Passage Attention Score (NPAS) and Attention-Variance Filter (AV Filter), which derive from attention weights to measure and flag anomalous passage influence on the response.
If this is right
- Defenses can operate by inspecting internal attention signals rather than final text alone.
- Attackers face a trade-off between controlling the output and remaining undetectable via attention.
- The formal distinguishability game shows that true stealth is limited when few passages must dominate the response.
- Adaptive attacks that try to conceal anomalies still leave measurable traces in attention patterns.
Where Pith is reading between the lines
- Attention monitoring could extend to other context-manipulation settings where models retrieve external material.
- Combining attention analysis with output-consistency checks might produce stronger composite defenses.
- Making poisoning reliably stealthy may require attackers to spread influence across many passages rather than concentrate it.
Load-bearing premise
Poisoned passages that control the response must bias the inference process more than benign ones, producing detectable attention anomalies.
What would settle it
A poisoning attack that alters the RAG output as intended while producing attention scores and variance values indistinguishable from an unpoisoned baseline would falsify the claim.
Figures
read the original abstract
Retrieval-augmented generation (RAG) systems are vulnerable to attacks that inject poisoned passages into the retrieved context, even at low corruption rates. We show that existing attacks are not designed to be stealthy, allowing reliable detection and mitigation. We formalize a distinguishability-based security game to quantify stealth for such attacks. If a few poisoned passages control the response, they must bias the inference process more than the benign ones, inherently compromising stealth. This motivates analyzing intermediate signals of LLMs, such as attention weights, to approximate the influence of different passages on the response. Leveraging attention weights, we introduce the $\textbf{Normalized Passage Attention Score}$ (NPAS) and a lightweight $\textbf{Attention-Variance Filter}$ (AV Filter) that flags anomalous passages. Our method improves robustness, yielding up to $\sim$ $\textbf{20%}$ higher accuracy than baseline defenses. We also develop adaptive attacks that attempt to conceal such anomalies, achieving up to $\textbf{35%}$ success rate and underscoring the challenges of achieving true stealth in poisoning RAG systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that poisoning attacks on RAG systems are not inherently stealthy because controlling the output requires biasing inference more than benign passages, which can be detected via attention weights. It formalizes a distinguishability security game, introduces the Normalized Passage Attention Score (NPAS) and Attention-Variance Filter (AV Filter), reports up to ~20% higher accuracy than baseline defenses, and shows that adaptive attacks achieve at most 35% success rate.
Significance. If the results hold under rigorous validation, the work contributes a formal security game for stealth in RAG poisoning and demonstrates that internal LLM signals (attention) can yield practical defenses. The adaptive attack evaluation is a strength, as it directly tests the limits of the proposed method rather than relying solely on static baselines.
major comments (3)
- [§2] §2 (motivation) and security game formalization: The premise that 'if a few poisoned passages control the response, they must bias the inference process more than the benign ones' is asserted as a necessity that compromises stealth. This modeling choice directly motivates NPAS and the AV Filter, but the manuscript provides no derivation or counterexample showing that distributed or indirect influence across tokens cannot achieve output control without producing measurable attention anomalies. This is load-bearing for the central claim.
- [§5] §5 (experimental results): The reported ~20% accuracy gain and 35% adaptive attack success rate are presented without sufficient detail on the number of runs, variance across seeds, exact baseline implementations, or ablation isolating the contribution of the attention-variance component versus simple thresholding. Without these, it is unclear whether the gains are robust or sensitive to data selection.
- [§4.1] Definition of NPAS (likely §4.1): The normalization and variance computation assume that higher or more variable attention on poisoned passages is both necessary and sufficient for detection. If an adaptive attack can equalize attention distributions while still steering generation (e.g., via prompt engineering on multiple passages), the filter's signal disappears; the manuscript should include a targeted experiment testing this scenario.
minor comments (2)
- [§4] Notation for NPAS should be defined with an explicit equation rather than prose description to avoid ambiguity in how normalization is performed across passages of varying lengths.
- [§5] The abstract states quantitative gains but the experimental section would benefit from a table summarizing accuracy, attack success rate, and false-positive rate for all methods and datasets.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for identifying areas where additional rigor and detail would strengthen the manuscript. We address each major comment below and commit to revisions that improve clarity without altering the core claims.
read point-by-point responses
-
Referee: [§2] §2 (motivation) and security game formalization: The premise that 'if a few poisoned passages control the response, they must bias the inference process more than the benign ones' is asserted as a necessity that compromises stealth. This modeling choice directly motivates NPAS and the AV Filter, but the manuscript provides no derivation or counterexample showing that distributed or indirect influence across tokens cannot achieve output control without producing measurable attention anomalies. This is load-bearing for the central claim.
Authors: The distinguishability security game formalizes stealth as the inability of an attacker to control output without creating detectable differences in passage influence. The manuscript motivates this via the observation that output control in transformer-based generation requires disproportionate contribution from poisoned passages. We agree that an explicit derivation or counterexample addressing distributed token-level influence would strengthen the argument. In revision we will expand §2 with a short proof sketch showing that any successful steering must increase the aggregate attention mass on the controlling passages (by the properties of softmax attention and next-token prediction), together with a brief counterexample illustrating why purely indirect influence fails to override benign context at low corruption rates. revision: yes
-
Referee: [§5] §5 (experimental results): The reported ~20% accuracy gain and 35% adaptive attack success rate are presented without sufficient detail on the number of runs, variance across seeds, exact baseline implementations, or ablation isolating the contribution of the attention-variance component versus simple thresholding. Without these, it is unclear whether the gains are robust or sensitive to data selection.
Authors: We acknowledge that the experimental section would benefit from greater statistical transparency. The reported figures aggregate results over multiple random seeds and datasets, yet the manuscript does not tabulate per-seed variance or explicitly describe baseline re-implementations. In the revised version we will add (i) the exact number of runs and standard deviations for all accuracy and attack-success metrics, (ii) precise references and hyper-parameter settings for each baseline, and (iii) an ablation table that isolates the contribution of the variance term in the AV Filter versus simple NPAS thresholding. revision: yes
-
Referee: [§4.1] Definition of NPAS (likely §4.1): The normalization and variance computation assume that higher or more variable attention on poisoned passages is both necessary and sufficient for detection. If an adaptive attack can equalize attention distributions while still steering generation (e.g., via prompt engineering on multiple passages), the filter's signal disappears; the manuscript should include a targeted experiment testing this scenario.
Authors: The NPAS formulation is derived from the necessity of biasing attention to achieve control, and the adaptive attacks already evaluated attempt to reduce attention anomalies yet still reach only 35% success. Nevertheless, we agree that an explicit test of attention-equalization strategies (e.g., multi-passage prompt engineering) is warranted. We will add a targeted experiment in §5 that constructs such equalizing attacks and reports the resulting NPAS distributions and filter performance, thereby quantifying the residual detectability even under this stronger threat model. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper states a premise in the abstract that poisoned passages controlling the response must bias inference more than benign ones, which motivates the distinguishability security game and attention-based NPAS/AV Filter. This is presented as a logical motivation and modeling choice rather than a derivation that reduces by construction to fitted inputs or self-citations. No equations, self-citation chains, ansatzes, or renamings are quoted that exhibit the specific reductions required by the circularity patterns. The reported accuracy gains are empirical comparisons to baselines, leaving the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Poisoned passages that control the generated response must bias the inference process more than benign passages, inherently reducing stealth.
invented entities (2)
-
Normalized Passage Attention Score (NPAS)
no independent evidence
-
Attention-Variance Filter (AV Filter)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
If a few poisoned passages control the response, they must bias the inference process more than the benign ones, inherently compromising stealth. This motivates analyzing intermediate signals... We formalize stealth using a distinguishability-based security game.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanbare_distinguishability_of_absolute_floor echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Stealth Attack Distinguishability Game (SADG)... The attack is said to be τ-stealthy if, for all PPT defenders D, the advantage is at most τ
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Needle-in-RAG: Prompt-Conditioned Character-Level Traceback of Poisoned Spans in Retrieved Evidence
RAGCharacter localizes poisoned character spans in RAG evidence via prompt-conditioned counterfactual masking and achieves the best accuracy-over-attribution trade-off across tested attacks and models.
-
Adaptive Defense Orchestration for RAG: A Sentinel-Strategist Architecture against Multi-Vector Attacks
A context-aware Sentinel-Strategist system for RAG selectively applies defenses to block membership inference and data poisoning while recovering most retrieval utility compared to always-on defense stacks.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877– 1901, 2020
work page 1901
-
[3]
Survey of hallucination in natural language generation
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM computing surveys, 55(12):1–38, 2023
work page 2023
-
[4]
Retrieval augmented language model pre-training
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR, 2020
work page 2020
-
[5]
Retrieval- augmented generation for knowledge-intensive nlp tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Na- man Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020
work page 2020
-
[6]
Improving language models by retrieving from trillions of tokens
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022
work page 2022
-
[7]
Dense passage retrieval for open-domain question answering
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In EMNLP (1), pages 6769–6781, 2020
work page 2020
-
[8]
Generative ai in search: Let google do the searching for you
Google. Generative ai in search: Let google do the searching for you. https://blog. google/products/search/generative-ai-google-search-may-2024/ , 2024. Ac- cessed: 2025-04-21
work page 2024
-
[9]
Sina J Semnani, Violet Z Yao, Heidi C Zhang, and Monica S Lam. Wikichat: Stopping the hallucination of large language model chatbots by few-shot grounding on wikipedia. arXiv preprint arXiv:2305.14292, 2023. 17
- [10]
-
[11]
Perplexity AI. Perplexity ai. https://www.perplexity.ai/, 2024. Accessed: 2025-04-21
work page 2024
-
[12]
Jerry Liu. Llamaindex. https://github.com/jerryjliu/llama_index, November 2022. Accessed: 2025-04-21
work page 2022
- [13]
-
[14]
Re- flexion: Language agents with verbal reinforcement learning
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36:8634–8652, 2023
work page 2023
-
[15]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[16]
Poisoning web-scale training datasets is practical
Nicholas Carlini, Matthew Jagielski, Christopher A Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramèr. Poisoning web-scale training datasets is practical. In 2024 IEEE Symposium on Security and Privacy (SP), pages 407–425. IEEE, 2024
work page 2024
-
[17]
Certifiably robust rag against retrieval corruption
Chong Xiang, Tong Wu, Zexuan Zhong, David Wagner, Danqi Chen, and Prateek Mittal. Certifiably robust rag against retrieval corruption. arXiv preprint arXiv:2405.15556, 2024
-
[18]
Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. Poisonedrag: Knowledge poi- soning attacks to retrieval-augmented generation of large language models. arXiv preprint arXiv:2402.07867, 2024
-
[19]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79–90, 2023
work page 2023
-
[20]
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping- yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Base- line defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Detecting Language Model Attacks with Perplexity
Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity.arXiv preprint arXiv:2308.14132, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Demystifying prompts in language models via perplexity estimation
Hila Gonen, Srini Iyer, Terra Blevins, Noah A Smith, and Luke Zettlemoyer. Demystifying prompts in language models via perplexity estimation. arXiv preprint arXiv:2212.04037, 2022. 18
-
[23]
Rankrag: Unifying context ranking with retrieval-augmented generation in llms
Yue Yu, Wei Ping, Zihan Liu, Boxin Wang, Jiaxuan You, Chao Zhang, Mohammad Shoeybi, and Bryan Catanzaro. Rankrag: Unifying context ranking with retrieval-augmented generation in llms. Advances in Neural Information Processing Systems, 37:121156–121184, 2024
work page 2024
-
[24]
Hoprag: Multi-hop reasoning for logic-aware retrieval-augmented generation
Hao Liu, Zhengren Wang, Xi Chen, Zhiyu Li, Feiyu Xiong, Qinhan Yu, and Wentao Zhang. Hoprag: Multi-hop reasoning for logic-aware retrieval-augmented generation. arXiv preprint arXiv:2502.12442, 2025
-
[25]
Collapse of dense retrievers: Short, early, and literal biases outranking factual evidence
Mohsen Fayyaz, Ali Modarressi, Hinrich Schuetze, and Nanyun Peng. Collapse of dense retrievers: Short, early, and literal biases outranking factual evidence. arXiv preprint arXiv:2503.05037, 2025
-
[26]
Building a robust retrieval system with dense retrieval models
Sheng-Chieh Lin. Building a robust retrieval system with dense retrieval models. 2024
work page 2024
-
[27]
More robust dense retrieval with contrastive dual learning
Yizhi Li, Zhenghao Liu, Chenyan Xiong, and Zhiyuan Liu. More robust dense retrieval with contrastive dual learning. In Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval, pages 287–296, 2021
work page 2021
-
[28]
Synthetic disinformation attacks on automated fact verification systems
Yibing Du, Antoine Bosselut, and Christopher D Manning. Synthetic disinformation attacks on automated fact verification systems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10581–10589, 2022
work page 2022
-
[29]
Attacking open-domain question answering by injecting misinformation
Liangming Pan, Wenhu Chen, Min-Yen Kan, and William Yang Wang. Attacking open-domain question answering by injecting misinformation. arXiv preprint arXiv:2110.07803, 2021
-
[30]
On the risk of misinformation pollution with large language models
Yikang Pan, Liangming Pan, Wenhu Chen, Preslav Nakov, Min-Yen Kan, and William Yang Wang. On the risk of misinformation pollution with large language models. arXiv preprint arXiv:2305.13661, 2023
-
[31]
Poisoning retrieval corpora by injecting adversarial passages
Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. Poisoning retrieval corpora by injecting adversarial passages. arXiv preprint arXiv:2310.19156, 2023
-
[32]
Sukmin Cho, Soyeong Jeong, Jeongyeon Seo, Taeho Hwang, and Jong C Park. Typos that broke the rag’s back: Genetic attack on rag pipeline by simulating documents in the wild via low-level perturbations. arXiv preprint arXiv:2404.13948, 2024
-
[33]
De- fending against disinformation attacks in open-domain question answering
Orion Weller, Aleem Khan, Nathaniel Weir, Dawn Lawrie, and Benjamin Van Durme. De- fending against disinformation attacks in open-domain question answering. arXiv preprint arXiv:2212.10002, 2022
-
[34]
Giwon Hong, Jeonghwan Kim, Junmo Kang, Sung-Hyon Myaeng, and Joyce Jiyoung Whang. Discern and answer: Mitigating the impact of misinformation in retrieval-augmented models with discriminators. CoRR, 2023
work page 2023
-
[35]
Analyzing the structure of attention in a transformer language model
Jesse Vig and Yonatan Belinkov. Analyzing the structure of attention in a transformer language model. In Tal Linzen, Grzegorz Chrupała, Yonatan Belinkov, and Dieuwke Hupkes, editors, Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 63–76, Florence, Italy, August 2019. Association for Computational...
work page 2019
-
[36]
H2o: Heavy-hitter oracle for efficient generative inference of large language models
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36:34661–34710, 2023
work page 2023
-
[37]
Zipcache: Accurate and efficient kv cache quantization with salient token identification
Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. Zipcache: Accurate and efficient kv cache quantization with salient token identification. arXiv preprint arXiv:2405.14256, 2024
-
[38]
Attention sorting combats recency bias in long context language models
Alexander Peysakhovich and Adam Lerer. Attention sorting combats recency bias in long context language models. arXiv preprint arXiv:2310.01427, 2023
-
[39]
Understanding data poisoning attacks for RAG: Insights and algorithms, 2025
Xun Xian, Tong Wang, Liwen You, and Yanjun Qi. Understanding data poisoning attacks for RAG: Insights and algorithms, 2025
work page 2025
-
[40]
Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir Radev, Noah A Smith, Yejin Choi, Kentaro Inui, et al. Realtime qa: What’s the answer right now? Advances in neural information processing systems, 36:49025–49043, 2023
work page 2023
-
[41]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Lost in the Middle: How Language Models Use Long Contexts
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Serial position effects of large language models.arXiv preprint arXiv:2406.15981, 2024
Xiaobo Guo and Soroush V osoughi. Serial position effects of large language models.arXiv preprint arXiv:2406.15981, 2024
-
[44]
Natural questions: a benchmark for question answering research
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019
work page 2019
-
[45]
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhut- dinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[46]
Devendra Singh Chaplot. Albert q. jiang, alexandre sablayrolles, arthur mensch, chris bamford, devendra singh chaplot, diego de las casas, florian bressand, gianna lengyel, guillaume lample, lucile saulnier, lélio renard lavaud, marie-anne lachaux, pierre stock, teven le scao, thibaut lavril, thomas wang, timothée lacroix, william el sayed. arXiv preprint...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. 20
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023. 21 A Additional Background on Existing Works PoisonedRAG. Given a query q and target answer s′, PoisonedRAG (Poison) seeks to craft a poisoned passage zpoison such that a RAG system is ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.