pith. machine review for the scientific record. sign in

arxiv: 2605.11217 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI· cs.CR

Recognition: 2 theorem links

· Lean Theorem

Leveraging RAG for Training-Free Alignment of LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CR
keywords RAGLLM alignmentpreference pairsagentic attackstraining-free alignmentsafety guardrailscontrastive informationinference-time alignment
0
0 comments X

The pith

RAG conditioning on preference pairs during inference triples LLM refusals to agentic attacks when added to offline alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes RAG-Pref, a training-free method that retrieves preferred and dispreferred samples and conditions the language model on them at generation time. This supplies contrastive signals to strengthen refusal guardrails for agentic attacks. When combined with standard offline alignment training, the approach yields more than a 3.7-fold average improvement in refusals across five common LLMs. It also raises performance on general human-preference tasks while keeping added computational cost low, in contrast to other online alignment techniques.

Core claim

RAG-Pref is an online alignment algorithm that uses retrieval augmented generation to condition LLMs on preferred and dispreferred samples, thereby leveraging contrastive information at inference. When combined with offline alignment algorithms, it enables more than an average 3.7 factor improvement in agentic attack refusals across five widely used LLMs, compared to 2.9 for other online alignment algorithms and 1.5 for offline alignment alone. In contrast to other online methods, it similarly increases performance on general human-preference alignment tasks and does not drastically increase overall computational requirements.

What carries the argument

RAG-Pref, a retrieval-augmented method that pulls preferred and dispreferred preference pairs and conditions the model's output on them at inference time to deliver contrastive alignment information.

If this is right

  • Agentic attack refusal rates rise by over 3.7 times on average across five LLMs when RAG-Pref supplements offline alignment.
  • General human-preference alignment performance improves at a level comparable to the gains on attack refusals.
  • Overall computational demands stay close to standard inference without large added overhead.
  • The method integrates with off-the-shelf packages and applies across multiple widely used LLMs.
  • Other online alignment techniques deliver smaller refusal gains and lack the same benefit to general preference tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Updating the retrieval database alone could let models adapt quickly to new attack patterns without retraining.
  • The same inference-time contrastive conditioning might be tested on tasks such as reducing hallucinations or improving instruction following.
  • A shared preference database could allow alignment adjustments across different models without individual retraining runs.
  • The approach invites checks on how sample quality and coverage in the retrieval set affect long-term robustness.

Load-bearing premise

Retrieving and conditioning on preferred and dispreferred samples during inference supplies contrastive information that generalizes to new agentic attacks without degrading other capabilities or creating new vulnerabilities.

What would settle it

A controlled test on held-out agentic attacks where the combination of RAG-Pref and offline alignment shows no improvement or a reduction in refusal rates relative to offline alignment alone.

Figures

Figures reproduced from arXiv: 2605.11217 by John T. Halloran.

Figure 1
Figure 1. Figure 1: Offline-aligned Llama-3.2-1B with following DPO losses: 1) No DPO - base model (no refusal alignment), (2) DPO - the original “sigmoid” DPO loss function (Rafailov et al., 2023), (3) AOT - Alignment via Optimal Transport (Melnyk et al., 2024), (4) APOd - Anchored Preference Optimization (APO) down (D’Oosterlinck et al., 2025), (5) APOz - APO zero (D’Oosterlinck et al., 2025), (6) BCO - Binary Classifier Op… view at source ↗
Figure 2
Figure 2. Figure 2: DeepSeek-R1-Distill-Qwen-14B aligned with DPO for 90 Epochs. Training quickly converges. F Standard RAG vs RAG-Pref FBA Refusal Rates Llama-3.2 1B Gemma-2-2B Llama-3.1 8B DSR1D-8B* DSR1D-14B* 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.8 1.0 Refusal Rate Base DPO SafeDPO Vanilla RAG, Base Vanilla RAG, DPO Vanilla RAG, SafeDPO RAG-Pref, Base RAG-Pref, DPO RAG-Pref, SafeDPO [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Attack Refusal Rates for model using Standard RAG and RAG-Pref: Refusal rates calculated over the test FBAs. Reasoning models are denoted using ∗. Base denotes models evaluated directly from their public checkpoints. G FBA Refusal Judge Details FBA refusals were assessed using the following two-stage judging: 1. Assess response using a BERT-based classifier trained explicitly on rejection/refusal data (Pro… view at source ↗
Figure 4
Figure 4. Figure 4: Response example for offline/online online FBA refusal guardrails. Responses in green show direct compliance. [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
read the original abstract

Large language model (LLM) alignment algorithms typically consist of post-training over preference pairs. While such algorithms are widely used to enable safety guardrails and align LLMs with general human preferences, we show that state-of-the-art alignment algorithms require significant computational resources while being far less capable of enabling refusal guardrails for recent agentic attacks. Thus, to improve refusal guardrails against such attacks without drastically increasing computational overhead, we introduce Retrieval Augmented Generation for Pref erence alignment (RAG-Pref), a simple RAG-based alignment algorithm which conditions on preferred and dispreferred samples to leverage contrastive information during inference. RAG-Pref is online (training-free), compatible with off-the-shelf packages, and, when combined with offline (training-based) alignment algorithms, enables more than an average 3.7 factor improvement in agentic attack refusals across five widely used LLMs, compared to 2.9 for other online alignment algorithms and 1.5 for offline alignment alone. We conclude by showing that, in stark contrast to other online alignment methods, RAG-Pref similarly increases performance on general human-preference alignment tasks and does not drastically increase overall computational requirements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces RAG-Pref, a training-free alignment method that uses retrieval-augmented generation to condition LLMs on preferred and dispreferred samples at inference time. It claims this approach improves refusal guardrails against agentic attacks, yielding an average 3.7x improvement when combined with offline alignment methods across five LLMs (compared to 2.9x for other online methods and 1.5x for offline alone), while also enhancing performance on general human-preference tasks without substantially increasing computational costs.

Significance. If the empirical claims hold under rigorous validation, the work could offer a practical, low-overhead complement to training-based alignment by enabling effective inference-time contrastive conditioning. The training-free nature and compatibility with off-the-shelf packages are clear strengths that could facilitate broader adoption for safety enhancements.

major comments (3)
  1. [Abstract] Abstract: The headline quantitative claim of a 3.7 factor improvement in agentic attack refusals is presented without any reference to the specific benchmarks, number of trials, statistical significance tests, or controls for confounding factors such as prompt length or retrieval noise, which are load-bearing for assessing whether the data supports the central claim.
  2. [Method] Method section: The RAG-Pref algorithm description provides no details on retrieval corpus construction, the embedding model, top-k policy, or retrieval precision/recall metrics. This is critical because the generalization to novel agentic attacks (the weakest assumption) depends on whether retrieved pairs supply usable contrastive signal rather than surface-level augmentation.
  3. [Experiments] Experiments: No ablations are described that isolate the contribution of the preferred/dispreferred contrast from generic context augmentation, nor are there controls to verify that the LLM utilizes the contrastive information for refusal rather than increasing false positives or introducing new vulnerabilities via the retrieval step.
minor comments (2)
  1. [Abstract] Abstract: Typo in 'Pref erence' (extra space) in the expanded acronym for RAG-Pref.
  2. [Conclusion] The claim that RAG-Pref 'does not drastically increase overall computational requirements' should be backed by concrete measurements (e.g., additional latency or token counts) rather than qualitative statements.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important areas for improving clarity and rigor. We address each major comment point-by-point below. We agree that expanding details and adding ablations will strengthen the manuscript and will incorporate the suggested revisions in the next version.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline quantitative claim of a 3.7 factor improvement in agentic attack refusals is presented without any reference to the specific benchmarks, number of trials, statistical significance tests, or controls for confounding factors such as prompt length or retrieval noise, which are load-bearing for assessing whether the data supports the central claim.

    Authors: We agree that the abstract would benefit from additional context to support the central claim. Due to length constraints, we will revise the abstract to briefly reference the specific agentic attack benchmarks and direct readers to the Experiments section, where we detail the number of trials, statistical significance tests performed, and controls for factors such as prompt length and retrieval noise. revision: yes

  2. Referee: [Method] Method section: The RAG-Pref algorithm description provides no details on retrieval corpus construction, the embedding model, top-k policy, or retrieval precision/recall metrics. This is critical because the generalization to novel agentic attacks (the weakest assumption) depends on whether retrieved pairs supply usable contrastive signal rather than surface-level augmentation.

    Authors: We acknowledge that these implementation details are essential for reproducibility and for validating the contrastive signal. In the revised manuscript, we will expand the Method section to specify the retrieval corpus construction (drawn from established preference datasets), the embedding model employed, the top-k retrieval policy, and quantitative retrieval precision/recall metrics on held-out data to demonstrate that the retrieved pairs provide meaningful contrast rather than superficial augmentation. revision: yes

  3. Referee: [Experiments] Experiments: No ablations are described that isolate the contribution of the preferred/dispreferred contrast from generic context augmentation, nor are there controls to verify that the LLM utilizes the contrastive information for refusal rather than increasing false positives or introducing new vulnerabilities via the retrieval step.

    Authors: We recognize the importance of these ablations and controls for isolating the effect of contrastive conditioning. We will add a dedicated ablation subsection in the Experiments to compare RAG-Pref against generic RAG augmentation without preference contrast. We will also include controls measuring false-positive rates on benign queries and analyze potential new vulnerabilities introduced by retrieval, with quantitative results to confirm that the LLM leverages the contrastive information for improved refusals. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with external benchmarks

full rationale

The paper introduces RAG-Pref as a training-free inference-time method that retrieves and conditions on preference pairs. All reported gains (3.7x refusal improvement, comparisons to 2.9x and 1.5x baselines) are presented as direct empirical measurements on external attack and preference datasets. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the provided text. The derivation chain is therefore a straightforward algorithmic description plus benchmark evaluation and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is based solely on the abstract; the central assumption is that RAG retrieval can supply useful contrastive preference information at inference without retraining.

axioms (1)
  • domain assumption LLMs can leverage retrieved context containing preferred and dispreferred examples to improve refusal behavior on unseen inputs
    Implicit in the description of RAG-Pref conditioning during inference.

pith-pipeline@v0.9.0 · 5502 in / 1290 out tokens · 60526 ms · 2026-05-13T02:36:36.086906+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 13 internal anchors

  1. [1]

    Rag llms are not safer: A safety analysis of retrieval-augmented generation for large language models

    Bang An, Shiyue Zhang, and Mark Dredze. Rag llms are not safer: A safety analysis of retrieval-augmented generation for large language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.\ 5444--5474, 2025

  2. [2]

    System card: Claude opus 4.5

    Anthropic . System card: Claude opus 4.5. Technical report, Anthropic, November 2025. URL https://www-cdn.anthropic.com/bf10f64990cfda0ba858290be7b8cc6317685f47.pdf. Version dated November 24, 2025

  3. [3]

    Donating the model context protocol and establishing the agentic ai foundation

    Anthropic. Donating the model context protocol and establishing the agentic ai foundation. https://www.anthropic.com/news/donating-the-model-context-protocol-and-establishing-of-the-agentic-ai-foundation, December 2025 a . Accessed: 2026-01-26

  4. [4]

    Accessed: 2025-02-12

    Anthropic. Introducing the Model Context Protocol. https://www.anthropic.com/news/model-context-protocol, 2025 b . "Accessed: 2025-02-12"

  5. [5]

    Accessed: 2025-05-09

    Anthropic. Slack MCP Server. https://github.com/modelcontextprotocol/servers/tree/main/src/slack, 2025 c . "Accessed: 2025-05-09"

  6. [6]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

  7. [7]

    Don't do rag: When cache-augmented generation is all you need for knowledge tasks

    Brian J Chan, Chao-Ting Chen, Jui-Hung Cheng, and Hen-Hsen Huang. Don't do rag: When cache-augmented generation is all you need for knowledge tasks. In Companion Proceedings of the ACM on Web Conference 2025, pp.\ 893--897, 2025

  8. [8]

    Jailbreakbench: An open robustness benchmark for jailbreaking large language models

    Patrick Chao, Edoardo Debenedetti, et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. Advances in Neural Information Processing Systems (NeurIPS), 2024

  9. [9]

    Jailbreaking black box large language models in twenty queries

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. In 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pp.\ 23--42. IEEE, 2025

  10. [10]

    Noise contrastive alignment of language models with explicit rewards

    Huayu Chen, Guande He, Lifan Yuan, Ganqu Cui, Hang Su, and Jun Zhu. Noise contrastive alignment of language models with explicit rewards. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  11. [11]

    Provably robust dpo: Aligning language models with noisy feedback

    Sayak Ray Chowdhury, Anush Kini, and Nagarajan Natarajan. Provably robust dpo: Aligning language models with noisy feedback. In Forty-first International Conference on Machine Learning, 2024

  12. [12]

    Elements of information theory

    Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999

  13. [13]

    Ultrafeedback: Boosting language models with high-quality feedback

    Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. 2023

  14. [14]

    CVE List V5 : CVE cache of the official CVE list in CVE JSON 5 format

    CVE Project . CVE List V5 : CVE cache of the official CVE list in CVE JSON 5 format. https://github.com/CVEProject/cvelistV5, 2023. Accessed: 2025-10-30

  15. [15]

    Safe RLHF : Safe reinforcement learning from human feedback

    Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe RLHF : Safe reinforcement learning from human feedback. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=TyFrPOKYXw

  16. [16]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  17. [17]

    Qlora: Efficient finetuning of quantized llms

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36: 0 10088--10115, 2023

  18. [18]

    Anchored preference optimization and contrastive revisions: Addressing underspecification in alignment

    Karel D'Oosterlinck, Winnie Xu, Chris Develder, Thomas Demeester, Amanpreet Singh, Christopher Potts, Douwe Kiela, and Shikib Mehri. Anchored preference optimization and contrastive revisions: Addressing underspecification in alignment. Transactions of the Association for Computational Linguistics, 13: 0 442--460, 2025

  19. [19]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Yann Dubois, Bal \'a zs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024

  20. [20]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization, 2025. URL https://arxiv.org/abs/2404.16130

  21. [21]

    Linear alignment: A closed-form solution for aligning human preferences without tuning and feedback

    Songyang Gao, Qiming Ge, Wei Shen, Shihan Dou, Junjie Ye, Xiao Wang, Rui Zheng, Yicheng Zou, Zhi Chen, Hang Yan, et al. Linear alignment: A closed-form solution for aligning human preferences without tuning and feedback. In International Conference on Machine Learning, pp.\ 14702--14722. PMLR, 2024

  22. [22]

    Gemma 2: Improving Open Language Models at a Practical Size

    Team Gemma, Morgane Riviere, Shreya Pathak, et al. Gemma 2: Improving open language models at a practical size, 2024. URL https://arxiv.org/abs/2408.00118

  23. [23]

    Accessed: 2025-05-09

    Google. MCP Toolbox for Databases: Simplify AI Agent Access to Enterprise Data. https://cloud.google.com/blog/products/ai-machine-learning/mcp-toolbox-for-databases-now-supports-model-context-protocol, 2025. "Accessed: 2025-05-09"

  24. [24]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  25. [25]

    Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM workshop on artificial intelligence and security, pp.\ 79--90, 2023

  26. [26]

    Accessed: 2025-05-05

    John Halloran. MCPSafetyScanner - Automated MCP safety auditing and remediation using Agents. https://github.com/johnhalloran321/mcpSafetyScanner, 2025. "Accessed: 2025-05-05"

  27. [27]

    arXiv preprint arXiv:2203.09509 , year=

    Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509, 2022

  28. [28]

    Accessed: 2025-10-13

    HuggingFace. UltraFeedback Binarized. https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized, 2025. "Accessed: 2025-10-13"

  29. [29]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  30. [30]

    Accessed: 2025-05-05

    Invariant. Introducing Guardrails: The contextual security layer for the agentic era. https://invariantlabs.ai/blog/guardrails, 2025 a . "Accessed: 2025-05-05"

  31. [31]

    Accessed: 2025-05-05

    Invariant. Introducing MCP-Scan: Protecting MCP with Invariant. https://invariantlabs.ai/blog/introducing-mcp-scan, 2025 b . "Accessed: 2025-05-05"

  32. [32]

    Towards efficient exact optimization of language model alignment.arXiv preprint arXiv:2402.00856, 2024

    Haozhe Ji, Cheng Lu, Yilin Niu, Pei Ke, Hongning Wang, Jun Zhu, Jie Tang, and Minlie Huang. Towards efficient exact optimization of language model alignment. arXiv preprint arXiv:2402.00856, 2024

  33. [33]

    Beavertails: Towards improved safety alignment of llm via a human-preference dataset

    Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.\ 2...

  34. [34]

    RAG - R eward B ench: Benchmarking reward models in retrieval augmented generation for preference alignment

    Zhuoran Jin, Hongbang Yuan, Tianyi Men, Pengfei Cao, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, and Jun Zhao. RAG - R eward B ench: Benchmarking reward models in retrieval augmented generation for preference alignment. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Co...

  35. [35]

    Binary classifier optimization for large language model alignment

    Seungjae Jung, Gunsoo Han, Daniel Wontae Nam, and Kyoung-Woon On. Binary classifier optimization for large language model alignment. arXiv preprint arXiv:2404.04656, 2024

  36. [36]

    Safe DPO : A simple approach to direct preference optimization with enhanced safety

    Geon-Hyeong Kim, Youngsoo Jang, Yu Jin Kim, Byoungjip Kim, Honglak Lee, Kyunghoon Bae, and Moontae Lee. Safe DPO : A simple approach to direct preference optimization with enhanced safety. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=PJdw4VBsXD

  37. [37]

    Rlaif vs

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Ren Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback. In International Conference on Machine Learning, pp.\ 26874--26901. PMLR, 2024

  38. [38]

    The wmdp benchmark: Measuring and reducing malicious use with unlearning

    Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D Li, Ann-Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning. In International Conference on Machine Learning, pp.\ 28525--28550. PMLR, 2024

  39. [39]

    Statistical rejection sampling improves preference optimization

    Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J Liu, and Jialu Liu. Statistical rejection sampling improves preference optimization. In The Twelfth International Conference on Learning Representations, 2024

  40. [40]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023 a

  41. [41]

    Edward Suh, Yevgeniy Vorobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao

    Xiaogeng Liu, Peiran Li, G. Edward Suh, Yevgeniy Vorobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao. Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu (eds.), International Conference on Representation Learning, volume 2025, pp.\ 10313--10...

  42. [42]

    Prompt Injection attack against LLM-integrated Applications

    Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, et al. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499, 2023 b

  43. [43]

    Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. In International Conference on Machine Learning, pp.\ 35181--35224. PMLR, 2024

  44. [44]

    Tree of attacks: Jailbreaking black-box llms automatically

    Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically. Advances in Neural Information Processing Systems, 37: 0 61065--61105, 2024

  45. [45]

    Distributional preference alignment of llms via optimal transport

    Igor Melnyk, Youssef Mroueh, Brian Belgodere, Mattia Rigotti, Apoorva Nitsure, Mikhail Yurochkin, Kristjan Greenewald, Jiri Navratil, and Jarret Ross. Distributional preference alignment of llms via optimal transport. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  46. [46]

    Simpo: Simple preference optimization with a reference-free reward

    Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. Advances in Neural Information Processing Systems, 37: 0 124198--124235, 2024

  47. [47]

    Accessed: 2025-03-20

    Microsoft. Introducing Model Context Protocol (MCP) in Copilot Studio. https://tinyurl.com/CopilotMCP, 2025. "Accessed: 2025-03-20"

  48. [48]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

  49. [49]

    Ignore Previous Prompt: Attack Techniques For Language Models

    F \'a bio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022

  50. [50]

    Accessed: 2025-05-15

    ProtectAI. Model Card for distilroberta-base-rejection-v1. https://huggingface.co/protectai/distilroberta-base-rejection-v1, 2025. "Accessed: 2025-05-15"

  51. [51]

    Mcp safety audit: Llms with the model context protocol allow major security exploits

    Brandon Radosevich and John Halloran. Mcp safety audit: Llms with the model context protocol allow major security exploits. arXiv preprint arXiv:2504.03767, 2025

  52. [52]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36: 0 53728--53741, 2023

  53. [53]

    Accessed: 2025-04-28

    Philipp Schmid. How to use Anthropic MCP Server with open LLMs, OpenAI or Google Gemini. https://github.com/philschmid/mcp-openai-gemini-llama-example, 2025. "Accessed: 2025-04-28"

  54. [54]

    Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming

    Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, et al. Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming. arXiv preprint arXiv:2501.18837, 2025

  55. [55]

    do anything now

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp.\ 1671--1685, 2024

  56. [56]

    Measuring and enhancing trustworthiness of LLM s in RAG through grounded attributions and learning to refuse

    Maojia Song, Shang Hong Sim, Rishabh Bhardwaj, Hai Leong Chieu, Navonil Majumder, and Soujanya Poria. Measuring and enhancing trustworthiness of LLM s in RAG through grounded attributions and learning to refuse. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=Iyrtb9EJBp

  57. [57]

    Accessed: 2025-03-20

    Stripe. Stripe Agent Toolkit. https://github.com/stripe/agent-toolkit, 2025. "Accessed: 2025-03-20"

  58. [58]

    Divide-then-align: Honest alignment based on the knowledge boundary of RAG

    Xin Sun, Jianan Xie, Zhongqi Chen, Qiang Liu, Shu Wu, Yuehe Chen, Bowen Song, Zilei Wang, Weiqiang Wang, and Liang Wang. Divide-then-align: Honest alignment based on the knowledge boundary of RAG . In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computatio...

  59. [59]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  60. [60]

    Fine-tuning language models for factuality

    Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D Manning, and Chelsea Finn. Fine-tuning language models for factuality. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=WPZ2yPag4K

  61. [61]

    arXiv preprint arXiv:2310.16944 , year=

    Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Cl \'e mentine Fourrier, Nathan Habib, et al. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023

  62. [62]

    PA - RAG : RAG alignment via multi-perspective preference optimization

    Jiayi Wu, Hengyi Cai, Lingyong Yan, Hao Sun, Xiang Li, Shuaiqiang Wang, Dawei Yin, and Ming Gao. PA - RAG : RAG alignment via multi-perspective preference optimization. In Luis Chiruzzo, Alan Ritter, and Lu Wang (eds.), Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Languag...

  63. [63]

    Self-play preference optimization for language model alignment

    Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-play preference optimization for language model alignment. In Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning, 2024

  64. [64]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

  65. [65]

    How language model hallucinations can snowball

    Muru Zhang, Ofir Press, William Merrill, Alisa Liu, and Noah A Smith. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534, 2023

  66. [66]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36: 0 46595--46623, 2023

  67. [67]

    arXiv preprint arXiv:2404.18922 , year=

    Han Zhong, Zikang Shan, Guhao Feng, Wei Xiong, Xinle Cheng, Li Zhao, Di He, Jiang Bian, and Liwei Wang. Dpo meets ppo: Reinforced token optimization for rlhf. arXiv preprint arXiv:2404.18922, 2024

  68. [68]

    DPO meets PPO : Reinforced token optimization for RLHF

    Han Zhong, Zikang Shan, Guhao Feng, Wei Xiong, Xinle Cheng, Li Zhao, Di He, Jiang Bian, and Liwei Wang. DPO meets PPO : Reinforced token optimization for RLHF . In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=IfWKVF6LfY

  69. [69]

    Lima: Less is more for alignment

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36: 0 55006--55021, 2023

  70. [70]

    Accessed: 2025-07-01

    Mingye Zhu. OPAD. https://github.com/stevie1023/OPAD, 2025. "Accessed: 2025-07-01"

  71. [71]

    On-the-fly preference alignment via principle-guided decoding

    Mingye Zhu, Yi Liu, Lei Zhang, Junbo Guo, and Zhendong Mao. On-the-fly preference alignment via principle-guided decoding. In The Thirteenth International Conference on Learning Representations, 2025

  72. [72]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023