arxiv: 2605.11217 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI· cs.CR

Recognition: 2 theorem links

· Lean Theorem

Leveraging RAG for Training-Free Alignment of LLMs

John T. Halloran

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CR

keywords RAGLLM alignmentpreference pairsagentic attackstraining-free alignmentsafety guardrailscontrastive informationinference-time alignment

0 comments

The pith

RAG conditioning on preference pairs during inference triples LLM refusals to agentic attacks when added to offline alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes RAG-Pref, a training-free method that retrieves preferred and dispreferred samples and conditions the language model on them at generation time. This supplies contrastive signals to strengthen refusal guardrails for agentic attacks. When combined with standard offline alignment training, the approach yields more than a 3.7-fold average improvement in refusals across five common LLMs. It also raises performance on general human-preference tasks while keeping added computational cost low, in contrast to other online alignment techniques.

Core claim

RAG-Pref is an online alignment algorithm that uses retrieval augmented generation to condition LLMs on preferred and dispreferred samples, thereby leveraging contrastive information at inference. When combined with offline alignment algorithms, it enables more than an average 3.7 factor improvement in agentic attack refusals across five widely used LLMs, compared to 2.9 for other online alignment algorithms and 1.5 for offline alignment alone. In contrast to other online methods, it similarly increases performance on general human-preference alignment tasks and does not drastically increase overall computational requirements.

What carries the argument

RAG-Pref, a retrieval-augmented method that pulls preferred and dispreferred preference pairs and conditions the model's output on them at inference time to deliver contrastive alignment information.

If this is right

Agentic attack refusal rates rise by over 3.7 times on average across five LLMs when RAG-Pref supplements offline alignment.
General human-preference alignment performance improves at a level comparable to the gains on attack refusals.
Overall computational demands stay close to standard inference without large added overhead.
The method integrates with off-the-shelf packages and applies across multiple widely used LLMs.
Other online alignment techniques deliver smaller refusal gains and lack the same benefit to general preference tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Updating the retrieval database alone could let models adapt quickly to new attack patterns without retraining.
The same inference-time contrastive conditioning might be tested on tasks such as reducing hallucinations or improving instruction following.
A shared preference database could allow alignment adjustments across different models without individual retraining runs.
The approach invites checks on how sample quality and coverage in the retrieval set affect long-term robustness.

Load-bearing premise

Retrieving and conditioning on preferred and dispreferred samples during inference supplies contrastive information that generalizes to new agentic attacks without degrading other capabilities or creating new vulnerabilities.

What would settle it

A controlled test on held-out agentic attacks where the combination of RAG-Pref and offline alignment shows no improvement or a reduction in refusal rates relative to offline alignment alone.

Figures

Figures reproduced from arXiv: 2605.11217 by John T. Halloran.

**Figure 1.** Figure 1: Offline-aligned Llama-3.2-1B with following DPO losses: 1) No DPO - base model (no refusal alignment), (2) DPO - the original “sigmoid” DPO loss function (Rafailov et al., 2023), (3) AOT - Alignment via Optimal Transport (Melnyk et al., 2024), (4) APOd - Anchored Preference Optimization (APO) down (D’Oosterlinck et al., 2025), (5) APOz - APO zero (D’Oosterlinck et al., 2025), (6) BCO - Binary Classifier Op… view at source ↗

**Figure 2.** Figure 2: DeepSeek-R1-Distill-Qwen-14B aligned with DPO for 90 Epochs. Training quickly converges. F Standard RAG vs RAG-Pref FBA Refusal Rates Llama-3.2 1B Gemma-2-2B Llama-3.1 8B DSR1D-8B* DSR1D-14B* 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.8 1.0 Refusal Rate Base DPO SafeDPO Vanilla RAG, Base Vanilla RAG, DPO Vanilla RAG, SafeDPO RAG-Pref, Base RAG-Pref, DPO RAG-Pref, SafeDPO [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗

**Figure 3.** Figure 3: Attack Refusal Rates for model using Standard RAG and RAG-Pref: Refusal rates calculated over the test FBAs. Reasoning models are denoted using ∗. Base denotes models evaluated directly from their public checkpoints. G FBA Refusal Judge Details FBA refusals were assessed using the following two-stage judging: 1. Assess response using a BERT-based classifier trained explicitly on rejection/refusal data (Pro… view at source ↗

**Figure 4.** Figure 4: Response example for offline/online online FBA refusal guardrails. Responses in green show direct compliance. [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

read the original abstract

Large language model (LLM) alignment algorithms typically consist of post-training over preference pairs. While such algorithms are widely used to enable safety guardrails and align LLMs with general human preferences, we show that state-of-the-art alignment algorithms require significant computational resources while being far less capable of enabling refusal guardrails for recent agentic attacks. Thus, to improve refusal guardrails against such attacks without drastically increasing computational overhead, we introduce Retrieval Augmented Generation for Pref erence alignment (RAG-Pref), a simple RAG-based alignment algorithm which conditions on preferred and dispreferred samples to leverage contrastive information during inference. RAG-Pref is online (training-free), compatible with off-the-shelf packages, and, when combined with offline (training-based) alignment algorithms, enables more than an average 3.7 factor improvement in agentic attack refusals across five widely used LLMs, compared to 2.9 for other online alignment algorithms and 1.5 for offline alignment alone. We conclude by showing that, in stark contrast to other online alignment methods, RAG-Pref similarly increases performance on general human-preference alignment tasks and does not drastically increase overall computational requirements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAG-Pref applies retrieval of preference pairs at inference to boost refusal rates on agentic attacks without training, but the big claimed gains rest on unshown experimental details.

read the letter

The main point is that this paper introduces RAG-Pref, which retrieves preferred and dispreferred examples to condition the LLM contrastively during generation. When stacked on top of standard offline alignment, it reports an average 3.7x lift in refusals for agentic attacks across five models, compared with 2.9x for other online methods and 1.5x for offline alone. It also claims no big hit to general preference tasks and low extra compute.

Referee Report

3 major / 2 minor

Summary. The paper introduces RAG-Pref, a training-free alignment method that uses retrieval-augmented generation to condition LLMs on preferred and dispreferred samples at inference time. It claims this approach improves refusal guardrails against agentic attacks, yielding an average 3.7x improvement when combined with offline alignment methods across five LLMs (compared to 2.9x for other online methods and 1.5x for offline alone), while also enhancing performance on general human-preference tasks without substantially increasing computational costs.

Significance. If the empirical claims hold under rigorous validation, the work could offer a practical, low-overhead complement to training-based alignment by enabling effective inference-time contrastive conditioning. The training-free nature and compatibility with off-the-shelf packages are clear strengths that could facilitate broader adoption for safety enhancements.

major comments (3)

[Abstract] Abstract: The headline quantitative claim of a 3.7 factor improvement in agentic attack refusals is presented without any reference to the specific benchmarks, number of trials, statistical significance tests, or controls for confounding factors such as prompt length or retrieval noise, which are load-bearing for assessing whether the data supports the central claim.
[Method] Method section: The RAG-Pref algorithm description provides no details on retrieval corpus construction, the embedding model, top-k policy, or retrieval precision/recall metrics. This is critical because the generalization to novel agentic attacks (the weakest assumption) depends on whether retrieved pairs supply usable contrastive signal rather than surface-level augmentation.
[Experiments] Experiments: No ablations are described that isolate the contribution of the preferred/dispreferred contrast from generic context augmentation, nor are there controls to verify that the LLM utilizes the contrastive information for refusal rather than increasing false positives or introducing new vulnerabilities via the retrieval step.

minor comments (2)

[Abstract] Abstract: Typo in 'Pref erence' (extra space) in the expanded acronym for RAG-Pref.
[Conclusion] The claim that RAG-Pref 'does not drastically increase overall computational requirements' should be backed by concrete measurements (e.g., additional latency or token counts) rather than qualitative statements.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important areas for improving clarity and rigor. We address each major comment point-by-point below. We agree that expanding details and adding ablations will strengthen the manuscript and will incorporate the suggested revisions in the next version.

read point-by-point responses

Referee: [Abstract] Abstract: The headline quantitative claim of a 3.7 factor improvement in agentic attack refusals is presented without any reference to the specific benchmarks, number of trials, statistical significance tests, or controls for confounding factors such as prompt length or retrieval noise, which are load-bearing for assessing whether the data supports the central claim.

Authors: We agree that the abstract would benefit from additional context to support the central claim. Due to length constraints, we will revise the abstract to briefly reference the specific agentic attack benchmarks and direct readers to the Experiments section, where we detail the number of trials, statistical significance tests performed, and controls for factors such as prompt length and retrieval noise. revision: yes
Referee: [Method] Method section: The RAG-Pref algorithm description provides no details on retrieval corpus construction, the embedding model, top-k policy, or retrieval precision/recall metrics. This is critical because the generalization to novel agentic attacks (the weakest assumption) depends on whether retrieved pairs supply usable contrastive signal rather than surface-level augmentation.

Authors: We acknowledge that these implementation details are essential for reproducibility and for validating the contrastive signal. In the revised manuscript, we will expand the Method section to specify the retrieval corpus construction (drawn from established preference datasets), the embedding model employed, the top-k retrieval policy, and quantitative retrieval precision/recall metrics on held-out data to demonstrate that the retrieved pairs provide meaningful contrast rather than superficial augmentation. revision: yes
Referee: [Experiments] Experiments: No ablations are described that isolate the contribution of the preferred/dispreferred contrast from generic context augmentation, nor are there controls to verify that the LLM utilizes the contrastive information for refusal rather than increasing false positives or introducing new vulnerabilities via the retrieval step.

Authors: We recognize the importance of these ablations and controls for isolating the effect of contrastive conditioning. We will add a dedicated ablation subsection in the Experiments to compare RAG-Pref against generic RAG augmentation without preference contrast. We will also include controls measuring false-positive rates on benign queries and analyze potential new vulnerabilities introduced by retrieval, with quantitative results to confirm that the LLM leverages the contrastive information for improved refusals. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with external benchmarks

full rationale

The paper introduces RAG-Pref as a training-free inference-time method that retrieves and conditions on preference pairs. All reported gains (3.7x refusal improvement, comparisons to 2.9x and 1.5x baselines) are presented as direct empirical measurements on external attack and preference datasets. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the provided text. The derivation chain is therefore a straightforward algorithmic description plus benchmark evaluation and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is based solely on the abstract; the central assumption is that RAG retrieval can supply useful contrastive preference information at inference without retraining.

axioms (1)

domain assumption LLMs can leverage retrieved context containing preferred and dispreferred examples to improve refusal behavior on unseen inputs
Implicit in the description of RAG-Pref conditioning during inference.

pith-pipeline@v0.9.0 · 5502 in / 1290 out tokens · 60526 ms · 2026-05-13T02:36:36.086906+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RAG-Pref ... conditions on preferred and dispreferred samples to leverage contrastive information during inference
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.1. ΔH_RAG-Pref ≥ ΔH_RAG ... contrastive information

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 13 internal anchors

[1]

Rag llms are not safer: A safety analysis of retrieval-augmented generation for large language models

Bang An, Shiyue Zhang, and Mark Dredze. Rag llms are not safer: A safety analysis of retrieval-augmented generation for large language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.\ 5444--5474, 2025

work page 2025
[2]

System card: Claude opus 4.5

Anthropic . System card: Claude opus 4.5. Technical report, Anthropic, November 2025. URL https://www-cdn.anthropic.com/bf10f64990cfda0ba858290be7b8cc6317685f47.pdf. Version dated November 24, 2025

work page 2025
[3]

Donating the model context protocol and establishing the agentic ai foundation

Anthropic. Donating the model context protocol and establishing the agentic ai foundation. https://www.anthropic.com/news/donating-the-model-context-protocol-and-establishing-of-the-agentic-ai-foundation, December 2025 a . Accessed: 2026-01-26

work page 2025
[4]

Accessed: 2025-02-12

Anthropic. Introducing the Model Context Protocol. https://www.anthropic.com/news/model-context-protocol, 2025 b . "Accessed: 2025-02-12"

work page 2025
[5]

Accessed: 2025-05-09

Anthropic. Slack MCP Server. https://github.com/modelcontextprotocol/servers/tree/main/src/slack, 2025 c . "Accessed: 2025-05-09"

work page 2025
[6]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Don't do rag: When cache-augmented generation is all you need for knowledge tasks

Brian J Chan, Chao-Ting Chen, Jui-Hung Cheng, and Hen-Hsen Huang. Don't do rag: When cache-augmented generation is all you need for knowledge tasks. In Companion Proceedings of the ACM on Web Conference 2025, pp.\ 893--897, 2025

work page 2025
[8]

Jailbreakbench: An open robustness benchmark for jailbreaking large language models

Patrick Chao, Edoardo Debenedetti, et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[9]

Jailbreaking black box large language models in twenty queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. In 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pp.\ 23--42. IEEE, 2025

work page 2025
[10]

Noise contrastive alignment of language models with explicit rewards

Huayu Chen, Guande He, Lifan Yuan, Ganqu Cui, Hang Su, and Jun Zhu. Noise contrastive alignment of language models with explicit rewards. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[11]

Provably robust dpo: Aligning language models with noisy feedback

Sayak Ray Chowdhury, Anush Kini, and Nagarajan Natarajan. Provably robust dpo: Aligning language models with noisy feedback. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[12]

Elements of information theory

Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999

work page 1999
[13]

Ultrafeedback: Boosting language models with high-quality feedback

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. 2023

work page 2023
[14]

CVE List V5 : CVE cache of the official CVE list in CVE JSON 5 format

CVE Project . CVE List V5 : CVE cache of the official CVE list in CVE JSON 5 format. https://github.com/CVEProject/cvelistV5, 2023. Accessed: 2025-10-30

work page 2023
[15]

Safe RLHF : Safe reinforcement learning from human feedback

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe RLHF : Safe reinforcement learning from human feedback. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=TyFrPOKYXw

work page 2024
[16]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Qlora: Efficient finetuning of quantized llms

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36: 0 10088--10115, 2023

work page 2023
[18]

Anchored preference optimization and contrastive revisions: Addressing underspecification in alignment

Karel D'Oosterlinck, Winnie Xu, Chris Develder, Thomas Demeester, Amanpreet Singh, Christopher Potts, Douwe Kiela, and Shikib Mehri. Anchored preference optimization and contrastive revisions: Addressing underspecification in alignment. Transactions of the Association for Computational Linguistics, 13: 0 442--460, 2025

work page 2025
[19]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Bal \'a zs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization, 2025. URL https://arxiv.org/abs/2404.16130

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Linear alignment: A closed-form solution for aligning human preferences without tuning and feedback

Songyang Gao, Qiming Ge, Wei Shen, Shihan Dou, Junjie Ye, Xiao Wang, Rui Zheng, Yicheng Zou, Zhi Chen, Hang Yan, et al. Linear alignment: A closed-form solution for aligning human preferences without tuning and feedback. In International Conference on Machine Learning, pp.\ 14702--14722. PMLR, 2024

work page 2024
[22]

Gemma 2: Improving Open Language Models at a Practical Size

Team Gemma, Morgane Riviere, Shreya Pathak, et al. Gemma 2: Improving open language models at a practical size, 2024. URL https://arxiv.org/abs/2408.00118

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Accessed: 2025-05-09

Google. MCP Toolbox for Databases: Simplify AI Agent Access to Enterprise Data. https://cloud.google.com/blog/products/ai-machine-learning/mcp-toolbox-for-databases-now-supports-model-context-protocol, 2025. "Accessed: 2025-05-09"

work page 2025
[24]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM workshop on artificial intelligence and security, pp.\ 79--90, 2023

work page 2023
[26]

Accessed: 2025-05-05

John Halloran. MCPSafetyScanner - Automated MCP safety auditing and remediation using Agents. https://github.com/johnhalloran321/mcpSafetyScanner, 2025. "Accessed: 2025-05-05"

work page 2025
[27]

arXiv preprint arXiv:2203.09509 , year=

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509, 2022

work page arXiv 2022
[28]

Accessed: 2025-10-13

HuggingFace. UltraFeedback Binarized. https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized, 2025. "Accessed: 2025-10-13"

work page 2025
[29]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Accessed: 2025-05-05

Invariant. Introducing Guardrails: The contextual security layer for the agentic era. https://invariantlabs.ai/blog/guardrails, 2025 a . "Accessed: 2025-05-05"

work page 2025
[31]

Accessed: 2025-05-05

Invariant. Introducing MCP-Scan: Protecting MCP with Invariant. https://invariantlabs.ai/blog/introducing-mcp-scan, 2025 b . "Accessed: 2025-05-05"

work page 2025
[32]

Towards efficient exact optimization of language model alignment.arXiv preprint arXiv:2402.00856, 2024

Haozhe Ji, Cheng Lu, Yilin Niu, Pei Ke, Hongning Wang, Jun Zhu, Jie Tang, and Minlie Huang. Towards efficient exact optimization of language model alignment. arXiv preprint arXiv:2402.00856, 2024

work page arXiv 2024
[33]

Beavertails: Towards improved safety alignment of llm via a human-preference dataset

Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.\ 2...

work page 2023
[34]

RAG - R eward B ench: Benchmarking reward models in retrieval augmented generation for preference alignment

Zhuoran Jin, Hongbang Yuan, Tianyi Men, Pengfei Cao, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, and Jun Zhao. RAG - R eward B ench: Benchmarking reward models in retrieval augmented generation for preference alignment. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Co...

work page doi:10.18653/v1/2025.findings-acl.877 2025
[35]

Binary classifier optimization for large language model alignment

Seungjae Jung, Gunsoo Han, Daniel Wontae Nam, and Kyoung-Woon On. Binary classifier optimization for large language model alignment. arXiv preprint arXiv:2404.04656, 2024

work page arXiv 2024
[36]

Safe DPO : A simple approach to direct preference optimization with enhanced safety

Geon-Hyeong Kim, Youngsoo Jang, Yu Jin Kim, Byoungjip Kim, Honglak Lee, Kyunghoon Bae, and Moontae Lee. Safe DPO : A simple approach to direct preference optimization with enhanced safety. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=PJdw4VBsXD

work page 2026
[37]

Rlaif vs

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Ren Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, et al. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback. In International Conference on Machine Learning, pp.\ 26874--26901. PMLR, 2024

work page 2024
[38]

The wmdp benchmark: Measuring and reducing malicious use with unlearning

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D Li, Ann-Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning. In International Conference on Machine Learning, pp.\ 28525--28550. PMLR, 2024

work page 2024
[39]

Statistical rejection sampling improves preference optimization

Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J Liu, and Jialu Liu. Statistical rejection sampling improves preference optimization. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[40]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Edward Suh, Yevgeniy Vorobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao

Xiaogeng Liu, Peiran Li, G. Edward Suh, Yevgeniy Vorobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao. Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu (eds.), International Conference on Representation Learning, volume 2025, pp.\ 10313--10...

work page 2025
[42]

Prompt Injection attack against LLM-integrated Applications

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, et al. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. In International Conference on Machine Learning, pp.\ 35181--35224. PMLR, 2024

work page 2024
[44]

Tree of attacks: Jailbreaking black-box llms automatically

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically. Advances in Neural Information Processing Systems, 37: 0 61065--61105, 2024

work page 2024
[45]

Distributional preference alignment of llms via optimal transport

Igor Melnyk, Youssef Mroueh, Brian Belgodere, Mattia Rigotti, Apoorva Nitsure, Mikhail Yurochkin, Kristjan Greenewald, Jiri Navratil, and Jarret Ross. Distributional preference alignment of llms via optimal transport. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[46]

Simpo: Simple preference optimization with a reference-free reward

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. Advances in Neural Information Processing Systems, 37: 0 124198--124235, 2024

work page 2024
[47]

Accessed: 2025-03-20

Microsoft. Introducing Model Context Protocol (MCP) in Copilot Studio. https://tinyurl.com/CopilotMCP, 2025. "Accessed: 2025-03-20"

work page 2025
[48]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

work page 2022
[49]

Ignore Previous Prompt: Attack Techniques For Language Models

F \'a bio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

Accessed: 2025-05-15

ProtectAI. Model Card for distilroberta-base-rejection-v1. https://huggingface.co/protectai/distilroberta-base-rejection-v1, 2025. "Accessed: 2025-05-15"

work page 2025
[51]

Mcp safety audit: Llms with the model context protocol allow major security exploits

Brandon Radosevich and John Halloran. Mcp safety audit: Llms with the model context protocol allow major security exploits. arXiv preprint arXiv:2504.03767, 2025

work page arXiv 2025
[52]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36: 0 53728--53741, 2023

work page 2023
[53]

Accessed: 2025-04-28

Philipp Schmid. How to use Anthropic MCP Server with open LLMs, OpenAI or Google Gemini. https://github.com/philschmid/mcp-openai-gemini-llama-example, 2025. "Accessed: 2025-04-28"

work page 2025
[54]

Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming

Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, et al. Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming. arXiv preprint arXiv:2501.18837, 2025

work page arXiv 2025
[55]

do anything now

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp.\ 1671--1685, 2024

work page 2024
[56]

Measuring and enhancing trustworthiness of LLM s in RAG through grounded attributions and learning to refuse

Maojia Song, Shang Hong Sim, Rishabh Bhardwaj, Hai Leong Chieu, Navonil Majumder, and Soujanya Poria. Measuring and enhancing trustworthiness of LLM s in RAG through grounded attributions and learning to refuse. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=Iyrtb9EJBp

work page 2025
[57]

Accessed: 2025-03-20

Stripe. Stripe Agent Toolkit. https://github.com/stripe/agent-toolkit, 2025. "Accessed: 2025-03-20"

work page 2025
[58]

Divide-then-align: Honest alignment based on the knowledge boundary of RAG

Xin Sun, Jianan Xie, Zhongqi Chen, Qiang Liu, Shu Wu, Yuehe Chen, Bowen Song, Zilei Wang, Weiqiang Wang, and Liang Wang. Divide-then-align: Honest alignment based on the knowledge boundary of RAG . In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computatio...

work page doi:10.18653/v1/2025.acl-long.561 2025
[59]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Fine-tuning language models for factuality

Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D Manning, and Chelsea Finn. Fine-tuning language models for factuality. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=WPZ2yPag4K

work page 2024
[61]

arXiv preprint arXiv:2310.16944 , year=

Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Cl \'e mentine Fourrier, Nathan Habib, et al. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023

work page arXiv 2023
[62]

PA - RAG : RAG alignment via multi-perspective preference optimization

Jiayi Wu, Hengyi Cai, Lingyong Yan, Hao Sun, Xiang Li, Shuaiqiang Wang, Dawei Yin, and Ming Gao. PA - RAG : RAG alignment via multi-perspective preference optimization. In Luis Chiruzzo, Alan Ritter, and Lu Wang (eds.), Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Languag...

work page doi:10.18653/v1/2025.naacl-long.459 2025
[63]

Self-play preference optimization for language model alignment

Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-play preference optimization for language model alignment. In Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning, 2024

work page 2024
[64]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

How language model hallucinations can snowball

Muru Zhang, Ofir Press, William Merrill, Alisa Liu, and Noah A Smith. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534, 2023

work page arXiv 2023
[66]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36: 0 46595--46623, 2023

work page 2023
[67]

arXiv preprint arXiv:2404.18922 , year=

Han Zhong, Zikang Shan, Guhao Feng, Wei Xiong, Xinle Cheng, Li Zhao, Di He, Jiang Bian, and Liwei Wang. Dpo meets ppo: Reinforced token optimization for rlhf. arXiv preprint arXiv:2404.18922, 2024

work page arXiv 2024
[68]

DPO meets PPO : Reinforced token optimization for RLHF

Han Zhong, Zikang Shan, Guhao Feng, Wei Xiong, Xinle Cheng, Li Zhao, Di He, Jiang Bian, and Liwei Wang. DPO meets PPO : Reinforced token optimization for RLHF . In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=IfWKVF6LfY

work page 2025
[69]

Lima: Less is more for alignment

Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36: 0 55006--55021, 2023

work page 2023
[70]

Accessed: 2025-07-01

Mingye Zhu. OPAD. https://github.com/stevie1023/OPAD, 2025. "Accessed: 2025-07-01"

work page 2025
[71]

On-the-fly preference alignment via principle-guided decoding

Mingye Zhu, Yi Liu, Lei Zhang, Junbo Guo, and Zhendong Mao. On-the-fly preference alignment via principle-guided decoding. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[72]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023