pith. sign in

arxiv: 2502.06861 · v1 · pith:7XOBAEY6new · submitted 2025-02-08 · 💻 cs.LG · cs.AI

Design Considerations in Offline Preference-based RL

classification 💻 cs.LG cs.AI
keywords policyalgorithmschoicesdesignmethodsofflineresponsessome
0
0 comments X
read the original abstract

Offline algorithms for Reinforcement Learning from Human Preferences (RLHF), which use only a fixed dataset of sampled responses given an input, and preference feedback among these responses, have gained increasing prominence in the literature on aligning language models. In this paper, we study how the different design choices made in methods such as DPO, IPO, SLiC and many variants influence the quality of the learned policy, from a theoretical perspective. Our treatment yields insights into the choices of loss function, the policy which is used to normalize log-likelihoods, and also the role of the data sampling policy. Notably, our results do not rely on the standard reparameterization-style arguments used to motivate some of the algorithms in this family, which allows us to give a unified treatment to a broad class of methods. We also conduct a small empirical study to verify some of the theoretical findings on a standard summarization benchmark.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

    cs.LG 2026-02 unverdicted novelty 6.0

    PEPO uses pessimistic ensembling of DPO policies on data subsets to achieve single-policy concentrability sample bounds and avoid over-optimization in tabular settings.

  2. Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

    cs.LG 2026-02 unverdicted novelty 5.0

    PEPO is a single-step pessimistic ensemble algorithm for direct preference optimization that provably avoids over-optimization by depending only on single-policy concentrability without knowing the data distribution o...