PEPO is a single-step pessimistic ensemble algorithm for direct preference optimization that provably avoids over-optimization by depending only on single-policy concentrability without knowing the data distribution or learning an explicit reward model.
Kashif Rasul, Edward Beeching, Lewis Tunstall, Lean- dro von Werra, and Omar Sanseviero
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 3verdicts
UNVERDICTED 3representative citing papers
Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.
RMiPO improves offline preference optimization by using intrinsic response-level mutual information to modulate hyperparameters, delivering superior performance with over 15% less training overhead.
citing papers explorer
-
Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution
PEPO is a single-step pessimistic ensemble algorithm for direct preference optimization that provably avoids over-optimization by depending only on single-policy concentrability without knowing the data distribution or learning an explicit reward model.
-
Failure Modes of Maximum Entropy RLHF
Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.
-
Intrinsic Mutual Information as a Modulator for Preference Optimization
RMiPO improves offline preference optimization by using intrinsic response-level mutual information to modulate hyperparameters, delivering superior performance with over 15% less training overhead.