PEPO is a single-step pessimistic ensemble algorithm for direct preference optimization that provably avoids over-optimization by depending only on single-policy concentrability without knowing the data distribution or learning an explicit reward model.
Langevin soft actor-critic: Efficient exploration through uncertainty-driven critic learning.arXiv preprint arXiv:2501.17827
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2verdicts
UNVERDICTED 2representative citing papers
The work establishes OOD generalization bounds for meta-supervised learning and meta-RL that exploit MDP structure, then analyzes a gradient-based meta-RL algorithm.
citing papers explorer
-
Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution
PEPO is a single-step pessimistic ensemble algorithm for direct preference optimization that provably avoids over-optimization by depending only on single-policy concentrability without knowing the data distribution or learning an explicit reward model.
-
An Information-Theoretic Analysis of OOD Generalization in Meta-Reinforcement Learning
The work establishes OOD generalization bounds for meta-supervised learning and meta-RL that exploit MDP structure, then analyzes a gradient-based meta-RL algorithm.