UBP2 uses ensembles of reward, dynamics, and value models to score trajectories on a unified objective of reward plus uncertainty, yielding sublinear regret bounds and higher sample efficiency on Meta-World than prior preference-based methods.
Few-shot preference learning for human-in-the-loop rl, 2022
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning
UBP2 uses ensembles of reward, dynamics, and value models to score trajectories on a unified objective of reward plus uncertainty, yielding sublinear regret bounds and higher sample efficiency on Meta-World than prior preference-based methods.