Adapting RFRL objectives as auxiliary tasks with preference-guided exploration outperforms prior MORL methods in performance and data efficiency on MO-Gymnasium tasks.
Assael, Diederik M
3 Pith papers cite this work. Polarity classification is still indexing.
abstract
We propose Deep Optimistic Linear Support Learning (DOL) to solve high-dimensional multi-objective decision problems where the relative importances of the objectives are not known a priori. Using features from the high-dimensional inputs, DOL computes the convex coverage set containing all potential optimal solutions of the convex combinations of the objectives. To our knowledge, this is the first time that deep reinforcement learning has succeeded in learning multi-objective policies. In addition, we provide a testbed with two experiments to be used as a benchmark for deep multi-objective reinforcement learning.
citation-role summary
citation-polarity summary
years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
A single preference-conditioned policy achieves unique and Lipschitz-continuous Pareto coverage in multi-objective MDPs via a new mirror-descent policy iteration algorithm with O(1/k) convergence.
PRL-PUTS casts utility-weight tuning as a one-step value-based RL task and uses scalarization-parameter Pareto sweeping at inference time to generate and govern a family of policies, reporting +0.13% lift in successful sessions on Pinterest Homefeed.
citing papers explorer
-
A Reward-Free Viewpoint on Multi-Objective Reinforcement Learning
Adapting RFRL objectives as auxiliary tasks with preference-guided exploration outperforms prior MORL methods in performance and data efficiency on MO-Gymnasium tasks.
-
A Single Deep Preference-Conditioned Policy for Learning Pareto Coverage Sets
A single preference-conditioned policy achieves unique and Lipschitz-continuous Pareto coverage in multi-objective MDPs via a new mirror-descent policy iteration algorithm with O(1/k) convergence.
-
A Production-Ready RL Framework for Personalized Utility Tuning with Pareto Sweeping in Pinterest Recommender Systems
PRL-PUTS casts utility-weight tuning as a one-step value-based RL task and uses scalarization-parameter Pareto sweeping at inference time to generate and govern a family of policies, reporting +0.13% lift in successful sessions on Pinterest Homefeed.