Issues with Value-Based Multi-objective Reinforcement Learning: Value Function Interference and Overestimation Sensitivity
Pith reviewed 2026-05-24 04:12 UTC · model grok-4.3
The pith
Value-based MORL with non-linear utilities suffers from value function interference and overestimation sensitivity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When a non-linear utility function is used to scalarise vector-valued value functions in multiobjective Q-learning, value function interference arises because changes to one objective's value function impact the selection and updates for other objectives, and the algorithm becomes more sensitive to overestimation of Q-values.
What carries the argument
Multiobjective Q-learning with non-linear scalarisation, where the value function is vector-valued and action selection uses a non-linear utility operator.
If this is right
- Algorithms using non-linear utilities in MORL will underperform due to these interferences.
- Tabular implementations already show degraded performance in simple environments.
- These issues may require modifications to standard Q-learning updates or different scalarisation approaches.
- Linear utility functions avoid these problems but limit the expressiveness of preferences.
Where Pith is reading between the lines
- Similar issues likely appear in deep MORL with neural network function approximation, potentially worsening the problems.
- Alternative MORL methods like policy-based approaches might sidestep these value-based issues.
- Developing interference-resistant value function representations could mitigate the identified problems.
- Testing on environments with more objectives would reveal if the issues scale.
Load-bearing premise
The issues observed in simple tabular multi-objective MDPs with basic multiobjective Q-learning are representative of problems in more complex environments or with function approximation.
What would settle it
Demonstrating that in a complex multi-objective environment with deep function approximation, non-linear utility MORL performs without noticeable interference or overestimation sensitivity compared to linear cases.
Figures
read the original abstract
Multi-objective reinforcement learning (MORL) algorithms extend conventional reinforcement learning (RL) to the more general case of problems with multiple, conflicting objectives, represented by vector-valued rewards. Widely-used scalar RL methods such as Q-learning can be modified to handle multiple objectives by (1) learning vector-valued value functions, and (2) performing action selection using a scalarisation or ordering operator which reflects the user's preferences with respect to the different objectives. This paper investigates two previously unreported issues which can hinder the performance of value-based MORL algorithms when applied in conjunction with a non-linear utility function -- value function interference, and sensitivity to overestimation. We illustrate the nature of these phenomena on simple multi-objective MDPs using a tabular implementation of multiobjective Q-learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that value-based MORL algorithms using non-linear utility functions suffer from two previously unreported issues—value function interference and sensitivity to overestimation—and illustrates the nature of these phenomena via a tabular multiobjective Q-learning implementation on simple multi-objective MDPs.
Significance. If the illustrated phenomena are as described, the work usefully flags concrete limitations when extending scalar value-based methods to MORL with non-linear scalarization. The choice of simple tabular settings is a strength, as it permits direct observation of the claimed interference and overestimation effects without confounding factors from function approximation or complex environments. This provides a clear starting point for subsequent algorithmic work.
minor comments (2)
- The experimental setup paragraph (referenced in the abstract) describes the use of 'simple multi-objective MDPs' and 'a basic multiobjective Q-learning implementation' but does not specify the exact MDP transition/reward structures or the precise form of the non-linear utility; adding these details would improve reproducibility while remaining within the paper's illustrative scope.
- The abstract states that the issues 'can hinder the performance' but provides no quantitative metrics (e.g., regret, success rate, or value error) comparing linear vs. non-linear scalarization; a short table of such metrics on the example MDPs would strengthen the illustration without altering the central claim.
Simulated Author's Rebuttal
We thank the referee for their careful reading and positive assessment of the work. The recommendation for minor revision is noted, though no specific major comments were provided in the report.
Circularity Check
No significant circularity
full rationale
The paper presents an empirical investigation of value function interference and overestimation sensitivity in value-based MORL with non-linear scalarization. It uses only simple tabular multi-objective MDPs and a basic multiobjective Q-learning implementation to illustrate the phenomena, without any derivation chain, fitted parameters renamed as predictions, uniqueness theorems, or self-citation load-bearing steps. The central claim is limited to demonstrating the issues in these controlled settings, which the experiments directly support without reduction to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A survey of multi-objective sequential decision-making
Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48: 0 67--113, 2013
work page 2013
-
[2]
Why multi-objective reinforcement learning
DM Roijers, Shimon Whiteson, Peter Vamplew, and Richard Dazeley. Why multi-objective reinforcement learning. In European Workshop on Reinforcement Learning, pages 1--2, 2015
work page 2015
-
[3]
Multi-criteria reinforcement learning
Zolt \'a n G \'a bor, Zsolt Kalm \'a r, and Csaba Szepesv \'a ri. Multi-criteria reinforcement learning. In ICML, volume 98, pages 197--205, 1998
work page 1998
-
[4]
Dynamic preferences in multi-criteria reinforcement learning
Sriraam Natarajan and Prasad Tadepalli. Dynamic preferences in multi-criteria reinforcement learning. In Proceedings of the 22nd international conference on Machine learning, pages 601--608, 2005
work page 2005
-
[5]
Scalarized multi-objective reinforcement learning: Novel design techniques
Kristof Van Moffaert, Madalina M Drugan, and Ann Now \'e . Scalarized multi-objective reinforcement learning: Novel design techniques. In 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 191--199. IEEE, 2013
work page 2013
-
[6]
Multi-objective reinforcement learning using sets of pareto dominating policies
Kristof Van Moffaert and Ann Now \'e . Multi-objective reinforcement learning using sets of pareto dominating policies. The Journal of Machine Learning Research, 15 0 (1): 0 3483--3512, 2014
work page 2014
-
[7]
A temporal difference method for multi-objective reinforcement learning
Manuela Ruiz-Montiel, Lawrence Mandow, and Jos \'e -Luis P \'e rez-de-la Cruz. A temporal difference method for multi-objective reinforcement learning. Neurocomputing, 263: 0 15--25, 2017
work page 2017
-
[8]
Steering approaches to pareto-optimal multiobjective reinforcement learning
Peter Vamplew, Rustam Issabekov, Richard Dazeley, Cameron Foale, Adam Berry, Tim Moore, and Douglas Creighton. Steering approaches to pareto-optimal multiobjective reinforcement learning. Neurocomputing, 263: 0 26--38, 2017 a
work page 2017
-
[9]
Quality assessment of morl algorithms: A utility-based approach
Luisa M Zintgraf, Timon V Kanters, Diederik M Roijers, Frans Oliehoek, and Philipp Beau. Quality assessment of morl algorithms: A utility-based approach. In Benelearn 2015: Proceedings of the 24th Annual Machine Learning Conference of Belgium and the Netherlands, 2015
work page 2015
-
[10]
The impact of environmental stochasticity on value-based multiobjective reinforcement learning
Peter Vamplew, Cameron Foale, and Richard Dazeley. The impact of environmental stochasticity on value-based multiobjective reinforcement learning. Neural Computing and Applications, pages 1--17, 2021
work page 2021
-
[11]
An empirical comparison of two common multiobjective reinforcement learning algorithms
Rustam Issabekov and Peter Vamplew. An empirical comparison of two common multiobjective reinforcement learning algorithms. In Australasian Joint Conference on Artificial Intelligence, pages 626--636. Springer, 2012
work page 2012
-
[12]
Umer Siddique, Paul Weng, and Matthieu Zimmer. Learning fair policies in multiobjective (deep) reinforcement learning with average and discounted rewards. In International Conference on Machine Learning, 2020
work page 2020
-
[13]
On the preferences characterization of additively separable utility
Gerard Debreu. On the preferences characterization of additively separable utility. In Constructing Scalar-Valued Objective Functions, pages 25--38. Springer, 1997
work page 1997
-
[14]
Softmax exploration strategies for multiobjective reinforcement learning
Peter Vamplew, Richard Dazeley, and Cameron Foale. Softmax exploration strategies for multiobjective reinforcement learning. Neurocomputing, 263: 0 74--86, 2017 b
work page 2017
-
[15]
Potential-based multiobjective reinforcement learning approaches to low-impact agents for AI safety
Peter Vamplew, Cameron Foale, Richard Dazeley, and Adam Bignold. Potential-based multiobjective reinforcement learning approaches to low-impact agents for AI safety. under review, 2020
work page 2020
-
[16]
Kewen Ding, Peter Vamplew, Cameron Foale, and Richard Dazeley. An empirical investigation of value-based multi-objective reinforcement learning for stochastic environments. arXiv preprint arXiv:2401.03163, 2024
-
[17]
Distributional monte carlo tree search for risk-aware and multi-objective reinforcement learning
Conor F Hayes, Mathieu Reymond, Diederik M Roijers, Enda Howley, and Patrick Mannion. Distributional monte carlo tree search for risk-aware and multi-objective reinforcement learning. In Proceedings of the 20th international conference on autonomous agents and multiagent systems, pages 1530--1532, 2021
work page 2021
-
[18]
Actor-critic multi-objective reinforcement learning for non-linear utility functions
Mathieu Reymond, Conor F Hayes, Denis Steckelmacher, Diederik M Roijers, and Ann Now \'e . Actor-critic multi-objective reinforcement learning for non-linear utility functions. Autonomous Agents and Multi-Agent Systems, 37 0 (2): 0 23, 2023
work page 2023
-
[19]
Distributional multi-objective decision making
Willem R \"o pke, Conor F Hayes, Patrick Mannion, Enda Howley, Ann Now \'e , and Diederik M Roijers. Distributional multi-objective decision making. arXiv preprint arXiv:2305.05560, 2023
-
[20]
, " * write output.state after.block = add.period write newline
ENTRY address archive author booktitle chapter doi edition editor eid eprint howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all ...
-
[21]
" write newline "" before.all 'output.state := FUNCTION add.period duplicate empty 'skip "." * add.blank if FUNCTION if.digit duplicate "0" = swap duplicate "1" = swap duplicate "2" = swap duplicate "3" = swap duplicate "4" = swap duplicate "5" = swap duplicate "6" = swap duplicate "7" = swap duplicate "8" = swap "9" = or or or or or or or or or FUNCTION ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.