pith. sign in

arxiv: 2402.06266 · v2 · submitted 2024-02-09 · 💻 cs.LG

Issues with Value-Based Multi-objective Reinforcement Learning: Value Function Interference and Overestimation Sensitivity

Pith reviewed 2026-05-24 04:12 UTC · model grok-4.3

classification 💻 cs.LG
keywords multi-objective reinforcement learningvalue function interferenceoverestimationQ-learningnon-linear utilityscalarisationmulti-objective MDPs
0
0 comments X

The pith

Value-based MORL with non-linear utilities suffers from value function interference and overestimation sensitivity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that two issues arise in value-based multi-objective reinforcement learning when non-linear utility functions are used for action selection. Value function interference occurs as updates for one objective affect the values of others. Sensitivity to overestimation means errors in Q-value estimates are amplified. These are shown through experiments on simple multi-objective MDPs using tabular multiobjective Q-learning. A sympathetic reader would care because non-linear utilities are necessary for many preference structures in real-world multi-objective problems.

Core claim

When a non-linear utility function is used to scalarise vector-valued value functions in multiobjective Q-learning, value function interference arises because changes to one objective's value function impact the selection and updates for other objectives, and the algorithm becomes more sensitive to overestimation of Q-values.

What carries the argument

Multiobjective Q-learning with non-linear scalarisation, where the value function is vector-valued and action selection uses a non-linear utility operator.

If this is right

  • Algorithms using non-linear utilities in MORL will underperform due to these interferences.
  • Tabular implementations already show degraded performance in simple environments.
  • These issues may require modifications to standard Q-learning updates or different scalarisation approaches.
  • Linear utility functions avoid these problems but limit the expressiveness of preferences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar issues likely appear in deep MORL with neural network function approximation, potentially worsening the problems.
  • Alternative MORL methods like policy-based approaches might sidestep these value-based issues.
  • Developing interference-resistant value function representations could mitigate the identified problems.
  • Testing on environments with more objectives would reveal if the issues scale.

Load-bearing premise

The issues observed in simple tabular multi-objective MDPs with basic multiobjective Q-learning are representative of problems in more complex environments or with function approximation.

What would settle it

Demonstrating that in a complex multi-objective environment with deep function approximation, non-linear utility MORL performs without noticeable interference or overestimation sensitivity compared to linear cases.

Figures

Figures reproduced from arXiv: 2402.06266 by Cameron Foale, Ethan (EJ) Watkins, Peter Vamplew, Richard Dazeley.

Figure 2
Figure 2. Figure 2: Heatmaps showing the frequency with which each policy was selected as the final greedy policy across [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Heatmaps showing the difference in the frequency with which the incorrect Policy 2 was selected as the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An example multi-objective MDP with stochastic rewards. The unlabelled states are terminal states, and the [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Multi-objective reinforcement learning (MORL) algorithms extend conventional reinforcement learning (RL) to the more general case of problems with multiple, conflicting objectives, represented by vector-valued rewards. Widely-used scalar RL methods such as Q-learning can be modified to handle multiple objectives by (1) learning vector-valued value functions, and (2) performing action selection using a scalarisation or ordering operator which reflects the user's preferences with respect to the different objectives. This paper investigates two previously unreported issues which can hinder the performance of value-based MORL algorithms when applied in conjunction with a non-linear utility function -- value function interference, and sensitivity to overestimation. We illustrate the nature of these phenomena on simple multi-objective MDPs using a tabular implementation of multiobjective Q-learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims that value-based MORL algorithms using non-linear utility functions suffer from two previously unreported issues—value function interference and sensitivity to overestimation—and illustrates the nature of these phenomena via a tabular multiobjective Q-learning implementation on simple multi-objective MDPs.

Significance. If the illustrated phenomena are as described, the work usefully flags concrete limitations when extending scalar value-based methods to MORL with non-linear scalarization. The choice of simple tabular settings is a strength, as it permits direct observation of the claimed interference and overestimation effects without confounding factors from function approximation or complex environments. This provides a clear starting point for subsequent algorithmic work.

minor comments (2)
  1. The experimental setup paragraph (referenced in the abstract) describes the use of 'simple multi-objective MDPs' and 'a basic multiobjective Q-learning implementation' but does not specify the exact MDP transition/reward structures or the precise form of the non-linear utility; adding these details would improve reproducibility while remaining within the paper's illustrative scope.
  2. The abstract states that the issues 'can hinder the performance' but provides no quantitative metrics (e.g., regret, success rate, or value error) comparing linear vs. non-linear scalarization; a short table of such metrics on the example MDPs would strengthen the illustration without altering the central claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful reading and positive assessment of the work. The recommendation for minor revision is noted, though no specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical investigation of value function interference and overestimation sensitivity in value-based MORL with non-linear scalarization. It uses only simple tabular multi-objective MDPs and a basic multiobjective Q-learning implementation to illustrate the phenomena, without any derivation chain, fitted parameters renamed as predictions, uniqueness theorems, or self-citation load-bearing steps. The central claim is limited to demonstrating the issues in these controlled settings, which the experiments directly support without reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical identification of issues rather than a theoretical derivation, so it introduces no new free parameters, axioms, or invented entities beyond standard RL assumptions already present in the prior literature.

pith-pipeline@v0.9.0 · 5665 in / 1104 out tokens · 22700 ms · 2026-05-24T04:12:22.305570+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

  1. [1]

    A survey of multi-objective sequential decision-making

    Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48: 0 67--113, 2013

  2. [2]

    Why multi-objective reinforcement learning

    DM Roijers, Shimon Whiteson, Peter Vamplew, and Richard Dazeley. Why multi-objective reinforcement learning. In European Workshop on Reinforcement Learning, pages 1--2, 2015

  3. [3]

    Multi-criteria reinforcement learning

    Zolt \'a n G \'a bor, Zsolt Kalm \'a r, and Csaba Szepesv \'a ri. Multi-criteria reinforcement learning. In ICML, volume 98, pages 197--205, 1998

  4. [4]

    Dynamic preferences in multi-criteria reinforcement learning

    Sriraam Natarajan and Prasad Tadepalli. Dynamic preferences in multi-criteria reinforcement learning. In Proceedings of the 22nd international conference on Machine learning, pages 601--608, 2005

  5. [5]

    Scalarized multi-objective reinforcement learning: Novel design techniques

    Kristof Van Moffaert, Madalina M Drugan, and Ann Now \'e . Scalarized multi-objective reinforcement learning: Novel design techniques. In 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 191--199. IEEE, 2013

  6. [6]

    Multi-objective reinforcement learning using sets of pareto dominating policies

    Kristof Van Moffaert and Ann Now \'e . Multi-objective reinforcement learning using sets of pareto dominating policies. The Journal of Machine Learning Research, 15 0 (1): 0 3483--3512, 2014

  7. [7]

    A temporal difference method for multi-objective reinforcement learning

    Manuela Ruiz-Montiel, Lawrence Mandow, and Jos \'e -Luis P \'e rez-de-la Cruz. A temporal difference method for multi-objective reinforcement learning. Neurocomputing, 263: 0 15--25, 2017

  8. [8]

    Steering approaches to pareto-optimal multiobjective reinforcement learning

    Peter Vamplew, Rustam Issabekov, Richard Dazeley, Cameron Foale, Adam Berry, Tim Moore, and Douglas Creighton. Steering approaches to pareto-optimal multiobjective reinforcement learning. Neurocomputing, 263: 0 26--38, 2017 a

  9. [9]

    Quality assessment of morl algorithms: A utility-based approach

    Luisa M Zintgraf, Timon V Kanters, Diederik M Roijers, Frans Oliehoek, and Philipp Beau. Quality assessment of morl algorithms: A utility-based approach. In Benelearn 2015: Proceedings of the 24th Annual Machine Learning Conference of Belgium and the Netherlands, 2015

  10. [10]

    The impact of environmental stochasticity on value-based multiobjective reinforcement learning

    Peter Vamplew, Cameron Foale, and Richard Dazeley. The impact of environmental stochasticity on value-based multiobjective reinforcement learning. Neural Computing and Applications, pages 1--17, 2021

  11. [11]

    An empirical comparison of two common multiobjective reinforcement learning algorithms

    Rustam Issabekov and Peter Vamplew. An empirical comparison of two common multiobjective reinforcement learning algorithms. In Australasian Joint Conference on Artificial Intelligence, pages 626--636. Springer, 2012

  12. [12]

    Learning fair policies in multiobjective (deep) reinforcement learning with average and discounted rewards

    Umer Siddique, Paul Weng, and Matthieu Zimmer. Learning fair policies in multiobjective (deep) reinforcement learning with average and discounted rewards. In International Conference on Machine Learning, 2020

  13. [13]

    On the preferences characterization of additively separable utility

    Gerard Debreu. On the preferences characterization of additively separable utility. In Constructing Scalar-Valued Objective Functions, pages 25--38. Springer, 1997

  14. [14]

    Softmax exploration strategies for multiobjective reinforcement learning

    Peter Vamplew, Richard Dazeley, and Cameron Foale. Softmax exploration strategies for multiobjective reinforcement learning. Neurocomputing, 263: 0 74--86, 2017 b

  15. [15]

    Potential-based multiobjective reinforcement learning approaches to low-impact agents for AI safety

    Peter Vamplew, Cameron Foale, Richard Dazeley, and Adam Bignold. Potential-based multiobjective reinforcement learning approaches to low-impact agents for AI safety. under review, 2020

  16. [16]

    An empirical investigation of value-based multi-objective reinforcement learning for stochastic environments

    Kewen Ding, Peter Vamplew, Cameron Foale, and Richard Dazeley. An empirical investigation of value-based multi-objective reinforcement learning for stochastic environments. arXiv preprint arXiv:2401.03163, 2024

  17. [17]

    Distributional monte carlo tree search for risk-aware and multi-objective reinforcement learning

    Conor F Hayes, Mathieu Reymond, Diederik M Roijers, Enda Howley, and Patrick Mannion. Distributional monte carlo tree search for risk-aware and multi-objective reinforcement learning. In Proceedings of the 20th international conference on autonomous agents and multiagent systems, pages 1530--1532, 2021

  18. [18]

    Actor-critic multi-objective reinforcement learning for non-linear utility functions

    Mathieu Reymond, Conor F Hayes, Denis Steckelmacher, Diederik M Roijers, and Ann Now \'e . Actor-critic multi-objective reinforcement learning for non-linear utility functions. Autonomous Agents and Multi-Agent Systems, 37 0 (2): 0 23, 2023

  19. [19]

    Distributional multi-objective decision making

    Willem R \"o pke, Conor F Hayes, Patrick Mannion, Enda Howley, Ann Now \'e , and Diederik M Roijers. Distributional multi-objective decision making. arXiv preprint arXiv:2305.05560, 2023

  20. [20]

    , " * write output.state after.block = add.period write newline

    ENTRY address archive author booktitle chapter doi edition editor eid eprint howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all ...

  21. [21]

    write newline

    " write newline "" before.all 'output.state := FUNCTION add.period duplicate empty 'skip "." * add.blank if FUNCTION if.digit duplicate "0" = swap duplicate "1" = swap duplicate "2" = swap duplicate "3" = swap duplicate "4" = swap duplicate "5" = swap duplicate "6" = swap duplicate "7" = swap duplicate "8" = swap "9" = or or or or or or or or or FUNCTION ...