Issues with Value-Based Multi-objective Reinforcement Learning: Value Function Interference and Overestimation Sensitivity

Cameron Foale; Ethan (EJ) Watkins; Peter Vamplew; Richard Dazeley

arxiv: 2402.06266 · v2 · submitted 2024-02-09 · 💻 cs.LG

Issues with Value-Based Multi-objective Reinforcement Learning: Value Function Interference and Overestimation Sensitivity

Peter Vamplew , Ethan (EJ) Watkins , Cameron Foale , Richard Dazeley This is my paper

Pith reviewed 2026-05-24 04:12 UTC · model grok-4.3

classification 💻 cs.LG

keywords multi-objective reinforcement learningvalue function interferenceoverestimationQ-learningnon-linear utilityscalarisationmulti-objective MDPs

0 comments

The pith

Value-based MORL with non-linear utilities suffers from value function interference and overestimation sensitivity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that two issues arise in value-based multi-objective reinforcement learning when non-linear utility functions are used for action selection. Value function interference occurs as updates for one objective affect the values of others. Sensitivity to overestimation means errors in Q-value estimates are amplified. These are shown through experiments on simple multi-objective MDPs using tabular multiobjective Q-learning. A sympathetic reader would care because non-linear utilities are necessary for many preference structures in real-world multi-objective problems.

Core claim

When a non-linear utility function is used to scalarise vector-valued value functions in multiobjective Q-learning, value function interference arises because changes to one objective's value function impact the selection and updates for other objectives, and the algorithm becomes more sensitive to overestimation of Q-values.

What carries the argument

Multiobjective Q-learning with non-linear scalarisation, where the value function is vector-valued and action selection uses a non-linear utility operator.

If this is right

Algorithms using non-linear utilities in MORL will underperform due to these interferences.
Tabular implementations already show degraded performance in simple environments.
These issues may require modifications to standard Q-learning updates or different scalarisation approaches.
Linear utility functions avoid these problems but limit the expressiveness of preferences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar issues likely appear in deep MORL with neural network function approximation, potentially worsening the problems.
Alternative MORL methods like policy-based approaches might sidestep these value-based issues.
Developing interference-resistant value function representations could mitigate the identified problems.
Testing on environments with more objectives would reveal if the issues scale.

Load-bearing premise

The issues observed in simple tabular multi-objective MDPs with basic multiobjective Q-learning are representative of problems in more complex environments or with function approximation.

What would settle it

Demonstrating that in a complex multi-objective environment with deep function approximation, non-linear utility MORL performs without noticeable interference or overestimation sensitivity compared to linear cases.

Figures

Figures reproduced from arXiv: 2402.06266 by Cameron Foale, Ethan (EJ) Watkins, Peter Vamplew, Richard Dazeley.

**Figure 3.** Figure 3: Heatmaps showing the difference in the frequency with which the incorrect Policy 2 was selected as the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: An example multi-objective MDP with stochastic rewards. The unlabelled states are terminal states, and the [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Multi-objective reinforcement learning (MORL) algorithms extend conventional reinforcement learning (RL) to the more general case of problems with multiple, conflicting objectives, represented by vector-valued rewards. Widely-used scalar RL methods such as Q-learning can be modified to handle multiple objectives by (1) learning vector-valued value functions, and (2) performing action selection using a scalarisation or ordering operator which reflects the user's preferences with respect to the different objectives. This paper investigates two previously unreported issues which can hinder the performance of value-based MORL algorithms when applied in conjunction with a non-linear utility function -- value function interference, and sensitivity to overestimation. We illustrate the nature of these phenomena on simple multi-objective MDPs using a tabular implementation of multiobjective Q-learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper identifies value function interference and overestimation sensitivity in value-based MORL on simple tabular MDPs but makes no broader claims.

read the letter

Hi, the main thing here is that value-based MORL with non-linear scalarization can hit value function interference and overestimation problems, shown through basic examples on small multi-objective MDPs with tabular multiobjective Q-learning. The authors treat these as previously unreported and use the illustrations to lay out how they arise. That is the actual contribution: spotting and describing the two failure modes in a controlled setting. The work stays within its evidence. It does not claim the issues appear with function approximation or in complex environments, and the stress-test note confirms the experiments match the stated scope exactly. That keeps the paper honest and avoids overreach. The illustrations are straightforward enough to follow, which helps readers see the mechanisms without extra assumptions. The soft spot is the lack of any quantitative results or error analysis in the reported setup. Without numbers it is difficult to judge how strong the effects are even in the tabular case, though the paper frames itself as an illustration rather than a full empirical study. This is for MORL researchers who work with value-based methods and non-linear utilities. A reader in that niche might pick up the paper to check whether these issues show up in their own implementations. It is not transformative for the wider field. It deserves a serious referee because the observations are concrete and could affect algorithm choices or debugging in this sub-area, even if later work shows the problems are narrow. I would send it out for review rather than desk reject.

Referee Report

0 major / 2 minor

Summary. The paper claims that value-based MORL algorithms using non-linear utility functions suffer from two previously unreported issues—value function interference and sensitivity to overestimation—and illustrates the nature of these phenomena via a tabular multiobjective Q-learning implementation on simple multi-objective MDPs.

Significance. If the illustrated phenomena are as described, the work usefully flags concrete limitations when extending scalar value-based methods to MORL with non-linear scalarization. The choice of simple tabular settings is a strength, as it permits direct observation of the claimed interference and overestimation effects without confounding factors from function approximation or complex environments. This provides a clear starting point for subsequent algorithmic work.

minor comments (2)

The experimental setup paragraph (referenced in the abstract) describes the use of 'simple multi-objective MDPs' and 'a basic multiobjective Q-learning implementation' but does not specify the exact MDP transition/reward structures or the precise form of the non-linear utility; adding these details would improve reproducibility while remaining within the paper's illustrative scope.
The abstract states that the issues 'can hinder the performance' but provides no quantitative metrics (e.g., regret, success rate, or value error) comparing linear vs. non-linear scalarization; a short table of such metrics on the example MDPs would strengthen the illustration without altering the central claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful reading and positive assessment of the work. The recommendation for minor revision is noted, though no specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical investigation of value function interference and overestimation sensitivity in value-based MORL with non-linear scalarization. It uses only simple tabular multi-objective MDPs and a basic multiobjective Q-learning implementation to illustrate the phenomena, without any derivation chain, fitted parameters renamed as predictions, uniqueness theorems, or self-citation load-bearing steps. The central claim is limited to demonstrating the issues in these controlled settings, which the experiments directly support without reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical identification of issues rather than a theoretical derivation, so it introduces no new free parameters, axioms, or invented entities beyond standard RL assumptions already present in the prior literature.

pith-pipeline@v0.9.0 · 5665 in / 1104 out tokens · 22700 ms · 2026-05-24T04:12:22.305570+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

[1]

A survey of multi-objective sequential decision-making

Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48: 0 67--113, 2013

work page 2013
[2]

Why multi-objective reinforcement learning

DM Roijers, Shimon Whiteson, Peter Vamplew, and Richard Dazeley. Why multi-objective reinforcement learning. In European Workshop on Reinforcement Learning, pages 1--2, 2015

work page 2015
[3]

Multi-criteria reinforcement learning

Zolt \'a n G \'a bor, Zsolt Kalm \'a r, and Csaba Szepesv \'a ri. Multi-criteria reinforcement learning. In ICML, volume 98, pages 197--205, 1998

work page 1998
[4]

Dynamic preferences in multi-criteria reinforcement learning

Sriraam Natarajan and Prasad Tadepalli. Dynamic preferences in multi-criteria reinforcement learning. In Proceedings of the 22nd international conference on Machine learning, pages 601--608, 2005

work page 2005
[5]

Scalarized multi-objective reinforcement learning: Novel design techniques

Kristof Van Moffaert, Madalina M Drugan, and Ann Now \'e . Scalarized multi-objective reinforcement learning: Novel design techniques. In 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 191--199. IEEE, 2013

work page 2013
[6]

Multi-objective reinforcement learning using sets of pareto dominating policies

Kristof Van Moffaert and Ann Now \'e . Multi-objective reinforcement learning using sets of pareto dominating policies. The Journal of Machine Learning Research, 15 0 (1): 0 3483--3512, 2014

work page 2014
[7]

A temporal difference method for multi-objective reinforcement learning

Manuela Ruiz-Montiel, Lawrence Mandow, and Jos \'e -Luis P \'e rez-de-la Cruz. A temporal difference method for multi-objective reinforcement learning. Neurocomputing, 263: 0 15--25, 2017

work page 2017
[8]

Steering approaches to pareto-optimal multiobjective reinforcement learning

Peter Vamplew, Rustam Issabekov, Richard Dazeley, Cameron Foale, Adam Berry, Tim Moore, and Douglas Creighton. Steering approaches to pareto-optimal multiobjective reinforcement learning. Neurocomputing, 263: 0 26--38, 2017 a

work page 2017
[9]

Quality assessment of morl algorithms: A utility-based approach

Luisa M Zintgraf, Timon V Kanters, Diederik M Roijers, Frans Oliehoek, and Philipp Beau. Quality assessment of morl algorithms: A utility-based approach. In Benelearn 2015: Proceedings of the 24th Annual Machine Learning Conference of Belgium and the Netherlands, 2015

work page 2015
[10]

The impact of environmental stochasticity on value-based multiobjective reinforcement learning

Peter Vamplew, Cameron Foale, and Richard Dazeley. The impact of environmental stochasticity on value-based multiobjective reinforcement learning. Neural Computing and Applications, pages 1--17, 2021

work page 2021
[11]

An empirical comparison of two common multiobjective reinforcement learning algorithms

Rustam Issabekov and Peter Vamplew. An empirical comparison of two common multiobjective reinforcement learning algorithms. In Australasian Joint Conference on Artificial Intelligence, pages 626--636. Springer, 2012

work page 2012
[12]

Learning fair policies in multiobjective (deep) reinforcement learning with average and discounted rewards

Umer Siddique, Paul Weng, and Matthieu Zimmer. Learning fair policies in multiobjective (deep) reinforcement learning with average and discounted rewards. In International Conference on Machine Learning, 2020

work page 2020
[13]

On the preferences characterization of additively separable utility

Gerard Debreu. On the preferences characterization of additively separable utility. In Constructing Scalar-Valued Objective Functions, pages 25--38. Springer, 1997

work page 1997
[14]

Softmax exploration strategies for multiobjective reinforcement learning

Peter Vamplew, Richard Dazeley, and Cameron Foale. Softmax exploration strategies for multiobjective reinforcement learning. Neurocomputing, 263: 0 74--86, 2017 b

work page 2017
[15]

Potential-based multiobjective reinforcement learning approaches to low-impact agents for AI safety

Peter Vamplew, Cameron Foale, Richard Dazeley, and Adam Bignold. Potential-based multiobjective reinforcement learning approaches to low-impact agents for AI safety. under review, 2020

work page 2020
[16]

An empirical investigation of value-based multi-objective reinforcement learning for stochastic environments

Kewen Ding, Peter Vamplew, Cameron Foale, and Richard Dazeley. An empirical investigation of value-based multi-objective reinforcement learning for stochastic environments. arXiv preprint arXiv:2401.03163, 2024

work page arXiv 2024
[17]

Distributional monte carlo tree search for risk-aware and multi-objective reinforcement learning

Conor F Hayes, Mathieu Reymond, Diederik M Roijers, Enda Howley, and Patrick Mannion. Distributional monte carlo tree search for risk-aware and multi-objective reinforcement learning. In Proceedings of the 20th international conference on autonomous agents and multiagent systems, pages 1530--1532, 2021

work page 2021
[18]

Actor-critic multi-objective reinforcement learning for non-linear utility functions

Mathieu Reymond, Conor F Hayes, Denis Steckelmacher, Diederik M Roijers, and Ann Now \'e . Actor-critic multi-objective reinforcement learning for non-linear utility functions. Autonomous Agents and Multi-Agent Systems, 37 0 (2): 0 23, 2023

work page 2023
[19]

Distributional multi-objective decision making

Willem R \"o pke, Conor F Hayes, Patrick Mannion, Enda Howley, Ann Now \'e , and Diederik M Roijers. Distributional multi-objective decision making. arXiv preprint arXiv:2305.05560, 2023

work page arXiv 2023
[20]

, " * write output.state after.block = add.period write newline

ENTRY address archive author booktitle chapter doi edition editor eid eprint howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all ...

work page
[21]

write newline

" write newline "" before.all 'output.state := FUNCTION add.period duplicate empty 'skip "." * add.blank if FUNCTION if.digit duplicate "0" = swap duplicate "1" = swap duplicate "2" = swap duplicate "3" = swap duplicate "4" = swap duplicate "5" = swap duplicate "6" = swap duplicate "7" = swap duplicate "8" = swap "9" = or or or or or or or or or FUNCTION ...

work page

[1] [1]

A survey of multi-objective sequential decision-making

Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48: 0 67--113, 2013

work page 2013

[2] [2]

Why multi-objective reinforcement learning

DM Roijers, Shimon Whiteson, Peter Vamplew, and Richard Dazeley. Why multi-objective reinforcement learning. In European Workshop on Reinforcement Learning, pages 1--2, 2015

work page 2015

[3] [3]

Multi-criteria reinforcement learning

Zolt \'a n G \'a bor, Zsolt Kalm \'a r, and Csaba Szepesv \'a ri. Multi-criteria reinforcement learning. In ICML, volume 98, pages 197--205, 1998

work page 1998

[4] [4]

Dynamic preferences in multi-criteria reinforcement learning

Sriraam Natarajan and Prasad Tadepalli. Dynamic preferences in multi-criteria reinforcement learning. In Proceedings of the 22nd international conference on Machine learning, pages 601--608, 2005

work page 2005

[5] [5]

Scalarized multi-objective reinforcement learning: Novel design techniques

Kristof Van Moffaert, Madalina M Drugan, and Ann Now \'e . Scalarized multi-objective reinforcement learning: Novel design techniques. In 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 191--199. IEEE, 2013

work page 2013

[6] [6]

Multi-objective reinforcement learning using sets of pareto dominating policies

Kristof Van Moffaert and Ann Now \'e . Multi-objective reinforcement learning using sets of pareto dominating policies. The Journal of Machine Learning Research, 15 0 (1): 0 3483--3512, 2014

work page 2014

[7] [7]

A temporal difference method for multi-objective reinforcement learning

Manuela Ruiz-Montiel, Lawrence Mandow, and Jos \'e -Luis P \'e rez-de-la Cruz. A temporal difference method for multi-objective reinforcement learning. Neurocomputing, 263: 0 15--25, 2017

work page 2017

[8] [8]

Steering approaches to pareto-optimal multiobjective reinforcement learning

Peter Vamplew, Rustam Issabekov, Richard Dazeley, Cameron Foale, Adam Berry, Tim Moore, and Douglas Creighton. Steering approaches to pareto-optimal multiobjective reinforcement learning. Neurocomputing, 263: 0 26--38, 2017 a

work page 2017

[9] [9]

Quality assessment of morl algorithms: A utility-based approach

Luisa M Zintgraf, Timon V Kanters, Diederik M Roijers, Frans Oliehoek, and Philipp Beau. Quality assessment of morl algorithms: A utility-based approach. In Benelearn 2015: Proceedings of the 24th Annual Machine Learning Conference of Belgium and the Netherlands, 2015

work page 2015

[10] [10]

The impact of environmental stochasticity on value-based multiobjective reinforcement learning

Peter Vamplew, Cameron Foale, and Richard Dazeley. The impact of environmental stochasticity on value-based multiobjective reinforcement learning. Neural Computing and Applications, pages 1--17, 2021

work page 2021

[11] [11]

An empirical comparison of two common multiobjective reinforcement learning algorithms

Rustam Issabekov and Peter Vamplew. An empirical comparison of two common multiobjective reinforcement learning algorithms. In Australasian Joint Conference on Artificial Intelligence, pages 626--636. Springer, 2012

work page 2012

[12] [12]

Learning fair policies in multiobjective (deep) reinforcement learning with average and discounted rewards

Umer Siddique, Paul Weng, and Matthieu Zimmer. Learning fair policies in multiobjective (deep) reinforcement learning with average and discounted rewards. In International Conference on Machine Learning, 2020

work page 2020

[13] [13]

On the preferences characterization of additively separable utility

Gerard Debreu. On the preferences characterization of additively separable utility. In Constructing Scalar-Valued Objective Functions, pages 25--38. Springer, 1997

work page 1997

[14] [14]

Softmax exploration strategies for multiobjective reinforcement learning

Peter Vamplew, Richard Dazeley, and Cameron Foale. Softmax exploration strategies for multiobjective reinforcement learning. Neurocomputing, 263: 0 74--86, 2017 b

work page 2017

[15] [15]

Potential-based multiobjective reinforcement learning approaches to low-impact agents for AI safety

Peter Vamplew, Cameron Foale, Richard Dazeley, and Adam Bignold. Potential-based multiobjective reinforcement learning approaches to low-impact agents for AI safety. under review, 2020

work page 2020

[16] [16]

An empirical investigation of value-based multi-objective reinforcement learning for stochastic environments

Kewen Ding, Peter Vamplew, Cameron Foale, and Richard Dazeley. An empirical investigation of value-based multi-objective reinforcement learning for stochastic environments. arXiv preprint arXiv:2401.03163, 2024

work page arXiv 2024

[17] [17]

Distributional monte carlo tree search for risk-aware and multi-objective reinforcement learning

Conor F Hayes, Mathieu Reymond, Diederik M Roijers, Enda Howley, and Patrick Mannion. Distributional monte carlo tree search for risk-aware and multi-objective reinforcement learning. In Proceedings of the 20th international conference on autonomous agents and multiagent systems, pages 1530--1532, 2021

work page 2021

[18] [18]

Actor-critic multi-objective reinforcement learning for non-linear utility functions

Mathieu Reymond, Conor F Hayes, Denis Steckelmacher, Diederik M Roijers, and Ann Now \'e . Actor-critic multi-objective reinforcement learning for non-linear utility functions. Autonomous Agents and Multi-Agent Systems, 37 0 (2): 0 23, 2023

work page 2023

[19] [19]

Distributional multi-objective decision making

Willem R \"o pke, Conor F Hayes, Patrick Mannion, Enda Howley, Ann Now \'e , and Diederik M Roijers. Distributional multi-objective decision making. arXiv preprint arXiv:2305.05560, 2023

work page arXiv 2023

[20] [20]

, " * write output.state after.block = add.period write newline

ENTRY address archive author booktitle chapter doi edition editor eid eprint howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all ...

work page

[21] [21]

write newline

" write newline "" before.all 'output.state := FUNCTION add.period duplicate empty 'skip "." * add.blank if FUNCTION if.digit duplicate "0" = swap duplicate "1" = swap duplicate "2" = swap duplicate "3" = swap duplicate "4" = swap duplicate "5" = swap duplicate "6" = swap duplicate "7" = swap duplicate "8" = swap "9" = or or or or or or or or or FUNCTION ...

work page