On Reward-Balancing Methods for Reinforcement Learning

Bahman Gharesifard; Giuseppe Notarstefano; Simone Baroncini

arxiv: 2604.20433 · v1 · submitted 2026-04-22 · 🧮 math.OC · cs.SY· eess.SY

On Reward-Balancing Methods for Reinforcement Learning

Simone Baroncini , Bahman Gharesifard , Giuseppe Notarstefano This is my paper

Pith reviewed 2026-05-10 00:07 UTC · model grok-4.3

classification 🧮 math.OC cs.SYeess.SY

keywords reward balancingreinforcement learningoptimal controlnormalization processmodel predictive controlscenario optimizationdiscounted return

0 comments

The pith

Reward-balancing methods for RL can be reformulated as an optimal control problem that preserves optimal policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines reward-balancing methods for solving discounted reinforcement learning problems by iteratively adjusting the reward function. The adjustments, called the normalization process, turn the problem into an equivalent one where optimal policies are greedy, and the algebraic structure of these transformations is analyzed. The procedure is then reformulated as an optimal control problem, which is extended to uncertain models via stochastic sampling to obtain guarantees and bounds. Simulations using this framework in scenario model predictive control show performance gains over state-of-the-art methods.

Core claim

Reward-balancing methods transform RL problems so that optimal policies become greedy. This transformation admits a control-theoretic interpretation that can be solved via optimal control techniques and extended to stochastic models for probabilistic performance bounds.

What carries the argument

The normalization process that adjusts the reward function to make optimal policies greedy while keeping the solution set unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The control reformulation might enable the use of existing optimal control solvers for RL tasks.
This approach could be tested on a wider range of RL benchmarks to verify the improvements.
Connections to other reward shaping techniques in RL may be explored using the algebraic analysis.

Load-bearing premise

The normalization process transforms the RL problem into an equivalent one in which the optimal policies are greedy, preserving the solution set.

What would settle it

A specific RL problem where applying the normalization process changes the set of optimal policies, or simulation results where the scenario MPC implementation fails to outperform existing methods.

Figures

Figures reproduced from arXiv: 2604.20433 by Bahman Gharesifard, Giuseppe Notarstefano, Simone Baroncini.

**Figure 1.** Figure 1: Representation of an MDP with two states x1, x2 and five actions, three available in x1 and two available in x2. pullbacks representations Rπ ∈ R n and Fπ ∈ R n×n can be easily computed as Rπ = ΠR and Fπ = ΠF, respectively. Consequently, the projection map s is represented by the matrix S ∶= diag{1mi (7) ∶ i = 1, . . . , n}. Observe that ΠS = In, for any policy matrix Π. Eventually, as done for the other f… view at source ↗

**Figure 2.** Figure 2: Representation of the space Map(U, R) of an MDP with just one state and two actions to choose from, for which the optimal policy is greedy regardless of the chosen reward and discount factor, with an optimal value function given by Vπ∗,r(x) = max{R 1 , R2 }/(1 − γ). The dark red set identifies a section of the normal set, in this case defined by N = Nπ1 ∪ Nπ2 , where Nπi = {R ∈ R 2 ∶ R i = 0, Rj ≤ 0, j ≠ i… view at source ↗

**Figure 3.** Figure 3: Visualization of the value space Map(X , R) for an MDP with two states x1, x2 and an action space {uij ∶ i ∈ {1, 2}, j ∈ {1, 2, 3}}, where uij ∈ Uxi . Each intersection point ∂Cu(r) ∩ ∂Cv(r), with u ∈ Ux1 and v ∈ Ux2 , is the value function corresponding to the policy (x1, x2) ↦ (u, v). The optimal value function is marked with a red cross, whereas the admissible set is the blue polygon. The green polygon … view at source ↗

**Figure 4.** Figure 4: Example of the evolution of the set of value functions (green region, including stochastic policies, with the optimal one marked with a cross), for an MDP with two states and six actions in total (three available in each state) when the reward is updated using the RB-S control law computed using the exact model. Each suboptimal value function corresponding to a stationary deterministic policy is marked wit… view at source ↗

**Figure 5.** Figure 5: Example of the evolution of the value functions set (green region, including stochastic policies) for the same MDP as in [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗

**Figure 6.** Figure 6: The largest model-invariant admissible set F I (R) for a reward vector R with hmax(R) = (−2,−0.5) and a discount factor γ = 0.8. The three darker, nested subsets of F I (R) represent the regions satisfying (70) for α ∈ {0.7, 0.8, 0.9}. Proof. By Theorem 33, the full-output feedback satisfies (70a), while it clearly satisfies (70b) with α ≥ γ, since replacing h with hmax yields γ ∥hmax(Rˆ)∥∞ ≤ α∥hmax(Rˆ)∥∞ … view at source ↗

**Figure 7.** Figure 7: Input and Output supremum norms over 500 Monte-Carlo simulations. Recall that, in the full-output feedback case, input and output coincide [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison between the full-output feedback control law and scenario MPC in terms of the percentage of times the greedy policy computed from the normalized reward function is optimal [PITH_FULL_IMAGE:figures/full_fig_p038_8.png] view at source ↗

**Figure 9.** Figure 9: Average greedy policy over 500 Monte-Carlo simulations, expressed in percentage [PITH_FULL_IMAGE:figures/full_fig_p038_9.png] view at source ↗

read the original abstract

This paper investigates the so-called reward-balancing methods, a novel class of algorithms for solving discounted-return reinforcement learning (RL) problems. These methods consist of iteratively adjusting the reward function to transform the RL problem into an equivalent one in which the optimal policies are greedy. For this procedure, referred to as normalization process, we provide a theoretical analysis of the involved transformations, emphasizing their algebraic structure. Then, we introduce a control-theoretic reformulation, recasting the reward-balancing procedure into an optimal control framework. The approach is further extended to address model uncertainty through stochastic model sampling, yielding normalization guarantees and probabilistic bounds on stochastic fluctuations. Using the proposed optimal control framework within a scenario model predictive control (MPC) setting, we demonstrate, through simulation studies, performance improvements over the current state-of-the-art.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper introduces reward-balancing as a way to normalize rewards in discounted RL so optimal policies become greedy, then recasts the process as an optimal control problem for scenario MPC.

read the letter

The core idea is straightforward: iteratively adjust the reward function until the optimal policies for the modified problem are exactly the greedy ones, while claiming the solution set stays the same. They analyze the algebraic steps in that normalization, turn the procedure into an optimal-control formulation, add stochastic sampling for model uncertainty with some probabilistic bounds, and test the whole thing inside scenario MPC. The simulations reportedly beat existing methods on their test cases. That control-theoretic angle and the extension to uncertain models are the parts that feel fresh compared with standard RL reward-shaping work. The algebraic focus also gives a cleaner handle on what the iterations are actually doing than most heuristic balancing tricks. The main soft spot is the invariance claim. The abstract says the normalization produces an equivalent problem that preserves the original optimal policy set, but the stress-test note is right to flag that this needs an explicit check, especially when value functions are not unique or when the adjustments create new fixed points. Without that invariance, the MPC controller is optimizing something different from the original discounted objective, and any reported gains become harder to interpret. The simulation section is also thin on specifics in the abstract—no clear baselines, no ablation on the sampling, no discussion of how sensitive the improvements are to the scenario count. This is aimed at researchers who already work at the RL-control boundary and want a structured way to handle discounting inside MPC. A reader who cares about formal guarantees on policy equivalence or reproducible MPC benchmarks will get the most out of it. The idea is coherent enough on its own terms to deserve a serious referee, even if the proofs and experiments need more detail before publication.

Referee Report

2 major / 2 minor

Summary. The paper investigates reward-balancing methods for discounted-return RL problems. These methods iteratively adjust the reward function via a normalization process that transforms the problem into an equivalent one where optimal policies are greedy w.r.t. the adjusted rewards while preserving the original solution set. The work provides a theoretical analysis emphasizing the algebraic structure of the transformations, recasts the procedure as an optimal control problem, extends it to model uncertainty via stochastic sampling with normalization guarantees and probabilistic bounds, and applies the framework in a scenario MPC setting to report performance improvements over state-of-the-art methods in simulation studies.

Significance. If the central equivalence between the original and normalized problems holds, the control-theoretic reformulation could offer a principled bridge between RL and optimal control, enabling robust handling of uncertainty through scenario MPC. The simulation results, if substantiated with clear baselines and metrics, would indicate practical utility for policy improvement in uncertain environments.

major comments (2)

[theoretical analysis of the normalization process] The normalization process is claimed to preserve the solution set while rendering policies greedy (abstract). This equivalence is load-bearing for the optimal-control reformulation and all subsequent MPC claims, yet the provided text supplies no explicit invariance proof or algebraic verification that the argmax set over policies remains unchanged under the iterative reward adjustment (particularly when value functions are non-unique). Without this, the scenario-MPC controller may optimize a modified MDP rather than the original discounted-return objective.
[extension to model uncertainty] The extension to stochastic model sampling claims normalization guarantees and probabilistic bounds on fluctuations, but no derivation or statement of these bounds (e.g., concentration inequalities or sample-complexity results) appears in the abstract or summary. This is central to the uncertainty-handling contribution and must be shown to hold for the reformulated optimal-control problem.

minor comments (2)

[simulation studies] The simulation studies claim improvements over the current state-of-the-art but provide no specifics on baselines, environments, or quantitative metrics (e.g., return values, variance). Adding these details would strengthen the empirical section.
Notation for the reward-adjustment operator and the optimal-control variables should be introduced more explicitly to aid readers bridging RL and control literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their insightful comments on the theoretical analysis and the handling of model uncertainty. These points will strengthen the manuscript significantly. Below, we provide point-by-point responses to the major comments.

read point-by-point responses

Referee: [theoretical analysis of the normalization process] The normalization process is claimed to preserve the solution set while rendering policies greedy (abstract). This equivalence is load-bearing for the optimal-control reformulation and all subsequent MPC claims, yet the provided text supplies no explicit invariance proof or algebraic verification that the argmax set over policies remains unchanged under the iterative reward adjustment (particularly when value functions are non-unique). Without this, the scenario-MPC controller may optimize a modified MDP rather than the original discounted-return objective.

Authors: We thank the referee for highlighting this important point. The equivalence between the original and normalized problems is indeed central to our contributions. While the abstract and summary emphasize the preservation of the solution set, we agree that an explicit invariance proof was not sufficiently detailed in the provided text. In the revised manuscript, we will include a dedicated subsection with a rigorous algebraic verification that the argmax set over policies remains unchanged under the iterative reward adjustment. This will include a specific treatment of cases where value functions are non-unique, ensuring that the scenario-MPC controller optimizes an equivalent formulation of the original discounted-return objective. We will also update the abstract to reference this invariance result. revision: yes
Referee: [extension to model uncertainty] The extension to stochastic model sampling claims normalization guarantees and probabilistic bounds on fluctuations, but no derivation or statement of these bounds (e.g., concentration inequalities or sample-complexity results) appears in the abstract or summary. This is central to the uncertainty-handling contribution and must be shown to hold for the reformulated optimal-control problem.

Authors: We agree that the specific bounds and their derivations should be more clearly stated. The manuscript claims normalization guarantees and probabilistic bounds, but as noted, the abstract and summary do not provide the explicit forms or derivations. In the revision, we will add a statement of the bounds (including the use of concentration inequalities such as Hoeffding's) to the abstract and provide the full derivation in the main text, explicitly showing that they apply to the reformulated optimal-control problem under stochastic model sampling. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chain is self-contained

full rationale

The paper's core claims rest on a theoretical analysis of the normalization transformations and a subsequent optimal-control reformulation, presented as derived results rather than inputs. No equations or steps in the abstract reduce predictions to fitted parameters, self-definitions, or load-bearing self-citations by construction. The equivalence of the transformed problem is asserted as an outcome of the algebraic analysis, not presupposed. Simulation results are empirical validation separate from the derivation. This matches the default expectation for non-circular papers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard RL domain assumptions about discounted returns and problem equivalence under reward transformations, with no free parameters, invented entities, or ad-hoc axioms apparent from the abstract.

axioms (2)

domain assumption Discounted-return reinforcement learning problems can be transformed via reward adjustments into equivalent problems with greedy optimal policies
This is the core premise of the normalization process stated in the abstract.
domain assumption Stochastic model sampling provides normalization guarantees and probabilistic bounds under model uncertainty
Invoked for the extension to uncertain models in the abstract.

pith-pipeline@v0.9.0 · 5441 in / 1354 out tokens · 46925 ms · 2026-05-10T00:07:37.643924+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

A. Y. Ng, D. Harada, and S. Russell,Policy invariance under reward transformations: Theory and ap- plication to reward shaping, Proceedings of the Sixteenth International Conference on Machine Learning (ICML), 1999, 278–287

work page 1999
[2]

Marthi,Automatic shaping and decomposition of reward functions, Proceedings of the24th International Conference on Machine Learning (ICML), 2007, 601–608

B. Marthi,Automatic shaping and decomposition of reward functions, Proceedings of the24th International Conference on Machine Learning (ICML), 2007, 601–608

work page 2007
[3]

J. Ren, S. Guo, and F. Chen,Orientation-preserving rewards’ balancing in reinforcement learning, IEEE Transactions on Neural Networks and Learning Systems33(2021), no. 11, 6458–6472

work page 2021
[4]

Y. Hu, W. Wang, H. Jia, Y. Wang, Y. Chen, J. Hao, F. Wu, and C. Fan,Learning to utilize shaping rewards: A new approach of reward shaping, Advances in Neural Information Processing Systems33 (2020), 15931–15941

work page 2020
[5]

H. Zou, T. Ren, D. Yan, H. Su, and J. Zhu,Reward shaping via meta-learning, arXiv:1901.09330, 2019

work page arXiv 1901
[6]

Mustafin, A

A. Mustafin, A. Pakharev, A. Olshevsky, and I. Ch. Paschalidis,MDP Geometry, Normalization and Reward Balancing Solvers, preprint, arXiv:2407.06712, 2025

work page arXiv 2025
[7]

J. Woo, G. Joshi, and Y. Chi,The blessing of heterogeneity in federated Q-learning: Linear speedup and beyond, Journal of Machine Learning Research26(2025), no. 26, 1–85

work page 2025
[8]

M. L. Puterman,Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley & Sons, 2014

work page 2014
[9]

van der Pol, D

E. van der Pol, D. Worrall, H. van Hoof, F. Oliehoek, and M. Welling,MDP homomorphic networks: Group symmetries in reinforcement learning, Advances in Neural Information Processing Systems33(2020), 4199–4210

work page 2020
[10]

J. L. Kelley,General Topology, Courier Dover Publications, 2017

work page 2017
[11]

J. M. Varah,A lower bound for the smallest singular value of a matrix, Linear Algebra and its Applications 11(1975), no. 1, 3–5

work page 1975
[12]

2, 7392–7399

B.J.Gravell, P.M.Esfahani, andT.H.Summers,Robust control design for linear systems via multiplicative noise, IFAC-PapersOnLine53(2020), no. 2, 7392–7399

work page 2020
[13]

Coppens, M

P. Coppens, M. Schuurmans, and P. Patrinos,Data-driven distributionally robust LQR with multiplicative noise, Proceedings of the 2nd Conference on Learning for Dynamics and Control (L4DC), 2020, 521–530

work page 2020
[14]

4, 1679–1706

I.Pinelis,Optimum bounds for the distributions of martingales in Banach spaces, TheAnnalsofProbability 22(1994), no. 4, 1679–1706

work page 1994
[15]

Guglielmo Marconi

G. Schildbach, L. Fagiano, C. Frei, and M. Morari,The scenario approach for stochastic model predictive control with bounds on closed-loop constraint violations, Automatica50(2014), no. 12, 3009–3018. Department of Electrical, Electronic, and Information Engineering “Guglielmo Marconi” - DEI, University of Bologna, Bologna, Italy Email address:s.baroncini...

work page 2014

[1] [1]

A. Y. Ng, D. Harada, and S. Russell,Policy invariance under reward transformations: Theory and ap- plication to reward shaping, Proceedings of the Sixteenth International Conference on Machine Learning (ICML), 1999, 278–287

work page 1999

[2] [2]

Marthi,Automatic shaping and decomposition of reward functions, Proceedings of the24th International Conference on Machine Learning (ICML), 2007, 601–608

B. Marthi,Automatic shaping and decomposition of reward functions, Proceedings of the24th International Conference on Machine Learning (ICML), 2007, 601–608

work page 2007

[3] [3]

J. Ren, S. Guo, and F. Chen,Orientation-preserving rewards’ balancing in reinforcement learning, IEEE Transactions on Neural Networks and Learning Systems33(2021), no. 11, 6458–6472

work page 2021

[4] [4]

Y. Hu, W. Wang, H. Jia, Y. Wang, Y. Chen, J. Hao, F. Wu, and C. Fan,Learning to utilize shaping rewards: A new approach of reward shaping, Advances in Neural Information Processing Systems33 (2020), 15931–15941

work page 2020

[5] [5]

H. Zou, T. Ren, D. Yan, H. Su, and J. Zhu,Reward shaping via meta-learning, arXiv:1901.09330, 2019

work page arXiv 1901

[6] [6]

Mustafin, A

A. Mustafin, A. Pakharev, A. Olshevsky, and I. Ch. Paschalidis,MDP Geometry, Normalization and Reward Balancing Solvers, preprint, arXiv:2407.06712, 2025

work page arXiv 2025

[7] [7]

J. Woo, G. Joshi, and Y. Chi,The blessing of heterogeneity in federated Q-learning: Linear speedup and beyond, Journal of Machine Learning Research26(2025), no. 26, 1–85

work page 2025

[8] [8]

M. L. Puterman,Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley & Sons, 2014

work page 2014

[9] [9]

van der Pol, D

E. van der Pol, D. Worrall, H. van Hoof, F. Oliehoek, and M. Welling,MDP homomorphic networks: Group symmetries in reinforcement learning, Advances in Neural Information Processing Systems33(2020), 4199–4210

work page 2020

[10] [10]

J. L. Kelley,General Topology, Courier Dover Publications, 2017

work page 2017

[11] [11]

J. M. Varah,A lower bound for the smallest singular value of a matrix, Linear Algebra and its Applications 11(1975), no. 1, 3–5

work page 1975

[12] [12]

2, 7392–7399

B.J.Gravell, P.M.Esfahani, andT.H.Summers,Robust control design for linear systems via multiplicative noise, IFAC-PapersOnLine53(2020), no. 2, 7392–7399

work page 2020

[13] [13]

Coppens, M

P. Coppens, M. Schuurmans, and P. Patrinos,Data-driven distributionally robust LQR with multiplicative noise, Proceedings of the 2nd Conference on Learning for Dynamics and Control (L4DC), 2020, 521–530

work page 2020

[14] [14]

4, 1679–1706

I.Pinelis,Optimum bounds for the distributions of martingales in Banach spaces, TheAnnalsofProbability 22(1994), no. 4, 1679–1706

work page 1994

[15] [15]

Guglielmo Marconi

G. Schildbach, L. Fagiano, C. Frei, and M. Morari,The scenario approach for stochastic model predictive control with bounds on closed-loop constraint violations, Automatica50(2014), no. 12, 3009–3018. Department of Electrical, Electronic, and Information Engineering “Guglielmo Marconi” - DEI, University of Bologna, Bologna, Italy Email address:s.baroncini...

work page 2014