Potential-Based Advice for Stochastic Policy Learning
Pith reviewed 2026-05-24 18:49 UTC · model grok-4.3
The pith
Potential-based reward shaping preserves optimality of stochastic policies in reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A potential-based reward shaping scheme is able to preserve optimality of stochastic policies, and the ability of an agent to learn an optimal policy is not affected when this scheme is augmented to soft Q-learning. A method to impart potential-based advice schemes to policy gradient algorithms is proposed, along with an advantage actor-critic architecture augmented with this scheme that has convergence guarantees.
What carries the argument
Potential-based reward shaping, in which a state-dependent potential function adds a shaping term to the reward whose contributions cancel exactly under the Bellman operator even when the policy is stochastic.
Load-bearing premise
The potential function must be chosen as a state-dependent function whose shaping terms cancel exactly under the Bellman operator for stochastic policies.
What would settle it
An experiment in which the shaped rewards cause the learned policy to differ from the original optimal stochastic policy or prevent convergence to it in a domain where the unshaped optimum is known.
Figures
read the original abstract
This paper augments the reward received by a reinforcement learning agent with potential functions in order to help the agent learn (possibly stochastic) optimal policies. We show that a potential-based reward shaping scheme is able to preserve optimality of stochastic policies, and demonstrate that the ability of an agent to learn an optimal policy is not affected when this scheme is augmented to soft Q-learning. We propose a method to impart potential based advice schemes to policy gradient algorithms. An algorithm that considers an advantage actor-critic architecture augmented with this scheme is proposed, and we give guarantees on its convergence. Finally, we evaluate our approach on a puddle-jump grid world with indistinguishable states, and the continuous state and action mountain car environment from classical control. Our results indicate that these schemes allow the agent to learn a stochastic optimal policy faster and obtain a higher average reward.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that potential-based reward shaping preserves optimality of stochastic policies (including when augmented to soft Q-learning), proposes a method to incorporate such advice into policy-gradient algorithms (specifically an advantage actor-critic variant with convergence guarantees), and reports faster learning and higher average reward on a puddle-jump gridworld with indistinguishable states and the continuous mountain-car task.
Significance. If the optimality-preservation result for stochastic policies holds with a rigorous derivation, the work would supply a practical mechanism for imparting state-dependent advice to entropy-regularized and policy-gradient methods without altering the optimal policy set; the convergence guarantee for the shaped A2C variant and the empirical demonstration on partially observable gridworlds would be incremental but useful contributions to reward-shaping literature.
major comments (2)
- [Abstract] Abstract: the central claim that a potential-based scheme 'is able to preserve optimality of stochastic policies' and remains valid under soft Q-learning rests on the unshown premise that the shaping term E[γΦ(s') − Φ(s)] factors identically out of the stochastic (and entropy-regularized) Bellman operator; no derivation, restriction to strictly state-dependent Φ, or error analysis is supplied to confirm cancellation for arbitrary stochastic policies.
- [Convergence section] Convergence section (implied by abstract): the stated guarantees for the shaped advantage actor-critic algorithm are asserted without visible reduction to the fitted quantities or explicit handling of the entropy term introduced by soft Q-learning; this leaves open whether the result follows from standard policy-gradient assumptions or requires additional restrictions.
minor comments (2)
- [Empirical evaluation] Empirical section: no details are given on the number of independent runs, statistical significance tests, or baseline comparisons (e.g., unshaped soft Q-learning or standard A2C) needed to support the claim of 'faster' learning and 'higher average reward'.
- Notation: the precise functional form of the potential Φ (state-only vs. state-action) is not stated explicitly when the shaping is introduced, which is required to evaluate the cancellation argument.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback. Below we respond point-by-point to the major comments. We are prepared to revise the manuscript for greater clarity on the derivations while preserving the original technical claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that a potential-based scheme 'is able to preserve optimality of stochastic policies' and remains valid under soft Q-learning rests on the unshown premise that the shaping term E[γΦ(s') − Φ(s)] factors identically out of the stochastic (and entropy-regularized) Bellman operator; no derivation, restriction to strictly state-dependent Φ, or error analysis is supplied to confirm cancellation for arbitrary stochastic policies.
Authors: Section 3 contains the derivation showing that the expected shaping term E[γΦ(s') − Φ(s)] cancels for any (including stochastic) policy when Φ is strictly state-dependent, thereby preserving the set of optimal policies. The same cancellation holds inside the soft Bellman operator because the entropy term depends only on the policy and is unaffected by additive state-dependent shaping. We will expand the proof with an explicit step-by-step expansion of the operators and a short error-bound paragraph in the revised manuscript. revision: yes
-
Referee: [Convergence section] Convergence section (implied by abstract): the stated guarantees for the shaped advantage actor-critic algorithm are asserted without visible reduction to the fitted quantities or explicit handling of the entropy term introduced by soft Q-learning; this leaves open whether the result follows from standard policy-gradient assumptions or requires additional restrictions.
Authors: The convergence argument in Section 4 reduces the shaped A2C update to the standard policy-gradient theorem by observing that the potential-based advantage differs from the unshaped advantage by a term whose expectation is zero under the stationary distribution; the entropy regularizer is left unchanged because shaping is state-dependent. We will insert an explicit reduction to the fitted critic and actor quantities together with the precise statement of the assumptions carried over from the underlying policy-gradient result. revision: yes
Circularity Check
No significant circularity; derivation relies on standard Bellman cancellation for state-only potentials.
full rationale
The central claim (preservation of optimality for stochastic policies under potential-based shaping, including soft Q-learning) follows from the algebraic property that a state-dependent Φ(s) produces an additive term whose expectation is policy-independent and factors out of both the standard and entropy-regularized Bellman operators. This is a direct consequence of the operator definitions rather than a self-definition, fitted parameter renamed as prediction, or load-bearing self-citation. No equations reduce the result to the paper's own inputs by construction; the argument is self-contained against external RL theory.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The environment is a Markov decision process with well-defined transition and reward functions.
Reference graph
Works this paper leans on
-
[1]
Reinforcement learning in feedback control,
R. Hafner and M. Riedmiller, “Reinforcement learning in feedback control,” Machine Learning, vol. 84, pp. 137–169, 2011
work page 2011
-
[2]
Continuous control with deep reinforcement learning,
T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” in International Conference on Learning and Represen- tations, 2016
work page 2016
-
[3]
Human-level control through deep reinforcement learning,
V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, 2015
work page 2015
-
[4]
Mastering the game of Go with deep neural networks and tree search,
D. Silver et al. , “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, 2016
work page 2016
-
[5]
Learning to drive a bicycle using reinforcement learning and shaping
J. Randløv and P. Alstrøm, “Learning to drive a bicycle using reinforcement learning and shaping.” in International Conference on Machine Learning , 1998
work page 1998
-
[6]
Policy invariance under re- ward transformations: Theory and application to reward shaping,
A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under re- ward transformations: Theory and application to reward shaping,” in International Conference on Machine Learning , 1999
work page 1999
-
[7]
Principled methods for advising reinforcement learning agents,
E. Wiewiora, G. W. Cottrell, and C. Elkan, “Principled methods for advising reinforcement learning agents,” in International Con- ference on Machine Learning , 2003, pp. 792–799
work page 2003
-
[8]
Dynamic potential-based reward shaping
S. M. Devlin and D. Kudenko, “Dynamic potential-based reward shaping.” in Autonomous Agents and Multiagent Systems , 2012, pp. 433–440
work page 2012
-
[9]
A. L. Thomaz and C. Breazeal, “Reinforcement learning with human teachers: Evidence of feedback and guidance with impli- cations for learning performance,” in AAAI, 2006, pp. 1000–1005
work page 2006
-
[10]
Combining manual feedback with subsequent MDP reward signals for reinforcement learning,
W. B. Knox and P. Stone, “Combining manual feedback with subsequent MDP reward signals for reinforcement learning,” in Autonomous Agents and Multiagent Systems , 2010, pp. 5–12
work page 2010
-
[11]
Curiosity- driven exploration by self-supervised prediction,
D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity- driven exploration by self-supervised prediction,” in International Conference on Machine Learning , 2017
work page 2017
-
[12]
# Exploration: A study of count-based exploration for deep reinforcement learning,
H. Tang et al., “# Exploration: A study of count-based exploration for deep reinforcement learning,” in Advances in Neural Informa- tion Processing Systems , 2017
work page 2017
-
[13]
Function optimization using con- nectionist reinforcement learning algorithms,
R. J. Williams and J. Peng, “Function optimization using con- nectionist reinforcement learning algorithms,” Connection Science, vol. 3, no. 3, pp. 241–268, 1991
work page 1991
-
[14]
Asynchronous methods for deep reinforcement learning,
V. Mnih et al. , “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning, 2016
work page 2016
-
[15]
S. Levine and V. Koltun, “Guided policy search,” in International Conference on Machine Learning , 2013, pp. 1–9
work page 2013
-
[16]
End-to-end training of deep visuomotor policies,
S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016
work page 2016
-
[17]
Potential-based shaping and Q-value initialization are equivalent,
E. Wiewiora, “Potential-based shaping and Q-value initialization are equivalent,” Journal of Artificial Intelligence Research , pp. 205–208, 2003
work page 2003
-
[18]
Expressing arbitrary reward functions as potential-based advice
A. Harutyunyan, S. Devlin, P. Vrancx, and A. Nowé, “Expressing arbitrary reward functions as potential-based advice.” in AAAI, 2015, pp. 2652–2658
work page 2015
-
[19]
Introspective reinforcement learning and learning from demonstration,
M. Li, T. Brys, and D. Kudenko, “Introspective reinforcement learning and learning from demonstration,” in Autonomous Agents and MultiAgent Systems , 2018, pp. 1992–1994
work page 2018
-
[20]
Potential-based shaping in model-based RL,
J. Asmuth, M. L. Littman, and R. Zinkov, “Potential-based shaping in model-based RL,” in AAAI, 2008, pp. 604–609
work page 2008
-
[21]
Reward shaping in episodic reinforcement learning,
M. Grze ´s, “Reward shaping in episodic reinforcement learning,” in Autonomous Agents and MultiAgent Systems , 2017, pp. 565–573
work page 2017
-
[22]
Potential-based reward shaping for finite horizon online POMDP planning,
A. Eck, L.-K. Soh, S. Devlin, and D. Kudenko, “Potential-based reward shaping for finite horizon online POMDP planning,” Au- tonomous Agents and Multi-Agent Systems , vol. 30, no. 3, 2016
work page 2016
-
[23]
RL applied to linear quadratic regulation,
S. J. Bradtke, “RL applied to linear quadratic regulation,” in Advances in Neural Information Processing Systems , 1993
work page 1993
-
[24]
Global convergence of policy gradient methods for the linear quadratic regulator,
M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” in International Conference on Machine Learning , 2018
work page 2018
-
[25]
Reinforcement learning for control: Performance, stability, and deep approximators,
L. Bu¸ soniu, T. de Bruin, D. Toli ´c, J. Kober, and I. Palunko, “Reinforcement learning for control: Performance, stability, and deep approximators,” Annual Reviews in Control , 2018
work page 2018
-
[26]
G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai Gym,” arXiv:1606.01540, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[27]
M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014
work page 2014
-
[28]
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Intro- duction. MIT press, 2018
work page 2018
-
[29]
Reinforcement Learning with Deep Energy-Based Policies,
T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement Learning with Deep Energy-Based Policies,” in International Con- ference on Machine Learning , 2017, pp. 1352–1361
work page 2017
-
[30]
Policy gradient methods for reinforcement learning with function approximation,
R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in Neural Information Processing Systems, 2000, pp. 1057–1063
work page 2000
-
[31]
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” arXiv:1801.01290, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[32]
The ODE method for convergence of stochastic approximation and reinforcement learning,
V. Borkar and S. Meyn, “The ODE method for convergence of stochastic approximation and reinforcement learning,” SIAM Journal on Control and Optimization , vol. 38, no. 2, 2000
work page 2000
-
[33]
A finite sample analysis of the actor-critic algorithm,
Z. Yang, K. Zhang, M. Hong, and T. Ba¸ sar, “A finite sample analysis of the actor-critic algorithm,” in IEEE Conference on Decision and Control (CDC) , 2018, pp. 2759–2764
work page 2018
-
[34]
Nat- ural actor–critic algorithms,
S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee, “Nat- ural actor–critic algorithms,” Automatica, vol. 45, no. 11, 2009
work page 2009
-
[35]
Belief reward shaping in reinforce- ment learning,
O. Marom and B. Rosman, “Belief reward shaping in reinforce- ment learning,” in AAAI, 2018, pp. 3762–3769
work page 2018
-
[36]
Adam: A Method for Stochastic Optimization
D. P. Kingma and J. Ba., “Adam: A method for stochastic opti- mization,” arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.