arxiv: 2605.06228 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Recognition: unknown

Soft Deterministic Policy Gradient with Gaussian Smoothing

Hyunjun Na , Donghwan Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-08 13:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Gaussian smoothingBellman equationdeterministic policy gradientcontinuous controldeep reinforcement learningnon-smooth Q-functionssparse rewardsDDPG

0 comments

The pith

A Gaussian-smoothed Bellman equation yields a deterministic policy gradient that stays well-defined without critic action derivatives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard deterministic policy gradient methods require the critic to be differentiable with respect to actions, but this breaks down when rewards are sparse or discrete and produce non-smooth value functions. The paper replaces the ordinary Bellman operator with one that convolves against a Gaussian kernel, creating a new action-value function whose policy gradient can be computed directly. From this construction the authors derive the soft deterministic policy gradient and implement it as Soft DDPG. The resulting algorithm matches conventional performance on dense-reward benchmarks while showing clearer gains on discretized-reward variants where standard methods become unstable.

Core claim

We define a novel action-value function based on a smoothed Bellman equation and derive the soft deterministic policy gradient (Soft-DPG). Our formulation eliminates explicit dependence on critic action-gradients and ensures that the gradient remains well-defined even for non-smooth Q-functions. We instantiate this framework into a deep reinforcement learning algorithm, which we call soft deep deterministic policy gradient (Soft DDPG).

What carries the argument

the Gaussian-smoothed Bellman operator that defines a differentiable action-value function from which a policy gradient is obtained without differentiating the critic

If this is right

Policy updates no longer require the critic to be differentiable with respect to actions.
Learning remains stable under sparse or discrete reward structures that create irregular critic landscapes.
Soft DDPG achieves comparable returns to DDPG on standard dense-reward continuous control tasks.
Performance advantages appear in most discretized-reward variants of the same environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same smoothing construction could be applied to other policy-gradient families that currently assume differentiable critics.
In domains such as robotics where natural reward signals are irregular, the method may reduce the need for manual reward shaping.
Varying the Gaussian kernel width offers a direct way to study the bias introduced by smoothing as a function of problem scale.

Load-bearing premise

The Gaussian-smoothed value function produces a policy gradient that is a sufficiently accurate proxy for the original non-smooth problem without introducing large optimization bias.

What would settle it

On a discretized-reward continuous-control benchmark where the critic landscape is known to be non-smooth, if Soft DDPG shows no performance gain or exhibits the same instability as standard DDPG, the claim that the smoothed gradient remains a reliable proxy would be falsified.

Figures

Figures reproduced from arXiv: 2605.06228 by Donghwan Lee, Hyunjun Na.

**Figure 1.** Figure 1: Visualization of the learned critic and its action-gradients in the toy environment with a discrete reward. DPG updates the actor by leveraging the gradient of the critic with respect to the action. Thus, the policy improvement direction is determined by ∇aQ(s, a) evaluated at the current policy action. This formulation implicitly assumes that the action-gradient of the critic provides a reliable and sta… view at source ↗

**Figure 2.** Figure 2: The first two rows show results in the continuous-reward environments, while the last two view at source ↗

**Figure 3.** Figure 3: Sensitivity analysis of Soft DDPG hyperparameters on the Ant environment. (Left) view at source ↗

**Figure 4.** Figure 4: Learning curves on various benchmarks: Continuous (top) vs. Discrete (bottom) control. view at source ↗

read the original abstract

Deterministic policy gradient (DPG) is widely utilized for continuous control; however, it inherently relies on the differentiability of the critic with respect to the action during policy updates. This assumption is violated in practical control problems involving sparse or discrete rewards, leading to ill-defined policy gradients and unstable learning. To address these challenges, we propose a principled alternative based on a smoothed Bellman equation formulated via Gaussian smoothing. Specifically, we define a novel action-value function based on a smoothed Bellman equation and derive the soft deterministic policy gradient (Soft-DPG). Our formulation eliminates explicit dependence on critic action-gradients and ensures that the gradient remains well-defined even for non-smooth Q-functions. We instantiate this framework into a deep reinforcement learning algorithm, which we call soft deep deterministic policy gradient (Soft DDPG). Empirical evaluations on standard continuous control benchmarks and their discretized-reward variants show that Soft DDPG remains competitive in dense-reward settings and provides clear gains in most discretized-reward environments, where standard DDPG is more sensitive to irregular critic landscapes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gaussian smoothing removes the action-gradient requirement from DDPG and gives measurable gains on discretized-reward tasks, but the method optimizes a smoothed proxy whose distance to the original optimum is not bounded.

read the letter

The main point is that they replace the usual Bellman backup with a Gaussian-smoothed version so the policy gradient no longer needs an explicit derivative of the critic with respect to actions. This keeps the update defined even when rewards are sparse or discrete and the Q-function is non-smooth. They call the resulting algorithm Soft DDPG and show it stays competitive on standard continuous-control benchmarks while outperforming plain DDPG on the discretized-reward variants where the baseline becomes unstable.

Referee Report

2 major / 1 minor

Summary. The paper claims that Gaussian smoothing of the Bellman equation yields a novel smoothed action-value function from which a soft deterministic policy gradient (Soft-DPG) can be derived; this gradient has no explicit dependence on the critic's action derivative and remains well-defined for non-smooth Q-functions arising from sparse or discrete rewards. The resulting Soft DDPG algorithm is shown to match standard DDPG on dense-reward continuous-control benchmarks while delivering gains on discretized-reward variants.

Significance. A method that removes the differentiability requirement on the critic would broaden the applicability of deterministic policy gradients to irregular reward landscapes. The empirical improvements on discretized-reward tasks indicate practical utility if the smoothing bias remains modest, but the absence of any bound relating the smoothed and original optima reduces the result's theoretical weight.

major comments (2)

[§3] §3 (derivation of Soft-DPG): the interchange of differentiation and the Gaussian integral removes the explicit ∇_a Q term for the smoothed operator, yet the manuscript supplies neither a bound on ||J_smoothed(π) - J(π)|| nor a guarantee that the fixed point of the smoothed policy gradient converges to a near-optimal policy for the original non-smooth Bellman operator. This gap directly affects whether the reported gains reflect optimization of the intended objective.
[Experimental section] Experimental section (discretized-reward results): no ablation on the smoothing width σ is presented, and the abstract reports neither error bars nor the number of independent seeds. Because σ is an explicit free parameter whose choice can shift the location of the smoothed optimum, the claim that Soft DDPG is “more robust” to irregular critic landscapes cannot be assessed without these controls.

minor comments (1)

Notation for the smoothed Bellman operator is introduced without an explicit statement of how the Gaussian kernel is normalized or truncated at the action boundaries; a short clarifying paragraph would prevent ambiguity when readers attempt to reproduce the update.

Simulated Author's Rebuttal

2 responses · 1 unresolved

Thank you for the constructive feedback. We appreciate the opportunity to clarify the theoretical foundations and strengthen the experimental analysis of Soft-DPG. Below we address the major comments point by point.

read point-by-point responses

Referee: [§3] §3 (derivation of Soft-DPG): the interchange of differentiation and the Gaussian integral removes the explicit ∇_a Q term for the smoothed operator, yet the manuscript supplies neither a bound on ||J_smoothed(π) - J(π)|| nor a guarantee that the fixed point of the smoothed policy gradient converges to a near-optimal policy for the original non-smooth Bellman operator. This gap directly affects whether the reported gains reflect optimization of the intended objective.

Authors: We agree that the derivation applies to the smoothed objective and that no explicit bound or convergence guarantee to the original optimum is provided. The smoothed Bellman equation yields a differentiable surrogate that enables stable gradients even for non-smooth Q. We will revise §3 and the discussion to explicitly note that Soft-DPG optimizes the smoothed objective J_smoothed and that the relationship to the original J is empirical, as supported by the competitive performance on dense-reward tasks and improvements on discretized ones. This limitation will be highlighted as an avenue for future theoretical analysis. revision: partial
Referee: [Experimental section] Experimental section (discretized-reward results): no ablation on the smoothing width σ is presented, and the abstract reports neither error bars nor the number of independent seeds. Because σ is an explicit free parameter whose choice can shift the location of the smoothed optimum, the claim that Soft DDPG is “more robust” to irregular critic landscapes cannot be assessed without these controls.

Authors: We will add an ablation study on the smoothing parameter σ in the experimental section, showing performance sensitivity across a range of values on both dense and discretized tasks. Additionally, we will update the abstract and all result tables/figures to report means and standard deviations over 5 independent random seeds, as used in our experiments. These revisions will allow readers to better evaluate the robustness claim. revision: yes

standing simulated objections not resolved

Lack of theoretical bound on the approximation error between the smoothed and original objectives, and absence of a convergence guarantee for the smoothed policy gradient to the original optimum.

Circularity Check

0 steps flagged

No significant circularity; derivation is a self-contained reformulation

full rationale

The paper introduces a novel smoothed Bellman equation as the starting point and derives the Soft-DPG by interchanging differentiation under the Gaussian integral. This is a direct mathematical consequence of the new definition rather than a reduction to fitted parameters, self-citations, or prior results by the same authors. No load-bearing step collapses to an input by construction, and the central claim (well-defined gradient for non-smooth Q) follows from the smoothing without circularity. The approach remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on introducing a smoothed action-value function whose gradient is always defined; this adds one invented entity and one domain assumption about the validity of the smoothed operator.

free parameters (1)

Gaussian smoothing width (sigma)
Controls the degree of smoothing; its value must be chosen and can affect both stability and bias of the resulting policy gradient.

axioms (1)

domain assumption The smoothed Bellman operator yields a differentiable Q-function whose policy gradient is a useful surrogate for the original non-smooth problem.
Invoked to justify that the derived Soft-DPG update is valid even when the original critic is non-differentiable.

invented entities (1)

Smoothed action-value function no independent evidence
purpose: To replace the standard Q-function so that its action gradient is always defined.
Newly defined object that enables the entire Soft-DPG derivation.

pith-pipeline@v0.9.0 · 5477 in / 1309 out tokens · 81463 ms · 2026-05-08T13:18:44.579159+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 9 canonical work pages · 4 internal anchors

[1]

Reinforcement learning: Theory and algorithms.CS Dept., UW Seattle, Seattle, WA, USA, Tech

Alekh Agarwal, Nan Jiang, Sham M Kakade, and Wen Sun. Reinforcement learning: Theory and algorithms.CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep, 32:96, 2019

2019
[2]

Compatible value gradients for reinforcement learning of continuous deep policies

David Balduzzi and Muhammad Ghifary. Compatible value gradients for reinforcement learning of continuous deep policies.arXiv preprint arXiv:1509.03005, 2015

work page arXiv 2015
[3]

Springer, 2006

Christopher M Bishop and Nasser M Nasrabadi.Pattern recognition and machine learning, volume 4. Springer, 2006

2006
[4]

Autonomous uav navigation: A ddpg-based deep reinforcement learning approach

Omar Bouhamed, Hakim Ghazzai, Hichem Besbes, and Yehia Massoud. Autonomous uav navigation: A ddpg-based deep reinforcement learning approach. In2020 IEEE International Symposium on circuits and systems (ISCAS), pages 1–5. IEEE, 2020

2020
[5]

OpenAI Gym

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym.arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review arXiv 2016
[6]

Addressing function approximation error in actor-critic methods

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596, 2018

2018
[7]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870, 2018

2018
[8]

Rainbow: Combining improve- ments in deep reinforcement learning

Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improve- ments in deep reinforcement learning. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018
[9]

Deep reinforcement learning for autonomous driving: A survey

B Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A Al Sallab, Senthil Yogamani, and Patrick Pérez. Deep reinforcement learning for autonomous driving: A survey. IEEE transactions on intelligent transportation systems, 23(6):4909–4926, 2021

2021
[10]

Zeroth-order deterministic policy gradient.arXiv preprint arXiv:2006.07314, 2020

Harshat Kumar, Dionysios S Kalogerias, George J Pappas, and Alejandro Ribeiro. Zeroth-order deterministic policy gradient.arXiv preprint arXiv:2006.07314, 2020

work page arXiv 2006
[11]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review arXiv 2015
[12]

Ddpg-based adaptive robust tracking control for aerial manipulators with decoupling approach.IEEE Transactions on Cybernetics, 52(8):8258–8271, 2021

Yen-Chen Liu and Chi-Yu Huang. Ddpg-based adaptive robust tracking control for aerial manipulators with decoupling approach.IEEE Transactions on Cybernetics, 52(8):8258–8271, 2021

2021
[13]

Playing Atari with Deep Reinforcement Learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602, 2013

work page internal anchor Pith review arXiv 2013
[14]

Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

2015
[15]

Smoothed action value functions for learning gaussian policies

Ofir Nachum, Mohammad Norouzi, George Tucker, and Dale Schuurmans. Smoothed action value functions for learning gaussian policies. InInternational Conference on Machine Learning, pages 3692–3700, 2018

2018
[16]

Random gradient-free minimization of convex functions

Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017

2017
[17]

Deep deterministic policy gradient (ddpg)- based energy harvesting wireless communications.IEEE Internet of Things Journal, 6(5): 8577–8588, 2019

Chengrun Qiu, Yang Hu, Yan Chen, and Bing Zeng. Deep deterministic policy gradient (ddpg)- based energy harvesting wireless communications.IEEE Internet of Things Journal, 6(5): 8577–8588, 2019. 10

2019
[18]

A tour of reinforcement learning: The view from continuous control.Annual Review of Control, Robotics, and Autonomous Systems, 2(1):253–279, 2019

Benjamin Recht. A tour of reinforcement learning: The view from continuous control.Annual Review of Control, Robotics, and Autonomous Systems, 2(1):253–279, 2019

2019
[19]

Compatible gradient approximations for actor-critic algorithms.arXiv preprint arXiv:2409.01477, 2024

Baturay Saglam and Dionysis Kalogerias. Compatible gradient approximations for actor-critic algorithms.arXiv preprint arXiv:2409.01477, 2024

work page arXiv 2024
[20]

Evolution Strategies as a Scalable Alternative to Reinforcement Learning

Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning.arXiv preprint arXiv:1703.03864, 2017

work page Pith review arXiv 2017
[21]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

2015
[22]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review arXiv 2017
[23]

Deterministic policy gradient algorithms

David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. InInternational conference on machine learning, pages 387–395, 2014

2014
[24]

Policy gradient meth- ods for reinforcement learning with function approximation.Advances in neural information processing systems, 12, 1999

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient meth- ods for reinforcement learning with function approximation.Advances in neural information processing systems, 12, 1999

1999
[25]

Deep reinforcement learning for robotics: A survey of real-world successes.Annual Review of Control, Robotics, and Autonomous Systems, 8(1):153–188, 2025

Chen Tang, Ben Abbatematteo, Jiaheng Hu, Rohan Chandra, Roberto Martín-Martín, and Peter Stone. Deep reinforcement learning for robotics: A survey of real-world successes.Annual Review of Control, Robotics, and Autonomous Systems, 8(1):153–188, 2025

2025
[26]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

2012
[27]

Deep reinforcement learning with double q-learning

Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. InProceedings of the AAAI conference on artificial intelligence, volume 30, 2016

2016
[28]

Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

Mel Vecerik, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards.arXiv preprint arXiv:1707.08817, 2017

work page Pith review arXiv 2017
[29]

Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8(3):229–256, 1992

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8(3):229–256, 1992

1992
[30]

Deep deterministic policy gradient (ddpg)-based resource allocation scheme for noma vehicular communications.IEEE Access, 8: 18797–18807, 2020

Yi-Han Xu, Cheng-Cheng Yang, Min Hua, and Wen Zhou. Deep deterministic policy gradient (ddpg)-based resource allocation scheme for noma vehicular communications.IEEE Access, 8: 18797–18807, 2020

2020
[31]

Multi- objective optimization for uav-assisted wireless powered iot networks based on extended ddpg algorithm.IEEE Transactions on Communications, 69(9):6361–6374, 2021

Yu Yu, Jie Tang, Jiayi Huang, Xiuyin Zhang, Daniel Ka Chun So, and Kai-Kit Wong. Multi- objective optimization for uav-assisted wireless powered iot networks based on extended ddpg algorithm.IEEE Transactions on Communications, 69(9):6361–6374, 2021. 11 A Proof A.1 Proof of Lemma 3.4 We start from the stochastic policy gradient theorem: ∇θJ(ν θ) = 1 1−γ X...

2021