Discretization error from regularized Reinforcement Learning to continuous-time stochastic control
Pith reviewed 2026-05-09 21:56 UTC · model grok-4.3
The pith
Regularized discrete-time RL policies approximate the optimal feedback controls of continuous-time stochastic problems with explicit convergence rates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The optimal policy induced by the regularized discrete-time Bellman equation converges to the true optimal feedback control of the continuous-time stochastic control problem, and the paper derives explicit quantitative rates for this convergence under suitable regularity conditions on the coefficients and value functions.
What carries the argument
The discretization error gap between the regularized discrete-time optimal policy and the continuous-time optimal feedback control, together with the quantitative convergence rates derived for this gap.
If this is right
- Standard RL algorithms can be applied directly to continuous-time problems while controlling the resulting policy error through the time step size.
- Exploratory policies obtained from regularized discrete-time training remain stable when implemented in the underlying continuous-time dynamics.
- The derived rates give practical guidance on how fine the time grid must be to achieve a target approximation accuracy.
Where Pith is reading between the lines
- The same convergence analysis may extend to other regularizers or to policy-gradient variants of RL.
- Numerical schemes for stochastic control could adopt these rates as a priori error estimators.
- The framework suggests testing the rates on low-dimensional linear-quadratic problems where exact solutions are known.
Load-bearing premise
The continuous-time stochastic control problem has enough regularity, such as Lipschitz or smooth coefficients and value functions, so that the discretization error admits quantitative bounds.
What would settle it
A concrete continuous-time stochastic control example with explicit coefficients where the observed policy difference fails to shrink at the claimed rate when the time discretization step is successively halved.
read the original abstract
This paper establishes a rigorous connection between regularized discrete-time reinforcement learning (RL) and continuous-time stochastic optimal control. Specifically, classical RL algorithms are typically solving a regularized discrete-time Bellman equation. We study the discretization error, namely, the gap between the optimal policy induced by the regularized discrete-time Bellman equation and the true optimal feedback control of the underlying continuous-time stochastic control problem. By deriving quantitative convergence rates for this gap, we provide a rigorous foundation for understanding the stability and implementation of exploratory RL policies in stochastic continuous-time environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript derives quantitative convergence rates for the discretization error between the optimal policy induced by a regularized discrete-time Bellman equation and the true optimal feedback control of the underlying continuous-time stochastic control problem. It aims to bridge regularized RL algorithms with continuous-time stochastic optimal control by analyzing the gap as the time discretization step h approaches zero.
Significance. If the rates are established under verifiable conditions, the work supplies a theoretical foundation for the stability of exploratory RL policies in continuous-time settings. This could inform implementation choices in stochastic control applications and strengthen the link between discrete-time RL theory and continuous-time HJB-based control.
major comments (2)
- §2 (Assumptions): The quantitative rates require Lipschitz coefficients and C^{2,1} regularity of the value function (or equivalent viscosity-solution arguments), yet the main text does not list the precise conditions or verify them on standard examples (linear-quadratic or Ornstein-Uhlenbeck). Without this, the claimed rates rest on an unverified hypothesis rather than the general setting advertised in the abstract.
- §3 (Main convergence theorem): The proof sketch relies on Itô-Taylor expansion or comparison of discrete Bellman operators to the continuous HJB PDE, but no explicit error bound or dependence on the regularization parameter is displayed. It is unclear whether the entropy-regularized value function retains the required twice-differentiability with bounded derivatives.
minor comments (2)
- Notation: The symbol for the regularized value function is introduced without a clear distinction from the unregularized case; a short table comparing discrete vs. continuous operators would improve readability.
- References: The introduction cites several RL-to-control papers but omits key works on discretization of HJB equations (e.g., on viscosity solutions for controlled diffusions).
Simulated Author's Rebuttal
We thank the referee for the careful reading and valuable comments on our manuscript. We address each major point below and will revise the paper to improve clarity and completeness while preserving the core contributions.
read point-by-point responses
-
Referee: §2 (Assumptions): The quantitative rates require Lipschitz coefficients and C^{2,1} regularity of the value function (or equivalent viscosity-solution arguments), yet the main text does not list the precise conditions or verify them on standard examples (linear-quadratic or Ornstein-Uhlenbeck). Without this, the claimed rates rest on an unverified hypothesis rather than the general setting advertised in the abstract.
Authors: We agree that the assumptions must be stated explicitly for the quantitative rates to be verifiable. Section 2 currently introduces the setting but does not isolate the precise hypotheses (Lipschitz continuity of coefficients, uniform ellipticity, and C^{2,1} regularity of the value function) in a dedicated list. In the revision we will add a clearly labeled Assumption block in §2, followed by a short verification subsection that checks the conditions on the linear-quadratic regulator and the Ornstein-Uhlenbeck process. This will make the hypotheses transparent and confirm that the claimed rates apply to these standard examples. revision: yes
-
Referee: §3 (Main convergence theorem): The proof sketch relies on Itô-Taylor expansion or comparison of discrete Bellman operators to the continuous HJB PDE, but no explicit error bound or dependence on the regularization parameter is displayed. It is unclear whether the entropy-regularized value function retains the required twice-differentiability with bounded derivatives.
Authors: The main theorem (Theorem 3.1) states an explicit convergence rate of order O(h + λ h) for the policy gap, where λ is the regularization parameter; the constant is tracked through the proof. The argument proceeds by comparing the discrete regularized Bellman operator to the continuous HJB operator via an Itô-Taylor expansion of order 2, followed by a Gronwall-type estimate. The entropy-regularized value function preserves C^{2,1} regularity and bounded derivatives under the same structural assumptions used for the unregularized problem, because the added entropy term is smooth and the Hamiltonian remains uniformly elliptic. We will expand the proof in the revision to display the full error bound with explicit λ-dependence and add a short lemma confirming the regularity inheritance. revision: yes
Circularity Check
No circularity: direct derivation of discretization rates from Bellman-HJB comparison
full rationale
The paper derives quantitative convergence rates for the gap between the optimal policy of the regularized discrete-time Bellman equation and the continuous-time feedback control by comparing the discrete operator to the continuous HJB PDE (via Itô-Taylor or viscosity methods). No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described chain; the result is obtained from standard PDE analysis under stated regularity assumptions rather than by construction from the target gap itself. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
L. C. Baird. Reinforcement learning in continuous time: Advantage updating. InProceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94), volume 4, pages 2448–2453. IEEE, 1994
work page 1994
-
[3]
E. Bayraktar and A. D. Kara. Approximate q learning for controlled diffusion processes and its near optimality. SIAM Journal on Mathematics of Data Science, 5(3):615–638, 2023
work page 2023
-
[4]
C. Bender and N. T. Thuan. On the grid-sampling limit sde.arXiv preprint arXiv:2410.07778, 2024
-
[5]
Bertsekas.Dynamic programming and optimal control: Volume I, volume 4
D. Bertsekas.Dynamic programming and optimal control: Volume I, volume 4. Athena scientific, 2012
work page 2012
-
[6]
D. P. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming: an overview. InProceedings of 1995 34th IEEE conference on decision and control, volume 1, pages 560–564. IEEE, 1995
work page 1995
-
[7]
L. A. Caffarelli and X. Cabr´ e.Fully nonlinear elliptic equations, volume 43 ofAmerican Mathematical Society Colloquium Publications. American Mathematical Society, Providence, RI, 1995
work page 1995
-
[8]
A. K. Dixit and R. S. Pindyck.Investment under uncertainty. Princeton university press, 1994
work page 1994
-
[9]
K. Doya. Reinforcement learning in continuous time and space.Neural computation, 12(1):219–245, 2000
work page 2000
-
[10]
W. H. Fleming and H. M. Soner.Controlled Markov processes and viscosity solutions, volume 25 ofStochastic Modelling and Applied Probability. Springer, second edition, 2006
work page 2006
-
[11]
Friedman.Stochastic differential equations and applications
A. Friedman.Stochastic differential equations and applications. Dover Publications, Inc., Mineola, NY, 2006. Two volumes bound as one, Reprint of the 1975 and 1976 original published in two volumes
work page 2006
-
[12]
X. Gao, Z. Q. Xu, and X. Y. Zhou. State-dependent temperature control for langevin diffusions.SIAM J. Control Optim., 60(3):1250–1268, 2022
work page 2022
-
[13]
M. Giegrich, C. Reisinger, and Y. Zhang. Convergence of policy gradient methods for finite-horizon exploratory linear-quadratic control problems.SIAM Journal on Control and Optimization, 62(2):1060–1092, 2024
work page 2024
-
[14]
X. Guo, A. Hu, and Y. Zhang. Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls.SIAM Journal on Control and Optimization, 61(2):755–787, 2023
work page 2023
-
[15]
T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energy-based policies. In International conference on machine learning, pages 1352–1361. PMLR, 2017
work page 2017
-
[16]
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforce- ment learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018
work page 2018
- [17]
-
[18]
T. Jaakkola, M. Jordan, and S. Singh. Convergence of stochastic iterative dynamic programming algorithms. Advances in neural information processing systems, 6, 1993
work page 1993
- [19]
- [20]
- [21]
-
[22]
A. A. K. B. N. Jiang and S. M. K. W. Sun. Reinforcement learning: Theory and algorithms. 2026
work page 2026
-
[23]
M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time.Machine learning, 49(2):209– 232, 2002
work page 2002
-
[24]
N. Krylov. Approximating value functions for controlled degenerate diffusion processes by using piece-wise con- stant policies.Electronic Journal of Probability, 4:1–19, 1999
work page 1999
-
[25]
H. Kushner and P. Dupuis.Numerical methods for stochastic control problems in continuous time, volume 24 of Stochastic modeling and applied probability. Springer-Verlag, New York, 2001
work page 2001
-
[26]
O. A. Ladyzhenskaia, V. A. Solonnikov, and N. N. Ural’tseva.Linear and quasi-linear equations of parabolic type, volume 23. American Mathematical Soc., 1968
work page 1968
-
[27]
Y. Lian, L. Wang, and K. Zhang. Pointwise regularity for fully nonlinear elliptic equations in general forms
-
[28]
S. Menozzi, A. Pesce, and X. Zhang. Density and gradient estimates for non degenerate brownian sdes with unbounded measurable drift.Journal of Differential Equations, 272:330–369, 2021
work page 2021
-
[29]
R. C. Merton. Optimum consumption and portfolio rules in a continuous-time model. InStochastic optimization models in finance, pages 621–661. Elsevier, 1975
work page 1975
-
[30]
On Bellman equations for continuous-time policy eval- uation i: discretization and approximation
W. Mou and Y. Zhu. On bellman equations for continuous-time policy evaluation i: discretization and approxi- mation.arXiv preprint arXiv:2407.05966, 2024
-
[31]
G. Pag` es, H. Pham, and J. Printems. An optimal markovian quantization algorithm for multi-dimensional stochastic control problems.Stochastics and Dynamics, 4:501–545, 2004
work page 2004
-
[32]
G. A. Pavliotis.Stochastic processes and applications. Springer, 2016
work page 2016
-
[33]
Pham.Continuous-time stochastic control and optimization with financial applications, volume 61
H. Pham.Continuous-time stochastic control and optimization with financial applications, volume 61. Springer Science & Business Media, 2009
work page 2009
-
[34]
H. Pham.Continuous time stochastic control and optimization with financial applications, volume 61 ofStochastic modeling and applied probability. Springer-Verlag, New York, 2009
work page 2009
-
[35]
M. L. Puterman.Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014
work page 2014
-
[36]
C. Reisinger, E. Jakobsen, and A. Picarelli. Improved order 1/4 convergence for piecewise constant policy ap- proximation of stochastic control problems.Electronic Communications in Probability, 24(2019), 2019
work page 2019
-
[37]
D. Revuz and M. Yor.Continuous martingales and Brownian motion, volume 293. Springer Science & Business Media, 2013
work page 2013
-
[38]
R. F. Stengel.Optimal control and estimation. Courier Corporation, 1994
work page 1994
-
[39]
D. W. Stroock and S. S. Varadhan.Multidimensional diffusion processes, volume 233. Springer Science & Business Media, 1997
work page 1997
-
[40]
R. S. Sutton and A. G. Barto.Reinforcement learning: An introduction. MIT press, 2018
work page 2018
-
[41]
L. Szpruch, T. Treetanthiploet, and Y. Zhang. Optimal scheduling of entropy regularizer for continuous-time linear-quadratic reinforcement learning.SIAM Journal on Control and Optimization, 62(1):135–166, 2024
work page 2024
-
[42]
W. Tang, Y. P. Zhang, and X. Y. Zhou. Exploratory HJB equations and their convergence.SIAM Journal on Control and Optimization, 60(6):3191–3216, 2022
work page 2022
-
[43]
E. Todorov. Efficient computation of optimal actions.Proceedings of the national academy of sciences, 106(28):11478–11483, 2009
work page 2009
-
[44]
J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning.Automatica, 35(11):1799–1808, 1999
work page 1999
-
[45]
H. Wang, T. Zariphopoulou, and X. Y. Zhou. Reinforcement learning in continuous time and space: A stochastic control approach.J. Mach. Learn. Res., 21:1–34, 2020
work page 2020
-
[46]
C. J. Watkins and P. Dayan. Q-learning.Machine learning, 8:279–292, 1992
work page 1992
- [47]
-
[48]
J. Yong and X. Y. Zhou.Stochastic controls – Hamiltonian systems and HJB equations, volume 43 ofApplications of Mathematics (New York). Springer-Verlag, New York, 1999
work page 1999
- [49]
- [50]
-
[51]
B. D. Ziebart, A. L. Maas, J. A. Bagnell, A. K. Dey, et al. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008
work page 2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.