Neural Actor-Critic Methods for Hamilton-Jacobi-Bellman PDEs: Asymptotic Analysis and Numerical Studies
Pith reviewed 2026-05-21 23:25 UTC · model grok-4.3
The pith
As the number of hidden units tends to infinity, actor and critic neural networks for HJB equations converge in a Sobolev space to an infinite-dimensional ODE whose fixed points solve the stochastic control problem under a convexity-likeass
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show that the training dynamics of the actor and critic neural networks converge in a Sobolev-type space to a certain infinite-dimensional ordinary differential equation as the number of hidden units tends to infinity. Further, under a convexity-like assumption on the Hamiltonian, any fixed point of this limit ODE is a solution of the original stochastic control problem. This provides a guarantee for the algorithm despite possible local minima in finite-width networks.
What carries the argument
The infinite-dimensional ODE that the actor-critic training dynamics converge to in the infinite-width limit, with fixed points that solve the HJB equation under the convexity-like assumption.
Load-bearing premise
The convexity-like assumption on the Hamiltonian is needed to guarantee that fixed points of the limiting ODE solve the original stochastic control problem.
What would settle it
A fixed point of the limiting ODE that fails to satisfy the HJB equation for a Hamiltonian that violates the convexity-like assumption.
Figures
read the original abstract
We mathematically analyze and numerically study an actor-critic machine learning algorithm for solving high-dimensional Hamilton-Jacobi-Bellman (HJB) partial differential equations from stochastic control theory. The architecture of the critic (the estimator for the value function) is structured so that the boundary condition is always perfectly satisfied (rather than being included in the training loss) and utilizes a biased gradient which reduces computational cost. The actor (the estimator for the optimal control) is trained by minimizing the integral of the Hamiltonian over the domain, where the Hamiltonian is estimated using the critic. We show that the training dynamics of the actor and critic neural networks converge in a Sobolev-type space to a certain infinite-dimensional ordinary differential equation (ODE) as the number of hidden units in the actor and critic $\rightarrow \infty$. Further, under a convexity-like assumption on the Hamiltonian, we prove that any fixed point of this limit ODE is a solution of the original stochastic control problem. This provides an important guarantee for the algorithm's performance in light of the fact that finite-width neural networks may only converge to a local minimizers (and not optimal solutions) due to the non-convexity of their loss functions. In our numerical studies, we demonstrate that the algorithm can solve stochastic control problems accurately in up to 200 dimensions. In particular, we construct a series of increasingly complex stochastic control problems with known analytic solutions and study the algorithm's numerical performance on them. These problems range from a linear-quadratic regulator equation to highly challenging equations with non-convex Hamiltonians, allowing us to identify and analyze the strengths and limitations of this neural actor-critic method for solving HJB equations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes a neural actor-critic algorithm for high-dimensional HJB PDEs from stochastic control. The critic network enforces boundary conditions exactly and employs a biased gradient for efficiency; the actor minimizes the integrated Hamiltonian estimated by the critic. The authors prove that, as the number of hidden units tends to infinity, the training dynamics of both networks converge in a Sobolev-type space to an infinite-dimensional ODE. Under a convexity-like assumption on the Hamiltonian, any fixed point of this ODE is shown to solve the original stochastic control problem. Numerical experiments demonstrate the method on a sequence of problems with known solutions, ranging from linear-quadratic regulators to non-convex Hamiltonian cases, in dimensions up to 200.
Significance. If the convergence result and the fixed-point theorem hold, the analysis supplies a mean-field justification for why actor-critic training can recover global solutions to HJB equations despite the non-convexity of finite-width losses. The explicit construction of the critic architecture and the use of the Hamiltonian integral as the actor objective are technically natural choices that align with the underlying control problem. The hierarchy of numerical test cases with analytic solutions provides concrete evidence of practical performance and helps delineate the method's strengths and limitations.
major comments (2)
- [Abstract and §4 (fixed-point analysis)] Abstract and fixed-point result: The convexity-like assumption on the Hamiltonian is invoked to guarantee that fixed points of the limit ODE solve the original HJB problem rather than merely satisfying a stationarity condition. The precise statement of this assumption (e.g., uniform convexity in the control variable for each fixed state) is not shown to hold for the non-convex Hamiltonian examples studied numerically, which the abstract describes as 'highly challenging.' This assumption is load-bearing for the theoretical guarantee yet remains unverified in the reported experiments.
- [§3 (asymptotic analysis)] Convergence theorem: The passage from finite-width actor-critic dynamics to the infinite-dimensional ODE in Sobolev space relies on a mean-field or NTK-style argument. The handling of the biased gradient in the critic and the precise function space in which the limit is taken should be accompanied by explicit error estimates or compactness arguments to confirm that the convergence is strong enough to pass to the fixed-point property.
minor comments (2)
- [Numerical studies] A table summarizing dimension, relative error, and wall-clock time for each test problem would improve readability of the numerical section.
- [§3] The precise definition of the Sobolev-type space used for the convergence statement should be recalled at the beginning of the asymptotic analysis section for self-contained reading.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable feedback on our manuscript. We address each of the major comments in detail below and indicate the revisions we plan to make.
read point-by-point responses
-
Referee: [Abstract and §4 (fixed-point analysis)] Abstract and fixed-point result: The convexity-like assumption on the Hamiltonian is invoked to guarantee that fixed points of the limit ODE solve the original HJB problem rather than merely satisfying a stationarity condition. The precise statement of this assumption (e.g., uniform convexity in the control variable for each fixed state) is not shown to hold for the non-convex Hamiltonian examples studied numerically, which the abstract describes as 'highly challenging.' This assumption is load-bearing for the theoretical guarantee yet remains unverified in the reported experiments.
Authors: We appreciate this observation. The convexity-like assumption is necessary for the fixed-point result to imply that the limit solves the stochastic control problem. In the numerical section, the non-convex Hamiltonian cases are included precisely to test the algorithm in regimes where this assumption may not hold, and we present them as challenging examples where empirical success is observed despite the lack of theoretical guarantee. We will revise the abstract to better distinguish between the theoretical results (under the assumption) and the numerical experiments (which include cases outside the assumption). Additionally, we will add a sentence in §4 clarifying that the fixed-point theorem does not apply when the assumption is violated, and discuss potential reasons for empirical performance in such cases. revision: partial
-
Referee: [§3 (asymptotic analysis)] Convergence theorem: The passage from finite-width actor-critic dynamics to the infinite-dimensional ODE in Sobolev space relies on a mean-field or NTK-style argument. The handling of the biased gradient in the critic and the precise function space in which the limit is taken should be accompanied by explicit error estimates or compactness arguments to confirm that the convergence is strong enough to pass to the fixed-point property.
Authors: We agree that the convergence analysis can be strengthened with more details. The proof in §3 establishes convergence in a Sobolev-type space using a mean-field limit approach, and we believe the arguments are sufficient to pass to the fixed points. However, to address the concern, we will include additional explanations regarding the handling of the biased gradient and a compactness argument to justify the limit passage. Full quantitative error estimates between the finite-width dynamics and the infinite-dimensional ODE are technically involved and may be left for future work, but we will provide a more explicit sketch of the key steps. revision: partial
Circularity Check
No significant circularity: convergence to limit ODE derived independently; fixed-point result relies on external convexity assumption
full rationale
The paper derives the convergence of finite-width actor-critic training dynamics to an infinite-dimensional ODE in Sobolev space as hidden units tend to infinity, using asymptotic analysis (likely mean-field or NTK-type arguments). This limit ODE is obtained directly from the network dynamics rather than being presupposed. Separately, the claim that fixed points of the ODE solve the original HJB problem invokes an explicit convexity-like assumption on the Hamiltonian, which is stated as an external hypothesis and is not obtained by fitting, renaming, or self-referential definition within the paper. No load-bearing step reduces by construction to a fitted parameter, prior self-citation chain, or ansatz smuggled from the authors' own work. The numerical studies on non-convex Hamiltonians are presented as empirical validation separate from the theoretical guarantees. The derivation chain is therefore self-contained against external benchmarks and does not exhibit the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Convexity-like assumption on the Hamiltonian
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
under a convexity-like assumption on the Hamiltonian, we prove that any fixed point of this limit ODE is a solution of the original stochastic control problem
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the training dynamics ... converge ... to a certain infinite-dimensional ordinary differential equation (ODE) as the number of hidden units → ∞
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Extensions of the deep Galerkin method
Ali Al-Aradi, Adolfo Correia, Gabriel Jardim, Danilo de Freitas Naiff, and Yuri Saporito. Extensions of the deep Galerkin method. Applied Mathematics and Computation , 430:127287, 2022
work page 2022
-
[2]
Christian Beck, Weinan E, and Arnulf Jentzen. Machine learning approximation algorithms for high- dimensional fully nonlinear partial differential equations and second-order backward stochastic differ- ential equations. Journal of Nonlinear Science , 29:1563–1619, 2019
work page 2019
-
[3]
Deep learning for mean field games and mean field control with applications to finance
Ren´ e Carmona and Mathieu Lauri` ere. Deep learning for mean field games and mean field control with applications to finance. arXiv preprint arXiv:2107.04568 , 7, 2021
-
[4]
Deep learning for continuous-time stochas- tic control with jumps
Patrick Cheridito, Jean-Loup Dupret, and Donatien Hainaut. Deep learning for continuous-time stochas- tic control with jumps. arXiv preprint arXiv:2505.15602 , 2025
-
[5]
Cohen, Deqing Jiang, and Justin Sirignano
Samuel N. Cohen, Deqing Jiang, and Justin Sirignano. Neural Q-learning for solving PDEs. Journal of Machine Learning Research, 24(236):1–49, 2023
work page 2023
-
[6]
Machine learning for continuous-time finance
Victor Duarte, Diogo Duarte, and Dejanir H Silva. Machine learning for continuous-time finance. The Review of Financial Studies , 37, 2024
work page 2024
-
[7]
Continuous policy and value iteration for stochastic control problems and its convergence
Qi Feng and Gu Wang. Continuous policy and value iteration for stochastic control problems and its convergence. arXiv preprint arXiv:2506.08121 , 2025
-
[8]
Deep Learning Approximation for Stochastic Control Problems
Jiequn Han et al. Deep learning approximation for stochastic control problems. arXiv preprint arXiv:1611.07422, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[9]
Solving high-dimensional partial differential equations using deep learning
Jiequn Han, Arnulf Jentzen, and Weinan E. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences , 115(34):8505–8510, 2018
work page 2018
-
[10]
William Hofgard, Jingruo Sun, and Asaf Cohen. Convergence of the deep galerkin method for mean field control problems. arXiv preprint arXiv:2405.13346 , 2024
-
[11]
Dynamic programming and Markov processes
Ronald Howard. Dynamic programming and Markov processes. MIT Press, 1960
work page 1960
-
[12]
Recent developments in machine learning methods for stochastic control and games
Riumeng Hu and Mathieu Lauri` ere. Recent developments in machine learning methods for stochastic control and games. Numerical Algebra, Control and Optimization , 14:435–525, 2024
work page 2024
-
[13]
Cˆ ome Hur´ e, Huyˆ en Pham, Achref Bachouch, and Nicolas Langren´ e. Deep neural networks algorithms for stochastic control problems on finite horizon: convergence analysis. SIAM Journal on Numerical Analysis, 59(1):525–557, 2021
work page 2021
-
[14]
Deep backward schemes for high-dimensional nonlinear PDEs
Cˆ ome Hur´ e, Huyˆ en Pham, and Xavier Warin. Deep backward schemes for high-dimensional nonlinear PDEs. Mathematics of Computation , 89(324):1547–1579, 2020
work page 2020
-
[15]
Kazufumi Ito, Christoph Reisinger, and Yufei Zhang. A neural network-based policy iteration algorithm with global H2-superlinear convergence for stochastic games on domains. Foundations of Computational Mathematics, 21(2):331–374, 2021
work page 2021
-
[16]
Policy gradient and actor-critic learning in continuous time and space: theory and algorithms
Yanwei Jia and Xunyu Zhou. Policy gradient and actor-critic learning in continuous time and space: theory and algorithms. Journal of Machine Learning Research , 23, 2022
work page 2022
-
[17]
Global Convergence of Deep Galerkin and PINNs Methods for Solving Partial Differential Equations
Deqing Jiang, Justin Sirignano, and Samuel N Cohen. Global Convergence of Deep Galerkin and PINNs Methods for Solving Partial Differential Equations. arXiv preprint arXiv:2305.06000 , 2023
-
[18]
Neural optimal controller for stochastic systems via pathwise HJB operator
Zhe Jiao, Xiaoyan Luo, and Xinlei Yi. Neural optimal controller for stochastic systems via pathwise HJB operator. arXiv preprint arXiv:2402.15592 , 2024
-
[19]
Differential equations on measures and functional spaces
Vassili Kolokoltsov. Differential equations on measures and functional spaces . Birkh¨ auser Cham, 2019. 37
work page 2019
-
[20]
Vijay Konda and John Tsitsiklis. Actor-critic algorithms. Advances in Neural Information Processing Systems, 12, 1999
work page 1999
-
[21]
Controlled Diffusion Processes
Nicolai Krylov. Controlled Diffusion Processes. Springer Berlin, 1980
work page 1980
-
[22]
Nikolas N¨ usken and Lorenz Richter. Solving high-dimensional Hamilton–Jacobi-Bellman PDEs using neural networks: perspectives from the theory of controlled diffusions and measures on path space. Partial Differential Equations and Applications , 2, 2021
work page 2021
-
[23]
Continuous-time stochastic control and optimization with financial applications , vol- ume 61
Huyˆ en Pham. Continuous-time stochastic control and optimization with financial applications , vol- ume 61. Springer Science & Business Media, 2009
work page 2009
-
[24]
Mean-field neural networks-based algorithms for McKean-Vlasov control problems
Huyˆ en Pham and Xavier Warin. Mean-field neural networks-based algorithms for McKean-Vlasov control problems. arXiv preprint arXiv:2212.11518 , 2022
-
[25]
Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational physics , 378:686–707, 2019
work page 2019
-
[26]
Regularity and stability of feedback relaxed controls
Christoph Reisinger and Yufei Zhang. Regularity and stability of feedback relaxed controls. SIAM Journal on Control and Optimization , 59(5):3118–3151, 2021
work page 2021
-
[27]
DGM: A deep learning algorithm for solving partial differential equations
Justin Sirignano and Konstantinos Spiliopoulos. DGM: A deep learning algorithm for solving partial differential equations. Journal of Computational Physics , 375:1339–1364, 2018
work page 2018
-
[28]
Reinforcement learning: An introduction
Richard Sutton and Andrew Barto. Reinforcement learning: An introduction. MIT Press, 2018
work page 2018
-
[29]
Stochastic controls: Hamiltonian systems and HJB equations
Jiongmin Yong and Xunyu Zhou. Stochastic controls: Hamiltonian systems and HJB equations . Num- ber 43 in Applications of Mathematics. Springer Science & Business Media, New York, 1999
work page 1999
-
[30]
Why gradient clipping accelerates training: A theoretical justification for adaptivity
Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. InInternational Conference on Learning Representations, 2020
work page 2020
-
[31]
Mo Zhou, Jiequn Han, and Jianfeng Lu. Actor-critic method for high dimensional static Hamilton– Jacobi–Bellman partial differential equations based on neural networks. SIAM Journal on Scientific Computing, 43(6):A4043–A4066, 2021
work page 2021
-
[32]
Mo Zhou and Jianfeng Lu. A policy gradient framework for stochastic optimal control problems with global convergence guarantee. arXiv preprint arXiv:2302.05816 , 2025. 38 A Measuring actor-critic agreement for constructed control prob- lems with Monte Carlo simulations One might be interested in finding a measure of how closely the estimated value functio...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.