pith. sign in

arxiv: 2507.06428 · v2 · pith:6PHFWAB4new · submitted 2025-07-08 · 🧮 math.OC · cs.LG· cs.NA· math.NA· stat.ML

Neural Actor-Critic Methods for Hamilton-Jacobi-Bellman PDEs: Asymptotic Analysis and Numerical Studies

Pith reviewed 2026-05-21 23:25 UTC · model grok-4.3

classification 🧮 math.OC cs.LGcs.NAmath.NAstat.ML
keywords actor-critic neural networksHamilton-Jacobi-Bellman equationsstochastic controlinfinite-width limitSobolev convergencehigh-dimensional PDEsasymptotic analysis
0
0 comments X

The pith

As the number of hidden units tends to infinity, actor and critic neural networks for HJB equations converge in a Sobolev space to an infinite-dimensional ODE whose fixed points solve the stochastic control problem under a convexity-likeass

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes a neural actor-critic algorithm for solving high-dimensional Hamilton-Jacobi-Bellman equations from stochastic control. The critic is built to satisfy the boundary condition exactly and uses a biased gradient to cut cost, while the actor minimizes the integrated Hamiltonian estimated by the critic. The authors prove that training dynamics converge in a Sobolev-type space to a limiting infinite-dimensional ODE as hidden units go to infinity. Under a convexity-like assumption on the Hamiltonian, any fixed point of this ODE solves the original control problem. Numerical tests show the method handles problems up to 200 dimensions, including those with non-convex Hamiltonians.

Core claim

We show that the training dynamics of the actor and critic neural networks converge in a Sobolev-type space to a certain infinite-dimensional ordinary differential equation as the number of hidden units tends to infinity. Further, under a convexity-like assumption on the Hamiltonian, any fixed point of this limit ODE is a solution of the original stochastic control problem. This provides a guarantee for the algorithm despite possible local minima in finite-width networks.

What carries the argument

The infinite-dimensional ODE that the actor-critic training dynamics converge to in the infinite-width limit, with fixed points that solve the HJB equation under the convexity-like assumption.

Load-bearing premise

The convexity-like assumption on the Hamiltonian is needed to guarantee that fixed points of the limiting ODE solve the original stochastic control problem.

What would settle it

A fixed point of the limiting ODE that fails to satisfy the HJB equation for a Hamiltonian that violates the convexity-like assumption.

Figures

Figures reproduced from arXiv: 2507.06428 by Deqing Jiang, Jackson Hebner, Justin Sirignano, Samuel N. Cohen.

Figure 1
Figure 1. Figure 1: The actor and critic approach their true values quickly in Problem 1. [PITH_FULL_IMAGE:figures/full_fig_p019_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Problem 2 when ζ is replaced with ζ ∗ (x, a) = 100 log cosh(a − u ∗ (x)) and Ω = B(0, 1). Further, as can be seen in [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Convergence is monotonic for Problem 2B possibility of the actor and critic converging towards a poor solution as in Problem 2A. In [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance on Problem 3 in three distinct regimes [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Convergence is rapid for Problem 4 4.2.5 Problem 4 We consider the setup where the domain is Ω = [1, −1]10, the action space is A = R 10, the dimension of the Brownian motion is d ′ = 10, the discounting rate is γ = 1, and V (x) = 1 + Y 10 i=1 1 − sin  πx2 i 2 2 ! , u ∗ i (x) = xi  1 + Y 10 j=1 xj   , bi(x, a) = aixi + ∥x∥ 2 , Φ(x, a) = Id10×10 1 + ∥a∥ 2 10 ! , ζ(x, a) = ∥a − u ∗ (x)∥ 2 . Algorithm 1… view at source ↗
Figure 6
Figure 6. Figure 6: Convergency is noisy for Problem 5 5 Proofs of Results In this final section, we assume without loss of generality that N = N∗ and denote QN t := QN ϕt and U N t := U N θt . We also fix an arbitrary training time bound T > 0. The constants C > 0 may vary from line to line (even within the same chain of inequalities) and depend on T, but are always independent of N. For simplicity, the proof is done with a … view at source ↗
Figure 7
Figure 7. Figure 7: Actor-critic disagreement metrics for Problem 1 [PITH_FULL_IMAGE:figures/full_fig_p040_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Actor-critic disagreement metrics for Problem 2A ( [PITH_FULL_IMAGE:figures/full_fig_p040_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Actor-critic disagreement metrics for Problem 2A ( [PITH_FULL_IMAGE:figures/full_fig_p040_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Actor-critic disagreement metrics for Problem 2B [PITH_FULL_IMAGE:figures/full_fig_p041_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Actor-critic disagreement metrics for Problem 3 [PITH_FULL_IMAGE:figures/full_fig_p041_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Actor-critic disagreement metrics for Problem 4 [PITH_FULL_IMAGE:figures/full_fig_p041_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Actor-critic disagreement metrics for Problem 5 [PITH_FULL_IMAGE:figures/full_fig_p041_13.png] view at source ↗
read the original abstract

We mathematically analyze and numerically study an actor-critic machine learning algorithm for solving high-dimensional Hamilton-Jacobi-Bellman (HJB) partial differential equations from stochastic control theory. The architecture of the critic (the estimator for the value function) is structured so that the boundary condition is always perfectly satisfied (rather than being included in the training loss) and utilizes a biased gradient which reduces computational cost. The actor (the estimator for the optimal control) is trained by minimizing the integral of the Hamiltonian over the domain, where the Hamiltonian is estimated using the critic. We show that the training dynamics of the actor and critic neural networks converge in a Sobolev-type space to a certain infinite-dimensional ordinary differential equation (ODE) as the number of hidden units in the actor and critic $\rightarrow \infty$. Further, under a convexity-like assumption on the Hamiltonian, we prove that any fixed point of this limit ODE is a solution of the original stochastic control problem. This provides an important guarantee for the algorithm's performance in light of the fact that finite-width neural networks may only converge to a local minimizers (and not optimal solutions) due to the non-convexity of their loss functions. In our numerical studies, we demonstrate that the algorithm can solve stochastic control problems accurately in up to 200 dimensions. In particular, we construct a series of increasingly complex stochastic control problems with known analytic solutions and study the algorithm's numerical performance on them. These problems range from a linear-quadratic regulator equation to highly challenging equations with non-convex Hamiltonians, allowing us to identify and analyze the strengths and limitations of this neural actor-critic method for solving HJB equations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes a neural actor-critic algorithm for high-dimensional HJB PDEs from stochastic control. The critic network enforces boundary conditions exactly and employs a biased gradient for efficiency; the actor minimizes the integrated Hamiltonian estimated by the critic. The authors prove that, as the number of hidden units tends to infinity, the training dynamics of both networks converge in a Sobolev-type space to an infinite-dimensional ODE. Under a convexity-like assumption on the Hamiltonian, any fixed point of this ODE is shown to solve the original stochastic control problem. Numerical experiments demonstrate the method on a sequence of problems with known solutions, ranging from linear-quadratic regulators to non-convex Hamiltonian cases, in dimensions up to 200.

Significance. If the convergence result and the fixed-point theorem hold, the analysis supplies a mean-field justification for why actor-critic training can recover global solutions to HJB equations despite the non-convexity of finite-width losses. The explicit construction of the critic architecture and the use of the Hamiltonian integral as the actor objective are technically natural choices that align with the underlying control problem. The hierarchy of numerical test cases with analytic solutions provides concrete evidence of practical performance and helps delineate the method's strengths and limitations.

major comments (2)
  1. [Abstract and §4 (fixed-point analysis)] Abstract and fixed-point result: The convexity-like assumption on the Hamiltonian is invoked to guarantee that fixed points of the limit ODE solve the original HJB problem rather than merely satisfying a stationarity condition. The precise statement of this assumption (e.g., uniform convexity in the control variable for each fixed state) is not shown to hold for the non-convex Hamiltonian examples studied numerically, which the abstract describes as 'highly challenging.' This assumption is load-bearing for the theoretical guarantee yet remains unverified in the reported experiments.
  2. [§3 (asymptotic analysis)] Convergence theorem: The passage from finite-width actor-critic dynamics to the infinite-dimensional ODE in Sobolev space relies on a mean-field or NTK-style argument. The handling of the biased gradient in the critic and the precise function space in which the limit is taken should be accompanied by explicit error estimates or compactness arguments to confirm that the convergence is strong enough to pass to the fixed-point property.
minor comments (2)
  1. [Numerical studies] A table summarizing dimension, relative error, and wall-clock time for each test problem would improve readability of the numerical section.
  2. [§3] The precise definition of the Sobolev-type space used for the convergence statement should be recalled at the beginning of the asymptotic analysis section for self-contained reading.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. We address each of the major comments in detail below and indicate the revisions we plan to make.

read point-by-point responses
  1. Referee: [Abstract and §4 (fixed-point analysis)] Abstract and fixed-point result: The convexity-like assumption on the Hamiltonian is invoked to guarantee that fixed points of the limit ODE solve the original HJB problem rather than merely satisfying a stationarity condition. The precise statement of this assumption (e.g., uniform convexity in the control variable for each fixed state) is not shown to hold for the non-convex Hamiltonian examples studied numerically, which the abstract describes as 'highly challenging.' This assumption is load-bearing for the theoretical guarantee yet remains unverified in the reported experiments.

    Authors: We appreciate this observation. The convexity-like assumption is necessary for the fixed-point result to imply that the limit solves the stochastic control problem. In the numerical section, the non-convex Hamiltonian cases are included precisely to test the algorithm in regimes where this assumption may not hold, and we present them as challenging examples where empirical success is observed despite the lack of theoretical guarantee. We will revise the abstract to better distinguish between the theoretical results (under the assumption) and the numerical experiments (which include cases outside the assumption). Additionally, we will add a sentence in §4 clarifying that the fixed-point theorem does not apply when the assumption is violated, and discuss potential reasons for empirical performance in such cases. revision: partial

  2. Referee: [§3 (asymptotic analysis)] Convergence theorem: The passage from finite-width actor-critic dynamics to the infinite-dimensional ODE in Sobolev space relies on a mean-field or NTK-style argument. The handling of the biased gradient in the critic and the precise function space in which the limit is taken should be accompanied by explicit error estimates or compactness arguments to confirm that the convergence is strong enough to pass to the fixed-point property.

    Authors: We agree that the convergence analysis can be strengthened with more details. The proof in §3 establishes convergence in a Sobolev-type space using a mean-field limit approach, and we believe the arguments are sufficient to pass to the fixed points. However, to address the concern, we will include additional explanations regarding the handling of the biased gradient and a compactness argument to justify the limit passage. Full quantitative error estimates between the finite-width dynamics and the infinite-dimensional ODE are technically involved and may be left for future work, but we will provide a more explicit sketch of the key steps. revision: partial

Circularity Check

0 steps flagged

No significant circularity: convergence to limit ODE derived independently; fixed-point result relies on external convexity assumption

full rationale

The paper derives the convergence of finite-width actor-critic training dynamics to an infinite-dimensional ODE in Sobolev space as hidden units tend to infinity, using asymptotic analysis (likely mean-field or NTK-type arguments). This limit ODE is obtained directly from the network dynamics rather than being presupposed. Separately, the claim that fixed points of the ODE solve the original HJB problem invokes an explicit convexity-like assumption on the Hamiltonian, which is stated as an external hypothesis and is not obtained by fitting, renaming, or self-referential definition within the paper. No load-bearing step reduces by construction to a fitted parameter, prior self-citation chain, or ansatz smuggled from the authors' own work. The numerical studies on non-convex Hamiltonians are presented as empirical validation separate from the theoretical guarantees. The derivation chain is therefore self-contained against external benchmarks and does not exhibit the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the convexity-like assumption for the fixed-point result and on the specific neural architectures whose details are only sketched in the abstract. No free parameters or new entities are introduced in the provided text.

axioms (1)
  • domain assumption Convexity-like assumption on the Hamiltonian
    Required to prove that fixed points of the limiting ODE solve the original stochastic control problem.

pith-pipeline@v0.9.0 · 5859 in / 1482 out tokens · 47885 ms · 2026-05-21T23:25:38.050777+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

  1. [1]

    Extensions of the deep Galerkin method

    Ali Al-Aradi, Adolfo Correia, Gabriel Jardim, Danilo de Freitas Naiff, and Yuri Saporito. Extensions of the deep Galerkin method. Applied Mathematics and Computation , 430:127287, 2022

  2. [2]

    Machine learning approximation algorithms for high- dimensional fully nonlinear partial differential equations and second-order backward stochastic differ- ential equations

    Christian Beck, Weinan E, and Arnulf Jentzen. Machine learning approximation algorithms for high- dimensional fully nonlinear partial differential equations and second-order backward stochastic differ- ential equations. Journal of Nonlinear Science , 29:1563–1619, 2019

  3. [3]

    Deep learning for mean field games and mean field control with applications to finance

    Ren´ e Carmona and Mathieu Lauri` ere. Deep learning for mean field games and mean field control with applications to finance. arXiv preprint arXiv:2107.04568 , 7, 2021

  4. [4]

    Deep learning for continuous-time stochas- tic control with jumps

    Patrick Cheridito, Jean-Loup Dupret, and Donatien Hainaut. Deep learning for continuous-time stochas- tic control with jumps. arXiv preprint arXiv:2505.15602 , 2025

  5. [5]

    Cohen, Deqing Jiang, and Justin Sirignano

    Samuel N. Cohen, Deqing Jiang, and Justin Sirignano. Neural Q-learning for solving PDEs. Journal of Machine Learning Research, 24(236):1–49, 2023

  6. [6]

    Machine learning for continuous-time finance

    Victor Duarte, Diogo Duarte, and Dejanir H Silva. Machine learning for continuous-time finance. The Review of Financial Studies , 37, 2024

  7. [7]

    Continuous policy and value iteration for stochastic control problems and its convergence

    Qi Feng and Gu Wang. Continuous policy and value iteration for stochastic control problems and its convergence. arXiv preprint arXiv:2506.08121 , 2025

  8. [8]

    Deep Learning Approximation for Stochastic Control Problems

    Jiequn Han et al. Deep learning approximation for stochastic control problems. arXiv preprint arXiv:1611.07422, 2016

  9. [9]

    Solving high-dimensional partial differential equations using deep learning

    Jiequn Han, Arnulf Jentzen, and Weinan E. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences , 115(34):8505–8510, 2018

  10. [10]

    Hofgard, J

    William Hofgard, Jingruo Sun, and Asaf Cohen. Convergence of the deep galerkin method for mean field control problems. arXiv preprint arXiv:2405.13346 , 2024

  11. [11]

    Dynamic programming and Markov processes

    Ronald Howard. Dynamic programming and Markov processes. MIT Press, 1960

  12. [12]

    Recent developments in machine learning methods for stochastic control and games

    Riumeng Hu and Mathieu Lauri` ere. Recent developments in machine learning methods for stochastic control and games. Numerical Algebra, Control and Optimization , 14:435–525, 2024

  13. [13]

    Deep neural networks algorithms for stochastic control problems on finite horizon: convergence analysis

    Cˆ ome Hur´ e, Huyˆ en Pham, Achref Bachouch, and Nicolas Langren´ e. Deep neural networks algorithms for stochastic control problems on finite horizon: convergence analysis. SIAM Journal on Numerical Analysis, 59(1):525–557, 2021

  14. [14]

    Deep backward schemes for high-dimensional nonlinear PDEs

    Cˆ ome Hur´ e, Huyˆ en Pham, and Xavier Warin. Deep backward schemes for high-dimensional nonlinear PDEs. Mathematics of Computation , 89(324):1547–1579, 2020

  15. [15]

    A neural network-based policy iteration algorithm with global H2-superlinear convergence for stochastic games on domains

    Kazufumi Ito, Christoph Reisinger, and Yufei Zhang. A neural network-based policy iteration algorithm with global H2-superlinear convergence for stochastic games on domains. Foundations of Computational Mathematics, 21(2):331–374, 2021

  16. [16]

    Policy gradient and actor-critic learning in continuous time and space: theory and algorithms

    Yanwei Jia and Xunyu Zhou. Policy gradient and actor-critic learning in continuous time and space: theory and algorithms. Journal of Machine Learning Research , 23, 2022

  17. [17]

    Global Convergence of Deep Galerkin and PINNs Methods for Solving Partial Differential Equations

    Deqing Jiang, Justin Sirignano, and Samuel N Cohen. Global Convergence of Deep Galerkin and PINNs Methods for Solving Partial Differential Equations. arXiv preprint arXiv:2305.06000 , 2023

  18. [18]

    Neural optimal controller for stochastic systems via pathwise HJB operator

    Zhe Jiao, Xiaoyan Luo, and Xinlei Yi. Neural optimal controller for stochastic systems via pathwise HJB operator. arXiv preprint arXiv:2402.15592 , 2024

  19. [19]

    Differential equations on measures and functional spaces

    Vassili Kolokoltsov. Differential equations on measures and functional spaces . Birkh¨ auser Cham, 2019. 37

  20. [20]

    Actor-critic algorithms

    Vijay Konda and John Tsitsiklis. Actor-critic algorithms. Advances in Neural Information Processing Systems, 12, 1999

  21. [21]

    Controlled Diffusion Processes

    Nicolai Krylov. Controlled Diffusion Processes. Springer Berlin, 1980

  22. [22]

    Solving high-dimensional Hamilton–Jacobi-Bellman PDEs using neural networks: perspectives from the theory of controlled diffusions and measures on path space

    Nikolas N¨ usken and Lorenz Richter. Solving high-dimensional Hamilton–Jacobi-Bellman PDEs using neural networks: perspectives from the theory of controlled diffusions and measures on path space. Partial Differential Equations and Applications , 2, 2021

  23. [23]

    Continuous-time stochastic control and optimization with financial applications , vol- ume 61

    Huyˆ en Pham. Continuous-time stochastic control and optimization with financial applications , vol- ume 61. Springer Science & Business Media, 2009

  24. [24]

    Mean-field neural networks-based algorithms for McKean-Vlasov control problems

    Huyˆ en Pham and Xavier Warin. Mean-field neural networks-based algorithms for McKean-Vlasov control problems. arXiv preprint arXiv:2212.11518 , 2022

  25. [25]

    Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations

    Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational physics , 378:686–707, 2019

  26. [26]

    Regularity and stability of feedback relaxed controls

    Christoph Reisinger and Yufei Zhang. Regularity and stability of feedback relaxed controls. SIAM Journal on Control and Optimization , 59(5):3118–3151, 2021

  27. [27]

    DGM: A deep learning algorithm for solving partial differential equations

    Justin Sirignano and Konstantinos Spiliopoulos. DGM: A deep learning algorithm for solving partial differential equations. Journal of Computational Physics , 375:1339–1364, 2018

  28. [28]

    Reinforcement learning: An introduction

    Richard Sutton and Andrew Barto. Reinforcement learning: An introduction. MIT Press, 2018

  29. [29]

    Stochastic controls: Hamiltonian systems and HJB equations

    Jiongmin Yong and Xunyu Zhou. Stochastic controls: Hamiltonian systems and HJB equations . Num- ber 43 in Applications of Mathematics. Springer Science & Business Media, New York, 1999

  30. [30]

    Why gradient clipping accelerates training: A theoretical justification for adaptivity

    Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. InInternational Conference on Learning Representations, 2020

  31. [31]

    Actor-critic method for high dimensional static Hamilton– Jacobi–Bellman partial differential equations based on neural networks

    Mo Zhou, Jiequn Han, and Jianfeng Lu. Actor-critic method for high dimensional static Hamilton– Jacobi–Bellman partial differential equations based on neural networks. SIAM Journal on Scientific Computing, 43(6):A4043–A4066, 2021

  32. [32]

    A policy gradient framework for stochastic optimal control problems with global convergence guarantee

    Mo Zhou and Jianfeng Lu. A policy gradient framework for stochastic optimal control problems with global convergence guarantee. arXiv preprint arXiv:2302.05816 , 2025. 38 A Measuring actor-critic agreement for constructed control prob- lems with Monte Carlo simulations One might be interested in finding a measure of how closely the estimated value functio...