pith. sign in

arxiv: 2604.26782 · v2 · pith:7PBBOKTZnew · submitted 2026-04-29 · 🧮 math.NA · cs.NA

Deep Policy Iteration for High-Dimensional Mean-Field Games with Regenerative Reformulation

Pith reviewed 2026-05-19 17:10 UTC · model grok-4.3

classification 🧮 math.NA cs.NA
keywords mean-field gamespolicy iterationdeep learninghigh-dimensional problemsregenerative reformulationparticle systemsEuler-Maruyama discretizationnumerical methods
0
0 comments X

The pith

By reformulating mean-field games into regenerative problems with deterministic cycles, deep policy iteration becomes efficient and scalable in dimensions up to 10,000.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a deep policy iteration algorithm for high-dimensional finite-horizon mean-field games by introducing a regenerative reformulation with deterministic cycles. This structure permits policy evaluation, policy improvement, and estimation of the population measure to occur sequentially cycle by cycle rather than over the full horizon. The population is approximated with particles that are advanced using one-step random mappings derived from Euler-Maruyama discretization, which transports mini-batches forward without repeated full simulations. Adversarial training handles evaluation while averaged optimization does improvement. Readers should care because standard approaches to mean-field games break down in high dimensions due to the need to solve large coupled systems or simulate long trajectories repeatedly.

Core claim

The authors claim that the mean-field game can be recast as a regenerative problem with deterministic cycles. Within this setup, the population measure is tracked by a particle system whose states are updated from one cycle to the next by a single random mapping coming from the Euler-Maruyama scheme applied to the controlled dynamics. Policy evaluation and improvement are then defined through the relations that hold between consecutive cycles, with the former solved via adversarial training and the latter via averaged optimization. The resulting procedure sidesteps the coupled Hamilton-Jacobi-Bellman and Fokker-Planck equations, avoids simulating entire trajectories at every iteration, disp

What carries the argument

Regenerative reformulation with deterministic cycles, which decomposes the game so that updates to the population measure and policy steps can be performed using one-step particle mappings between cycles.

If this is right

  • The method avoids direct solution of the coupled Hamilton-Jacobi-Bellman and Fokker-Planck system.
  • It avoids the full simulation of trajectories to estimate the population measure at each iteration.
  • It avoids the explicit computation of conditional expectations in policy evaluation.
  • It avoids pointwise optimization in policy improvement.
  • Numerical experiments show effective performance in dimensions up to 10,000.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This cycle-based particle update could reduce computational cost in other stochastic control problems involving large populations.
  • Extending the regenerative structure to infinite-horizon settings might require defining appropriate cycle lengths based on ergodicity assumptions.
  • The use of mini-batch particle transport suggests potential for parallelization on modern hardware.

Load-bearing premise

The mean-field game must admit a reformulation as a regenerative problem with deterministic cycles so that all subproblems can be solved accurately using cycle-by-cycle particle approximations from the Euler-Maruyama discretization.

What would settle it

Observing that the approximated population measures diverge from the true distribution or that the learned policies fail to satisfy the mean-field equilibrium condition as dimension increases beyond 1,000 would falsify the scalability of the method.

Figures

Figures reproduced from arXiv: 2604.26782 by Hui Zhang, Shuixin Fang, Shupeng Wang, Tao Zhou, Zhen Wu.

Figure 1
Figure 1. Figure 1: Numerical results of Algorithm 1 for LQ-1, -2, and -3 in section 4.1 with view at source ↗
Figure 2
Figure 2. Figure 2: Numerical results of Algorithm 1 for LQ-1, -2, and -3 in section 4.1 with view at source ↗
Figure 3
Figure 3. Figure 3: Numerical results of Algorithm 1 for LQ-1 in section 4.1 with view at source ↗
Figure 4
Figure 4. Figure 4: 18 view at source ↗
Figure 4
Figure 4. Figure 4: Results of Algorithm 1 for the MFG in section 4.2. (Upper left) Loss versus view at source ↗
Figure 5
Figure 5. Figure 5: Results of Algorithm 1 for the MFG in section 4.3. (Upper left) Loss versus view at source ↗
read the original abstract

This paper develops a deep policy iteration method for high-dimensional finite-horizon mean-field games (MFG). We reformulate the game as a regenerative problem with deterministic cycles, which allows policy evaluation (PE), policy improvement (PI), and population measure estimation to be carried out cycle by cycle. Within this formulation, we approximate the population measure by a particle system and update it using a one-step random mapping induced by the Euler-Maruyama discretization of the state dynamics. This update transports a mini-batch of particles from one cycle to the next, avoiding sequential trajectory simulation over the entire time horizon at each iteration. The PE and PI subproblems are formulated through the relation between consecutive cycles, with adversarial training used for evaluation and averaged optimization used for improvement. The resulting method is efficient and scalable in high dimensions, as it avoids the direct solution of the coupled Hamilton-Jacobi-Bellman and Fokker-Planck system, the full simulation of trajectories to estimate the population measure, the explicit computation of conditional expectations in policy evaluation, and pointwise optimization in policy improvement. Numerical experiments demonstrate that the proposed method effectively handles dimensions up to 10,000.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. This manuscript develops a deep policy iteration algorithm for high-dimensional finite-horizon mean-field games. The central contribution is a regenerative reformulation of the MFG as a problem with deterministic cycles, which permits cycle-by-cycle policy evaluation (via adversarial training), policy improvement (via averaged optimization), and population-measure estimation (via a particle system updated by one-step Euler-Maruyama random mappings). The method is asserted to avoid direct solution of the coupled HJB-FP system, full-trajectory simulation, explicit conditional expectations, and pointwise optimization, with numerical results reported for state dimensions up to 10,000.

Significance. If the regenerative reformulation is rigorously equivalent to the original finite-horizon MFG and the particle and neural approximations converge at controllable rates, the approach would constitute a meaningful advance for scalable numerical solution of high-dimensional MFGs. The explicit avoidance of several standard computational bottlenecks and the reported ability to reach d=10,000 are concrete strengths that, if substantiated, could influence subsequent work on mean-field control and games.

major comments (3)
  1. [§2] §2 (Regenerative reformulation): The manuscript introduces the deterministic-cycle reformulation and states that PE/PI are formulated 'through the relation between consecutive cycles,' yet provides neither a derivation establishing exact equivalence to the original finite-horizon MFG nor an error bound quantifying the bias introduced by a fixed cycle length. Because the central scalability claim rests on solving the true mean-field Nash equilibrium rather than an altered problem, this equivalence must be proved or the approximation error controlled.
  2. [§3.2] §3.2 (One-step particle update): The population measure is transported by a single Euler-Maruyama step per cycle. No global error analysis or stability estimate is given for the accumulated local truncation error over many cycles, especially when the drift or diffusion coefficients are state-dependent. This directly affects the reliability of the measure approximation that underpins both the policy-evaluation and policy-improvement steps.
  3. [§4] §4 (Numerical experiments): Results are presented for dimensions up to 10,000, but the experiments section supplies neither quantitative error metrics against known low-dimensional solutions nor comparisons with existing MFG solvers. Without such validation, the claim that the method 'effectively handles' these dimensions remains difficult to assess.
minor comments (2)
  1. The notation for cycle length and the precise definition of the 'one-step random mapping' should be introduced once and used consistently; occasional redefinition in later sections reduces readability.
  2. Figure captions for the particle-transport diagrams would benefit from explicit mention of the mini-batch size and the Euler-Maruyama step size employed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough review and valuable suggestions. We will address each of the major comments in detail below and make the necessary revisions to the manuscript.

read point-by-point responses
  1. Referee: [§2] §2 (Regenerative reformulation): The manuscript introduces the deterministic-cycle reformulation and states that PE/PI are formulated 'through the relation between consecutive cycles,' yet provides neither a derivation establishing exact equivalence to the original finite-horizon MFG nor an error bound quantifying the bias introduced by a fixed cycle length. Because the central scalability claim rests on solving the true mean-field Nash equilibrium rather than an altered problem, this equivalence must be proved or the approximation error controlled.

    Authors: We agree with the referee that establishing the equivalence rigorously is crucial. In the revised manuscript, we will expand §2 to include a complete derivation of the regenerative reformulation, demonstrating its exact equivalence to the original finite-horizon MFG under the deterministic cycle structure. We will also derive an error bound for the approximation error induced by a fixed cycle length, showing that this bias can be made arbitrarily small by appropriate selection of the cycle length relative to the time horizon. This will confirm that the method targets the true mean-field Nash equilibrium. revision: yes

  2. Referee: [§3.2] §3.2 (One-step particle update): The population measure is transported by a single Euler-Maruyama step per cycle. No global error analysis or stability estimate is given for the accumulated local truncation error over many cycles, especially when the drift or diffusion coefficients are state-dependent. This directly affects the reliability of the measure approximation that underpins both the policy-evaluation and policy-improvement steps.

    Authors: The referee is correct that a global error analysis is currently missing. We will revise §3.2 to incorporate a detailed stability estimate and global error bound for the accumulated truncation errors over the cycles. Drawing on numerical analysis for SDEs, we will bound the error in the particle system approximation of the population measure, taking into account state-dependent coefficients. This addition will provide the necessary guarantees for the accuracy of the measure estimates used in the PE and PI procedures. revision: yes

  3. Referee: [§4] §4 (Numerical experiments): Results are presented for dimensions up to 10,000, but the experiments section supplies neither quantitative error metrics against known low-dimensional solutions nor comparisons with existing MFG solvers. Without such validation, the claim that the method 'effectively handles' these dimensions remains difficult to assess.

    Authors: We appreciate this observation and will enhance the numerical experiments section. In the revision, we will add quantitative error metrics, including comparisons to analytical or high-accuracy reference solutions in low-dimensional settings (such as d ≤ 5). We will also provide benchmark comparisons against other state-of-the-art MFG solvers, including neural network-based methods and traditional discretization approaches, to highlight the scalability and performance advantages of our method in high dimensions up to 10,000. revision: yes

Circularity Check

0 steps flagged

No significant circularity; algorithmic reformulation is self-contained

full rationale

The paper proposes a deep policy iteration algorithm for finite-horizon MFGs by introducing a regenerative reformulation with deterministic cycles, particle approximations, and one-step Euler-Maruyama updates for population measure transport. Policy evaluation uses adversarial training and policy improvement uses averaged optimization, both formulated via consecutive-cycle relations. No equations or steps are presented that reduce the claimed scalability or equilibrium approximation to fitted parameters, self-definitions, or load-bearing self-citations by construction. The derivation chain consists of standard discretization and approximation techniques applied to the reformulated problem, with numerical validation in dimensions up to 10,000 serving as external check. This qualifies as an independent algorithmic construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the ledger captures the core modeling assumption; standard numerical tools like Euler-Maruyama are not counted as invented here.

axioms (1)
  • domain assumption The mean-field game admits a regenerative reformulation with deterministic cycles that preserves the original dynamics for cycle-by-cycle policy evaluation and improvement.
    This premise enables the avoidance of full-horizon simulation and is invoked to justify the particle transport and subproblem formulations.

pith-pipeline@v0.9.0 · 5740 in / 1367 out tokens · 53706 ms · 2026-05-19T17:10:15.247911+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages

  1. [1]

    Mean field games for modeling crowd motion

    Yves Achdou and Jean-Michel Lasry. Mean field games for modeling crowd motion. In Contributions to partial differential equations and applications, volume 47 ofComput. Methods Appl. Sci., pages 17–42. Springer, Cham, 2019

  2. [2]

    Extensions of the deep Galerkin method.Appl

    Ali Al-Aradi, Adolfo Correia, Gabriel Jardim, Danilo de Freitas Naiff, and Yuri Saporito. Extensions of the deep Galerkin method.Appl. Math. Comput., 430:Paper No. 127287, 18, 2022

  3. [3]

    A maximum principle for SDEs of mean-field type.Appl

    Daniel Andersson and Boualem Djehiche. A maximum principle for SDEs of mean-field type.Appl. Math. Optim., 63(3):341–356, 2011

  4. [4]

    SpringerBriefs in Mathematics

    Alain Bensoussan, Jens Frehse, and Phillip Yam.Mean field games and mean field type control theory. SpringerBriefs in Mathematics. Springer, New York, 2013

  5. [5]

    Mean field control and mean field game models with several populations.Minimax Theory Appl., 3(2):173–209, 2018

    Alain Bensoussan, Tao Huang, and Mathieu Lauri` ere. Mean field control and mean field game models with several populations.Minimax Theory Appl., 3(2):173–209, 2018. 23

  6. [6]

    and Zhou, T

    Wei Cai, Shuixin Fang, Wenzhong Zhang, and Tao Zhou. Martingale deep learning for very high dimensional quasi-linear partial differential equations and stochastic optimal controls.arXiv preprint arXiv:2408.14395, 2024

  7. [7]

    SOC-MartNet: A martingale neural network for the hamilton-jacobi-bellman equation without explicit inf u∈U Hin stochastic optimal controls.SIAM J

    Wei Cai, Shuixin Fang, and Tao Zhou. SOC-MartNet: A martingale neural network for the hamilton-jacobi-bellman equation without explicit inf u∈U Hin stochastic optimal controls.SIAM J. Sci. Comput., 47(4):C795–C819, 2025

  8. [8]

    Deep random difference method for high- dimensional quasilinear parabolic partial differential equations.J

    Wei Cai, Shuixin Fang, and Tao Zhou. Deep random difference method for high- dimensional quasilinear parabolic partial differential equations.J. Comput. Phys., page 114767, 2026

  9. [9]

    DeepMartNet: a Martingale-based deep neural network learning method for Dirichlet BVPs and eigenvalue problems of elliptic PDEs inR d.SIAM J

    Wei Cai, Andrew He, and Daniel Margolis. DeepMartNet: a Martingale-based deep neural network learning method for Dirichlet BVPs and eigenvalue problems of elliptic PDEs inR d.SIAM J. Sci. Comput., 48(1):C25–C50, 2026

  10. [10]

    Cardaliaguet, J.-M

    P. Cardaliaguet, J.-M. Lasry, P.-L. Lions, and A. Porretta. Long time average of mean field games with a nonlocal coupling.SIAM J. Control Optim., 51(5):3558–3591, 2013

  11. [11]

    Notes on mean field games

    Pierre Cardaliaguet. Notes on mean field games. Technical report, Technical report Technical report, 2010

  12. [12]

    I, volume 83 ofProbability Theory and Stochastic Modelling

    Ren´ e Carmona and Fran¸ cois Delarue.Probabilistic theory of mean field games with applications. I, volume 83 ofProbability Theory and Stochastic Modelling. Springer, Cham, 2018. Mean field FBSDEs, control, and games

  13. [13]

    Mean field games and systemic risk.Commun

    Ren´ e Carmona, Jean-Pierre Fouque, and Li-Hsien Sun. Mean field games and systemic risk.Commun. Math. Sci., 13(4):911–933, 2015

  14. [14]

    A probabilistic weak formulation of mean field games and applications.Ann

    Ren´ e Carmona and Daniel Lacker. A probabilistic weak formulation of mean field games and applications.Ann. Appl. Probab., 25(3):1189–1231, 2015

  15. [15]

    Discrete time mean-field stochastic linear- quadratic optimal control problems.Automatica J

    Robert Elliott, Xun Li, and Yuan-Hua Ni. Discrete time mean-field stochastic linear- quadratic optimal control problems.Automatica J. IFAC, 49(11):3222–3233, 2013

  16. [16]

    Failure-informed adaptive sampling for PINNs

    Zhiwei Gao, Liang Yan, and Tao Zhou. Failure-informed adaptive sampling for PINNs. SIAM J. Sci. Comput., 45(4):A1971–A1994, 2023

  17. [17]

    Large deviations for a mean field model of systemic risk.SIAM J

    Josselin Garnier, George Papanicolaou, and Tzu-Wei Yang. Large deviations for a mean field model of systemic risk.SIAM J. Financial Math., 4(1):151–184, 2013

  18. [18]

    Approximation error analysis of some deep backward schemes for nonlinear PDEs.SIAM J

    Maximilien Germain, Huyˆ en Pham, and Xavier Warin. Approximation error analysis of some deep backward schemes for nonlinear PDEs.SIAM J. Sci. Comput., 44(1):A28– A56, 2022. 24

  19. [19]

    Solving high-dimensional partial differen- tial equations using deep learning.Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018

    Jiequn Han, Arnulf Jentzen, and Weinan E. Solving high-dimensional partial differen- tial equations using deep learning.Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018

  20. [20]

    Learning physics-informed neural networks without stacked back- propagation

    Di He, Shanda Li, Wenlei Shi, Xiaotian Gao, Jia Zhang, Jiang Bian, Liwei Wang, and Tie-Yan Liu. Learning physics-informed neural networks without stacked back- propagation. In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statis- tics, volume 206 ofProceed...

  21. [21]

    Hutchinson trace estimation for high-dimensional and high-order physics-informed neural networks

    Zheyuan Hu, Zekun Shi, George Em Karniadakis, and Kenji Kawaguchi. Hutchinson trace estimation for high-dimensional and high-order physics-informed neural networks. Comput. Methods Appl. Mech. Engrg., 424:Paper No. 116883, 17, 2024

  22. [22]

    Tackling the curse of dimensionality with physics-informed neural networks.Neural Networks, 176:106369, 2024

    Zheyuan Hu, Khemraj Shukla, George Em Karniadakis, and Kenji Kawaguchi. Tackling the curse of dimensionality with physics-informed neural networks.Neural Networks, 176:106369, 2024

  23. [23]

    Karniadakis, and Kenji Kawaguchi

    Zheyuan Hu, Zhouhao Yang, Yezhen Wang, George E. Karniadakis, and Kenji Kawaguchi. Bias-Variance Trade-Off in Physics-Informed Neural Networks with Ran- domized Smoothing for High-Dimensional PDEs.SIAM J. Sci. Comput., 47(4):C846– C872, 2025

  24. [24]

    Large-population LQG games involving a major player: the Nash cer- tainty equivalence principle.SIAM J

    Minyi Huang. Large-population LQG games involving a major player: the Nash cer- tainty equivalence principle.SIAM J. Control Optim., 48(5):3318–3353, 2009/10

  25. [25]

    Caines, and Roland P

    Minyi Huang, Peter E. Caines, and Roland P. Malham´ e. Social optima in mean field LQG control: centralized and decentralized strategies.IEEE Trans. Automat. Control, 57(7):1736–1751, 2012

  26. [26]

    Deep backward schemes for high- dimensional nonlinear PDEs.Math

    Cˆ ome Hur´ e, Huyˆ en Pham, and Xavier Warin. Deep backward schemes for high- dimensional nonlinear PDEs.Math. Comp., 89(324):1547–1579, 2020

  27. [27]

    Policy evaluation and temporal-difference learning in con- tinuous time and space: A martingale approach.Journal of Machine Learning Research, 23(154):1–55, 2022

    Yanwei Jia and Xun Yu Zhou. Policy evaluation and temporal-difference learning in con- tinuous time and space: A martingale approach.Journal of Machine Learning Research, 23(154):1–55, 2022

  28. [28]

    Policy gradient and actor-critic learning in continu- ous time and space: Theory and algorithms.Journal of Machine Learning Research, 23(275):1–50, 2022

    Yanwei Jia and Xun Yu Zhou. Policy gradient and actor-critic learning in continu- ous time and space: Theory and algorithms.Journal of Machine Learning Research, 23(275):1–50, 2022

  29. [29]

    Springer Cham, third edition, 2020

    Achim Klenke.Probability Theory. Springer Cham, third edition, 2020. 25

  30. [30]

    Kloeden and Eckhard Platen.Numerical solution of stochastic differential equations, volume 23 ofApplications of Mathematics (New York)

    Peter E. Kloeden and Eckhard Platen.Numerical solution of stochastic differential equations, volume 23 ofApplications of Mathematics (New York). Springer-Verlag, Berlin, 1992

  31. [31]

    Efficiency of the price formation process in presence of high frequency participants: a mean field game analysis.Math

    Aim´ e Lachapelle, Jean-Michel Lasry, Charles-Albert Lehalle, and Pierre-Louis Lions. Efficiency of the price formation process in presence of high frequency participants: a mean field game analysis.Math. Financ. Econ., 10(3):223–262, 2016

  32. [32]

    Computation of mean field equilibria in economics.Math

    Aime Lachapelle, Julien Salomon, and Gabriel Turinici. Computation of mean field equilibria in economics.Math. Models Methods Appl. Sci., 20(4):567–588, 2010

  33. [33]

    On a mean field game approach mod- eling congestion and aversion in pedestrian crowds.Transportation research part B: methodological, 45(10):1572–1589, 2011

    Aim´ e Lachapelle and Marie-Therese Wolfram. On a mean field game approach mod- eling congestion and aversion in pedestrian crowds.Transportation research part B: methodological, 45(10):1572–1589, 2011

  34. [34]

    A neural network approach for stochastic optimal control.SIAM J

    Xingjian Li, Deepanshu Verma, and Lars Ruthotto. A neural network approach for stochastic optimal control.SIAM J. Sci. Comput., 46(5):C535–C556, 2024

  35. [35]

    Multi-scale deep neural network (MscaleDNN) for solving Poisson-Boltzmann equation in complex domains.Commun

    Ziqi Liu, Wei Cai, and Zhi-Qin John Xu. Multi-scale deep neural network (MscaleDNN) for solving Poisson-Boltzmann equation in complex domains.Commun. Comput. Phys., 28(5):1970–2001, 2020

  36. [36]

    On bellman equations for continuous-time policy eval- uation i: discretization and approximation, 2024

    Wenlong Mou and Yuhua Zhu. On bellman equations for continuous-time policy eval- uation i: discretization and approximation, 2024

  37. [37]

    Springer-Verlag, Berlin, 2009

    Huyˆ en Pham.Continuous-time stochastic control and optimization with financial ap- plications, volume 61 ofStochastic Modelling and Applied Probability. Springer-Verlag, Berlin, 2009

  38. [38]

    Raissi, P

    M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.J. Comput. Phys., 378:686–707, 2019

  39. [39]

    Deep neural networks motivated by partial differential equations.J

    Lars Ruthotto and Eldad Haber. Deep neural networks motivated by partial differential equations.J. Math. Imaging Vision, 62(3):352–364, 2020

  40. [40]

    Osher, Wuchen Li, Levon Nurbekyan, and Samy Wu Fung

    Lars Ruthotto, Stanley J. Osher, Wuchen Li, Levon Nurbekyan, and Samy Wu Fung. A machine learning framework for solving high-dimensional mean field game and mean field control problems.Proc. Natl. Acad. Sci. USA, 117(17):9183–9193, 2020

  41. [41]

    Stochastic taylor derivative estimator: Efficient amortization for arbitrary differential operators

    Zekun Shi, Zheyuan Hu, Min Lin, and Kenji Kawaguchi. Stochastic taylor derivative estimator: Efficient amortization for arbitrary differential operators. InThe Thirty- eighth Annual Conference on Neural Information Processing Systems, 2024

  42. [42]

    DGM: a deep learning algorithm for solving partial differential equations.J

    Justin Sirignano and Konstantinos Spiliopoulos. DGM: a deep learning algorithm for solving partial differential equations.J. Comput. Phys., 375:1339–1364, 2018. 26

  43. [43]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement learning. An introduction. Adapt. Comput. Mach. Learn. Cambridge, MA: MIT Press, 2nd expanded and updated edition edition, 2018

  44. [44]

    Das-pinns: A deep adaptive sampling method for solving high-dimensional partial differential equations.Journal of Compu- tational Physics, 476:111868, 2023

    Kejun Tang, Xiaoliang Wan, and Chao Yang. Das-pinns: A deep adaptive sampling method for solving high-dimensional partial differential equations.Journal of Compu- tational Physics, 476:111868, 2023

  45. [45]

    Adaptive importance sampling for deep Ritz.Commun

    Xiaoliang Wan, Tao Zhou, and Yuancheng Zhou. Adaptive importance sampling for deep Ritz.Commun. Appl. Math. Comput., 7(3):929–953, 2025

  46. [46]

    A deep shotgun method for solving high-dimensional parabolic partial differential equations.J

    Wenjun Xu and Wenzhong Zhang. A deep shotgun method for solving high-dimensional parabolic partial differential equations.J. Sci. Comput., 104(2):69, 2025

  47. [47]

    Linear-quadratic optimal control problems for mean-field stochastic differential equations.SIAM J

    Jiongmin Yong. Linear-quadratic optimal control problems for mean-field stochastic differential equations.SIAM J. Control Optim., 51(4):2809–2838, 2013

  48. [48]

    Springer-Verlag, New York, 1999

    Jiongmin Yong and Xun Yu Zhou.Stochastic controls, volume 43 ofApplications of Mathematics (New York). Springer-Verlag, New York, 1999. Hamiltonian systems and HJB equations

  49. [49]

    Weak adversarial networks for high-dimensional partial differential equations.J

    Yaohua Zang, Gang Bao, Xiaojing Ye, and Haomin Zhou. Weak adversarial networks for high-dimensional partial differential equations.J. Comput. Phys., 411:109409, 14, 2020

  50. [50]

    FBSDE based neural network algorithms for high- dimensional quasilinear parabolic PDEs.J

    Wenzhong Zhang and Wei Cai. FBSDE based neural network algorithms for high- dimensional quasilinear parabolic PDEs.J. Comput. Phys., 470:Paper No. 111557, 14, 2022

  51. [51]

    Actor-critic method for high dimensional static Hamilton-Jacobi-Bellman partial differential equations based on neural networks.SIAM J

    Mo Zhou, Jiequn Han, and Jianfeng Lu. Actor-critic method for high dimensional static Hamilton-Jacobi-Bellman partial differential equations based on neural networks.SIAM J. Sci. Comput., 43(6):A4043–A4066, 2021

  52. [52]

    Solving time-continuous stochastic optimal control prob- lems: Algorithm design and convergence analysis of actor-critic flow

    Mo Zhou and Jianfeng Lu. Solving Time-Continuous Stochastic Optimal Control Prob- lems: Algorithm Design and Convergence Analysis of Actor-Critic Flow. Preprint, arXiv:2402.17208 [math.OC] (2024), 2024

  53. [53]

    A policy gradient framework for stochastic optimal control problems with global convergence guarantee.SIAM J

    Mo Zhou and Jianfeng Lu. A policy gradient framework for stochastic optimal control problems with global convergence guarantee.SIAM J. Control Optim., 63(4):2605–2631, 2025

  54. [54]

    Optimal-PhiBE: A PDE-based Model-free framework for Continuous-time Reinforcement Learning

    Yuhua Zhu, Yuming Zhang, and Haoyu Zhang. Optimal-PhiBE: A PDE-based Model-free framework for Continuous-time Reinforcement Learning. Preprint, arXiv:2506.05208 [math.OC] (2025), 2025. 27