Stabilized neural Hamilton--Jacobi--Bellman solvers: Error analysis and applications in model-based reinforcement learning

arxiv: 2605.07116 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI· cs.NA· math.NA· math.OC

Stabilized neural Hamilton--Jacobi--Bellman solvers: Error analysis and applications in model-based reinforcement learning

Minseok Kim , Yeongjong Kim , Namkyeong Cho , Yeoneung Kim This is my paper

Pith reviewed 2026-05-11 00:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NAmath.NAmath.OC

keywords Hamilton-Jacobi-Bellmanneural networkspolicy evaluationmodel-based reinforcement learningstability estimatefinite differenceserror analysislearned dynamics

0 comments p. Extension

The pith

Interpreting finite differences as shift operators on neural networks yields a population L2 stability bound for one policy-evaluation step with learned dynamics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that a hybrid neural-finite-difference approach to Hamilton-Jacobi-Bellman equations admits a population L2 stability estimate for a single policy-evaluation step even when the dynamics are learned from data. The estimate isolates residual error, initial and exterior mismatch, policy mismatch, and model-identification error while carrying an explicit gradient-amplification factor for the learned dynamics; the underlying linear stability stays free of inverse-viscosity blow-up. It also supplies a finite-sample collocation certificate and a conditional bound on error growth across greedy policy-improvement steps. A sympathetic reader cares because the result supplies concrete error accounting for practical continuous-time model-based reinforcement learning without forcing a full grid or a purely continuous PINN formulation.

Core claim

By interpreting finite differences as shift operators acting on neural networks, we prove a population L² stability estimate for one policy-evaluation step with learned dynamics. The bound separates residual error, initial and exterior-collar mismatch, policy mismatch, and model-identification error, with an explicit gradient amplification factor for learned dynamics, while the underlying linear evaluation stability remains free of hidden inverse-viscosity blow-up. We further give a finite-sample collocation certificate and a conditional multi-step propagation result through greedy policy improvement.

What carries the argument

The shift-operator interpretation of finite differences acting on neural networks, which carries the population L2 stability estimate for policy evaluation with learned dynamics.

Load-bearing premise

Finite-difference operators act as stable shift operators on the neural network without hidden inverse-viscosity blow-up, and the learned dynamics keep the gradient amplification factor controlled.

What would settle it

An experiment in which model-identification error is increased while residuals and mismatches are held small, yet the observed policy-evaluation error grows faster than the explicit gradient amplification factor predicts.

Figures

Figures reproduced from arXiv: 2605.07116 by Minseok Kim, Namkyeong Cho, Yeoneung Kim, Yeongjong Kim.

**Figure 2.** Figure 2: Value function validation across linear and nonlinear tasks. (a) Slice-wise com [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Physics-informed neural solvers offer a promising route to model-based reinforcement learning in continuous time, where optimal feedback synthesis is governed by Hamilton--Jacobi--Bellman (HJB) equations. Practical implementations often occupy a regime that is neither a classical grid method nor a continuous-PDE PINN: the value function is represented by a neural network, finite-difference HJB policy-evaluation operators are evaluated by network queries at shifted points, and residuals are minimized by random continuous collocation. This regime preserves the stabilized finite-difference policy-evaluation structure while avoiding grid-based value unknowns. We develop an error theory for this hybrid regime. Interpreting finite differences as shift operators acting on neural networks, we prove a population $L^2$ stability estimate for one policy-evaluation step with learned dynamics. The bound separates residual error, initial and exterior-collar mismatch, policy mismatch, and model-identification error, with an explicit gradient amplification factor for learned dynamics, while the underlying linear evaluation stability remains free of hidden inverse-viscosity blow-up. We further give a finite-sample collocation certificate and a conditional multi-step propagation result through greedy policy improvement. Experiments on compact-control LQR upto 64 dimensions, Allen--Cahn control, pendulum, Hopper, and 3D quadrotor benchmarks compare against representative model-based and model-free RL baselines, demonstrating the predicted residual, policy-mismatch, and learned-model error trends.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hybrid neural HJB solver gets explicit error separation with gradient amplification, but control of that factor is the key assumption to check.

read the letter

The main takeaway is that this work develops an error theory for the hybrid regime of neural value functions combined with finite-difference policy evaluation at shifted points and random collocation. By interpreting the finite differences as shift operators on the neural network, they prove a population L² stability estimate for one policy-evaluation step when dynamics are learned from data. The bound explicitly separates residual error, initial and exterior mismatch, policy mismatch, and model-identification error, along with a gradient amplification factor from the learned dynamics. Importantly, the linear evaluation stability avoids hidden inverse-viscosity problems. What they do well is organize these error sources clearly and show through experiments on high-dimensional LQR up to 64 dimensions, Allen-Cahn, pendulum, Hopper, and quadrotor that the predicted trends in residual, policy, and model errors hold up against baselines. The soft spot is the gradient amplification factor for learned dynamics. For the bound to be useful without blow-up, this factor needs to remain controlled, which likely requires assumptions like bounded Lipschitz constants on the dynamics model or small identification error. The abstract claims the factor is explicit and the regime is hybrid, but if those conditions aren't met in practice, the separation may not deliver reliable guarantees. The finite-sample and multi-step results depend on this one-step stability. This is for specialists in model-based RL and physics-informed methods for control. The thinking is clear and the citation pattern looks standard. It should go to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript develops an error theory for a hybrid regime of physics-informed neural network solvers for Hamilton-Jacobi-Bellman equations in continuous-time model-based reinforcement learning. In this regime, the value function is represented by a neural network, finite-difference operators are applied via network queries at shifted points, and residuals are minimized at random collocation points. The central contributions are a population L² stability estimate for one policy-evaluation step with learned dynamics, which separates residual error, initial and exterior-collar mismatch, policy mismatch, and model-identification error with an explicit gradient amplification factor; a finite-sample collocation certificate; and a conditional multi-step propagation result through greedy policy improvement. The paper also presents experimental results on LQR problems up to 64 dimensions, Allen-Cahn control, pendulum, Hopper, and 3D quadrotor, comparing against model-based and model-free RL baselines.

Significance. If the stability estimates hold under the stated assumptions, this work provides a rigorous foundation for stabilized neural HJB solvers, addressing key challenges in error propagation and model identification in high-dimensional continuous control. The explicit separation of errors and the claim of no hidden inverse-viscosity blow-up in the linear evaluation stability are notable strengths, potentially enabling more reliable applications in model-based RL. The experimental validation on diverse benchmarks supports the practical relevance of the theoretical results.

major comments (2)

[§3.2] §3.2 (Population L² stability estimate): The bound separates model-identification error via an explicit gradient amplification factor for learned dynamics. However, no explicit assumptions (e.g., uniform Lipschitz bound on the dynamics approximator or smallness condition relative to residual error) are provided to guarantee that this factor remains controlled in the hybrid regime. Without such conditions, the separation may fail to prevent blow-up, weakening the practical utility of the L² guarantee.
[Theorem 4] Theorem 4 (conditional multi-step propagation): The result relies on the one-step stability carrying through greedy policy improvement, but the propagation bound appears to inherit the same uncontrolled amplification factor from the learned dynamics without additional smallness or contraction arguments; this makes the multi-step claim sensitive to the same gap identified in the one-step estimate.

minor comments (2)

[§5] The benchmark tables would be strengthened by reporting standard deviations across random seeds, as the current trend plots make it difficult to assess statistical significance of the residual and model-error scaling.
Notation for the exterior-collar mismatch term is introduced without a clear diagram or equation reference in the main text, which could be clarified for readers unfamiliar with the shift-operator interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. The feedback highlights important points about controlling the gradient amplification factor in the stability estimates. We address each major comment below, providing clarifications on the existing assumptions and indicating where we will strengthen the presentation in revision.

read point-by-point responses

Referee: [§3.2] §3.2 (Population L² stability estimate): The bound separates model-identification error via an explicit gradient amplification factor for learned dynamics. However, no explicit assumptions (e.g., uniform Lipschitz bound on the dynamics approximator or smallness condition relative to residual error) are provided to guarantee that this factor remains controlled in the hybrid regime. Without such conditions, the separation may fail to prevent blow-up, weakening the practical utility of the L² guarantee.

Authors: We appreciate the referee drawing attention to the need for explicit control on the amplification factor. In the manuscript, the factor is derived directly from the shift-operator representation of finite differences applied to the learned dynamics (see the proof of the population L² estimate in §3.2), and it multiplies only the model-identification error term while the linear policy-evaluation operator itself remains free of inverse-viscosity blow-up. The paper implicitly relies on the dynamics approximator being Lipschitz (consistent with standard assumptions on neural network approximators for control systems) and on the model-identification error being small relative to the residual. However, we agree that making a uniform Lipschitz bound and a smallness condition explicit would strengthen the result and clarify the regime of validity. We will add a dedicated remark in §3.2 stating these conditions and their role in keeping the factor controlled. revision: partial
Referee: [Theorem 4] Theorem 4 (conditional multi-step propagation): The result relies on the one-step stability carrying through greedy policy improvement, but the propagation bound appears to inherit the same uncontrolled amplification factor from the learned dynamics without additional smallness or contraction arguments; this makes the multi-step claim sensitive to the same gap identified in the one-step estimate.

Authors: The multi-step result in Theorem 4 is explicitly conditional on the one-step errors (including the amplified model-identification term) remaining below a threshold that prevents accumulation over iterations; this conditioning is stated in the theorem and leverages the contraction properties of greedy policy improvement under the problem's compactness and stability assumptions. We acknowledge that the presentation could more clearly link the smallness requirement to the amplification factor from the one-step bound. In revision we will expand the statement and proof sketch of Theorem 4 to include an explicit smallness condition on the model-identification error (relative to the one-step residual and the policy-improvement contraction rate) that ensures the propagation remains controlled over a finite number of steps. revision: partial

Circularity Check

0 steps flagged

Derivation self-contained; stability estimate obtained from shift-operator interpretation without reduction to inputs

full rationale

The paper presents a population L^2 stability estimate derived by interpreting finite-difference operators as shift operators acting on neural networks. This produces an explicit separation of residual error, initial/exterior-collar mismatch, policy mismatch, and model-identification error together with a gradient amplification factor. No step in the provided abstract or claimed derivation chain reduces a bound or prediction to a fitted parameter by construction, nor invokes a self-citation chain, uniqueness theorem, or ansatz that is itself unverified within the paper. The central error analysis therefore remains independent of the target quantities and is self-contained under the stated hybrid-regime assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Insufficient detail in abstract to enumerate free parameters, axioms, or invented entities; the stability proof likely relies on standard assumptions about neural network approximation and dynamics Lipschitz conditions, but none are specified here.

pith-pipeline@v0.9.0 · 5578 in / 1298 out tokens · 49189 ms · 2026-05-11T00:50:34.747961+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

Dyna, an integrated architecture for learning, planning, and reacting

Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163, 1991

work page 1991
[2]

Pilco: A model-based and data-efficient approach to policy search

Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. InProceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011

work page 2011
[3]

Deep reinforcement learning in a handful of trials using probabilistic dynamics models

Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, volume 31, 2018

work page 2018
[4]

When to trust your model: Model-based policy optimization

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. InAdvances in Neural Information Processing Systems, volume 32, 2019

work page 2019
[5]

Ameri- can Mathematical Soc., 2021

Hung V Tran.Hamilton–Jacobi equations: theory and applications, volume 213. Ameri- can Mathematical Soc., 2021

work page 2021
[6]

American mathematical society, 2022

Lawrence C Evans.Partial differential equations, volume 19. American mathematical society, 2022

work page 2022
[7]

Dynamic programming and markov processes

Ronald A Howard. Dynamic programming and markov processes. 1960

work page 1960
[8]

M. L. Puterman. On the convergence of policy iteration for controlled diffusions.Journal of Optimization Theory and Applications, 33(1):137–144, 1981

work page 1981
[9]

On the policy improvement algorithm in continuous time.Stochastics, 89(1):348–359, 2017

Saul D Jacka and Aleksandar Mijatović. On the policy improvement algorithm in continuous time.Stochastics, 89(1):348–359, 2017

work page 2017
[10]

Kerimkulov, D

B. Kerimkulov, D. Šiška, and Łukasz Szpruch. Exponential convergence and stability of howard’s policy improvement algorithm for controlled diffusions.SIAM Journal on Control and Optimization, 58(3):1314–1340, 2020

work page 2020
[11]

Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.Journal of Computational physics, 378:686–707, 2019

work page 2019
[12]

Solving high-dimensional partial differen- tial equations using deep learning.Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018

Jiequn Han, Arnulf Jentzen, and Weinan E. Solving high-dimensional partial differen- tial equations using deep learning.Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018

work page 2018
[13]

Physics-informed model-based reinforce- ment learning

Adithya Ramesh and Balaraman Ravindran. Physics-informed model-based reinforce- ment learning. InLearning for Dynamics and Control Conference, pages 26–37. PMLR, 2023

work page 2023
[14]

Bridging physics-informed neural networks with reinforcement learning: Hamilton-jacobi-bellman proximal policy optimization (hjbppo)

Amartya Mukherjee and Jun Liu. Bridging physics-informed neural networks with reinforcement learning: Hamilton-jacobi-bellman proximal policy optimization (hjbppo). arXiv preprint arXiv:2302.00237, 2023

work page arXiv 2023
[15]

Physics-informed neural network policy iteration: Algorithms, convergence, and verification.arXiv preprint arXiv:2402.10119, 2024

Yiming Meng, Ruikun Zhou, Amartya Mukherjee, Maxwell Fitzsimmons, Christopher Song, and Jun Liu. Physics-informed neural network policy iteration: Algorithms, convergence, and verification.arXiv preprint arXiv:2402.10119, 2024

work page arXiv 2024
[16]

Physics-informed approach for exploratory Hamilton–Jacobi–Bellman equations via policy iterations

Yeongjong Kim, Namkyeong Cho, Minseok Kim, and Yeoneung Kim. Physics-informed approach for exploratory Hamilton–Jacobi–Bellman equations via policy iterations. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 22609–22616, 2026. 10

work page 2026
[17]

On the convergence of policy iteration in stationary dynamic programming.Mathematics of Operations Research, 4(1):60–69, 1979

Martin L Puterman and Shelby L Brumelle. On the convergence of policy iteration in stationary dynamic programming.Mathematics of Operations Research, 4(1):60–69, 1979

work page 1979
[18]

Policy iteration for the deterministic control problems—a viscosity approach.SIAM Journal on Control and Optimization, 63(1):375–401, 2025

Wenpin Tang, Hung Vinh Tran, and Yuming Paul Zhang. Policy iteration for the deterministic control problems—a viscosity approach.SIAM Journal on Control and Optimization, 63(1):375–401, 2025

work page 2025
[19]

Policy iteration for nonconvex viscous hamilton–jacobi equations.SIAM Journal on Applied Mathematics, 86(2):532–556, 2026

Xiaoqin Guo, Hung V Tran, and Yuming P Zhang. Policy iteration for nonconvex viscous hamilton–jacobi equations.SIAM Journal on Applied Mathematics, 86(2):532–556, 2026

work page 2026
[20]

Policy iteration for ex- ploratory Hamilton–Jacobi–Bellman equations.Applied Mathematics & Optimization, 91(2):50, 2025

Hung Vinh Tran, Zhenhua Wang, and Yuming Paul Zhang. Policy iteration for ex- ploratory Hamilton–Jacobi–Bellman equations.Applied Mathematics & Optimization, 91(2):50, 2025

work page 2025
[21]

Convergence of policy iteration for entropy-regularized stochastic control problems.SIAM Journal on Control and Opti- mization, 63(2):752–777, 2025

Yu-Jui Huang, Zhenhua Wang, and Zhou Zhou. Convergence of policy iteration for entropy-regularized stochastic control problems.SIAM Journal on Control and Opti- mization, 63(2):752–777, 2025

work page 2025
[22]

Convergence analysis for entropy- regularized control problems: A probabilistic approach.SIAM Journal on Control and Optimization, 64(2):816–842, 2026

Jin Ma, Gaozhan Wang, and Jianfeng Zhang. Convergence analysis for entropy- regularized control problems: A probabilistic approach.SIAM Journal on Control and Optimization, 64(2):816–842, 2026

work page 2026
[23]

Solving nonconvex hamilton–jacobi– isaacs equations with pinn-based policy iteration.arXiv preprint arXiv:2507.15455, 2025

Hee Jun Yang, Min Jung Kim, and Yeoneung Kim. Solving nonconvex hamilton–jacobi– isaacs equations with pinn-based policy iteration.arXiv preprint arXiv:2507.15455, 2025

work page arXiv 2025
[24]

Hamilton–Jacobi based policy-iteration via deep operator learning.Neurocomputing, page 130515, 2025

Jae Yong Lee and Yeoneung Kim. Hamilton–Jacobi based policy-iteration via deep operator learning.Neurocomputing, page 130515, 2025. A Finite-difference calculus and zero-extension identities This appendix gives the full deterministic proof of the population estimate. All identities are first justified for smooth functions and then extended by density to t...

work page 2025
[25]

The semi-discrete solver evaluates (58) using queries atx±hei

+ 1 2β(Vx2)2−1 2x⊤Qx= 0,(58) whereV(0,x) = 1 2x⊤Qfx. The semi-discrete solver evaluates (58) using queries atx±hei. G.3 Spacecraft Rendezvous (Clohessy–Wiltshire) This benchmark models the planar docking of a chaser spacecraft to a target in a circular orbit. Relative Dynamics.The linearized dynamics in the relative frame are: ˙x=Ax+Bu, A=   0 0 1 0 0 ...

work page

[1] [1]

Dyna, an integrated architecture for learning, planning, and reacting

Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163, 1991

work page 1991

[2] [2]

Pilco: A model-based and data-efficient approach to policy search

Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. InProceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011

work page 2011

[3] [3]

Deep reinforcement learning in a handful of trials using probabilistic dynamics models

Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, volume 31, 2018

work page 2018

[4] [4]

When to trust your model: Model-based policy optimization

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. InAdvances in Neural Information Processing Systems, volume 32, 2019

work page 2019

[5] [5]

Ameri- can Mathematical Soc., 2021

Hung V Tran.Hamilton–Jacobi equations: theory and applications, volume 213. Ameri- can Mathematical Soc., 2021

work page 2021

[6] [6]

American mathematical society, 2022

Lawrence C Evans.Partial differential equations, volume 19. American mathematical society, 2022

work page 2022

[7] [7]

Dynamic programming and markov processes

Ronald A Howard. Dynamic programming and markov processes. 1960

work page 1960

[8] [8]

M. L. Puterman. On the convergence of policy iteration for controlled diffusions.Journal of Optimization Theory and Applications, 33(1):137–144, 1981

work page 1981

[9] [9]

On the policy improvement algorithm in continuous time.Stochastics, 89(1):348–359, 2017

Saul D Jacka and Aleksandar Mijatović. On the policy improvement algorithm in continuous time.Stochastics, 89(1):348–359, 2017

work page 2017

[10] [10]

Kerimkulov, D

B. Kerimkulov, D. Šiška, and Łukasz Szpruch. Exponential convergence and stability of howard’s policy improvement algorithm for controlled diffusions.SIAM Journal on Control and Optimization, 58(3):1314–1340, 2020

work page 2020

[11] [11]

Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.Journal of Computational physics, 378:686–707, 2019

work page 2019

[12] [12]

Solving high-dimensional partial differen- tial equations using deep learning.Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018

Jiequn Han, Arnulf Jentzen, and Weinan E. Solving high-dimensional partial differen- tial equations using deep learning.Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018

work page 2018

[13] [13]

Physics-informed model-based reinforce- ment learning

Adithya Ramesh and Balaraman Ravindran. Physics-informed model-based reinforce- ment learning. InLearning for Dynamics and Control Conference, pages 26–37. PMLR, 2023

work page 2023

[14] [14]

Bridging physics-informed neural networks with reinforcement learning: Hamilton-jacobi-bellman proximal policy optimization (hjbppo)

Amartya Mukherjee and Jun Liu. Bridging physics-informed neural networks with reinforcement learning: Hamilton-jacobi-bellman proximal policy optimization (hjbppo). arXiv preprint arXiv:2302.00237, 2023

work page arXiv 2023

[15] [15]

Physics-informed neural network policy iteration: Algorithms, convergence, and verification.arXiv preprint arXiv:2402.10119, 2024

Yiming Meng, Ruikun Zhou, Amartya Mukherjee, Maxwell Fitzsimmons, Christopher Song, and Jun Liu. Physics-informed neural network policy iteration: Algorithms, convergence, and verification.arXiv preprint arXiv:2402.10119, 2024

work page arXiv 2024

[16] [16]

Physics-informed approach for exploratory Hamilton–Jacobi–Bellman equations via policy iterations

Yeongjong Kim, Namkyeong Cho, Minseok Kim, and Yeoneung Kim. Physics-informed approach for exploratory Hamilton–Jacobi–Bellman equations via policy iterations. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 22609–22616, 2026. 10

work page 2026

[17] [17]

On the convergence of policy iteration in stationary dynamic programming.Mathematics of Operations Research, 4(1):60–69, 1979

Martin L Puterman and Shelby L Brumelle. On the convergence of policy iteration in stationary dynamic programming.Mathematics of Operations Research, 4(1):60–69, 1979

work page 1979

[18] [18]

Policy iteration for the deterministic control problems—a viscosity approach.SIAM Journal on Control and Optimization, 63(1):375–401, 2025

Wenpin Tang, Hung Vinh Tran, and Yuming Paul Zhang. Policy iteration for the deterministic control problems—a viscosity approach.SIAM Journal on Control and Optimization, 63(1):375–401, 2025

work page 2025

[19] [19]

Policy iteration for nonconvex viscous hamilton–jacobi equations.SIAM Journal on Applied Mathematics, 86(2):532–556, 2026

Xiaoqin Guo, Hung V Tran, and Yuming P Zhang. Policy iteration for nonconvex viscous hamilton–jacobi equations.SIAM Journal on Applied Mathematics, 86(2):532–556, 2026

work page 2026

[20] [20]

Policy iteration for ex- ploratory Hamilton–Jacobi–Bellman equations.Applied Mathematics & Optimization, 91(2):50, 2025

Hung Vinh Tran, Zhenhua Wang, and Yuming Paul Zhang. Policy iteration for ex- ploratory Hamilton–Jacobi–Bellman equations.Applied Mathematics & Optimization, 91(2):50, 2025

work page 2025

[21] [21]

Convergence of policy iteration for entropy-regularized stochastic control problems.SIAM Journal on Control and Opti- mization, 63(2):752–777, 2025

Yu-Jui Huang, Zhenhua Wang, and Zhou Zhou. Convergence of policy iteration for entropy-regularized stochastic control problems.SIAM Journal on Control and Opti- mization, 63(2):752–777, 2025

work page 2025

[22] [22]

Convergence analysis for entropy- regularized control problems: A probabilistic approach.SIAM Journal on Control and Optimization, 64(2):816–842, 2026

Jin Ma, Gaozhan Wang, and Jianfeng Zhang. Convergence analysis for entropy- regularized control problems: A probabilistic approach.SIAM Journal on Control and Optimization, 64(2):816–842, 2026

work page 2026

[23] [23]

Solving nonconvex hamilton–jacobi– isaacs equations with pinn-based policy iteration.arXiv preprint arXiv:2507.15455, 2025

Hee Jun Yang, Min Jung Kim, and Yeoneung Kim. Solving nonconvex hamilton–jacobi– isaacs equations with pinn-based policy iteration.arXiv preprint arXiv:2507.15455, 2025

work page arXiv 2025

[24] [24]

Hamilton–Jacobi based policy-iteration via deep operator learning.Neurocomputing, page 130515, 2025

Jae Yong Lee and Yeoneung Kim. Hamilton–Jacobi based policy-iteration via deep operator learning.Neurocomputing, page 130515, 2025. A Finite-difference calculus and zero-extension identities This appendix gives the full deterministic proof of the population estimate. All identities are first justified for smooth functions and then extended by density to t...

work page 2025

[25] [25]

The semi-discrete solver evaluates (58) using queries atx±hei

+ 1 2β(Vx2)2−1 2x⊤Qx= 0,(58) whereV(0,x) = 1 2x⊤Qfx. The semi-discrete solver evaluates (58) using queries atx±hei. G.3 Spacecraft Rendezvous (Clohessy–Wiltshire) This benchmark models the planar docking of a chaser spacecraft to a target in a circular orbit. Relative Dynamics.The linearized dynamics in the relative frame are: ˙x=Ax+Bu, A=   0 0 1 0 0 ...

work page