Learning-based Hamilton-Jacobi-Bellman Methods for Optimal Control
Pith reviewed 2026-05-24 17:05 UTC · model grok-4.3
The pith
Neural networks trained by supervised or reinforcement learning supply real-time initial adjoint guesses that let TPBVP solvers converge for HJB optimal control problems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that a neural network can be trained either by supervised learning on a database of boundary-condition and adjoint-initial pairs or by reinforcement learning that updates the network according to a reward measuring solver convergence, and that either trained network will output adjoint initials sufficient for a TPBVP solver to converge on unseen boundary conditions arising from HJB equations.
What carries the argument
Neural-network approximator that maps boundary conditions to initial adjoint variables for the TPBVP solver derived from the Hamilton-Jacobi-Bellman equation.
If this is right
- Optimal control problems formulated as TPBVPs can be solved online without manual tuning of adjoint initials.
- The reinforcement-learning route enables solution of HJB problems even when no precomputed solution database exists.
- Classical shooting or collocation methods become practical for real-time use once the network supplies the starting guess.
- The same trained networks can be applied to any new boundary conditions drawn from the same problem family.
Where Pith is reading between the lines
- The supervised route could be combined with adaptive retraining when plant parameters drift.
- High-dimensional state spaces would require checking whether network size must grow with state dimension to maintain guess accuracy.
- The reinforcement-learning formulation might be extended by adding a penalty on guess magnitude to encourage simpler initial conditions.
Load-bearing premise
A neural network trained on example solutions or rewards will output adjoint initial guesses close enough to the true values that the chosen TPBVP solver converges on boundary conditions outside the training set.
What would settle it
Generate a held-out test set of boundary conditions, feed each through the trained network to obtain an adjoint initial guess, then run the TPBVP solver and record whether it fails to converge or returns trajectories that violate the original optimality conditions.
Figures
read the original abstract
Many optimal control problems are formulated as two point boundary value problems (TPBVPs) with conditions of optimality derived from the Hamilton-Jacobi-Bellman (HJB) equations. In most cases, it is challenging to solve HJBs due to the difficulty of guessing the adjoint variables. This paper proposes two learning-based approaches to find the initial guess of adjoint variables in real-time, which can be applied to solve general TPBVPs. For cases with database of solutions and corresponding adjoint variables of a TPBVP under varying boundary conditions, a supervised learning method is applied to learn the HJB solutions off-line. After obtaining a trained neural network from supervised learning, we are able to find proper initial adjoint variables for given boundary conditions in real-time. However, when validated solutions of TPBVPs are not available, the reinforcement learning method is applied to solve HJB by constructing a neural network, defining a reward function, and setting appropriate super parameters. The reinforcement learning based HJB method can learn how to find accurate adjoint variables via an updating neural network. Finally, both learning approaches are implemented in classical optimal control problems to verify the effectiveness of the learning based HJB methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes two learning-based methods to generate initial adjoint variable guesses for two-point boundary value problems (TPBVPs) arising from Hamilton-Jacobi-Bellman (HJB) optimality conditions in optimal control. When a database of prior solutions exists, a supervised neural network is trained offline to map boundary conditions to adjoint initials; when no such database is available, a reinforcement learning approach constructs a network, defines a reward, and updates parameters to learn accurate adjoints. Both are stated to have been implemented on classical optimal control problems to verify real-time effectiveness.
Significance. If the central claim holds with quantitative support, the work would address a long-standing practical bottleneck in indirect optimal control methods by replacing manual or heuristic adjoint initialization with learned real-time guesses, potentially broadening the applicability of HJB-derived TPBVP solvers. The distinction between supervised and RL regimes is a reasonable organizing principle, and the absence of free parameters or invented entities in the high-level description is a minor positive.
major comments (2)
- [Abstract] Abstract: the assertion that the methods were 'implemented in classical optimal control problems to verify the effectiveness' is load-bearing for the real-time claim yet supplies no quantitative results (success rates on unseen boundary conditions, adjoint approximation errors, TPBVP convergence statistics, or baseline comparisons). Without these, the generalization performance required by the central claim cannot be assessed.
- [Abstract] Abstract: the RL formulation is described only at the level of 'defining a reward function' and 'setting appropriate super parameters,' with no explicit statement of the reward, network architecture, or convergence criterion that would guarantee the HJB residual is driven to zero for out-of-distribution boundary conditions. This omission directly affects reproducibility and the soundness of the 'learning how to find accurate adjoint variables' claim.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that the methods were 'implemented in classical optimal control problems to verify the effectiveness' is load-bearing for the real-time claim yet supplies no quantitative results (success rates on unseen boundary conditions, adjoint approximation errors, TPBVP convergence statistics, or baseline comparisons). Without these, the generalization performance required by the central claim cannot be assessed.
Authors: We agree that the abstract would be strengthened by including quantitative metrics. In the revised manuscript we will add specific results (success rates on unseen boundary conditions, adjoint approximation errors, TPBVP convergence statistics, and baseline comparisons) to the abstract to support the real-time effectiveness claim. revision: yes
-
Referee: [Abstract] Abstract: the RL formulation is described only at the level of 'defining a reward function' and 'setting appropriate super parameters,' with no explicit statement of the reward, network architecture, or convergence criterion that would guarantee the HJB residual is driven to zero for out-of-distribution boundary conditions. This omission directly affects reproducibility and the soundness of the 'learning how to find accurate adjoint variables' claim.
Authors: We acknowledge the abstract description of the RL approach is high-level. In the revision we will expand the abstract to state the reward function, network architecture, and convergence criterion explicitly, thereby improving reproducibility and supporting the claim that the method learns accurate adjoint variables. revision: yes
Circularity Check
No circularity: learning methods rely on external data/rewards without self-referential reduction
full rationale
The paper describes two standard learning procedures (supervised NN training on a database of TPBVP solutions, or RL with a defined reward) to generate initial adjoint guesses for HJB-derived TPBVPs. No equations, derivations, or fitted-parameter predictions appear in the abstract or description. No self-citations are invoked as load-bearing uniqueness results. The approach is self-contained as an empirical ML technique whose validity rests on external verification in classical problems rather than any reduction of outputs to the method's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Neural networks can learn a sufficiently accurate mapping from boundary conditions to adjoint variables that enables TPBVP convergence
Reference graph
Works this paper leans on
-
[1]
A. E. Bryson, Applied optimal control: optimization, estimation and con- trol. Routledge, 2018
work page 2018
-
[2]
Survey of numerical methods for trajectory optimization,
J. T. Betts, “Survey of numerical methods for trajectory optimization,” Journal of guidance, control, and dynamics , vol. 21, no. 2, pp. 193–207, 1998
work page 1998
-
[3]
Direct trajectory optimization by a cheby- shev pseudospectral method,
F. Fahroo and I. M. Ross, “Direct trajectory optimization by a cheby- shev pseudospectral method,” Journal of Guidance, Control, and Dynam- ics, vol. 25, no. 1, pp. 160–166, 2002
work page 2002
-
[4]
D. P. Bertsekas, “Nonlinear programming,” Journal of the Operational Re- search Society, vol. 48, no. 3, pp. 334–334, 1997
work page 1997
-
[5]
Galerkin approximations of the generalized hamilton-jacobi-bellman equation,
R. W. Beard, G. N. Saridis, and J. T. Wen, “Galerkin approximations of the generalized hamilton-jacobi-bellman equation,” Automatica, vol. 33, no. 12, pp. 2159 – 2177, 1997
work page 1997
-
[6]
An approximation theory of optimal control for trainable manipulators,
C.-S. G. L. George N. Saridis, “An approximation theory of optimal control for trainable manipulators,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 9, no. 3, pp. 152–159, 1979
work page 1979
-
[7]
Approximate solutions to the time-invariant hamilton–jacobi–bellman equation,
S.-G. N. Beard, R. W. and J. T. Wen, “Approximate solutions to the time-invariant hamilton–jacobi–bellman equation,” Journal of Optimiza- tion Theory and Applications , vol. 96, no. 3, pp. 589–626, Mar 1998
work page 1998
-
[8]
M. Evans and T. Swartz, Approximating integrals via Monte Carlo and deterministic methods. Cham, Switzerland: Springer, 2017
work page 2017
-
[9]
A direct multiple shooting method for real-time optimization of nonlinear dae processes,
H. Bock, M. Diehl, D. Leineweber, and J. Schl¨ oder, “A direct multiple shooting method for real-time optimization of nonlinear dae processes,” in Nonlinear Model Predictive Control, 2000, pp. 245–267
work page 2000
-
[10]
Machine learning: a review of classification and combining techniques,
S. B. Kotsiantis, I. D. Zaharakis, and P. E. Pintelas, “Machine learning: a review of classification and combining techniques,” Artificial Intelligence Review, vol. 26, no. 3, pp. 159–190, Nov 2006. 12
work page 2006
-
[11]
Efficient machine learning for big data: A review,
O. Y. Al-Jarrah, P. D. Yoo, S. Muhaidat, G. K. Karagiannidis, and K. Taha, “Efficient machine learning for big data: A review,” Big Data Research , vol. 2, no. 3, pp. 87 – 93, 2015, big Data, Analytics, and High-Performance Computing
work page 2015
-
[12]
Representation learning: A re- view and new perspectives,
Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A re- view and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, Aug 2013
work page 2013
-
[13]
Supervised machine learning: A review of classification techniques,
Z.-I. . P. P. Kotsiantis, S. B., “Supervised machine learning: A review of classification techniques,” Emerging artificial intelligence applications in computer engineering, vol. 160, no. 0, pp. 3–24, 2007
work page 2007
-
[14]
Representation Learning: A Review and New Perspectives
A. C. Yoshua Bengio and P. Vincent, “Unsupervised feature learning and deep learning: A review and new perspectives,” arXiv:1206.5538v1, vol. 0, no. 0, pp. 1–30, 2012
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[15]
Optimal and autonomous control using reinforcement learning: A survey,
B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis, “Optimal and autonomous control using reinforcement learning: A survey,” IEEE transactions on neural networks and learning systems , vol. 29, no. 6, pp. 2042–2062, 2018
work page 2042
-
[16]
T. J. B¨ ohme and B. Frank, Hybrid Systems, Optimal Control and Hybrid Vehicles. Cham, Switzerland: Springer, 2017
work page 2017
-
[17]
Discrete-time nonlin- ear hjb solution using approximate dynamic programming: Convergence proof,
A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time nonlin- ear hjb solution using approximate dynamic programming: Convergence proof,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 38, no. 4, pp. 943–949, 2008
work page 2008
-
[18]
B. Kiumarsi, F. L. Lewis, H. Modares, A. Karimpour, and M.-B. Naghibi- Sistani, “Reinforcement q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics,” Automatica, vol. 50, no. 4, pp. 1167 – 1175, 2014
work page 2014
-
[19]
Reinforcement learning solution for hjb equation arising in constrained optimal control problem,
B. Luo, H.-N. Wu, T. Huang, and D. Liu, “Reinforcement learning solution for hjb equation arising in constrained optimal control problem,” Neural Networks, vol. 71, pp. 150 – 158, 2015
work page 2015
-
[20]
Manifold-following approximate solution of completely hypersensitive optimal control problems,
E. Aykutlug, U. Topcu, and K. D. Mease, “Manifold-following approximate solution of completely hypersensitive optimal control problems,” Journal of Optimization Theory and Applications , vol. 170, no. 1, pp. 220–242, 2016
work page 2016
-
[21]
E. Aykutlug and K. D. Mease, “Approximate solution of hyper-sensitive op- timal control problems using finite-time lyapunov analysis,” in 2009 Amer- ican Control Conference, 2009, pp. 1034–1039
work page 2009
-
[22]
M. Evans and T. Swartz, Introduction to reinforcement learning . Cam- bridge: MIT press, 1998
work page 1998
-
[23]
C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, no. 3, pp. 279–292, May 1992. 13
work page 1992
-
[24]
G. A. Rummery and M. Niranjan, On-line Q-learning using connectionist systems. Cambridge, England: University of Cambridge, Department of Engineering, 1994, vol. 37
work page 1994
-
[25]
Human-level control through deep reinforcement learning,
S. D. R. A. A. V. J. Mnih Volodymyr, Kavukcuoglu Koray, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 3, pp. 529–533, Feb 2015
work page 2015
-
[26]
Asynchronous Methods for Deep Reinforcement Learning
M. M. A. G. T. P. L. T. H. D. S. K. K. Volodymyr Mnih, Adri Puigdomnech Badia, “Asynchronous methods for deep reinforcement learning,” arXiv:1602.01783, 2016. [Online]. Available: https://arxiv.org/ abs/1602.01783
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[27]
Continuous control with deep reinforcement learning
A. P. N. H. T. E. Y. T. D. S. D. W. Timothy P. Lillicrap, Jonathan J. Hunt, “Continuous control with deep reinforcement learning,” arXiv:1509.02971, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[28]
Proximal Policy Optimization Algorithms
P. D. A. R. O. K. John Schulman, Filip Wolski, “Proximal policy optimization algorithms,” arXiv:1707.06347, 2017. [Online]. Available: https://arxiv.org/abs/1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[29]
E. B. Philip S. Thomas, “Policy gradient methods for reinforcement learn- ing with function approximation and action-dependent baselines,” Ad- vances in neural information processing systems , pp. 1057–1063, 2000
work page 2000
-
[30]
Determin- istic policy gradient algorithms,
N. H. T. D. D. W. David Silver, Guy Lever and M. Riedmiller, “Determin- istic policy gradient algorithms,” in Proceedings of the 31 st International Conference on Machine Learning, vol. 32. 14 (a) MSE between the output of neural net- work and tested data (b) Comparison of predicted output from neural network and the optimal point, the blue line represent...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.