pith. sign in

arxiv: 1907.10097 · v1 · pith:VEDTJLTWnew · submitted 2019-07-23 · 🧮 math.OC

Learning-based Hamilton-Jacobi-Bellman Methods for Optimal Control

Pith reviewed 2026-05-24 17:05 UTC · model grok-4.3

classification 🧮 math.OC
keywords Hamilton-Jacobi-Bellmanoptimal controltwo-point boundary value problemssupervised learningreinforcement learningadjoint variablesneural networks
0
0 comments X

The pith

Neural networks trained by supervised or reinforcement learning supply real-time initial adjoint guesses that let TPBVP solvers converge for HJB optimal control problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops two neural-network approaches to guess the initial adjoint variables required by two-point boundary value problems that arise from Hamilton-Jacobi-Bellman optimality conditions. When solved examples are available, supervised learning trains a network offline to map boundary conditions to those initial values. When no solved examples exist, reinforcement learning trains the network online by rewarding guesses that produce convergent trajectories. Both routes are shown to deliver guesses accurate enough for standard TPBVP solvers to succeed on new boundary conditions in real time.

Core claim

The authors claim that a neural network can be trained either by supervised learning on a database of boundary-condition and adjoint-initial pairs or by reinforcement learning that updates the network according to a reward measuring solver convergence, and that either trained network will output adjoint initials sufficient for a TPBVP solver to converge on unseen boundary conditions arising from HJB equations.

What carries the argument

Neural-network approximator that maps boundary conditions to initial adjoint variables for the TPBVP solver derived from the Hamilton-Jacobi-Bellman equation.

If this is right

  • Optimal control problems formulated as TPBVPs can be solved online without manual tuning of adjoint initials.
  • The reinforcement-learning route enables solution of HJB problems even when no precomputed solution database exists.
  • Classical shooting or collocation methods become practical for real-time use once the network supplies the starting guess.
  • The same trained networks can be applied to any new boundary conditions drawn from the same problem family.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The supervised route could be combined with adaptive retraining when plant parameters drift.
  • High-dimensional state spaces would require checking whether network size must grow with state dimension to maintain guess accuracy.
  • The reinforcement-learning formulation might be extended by adding a penalty on guess magnitude to encourage simpler initial conditions.

Load-bearing premise

A neural network trained on example solutions or rewards will output adjoint initial guesses close enough to the true values that the chosen TPBVP solver converges on boundary conditions outside the training set.

What would settle it

Generate a held-out test set of boundary conditions, feed each through the trained network to obtain an adjoint initial guess, then run the TPBVP solver and record whether it fails to converge or returns trajectories that violate the original optimality conditions.

Figures

Figures reproduced from arXiv: 1907.10097 by Ping Lu, Ran Dai, Sixiong You.

Figure 1
Figure 1. Figure 1: The network architecture of SLH However, there are a set of optimal control problems with very sensitive adjoint variables. Very small changes, e.g., less than 10−4 magnitude, of the adjoint variables may lead to significant difference of the final solution. These optimal control problems are named hypersensitive HJBs where the dynamics exhibit fast contraction and expansion along the time interval [20]. C… view at source ↗
Figure 2
Figure 2. Figure 2: An examples of completely hypersensitive HJBs [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The basic composition of reinforcement learning [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The framework of DDPG According to the different functions of networks, we define them as local actor network: a = π(s; θl) (7) local critic network: Rt = Q(s; a; ωl) target actor network: a = π(s; θt) target critic network: Rt = Q(s; a; ωt) where π represents the policy for selecting actions, Q represents the estimated Q value of different states and actions. In addition, θl , θt, ωl , ωt represent the pa… view at source ↗
Figure 5
Figure 5. Figure 5: The results of SLH for the Branchistochrone problem [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The results of RLH for the Brachistochrone problem [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The results of SLH for the hypersensitive problem [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
read the original abstract

Many optimal control problems are formulated as two point boundary value problems (TPBVPs) with conditions of optimality derived from the Hamilton-Jacobi-Bellman (HJB) equations. In most cases, it is challenging to solve HJBs due to the difficulty of guessing the adjoint variables. This paper proposes two learning-based approaches to find the initial guess of adjoint variables in real-time, which can be applied to solve general TPBVPs. For cases with database of solutions and corresponding adjoint variables of a TPBVP under varying boundary conditions, a supervised learning method is applied to learn the HJB solutions off-line. After obtaining a trained neural network from supervised learning, we are able to find proper initial adjoint variables for given boundary conditions in real-time. However, when validated solutions of TPBVPs are not available, the reinforcement learning method is applied to solve HJB by constructing a neural network, defining a reward function, and setting appropriate super parameters. The reinforcement learning based HJB method can learn how to find accurate adjoint variables via an updating neural network. Finally, both learning approaches are implemented in classical optimal control problems to verify the effectiveness of the learning based HJB methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes two learning-based methods to generate initial adjoint variable guesses for two-point boundary value problems (TPBVPs) arising from Hamilton-Jacobi-Bellman (HJB) optimality conditions in optimal control. When a database of prior solutions exists, a supervised neural network is trained offline to map boundary conditions to adjoint initials; when no such database is available, a reinforcement learning approach constructs a network, defines a reward, and updates parameters to learn accurate adjoints. Both are stated to have been implemented on classical optimal control problems to verify real-time effectiveness.

Significance. If the central claim holds with quantitative support, the work would address a long-standing practical bottleneck in indirect optimal control methods by replacing manual or heuristic adjoint initialization with learned real-time guesses, potentially broadening the applicability of HJB-derived TPBVP solvers. The distinction between supervised and RL regimes is a reasonable organizing principle, and the absence of free parameters or invented entities in the high-level description is a minor positive.

major comments (2)
  1. [Abstract] Abstract: the assertion that the methods were 'implemented in classical optimal control problems to verify the effectiveness' is load-bearing for the real-time claim yet supplies no quantitative results (success rates on unseen boundary conditions, adjoint approximation errors, TPBVP convergence statistics, or baseline comparisons). Without these, the generalization performance required by the central claim cannot be assessed.
  2. [Abstract] Abstract: the RL formulation is described only at the level of 'defining a reward function' and 'setting appropriate super parameters,' with no explicit statement of the reward, network architecture, or convergence criterion that would guarantee the HJB residual is driven to zero for out-of-distribution boundary conditions. This omission directly affects reproducibility and the soundness of the 'learning how to find accurate adjoint variables' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that the methods were 'implemented in classical optimal control problems to verify the effectiveness' is load-bearing for the real-time claim yet supplies no quantitative results (success rates on unseen boundary conditions, adjoint approximation errors, TPBVP convergence statistics, or baseline comparisons). Without these, the generalization performance required by the central claim cannot be assessed.

    Authors: We agree that the abstract would be strengthened by including quantitative metrics. In the revised manuscript we will add specific results (success rates on unseen boundary conditions, adjoint approximation errors, TPBVP convergence statistics, and baseline comparisons) to the abstract to support the real-time effectiveness claim. revision: yes

  2. Referee: [Abstract] Abstract: the RL formulation is described only at the level of 'defining a reward function' and 'setting appropriate super parameters,' with no explicit statement of the reward, network architecture, or convergence criterion that would guarantee the HJB residual is driven to zero for out-of-distribution boundary conditions. This omission directly affects reproducibility and the soundness of the 'learning how to find accurate adjoint variables' claim.

    Authors: We acknowledge the abstract description of the RL approach is high-level. In the revision we will expand the abstract to state the reward function, network architecture, and convergence criterion explicitly, thereby improving reproducibility and supporting the claim that the method learns accurate adjoint variables. revision: yes

Circularity Check

0 steps flagged

No circularity: learning methods rely on external data/rewards without self-referential reduction

full rationale

The paper describes two standard learning procedures (supervised NN training on a database of TPBVP solutions, or RL with a defined reward) to generate initial adjoint guesses for HJB-derived TPBVPs. No equations, derivations, or fitted-parameter predictions appear in the abstract or description. No self-citations are invoked as load-bearing uniqueness results. The approach is self-contained as an empirical ML technique whose validity rests on external verification in classical problems rather than any reduction of outputs to the method's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed from abstract only; no equations, datasets, or implementation details are available to enumerate free parameters, axioms, or invented entities with precision.

axioms (1)
  • domain assumption Neural networks can learn a sufficiently accurate mapping from boundary conditions to adjoint variables that enables TPBVP convergence
    Implicit in both the supervised and reinforcement learning proposals described in the abstract.

pith-pipeline@v0.9.0 · 5735 in / 1228 out tokens · 20160 ms · 2026-05-24T17:05:14.998759+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 4 internal anchors

  1. [1]

    A. E. Bryson, Applied optimal control: optimization, estimation and con- trol. Routledge, 2018

  2. [2]

    Survey of numerical methods for trajectory optimization,

    J. T. Betts, “Survey of numerical methods for trajectory optimization,” Journal of guidance, control, and dynamics , vol. 21, no. 2, pp. 193–207, 1998

  3. [3]

    Direct trajectory optimization by a cheby- shev pseudospectral method,

    F. Fahroo and I. M. Ross, “Direct trajectory optimization by a cheby- shev pseudospectral method,” Journal of Guidance, Control, and Dynam- ics, vol. 25, no. 1, pp. 160–166, 2002

  4. [4]

    Nonlinear programming,

    D. P. Bertsekas, “Nonlinear programming,” Journal of the Operational Re- search Society, vol. 48, no. 3, pp. 334–334, 1997

  5. [5]

    Galerkin approximations of the generalized hamilton-jacobi-bellman equation,

    R. W. Beard, G. N. Saridis, and J. T. Wen, “Galerkin approximations of the generalized hamilton-jacobi-bellman equation,” Automatica, vol. 33, no. 12, pp. 2159 – 2177, 1997

  6. [6]

    An approximation theory of optimal control for trainable manipulators,

    C.-S. G. L. George N. Saridis, “An approximation theory of optimal control for trainable manipulators,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 9, no. 3, pp. 152–159, 1979

  7. [7]

    Approximate solutions to the time-invariant hamilton–jacobi–bellman equation,

    S.-G. N. Beard, R. W. and J. T. Wen, “Approximate solutions to the time-invariant hamilton–jacobi–bellman equation,” Journal of Optimiza- tion Theory and Applications , vol. 96, no. 3, pp. 589–626, Mar 1998

  8. [8]

    Evans and T

    M. Evans and T. Swartz, Approximating integrals via Monte Carlo and deterministic methods. Cham, Switzerland: Springer, 2017

  9. [9]

    A direct multiple shooting method for real-time optimization of nonlinear dae processes,

    H. Bock, M. Diehl, D. Leineweber, and J. Schl¨ oder, “A direct multiple shooting method for real-time optimization of nonlinear dae processes,” in Nonlinear Model Predictive Control, 2000, pp. 245–267

  10. [10]

    Machine learning: a review of classification and combining techniques,

    S. B. Kotsiantis, I. D. Zaharakis, and P. E. Pintelas, “Machine learning: a review of classification and combining techniques,” Artificial Intelligence Review, vol. 26, no. 3, pp. 159–190, Nov 2006. 12

  11. [11]

    Efficient machine learning for big data: A review,

    O. Y. Al-Jarrah, P. D. Yoo, S. Muhaidat, G. K. Karagiannidis, and K. Taha, “Efficient machine learning for big data: A review,” Big Data Research , vol. 2, no. 3, pp. 87 – 93, 2015, big Data, Analytics, and High-Performance Computing

  12. [12]

    Representation learning: A re- view and new perspectives,

    Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A re- view and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, Aug 2013

  13. [13]

    Supervised machine learning: A review of classification techniques,

    Z.-I. . P. P. Kotsiantis, S. B., “Supervised machine learning: A review of classification techniques,” Emerging artificial intelligence applications in computer engineering, vol. 160, no. 0, pp. 3–24, 2007

  14. [14]

    Representation Learning: A Review and New Perspectives

    A. C. Yoshua Bengio and P. Vincent, “Unsupervised feature learning and deep learning: A review and new perspectives,” arXiv:1206.5538v1, vol. 0, no. 0, pp. 1–30, 2012

  15. [15]

    Optimal and autonomous control using reinforcement learning: A survey,

    B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis, “Optimal and autonomous control using reinforcement learning: A survey,” IEEE transactions on neural networks and learning systems , vol. 29, no. 6, pp. 2042–2062, 2018

  16. [16]

    T. J. B¨ ohme and B. Frank, Hybrid Systems, Optimal Control and Hybrid Vehicles. Cham, Switzerland: Springer, 2017

  17. [17]

    Discrete-time nonlin- ear hjb solution using approximate dynamic programming: Convergence proof,

    A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time nonlin- ear hjb solution using approximate dynamic programming: Convergence proof,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 38, no. 4, pp. 943–949, 2008

  18. [18]

    Reinforcement q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics,

    B. Kiumarsi, F. L. Lewis, H. Modares, A. Karimpour, and M.-B. Naghibi- Sistani, “Reinforcement q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics,” Automatica, vol. 50, no. 4, pp. 1167 – 1175, 2014

  19. [19]

    Reinforcement learning solution for hjb equation arising in constrained optimal control problem,

    B. Luo, H.-N. Wu, T. Huang, and D. Liu, “Reinforcement learning solution for hjb equation arising in constrained optimal control problem,” Neural Networks, vol. 71, pp. 150 – 158, 2015

  20. [20]

    Manifold-following approximate solution of completely hypersensitive optimal control problems,

    E. Aykutlug, U. Topcu, and K. D. Mease, “Manifold-following approximate solution of completely hypersensitive optimal control problems,” Journal of Optimization Theory and Applications , vol. 170, no. 1, pp. 220–242, 2016

  21. [21]

    Approximate solution of hyper-sensitive op- timal control problems using finite-time lyapunov analysis,

    E. Aykutlug and K. D. Mease, “Approximate solution of hyper-sensitive op- timal control problems using finite-time lyapunov analysis,” in 2009 Amer- ican Control Conference, 2009, pp. 1034–1039

  22. [22]

    Evans and T

    M. Evans and T. Swartz, Introduction to reinforcement learning . Cam- bridge: MIT press, 1998

  23. [23]

    Q-learning,

    C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, no. 3, pp. 279–292, May 1992. 13

  24. [24]

    G. A. Rummery and M. Niranjan, On-line Q-learning using connectionist systems. Cambridge, England: University of Cambridge, Department of Engineering, 1994, vol. 37

  25. [25]

    Human-level control through deep reinforcement learning,

    S. D. R. A. A. V. J. Mnih Volodymyr, Kavukcuoglu Koray, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 3, pp. 529–533, Feb 2015

  26. [26]

    Asynchronous Methods for Deep Reinforcement Learning

    M. M. A. G. T. P. L. T. H. D. S. K. K. Volodymyr Mnih, Adri Puigdomnech Badia, “Asynchronous methods for deep reinforcement learning,” arXiv:1602.01783, 2016. [Online]. Available: https://arxiv.org/ abs/1602.01783

  27. [27]

    Continuous control with deep reinforcement learning

    A. P. N. H. T. E. Y. T. D. S. D. W. Timothy P. Lillicrap, Jonathan J. Hunt, “Continuous control with deep reinforcement learning,” arXiv:1509.02971, 2015

  28. [28]

    Proximal Policy Optimization Algorithms

    P. D. A. R. O. K. John Schulman, Filip Wolski, “Proximal policy optimization algorithms,” arXiv:1707.06347, 2017. [Online]. Available: https://arxiv.org/abs/1707.06347

  29. [29]

    Policy gradient methods for reinforcement learn- ing with function approximation and action-dependent baselines,

    E. B. Philip S. Thomas, “Policy gradient methods for reinforcement learn- ing with function approximation and action-dependent baselines,” Ad- vances in neural information processing systems , pp. 1057–1063, 2000

  30. [30]

    Determin- istic policy gradient algorithms,

    N. H. T. D. D. W. David Silver, Guy Lever and M. Riedmiller, “Determin- istic policy gradient algorithms,” in Proceedings of the 31 st International Conference on Machine Learning, vol. 32. 14 (a) MSE between the output of neural net- work and tested data (b) Comparison of predicted output from neural network and the optimal point, the blue line represent...