Learning-based Hamilton-Jacobi-Bellman Methods for Optimal Control

Ping Lu; Ran Dai; Sixiong You

arxiv: 1907.10097 · v1 · pith:VEDTJLTWnew · submitted 2019-07-23 · 🧮 math.OC

Learning-based Hamilton-Jacobi-Bellman Methods for Optimal Control

Sixiong You , Ran Dai , Ping Lu This is my paper

Pith reviewed 2026-05-24 17:05 UTC · model grok-4.3

classification 🧮 math.OC

keywords Hamilton-Jacobi-Bellmanoptimal controltwo-point boundary value problemssupervised learningreinforcement learningadjoint variablesneural networks

0 comments

The pith

Neural networks trained by supervised or reinforcement learning supply real-time initial adjoint guesses that let TPBVP solvers converge for HJB optimal control problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops two neural-network approaches to guess the initial adjoint variables required by two-point boundary value problems that arise from Hamilton-Jacobi-Bellman optimality conditions. When solved examples are available, supervised learning trains a network offline to map boundary conditions to those initial values. When no solved examples exist, reinforcement learning trains the network online by rewarding guesses that produce convergent trajectories. Both routes are shown to deliver guesses accurate enough for standard TPBVP solvers to succeed on new boundary conditions in real time.

Core claim

The authors claim that a neural network can be trained either by supervised learning on a database of boundary-condition and adjoint-initial pairs or by reinforcement learning that updates the network according to a reward measuring solver convergence, and that either trained network will output adjoint initials sufficient for a TPBVP solver to converge on unseen boundary conditions arising from HJB equations.

What carries the argument

Neural-network approximator that maps boundary conditions to initial adjoint variables for the TPBVP solver derived from the Hamilton-Jacobi-Bellman equation.

If this is right

Optimal control problems formulated as TPBVPs can be solved online without manual tuning of adjoint initials.
The reinforcement-learning route enables solution of HJB problems even when no precomputed solution database exists.
Classical shooting or collocation methods become practical for real-time use once the network supplies the starting guess.
The same trained networks can be applied to any new boundary conditions drawn from the same problem family.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The supervised route could be combined with adaptive retraining when plant parameters drift.
High-dimensional state spaces would require checking whether network size must grow with state dimension to maintain guess accuracy.
The reinforcement-learning formulation might be extended by adding a penalty on guess magnitude to encourage simpler initial conditions.

Load-bearing premise

A neural network trained on example solutions or rewards will output adjoint initial guesses close enough to the true values that the chosen TPBVP solver converges on boundary conditions outside the training set.

What would settle it

Generate a held-out test set of boundary conditions, feed each through the trained network to obtain an adjoint initial guess, then run the TPBVP solver and record whether it fails to converge or returns trajectories that violate the original optimality conditions.

Figures

Figures reproduced from arXiv: 1907.10097 by Ping Lu, Ran Dai, Sixiong You.

**Figure 1.** Figure 1: The network architecture of SLH However, there are a set of optimal control problems with very sensitive adjoint variables. Very small changes, e.g., less than 10−4 magnitude, of the adjoint variables may lead to significant difference of the final solution. These optimal control problems are named hypersensitive HJBs where the dynamics exhibit fast contraction and expansion along the time interval [20]. C… view at source ↗

**Figure 2.** Figure 2: An examples of completely hypersensitive HJBs [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The basic composition of reinforcement learning [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: The framework of DDPG According to the different functions of networks, we define them as local actor network: a = π(s; θl) (7) local critic network: Rt = Q(s; a; ωl) target actor network: a = π(s; θt) target critic network: Rt = Q(s; a; ωt) where π represents the policy for selecting actions, Q represents the estimated Q value of different states and actions. In addition, θl , θt, ωl , ωt represent the pa… view at source ↗

**Figure 5.** Figure 5: The results of SLH for the Branchistochrone problem [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: The results of RLH for the Brachistochrone problem [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: The results of SLH for the hypersensitive problem [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

Many optimal control problems are formulated as two point boundary value problems (TPBVPs) with conditions of optimality derived from the Hamilton-Jacobi-Bellman (HJB) equations. In most cases, it is challenging to solve HJBs due to the difficulty of guessing the adjoint variables. This paper proposes two learning-based approaches to find the initial guess of adjoint variables in real-time, which can be applied to solve general TPBVPs. For cases with database of solutions and corresponding adjoint variables of a TPBVP under varying boundary conditions, a supervised learning method is applied to learn the HJB solutions off-line. After obtaining a trained neural network from supervised learning, we are able to find proper initial adjoint variables for given boundary conditions in real-time. However, when validated solutions of TPBVPs are not available, the reinforcement learning method is applied to solve HJB by constructing a neural network, defining a reward function, and setting appropriate super parameters. The reinforcement learning based HJB method can learn how to find accurate adjoint variables via an updating neural network. Finally, both learning approaches are implemented in classical optimal control problems to verify the effectiveness of the learning based HJB methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper outlines two standard ML pipelines for guessing initial adjoints in HJB-derived TPBVPs but supplies no metrics or tests to show the guesses work on unseen cases.

read the letter

The core idea is to train a neural net either from a database of solved TPBVPs (supervised) or via reinforcement learning when no database exists, then use the net to supply fast initial adjoint guesses so a shooting solver can converge in real time. The authors note the two cases and say both were tried on classical problems to check effectiveness. That framing is clear enough and directly targets a practical bottleneck in optimal control solvers. The description of when to use each method is straightforward and matches how practitioners already think about the problem. The main weakness is the complete absence of numbers. No error on the adjoint guesses, no success rate on held-out boundary conditions, no comparison to existing guess heuristics, and no mention of network size or training details appear in the abstract. The stress-test concern lands: without evidence that the learned guesses let the solver converge reliably outside the training distribution, the real-time claim stays untested. This work would mainly interest control engineers already running TPBVP solvers who want to try swapping in a learned initializer. A reader looking for a method with demonstrated accuracy or generalization will not find it here. The paper does not reach the threshold for serious peer review in its current state; the evidence gap is too large for referees to evaluate the central claim.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes two learning-based methods to generate initial adjoint variable guesses for two-point boundary value problems (TPBVPs) arising from Hamilton-Jacobi-Bellman (HJB) optimality conditions in optimal control. When a database of prior solutions exists, a supervised neural network is trained offline to map boundary conditions to adjoint initials; when no such database is available, a reinforcement learning approach constructs a network, defines a reward, and updates parameters to learn accurate adjoints. Both are stated to have been implemented on classical optimal control problems to verify real-time effectiveness.

Significance. If the central claim holds with quantitative support, the work would address a long-standing practical bottleneck in indirect optimal control methods by replacing manual or heuristic adjoint initialization with learned real-time guesses, potentially broadening the applicability of HJB-derived TPBVP solvers. The distinction between supervised and RL regimes is a reasonable organizing principle, and the absence of free parameters or invented entities in the high-level description is a minor positive.

major comments (2)

[Abstract] Abstract: the assertion that the methods were 'implemented in classical optimal control problems to verify the effectiveness' is load-bearing for the real-time claim yet supplies no quantitative results (success rates on unseen boundary conditions, adjoint approximation errors, TPBVP convergence statistics, or baseline comparisons). Without these, the generalization performance required by the central claim cannot be assessed.
[Abstract] Abstract: the RL formulation is described only at the level of 'defining a reward function' and 'setting appropriate super parameters,' with no explicit statement of the reward, network architecture, or convergence criterion that would guarantee the HJB residual is driven to zero for out-of-distribution boundary conditions. This omission directly affects reproducibility and the soundness of the 'learning how to find accurate adjoint variables' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that the methods were 'implemented in classical optimal control problems to verify the effectiveness' is load-bearing for the real-time claim yet supplies no quantitative results (success rates on unseen boundary conditions, adjoint approximation errors, TPBVP convergence statistics, or baseline comparisons). Without these, the generalization performance required by the central claim cannot be assessed.

Authors: We agree that the abstract would be strengthened by including quantitative metrics. In the revised manuscript we will add specific results (success rates on unseen boundary conditions, adjoint approximation errors, TPBVP convergence statistics, and baseline comparisons) to the abstract to support the real-time effectiveness claim. revision: yes
Referee: [Abstract] Abstract: the RL formulation is described only at the level of 'defining a reward function' and 'setting appropriate super parameters,' with no explicit statement of the reward, network architecture, or convergence criterion that would guarantee the HJB residual is driven to zero for out-of-distribution boundary conditions. This omission directly affects reproducibility and the soundness of the 'learning how to find accurate adjoint variables' claim.

Authors: We acknowledge the abstract description of the RL approach is high-level. In the revision we will expand the abstract to state the reward function, network architecture, and convergence criterion explicitly, thereby improving reproducibility and supporting the claim that the method learns accurate adjoint variables. revision: yes

Circularity Check

0 steps flagged

No circularity: learning methods rely on external data/rewards without self-referential reduction

full rationale

The paper describes two standard learning procedures (supervised NN training on a database of TPBVP solutions, or RL with a defined reward) to generate initial adjoint guesses for HJB-derived TPBVPs. No equations, derivations, or fitted-parameter predictions appear in the abstract or description. No self-citations are invoked as load-bearing uniqueness results. The approach is self-contained as an empirical ML technique whose validity rests on external verification in classical problems rather than any reduction of outputs to the method's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed from abstract only; no equations, datasets, or implementation details are available to enumerate free parameters, axioms, or invented entities with precision.

axioms (1)

domain assumption Neural networks can learn a sufficiently accurate mapping from boundary conditions to adjoint variables that enables TPBVP convergence
Implicit in both the supervised and reinforcement learning proposals described in the abstract.

pith-pipeline@v0.9.0 · 5735 in / 1228 out tokens · 20160 ms · 2026-05-24T17:05:14.998759+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 4 internal anchors

[1]

A. E. Bryson, Applied optimal control: optimization, estimation and con- trol. Routledge, 2018

work page 2018
[2]

Survey of numerical methods for trajectory optimization,

J. T. Betts, “Survey of numerical methods for trajectory optimization,” Journal of guidance, control, and dynamics , vol. 21, no. 2, pp. 193–207, 1998

work page 1998
[3]

Direct trajectory optimization by a cheby- shev pseudospectral method,

F. Fahroo and I. M. Ross, “Direct trajectory optimization by a cheby- shev pseudospectral method,” Journal of Guidance, Control, and Dynam- ics, vol. 25, no. 1, pp. 160–166, 2002

work page 2002
[4]

Nonlinear programming,

D. P. Bertsekas, “Nonlinear programming,” Journal of the Operational Re- search Society, vol. 48, no. 3, pp. 334–334, 1997

work page 1997
[5]

Galerkin approximations of the generalized hamilton-jacobi-bellman equation,

R. W. Beard, G. N. Saridis, and J. T. Wen, “Galerkin approximations of the generalized hamilton-jacobi-bellman equation,” Automatica, vol. 33, no. 12, pp. 2159 – 2177, 1997

work page 1997
[6]

An approximation theory of optimal control for trainable manipulators,

C.-S. G. L. George N. Saridis, “An approximation theory of optimal control for trainable manipulators,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 9, no. 3, pp. 152–159, 1979

work page 1979
[7]

Approximate solutions to the time-invariant hamilton–jacobi–bellman equation,

S.-G. N. Beard, R. W. and J. T. Wen, “Approximate solutions to the time-invariant hamilton–jacobi–bellman equation,” Journal of Optimiza- tion Theory and Applications , vol. 96, no. 3, pp. 589–626, Mar 1998

work page 1998
[8]

Evans and T

M. Evans and T. Swartz, Approximating integrals via Monte Carlo and deterministic methods. Cham, Switzerland: Springer, 2017

work page 2017
[9]

A direct multiple shooting method for real-time optimization of nonlinear dae processes,

H. Bock, M. Diehl, D. Leineweber, and J. Schl¨ oder, “A direct multiple shooting method for real-time optimization of nonlinear dae processes,” in Nonlinear Model Predictive Control, 2000, pp. 245–267

work page 2000
[10]

Machine learning: a review of classiﬁcation and combining techniques,

S. B. Kotsiantis, I. D. Zaharakis, and P. E. Pintelas, “Machine learning: a review of classiﬁcation and combining techniques,” Artiﬁcial Intelligence Review, vol. 26, no. 3, pp. 159–190, Nov 2006. 12

work page 2006
[11]

Eﬃcient machine learning for big data: A review,

O. Y. Al-Jarrah, P. D. Yoo, S. Muhaidat, G. K. Karagiannidis, and K. Taha, “Eﬃcient machine learning for big data: A review,” Big Data Research , vol. 2, no. 3, pp. 87 – 93, 2015, big Data, Analytics, and High-Performance Computing

work page 2015
[12]

Representation learning: A re- view and new perspectives,

Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A re- view and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, Aug 2013

work page 2013
[13]

Supervised machine learning: A review of classiﬁcation techniques,

Z.-I. . P. P. Kotsiantis, S. B., “Supervised machine learning: A review of classiﬁcation techniques,” Emerging artiﬁcial intelligence applications in computer engineering, vol. 160, no. 0, pp. 3–24, 2007

work page 2007
[14]

Representation Learning: A Review and New Perspectives

A. C. Yoshua Bengio and P. Vincent, “Unsupervised feature learning and deep learning: A review and new perspectives,” arXiv:1206.5538v1, vol. 0, no. 0, pp. 1–30, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[15]

Optimal and autonomous control using reinforcement learning: A survey,

B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis, “Optimal and autonomous control using reinforcement learning: A survey,” IEEE transactions on neural networks and learning systems , vol. 29, no. 6, pp. 2042–2062, 2018

work page 2042
[16]

T. J. B¨ ohme and B. Frank, Hybrid Systems, Optimal Control and Hybrid Vehicles. Cham, Switzerland: Springer, 2017

work page 2017
[17]

Discrete-time nonlin- ear hjb solution using approximate dynamic programming: Convergence proof,

A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time nonlin- ear hjb solution using approximate dynamic programming: Convergence proof,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 38, no. 4, pp. 943–949, 2008

work page 2008
[18]

Reinforcement q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics,

B. Kiumarsi, F. L. Lewis, H. Modares, A. Karimpour, and M.-B. Naghibi- Sistani, “Reinforcement q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics,” Automatica, vol. 50, no. 4, pp. 1167 – 1175, 2014

work page 2014
[19]

Reinforcement learning solution for hjb equation arising in constrained optimal control problem,

B. Luo, H.-N. Wu, T. Huang, and D. Liu, “Reinforcement learning solution for hjb equation arising in constrained optimal control problem,” Neural Networks, vol. 71, pp. 150 – 158, 2015

work page 2015
[20]

Manifold-following approximate solution of completely hypersensitive optimal control problems,

E. Aykutlug, U. Topcu, and K. D. Mease, “Manifold-following approximate solution of completely hypersensitive optimal control problems,” Journal of Optimization Theory and Applications , vol. 170, no. 1, pp. 220–242, 2016

work page 2016
[21]

Approximate solution of hyper-sensitive op- timal control problems using ﬁnite-time lyapunov analysis,

E. Aykutlug and K. D. Mease, “Approximate solution of hyper-sensitive op- timal control problems using ﬁnite-time lyapunov analysis,” in 2009 Amer- ican Control Conference, 2009, pp. 1034–1039

work page 2009
[22]

Evans and T

M. Evans and T. Swartz, Introduction to reinforcement learning . Cam- bridge: MIT press, 1998

work page 1998
[23]

Q-learning,

C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, no. 3, pp. 279–292, May 1992. 13

work page 1992
[24]

G. A. Rummery and M. Niranjan, On-line Q-learning using connectionist systems. Cambridge, England: University of Cambridge, Department of Engineering, 1994, vol. 37

work page 1994
[25]

Human-level control through deep reinforcement learning,

S. D. R. A. A. V. J. Mnih Volodymyr, Kavukcuoglu Koray, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 3, pp. 529–533, Feb 2015

work page 2015
[26]

Asynchronous Methods for Deep Reinforcement Learning

M. M. A. G. T. P. L. T. H. D. S. K. K. Volodymyr Mnih, Adri Puigdomnech Badia, “Asynchronous methods for deep reinforcement learning,” arXiv:1602.01783, 2016. [Online]. Available: https://arxiv.org/ abs/1602.01783

work page internal anchor Pith review Pith/arXiv arXiv 2016
[27]

Continuous control with deep reinforcement learning

A. P. N. H. T. E. Y. T. D. S. D. W. Timothy P. Lillicrap, Jonathan J. Hunt, “Continuous control with deep reinforcement learning,” arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[28]

Proximal Policy Optimization Algorithms

P. D. A. R. O. K. John Schulman, Filip Wolski, “Proximal policy optimization algorithms,” arXiv:1707.06347, 2017. [Online]. Available: https://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

Policy gradient methods for reinforcement learn- ing with function approximation and action-dependent baselines,

E. B. Philip S. Thomas, “Policy gradient methods for reinforcement learn- ing with function approximation and action-dependent baselines,” Ad- vances in neural information processing systems , pp. 1057–1063, 2000

work page 2000
[30]

Determin- istic policy gradient algorithms,

N. H. T. D. D. W. David Silver, Guy Lever and M. Riedmiller, “Determin- istic policy gradient algorithms,” in Proceedings of the 31 st International Conference on Machine Learning, vol. 32. 14 (a) MSE between the output of neural net- work and tested data (b) Comparison of predicted output from neural network and the optimal point, the blue line represent...

work page

[1] [1]

A. E. Bryson, Applied optimal control: optimization, estimation and con- trol. Routledge, 2018

work page 2018

[2] [2]

Survey of numerical methods for trajectory optimization,

J. T. Betts, “Survey of numerical methods for trajectory optimization,” Journal of guidance, control, and dynamics , vol. 21, no. 2, pp. 193–207, 1998

work page 1998

[3] [3]

Direct trajectory optimization by a cheby- shev pseudospectral method,

F. Fahroo and I. M. Ross, “Direct trajectory optimization by a cheby- shev pseudospectral method,” Journal of Guidance, Control, and Dynam- ics, vol. 25, no. 1, pp. 160–166, 2002

work page 2002

[4] [4]

Nonlinear programming,

D. P. Bertsekas, “Nonlinear programming,” Journal of the Operational Re- search Society, vol. 48, no. 3, pp. 334–334, 1997

work page 1997

[5] [5]

Galerkin approximations of the generalized hamilton-jacobi-bellman equation,

R. W. Beard, G. N. Saridis, and J. T. Wen, “Galerkin approximations of the generalized hamilton-jacobi-bellman equation,” Automatica, vol. 33, no. 12, pp. 2159 – 2177, 1997

work page 1997

[6] [6]

An approximation theory of optimal control for trainable manipulators,

C.-S. G. L. George N. Saridis, “An approximation theory of optimal control for trainable manipulators,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 9, no. 3, pp. 152–159, 1979

work page 1979

[7] [7]

Approximate solutions to the time-invariant hamilton–jacobi–bellman equation,

S.-G. N. Beard, R. W. and J. T. Wen, “Approximate solutions to the time-invariant hamilton–jacobi–bellman equation,” Journal of Optimiza- tion Theory and Applications , vol. 96, no. 3, pp. 589–626, Mar 1998

work page 1998

[8] [8]

Evans and T

M. Evans and T. Swartz, Approximating integrals via Monte Carlo and deterministic methods. Cham, Switzerland: Springer, 2017

work page 2017

[9] [9]

A direct multiple shooting method for real-time optimization of nonlinear dae processes,

H. Bock, M. Diehl, D. Leineweber, and J. Schl¨ oder, “A direct multiple shooting method for real-time optimization of nonlinear dae processes,” in Nonlinear Model Predictive Control, 2000, pp. 245–267

work page 2000

[10] [10]

Machine learning: a review of classiﬁcation and combining techniques,

S. B. Kotsiantis, I. D. Zaharakis, and P. E. Pintelas, “Machine learning: a review of classiﬁcation and combining techniques,” Artiﬁcial Intelligence Review, vol. 26, no. 3, pp. 159–190, Nov 2006. 12

work page 2006

[11] [11]

Eﬃcient machine learning for big data: A review,

O. Y. Al-Jarrah, P. D. Yoo, S. Muhaidat, G. K. Karagiannidis, and K. Taha, “Eﬃcient machine learning for big data: A review,” Big Data Research , vol. 2, no. 3, pp. 87 – 93, 2015, big Data, Analytics, and High-Performance Computing

work page 2015

[12] [12]

Representation learning: A re- view and new perspectives,

Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A re- view and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, Aug 2013

work page 2013

[13] [13]

Supervised machine learning: A review of classiﬁcation techniques,

Z.-I. . P. P. Kotsiantis, S. B., “Supervised machine learning: A review of classiﬁcation techniques,” Emerging artiﬁcial intelligence applications in computer engineering, vol. 160, no. 0, pp. 3–24, 2007

work page 2007

[14] [14]

Representation Learning: A Review and New Perspectives

A. C. Yoshua Bengio and P. Vincent, “Unsupervised feature learning and deep learning: A review and new perspectives,” arXiv:1206.5538v1, vol. 0, no. 0, pp. 1–30, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[15] [15]

Optimal and autonomous control using reinforcement learning: A survey,

B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis, “Optimal and autonomous control using reinforcement learning: A survey,” IEEE transactions on neural networks and learning systems , vol. 29, no. 6, pp. 2042–2062, 2018

work page 2042

[16] [16]

T. J. B¨ ohme and B. Frank, Hybrid Systems, Optimal Control and Hybrid Vehicles. Cham, Switzerland: Springer, 2017

work page 2017

[17] [17]

Discrete-time nonlin- ear hjb solution using approximate dynamic programming: Convergence proof,

A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time nonlin- ear hjb solution using approximate dynamic programming: Convergence proof,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 38, no. 4, pp. 943–949, 2008

work page 2008

[18] [18]

Reinforcement q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics,

B. Kiumarsi, F. L. Lewis, H. Modares, A. Karimpour, and M.-B. Naghibi- Sistani, “Reinforcement q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics,” Automatica, vol. 50, no. 4, pp. 1167 – 1175, 2014

work page 2014

[19] [19]

Reinforcement learning solution for hjb equation arising in constrained optimal control problem,

B. Luo, H.-N. Wu, T. Huang, and D. Liu, “Reinforcement learning solution for hjb equation arising in constrained optimal control problem,” Neural Networks, vol. 71, pp. 150 – 158, 2015

work page 2015

[20] [20]

Manifold-following approximate solution of completely hypersensitive optimal control problems,

E. Aykutlug, U. Topcu, and K. D. Mease, “Manifold-following approximate solution of completely hypersensitive optimal control problems,” Journal of Optimization Theory and Applications , vol. 170, no. 1, pp. 220–242, 2016

work page 2016

[21] [21]

Approximate solution of hyper-sensitive op- timal control problems using ﬁnite-time lyapunov analysis,

E. Aykutlug and K. D. Mease, “Approximate solution of hyper-sensitive op- timal control problems using ﬁnite-time lyapunov analysis,” in 2009 Amer- ican Control Conference, 2009, pp. 1034–1039

work page 2009

[22] [22]

Evans and T

M. Evans and T. Swartz, Introduction to reinforcement learning . Cam- bridge: MIT press, 1998

work page 1998

[23] [23]

Q-learning,

C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, no. 3, pp. 279–292, May 1992. 13

work page 1992

[24] [24]

G. A. Rummery and M. Niranjan, On-line Q-learning using connectionist systems. Cambridge, England: University of Cambridge, Department of Engineering, 1994, vol. 37

work page 1994

[25] [25]

Human-level control through deep reinforcement learning,

S. D. R. A. A. V. J. Mnih Volodymyr, Kavukcuoglu Koray, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 3, pp. 529–533, Feb 2015

work page 2015

[26] [26]

Asynchronous Methods for Deep Reinforcement Learning

M. M. A. G. T. P. L. T. H. D. S. K. K. Volodymyr Mnih, Adri Puigdomnech Badia, “Asynchronous methods for deep reinforcement learning,” arXiv:1602.01783, 2016. [Online]. Available: https://arxiv.org/ abs/1602.01783

work page internal anchor Pith review Pith/arXiv arXiv 2016

[27] [27]

Continuous control with deep reinforcement learning

A. P. N. H. T. E. Y. T. D. S. D. W. Timothy P. Lillicrap, Jonathan J. Hunt, “Continuous control with deep reinforcement learning,” arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[28] [28]

Proximal Policy Optimization Algorithms

P. D. A. R. O. K. John Schulman, Filip Wolski, “Proximal policy optimization algorithms,” arXiv:1707.06347, 2017. [Online]. Available: https://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017

[29] [29]

Policy gradient methods for reinforcement learn- ing with function approximation and action-dependent baselines,

E. B. Philip S. Thomas, “Policy gradient methods for reinforcement learn- ing with function approximation and action-dependent baselines,” Ad- vances in neural information processing systems , pp. 1057–1063, 2000

work page 2000

[30] [30]

Determin- istic policy gradient algorithms,

N. H. T. D. D. W. David Silver, Guy Lever and M. Riedmiller, “Determin- istic policy gradient algorithms,” in Proceedings of the 31 st International Conference on Machine Learning, vol. 32. 14 (a) MSE between the output of neural net- work and tested data (b) Comparison of predicted output from neural network and the optimal point, the blue line represent...

work page