Robust On-Line ADP-based Solution of a Class of Hierarchical Nonlinear Differential Game
Pith reviewed 2026-05-24 15:35 UTC · model grok-4.3
The pith
An ADP algorithm solves hierarchical nonlinear differential games while cutting neural network usage by thirty percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The proposed ADP method achieves optimal control strategies under the worst-case disturbance for the hierarchical one-leader-multi-followers game by integrating zero-sum and nonzero-sum game models, while reducing the number of neural networks used for estimation by about thirty percent, with convergence guaranteed via Lyapunov analysis and Nemytskii operator properties.
What carries the argument
Policy iteration reinforcement learning inside adaptive dynamic programming that jointly estimates value functions, control policies, and disturbances with reduced neural networks.
If this is right
- The method yields robust optimal control for continuous-time nonlinear systems under disturbances in a hierarchical setting.
- Convergence of the neural-network estimates is assured by Lyapunov theory and Nemytskii operator properties.
- Both zero-sum and nonzero-sum aspects are handled inside a single algorithm.
- The procedure runs online and requires no prior offline solution of the game.
- Simulation examples confirm that the reduced network count still produces effective control policies.
Where Pith is reading between the lines
- If the thirty-percent network reduction scales with system size, the method could lower real-time computational load in embedded controllers.
- Applying the same structure to discrete-time or partially observed systems would test whether the mixed-game formulation remains tractable.
- The coexistence of competitive and cooperative elements suggests the algorithm could address other multi-agent problems that contain both adversarial and shared objectives.
- Hardware experiments on physical plants would reveal whether the theoretical guarantees survive sensor noise and actuator limits.
Load-bearing premise
The hierarchical game can be modeled as a simultaneous combination of zero-sum and nonzero-sum games for continuous-time nonlinear systems, with convergence following from Lyapunov theory and Nemytskii operator properties.
What would settle it
A concrete simulation of a nonlinear system in which the algorithm either fails to reach the claimed optimal strategies or requires more than the stated thirty percent reduction in neural networks.
Figures
read the original abstract
In this paper, a hierarchical one-leader-multi-followers game for a class of continuous-time nonlinear systems with disturbance is investigated by a novel policy iteration reinforcement learning technique in which, the game model consists both of the zero-sum and nonzero-sum games, simultaneously. An adaptive dynamic programming (ADP), method is developed to achieve optimal control strategy under the worst case of disturbance. This algorithm reduces the number of neural networks which are used for estimation for about thirty percent. The proposed algorithm uses neural networks to estimate value functions, control policies and disturbances. Convergence analysis of the estimations is investigated using Lyapunov theory and exploiting properties of the Nemytskii operator. Finally, the simulation results will show effectiveness of the developed ADP method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an ADP-based policy iteration algorithm for a hierarchical one-leader-multi-followers differential game on continuous-time nonlinear systems subject to disturbances. The model simultaneously incorporates zero-sum and nonzero-sum game elements; neural networks approximate value functions, policies, and disturbances. The method claims an approximately 30% reduction in the number of networks while achieving optimal strategies under worst-case disturbance. Convergence of the estimates is asserted via Lyapunov analysis combined with properties of the Nemytskii operator, and effectiveness is illustrated by simulation.
Significance. If the convergence argument can be completed with the required operator conditions, the work would provide a concrete reduction in approximator count for mixed game problems, which is a practical contribution to ADP methods for hierarchical control. The explicit handling of both game types within a single ADP framework is a distinguishing feature that could influence subsequent research on multi-agent differential games.
major comments (2)
- [Convergence analysis] Convergence analysis section: the proof invokes Nemytskii operator properties on the estimation errors/value-function mappings but records no explicit verification that the neural-network approximators satisfy the Carathéodory conditions (measurability in t, continuity in the state variable) or the requisite growth bounds under the combined zero-sum/nonzero-sum costs and worst-case disturbance. These conditions are load-bearing for the operator to be well-defined and for the Lyapunov argument to close.
- [§3 and abstract] §3 (algorithm description) and abstract: the claim that the algorithm 'reduces the number of neural networks … by about thirty percent' is stated without a tabulated baseline count of networks required by a standard ADP treatment of the same hierarchical game or an explicit accounting of which estimators are eliminated while still covering value functions, policies, and disturbances.
minor comments (2)
- [Abstract] Abstract: the phrase 'the simulation results will show effectiveness' should be changed to present tense.
- Notation: the hierarchical structure (leader vs. followers, zero-sum vs. nonzero-sum subgames) would benefit from an explicit diagram or a compact equation block that distinguishes the cost functionals.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The two major comments identify areas where the manuscript can be strengthened with additional rigor and clarity. We address each point below and will incorporate the suggested revisions in the next version.
read point-by-point responses
-
Referee: Convergence analysis section: the proof invokes Nemytskii operator properties on the estimation errors/value-function mappings but records no explicit verification that the neural-network approximators satisfy the Carathéodory conditions (measurability in t, continuity in the state variable) or the requisite growth bounds under the combined zero-sum/nonzero-sum costs and worst-case disturbance. These conditions are load-bearing for the operator to be well-defined and for the Lyapunov argument to close.
Authors: We agree that the convergence section would benefit from an explicit verification step. The neural-network approximators are constructed as continuous functions of the state (standard radial-basis or polynomial forms) and the time dependence enters only through the measurable disturbance and control signals, satisfying Carathéodory conditions by construction. Growth bounds follow from the quadratic cost structure and the boundedness assumptions already stated on the disturbance set. In the revised manuscript we will insert a short lemma (or appendix paragraph) that records these verifications before invoking the Nemytskii operator, thereby closing the Lyapunov argument rigorously. revision: yes
-
Referee: §3 (algorithm description) and abstract: the claim that the algorithm 'reduces the number of neural networks … by about thirty percent' is stated without a tabulated baseline count of networks required by a standard ADP treatment of the same hierarchical game or an explicit accounting of which estimators are eliminated while still covering value functions, policies, and disturbances.
Authors: The 30 % figure arises from replacing separate disturbance estimators for each follower with a single shared worst-case disturbance approximator that is reused across both the zero-sum leader-follower subgame and the nonzero-sum follower subgames. We acknowledge that the current text lacks an explicit side-by-side count. In the revision we will add a table in §3 that lists (i) the network count for a conventional ADP formulation of the identical hierarchical game and (ii) the reduced count achieved by our shared approximators, together with a brief accounting of which estimators are eliminated while still covering all value functions, policies, and disturbances. revision: yes
Circularity Check
No significant circularity; derivation uses external Lyapunov and operator theory without self-referential reduction
full rationale
The paper's central claims rest on an ADP policy-iteration scheme for a mixed zero-sum/nonzero-sum hierarchical game, with convergence asserted via Lyapunov stability plus Nemytskii-operator properties on the estimation errors. No equations or steps in the provided text reduce a claimed prediction or uniqueness result to a fitted parameter or to a self-citation whose content is itself the target result. The 30 % NN reduction is presented as an algorithmic outcome rather than a definitional identity, and the convergence argument invokes standard external theorems rather than an ansatz or renaming that collapses to the paper's own inputs. The derivation chain therefore remains self-contained against the cited mathematical machinery.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
R. W. H. Sargent, “Optimal control,” J. Computational Appl. Math. , vol. 124, no. 1-2, pp. 361–371, 2000
work page 2000
-
[2]
T. Zhao and Z. Ding, “Cooperative optimal control of battery energy storage system under wind uncertainties in a microgrid,” IEEE Trans. Power Syst., vol. 33, no. 2, pp. 2292–2300, 2018
work page 2018
-
[3]
Hex2oqtal: Translational optimal control exploiting quaternion error dynamics,
P. Ghiglino and J. L. Forshaw, “Hex2oqtal: Translational optimal control exploiting quaternion error dynamics,” IEEE Trans. Aerosp. Electron. Syst., vol. 53, no. 3, pp. 1181–1195, 2017
work page 2017
-
[4]
Neural network-based solutions for stochastic optimal control using path in- tegrals,
K. Rajagopal, S. N. Balakrishnan, and J. R. Busemeyer, “Neural network-based solutions for stochastic optimal control using path in- tegrals,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 3, pp. 534– 545, 2017
work page 2017
-
[5]
Dynamic optimization and learning for renewal systems,
M. J. Neely, “Dynamic optimization and learning for renewal systems,” IEEE Trans. Autom. Control , vol. 58, no. 1, pp. 32–46, 2013
work page 2013
-
[6]
Reinforcement learning and adaptive dynamic programming for feedback control,
F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Circuits Syst. Mag. , vol. 9, no. 3, pp. 32–50, 2009
work page 2009
-
[7]
Value and policy iterations in optimal control and adaptive dynamic programming,
D. P. Bertsekas, “Value and policy iterations in optimal control and adaptive dynamic programming,”IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 3, pp. 500–509, 2017
work page 2017
-
[8]
Global adaptive dynamic programming for continuous-time nonlinear systems,
Y . Jiang and Z. Jiang, “Global adaptive dynamic programming for continuous-time nonlinear systems,” IEEE Trans. Autom. Control , vol. 60, no. 11, pp. 2917–2929, 2015
work page 2015
-
[9]
C. Mu, Z. Ni, C. Sun, and H. He, “Data-driven tracking control with adaptive dynamic programming for a class of continuous-time nonlinear systems,” IEEE Trans. Cybern. , vol. 47, no. 6, pp. 1460–1470, 2017
work page 2017
-
[10]
Q. Wei, D. Liu, and H. Lin, “Value iteration adaptive dynamic pro- gramming for optimal control of discrete-time nonlinear systems,” IEEE Trans. Cybern., vol. 46, no. 3, pp. 840–853, 2016. xii
work page 2016
-
[11]
W. Lu, P. Zhu, and S. Ferrari, “A hybrid-adaptive dynamic programming approach for the model-free control of nonlinear switched systems,” IEEE Trans. Autom. Control , vol. 61, no. 10, pp. 3203–3208, 2016
work page 2016
-
[12]
Continuous-time q-learning for infinite-horizon discounted cost linear quadratic regulator problems,
M. Palanisamy, H. Modares, F. L. Lewis, and M. Aurangzeb, “Continuous-time q-learning for infinite-horizon discounted cost linear quadratic regulator problems,” IEEE Trans. Cybern. , vol. 45, no. 2, pp. 165–176, 2015
work page 2015
-
[13]
H. Zhang, C. Qin, B. Jiang, and Y . Luo, “Online adaptive policy learning algorithm for h∞state feedback control of unknown affine nonlinear discrete-time systems,” IEEE Trans. Cybern., vol. 44, no. 12, pp. 2706– 2718, 2014
work page 2014
-
[14]
F. L. Lewis and K. G. Vamvoudakis, “Reinforcement learning for partially observable dynamic processes: Adaptive dynamic programming using measured output data,” IEEE Trans. Syst., Man, Cybern., B , vol. 41, no. 1, pp. 14–25, 2011
work page 2011
-
[15]
H. Xu, Q. Zhao, and S. Jagannathan, “Finite-horizon near-optimal output feedback neural network control of quantized nonlinear discrete-time systems with input constraint,” IEEE Trans. Neural Netw. Learn. Syst. , vol. 26, no. 8, pp. 1776–1788, 2015
work page 2015
-
[16]
W. B. Powell, Approximate dynamic programming: solving the curses of dimensionality. USA, NJ: Wiley, 2011
work page 2011
-
[17]
Power system stability control for a wind farm based on adaptive dynamic programming,
Y . Tang, H. He, J. Wen, and J. Liu, “Power system stability control for a wind farm based on adaptive dynamic programming,” IEEE Trans. Smart Grid, vol. 6, no. 1, pp. 166–177, 2015
work page 2015
-
[18]
Robust adaptive dynamic programming with an application to power systems,
Y . Jiang and Z. Jiang, “Robust adaptive dynamic programming with an application to power systems,” IEEE Trans. Neural Netw. Learn. Syst. , vol. 24, no. 7, pp. 1150–1156, 2013
work page 2013
-
[19]
Q. Wei, D. Liu, F. L. Lewis, Y . Liu, and J. Zhang, “Mixed iterative adaptive dynamic programming for optimal battery energy control in smart residential microgrids,” IEEE Trans. Ind. Electron., vol. 64, no. 5, pp. 4110–4120, 2017
work page 2017
-
[20]
Q. Wei, D. Liu, G. Shi, and Y . Liu, “Multibattery optimal coordination control for home energy management systems via distributed iterative adaptive dynamic programming,” IEEE Trans. Ind. Electron. , vol. 62, no. 7, pp. 4203–4214, 2015
work page 2015
-
[21]
Snac convergence and use in adaptive autopilot design,
S. Chen, Y . Yang, S. N. Balakrishnan, N. T. Nguyen, and K. Krishnaku- mar, “Snac convergence and use in adaptive autopilot design,” in 2009 Int. Joint Conf. Neural Netw. , pp. 530–537, 2009
work page 2009
-
[22]
Adaptive critic autopilot design of bank-to-turn missiles using fuzzy basis function networks,
“Adaptive critic autopilot design of bank-to-turn missiles using fuzzy basis function networks,” IEEE Trans. Syst., Man., Cybern. B , vol. 35, no. 2, pp. 197–207, 2005
work page 2005
-
[23]
K. G. Vamvoudakis and F. L. Lewis, “Multi-player non-zero-sum games: Online adaptive learning solution of coupled hamiltonjacobi equations,” Automatica, vol. 47, no. 8, pp. 1556–1569, 2011
work page 2011
-
[24]
H. Zhang, Q. Wei, and D. Liu, “An iterative adaptive dynamic pro- gramming method for solving a class of nonlinear zero-sum differential games,” Automatica, vol. 47, no. 1, pp. 207–214, 2011
work page 2011
-
[25]
Q. Wei, R. Song, and P. Yan, “Data-driven zero-sum neuro-optimal control for a class of continuous-time unknown nonlinear systems with disturbance using adp,” IEEE Trans. Neural Netw. Learn. Syst. , vol. 27, no. 2, pp. 444–458, 2016
work page 2016
-
[26]
A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Model-free q-learning designs for linear discrete-time zero-sum games with application to h- infinity control,” Automatica, vol. 43, no. 3, pp. 473–481, 2007
work page 2007
-
[27]
Adaptive dynamic programming for online solution of a zero-sum differential game,
D. Vrabie and F. L. Lewis, “Adaptive dynamic programming for online solution of a zero-sum differential game,”J. Control Theory Appl., vol. 9, no. 3, pp. 353–360, 2011
work page 2011
-
[28]
Y . Zhu, D. Zhao, and X. Li, “Iterative adaptive dynamic programming for solving unknown nonlinear zero-sum game based on online data,” IEEE Trans. Neural Netw. Learn. Syst. , vol. 28, no. 3, pp. 714–725, 2017
work page 2017
-
[29]
Neurodynamic program- ming and zero-sum games for constrained control systems,
M. Abu-Khalaf, F. L. Lewis, and J. Huang, “Neurodynamic program- ming and zero-sum games for constrained control systems,” IEEE Trans. Neural Netw., vol. 19, no. 7, pp. 1243–1252, 2008
work page 2008
-
[30]
Robust adaptive dynamic programming of two-player zero-sum games for continuous-time linear systems,
Y . Fu, J. Fu, and T. Chai, “Robust adaptive dynamic programming of two-player zero-sum games for continuous-time linear systems,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 12, pp. 3314–3319, 2015
work page 2015
-
[31]
Online partially model-free solution of two-player zero sum differential games,
“Online partially model-free solution of two-player zero sum differential games,” IFAC Proceedings Volumes, vol. 46, no. 32, pp. 696 – 701, 2013
work page 2013
-
[32]
H. Zhang, H. Jiang, C. Luo, and G. Xiao, “Discrete-time nonzero- sum games for multiplayer using policy-iteration-based adaptive dy- namic programming algorithms,” IEEE Trans. Cybern., vol. 47, no. 10, pp. 3331–3340, 2017
work page 2017
-
[33]
R. Song, F. L. Lewis, and Q. Wei, “Off-policy integral reinforcement learning method to solve nonlinear continuous-time multiplayer nonzero- sum games,” IEEE Trans. Neural Netw. Learn. Syst. , vol. 28, no. 3, pp. 704–713, 2017
work page 2017
-
[34]
Approximaten-player nonzero-sum game solution for an uncertain continuous nonlinear system,
M. Johnson, R. Kamalapurkar, S. Bhasin, and W. E. Dixon, “Approximaten-player nonzero-sum game solution for an uncertain continuous nonlinear system,” IEEE Trans. Neural Netw. Learn. Syst. , vol. 26, no. 8, pp. 1645–1658, 2015
work page 2015
-
[35]
von Stackelberg, Marktform und Gleichgewicht
H. von Stackelberg, Marktform und Gleichgewicht . Berlin: Springer- Verlag, 1934
work page 1934
-
[36]
Stackelburg solution for two-person games with biased information patterns,
C. Chen and J. Cruz, “Stackelburg solution for two-person games with biased information patterns,” IEEE Trans. Autom. Control, vol. 17, no. 6, pp. 791–798, 1972
work page 1972
-
[37]
Feedback stackelberg strategy for m-level hierarchical games,
B. Gardner and J. Cruz, “Feedback stackelberg strategy for m-level hierarchical games,” IEEE Trans. Autom. Control , vol. 23, no. 3, pp. 489–491, 1978
work page 1978
-
[38]
Discrete-time riccati equa- tions in open-loop nash and stackelberg games,
G. Freiling, G. Jank, and H. Abou-Kandil, “Discrete-time riccati equa- tions in open-loop nash and stackelberg games,” Eur. J. Control, vol. 5, no. 1, pp. 56–66, 1999
work page 1999
-
[39]
On the stackelberg strategy in nonzero-sum games,
M. Simaan and J. B. Cruz, “On the stackelberg strategy in nonzero-sum games,” J. Optim. Theory Appl. , vol. 11, no. 5, pp. 533–555, 1973
work page 1973
-
[40]
A hierarchical game theoretic framework for cognitive radio networks,
Y . Xiao, G. Bi, D. Niyato, and L. A. DaSilva, “A hierarchical game theoretic framework for cognitive radio networks,” IEEE J. Sel. Areas Commun., vol. 30, no. 10, pp. 2053–2069, 2012
work page 2053
-
[41]
A. Mohsenian-Rad, V . W. S. Wong, J. Jatskevich, R. Schober, and A. Leon-Garcia, “Autonomous demand-side management based on game-theoretic energy consumption scheduling for the future smart grid,” IEEE Trans. Smart Grid , vol. 1, no. 3, pp. 320–331, 2010
work page 2010
-
[42]
Infinite horizon linear-quadratic stackel- berg games for discrete-time stochastic systems,
H. Mukaidani and H. Xub, “Infinite horizon linear-quadratic stackel- berg games for discrete-time stochastic systems,” Automatica, vol. 76, pp. 301–308, 2017
work page 2017
-
[43]
Existence and uniqueness of open-loop stackelberg equilibria in linear-quadratic differential games,
G. Freiling, G. Jank, and S. R. Lee, “Existence and uniqueness of open-loop stackelberg equilibria in linear-quadratic differential games,” J. Optim. Theory Appl. , vol. 110, no. 3, p. 515544, 2001
work page 2001
-
[44]
Stackelberg strategies in linear-quadratic stochastic differential games,
A. Bagchi and T. Baar, “Stackelberg strategies in linear-quadratic stochastic differential games,” J. Optim. Theory Appl. , vol. 35, no. 3, p. 443464, 1981
work page 1981
-
[45]
Discrete-time robust hierarchical linear- quadratic dynamic games,
H. Kebriaei and L. Iannelli, “Discrete-time robust hierarchical linear- quadratic dynamic games,” IEEE Trans. Autom. Control, vol. 63, no. 3, pp. 902–909, 2018
work page 2018
-
[46]
Dynamic stackelberg equilibrium congestion pricing,
B. W. Wie, “Dynamic stackelberg equilibrium congestion pricing,” Transportation Research Part C: Emerging Technologies, vol. 15, no. 3, p. 154174, 2007
work page 2007
-
[47]
Dynamic feedback stackelberg games with nonunique solutions,
P. Y . Nie, M. Y . Lai, and S. J. Zhu, “Dynamic feedback stackelberg games with nonunique solutions,” J. Optim. Theory Appl., vol. 69, no. 7, pp. 1904–1913, 2008
work page 1904
-
[48]
Online actorcritic algorithm to solve the continuous-time infinite horizon optimal control problem,
K. G. Vamvoudakis and F. L. Lewis, “Online actorcritic algorithm to solve the continuous-time infinite horizon optimal control problem,” Automatica, vol. 46, no. 5, pp. 878–888, 2010
work page 2010
-
[49]
T. Basar and G. J. Olsder, Dynamic Noncooperative Game Theory , vol. 23 of Classics in Applied Mathematics . Philadelphia, PA, USA: SIAM, 1999
work page 1999
-
[50]
L. Gasinski and N. S. Papageorgiou, Nonlinear Analysis. Mathematical Analysis and Applications, New York, USA: Chapman and Hall/CRC, 2005
work page 2005
-
[51]
H. K. Khalil, Nonlinear Control. London, UK: Pearson, 2015
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.