pith. sign in

arxiv: 2507.04356 · v2 · submitted 2025-07-06 · 🧮 math.OC · cs.AI· cs.RO

Mission-Aligned Learning-Informed Control of Autonomous Systems: Formulation and Foundations

Pith reviewed 2026-05-19 06:29 UTC · model grok-4.3

classification 🧮 math.OC cs.AIcs.RO
keywords autonomous systemstwo-level optimizationcontrolclassical planningreinforcement learningrobotic carephysical safetyinterpretability
0
0 comments X

The pith

Autonomous systems achieve greater safety and interpretability by framing decisions as a two-level optimization scheme that combines control, classical planning, and learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates the control of autonomous physical agents, using a stylized robotic care scenario as example, as a two-level optimization problem rather than a pure two-level reinforcement learning procedure. The lower level handles physical movement via control methods while the higher level addresses conceptual tasks through classical planning, with learning integrated across both. This structure is presented as a way to gain greater insight into algorithm design and to deliver more efficient performance with improved physical safety and interpretability for users and regulators. A sympathetic reader would care because current autonomous systems often operate as opaque boxes that raise legitimate concerns about reliability and oversight.

Core claim

We present the general formulation of mission-aligned control of autonomous systems as a two-level optimization scheme which incorporates control at the lower level and classical planning at the higher level, integrated with a capacity for learning. This synergistic integration of control, classical planning, and RL presents an opportunity for greater insight for algorithm development, leading to more efficient and reliable performance, where reliability pertains to physical safety and interpretability into an otherwise black-box operation.

What carries the argument

The two-level optimization scheme that places control for physical movements at the lower level and classical planning for tasks at the higher level while incorporating learning.

If this is right

  • The integration yields more efficient and reliable performance in autonomous physical agents.
  • Physical safety improves because decisions are structured rather than purely learned.
  • Interpretability increases, addressing user and regulator concerns about black-box behavior.
  • Greater insight becomes available for developing new algorithms that blend the three methodologies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-level structure could apply directly to the industrial robots, UAVs, and embedded devices listed in the introduction.
  • High-level planning might make regulatory verification of safety constraints simpler than in end-to-end learned policies.
  • The framework could support incremental deployment where only the planning layer is updated while the control layer remains fixed.

Load-bearing premise

That casting a stylized robotic care problem as a two-level optimization integrating control, planning, and learning will inherently produce better physical safety and interpretability than existing methods.

What would settle it

A side-by-side test of a robotic care task in simulation or on hardware that measures physical safety violations and decision transparency scores for the two-level optimization system versus a standard reinforcement learning policy.

Figures

Figures reproduced from arXiv: 2507.04356 by Akhil Anand, Gustav Sir, Haozhe Tian, Homayoun Hamedmoghadam, Monicah Cherop Naibei, Sebastien Gros, Vyacheslav Kungurtsev.

Figure 1
Figure 1. Figure 1: A pictorial model of integrating the Symbolic Planning and Control [PITH_FULL_IMAGE:figures/full_fig_p031_1.png] view at source ↗
read the original abstract

Research, innovation and practical capital investment have been increasing rapidly toward the realization of autonomous physical agents. This includes industrial and service robots, unmanned aerial vehicles, embedded control devices, and a number of other realizations of cybernetic/mechatronic implementations of intelligent autonomous devices. In this paper, we consider a stylized version of robotic care, which would normally involve a two-level Reinforcement Learning procedure that trains a policy for both lower level physical movement decisions as well as higher level conceptual tasks and their sub-components. In order to deliver greater safety and reliability in the system, we present the general formulation of this as a two-level optimization scheme which incorporates control at the lower level, and classical planning at the higher level, integrated with a capacity for learning. This synergistic integration of multiple methodologies -- control, classical planning, and RL -- presents an opportunity for greater insight for algorithm development, leading to more efficient and reliable performance. Here, the notion of reliability pertains to physical safety and interpretability into an otherwise black box operation of autonomous agents, concerning users and regulators. This work presents the necessary background and general formulation of the optimization framework, detailing each component and its integration with the others.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes a general formulation for mission-aligned control of autonomous systems, using a stylized robotic care example. It frames the task as a two-level optimization scheme: lower-level control for physical movements, upper-level classical planning for conceptual tasks and sub-tasks, with an integrated learning (RL) capacity. The central claim is that this synergistic integration of control, planning, and RL yields greater physical safety, reliability, and interpretability than standard two-level RL or pure planning approaches.

Significance. If the integration can be equipped with explicit preservation mechanisms, the framework could offer a principled route to combine stability guarantees from control theory, mission constraints from planning, and adaptability from learning. The focus on interpretability for users and regulators addresses a practical gap in autonomous systems. As a foundational formulation paper without theorems, examples, or empirical validation, its significance rests on enabling subsequent rigorous developments rather than delivering immediate results.

major comments (1)
  1. [Abstract and general formulation of the two-level optimization scheme] The abstract and formulation claim that the two-level scheme delivers greater physical safety and reliability than existing approaches. However, no safety invariant, constraint qualification, or post-learning verification mechanism is stated that would ensure the learned lower-level policy respects upper-level mission constraints or control-level stability margins when the RL component is active. This link is load-bearing for the reliability assertion.
minor comments (1)
  1. [Abstract] The abstract refers to a 'stylized version of robotic care' without providing a concrete mathematical example, diagram, or small-scale instance that illustrates how the levels interact.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and insightful review. We appreciate the recognition of the framework's potential to combine stability guarantees, mission constraints, and adaptability, as well as the focus on interpretability. We address the major comment below and will revise the manuscript to strengthen the presentation of the formulation.

read point-by-point responses
  1. Referee: The abstract and formulation claim that the two-level scheme delivers greater physical safety and reliability than existing approaches. However, no safety invariant, constraint qualification, or post-learning verification mechanism is stated that would ensure the learned lower-level policy respects upper-level mission constraints or control-level stability margins when the RL component is active. This link is load-bearing for the reliability assertion.

    Authors: We agree that the manuscript, as a foundational formulation, does not provide explicit safety invariants, constraint qualifications, or post-learning verification mechanisms. The abstract and introduction frame the two-level scheme as delivering an opportunity for greater physical safety and reliability through the synergistic integration of control (for stability margins), classical planning (for mission constraints), and RL (for adaptability), with reliability tied to both physical safety and interpretability. This is positioned as an improvement over standard two-level RL or pure planning by design, but we acknowledge the current text does not detail how the lower-level learned policy is guaranteed to respect upper-level constraints. In the revised manuscript we will add a new subsection under the formulation that outlines possible interfaces for preserving these properties, such as embedding Lyapunov-based stability or control barrier functions at the lower level and propagating temporal or logical constraints from the planner to bound RL actions. We will also revise the abstract and introduction to clarify that the framework is structured to enable such preservation mechanisms rather than claiming they are automatically delivered in the general formulation. revision: yes

Circularity Check

0 steps flagged

Formulation paper structures existing methodologies without self-referential reductions or fitted predictions

full rationale

The manuscript presents a general two-level optimization formulation that places classical planning at the upper level, control at the lower level, and an integrated learning capacity. No equations, predictions, or first-principles derivations are advanced that reduce by construction to the inputs; the text instead describes the components and their integration as an independent structuring of control, planning, and RL. No self-citations are invoked as load-bearing uniqueness theorems, no parameters are fitted and then renamed as predictions, and no ansatzes are smuggled via prior work. The claims of improved safety and interpretability are asserted as opportunities arising from the proposed architecture rather than results obtained through any circular chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard domain assumptions about hierarchical decomposition of autonomous tasks rather than new fitted parameters or invented entities; no specific numerical values or novel postulates are introduced in the abstract.

axioms (1)
  • domain assumption Autonomous systems can be decomposed into a lower physical control level and a higher conceptual planning level.
    This decomposition is invoked as the basis for the two-level optimization scheme in the abstract.

pith-pipeline@v0.9.0 · 5769 in / 1252 out tokens · 59593 ms · 2026-05-19T06:29:47.668275+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · 3 internal anchors

  1. [1]

    Constrained policy optimization

    Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International conference on machine learning, pages 22–31. PMLR, 2017

  2. [2]

    Reinforcement learn- ing based mpc with neural dynamical models

    Saket Adhau, S´ ebastien Gros, and Sigurd Skogestad. Reinforcement learn- ing based mpc with neural dynamical models. European Journal of Control, page 101048, 2024

  3. [3]

    Constrained Markov decision processes

    Eitan Altman. Constrained Markov decision processes. Routledge, 2021

  4. [4]

    Safe learning for control using control lyapunov functions and control barrier functions: A review

    Akhil Anand, Katrine Seel, Vilde Gjærum, Anne H˚ akansson, Haakon Robinson, and Aya Saad. Safe learning for control using control lyapunov functions and control barrier functions: A review. Procedia Computer Sci- ence, 192:3987–3997, 2021

  5. [5]

    Optimality conditions for model predictive control: Rethinking predictive model design

    Akhil S Anand, Arash Bahari Kordabad, Mario Zanon, and Sebastien Gros. Optimality conditions for model predictive control: Rethinking predictive model design. arXiv preprint arXiv:2412.18268 , 2024

  6. [6]

    A painless deterministic policy gradient method for learning-based mpc

    Akhil S Anand, Dirk Reinhardt, Shambhuraj Sawant, Jan Tommy Grav- dahl, and Sebastien Gros. A painless deterministic policy gradient method for learning-based mpc. In 2023 European Control Conference (ECC) , pages 1–7. IEEE, 2023. 41

  7. [7]

    Anand, S

    A.S. Anand, S. Sawant, D. Reinhardt, and S. Gros. Data-driven predic- tive control and MPC: Do we achieve optimality? IFAC-PapersOnLine, 58(15):73–78, 2024. 20th IFAC Symposium on System Identification SYSID 2024

  8. [8]

    Fundamentals of fuzzy logic control—fuzzy sets, fuzzy rules and defuzzifications

    Ying Bai and Dali Wang. Fundamentals of fuzzy logic control—fuzzy sets, fuzzy rules and defuzzifications. Advanced fuzzy logic technologies in indus- trial applications, pages 17–36, 2006

  9. [9]

    Principles of sequencing and schedul- ing

    Kenneth R Baker and Dan Trietsch. Principles of sequencing and schedul- ing. John Wiley & Sons, 2018

  10. [10]

    Constraint-based scheduling: applying constraint programming to scheduling problems , vol- ume 39

    Philippe Baptiste, Claude Le Pape, and Wim Nuijten. Constraint-based scheduling: applying constraint programming to scheduling problems , vol- ume 39. Springer Science & Business Media, 2001

  11. [11]

    Neuro-dynamic programming

    DP Bertsekas. Neuro-dynamic programming. Athena Scientific, 1996

  12. [12]

    Safe learning in robotics: From learning-based control to safe reinforcement learning

    Lukas Brunke, Melissa Greeff, Adam W Hall, Zhaocong Yuan, Siqi Zhou, Jacopo Panerati, and Angela P Schoellig. Safe learning in robotics: From learning-based control to safe reinforcement learning. Annual Review of Control, Robotics, and Autonomous Systems , 5(1):411–444, 2022

  13. [13]

    Optimal management of the peak power penalty for smart grids using mpc- based reinforcement learning

    Wenqi Cai, Hossein N Esfahani, Arash B Kordabad, and S´ ebastien Gros. Optimal management of the peak power penalty for smart grids using mpc- based reinforcement learning. In 2021 60th IEEE Conference on Decision and Control (CDC) , pages 6365–6370. IEEE, 2021

  14. [14]

    Mpc-based reinforcement learning for a simplified freight mission of autonomous surface vehicles

    Wenqi Cai, Arash B Kordabad, Hossein N Esfahani, Anastasios M Lekkas, and S´ ebastien Gros. Mpc-based reinforcement learning for a simplified freight mission of autonomous surface vehicles. In 2021 60th IEEE Con- ference on Decision and Control (CDC) , pages 2990–2995. IEEE, 2021

  15. [15]

    A learning-based model predictive control strategy for home energy management systems

    Wenqi Cai, Shambhuraj Sawant, Dirk Reinhardt, Soroush Rastegarpour, and Sebastien Gros. A learning-based model predictive control strategy for home energy management systems. IEEE Access, 2023

  16. [16]

    Control regularization for reduced variance re- inforcement learning

    Richard Cheng, Abhinav Verma, Gabor Orosz, Swarat Chaudhuri, Yisong Yue, and Joel Burdick. Control regularization for reduced variance re- inforcement learning. In International Conference on Machine Learning , pages 1141–1150. PMLR, 2019

  17. [17]

    Adaptive Multilevel Stochastic Approximation of the Value-at-Risk

    St´ ephane Cr´ epey, Noufel Frikha, Azar Louzi, and Jonathan Spence. Adap- tive multilevel stochastic approximation of the value-at-risk.arXiv preprint arXiv:2408.06531, 2024

  18. [18]

    Magnetic control of tokamak plasmas through deep reinforcement learning

    Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdol- maleki, Diego de Las Casas, et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 602(7897):414–419, 2022. 42

  19. [19]

    General multilevel adap- tations for stochastic approximation algorithms of robbins–monro and polyak–ruppert type

    Steffen Dereich and Thomas M¨ uller-Gronbach. General multilevel adap- tations for stochastic approximation algorithms of robbins–monro and polyak–ruppert type. Numerische Mathematik, 142:279–328, 2019

  20. [20]

    Scheduling: theory, algorithms, and systems

    Jeremy Dick, Johann M Schumann, NAHB Remodelers, Bart L Weathing- ton, Ray Floyd, and Gerardus Blokdyk. Scheduling: theory, algorithms, and systems. 2022

  21. [21]

    Optimiza- tion with learning-informed differential equation constraints and its appli- cations

    Guozhi Dong, Michael Hinterm¨ uller, and Kostas Papafitsoros. Optimiza- tion with learning-informed differential equation constraints and its appli- cations. ESAIM: Control, Optimisation and Calculus of Variations , 28:3, 2022

  22. [22]

    A descent al- gorithm for the optimal control of relu neural network informed pdes based on approximate directional derivatives

    Guozhi Dong, Michael Hinterm¨ uller, and Kostas Papafitsoros. A descent al- gorithm for the optimal control of relu neural network informed pdes based on approximate directional derivatives. SIAM Journal on Optimization , 34(3):2314–2349, 2024

  23. [23]

    Lie group forced variational integrator networks for learning and control of robot systems

    Valentin Duruisseaux, Thai P Duong, Melvin Leok, and Nikolay Atanasov. Lie group forced variational integrator networks for learning and control of robot systems. In Learning for Dynamics and Control Conference , pages 731–744. PMLR, 2023

  24. [24]

    Staff scheduling and rostering: A review of applications, methods and mod- els

    Andreas T Ernst, Houyuan Jiang, Mohan Krishnamoorthy, and David Sier. Staff scheduling and rostering: A review of applications, methods and mod- els. European journal of operational research, 153(1):3–27, 2004

  25. [25]

    Learning explanatory rules from noisy data

    Richard Evans and Edward Grefenstette. Learning explanatory rules from noisy data. Journal of Artificial Intelligence Research , 61:1–64, 2018

  26. [26]

    A stochastic planning framework for the discovery of complementary, agricultural systems

    Hector Flores and J Rene Villalobos. A stochastic planning framework for the discovery of complementary, agricultural systems. European Journal of Operational Research, 280(2):707–729, 2020

  27. [27]

    Addressing function ap- proximation error in actor-critic methods

    Scott Fujimoto, Herke Hoof, and David Meger. Addressing function ap- proximation error in actor-critic methods. In International Conference on Machine Learning, pages 1587–1596. PMLR, 2018

  28. [28]

    A comprehensive survey on safe reinforcement learning

    Javier Garcıa and Fernando Fern´ andez. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437– 1480, 2015

  29. [29]

    A single timescale stochastic approximation method for nested stochastic optimiza- tion

    Saeed Ghadimi, Andrzej Ruszczynski, and Mengdi Wang. A single timescale stochastic approximation method for nested stochastic optimiza- tion. SIAM Journal on Optimization , 30(1):960–979, 2020

  30. [30]

    Automated Planning: the- ory and practice

    Malik Ghallab, Dana Nau, and Paolo Traverso. Automated Planning: the- ory and practice. Elsevier, 2004. 43

  31. [31]

    Data-driven economic nmpc using rein- forcement learning

    S´ ebastien Gros and Mario Zanon. Data-driven economic nmpc using rein- forcement learning. IEEE TAC, 65(2):636–648, 2019

  32. [32]

    Reinforcement learning for mixed-integer problems based on MPC

    Sebastien Gros and Mario Zanon. Reinforcement learning for mixed-integer problems based on MPC. IFAC-PapersOnLine, 53(2):5219–5224, 2020

  33. [33]

    Learning for mpc with stability & safety guarantees

    Sebastien Gros and Mario Zanon. Learning for mpc with stability & safety guarantees. Automatica, 146:110598, 2022

  34. [34]

    Development and validation of qrisk3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study

    Julia Hippisley-Cox, Carol Coupland, and Peter Brindle. Development and validation of qrisk3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study. British Medical Journal , 357, 2017

  35. [35]

    Logic in Computer Science: Modelling and reasoning about systems

    Michael Huth and Mark Ryan. Logic in Computer Science: Modelling and reasoning about systems. Cambridge university press, 2004

  36. [36]

    Neural logic reinforcement learning

    Zhengyao Jiang and Shan Luo. Neural logic reinforcement learning. In International conference on machine learning , pages 3110–3119. PMLR, 2019

  37. [37]

    Approximately optimal approximate re- inforcement learning

    Sham Kakade and John Langford. Approximately optimal approximate re- inforcement learning. In Proceedings of the Nineteenth International Con- ference on Machine Learning, pages 267–274, 2002

  38. [38]

    MPC-based re- inforcement learning for economic problems with application to battery storage

    Arash Bahari Kordabad, Wenqi Cai, and Sebastien Gros. MPC-based re- inforcement learning for economic problems with application to battery storage. In 2021 European Control Conference (ECC) , pages 2573–2578. IEEE, 2021

  39. [39]

    Multi-agent bat- tery storage management using mpc-based reinforcement learning

    Arash Bahari Kordabad, Wenqi Cai, and Sebastien Gros. Multi-agent bat- tery storage management using mpc-based reinforcement learning. In 2021 IEEE Conference on Control Technology and Applications (CCTA) , pages 57–62. IEEE, 2021

  40. [40]

    Reinforcement learning based on scenario- tree mpc for asvs

    Arash Bahari Kordabad, Hossein Nejatbakhsh Esfahani, Anastasios M Lekkas, and S´ ebastien Gros. Reinforcement learning based on scenario- tree mpc for asvs. In 2021 American Control Conference (ACC) , pages 1985–1990. IEEE, 2021

  41. [41]

    Reinforcement learning for mpc: Fundamentals and current chal- lenges

    Arash Bahari Kordabad, Dirk Reinhardt, Akhil S Anand, and Sebastien Gros. Reinforcement learning for mpc: Fundamentals and current chal- lenges. IFAC-PapersOnLine, 56(2):5773–5780, 2023

  42. [42]

    Safe reinforcement learning using wasserstein distributionally robust MPC and chance constraint

    Arash Bahari Kordabad, Rafael Wisniewski, and Sebastien Gros. Safe reinforcement learning using wasserstein distributionally robust MPC and chance constraint. IEEE Access, 10:130058–130067, 2022

  43. [43]

    Reinforce- ment learning in robotics: Applications and real-world challenges

    Petar Kormushev, Sylvain Calinon, and Darwin G Caldwell. Reinforce- ment learning in robotics: Applications and real-world challenges. Robotics, 2(3):122–148, 2013. 44

  44. [44]

    Inter-level cooperation in hi- erarchical reinforcement learning

    Abdul Rahman Kreidieh, Glen Berseth, Brandon Trabucco, Samyak Para- juli, Sergey Levine, and Alexandre M Bayen. Inter-level cooperation in hi- erarchical reinforcement learning. arXiv preprint arXiv:1912.02368 , 2019

  45. [45]

    A predictor-corrector path- following algorithm for dual-degenerate parametric optimization problems

    Vyacheslav Kungurtsev and Johannes Jaschke. A predictor-corrector path- following algorithm for dual-degenerate parametric optimization problems. SIAM Journal on Optimization , 27(1):538–564, 2017

  46. [46]

    Dynamic stochastic approxima- tion for multi-stage stochastic optimization

    Guanghui Lan and Zhiqiang Zhou. Dynamic stochastic approxima- tion for multi-stage stochastic optimization. Mathematical Programming, 187(1):487–532, 2021

  47. [47]

    Planning algorithms

    Steven M LaValle. Planning algorithms. Cambridge university press, 2006

  48. [48]

    End-to- end training of deep visuomotor policies

    Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to- end training of deep visuomotor policies. Journal of Machine Learning Research, 17(39):1–40, 2016

  49. [49]

    A framework for parameter estimation and model selection from experimental data in systems biology using approximate bayesian computation

    Juliane Liepe, Paul Kirk, Sarah Filippi, Tina Toni, Chris P Barnes, and Michael PH Stumpf. A framework for parameter estimation and model selection from experimental data in systems biology using approximate bayesian computation. Nature Protocols, 9(2):439–456, 2014

  50. [50]

    Online planner selection with graph neural networks and adaptive scheduling

    Tengfei Ma, Patrick Ferber, Siyu Huo, Jie Chen, and Michael Katz. Online planner selection with graph neural networks and adaptive scheduling. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 34, pages 5077–5084, 2020

  51. [51]

    An experiment in linguistic syn- thesis with a fuzzy logic controller

    Ebrahim H Mamdani and Sedrak Assilian. An experiment in linguistic syn- thesis with a fuzzy logic controller. International journal of man-machine studies, 7(1):1–13, 1975

  52. [52]

    Rein- forcement learning-based nmpc for tracking control of asvs: Theory and experiments

    Andreas B Martinsen, Anastasios M Lekkas, and S´ ebastien Gros. Rein- forcement learning-based nmpc for tracking control of asvs: Theory and experiments. Control Engineering Practice, 120:105024, 2022

  53. [53]

    Simultaneous on- line model identification and production optimization using modifier adap- tation

    Jos´ e Matias, Vyacheslav Kungurtsev, and Malcolm Egan. Simultaneous on- line model identification and production optimization using modifier adap- tation. Journal of Process Control, 110:110–120, 2022

  54. [54]

    Multilevel optimization: algorithms and applications , volume 20

    Athanasios Migdalas, Panos M Pardalos, and Peter V¨ arbrand. Multilevel optimization: algorithms and applications , volume 20. Springer Science & Business Media, 2013

  55. [55]

    Playing Atari with Deep Reinforcement Learning

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 , 2013

  56. [56]

    Model-based reinforcement learning: A survey

    Thomas M Moerland, Joost Broekens, Aske Plaat, Catholijn M Jonker, et al. Model-based reinforcement learning: A survey. Foundations and Trends® in Machine Learning, 16(1):1–118, 2023. 45

  57. [57]

    Second-order fast-slow stochastic systems

    Nhu N Nguyen and George Yin. Second-order fast-slow stochastic systems. SIAM Journal on Mathematical Analysis , 56(4):5175–5208, 2024

  58. [58]

    Assessing Generalization in Deep Reinforcement Learning

    Charles Packer, Katelyn Gao, Jernej Kos, Philipp Kr¨ ahenb¨ uhl, Vladlen Koltun, and Dawn Song. Assessing generalization in deep reinforcement learning. arXiv preprint arXiv:1810.12282 , 2018

  59. [59]

    Fuzzy con- trol, volume 42

    Kevin M Passino, Stephen Yurkovich, and Michael Reinfrank. Fuzzy con- trol, volume 42. Addison-wesley Reading, MA, 1998

  60. [60]

    Clearing the jungle of stochastic optimization

    Warren B Powell. Clearing the jungle of stochastic optimization. In Bridg- ing data and decisions , pages 109–137. Informs, 2014

  61. [61]

    A unified framework for stochastic optimization

    Warren B Powell. A unified framework for stochastic optimization. Euro- pean Journal of Operational Research, 275(3):795–821, 2019

  62. [62]

    Model predictive control: theory, computation, and design , volume 2

    James Blake Rawlings, David Q Mayne, and Moritz Diehl. Model predictive control: theory, computation, and design , volume 2. Nob Hill Publishing Madison, WI, 2017

  63. [63]

    A tour of reinforcement learning: The view from continu- ous control

    Benjamin Recht. A tour of reinforcement learning: The view from continu- ous control. Annual Review of Control, Robotics, and Autonomous Systems, 2(1):253–279, 2019

  64. [64]

    Learning-based mpc from big data using reinforcement learning

    Shambhuraj Sawant, Akhil S Anand, Dirk Reinhardt, and Sebastien Gros. Learning-based mpc from big data using reinforcement learning. arXiv preprint arXiv:2301.01667, 2023

  65. [65]

    Model-free data-driven predictive control using reinforcement learning

    Shambhuraj Sawant, Dirk Reinhardt, Arash Bahari Kordabad, and Se- bastien Gros. Model-free data-driven predictive control using reinforcement learning. In 2023 62nd IEEE Conference on Decision and Control (CDC) , pages 4046–4052. IEEE, 2023

  66. [66]

    Trust region policy optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael I Jordan, and Philipp Moritz. Trust region policy optimization. In International con- ference on machine learning , pages 1889–1897. PMLR, 2015

  67. [67]

    Trilevel and multilevel optimization using monotone operator theory

    Allahkaram Shafiei, Vyacheslav Kungurtsev, and Jakub Marecek. Trilevel and multilevel optimization using monotone operator theory. Mathematical Methods of Operations Research, 99(1):77–114, 2024

  68. [68]

    A single-timescale analysis for stochastic ap- proximation with multiple coupled sequences

    Han Shen and Tianyi Chen. A single-timescale analysis for stochastic ap- proximation with multiple coupled sequences. Advances in Neural Infor- mation Processing Systems, 35:17415–17429, 2022

  69. [69]

    Deterministic policy gradient algorithms

    David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In Inter- national conference on machine learning , pages 387–395. Pmlr, 2014. 46

  70. [70]

    Lifted relational neural networks: Efficient learning of latent relational structures

    Gustav Sourek, Vojtech Aschenbrenner, Filip Zelezny, Steven Schockaert, and Ondrej Kuzelka. Lifted relational neural networks: Efficient learning of latent relational structures. Journal of Artificial Intelligence Research , 62:69–100, 2018

  71. [71]

    Reinforcement learning: An introduction

    Richard S Sutton. Reinforcement learning: An introduction. A Bradford Book, 2018

  72. [72]

    Policy gradient methods for reinforcement learning with function approxi- mation

    Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approxi- mation. Advances in neural information processing systems , 12, 1999

  73. [73]

    Fuzzy identification of systems and its applications to modeling and control

    Tomohiro Takagi and Michio Sugeno. Fuzzy identification of systems and its applications to modeling and control. IEEE transactions on systems, man, and cybernetics , (1):116–132, 1985

  74. [74]

    Reinforcement learning with adaptive regularization for safe con- trol of critical systems

    Haozhe Tian, Homayoun Hamedmoghadam, Robert Shorten, and Pietro Ferraro. Reinforcement learning with adaptive regularization for safe con- trol of critical systems. In The 38th Advances in Neural Information Pro- cessing Systems (NeurIPS) , volume 37, pages 2528–2557, 2024

  75. [75]

    Q-learning

    Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8:279–292, 1992

  76. [76]

    Multi-level policy and reward-based deep reinforcement learning framework for image captioning

    Ning Xu, Hanwang Zhang, An-An Liu, Weizhi Nie, Yuting Su, Jie Nie, and Yongdong Zhang. Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Transactions on Multime- dia, 22(5):1372–1383, 2019

  77. [77]

    Asymptotic expansions of backward equa- tions for two-time-scale markov chains in continuous time.Acta Mathemat- icae Applicatae Sinica, English Series , 25(3):457–476, 2009

    G Yin and Dung Tien Nguyen. Asymptotic expansions of backward equa- tions for two-time-scale markov chains in continuous time.Acta Mathemat- icae Applicatae Sinica, English Series , 25(3):457–476, 2009

  78. [78]

    Continuous-time Markov chains and appli- cations: a two-time-scale approach, volume 37

    G George Yin and Qing Zhang. Continuous-time Markov chains and appli- cations: a two-time-scale approach, volume 37. Springer Science & Business Media, 2012

  79. [79]

    Fuzzy sets

    Lotfi A Zadeh. Fuzzy sets. Information and Control , 1965

  80. [80]

    Graphmp: Graph neural network-based motion planning with efficient graph search

    Xiao Zang, Miao Yin, Jinqi Xiao, Saman Zonouz, and Bo Yuan. Graphmp: Graph neural network-based motion planning with efficient graph search. Advances in Neural Information Processing Systems , 36:3131–3142, 2023

Showing first 80 references.