pith. sign in

arxiv: 2506.01665 · v4 · submitted 2025-06-02 · 💻 cs.LG · cs.AI· cs.RO

Leveraging Analytic Gradients in Provably Safe Reinforcement Learning

Pith reviewed 2026-05-19 11:01 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.RO
keywords provably safe reinforcement learninganalytic gradientsdifferentiable safeguardscontrol taskssafety guaranteesgradient-based RLdifferentiable simulation
0
0 comments X

The pith

Analytic gradient-based reinforcement learning can now use adapted differentiable safeguards to guarantee safety during training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops the first effective safeguard for analytic gradient-based reinforcement learning to provide safety guarantees in safety-critical applications such as autonomous robots. It analyzes existing differentiable safeguards and adapts them using modified mappings and gradient formulations before integrating the result into a state-of-the-art learning algorithm and a differentiable simulation. Numerical experiments on three control tasks show that the safeguarded training proceeds without compromising performance. This closes a gap that previously existed between sampling-based and analytic gradient-based safe reinforcement learning.

Core claim

By adapting existing differentiable safeguards through modified mappings and gradient formulations, it becomes possible to integrate them into analytic gradient-based reinforcement learning algorithms and differentiable simulators, yielding the first effective safeguard for this paradigm that preserves safety properties while maintaining learning performance on control tasks.

What carries the argument

Modified mappings and gradient formulations that adapt differentiable safeguards for analytic gradient-based training and differentiable simulation.

If this is right

  • Provably safe training becomes feasible for analytic gradient methods that learn from fewer environment interactions than sampling-based approaches.
  • The sim-to-real gap narrows because safety constraints are enforced already during the differentiable training phase.
  • State-of-the-art gradient-based algorithms can incorporate safety without requiring a separate post-training verification step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same adaptation pattern could be tested on other gradient-based control methods that rely on differentiable dynamics.
  • Physical robot experiments would directly test whether the reported safety carries over from the differentiable simulator to hardware.
  • The approach opens a route to combine analytic gradients with model-based safety filters in hybrid learning setups.

Load-bearing premise

The modified mappings and gradient formulations preserve the original safety properties of the differentiable safeguards when inserted into analytic gradient-based training and a differentiable simulator.

What would settle it

An experiment in which safeguarded analytic gradient-based training on one of the control tasks produces unsafe actions that violate the original safety constraints or shows markedly worse performance than the unguarded baseline would falsify the claim.

Figures

Figures reproduced from arXiv: 2506.01665 by Hannah Markgraf, Jonathan K\"ulz, Matthias Althoff, Tim Walter.

Figure 2
Figure 2. Figure 2: FIGURE 2 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: The zonotopic approach directly approximates the safe action set by inflating the generator lengths of a zono￾tope. The under-approximated zonotope ZAs is the so￾lution to max cAs ,ls n vuut Yn i=1 ls,i (23a) subject to ZAs = ⟨cAs , GAs diag(ls)⟩ (23b) ZAs ⊆ A (23c) Si+1(ZAs , si) ⊆ Ss (23d) with n uniformally sampled generator directions GAs . Generally, the number of generators should be in the order of … view at source ↗
Figure 5
Figure 5. Figure 5: FIGURE 5 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: FIGURE 6 [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: FIGURE 7 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: FIGURE 8 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: FIGURE 10 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
read the original abstract

The deployment of autonomous robots in safety-critical applications requires safety guarantees. Provably safe reinforcement learning is an active field of research that aims to provide such guarantees using safeguards. These safeguards should be integrated during training to reduce the sim-to-real gap. While there are several approaches for safeguarding sampling-based reinforcement learning, analytic gradient-based reinforcement learning often achieves superior performance from fewer environment interactions. However, there is no safeguarding approach for this learning paradigm yet. Our work addresses this gap by developing the first effective safeguard for analytic gradient-based reinforcement learning. We analyse existing, differentiable safeguards, adapt them through modified mappings and gradient formulations, and integrate them into a state-of-the-art learning algorithm and a differentiable simulation. Using numerical experiments on three control tasks, we evaluate how different safeguards affect learning. The results demonstrate safeguarded training without compromising performance. Additional visuals are provided at timwalter.github.io/safe-agb-rl.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops the first safeguard for analytic gradient-based reinforcement learning by analyzing existing differentiable safeguards, adapting them via modified mappings and gradient formulations, and integrating the result into a state-of-the-art learning algorithm inside a differentiable simulator. Numerical experiments on three control tasks show that the safeguarded training incurs no performance loss while avoiding obvious safety violations.

Significance. If the adapted safeguards retain their original safety certificates, the work would close a notable gap: analytic-gradient RL is more sample-efficient than sampling-based methods yet previously lacked provable-safety integration during training. The empirical demonstration across multiple tasks and the use of a differentiable simulator are practical strengths that could reduce the sim-to-real gap in safety-critical robotics.

major comments (2)
  1. [§4] §4 (Adaptation of differentiable safeguards): The central claim of 'provably safe' training rests on the assertion that the modified mappings and gradient formulations inherit the original safety properties. No explicit invariance argument, re-derivation, or proof sketch is supplied showing that key assumptions (monotonicity, Lipschitz bounds, or barrier-function forms) remain satisfied after the changes. The numerical results on three tasks report no violations but do not substitute for a formal guarantee.
  2. [§5] §5 (Integration into analytic-gradient algorithm): When the adapted safeguard is inserted into the analytic-gradient loop and differentiable simulator, it is unclear whether the safety certificate still holds at every policy update. The manuscript provides no theorem or lemma establishing that the combined system remains safe throughout training.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from a concise statement of the precise safety property (e.g., forward invariance of a safe set) that the adapted safeguard is claimed to enforce.
  2. [Experiments] Figure captions and axis labels in the experimental section could be expanded to indicate which safeguard variant corresponds to each curve.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major comment below and outline the revisions we will make to strengthen the formal foundations of the work.

read point-by-point responses
  1. Referee: [§4] §4 (Adaptation of differentiable safeguards): The central claim of 'provably safe' training rests on the assertion that the modified mappings and gradient formulations inherit the original safety properties. No explicit invariance argument, re-derivation, or proof sketch is supplied showing that key assumptions (monotonicity, Lipschitz bounds, or barrier-function forms) remain satisfied after the changes. The numerical results on three tasks report no violations but do not substitute for a formal guarantee.

    Authors: We agree that the manuscript would benefit from an explicit invariance argument. In the revised version we will add a proof sketch in Section 4 (or a dedicated appendix) showing that the modified mappings preserve monotonicity and the original Lipschitz bounds. The argument will start from the barrier-function form of the source safeguards and demonstrate that the adapted gradient formulations do not violate the required contraction or invariance properties. revision: yes

  2. Referee: [§5] §5 (Integration into analytic-gradient algorithm): When the adapted safeguard is inserted into the analytic-gradient loop and differentiable simulator, it is unclear whether the safety certificate still holds at every policy update. The manuscript provides no theorem or lemma establishing that the combined system remains safe throughout training.

    Authors: We acknowledge the need for a formal statement covering the full training loop. We will insert a new lemma (likely in Section 5) that proves the safety certificate is preserved at each policy update. The lemma will explicitly account for the differentiability of the simulator, the analytic gradient path, and the fact that the safeguard is applied before each gradient step, thereby ensuring the state remains inside the safe set throughout training. revision: yes

Circularity Check

0 steps flagged

No circularity: adaptations build on external prior safeguards without reducing to self-definition or fitted predictions

full rationale

The paper's central contribution is analyzing existing differentiable safeguards, adapting them via modified mappings and gradient formulations, then integrating into an analytic-gradient RL algorithm and differentiable simulator. The abstract and provided text give no equations or steps that define safety in terms of the new method itself, rename fitted parameters as predictions, or rely on a self-citation chain for the load-bearing safety claim. The original safety properties are treated as coming from prior external work, with the adaptations presented as engineering changes whose preservation is left to empirical validation on three tasks rather than a closed self-referential loop. This keeps the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.0 · 5692 in / 957 out tokens · 54820 ms · 2026-05-19T11:01:04.360947+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Safe Reinforcement Learning using Action Projection: Safeguard the Policy or the Environment?

    cs.LG 2025-09 conditional novelty 7.0

    Action aliasing from safety projections harms policy-gradient estimates more severely when the projection is inside the policy than when it is outside, but a penalty term restores competitiveness.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Core challenges of social robot navigation: A survey,

    C. Mavrogiannis et al., “Core challenges of social robot navigation: A survey,”ACM Transactions on Human-Robot Interaction, vol. 12, no. 3, pp. 1–39, Sep. 30, 2023

  2. [2]

    Safety issues in human-robot interactions,

    M. Vasic and A. Billard, “Safety issues in human-robot interactions,” inProc. of the IEEE Int. Conf. on Robotics and Automation (ICRA), May 2013, pp. 197–204

  3. [3]

    Optimal and autonomous control using reinforce- ment learning: A survey,

    B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis, “Optimal and autonomous control using reinforce- ment learning: A survey,”IEEE Transactions on Neural Net- works and Learning Systems, vol. 29, no. 6, pp. 2042–2062, 2018

  4. [4]

    Reinforcement learning for versatile, dynamic, and robust bipedal locomotion control,

    Z. Li, X. B. Peng, P. Abbeel, S. Levine, G. Berseth, and K. Sreenath, “Reinforcement learning for versatile, dynamic, and robust bipedal locomotion control,”The International Journal of Robotics Research, Oct. 23, 2024

  5. [5]

    Learning quadruped locomotion using differentiable simulation,

    Y. Song, S. b. Kim, and D. Scaramuzza, “Learning quadruped locomotion using differentiable simulation,” pre- sented at the Proc. of the Conf. on Robot Learning (CoRL), Sep. 5, 2024

  6. [6]

    J. Heeg, Y. Song, and D. Scaramuzza, Learning quadrotor control from visual features using differentiable simulation, Mar. 6, 2025. arXiv: 2410.15979[cs]

  7. [7]

    Cross- ing the reality gap: A survey on sim-to-real transferability of robot controllers in reinforcement learning,

    E. Salvato, G. Fenu, E. Medvet, and F. A. Pellegrino, “Cross- ing the reality gap: A survey on sim-to-real transferability of robot controllers in reinforcement learning,”IEEE Access, vol. 9, pp. 153171–153187, 2021

  8. [8]

    Sim-to-real transfer in deep reinforcement learning for robotics: A sur- vey,

    W. Zhao, J. P. Queralta, and T. Westerlund, “Sim-to-real transfer in deep reinforcement learning for robotics: A sur- vey,” inProc. of the IEEE Symp. Series on Computational Intelligence (SSCI), Dec. 2020, pp. 737–744. VOLUME 00 2021 13 F. A. Author ET AL .: PREPARATION OF PAPERS FOR IEEE OPEN JOURNAL OF CONTROL SYSTEMS TABLE 5. Comparison of the safe cen...

  9. [9]

    Safety fil- tering while training: Improving the performance and sample efficiency of reinforcement learning agents,

    F. P. Bejarano, L. Brunke, and A. P. Schoellig, “Safety fil- tering while training: Improving the performance and sample efficiency of reinforcement learning agents,”IEEE Robotics and Automation Letters, vol. 10, no. 1, pp. 788–795, Jan. 2025

  10. [10]

    The effects of reward misspecification: Mapping and mitigating misaligned mod- els,

    A. Pan, K. Bhatia, and J. Steinhardt, “The effects of reward misspecification: Mapping and mitigating misaligned mod- els,” presented at the Proc. of the Int. Conf. on Learning Representations (ICLR), Oct. 6, 2021

  11. [11]

    Data-efficient Deep Reinforcement Learning for Dexterous Manipulation

    I. Popov et al.,Data-efficient deep reinforcement learning for dexterousmanipulation,Apr.10,2017.arXiv:1704.03073[cs]

  12. [12]

    Excluding the irrelevant focusing reinforcement learning through continuous action masking,

    R. Stolz, H. Krasowski, J. Thumm, M. Eichelbeck, P. Gassert, and M. Althoff, “Excluding the irrelevant focusing reinforcement learning through continuous action masking,” in Proc. of the Int. Conf. on Neural Information Processing Systems (NeurIPS), 2024

  13. [13]

    Provablysafedeepreinforcement learning for robotic manipulation in human environments,

    J.ThummandM.Althoff,“Provablysafedeepreinforcement learning for robotic manipulation in human environments,” in Proc. of the IEEE Int. Conf. on Robotics and Automation (ICRA), 2022, pp. 6344–6350

  14. [14]

    A comprehensive survey on safe reinforcement learning,

    J. García and F. Fernández, “A comprehensive survey on safe reinforcement learning,”Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437–1480, Jan. 2015

  15. [15]

    Provably safe reinforcement learning: Con- ceptual analysis, survey, and benchmarking,

    H. Krasowski, J. Thumm, M. Müller, L. Schäfer, X. Wang, and M. Althoff, “Provably safe reinforcement learning: Con- ceptual analysis, survey, and benchmarking,”Transactions on Machine Learning Research, 2023

  16. [16]

    Safe reinforcement learning using black- box reachability analysis,

    M. Selim, A. Alanwar, S. Kousik, G. Gao, M. Pavone, and K. H. Johansson, “Safe reinforcement learning using black- box reachability analysis,”IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 10665–10672, 2022

  17. [17]

    Provably safe reinforcement learning via action projection using reachability analysis and polynomial zono- topes,

    N. Kochdumper, H. Krasowski, X. Wang, S. Bak, and M. Althoff, “Provably safe reinforcement learning via action projection using reachability analysis and polynomial zono- topes,”IEEE Open Journal of Control Systems , vol. 2, pp. 79–92, 2023

  18. [18]

    Enforcing policy feasibility constraints through dif- ferentiable projection for energy optimization,

    B. Chen, P. L. Donti, K. Baker, J. Z. Kolter, and M. Bergés, “Enforcing policy feasibility constraints through dif- ferentiable projection for energy optimization,” inProc. of the ACM Int. Conf. on Future Energy Systems (e-Energy), Jun. 22, 2021, pp. 199–210

  19. [19]

    Computationally efficient safe reinforcement learning for power systems,

    D. Tabas and B. Zhang, “Computationally efficient safe reinforcement learning for power systems,” inProc. of the American Control Conf. (ACC), 2022, pp. 3303–3310

  20. [20]

    Safe reinforcement learning via projection on a safe set: How to achieve opti- mality?

    S. Gros, M. Zanon, and A. Bemporad, “Safe reinforcement learning via projection on a safe set: How to achieve opti- mality?”IFAC-PapersOnLine, vol. 53, no. 2, pp. 8076–8081, Jan. 1, 2020

  21. [21]

    MuJoCo: A physics engine for model-based control,

    E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control,” inProc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), 2012, pp. 5026–5033

  22. [22]

    Brax - a differentiable physics engine for large scale rigid body simulation,

    C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem, “Brax - a differentiable physics engine for large scale rigid body simulation,” inProc. of the Int. Conf. on Neural Information Processing Systems (NeurIPS), 2021

  23. [23]

    ChainQueen:Areal-timedifferentiablephysical simulator for soft robotics,

    Y.Hu etal.,“ChainQueen:Areal-timedifferentiablephysical simulator for soft robotics,” inProc. of the IEEE Int. Conf. on Robotics and Automation (ICRA), May 2019, pp. 6265– 6271

  24. [24]

    Thuerey, P

    N. Thuerey, P. Holl, M. Mueller, P. Schnell, F. Trost, and K. Um,Physics-based Deep Learning. WWW, 2021

  25. [25]

    Stabilizing reinforcement learning in differentiable multiphysics simulation,

    E. Xing, V. Luk, and J. Oh, “Stabilizing reinforcement learning in differentiable multiphysics simulation,” presented at the Proc. of the Int. Conf. on Learning Representations (ICLR), 2025

  26. [26]

    Monte carlo gradient estimation in machine learning,

    S. Mohamed, M. Rosca, M. Figurnov, and A. Mnih, “Monte carlo gradient estimation in machine learning,”Journal of Machine Learning Research, vol. 21, no. 132, pp. 1–62, 2020

  27. [27]

    Stochastic first- and zeroth-order methods for nonconvex stochastic programming,

    S. Ghadimi and G. Lan, “Stochastic first- and zeroth-order methods for nonconvex stochastic programming,” SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341–2368, 2013

  28. [28]

    PODS: Policy optimization via differentiable simulation,

    M. A. Z. Mora, M. Peychev, S. Ha, M. Vechev, and S. Coros, “PODS: Policy optimization via differentiable simulation,” in Proc. of the Int. Conf. on Machine Learning (ICML), M. Meila and T. Zhang, Eds., vol. 139, Jul. 18, 2021, pp. 7805– 7817

  29. [29]

    Accelerated policy learning with parallel differ- entiable simulation,

    J. Xu et al., “Accelerated policy learning with parallel differ- entiable simulation,” inProc. of the Int. Conf. on Learning Representations (ICLR), 2022

  30. [30]

    Do differentiable simulators give better policy gradients?

    H. J. Suh, M. Simchowitz, K. Zhang, and R. Tedrake, “Do differentiable simulators give better policy gradients?” In Proc. of the Int. Conf. on Machine Learning (ICML), K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 162, Jul. 17, 2022, pp. 20668–20696

  31. [31]

    A focused backpropagation algorithm for tem- poral pattern recognition,

    M. C. Mozer, “A focused backpropagation algorithm for tem- poral pattern recognition,”Complex Systems 3, pp. 349–381, 1989

  32. [32]

    A differentiable physics engine for deep learning in robotics,

    J. Degrave, M. Hermans, J. Dambre, and F. Wyffels, “A differentiable physics engine for deep learning in robotics,” Frontiers in Neurorobotics, vol. 13, Mar. 7, 2019

  33. [33]

    R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1

  34. [34]

    Adaptive horizon actor-critic for policy learning in contact- rich differentiable simulation,

    I. Georgiev, K. Srinivasan, J. Xu, E. Heiden, and A. Garg, “Adaptive horizon actor-critic for policy learning in contact- rich differentiable simulation,” inProc. of the Int. Conf. on Machine Learning (ICML), 2024

  35. [35]

    Safe learning in robotics: From learning- based control to safe reinforcement learning,

    L. Brunke et al., “Safe learning in robotics: From learning- based control to safe reinforcement learning,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 5, no. 1, pp. 411–444, 2022. 14 VOLUME 00 2021

  36. [36]

    End- to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks,

    R.Cheng,G.Orosz,R.M.Murray,andJ.W.Burdick,“End- to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks,” inProc. of the AAAI Conf. on Artificial Intelligence (AAAI), vol. 33, 2019, pp. 3387–3395

  37. [37]

    Safe reinforcement learning for dynamical games,

    Y. Yang, Kyriakos G. Vamvoudakis, and H. Modares, “Safe reinforcement learning for dynamical games,”International Journal of Robust and Nonlinear Control, vol. 30, no. 9, pp. 3706–3726, 2020

  38. [38]

    Reinforcement learning with safety and stability guarantees during exploration for linear systems,

    Z. Marvi and B. Kiumarsi, “Reinforcement learning with safety and stability guarantees during exploration for linear systems,”IEEE Open Journal of Control Systems, vol. 1, pp. 322–334, 2022

  39. [39]

    Safe neural control for non-affine control systems with differentiable control barrier functions,

    W. Xiao, R. Allen, and D. Rus, “Safe neural control for non-affine control systems with differentiable control barrier functions,” inProc. of the IEEE Conf. on Decision and Control (CDC), 2023, pp. 3366–3371

  40. [40]

    Safety-aware pursuit-evasion games in unknown environ- ments using Gaussian processes and finite-time convergent reinforcement learning,

    Nikolaos-Marios T. Kokolakis and K. G. Vamvoudakis, “Safety-aware pursuit-evasion games in unknown environ- ments using Gaussian processes and finite-time convergent reinforcement learning,”IEEE Transactions on Neural Net- works and Learning Systems, vol. 35, no. 3, pp. 3130–3143, 2022

  41. [41]

    Safe reinforcement learning using data-driven predictive control,

    M. Selim, A. Alanwar, M. W. El-Kharashi, H. M. Abbas, and K. H. Johansson, “Safe reinforcement learning using data-driven predictive control,” inProc. of the Int. Conf. on Communications, Signal Processing, and their Applications (ICCSPA), 2022, pp. 1–6

  42. [42]

    Contingency- constrained economic dispatch with safe reinforcement learn- ing,

    M. Eichelbeck, H. Markgraf, and M. Althoff, “Contingency- constrained economic dispatch with safe reinforcement learn- ing,” inProc. of the IEEE Int. Conf. on Machine Learning and Applications (ICMLA), 2022, pp. 597–602

  43. [43]

    Data-driven safety filters: Hamilton- Jacobi reachability, control barrier functions, and predictive methodsforuncertainsystems,

    K. P. Wabersich et al., “Data-driven safety filters: Hamilton- Jacobi reachability, control barrier functions, and predictive methodsforuncertainsystems,” IEEEControlSystemsMag- azine, vol. 43, no. 5, pp. 137–177, 2023

  44. [44]

    Scalable reachset-conformant identification of linear systems,

    L. Lützow and M. Althoff, “Scalable reachset-conformant identification of linear systems,”IEEE Control Systems Let- ters, vol. 8, pp. 520–525, 2024

  45. [45]

    Reachset-conformant system identification,

    L. Lützow and M. Althoff, “Reachset-conformant system identification,”arXiv preprint arXiv:2407.11692, 2024

  46. [46]

    Scalablecomputation of robust control invariant sets of nonlinear systems,

    L.Schäfer,F.Gruber,andM.Althoff,“Scalablecomputation of robust control invariant sets of nonlinear systems,”IEEE Transactions on Automatic Control, vol. 69, no. 2, pp. 755– 770, 2024

  47. [47]

    (implicit)2: Implicit layers for implicit representations,

    Z. Huang, S. Bai, and J. Z. Kolter, “(implicit)2: Implicit layers for implicit representations,” inProc. of the Int. Conf. on Neural Information Processing Systems (NeurIPS), 2021, pp. 9639–9650

  48. [48]

    Differentiable convex optimization layers,

    A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, and J. Z. Kolter, “Differentiable convex optimization layers,” in Proc.oftheInt.Conf.onNeuralInformationProcessingSys- tems (NeurIPS), H. Wallach, H. Larochelle, A. Beygelzimer, F. d. Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32, 2019

  49. [49]

    S. G. Krantz and H. R. Parks,The Implicit Function Theo- rem: History, Theory, and Applications. Springer, 2013

  50. [50]

    Learning convex optimization control policies,

    A. Agrawal, S. Barratt, S. Boyd, and B. Stellato, “Learning convex optimization control policies,” inProc. of the Ann. Learning for Dynamics and Control Conf. (L4DC), A. M. Bayen et al., Eds., vol. 120, Jun. 10, 2020, pp. 361–373

  51. [51]

    Learning convex optimization models,

    A. Agrawal, S. Barratt, and S. Boyd, “Learning convex optimization models,”IEEE/CAA Journal of Automatica Sinica, vol. 8, no. 8, pp. 1355–1364, Aug. 2021

  52. [52]

    Differentiating through a cone program,

    A. Agrawal, S. Barratt, S. Boyd, E. Busseti, and M. Walaa, “Differentiating through a cone program,”JournalofApplied and Numerical Optimization, vol. 2019, no. 2, 2019

  53. [53]

    A tutorial on geometric programming,

    S. Boyd, S.-J. Kim, L. Vandenberghe, and A. Hassibi, “A tutorial on geometric programming,”Optimization and En- gineering, vol. 8, no. 1, Mar. 2007

  54. [54]

    Conic formulation of a convex programming problem and duality,

    Y. Nesterov and A. Nemirovsky, “Conic formulation of a convex programming problem and duality,”Optimization Methods and Software, vol. 1, no. 2, pp. 95–115, Jan. 1992

  55. [55]

    CVXPY: A Python-embedded modeling language for convex optimization,

    S. Diamond and S. Boyd, “CVXPY: A Python-embedded modeling language for convex optimization,”Journal of Ma- chine Learning Research, vol. 17, no. 83, pp. 1–5, 2016

  56. [56]

    A rewriting system for convex optimization problems,

    A. Agrawal, R. Verschueren, S. Diamond, and S. Boyd, “A rewriting system for convex optimization problems,”Journal of Control and Decision, vol. 5, no. 1, pp. 42–60, 2018

  57. [57]

    Simple statistical gradient-following algo- rithms for connectionist reinforcement learning,

    R. J. Williams, “Simple statistical gradient-following algo- rithms for connectionist reinforcement learning,”Machine Learning, vol. 8, no. 3, pp. 229–256, 1992

  58. [58]

    Schulman, F

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, Proximal policy optimization algorithms, Aug. 28,

  59. [59]

    arXiv: 1707.06347[cs]

  60. [60]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,

    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” inProc. of the Int. Conf. on Machine Learning (ICML), Jul. 3, 2018, pp. 1861–1870

  61. [61]

    Policy gradient methods for reinforcement learning with function approximation,

    S. P. S. Richard S. Sutton David A. McAllester, “Policy gradient methods for reinforcement learning with function approximation,” inProc. of the Int. Conf. on Neural Infor- mation Processing Systems (NeurIPS), vol. 12, 1999

  62. [62]

    Combining zonotopes and sup- port functions for efficient reachability analysis of linear systems,

    M. Althoff and G. Frehse, “Combining zonotopes and sup- port functions for efficient reachability analysis of linear systems,”inProc.oftheIEEEConf.onDecisionandControl (CDC), Dec. 2016, pp. 7439–7446

  63. [63]

    On the co-NP-completeness of the zonotope containment problem,

    A. Kulmburg and M. Althoff, “On the co-NP-completeness of the zonotope containment problem,”European Journal of Control, vol. 62, pp. 84–91, 2021

  64. [64]

    Linear encodings for poly- tope containment problems,

    S. Sadraddini and R. Tedrake, “Linear encodings for poly- tope containment problems,” inProc. of the IEEE Conf. on Decision and Control (CDC), 2019, pp. 4367–4372

  65. [65]

    Disciplined convex program- ming,

    M. Grant, S. Boyd, and Y. Ye, “Disciplined convex program- ming,” inGlobal Optimization: From Theory to Implemen- tation, L. Liberti and N. Maculan, Eds., 2006, pp. 155–210

  66. [66]

    Guarantees for realroboticsystems:Unifyingformalcontrollersynthesisand reachset-conformant identification,

    S. B. Liu, B. Schürmann, and M. Althoff, “Guarantees for realroboticsystems:Unifyingformalcontrollersynthesisand reachset-conformant identification,”IEEE Transactions on Robotics, vol. 39, no. 5, pp. 3776–3790, Oct. 2023

  67. [67]

    Scalable robust safety filter with unknown disturbance set,

    F. Gruber and M. Althoff, “Scalable robust safety filter with unknown disturbance set,”IEEE Transactions on Automatic Control, vol. 68, no. 12, pp. 7756–7770, Dec. 2023

  68. [68]

    Set propagation tech- niques for reachability analysis,

    M. Althoff, G. Frehse, and A. Girard, “Set propagation tech- niques for reachability analysis,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 4, pp. 369–395, May 3, 2021

  69. [69]

    AROC: A toolbox for au- tomated reachset optimal controller synthesis,

    N. Kochdumper, F. Gruber, B. Schürmann, V. Gaßmann, M. Klischat, and M. Althoff, “AROC: A toolbox for au- tomated reachset optimal controller synthesis,” inProc. of the Int. Conf. on Hybrid Systems: Computation and Control (HSCC), 2021, pp. 1–6

  70. [70]

    Generalized gradients and applications,

    F. H. Clarke, “Generalized gradients and applications,” TransactionsoftheAmericanMathematicalSociety ,vol.205, pp. 247–247, 1975

  71. [71]

    Optuna: A next-generation hyperparameter optimization framework,

    T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next-generation hyperparameter optimization framework,” inProc. of the ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (SIGKDD), Jul. 25, 2019, pp. 2623–2631

  72. [72]

    Gymnasium: A Standard Interface for Reinforcement Learning Environments

    M. Towers et al., Gymnasium: A standard interface for reinforcement learning environments, Nov. 8, 2024. arXiv: 2407.17032[cs]

  73. [73]

    PyTorch: An imperative style, high- performance deep learning library,

    A. Paszke et al., “PyTorch: An imperative style, high- performance deep learning library,” in Proc. of the Int. Conf. on Neural Information Processing Systems (NeurIPS), vol. 32, 2019

  74. [74]

    Y. Chen, D. Tse, P. Nobel, P. Goulart, and S. Boyd, CuClarabel: GPU acceleration for a conic optimization solver, Dec. 30, 2024. arXiv: 2412.19027[math]

  75. [75]

    Embedded code generation with CVXPY,

    M. Schaller, G. Banjac, S. Diamond, A. Agrawal, B. Stellato, and S. Boyd, “Embedded code generation with CVXPY,” IEEE Control Systems Letters, vol. 6, pp. 2653–2658, 2022

  76. [76]

    A. S. C. Bianchi, Analogues of the usual pseudodifferential calculus on the Heisenberg group. State University of New York at Stony Brook, 2005. VOLUME 00 2021 15 F. A. Author ET AL .: PREPARATION OF PAPERS FOR IEEE OPEN JOURNAL OF CONTROL SYSTEMS

  77. [77]

    OptNet: Differentiable optimiza- tion as a layer in neural networks,

    B. Amos and J. Z. Kolter, “OptNet: Differentiable optimiza- tion as a layer in neural networks,” inProc. of the Int. Conf. on Learning Representations (ICLR), Aug. 6, 2017, pp. 136–145. T. Walter (Member, IEEE) received the B.Eng. degree in Electrical Engineering and Information Technology from the University of Applied Sciences Munich, Munich, Ger- many,...

  78. [78]

    (33) The reward function encodes the goal of balancing the pendulum upright. We define the safety constraints as the part of the state space from which the controller can maintain balance, effectively limiting the velocity and angle close to the upright position. We induce a safe action set from a robust control invariant (RCI) state set, which we obtain ...