pith. sign in

arxiv: 2605.19469 · v1 · pith:65BZRRYSnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI· cs.RO

Sampling-Based Safe Reinforcement Learning

Pith reviewed 2026-05-20 07:01 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.RO
keywords safe reinforcement learningmodel-based RLsafety constraintsdynamics samplingepistemic uncertaintysample complexitycontinuous controlrobotic hardware
0
0 comments X

The pith

Enforcing constraints jointly over finite dynamics samples approximates worst-case safety in reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Sampling-Based Safe Reinforcement Learning to address safe exploration in RL. It keeps the agent safe during learning by enforcing constraints simultaneously across a finite collection of sampled dynamics models. This turns an otherwise intractable worst-case problem into a computable one for continuous control tasks. An exploration strategy that limits epistemic uncertainty removes the need for separate bonuses. Under regularity conditions the approach supplies high-probability safety guarantees and a finite-time sample-complexity bound for recovering a near-optimal policy, with successful tests on both simulation and physical robots.

Core claim

SBSRL maintains safety throughout learning by enforcing constraints jointly across a finite set of dynamics samples. This formulation approximates an intractable worst-case optimization over uncertain dynamics and enables practical safety guarantees in continuous domains. Under regularity conditions, high-probability guarantees of safety throughout learning and a finite-time sample complexity bound for recovering a near-optimal policy are derived.

What carries the argument

Joint enforcement of constraints across a finite set of sampled dynamics models that approximates the worst-case optimization over uncertain dynamics.

If this is right

  • Safety is preserved with high probability for the entire duration of learning.
  • A finite-time bound guarantees recovery of a near-optimal policy after a controlled number of samples.
  • Exploration proceeds by constraining epistemic uncertainty without separate reward bonuses.
  • The formulation scales to deep-ensemble implementations for high-dimensional continuous control.
  • Empirical results confirm safe operation on both simulated and physical robotic hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The sampling approach could reduce the risk of catastrophic failures when RL is applied to real-world systems with costly mistakes.
  • Similar joint-constraint ideas might transfer to other model-based planning settings that face model uncertainty.
  • Direct measurement of how often sampled models must be drawn to keep violation probability below a target threshold would test practical tightness of the bounds.

Load-bearing premise

High-probability safety guarantees and sample-complexity bounds rely on regularity conditions on the dynamics and uncertainty model.

What would settle it

An experiment that satisfies the regularity conditions yet records a safety violation during learning would disprove the high-probability guarantees.

Figures

Figures reproduced from arXiv: 2605.19469 by Andreas Krause, Bruce D. Lee, Luca Vignola, Manish Prajapat, Manuel Wendl, Melanie Zeilinger, Yarden As.

Figure 1
Figure 1. Figure 1: We illustrate a safe exploration task in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: At the initial episode n = 0, we draw M samples ˜f m from the GP prior. These are then clipped using the prior bound F B 0 , yielding the truncated samples f m 0 . Finally, f ζ denotes the ζ￾close sample of Lemma 2. For well-calibrated models (cf. Lemma 1), truncation can only reduce deviation from f ⋆ , thereby preserving the ζ-close sample. This idea is illustrated in [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation of safety on PENDULUMSWINGUP and CARTPOLESWINGUP. We report mean and standard deviation of the normalized cumulative returns and maximum cost violations over five seeds. SBSRL maintains safety throughout training while achieving near-optimal performance. Theorem 2 (Optimality). Suppose Assumptions 1-5 hold. Fix some δ ∈ (0, 1/2), ζ ∈ (0, √ σw∆ dxT 2Cmax ), ε > 0 and ξ > ε + ζ √ dx T 2Cmax σw . L… view at source ↗
Figure 3
Figure 3. Figure 3: At termination, Πn ⋆ safe ⊇ Π ⋆,c ξ . The set Π⋆ ξ may be disconnected, and only its reachable component Π ⋆,c ξ can be identified via safe updates. The tightening ξ influences connectivity: smaller values can merge otherwise disconnected regions, as illustrated for ξ = 0. The theorem guarantees near-optimality over the reachable component Π ⋆,c ξ in finite time. In the definition of ξ, ε accounts for the … view at source ↗
Figure 6
Figure 6. Figure 6: Safe offline-to-online on real-world hardware. We report mean and standard error over [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Evaluation of exploration via an exploration constraint in the PENDULUM￾SWINGUP task. We report mean and 95 percentile interval over five seeds. We further illustrate how the choice of d 0 σ can influence the behavior of SBSRL. From our theory, smaller values of d 0 σ guarantee closer-to-optimal performance at termi￾nation but reduce sample efficiency, which helps explain why increasing its value may actua… view at source ↗
Figure 7
Figure 7. Figure 7: GP experiments demonstrating the effectiveness of using sampling to incentivize safety. [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: GP experiments ablating different values of [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Learning curves comparing SBSRL against standard baselines. The performance is evaluated in the environments GOTOGOAL, RACECAR, HUMANOID, CARTPOLE and QUADRUPED . We report the mean and 95 percentile interval. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Learning curves comparing SBSRL against the mean baseline. The performance is evaluated in the environments GOTOGOAL, RACECAR, HUMANOID, CARTPOLE and QUADRUPED. We report the mean and 95 percentile interval. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_10.png] view at source ↗
read the original abstract

Safe exploration remains a fundamental challenge in reinforcement learning (RL), limiting the deployment of RL agents in the real world. We propose Sampling-Based Safe Reinforcement Learning (SBSRL), a model-based RL algorithm that maintains safety throughout the learning process by enforcing constraints jointly across a finite set of dynamics samples. This formulation approximates an intractable worst-case optimization over uncertain dynamics and enables practical safety guarantees in continuous domains. We further introduce an exploration strategy based on constraining epistemic uncertainty, eliminating the need for explicit exploration bonuses. Under regularity conditions, we derive high-probability guarantees of safety throughout learning and a finite-time sample complexity bound for recovering a near-optimal policy. Empirically, SBSRL achieves safe and efficient exploration both in simulation and in real robotic hardware, and readily extends to practical deep-ensemble implementations that scale to high-dimensional continuous control problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Sampling-Based Safe Reinforcement Learning (SBSRL), a model-based RL algorithm that maintains safety throughout learning by jointly enforcing constraints over a finite set of sampled dynamics models. This approximates an intractable worst-case optimization over uncertain dynamics, introduces an epistemic-uncertainty-constrained exploration strategy without explicit bonuses, and claims high-probability safety guarantees plus finite-time sample-complexity bounds for near-optimal policy recovery under regularity conditions. Empirical results are reported in simulation and on real robotic hardware, with extensions to deep-ensemble implementations for high-dimensional continuous control.

Significance. If the regularity conditions are mild and the finite-sample approximation closes the gap to the true uncertainty set with the stated high probability, the work would provide a practical bridge between worst-case robust RL and scalable model-based methods. The elimination of explicit exploration bonuses via uncertainty constraints and the hardware validation are strengths that could influence safe exploration research in continuous domains.

major comments (2)
  1. [Abstract / Theoretical Results] Abstract and theoretical results section: the high-probability safety guarantees and finite-time sample-complexity bounds are stated to hold only under unspecified 'regularity conditions' on the dynamics and uncertainty model. Without an explicit list (e.g., Lipschitz constants, bounded epistemic variance, or compactness of the uncertainty set), it is impossible to verify whether the finite dynamics samples suffice to approximate the worst-case optimization with the claimed probability, particularly for the continuous robotic domains tested.
  2. [Safety Formulation] The safety formulation is defined directly with respect to the finite set of sampled models chosen by the algorithm. This creates a potential circularity: the derived bounds apply to the sampled set by construction, but no explicit reduction or concentration argument is visible showing that the sampled-set safety implies safety with respect to the true (continuous) uncertainty set outside the regularity conditions.
minor comments (2)
  1. [Abstract] The abstract reports empirical success but does not mention error bars, number of trials, or statistical significance for the hardware experiments; adding these would strengthen the reproducibility claim.
  2. [Introduction / Method] Notation for the epistemic uncertainty set and the finite sample size should be introduced earlier and used consistently when stating the approximation to the worst-case optimization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, clarifying the theoretical assumptions and indicating revisions that will be incorporated to make the regularity conditions and safety reduction explicit.

read point-by-point responses
  1. Referee: [Abstract / Theoretical Results] Abstract and theoretical results section: the high-probability safety guarantees and finite-time sample-complexity bounds are stated to hold only under unspecified 'regularity conditions' on the dynamics and uncertainty model. Without an explicit list (e.g., Lipschitz constants, bounded epistemic variance, or compactness of the uncertainty set), it is impossible to verify whether the finite dynamics samples suffice to approximate the worst-case optimization with the claimed probability, particularly for the continuous robotic domains tested.

    Authors: We agree that the regularity conditions must be stated explicitly to allow verification of the bounds. Our analysis assumes: (i) Lipschitz continuity of the dynamics with constant L, (ii) uniform bound σ on epistemic variance, and (iii) compactness of the uncertainty set. Under these, finite samples suffice via standard covering-number and concentration arguments. We will revise the abstract and theory section to list these conditions explicitly and add a short discussion of their relevance to the tested robotic domains. revision: yes

  2. Referee: [Safety Formulation] The safety formulation is defined directly with respect to the finite set of sampled models chosen by the algorithm. This creates a potential circularity: the derived bounds apply to the sampled set by construction, but no explicit reduction or concentration argument is visible showing that the sampled-set safety implies safety with respect to the true (continuous) uncertainty set outside the regularity conditions.

    Authors: Safety is enforced on the sampled models as a tractable proxy. The reduction to the true continuous uncertainty set is obtained via a uniform-convergence argument: under the stated regularity conditions, the worst-case violation over the true set is bounded by the sampled-set violation plus a term that decays with sample count (via Lipschitz continuity and compactness). We will insert an explicit lemma (with proof sketch) in the theory section or appendix to detail this concentration step and the required sample size. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the claimed derivation

full rationale

The paper defines SBSRL as enforcing joint constraints over a finite set of dynamics samples to approximate worst-case optimization over uncertain dynamics, then derives high-probability safety guarantees and finite-time sample-complexity bounds under regularity conditions on the dynamics and uncertainty model. No step reduces a derived bound or guarantee to an input quantity by construction (e.g., no fitted parameter renamed as prediction, no self-definitional loop where safety w.r.t. samples is equated to true safety without external conditions, and no load-bearing self-citation or smuggled ansatz). The regularity conditions are external assumptions that are intended to close the gap between samples and true dynamics, leaving the central claims with independent theoretical content rather than tautological equivalence to the algorithmic choices.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on regularity conditions for the dynamics and uncertainty model plus the choice of finite sample size; these are not derived from first principles within the paper.

free parameters (1)
  • number of dynamics samples
    Finite set size is a modeling choice that controls the approximation quality of the worst-case safety constraint.
axioms (1)
  • domain assumption Regularity conditions on dynamics and epistemic uncertainty
    Invoked to derive high-probability safety guarantees and sample-complexity bounds (abstract theoretical paragraph).

pith-pipeline@v0.9.0 · 5684 in / 1154 out tokens · 32974 ms · 2026-05-20T07:01:49.477231+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 2 internal anchors

  1. [1]

    Human-level control through deep reinforcement learning

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei Rusu, Joel Veness, Marc Bellemare, Alex Graves, Martin Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning....

  2. [2]

    Mastering the game of Go without human knowledge.Nature, 2017.(Cited on page 1)

    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of Go without human knowledge.Nature, 2017.(Cited on page 1)

  3. [3]

    Bagnell, and Jan Peters

    Jens Kober, J. Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey.The International Journal of Robotics Research, 2013.(Cited on page 1)

  4. [4]

    Deep reinforcement learning for robotics: A survey of real-world successes.Proceedings of the AAAI Conference on Artificial Intelligence, 2025.(Cited on page 1)

    Chen Tang, Ben Abbatematteo, Jiaheng Hu, Rohan Chandra, Roberto Martin-Martin, and Peter Stone. Deep reinforcement learning for robotics: A survey of real-world successes.Proceedings of the AAAI Conference on Artificial Intelligence, 2025.(Cited on page 1)

  5. [5]

    Magnetic control of tokamak plasmas through deep reinforcement learning.Nature, 2022.(Cited on page 1)

    Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego Casas, Craig Donner, Leslie Fritz, Cristian Galperti, Andrea Huber, James Keeling, Maria Tsimpoukelli, Jackie Kay, Antoine Merle, Jean-Marc Moret, and Martin Riedmiller. Magnetic control of tokamak plasma...

  6. [6]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

  7. [7]

    Concrete problems in ai safety, 2016.(Cited on page 1)

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety, 2016.(Cited on page 1)

  8. [8]

    Actsafe: Active exploration with safety constraints for reinforcement learning

    Yarden As, Bhavya Sukhija, Lenart Treven, Carmelo Sferrazza, Stelian Coros, and Andreas Krause. Actsafe: Active exploration with safety constraints for reinforcement learning. In International Conference on Learning Representations, 2025.(Cited on pages 1, 3, 4, 5, 6, 8, 16, 17, 18, 23, 25, 26, 27, and 28)

  9. [9]

    Safe exploration via policy priors

    Manuel Wendl, Yarden As, Manish Prajapat, Anton Pollak, Stelian Coros, and Andreas Krause. Safe exploration via policy priors. InThe Fourteenth International Conference on Learning Representations, 2026.(Cited on pages 1, 3, 4, 6, 28, and 29)

  10. [10]

    Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models.International Conference on Neural Information Processing Systems, 2018.(Cited on pages 2, 25, 27, and 29)

  11. [11]

    Benchmarking Safe Exploration in Deep Reinforcement Learning

    Alex Ray, Joshua Achiam, and Dario Amodei. Benchmarking Safe Exploration in Deep Reinforcement Learning. 2019.(Cited on pages 2, 8, 27, 28, and 29)

  12. [12]

    Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester

    Gabriel Dulac-Arnold, Nir Levine, Daniel J. Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester. An empirical investigation of the challenges of real-world reinforce- ment learning, 2021.(Cited on pages 2, 8, 27, 28, and 29)

  13. [13]

    A comprehensive survey on safe reinforcement learning

    Javier García and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 2015.(Cited on page 2)

  14. [14]

    Safe learning in robotics: From learning-based control to safe reinforcement learning.Annual Review of Control, Robotics, and Autonomous Systems, 2022.(Cited on page 2)

    Lukas Brunke, Melissa Greeff, Adam W Hall, Zhaocong Yuan, Siqi Zhou, Jacopo Panerati, and Angela P Schoellig. Safe learning in robotics: From learning-based control to safe reinforcement learning.Annual Review of Control, Robotics, and Autonomous Systems, 2022.(Cited on page 2)

  15. [15]

    A review of safe reinforcement learning: Methods, theories and applications.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.(Cited on page 2) 11

    Shangding Gu, Long Yang, Yali Du, Guang Chen, Florian Walter, Jun Wang, and Alois Knoll. A review of safe reinforcement learning: Methods, theories and applications.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.(Cited on page 2) 11

  16. [16]

    Altman.Constrained Markov Decision Processes

    E. Altman.Constrained Markov Decision Processes. Chapman and Hall, 1999.(Cited on pages 2 and 10)

  17. [17]

    Safe reinforcement learning using advantage-based intervention

    Nolan Wagener, Byron Boots, and Ching-An Cheng. Safe reinforcement learning using advantage-based intervention. InInternational Conference on Machine Learning, 2021.(Cited on page 2)

  18. [18]

    Safe reinforcement learning via confidence-based filters

    Sebastian Curi, Armin Lederer, Sandra Hirche, and Andreas Krause. Safe reinforcement learning via confidence-based filters. InConference on Decision and Control. IEEE, 2022.(Cited on page 2)

  19. [19]

    Safe exploration in reinforcement learning: A generalized formulation and algorithms

    Akifumi Wachi, Wataru Hashimoto, Xun Shen, and Kazumune Hashimoto. Safe exploration in reinforcement learning: A generalized formulation and algorithms. InInternational Conference on Neural Information Processing Systems, 2023.(Cited on page 2)

  20. [20]

    Near-optimal sample complexity bounds for constrained MDPs

    Sharan Vaswani, Lin Yang, and Csaba Szepesvari. Near-optimal sample complexity bounds for constrained MDPs. InInternational Conference on Neural Information Processing Systems, 2022.(Cited on page 2)

  21. [21]

    Zhang, Jiali Duan, Tamer Bacsar, and Mihailo R

    Dongsheng Ding, K. Zhang, Jiali Duan, Tamer Bacsar, and Mihailo R. Jovanovi´c. Convergence and sample complexity of natural policy gradient primal-dual methods for constrained mdps. ArXiv, 2022.(Cited on page 2)

  22. [22]

    Truly no-regret learning in constrained mdps

    Adrian Müller, Pragnya Alatur, V olkan Cevher, Giorgia Ramponi, and Niao He. Truly no-regret learning in constrained mdps. InInternational Conference on Machine Learning, 2024.(Cited on page 2)

  23. [23]

    Constrained policy optimization

    Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. InInternational Conference on Machine Learning, 2017.(Cited on page 2)

  24. [24]

    Constrained policy opti- mization via bayesian world models

    Yarden As, Ilnura Usmanova, Sebastian Curi, and Andreas Krause. Constrained policy opti- mization via bayesian world models. InInternational Conference on Learning Representations, 2022.(Cited on pages 2, 3, 6, and 28)

  25. [25]

    Saute rl: Almost surely safe reinforcement learning using state augmentation

    Aivar Sootla, Alexander I Cowen-Rivers, Taher Jafferjee, Ziyan Wang, David H Mguni, Jun Wang, and Haitham Ammar. Saute rl: Almost surely safe reinforcement learning using state augmentation. InInternational Conference on Machine Learning, 2022.(Cited on page 2)

  26. [26]

    Safedreamer: Safe reinforcement learning with world models

    Weidong Huang, Jiaming Ji, Chunhe Xia, Borong Zhang, and Yaodong Yang. Safedreamer: Safe reinforcement learning with world models. InInternational Conference on Learning Representations, 2024.(Cited on page 2)

  27. [27]

    Safe Exploration in Markov Decision Processes

    Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in markov decision processes. arXiv preprint arXiv:1205.4810, 2012.(Cited on page 2)

  28. [28]

    Safe exploration in finite markov decision processes with gaussian processes.International Conference on Neural Information Processing Systems, 2016.(Cited on pages 2 and 7)

    Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration in finite markov decision processes with gaussian processes.International Conference on Neural Information Processing Systems, 2016.(Cited on pages 2 and 7)

  29. [29]

    Safe exploration and optimization of constrained mdps using gaussian processes.AAAI, 2018.(Cited on page 2)

    Akifumi Wachi, Yanan Sui, Yisong Yue, and Masahiro Ono. Safe exploration and optimization of constrained mdps using gaussian processes.AAAI, 2018.(Cited on page 2)

  30. [30]

    Safe reinforcement learning in constrained markov decision processes

    Akifumi Wachi and Yanan Sui. Safe reinforcement learning in constrained markov decision processes. InInternational Conference on Machine Learning, 2020.(Cited on page 2)

  31. [31]

    Provably efficient safe exploration via primal-dual policy optimization

    Dongsheng Ding, Xiaohan Wei, Zhuoran Yang, Zhaoran Wang, and Mihailo Jovanovic. Provably efficient safe exploration via primal-dual policy optimization. InInternational conference on artificial intelligence and statistics, 2021.(Cited on page 2)

  32. [32]

    Dope: Doubly optimistic and pessimistic exploration for safe reinforcement learning.International Conference on Neural Information Processing Systems, 2022.(Cited on page 2) 12

    Archana Bura, Aria HasanzadeZonuzy, Dileep Kalathil, Srinivas Shakkottai, and Jean-Francois Chamberland. Dope: Doubly optimistic and pessimistic exploration for safe reinforcement learning.International Conference on Neural Information Processing Systems, 2022.(Cited on page 2) 12

  33. [33]

    Learning-based model predictive control for safe exploration.IEEE Conference on Decision and Control, 2018

    Torsten Koller, Felix Berkenkamp, Matteo Turchetta, and Andreas Krause. Learning-based model predictive control for safe exploration.IEEE Conference on Decision and Control, 2018. (Cited on page 2)

  34. [34]

    Cautious model predictive control using gaussian process regression.IEEE Transactions on Control Systems Technology, 2019.(Cited on page 2)

    Lukas Hewing, Juraj Kabzan, and Melanie N Zeilinger. Cautious model predictive control using gaussian process regression.IEEE Transactions on Control Systems Technology, 2019.(Cited on page 2)

  35. [35]

    A predictive safety filter for learning-based control of constrained nonlinear dynamical systems.Automatica, 2021.(Cited on page 2)

    Kim Peter Wabersich and Melanie N Zeilinger. A predictive safety filter for learning-based control of constrained nonlinear dynamical systems.Automatica, 2021.(Cited on page 2)

  36. [36]

    Zeilinger

    Manish Prajapat, Johannes Köhler, Matteo Turchetta, Andreas Krause, and Melanie N. Zeilinger. Safe guaranteed exploration for non-linear systems.IEEE Transactions on Automatic Control, 2025.(Cited on page 2)

  37. [37]

    Towards safe and tractable gaussian process-based mpc: Efficient sampling within a sequential quadratic programming framework

    Manish Prajapat, Amon Lahr, Johannes Köhler, Andreas Krause, and Melanie N Zeilinger. Towards safe and tractable gaussian process-based mpc: Efficient sampling within a sequential quadratic programming framework. InIEEE Conference on Decision and Control. IEEE, 2024. (Cited on pages 2, 6, 15, and 16)

  38. [38]

    Domain randomization for transferring deep neural networks from simulation to the real world

    Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), 2017. (Cited on page 2)

  39. [39]

    A simple and efficient sampling-based algorithm for general reachability analysis

    Thomas Lew, Lucas Janson, Riccardo Bonalli, and Marco Pavone. A simple and efficient sampling-based algorithm for general reachability analysis. InProceedings of The 4th Annual Learning for Dynamics and Control Conference, 2022.(Cited on page 2)

  40. [40]

    Rehg, and Evangelos A

    Grady Williams, Paul Drews, Brian Goldfain, James M. Rehg, and Evangelos A. Theodorou. Aggressive driving with model predictive path integral control. In2016 IEEE International Conference on Robotics and Automation (ICRA), 2016.(Cited on page 2)

  41. [41]

    Calafiore and Marco Campi

    G. Calafiore and Marco Campi. The scenario approach to robust control design.Automatic Control, IEEE Transactions on, 2006.(Cited on page 2)

  42. [42]

    Scenario-based Optimal Control for Gaussian Process State Space Models

    Jonas Umlauft, Thomas Beckers, and Sandra Hirche. Scenario-based Optimal Control for Gaussian Process State Space Models. In2018 European Control Conference (ECC), 2018. (Cited on pages 2 and 15)

  43. [43]

    Pappas, and Jyotirmoy V

    Lars Lindemann, Yiqi Zhao, Xinyi Yu, George J. Pappas, and Jyotirmoy V . Deshmukh. Formal verification and control with conformal prediction: Practical safety guarantees for autonomous systems.IEEE Control Systems, 2025.(Cited on page 2)

  44. [44]

    Finite-sample-based reachability for safe control with gaussian process dynamics.arXiv preprint arXiv:2505.07594, 2025.(Cited on pages 2, 5, 15, 16, and 17)

    Manish Prajapat, Johannes Köhler, Amon Lahr, Andreas Krause, and Melanie N Zeilinger. Finite-sample-based reachability for safe control with gaussian process dynamics.arXiv preprint arXiv:2505.07594, 2025.(Cited on pages 2, 5, 15, 16, and 17)

  45. [45]

    Logarithmic online regret bounds for undiscounted reinforcement learning

    Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. InInternational Conference on Neural Information Processing Systems, 2006.(Cited on page 3)

  46. [46]

    Kakade, and Matthias W

    Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias W. Seeger. Information- theoretic regret bounds for gaussian process optimization in the bandit setting.IEEE Transac- tions on Information Theory, 2012.(Cited on pages 3 and 15)

  47. [47]

    Efficient model-based reinforcement learning through optimistic policy search and planning

    Sebastian Curi, Felix Berkenkamp, and Andreas Krause. Efficient model-based reinforcement learning through optimistic policy search and planning. InInternational Conference on Neural Information Processing Systems, 2020.(Cited on pages 3, 4, 21, and 25)

  48. [48]

    Safe exploration for optimization with gaussian processes

    Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization with gaussian processes. InInternational conference on machine learning, 2015.(Cited on page 3) 13

  49. [49]

    Transductive active learning: Theory and applications

    Jonas Hübotter, Bhavya Sukhija, Lenart Treven, Yarden As, and Andreas Krause. Transductive active learning: Theory and applications. InInternational Conference on Neural Information Processing Systems, 2024.(Cited on page 3)

  50. [50]

    On explore-then-commit strategies

    Aurélien Garivier, Tor Lattimore, and Emilie Kaufmann. On explore-then-commit strategies. Advances in Neural Information Processing Systems, 2016.(Cited on page 3)

  51. [51]

    Sombrl: Scalable and optimistic model-based rl

    Bhavya Sukhija, Lenart Treven, Carmelo Sferrazza, Florian Dörfler, Pieter Abbeel, and Andreas Krause. Sombrl: Scalable and optimistic model-based rl. InAdvances in Neural Information Processing Systems (NeurIPS), 2025.(Cited on pages 3, 4, 17, and 18)

  52. [52]

    Zeilinger, and Andreas Krause

    Manish Prajapat, Johannes Köhler, Melanie N. Zeilinger, and Andreas Krause. Safe and near-optimal control with online dynamics learning. 2026.(Cited on pages 3 and 7)

  53. [53]

    John Wiley & Sons, 2014.(Cited on page 3)

    Martin L Puterman.Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.(Cited on page 3)

  54. [54]

    In- formation theoretic regret bounds for online nonlinear control

    Sham Kakade, Akshay Krishnamurthy, Kendall Lowrey, Motoya Ohnishi, and Wen Sun. In- formation theoretic regret bounds for online nonlinear control. InInternational Conference on Neural Information Processing Systems, 2020.(Cited on pages 4, 18, and 23)

  55. [55]

    On kernelized multi-armed bandits

    Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. InInternational Conference on Machine Learning, 2017.(Cited on page 5)

  56. [56]

    Information rates of nonparametric gaussian process methods.Journal of Machine Learning Research, 12(6), 2011.(Cited on page 5)

    Aad Van Der Vaart and Harry Van Zanten. Information rates of nonparametric gaussian process methods.Journal of Machine Learning Research, 12(6), 2011.(Cited on page 5)

  57. [57]

    Schoellig, and Andreas Krause

    Felix Berkenkamp, Matteo Turchetta, Angela P. Schoellig, and Andreas Krause. Safe model- based reinforcement learning with stability guarantees. InInternational Conference on Neural Information Processing Systems, 2017.(Cited on page 7)

  58. [58]

    DeepMind Control Suite

    Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.(Cited on page 8)

  59. [59]

    Beckers and S

    T. Beckers and S. Hirche. Prediction with approximated gaussian process dynamical models. IEEE Transactions on Automatic Control, 2021.(Cited on page 15)

  60. [60]

    Cambridge University Press, 2023.(Cited on page 16)

    Roman Garnett.Bayesian Optimization. Cambridge University Press, 2023.(Cited on page 16)

  61. [61]

    Ajenifuja, Taiwo A

    Bach Do, Nafeezat A. Ajenifuja, Taiwo A. Adebiyi, and Ruda Zhang. Sampling from gaussian processes: a tutorial and applications in global sensitivity analysis and optimization.Structural and Multidisciplinary Optimization, 2025.(Cited on page 16)

  62. [62]

    Optimistic active exploration of dynamical systems

    Bhavya Sukhija, Lenart Treven, Cansu Sancaktar, Sebastian Blaes, Stelian Coros, and Andreas Krause. Optimistic active exploration of dynamical systems. InInternational Conference on Neural Information Processing Systems, 2023.(Cited on pages 17, 23, and 25)

  63. [63]

    Nicolò Cesa-Bianchi and Gábor Lugosi.Prediction, Learning, and Games. 2006. doi: 10.1017/CBO9780511546921.(Cited on page 20)

  64. [64]

    Sample-efficient cross-entropy method for real-time planning

    Cristina Pinneri, Shambhuraj Sawant, Sebastian Blaes, Jan Achterhold, Joerg Stueckler, Michal Rolinek, and Georg Martius. Sample-efficient cross-entropy method for real-time planning. In Conference on Robot Learning, 2020.(Cited on page 25)

  65. [65]

    When to trust your model: model- based policy optimization

    Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: model- based policy optimization. InInternational Conference on Neural Information Processing Systems, 2019.(Cited on pages 27 and 29)

  66. [66]

    Simple and scalable predictive uncertainty estimation using deep ensembles.International Conference on Neural Information Processing Systems, 2017.(Cited on page 27) 14

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles.International Conference on Neural Information Processing Systems, 2017.(Cited on page 27) 14

  67. [67]

    Decomposition of uncertainty in bayesian deep learning for efficient and risk-sensitive learning

    Stefan Depeweg, Jose-Miguel Hernandez-Lobato, Finale Doshi-Velez, and Steffen Udluft. Decomposition of uncertainty in bayesian deep learning for efficient and risk-sensitive learning. InInternational conference on machine learning, 2018.(Cited on page 27)

  68. [68]

    Wright.Numerical Optimization

    Jorge Nocedal and Stephen J. Wright.Numerical Optimization. Springer, second edition, 2006. (Cited on page 28)

  69. [69]

    Jingqi Li, David Fridovich-Keil, Somayeh Sojoudi, and Claire J. Tomlin. Augmented lagrangian method for instantaneously constrained reinforcement learning problems.2021 60th IEEE Conference on Decision and Control (CDC), 2021.(Cited on page 28)

  70. [70]

    SPiDR: A simple approach for zero-shot safety in sim-to-real transfer

    Yarden As, Chengrui Qu, Benjamin Unger, Dongho Kang, Max van der Hart, Laixi Shi, Stelian Coros, Adam Wierman, and Andreas Krause. SPiDR: A simple approach for zero-shot safety in sim-to-real transfer. InInternational Conference on Neural Information Processing Systems, 2025.(Cited on page 28)

  71. [71]

    T−1X k=t+1 g(xk, π(xk)) # We can directly apply Corollary 2 of Sukhija et al. [62] to obtain Jg(π, f ′)−J g(π, f) =E τ f π

    Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. InAdvances in Neural Information Processing Systems, 2020.(Cited on page 28) A Discussion We collect here several technical observations that clarify the scope of our assumptions and design choices. We ...