pith. sign in

arxiv: 2605.22207 · v1 · pith:245WLY3Wnew · submitted 2026-05-21 · 📡 eess.SY · cs.LG· cs.SY

Kernel-Based Safe Exploration in Deep Reinforcement Learning

Pith reviewed 2026-05-22 04:25 UTC · model grok-4.3

classification 📡 eess.SY cs.LGcs.SY
keywords safe reinforcement learningbarrier functionskernel embeddingsprobabilistic safetydeep RLsafe explorationstochastic systems
0
0 comments X

The pith

Kernel embeddings of conditional distributions let deep RL learn policies and barrier functions together for probabilistic safety.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that barrier functions, which decrease in expectation along system transitions and separate safe initial states from unsafe ones, can be represented and updated as conditional mean embeddings while a deep RL agent explores. This simultaneous learning produces tightening probabilistic bounds on the chance of reaching unsafe states as more transition data arrives, and the algorithm uses the current barrier estimate to replace unsafe actions with safe ones during exploration. A sympathetic reader would care because the approach removes the need for either massive data collection or restrictive assumptions on the dynamics, making it feasible to train high-performing policies in real stochastic systems where safety violations must be avoided.

Core claim

The kernel-based safe exploration algorithm learns an optimal policy and a barrier simultaneously. Barriers are computed iteratively as conditional mean embeddings of the unknown transition kernel; each new embedding yields a tighter upper bound on the probability of reaching unsafe states. When the barrier signals that an action would violate the current bound, the algorithm intervenes and substitutes a safe action, thereby keeping all visited trajectories inside the probabilistically safe set.

What carries the argument

Conditional mean embeddings that represent the expected change in the barrier value under the unknown stochastic dynamics.

If this is right

  • Exploration can continue indefinitely while the probability of unsafe visits is kept below a user-specified threshold that improves with data.
  • The same barrier representation works for any continuous control task whose transitions admit a kernel embedding, without requiring a separate safety module.
  • Policy performance is not sacrificed because the intervention only redirects unsafe actions rather than halting learning.
  • Safety certificates become strictly tighter with each batch of new transition samples collected during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The intervention step could be replaced by a softer penalty inside the reward function while still preserving the embedding-based bound.
  • The same embedding machinery might certify safety for policies transferred to environments whose dynamics are close but not identical to the training distribution.
  • One could measure how the tightness of the safety bound scales with the number of kernel basis functions chosen for the embedding.

Load-bearing premise

Kernel embeddings of the unknown transition distributions are accurate enough that the resulting barrier functions give valid upper bounds on the probability of reaching unsafe states.

What would settle it

Trajectories collected under the final policy that reach unsafe states at a rate higher than the probability bound predicted by the learned barrier embedding.

Figures

Figures reproduced from arXiv: 2605.22207 by Nikhil Singh, Rupak Majumdar, Sadegh Soudjani.

Figure 1
Figure 1. Figure 1: Pendulum A motivating example. We motivate our technique using the classical example of controlling an inverted pendulum (Brockman et al., 2016) to maintain its upright position. A pendulum freely hangs in the downward position, and the default goal is to balance it vertically upward. Normally, the control policy learns to balance upright by swinging the pendulum by one full round to gain sufficient moment… view at source ↗
Figure 2
Figure 2. Figure 2: Plots showing an intermediate barrier synthesized for [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Values of B¯ in the left and ϵ in the right during the exploration phase for the Hopper environment. D.3. Training Time and Safety Violations The training time depends on the number of safety violations encountered during exploration. Hence, relaxing the safety specification reduces training time. To demonstrate this, [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗
read the original abstract

Safety has been a major concern when deploying deep reinforcement learning algorithms in the real world. A promising direction that ensures that the learned policy does not visit unsafe regions is to learn a \emph{barrier function} along with the policy. A barrier is a function from states to reals that assigns low values to the initial states, high values to the unsafe states, and decreases in expectation on each transition; such a function can be used to bound the probability of reaching unsafe states. Previous attempts learned a barrier function directly from exploration data, but this required either large amounts of data or restrictions on the system dynamics. In this paper, we show how kernel embeddings can be used to learn barrier functions during deep reinforcement learning for stochastic systems with unknown dynamics. Our algorithm, \emph{kernel-based safe exploration (KBSE)}, learns an optimal policy and a barrier simultaneously during exploration. The barriers are computed iteratively, represented as conditional mean embeddings, and provide better probabilistic safety guarantees with more exploration. The exploration algorithm uses the learned barrier functions to identify safety violations. In the case of violation, it intervenes to modify the unsafe action to a safe action, thereby ensuring that the exploration is restricted to actions that bound the probability of reaching unsafe states. We evaluate KBSE on several complex continuous control benchmarks. Experimental results establish our new algorithm to be suitable for synthesizing control policies that are probabilistically safe without degradation in reward accumulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a kernel-based safe exploration (KBSE) algorithm for deep reinforcement learning in stochastic systems with unknown dynamics. It simultaneously learns an optimal policy and a barrier function during exploration, representing the barriers as conditional mean embeddings of the transition kernel. These embeddings are used to identify and intervene on unsafe actions, providing probabilistic upper bounds on the probability of reaching unsafe states that improve with more data. The approach is evaluated on several complex continuous control benchmarks, claiming suitability for synthesizing probabilistically safe policies without degrading reward accumulation.

Significance. If the embedding-based barriers deliver the claimed tightening probabilistic safety guarantees, the work would meaningfully advance safe RL by enabling data-efficient exploration without strong assumptions on system dynamics or the need for large offline datasets. The simultaneous policy-barrier learning and intervention mechanism during online exploration, combined with benchmark validation, could have practical impact in continuous control domains where safety is critical.

major comments (2)
  1. [Abstract and method description] Abstract and method description: The central claim that conditional mean embeddings of the unknown transition kernel yield valid (and improving) probabilistic upper bounds on reaching unsafe sets rests on an unshown translation from embedding approximation error to safety probability. No concentration inequalities, RKHS-norm bounds, or robustness margins are derived to control this error along trajectories, which is load-bearing for the probabilistic guarantees.
  2. [Barrier update procedure] Barrier update procedure: The iterative computation of barriers from data collected under the current policy creates a potential circularity, where safety bounds depend on the exploration trajectory itself; no external verification, independent benchmarks, or robustness analysis is provided to break this dependence.
minor comments (2)
  1. [Abstract] The abstract states that barriers 'provide better probabilistic safety guarantees with more exploration' but does not specify the precise sense in which the bounds tighten (e.g., in probability or in expectation).
  2. [Method] Notation for the conditional mean embeddings and their iterative updates could be clarified with explicit definitions or pseudocode to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us strengthen the theoretical presentation of the probabilistic guarantees in our work. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and method description] Abstract and method description: The central claim that conditional mean embeddings of the unknown transition kernel yield valid (and improving) probabilistic upper bounds on reaching unsafe sets rests on an unshown translation from embedding approximation error to safety probability. No concentration inequalities, RKHS-norm bounds, or robustness margins are derived to control this error along trajectories, which is load-bearing for the probabilistic guarantees.

    Authors: We agree that an explicit step-by-step derivation linking the RKHS approximation error of the conditional mean embedding to the finite-horizon safety probability bound was not presented with sufficient detail in the main text. The manuscript relies on standard concentration results for kernel mean embeddings, but the translation to trajectory-wise bounds via a union bound and robustness margins for the intervention step was only sketched. In the revised version we have added Lemma 4.3 together with its proof in Section 4.2 (and an expanded Appendix B) that derives the required concentration inequality, shows how the embedding error contracts with additional samples, and supplies explicit robustness margins that control the propagation of this error along safe trajectories. These additions make the probabilistic guarantees fully rigorous. revision: yes

  2. Referee: [Barrier update procedure] Barrier update procedure: The iterative computation of barriers from data collected under the current policy creates a potential circularity, where safety bounds depend on the exploration trajectory itself; no external verification, independent benchmarks, or robustness analysis is provided to break this dependence.

    Authors: The iterative update is indeed data-dependent, yet the safety argument is not circular: the barrier at iteration k is computed from data collected under the intervened policy of iteration k-1, and the intervention mechanism guarantees that each new trajectory remains inside the safe set with high probability from the initial policy onward. Nevertheless, we acknowledge that an explicit robustness analysis quantifying sensitivity to distribution shift between successive data batches was missing. In the revision we have inserted a new robustness theorem (Theorem 5.1) that bounds the change in the safety probability under bounded perturbations of the empirical transition kernel, together with additional benchmark experiments that compare KBSE against an independent safety oracle (a separately trained model-based verifier) on the same tasks. These changes provide the requested external verification and break any apparent dependence on the exploration trajectory alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central construction uses kernel mean embeddings to represent and iteratively update barrier functions from exploration data in an RL setting. This is a standard nonparametric estimation technique applied to the unknown transition kernel, with the safety bounds derived from the barrier certificate properties (decreasing in expectation) rather than being tautological to the fitted values. The iterative aspect ties data collection to the current barrier estimate, but this is algorithmic feedback, not a reduction of the claimed probabilistic guarantees to the inputs by definition or self-citation. No load-bearing self-citations, uniqueness theorems from prior author work, or ansatz smuggling are indicated. The derivation remains self-contained against external results on conditional mean embeddings and barrier certificates.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full derivations, assumptions, and experimental details unavailable.

axioms (1)
  • domain assumption Kernel embeddings can represent conditional distributions of unknown stochastic dynamics sufficiently well for barrier-function construction.
    Central to the iterative barrier update and safety intervention mechanism described in the abstract.

pith-pipeline@v0.9.0 · 5789 in / 1141 out tokens · 38702 ms · 2026-05-22T04:25:52.196308+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages

  1. [1]

    2002 , publisher=

    Learning with kernels: Support vector machines, regularization, optimization, and beyond , author=. 2002 , publisher=

  2. [2]

    S. Prajna. Barrier certificates for nonlinear model validation. Automatica

  3. [3]

    Berlinet, Alain and Thomas-Agnan, Christine , title =

  4. [4]

    Proceedings of the 26th Annual International Conference on Machine Learning , pages =

    Song, Le and Huang, Jonathan and Smola, Alex and Fukumizu, Kenji , title =. Proceedings of the 26th Annual International Conference on Machine Learning , pages =. 2009 , address =

  5. [5]

    Klebanov, Ilja and Schuster, Ingmar and Sullivan, T. J. , title =. SIAM Journal on Mathematics of Data Science , volume =

  6. [6]

    2008 , publisher =

    Steinwart, Ingo and Christmann, Andreas , title =. 2008 , publisher =

  7. [7]

    Data-Driven Distributionally Robust Safety Verification Using Barrier Certificates and Conditional Mean Embeddings , year=

    Schön, Oliver and Zhong, Zhengang and Soudjani, Sadegh , booktitle=. Data-Driven Distributionally Robust Safety Verification Using Barrier Certificates and Conditional Mean Embeddings , year=

  8. [8]

    Policy Gradient Methods for Reinforcement Learning with Function Approximation , volume =

    Sutton, Richard S and McAllester, David and Singh, Satinder and Mansour, Yishay , booktitle =. Policy Gradient Methods for Reinforcement Learning with Function Approximation , volume =

  9. [9]

    2019 , eprint=

    Temporal Logic Guided Safe Reinforcement Learning Using Control Barrier Functions , author=. 2019 , eprint=

  10. [10]

    End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks , year =

    Cheng, Richard and Orosz, G\'. End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks , year =. Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference , articleno =

  11. [11]

    Proceedings of the 40th International Conference on Machine Learning , articleno =

    Wang, Yixuan and Zhan, Simon Sinong and Jiao, Ruochen and Wang, Zhilu and Jin, Wanxin and Yang, Zhuoran and Wang, Zhaoran and Huang, Chao and Zhu, Qi , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

  12. [12]

    Sparse Approximation of a Kernel Mean , year=

    Cortés, Efrén Cruz and Scott, Clayton , journal=. Sparse Approximation of a Kernel Mean , year=

  13. [13]

    Proceedings of the 38th International Conference on Machine Learning , pages =

    Kernel-Based Reinforcement Learning: A Finite-Time Analysis , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , series =

  14. [14]

    Machine Learning , volume =

    Dirk Ormoneit and Saunak Sen , title =. Machine Learning , volume =

  15. [15]

    Practical Kernel-Based Reinforcement Learning , journal =

    Andr. Practical Kernel-Based Reinforcement Learning , journal =. 2016 , volume =

  16. [16]

    and Soudjani, Sadegh , booktitle=

    Kordabad, Arash Bahari and Charitidou, Maria and Dimarogonas, Dimos V. and Soudjani, Sadegh , booktitle=. Control Barrier Functions for Stochastic Systems under Signal Temporal Logic Tasks , year=

  17. [17]

    IEEE Transactions on Automatic Control , year=

    Data-Driven Economic NMPC Using Reinforcement Learning , author=. IEEE Transactions on Automatic Control , year=

  18. [18]

    and Abate, Alessandro , booktitle=

    Romao, Licio and Hota, Ashish R. and Abate, Alessandro , booktitle=. Distributionally Robust Optimal and Safe Control of Stochastic Systems via Kernel Conditional Mean Embedding , year=

  19. [19]

    and Oishi, Meeko M

    Thorpe, Adam J. and Oishi, Meeko M. K. , title =. 2021 60th IEEE Conference on Decision and Control (CDC) , pages =. 2021 , publisher =

  20. [20]

    Thorpe and Jake A

    Adam J. Thorpe and Jake A. Gonzales and Meeko M. K. Oishi , title =. American Control Conference,

  21. [21]

    CoRR , year =

    Greg Brockman and Vicki Cheung and Ludwig Pettersson and Jonas Schneider and John Schulman and Jie Tang and Wojciech Zaremba , title =. CoRR , year =

  22. [22]

    2012 , author =

    Conditional mean embeddings as regressors , booktitle =. 2012 , author =

  23. [23]

    2024 , booktitle=

    Learning-Based Shielding for Safe Autonomy under Unknown Dynamics , author=. 2024 , booktitle=

  24. [24]

    Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =

    Li, Zhu and Meunier, Dimitri and Mollenhauer, Mattes and Gretton, Arthur , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , address =

  25. [25]

    Model-Free Safe Reinforcement Learning Through Neural Barrier Certificate , year=

    Yang, Yujie and Jiang, Yuxuan and Liu, Yichen and Chen, Jianyu and Li, Shengbo Eben , journal=. Model-Free Safe Reinforcement Learning Through Neural Barrier Certificate , year=

  26. [26]

    2019 , booktitle=

    Ray, Alex and Achiam, Joshua and Amodei, Dario , title=. 2019 , booktitle=

  27. [27]

    ICML 2022 , year =

    Constrained Variational Policy Optimization for Safe Reinforcement Learning , author =. ICML 2022 , year =

  28. [28]

    A Measure-Theoretic Approach to Kernel Conditional Mean Embeddings , volume =

    Park, Junhyung and Muandet, Krikamol , booktitle =. A Measure-Theoretic Approach to Kernel Conditional Mean Embeddings , volume =

  29. [29]

    Maximum Mean Discrepancy Distributionally Robust Nonlinear Chance-Constrained Optimization with Finite-Sample Guarantee , year=

    Nemmour, Yassine and Kremer, Heiner and Schölkopf, Bernhard and Zhu, Jia-Jie , booktitle=. Maximum Mean Discrepancy Distributionally Robust Nonlinear Chance-Constrained Optimization with Finite-Sample Guarantee , year=

  30. [30]

    , author=

    Stochastic stability and control. , author=. 1967 , journal=

  31. [31]

    2024 European Control Conference (ECC) , pages=

    Control Barrier Functions for Stochastic Systems under Signal Temporal Logic Tasks , author=. 2024 European Control Conference (ECC) , pages=. 2024 , organization=

  32. [32]

    Automatica , volume=

    Data-driven verification and synthesis of stochastic systems via barrier certificates , author=. Automatica , volume=. 2024 , publisher=

  33. [33]

    Nature , year=

    Human-level control through deep reinforcement learning , author=. Nature , year=

  34. [34]

    Nature , year=

    Mastering the game of Go without human knowledge , author=. Nature , year=

  35. [35]

    Levine, Sergey and Finn, Chelsea and Darrell, Trevor and Abbeel, Pieter , title =. J. Mach. Learn. Res. , pages =. 2016 , issue_date =

  36. [36]

    2025 , issue_date =

    Naveed, Humza and Khan, Asad Ullah and Qiu, Shi and Saqib, Muhammad and Anwar, Saeed and Usman, Muhammad and Akhtar, Naveed and Barnes, Nick and Mian, Ajmal , title =. 2025 , issue_date =

  37. [37]

    Proceedings of The 24th International Conference on Artificial Intelligence and Statistics , pages =

    Reinforcement Learning for Constrained Markov Decision Processes , author =. Proceedings of The 24th International Conference on Artificial Intelligence and Statistics , pages =. 2021 , volume =

  38. [38]

    Proceedings of the 34th International Conference on Machine Learning , pages =

    Constrained Policy Optimization , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , volume =

  39. [39]

    Proceedings of the 37th International Conference on Machine Learning , articleno =

    Stooke, Adam and Achiam, Joshua and Abbeel, Pieter , title =. Proceedings of the 37th International Conference on Machine Learning , articleno =. 2020 , publisher =

  40. [40]

    2020 , month=

    Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2020 , month=

  41. [41]

    Safe Reinforcement Learning via Shielding , journal=

    Alshiekh, Mohammed and Bloem, Roderick and Ehlers, Rüdiger and Könighofer, Bettina and Niekum, Scott and Topcu, Ufuk , year=. Safe Reinforcement Learning via Shielding , journal=

  42. [42]

    Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence , articleno =

    Yang, Wen-Chi and Marra, Giuseppe and Rens, Gavin and De Raedt, Luc , title =. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence , articleno =. 2023 , isbn =

  43. [43]

    31st International Conference on Concurrency Theory (CONCUR 2020) , pages =

    Jansen, Nils and K\". 31st International Conference on Concurrency Theory (CONCUR 2020) , pages =. 2020 , volume =

  44. [44]

    and Peck, Elizabeth A

    Montgomery, Douglas C. and Peck, Elizabeth A. and Vining, Geoffrey G. , publisher =

  45. [45]

    ICML , pages =

    Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , author =. ICML , pages =

  46. [46]

    Journal of Machine Learning Research , year =

    Antonin Raffin and Ashley Hill and Adam Gleave and Anssi Kanervisto and Maximilian Ernestus and Noah Dormann , title =. Journal of Machine Learning Research , year =

  47. [47]

    Annals of Statistics , number = 3, pages =

    Hofmann, Thomas and Sch\". Annals of Statistics , number = 3, pages =

  48. [48]

    Kernel Methods Through the Roof: Handling Billions of Points Efficiently , volume =

    Meanti, Giacomo and Carratino, Luigi and Rosasco, Lorenzo and Rudi, Alessandro , booktitle =. Kernel Methods Through the Roof: Handling Billions of Points Efficiently , volume =

  49. [49]

    CoRR , year =

    Sergey Levine and Aviral Kumar and George Tucker and Justin Fu , title =. CoRR , year =

  50. [50]

    Borgwardt and Malte J

    Arthur Gretton and Karsten M. Borgwardt and Malte J. Rasch and Bernhard Sch. A Kernel Two-Sample Test , journal =. 2012 , volume =

  51. [51]

    IEEE 61st Conference on Decision and Control (CDC) , year=

    Learning a Better Control Barrier Function , author=. IEEE 61st Conference on Decision and Control (CDC) , year=

  52. [52]

    Data-Efficient Control Barrier Function Refinement

    Bolun Dai and Heming Huang and Prashanth Krishnamurthy and Farshad Khorrami. Data-Efficient Control Barrier Function Refinement. American Control Conference, ACC 2023. 2023

  53. [53]

    2023 , publisher=

    Data-driven stochastic optimal control using hilbert space embeddings of distributions , author=. 2023 , publisher=

  54. [54]

    Journal of Machine Learning Research , year =

    Jiaming Ji and Jiayi Zhou and Borong Zhang and Juntao Dai and Xuehai Pan and Ruiyang Sun and Weidong Huang and Yiran Geng and Mickel Liu and Yaodong Yang , title =. Journal of Machine Learning Research , year =

  55. [55]

    Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =

    Yang, Long and Ji, Jiaming and Dai, Juntao and Zhang, Linrui and Zhou, Binbin and Li, Pengfei and Yang, Yaodong and Pan, Gang , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , address =

  56. [56]

    Proceedings of the 38th International Conference on Machine Learning , pages =

    Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , series =

  57. [57]

    Automatic Model Construction with Gaussian Processes , author =

  58. [58]

    2023 , booktitle =

    Hou, Boya and Sanjari, Sina and Dahlin, Nathan and Bose, Subhonmesh , title =. 2023 , booktitle =