pith. sign in

arxiv: 2605.18841 · v1 · pith:IGVJ4GBZnew · submitted 2026-05-13 · 💻 cs.LG

From Cumulative Constraints to Adaptive Runtime Safety Control for Nonstationary Reinforcement Learning

Pith reviewed 2026-05-20 21:43 UTC · model grok-4.3

classification 💻 cs.LG
keywords reinforcement learning safetynonstationary RLruntime shieldingcumulative constraintsadaptive thresholdssafety violationshighway merging
0
0 comments X

The pith

Constraint Projection Safety Shield turns cumulative safety budgets into adaptive per-state thresholds for nonstationary reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper addresses the gap between trajectory-level cumulative safety constraints and the need for per-decision safety guarantees in nonstationary reinforcement learning, where risk for the same action changes with context. It proposes the Constraint Projection Safety Shield that tracks the remaining safety budget and projects it into a time-varying admissible risk threshold adjusted online by contextual signals. Actions whose predicted safety cost exceeds the active threshold are filtered during execution. Analysis of the shielded policy establishes per-state threshold satisfaction, finite-horizon cumulative cost bounds, and a performance degradation bound expressed via intervention frequency and per-step reward distortion. Experiments in nonstationary highway merging scenarios show reduced proximity-based safety violations and increased separation margins with selective rather than dominant interventions.

Core claim

CPSS projects the remaining cumulative safety budget into an adaptive admissible risk threshold that varies with time and context, filters policy actions whose predicted safety cost exceeds this threshold, and the resulting shielded policy guarantees per-state threshold satisfaction for executed actions, induces finite-horizon cumulative cost bounds, and yields a performance degradation bound in terms of intervention frequency and per-step reward distortion.

What carries the argument

Constraint Projection Safety Shield (CPSS): runtime mechanism that converts a cumulative safety budget into adaptive state-level control constraints by tracking remaining budget, projecting it to a time-varying threshold, and adjusting the threshold online using contextual signals.

Load-bearing premise

Safety costs of candidate actions can be predicted with sufficient accuracy in real time and contextual signals provide reliable information for dynamically adjusting the threshold.

What would settle it

In a nonstationary highway merging simulation where safety cost predictions match the assumed accuracy, observe whether cumulative safety costs exceed the claimed finite-horizon bounds or whether per-state threshold violations occur.

Figures

Figures reproduced from arXiv: 2605.18841 by Timofey Tomashevskiy.

Figure 1
Figure 1. Figure 1: Constraint projection view of CPSS. A cumulative safety-budget constraint induces a [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Average collision rate across environments as nonstationarity increases. CPSS consistently [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Collision rate across nonstationarity regimes for each environment. CPSS maintains lower [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Safety diagnostics averaged across nonstationarity regimes. Collision rate and proximity [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Safety in reinforcement learning is often specified through cumulative cost constraints, but these trajectory-level guarantees do not directly prevent unsafe individual decisions, especially under nonstationarity. In continual and nonstationary settings, the difficulty is amplified because the risk associated with the same action can vary across contexts, while a fixed state-level threshold may be either too conservative or too weak. We propose Constraint Projection Safety Shield (CPSS), a runtime mechanism that converts a cumulative safety budget into adaptive state-level control constraints during execution. CPSS tracks the remaining safety budget, projects it into a time-varying admissible risk threshold, and filters policy actions whose predicted safety cost exceeds the active threshold. The threshold is adjusted online using contextual signals so that enforcement becomes stricter in more demanding or rapidly changing regimes and less restrictive when the available safety budget is sufficient. We analyze the resulting shielded policy and show that the mechanism guarantees per-state threshold satisfaction for executed actions, induces finite-horizon cumulative cost bounds, and yields a performance degradation bound in terms of intervention frequency and per-step reward distortion. We evaluate CPSS in nonstationary highway merging scenarios using highway-env. Across multiple seeds, CPSS substantially reduces proximity-based safety violations and increases separation margins while intervening selectively rather than dominating the learned policy. These results support adaptive budget-to-threshold projection as a practical way to transform cumulative safety specifications into effective local safety control for continual reinforcement learning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes Constraint Projection Safety Shield (CPSS), a runtime mechanism that converts a cumulative safety budget into an adaptive, time-varying state-level risk threshold. CPSS tracks remaining budget, projects it into a context-adjusted threshold using online signals, and filters policy actions whose predicted safety cost exceeds the threshold. The authors analyze the shielded policy to claim per-state threshold satisfaction for executed actions, finite-horizon cumulative cost bounds, and a performance degradation bound expressed in terms of intervention frequency and per-step reward distortion. Evaluation on nonstationary highway merging tasks in highway-env shows reduced proximity violations and larger separation margins across seeds, with selective rather than dominant intervention.

Significance. If the central bounds hold, the work provides a concrete bridge from trajectory-level cumulative constraints to enforceable local safety filters that adapt to nonstationarity. The explicit dependence of the performance bound on intervention frequency and reward distortion is a useful, falsifiable form of guarantee. The highway-env results supply initial evidence that adaptive projection can reduce violations without collapsing to a conservative baseline policy.

major comments (1)
  1. [§3] §3 (theoretical analysis of finite-horizon bounds): the cumulative cost bound is obtained by summing the time-varying thresholds applied to predicted costs. No additive term appears for the maximum per-step prediction error between predicted and realized safety cost. Because filtering decisions rest exclusively on predictions, the bound on actual cumulative cost requires either an assumption that prediction error is identically zero or an explicit propagation of the worst-case error over the horizon length. In nonstationary regimes this discrepancy can be both large and time-varying; its absence is therefore load-bearing for the safety guarantee.
minor comments (2)
  1. [Abstract] The abstract states results hold 'across multiple seeds' but neither reports the exact number of seeds nor supplies standard-error or inter-quartile ranges for the violation and margin metrics; adding these would strengthen the empirical claim.
  2. [§2] Notation for the remaining budget B_t and the projected threshold τ_t is introduced without an explicit recurrence or update rule in the main text; a compact equation or algorithm box would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The observation on the finite-horizon bound is well taken, and we address it directly below. We are prepared to revise the manuscript to strengthen the theoretical claims.

read point-by-point responses
  1. Referee: [§3] §3 (theoretical analysis of finite-horizon bounds): the cumulative cost bound is obtained by summing the time-varying thresholds applied to predicted costs. No additive term appears for the maximum per-step prediction error between predicted and realized safety cost. Because filtering decisions rest exclusively on predictions, the bound on actual cumulative cost requires either an assumption that prediction error is identically zero or an explicit propagation of the worst-case error over the horizon length. In nonstationary regimes this discrepancy can be both large and time-varying; its absence is therefore load-bearing for the safety guarantee.

    Authors: We agree that the current derivation in §3 obtains the cumulative bound by summing the projected thresholds applied to predicted per-step costs. Because action filtering is performed on the basis of these predictions, a bound on realized cumulative cost must account for prediction error. We will revise the analysis to include an explicit additive term that propagates a worst-case per-step prediction error bound over the finite horizon. The revised statement will make the dependence on prediction quality explicit and will remain valid under the nonstationary regime considered in the paper. revision: yes

Circularity Check

0 steps flagged

Bounds expressed via mechanism outputs but derivation remains independent

full rationale

The paper derives per-state threshold satisfaction, finite-horizon cumulative cost bounds, and performance degradation bounds directly from the CPSS shielding rules and threshold projection. These bounds are stated in terms of intervention frequency and per-step reward distortion, which are defined by the mechanism's own operation rather than external data fits. No self-citation chains, uniqueness theorems, or ansatzes are invoked to close the argument. The analysis is self-contained given the explicit assumption of sufficiently accurate safety-cost predictions; any gap in error propagation is a correctness issue, not a reduction of the claimed result to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that action safety costs can be predicted accurately enough to support real-time filtering and that contextual signals are informative for threshold adjustment; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Safety costs of individual actions can be predicted with sufficient accuracy for runtime filtering decisions.
    This premise is required for the projection and filtering step described in the abstract.

pith-pipeline@v0.9.0 · 5776 in / 1304 out tokens · 57770 ms · 2026-05-20T21:43:51.116240+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

124 extracted references · 124 canonical work pages · 4 internal anchors

  1. [1]

    doi:10.1109/ms.2018.4321239 , number =

    David Lorge Parnas , title =. doi:10.1109/ms.2018.4321239 , number =

  2. [2]

    1999 , publisher =

    Constrained Markov decision processes , author =. 1999 , publisher =

  3. [3]

    2018 , publisher =

    Reinforcement learning: An introduction , author =. 2018 , publisher =

  4. [4]

    arXiv preprint arXiv:2004.07584 , year =

    Reinforcement learning for safety-critical control under model uncertainty, using control lyapunov functions and control barrier functions , author =. arXiv preprint arXiv:2004.07584 , year =

  5. [5]

    2016 IEEE 55th Conference on Decision and Control (CDC) , pages =

    Safe learning of regions of attraction for uncertain, nonlinear systems with Gaussian processes , author =. 2016 IEEE 55th Conference on Decision and Control (CDC) , pages =. 2016 , organization =

  6. [6]

    arXiv preprint arXiv:2205.10330 , year =

    A review of safe reinforcement learning: Methods, theory and applications , author =. arXiv preprint arXiv:2205.10330 , year =

  7. [7]

    arXiv preprint arXiv:2006.10701 , year =

    Deep reinforcement learning amidst lifelong non-stationarity , author =. arXiv preprint arXiv:2006.10701 , year =

  8. [8]

    International Conference on Machine Learning (ICML) , year =

    Constrained Policy Optimization , author =. International Conference on Machine Learning (ICML) , year =

  9. [9]

    Risk-Sensitive and Robust Decision-Making: a

    Chow, Yinlam and Ghavamzadeh, Mohammad and Janson, Lucas and Pavone, Marco , booktitle =. Risk-Sensitive and Robust Decision-Making: a

  10. [10]

    A uniform estimate for general quaternionic Calabi problem (with appendix by Daniel Barlet)

    Policy Gradient for Coherent Risk Measures , author =. arXiv preprint arXiv:1502.02267 , year =

  11. [11]

    Journal of Machine Learning Research , volume =

    A Comprehensive Survey on Safe Reinforcement Learning , author =. Journal of Machine Learning Research , volume =

  12. [12]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Safe Model-based Reinforcement Learning with Stability Guarantees , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  13. [13]

    1998 , publisher =

    Reinforcement learning: An introduction , author =. 1998 , publisher =

  14. [14]

    2015 European Control Conference (ECC) , pages =

    Safe and robust learning control with Gaussian processes , author =. 2015 European Control Conference (ECC) , pages =. 2015 , organization =

  15. [15]

    International Conference on Machine Learning , pages =

    Robust multi-objective bayesian optimization under input noise , author =. International Conference on Machine Learning , pages =. 2022 , organization =

  16. [16]

    Neural networks , volume =

    Continual lifelong learning with neural networks: A review , author =. Neural networks , volume =. 2019 , publisher =

  17. [17]

    Automated machine learning: methods, systems, challenges , pages =

    Meta-learning , author =. Automated machine learning: methods, systems, challenges , pages =. 2019 , publisher =

  18. [18]

    International conference on machine learning , pages =

    Pac-inspired option discovery in lifelong reinforcement learning , author =. International conference on machine learning , pages =. 2014 , organization =

  19. [19]

    International conference on machine learning , pages =

    Policy and value transfer in lifelong reinforcement learning , author =. International conference on machine learning , pages =. 2018 , organization =

  20. [20]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume =

    Lifelong learning with a changing action set , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =

  21. [21]

    Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments

    Continuous adaptation via meta-learning in nonstationary and competitive environments , author =. arXiv preprint arXiv:1710.03641 , year =

  22. [22]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume =

    Provably efficient primal-dual reinforcement learning for cmdps with non-stationary objectives and constraints , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =

  23. [23]

    Advances in Neural Information Processing Systems , volume =

    Towards safe policy improvement for non-stationary MDPs , author =. Advances in Neural Information Processing Systems , volume =

  24. [24]

    arXiv preprint arXiv:2003.00660 , year =

    Upper confidence primal-dual optimization: Stochastically constrained markov decision processes with adversarial losses and unknown transitions , author =. arXiv preprint arXiv:2003.00660 , year =

  25. [25]

    International Conference on Machine Learning , pages =

    Optimizing for the future in non-stationary mdps , author =. International Conference on Machine Learning , pages =. 2020 , organization =

  26. [26]

    International Conference on Machine Learning , pages =

    Safe policy search for lifelong reinforcement learning with sublinear regret , author =. International Conference on Machine Learning , pages =. 2015 , organization =

  27. [27]

    International Conference on Artificial Intelligence and Statistics , pages =

    Provably efficient model-free algorithms for non-stationary cmdps , author =. International Conference on Artificial Intelligence and Statistics , pages =. 2023 , organization =

  28. [28]

    2021 IEEE International Conference on Robotics and Automation (ICRA) , pages =

    Context-aware safe reinforcement learning for non-stationary environments , author =. 2021 IEEE International Conference on Robotics and Automation (ICRA) , pages =. 2021 , organization =

  29. [29]

    2022 IEEE 61st Conference on Decision and Control (CDC) , pages =

    Finite-time complexity of online primal-dual natural actor-critic algorithm for constrained Markov decision processes , author =. 2022 IEEE 61st Conference on Decision and Control (CDC) , pages =. 2022 , organization =

  30. [30]

    arXiv preprint arXiv:2405.16601 , year =

    A CMDP-within-online framework for meta-safe reinforcement learning , author =. arXiv preprint arXiv:2405.16601 , year =

  31. [31]

    arXiv preprint arXiv:2111.00552 , year =

    Policy optimization for constrained mdps with provable fast global convergence , author =. arXiv preprint arXiv:2111.00552 , year =

  32. [32]

    Machine Learning , volume =

    All-time safety and sample-efficient meta update for online safe meta reinforcement learning under Markov task transition , author =. Machine Learning , volume =. 2025 , publisher =

  33. [33]

    IEEE Journal of Selected Topics in Signal Processing , volume =

    Online convex optimization in dynamic environments , author =. IEEE Journal of Selected Topics in Signal Processing , volume =. 2015 , publisher =

  34. [34]

    Proceedings of the AAAI conference on artificial intelligence , volume =

    Safe online convex optimization with unknown linear safety constraints , author =. Proceedings of the AAAI conference on artificial intelligence , volume =

  35. [35]

    IEEE Transactions on Cybernetics , volume =

    Adaptive safe reinforcement learning with full-state constraints and constrained adaptation for autonomous vehicles , author =. IEEE Transactions on Cybernetics , volume =. 2023 , publisher =

  36. [36]

    International Conference on Machine Learning , pages =

    Enforcing hard constraints with soft barriers: Safe reinforcement learning in unknown stochastic environments , author =. International Conference on Machine Learning , pages =. 2023 , organization =

  37. [37]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , year =

    A review of safe reinforcement learning: Methods, theories and applications , author =. IEEE Transactions on Pattern Analysis and Machine Intelligence , year =

  38. [38]

    The 30th international joint conference on artificial intelligence (ijcai) , year =

    Policy learning with constraints in model-free reinforcement learning: A survey , author =. The 30th international joint conference on artificial intelligence (ijcai) , year =

  39. [39]

    arXiv preprint arXiv:2402.02025 , year =

    A survey of constraint formulations in safe reinforcement learning , author =. arXiv preprint arXiv:2402.02025 , year =

  40. [40]

    ACM Computing Surveys (CSUR) , volume =

    A survey of reinforcement learning algorithms for dynamically varying environments , author =. ACM Computing Surveys (CSUR) , volume =. 2021 , publisher =

  41. [41]

    Journal of Artificial Intelligence Research , volume =

    Towards continual reinforcement learning: A review and perspectives , author =. Journal of Artificial Intelligence Research , volume =

  42. [42]

    Machine Learning , volume =

    A taxonomy for similarity metrics between markov decision processes , author =. Machine Learning , volume =. 2022 , publisher =

  43. [43]

    Proceedings of the AAAI conference on artificial intelligence , volume =

    Safe reinforcement learning via shielding under partial observability , author =. Proceedings of the AAAI conference on artificial intelligence , volume =

  44. [44]

    Machine learning , volume =

    Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics , author =. Machine learning , volume =. 2023 , publisher =

  45. [45]

    2022 , school =

    Reinforcement Learning for Non-stationary problems , author =. 2022 , school =

  46. [46]

    Learning to reinforcement learn

    Learning to reinforcement learn , author =. arXiv preprint arXiv:1611.05763 , year =

  47. [47]

    System Modeling and Optimization: Proceedings of the 10th IFIP Conference New York City, USA, August 31--September 4, 1981 , pages =

    The Bayesian approach to global optimization , author =. System Modeling and Optimization: Proceedings of the 10th IFIP Conference New York City, USA, August 31--September 4, 1981 , pages =. 2005 , organization =

  48. [48]

    Foundations and Trends

    Bayesian reinforcement learning: A survey , author =. Foundations and Trends. 2015 , publisher =

  49. [49]

    Journal of Global Optimization , volume =

    Bayesian heuristic approach to global optimization and examples , author =. Journal of Global Optimization , volume =. 2002 , publisher =

  50. [50]

    2021 , publisher =

    Constrained Markov decision processes , author =. 2021 , publisher =

  51. [51]

    IEEE Transactions on Automatic Control , volume =

    Risk-constrained Markov decision processes , author =. IEEE Transactions on Automatic Control , volume =. 2014 , publisher =

  52. [52]

    International Conference on Machine Learning , pages =

    Safe reinforcement learning in constrained markov decision processes , author =. International Conference on Machine Learning , pages =. 2020 , organization =

  53. [53]

    International Workshop on the Foundations of Trustworthy AI Integrating Learning, Optimization and Reasoning , pages =

    Safe learning and optimization techniques: Towards a survey of the state of the art , author =. International Workshop on the Foundations of Trustworthy AI Integrating Learning, Optimization and Reasoning , pages =. 2020 , organization =

  54. [54]

    IEEE transactions on pattern analysis and machine intelligence , volume =

    Meta-learning in neural networks: A survey , author =. IEEE transactions on pattern analysis and machine intelligence , volume =. 2021 , publisher =

  55. [55]

    Annual Review of Control, Robotics, and Autonomous Systems , volume =

    Safe learning in robotics: From learning-based control to safe reinforcement learning , author =. Annual Review of Control, Robotics, and Autonomous Systems , volume =. 2022 , publisher =

  56. [56]

    Journal of mathematics and mechanics , pages =

    A Markovian decision process , author =. Journal of mathematics and mechanics , pages =. 1957 , publisher =

  57. [57]

    Journal of mathematical analysis and applications , volume =

    Optimal control of Markov processes with incomplete state information I , author =. Journal of mathematical analysis and applications , volume =. 1965 , publisher =

  58. [58]

    IJCAI: proceedings of the conference , volume =

    Hidden parameter markov decision processes: A semiparametric regression approach for discovering latent task parametrizations , author =. IJCAI: proceedings of the conference , volume =

  59. [59]

    Sequence learning: paradigms, algorithms, and applications , pages =

    Hidden-mode markov decision processes for nonstationary sequential decision making , author =. Sequence learning: paradigms, algorithms, and applications , pages =. 2001 , publisher =

  60. [60]

    Artificial intelligence , volume =

    Planning and acting in partially observable stochastic domains , author =. Artificial intelligence , volume =. 1998 , publisher =

  61. [61]

    Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems , pages =

    Improving reinforcement learning with context detection , author =. Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems , pages =

  62. [62]

    Handbooks in operations research and management science , volume =

    Markov decision processes , author =. Handbooks in operations research and management science , volume =. 1990 , publisher =

  63. [63]

    Advances in neural information processing systems , volume =

    Bayes-adaptive pomdps , author =. Advances in neural information processing systems , volume =

  64. [64]

    2002 , publisher =

    Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes , author =. 2002 , publisher =

  65. [65]

    2019 , school =

    Safe exploration in reinforcement learning: Theory and applications in robotics , author =. 2019 , school =

  66. [66]

    International workshop on hybrid systems: Computation and control , pages =

    Safety verification of hybrid systems using barrier certificates , author =. International workshop on hybrid systems: Computation and control , pages =. 2004 , organization =

  67. [67]

    Mitigating Distribution Shifts: Uncertainty-Aware Offline-to-Online Reinforcement Learning , author =

  68. [68]

    2019 American Control Conference (ACC) , pages =

    Safety-aware reinforcement learning framework with an actor-critic-barrier structure , author =. 2019 American Control Conference (ACC) , pages =. 2019 , organization =

  69. [69]

    IEEE Transactions on robotics , volume =

    Barrier-certified adaptive reinforcement learning with applications to brushbot navigation , author =. IEEE Transactions on robotics , volume =. 2019 , publisher =

  70. [70]

    International conference on machine learning , pages =

    Model-agnostic meta-learning for fast adaptation of deep networks , author =. International conference on machine learning , pages =. 2017 , organization =

  71. [71]

    International conference on machine learning , pages =

    Online meta-learning , author =. International conference on machine learning , pages =. 2019 , organization =

  72. [72]

    Foundations and Trends

    Introduction to online convex optimization , author =. Foundations and Trends. 2016 , publisher =

  73. [73]

    Advances in Neural Information Processing Systems , volume =

    Meta-reinforcement learning with universal policy adaptation: Provable near-optimality under all-task optimum comparator , author =. Advances in Neural Information Processing Systems , volume =

  74. [74]

    International Conference on Machine Learning , pages =

    Memory efficient online meta learning , author =. International Conference on Machine Learning , pages =. 2021 , organization =

  75. [75]

    Advances in neural information processing systems , volume =

    Adaptive gradient-based meta-learning methods , author =. Advances in neural information processing systems , volume =

  76. [76]

    International Conference on Machine Learning , pages =

    Crpo: A new approach for safe reinforcement learning with convergence guarantee , author =. International Conference on Machine Learning , pages =. 2021 , organization =

  77. [77]

    International conference on machine learning , pages =

    Efficient off-policy meta-reinforcement learning via probabilistic context variables , author =. International conference on machine learning , pages =. 2019 , organization =

  78. [78]

    2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages =

    Improved Robustness and Safety for Pre-Adaptation of Meta Reinforcement Learning with Prior Regularization , author =. 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages =. 2022 , organization =

  79. [79]

    2.3 softmax units for multinoulli output distributions , author =

    6.2. 2.3 softmax units for multinoulli output distributions , author =. Deep learning , volume =. 2016 , publisher =

  80. [80]

    Towards safe reinforcement learning via constraining con- ditional value-at-risk.arXiv preprint arXiv:2206.04436,

    Towards safe reinforcement learning via constraining conditional value-at-risk , author =. arXiv preprint arXiv:2206.04436 , year =

Showing first 80 references.