Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning

Bai Xue; Jingduo Pan; Taoran Wu; Yiling Xue

arxiv: 2605.11975 · v2 · pith:AKT4PA27new · submitted 2026-05-12 · 💻 cs.LG

Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning

Jingduo Pan , Taoran Wu , Yiling Xue , Bai Xue This is my paper

Pith reviewed 2026-05-20 22:44 UTC · model grok-4.3

classification 💻 cs.LG

keywords stochastic reinforcement learningreach-avoid constraintssafe RLprobability certificatesBellman operatorconvergence analysiscost optimization

0 comments

The pith

Reach-avoid probability certificates combined with a contraction Bellman operator let agents minimize costs while satisfying probabilistic reach-avoid constraints in stochastic environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method for stochastic minimum-cost reach-avoid reinforcement learning. It introduces reach-avoid probability certificates that mark states where an agent can meet a reach-avoid goal with at least probability p. Using these certificates, the authors create a contraction mapping in the Bellman operator that incorporates the constraint as a surrogate. This setup lets the learning process optimize expected costs without violating the probabilistic requirement. The approach comes with proofs that the algorithms converge almost surely to locally optimal policies.

Core claim

We introduce reach-avoid probability certificates (RAPCs), which identify states from which stochastic reach-avoid constraints are satisfiable. Building on RAPCs, we develop a contraction-based Bellman formulation that serves as a principled surrogate for integrating reach-avoid considerations into reinforcement learning, enabling cost optimization under probabilistic constraints. We establish almost sure convergence of the proposed algorithms to locally optimal policies with respect to the resulting objective.

What carries the argument

Reach-avoid probability certificates (RAPCs) that identify satisfiable states and serve as the basis for a contraction-based Bellman formulation used as a surrogate inside the value-function update.

If this is right

Reinforcement learning agents can jointly optimize cumulative costs and satisfy reach-avoid specifications with high probability in stochastic dynamics.
The learning algorithms converge almost surely to locally optimal policies under the combined objective.
Empirical results in MuJoCo environments show lower costs and higher rates of successful constraint satisfaction compared to prior methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The surrogate formulation may support extension to other temporal logic specifications beyond simple reach-avoid.
If certificate computation scales to high-dimensional continuous spaces, the method could transfer to physical robotic systems with uncertainty.
The contraction property opens a route to bounding sample complexity for constrained RL objectives.

Load-bearing premise

Reach-avoid probability certificates can be computed or approximated with sufficient accuracy from the stochastic dynamics to serve as a reliable surrogate inside the Bellman operator.

What would settle it

An experiment that deliberately uses an inaccurate approximation of the certificates and checks whether the learned policy still meets the required reach-avoid probability threshold in held-out stochastic trajectories.

Figures

Figures reproduced from arXiv: 2605.11975 by Bai Xue, Jingduo Pan, Taoran Wu, Yiling Xue.

**Figure 1.** Figure 1: illustrates the deterministic benchmark environments. Across PointGoal and FixedWing, RAPCPO consistently achieves the lowest cumulative cost while maintaining higher or comparable reach rates relative to P P Oβ, Sauté, CPPO, and RESPO ( [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: Cumulative cost and reach rate on deterministic benchmarks. RAPCPO achieves lower cost with competitive or higher reach rates compared to baselines [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 5.** Figure 5: Cumulative cost and reach rate on stochastic benchmarks. RAPCPO achieves the best cost-success trade-off among all methods. 6.3. Ablation Study: Compensation Factor ϕ π γ and Hyperparameter p We isolate the auxiliary compensation factor on FrozenLake (Brockman et al., 2016), where the objective is to maximize the probability of safely reaching the goal [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 3.** Figure 3: Pendulum results. RAPCPO achieves lower cumulative cost by favoring longer, smoother trajectories. 6.2. Stochastic Minimum-Cost Reach-Avoid [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 6.** Figure 6: Impact of the compensation factor on cost and reach rate across tasks. Then, when both methods employ the compensation factor, we compare the fixed-γ Bellman formulation (8) with the enhanced Bellman formulation (9) for minimum costs reachavoid problem. Under the same iteration budget, the fixed-γ Bellman formulation (8) suffers from severe reward sparsity and achieves low reach rates. In contrast, the en… view at source ↗

**Figure 4.** Figure 4: illustrates the stochastic benchmark environments. Under action noise, RAPCPO consistently outperforms all baselines on both Safety Hopper and Safety HalfCheetah ( [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 7.** Figure 7: Ablation on the hyperparameter p in the Safety Hopper environment. Right: reach rate as a function of p. Left: additional cumulative cost as a function of p. The cost is plotted on a logarithmic scale due to its wide dynamic range. 7. Conclusion and Limitations This paper addressed the minimum-cost reach-avoid reinforcement learning problem in stochastic environments. We first proposed the Reach-Avoid Pr… view at source ↗

**Figure 8.** Figure 8: Effect of the auxiliary compensation factor on reach–avoid probability estimation. The compensation factor significantly reduces conservativeness. • RESPO: https://github.com/milanganai/milanganai.github.io/tree/main/ NeurIPS2023/code (No license) • CPPO: https://github.com/yingchengyang/CPPO (No license) • Sauté: https://github.com/huawei-noah/HEBO/tree/master/SIMMER (No license) D.4. Hyperparameters In e… view at source ↗

read the original abstract

We study stochastic minimum-cost reach-avoid reinforcement learning, where an agent must satisfy a reach-avoid specification with probability at least $p$ while minimizing expected cumulative costs in stochastic environments. Existing safe and constrained reinforcement learning methods typically fail to jointly enforce probabilistic reach-avoid constraints and optimize cost in the learning setting in stochastic environments. To address this challenge, we introduce reach-avoid probability certificates (RAPCs), which identify states from which stochastic reach-avoid constraints are satisfiable. Building on RAPCs, we develop a contraction-based Bellman formulation that serves as a principled surrogate for integrating reach-avoid considerations into reinforcement learning, enabling cost optimization under probabilistic constraints. We establish almost sure convergence of the proposed algorithms to locally optimal policies with respect to the resulting objective. Experiments in the MuJoCo simulator demonstrate improved cost performance and consistently higher reach-avoid satisfaction rates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAPCs paired with a contraction Bellman operator give a fresh surrogate for joint cost minimization and probabilistic reach-avoid in stochastic RL, but the convergence claim needs explicit handling of approximation error to hold up.

read the letter

Colleague, the main takeaway is that the paper introduces reach-avoid probability certificates to mark states where stochastic reach-avoid specs can be met with probability at least p, then folds them into a modified Bellman operator that contracts and supports almost-sure convergence to locally optimal policies while minimizing expected cost. This pairing is not directly in the prior safe RL or reach-avoid literature cited in the abstract, so the technical device is new. The MuJoCo experiments report better cost numbers and higher constraint satisfaction rates than baselines, which is concrete evidence that the surrogate can be implemented and yields measurable gains in a standard simulator. That part is useful for applied safe RL work. The soft spot is the theoretical backbone. The abstract asserts convergence once RAPCs are substituted, yet gives no proof sketch, no modulus that stays strictly below 1, and no description of how the certificates are computed or approximated from continuous stochastic dynamics. The stress-test concern is on target here: if sampling or function approximation introduces error that pushes the effective contraction constant to 1 or above, both the fixed-point guarantee and the claim that the surrogate reliably enforces the probability threshold become questionable. The paper would need to show that the fixed point remains a valid certificate under realistic approximation error for the central argument to land cleanly. This is aimed at researchers who already work on constrained or safe RL in stochastic settings and want a way to keep reach-avoid specs inside the learning loop rather than as an add-on. A reader who cares about contraction arguments and certificate-based constraints would get value from the formulation even if the experiments are fairly standard. I would send it to peer review so the proofs and approximation details can be checked properly; the gap it targets is real and the idea is worth testing.

Referee Report

2 major / 2 minor

Summary. The paper studies stochastic minimum-cost reach-avoid reinforcement learning, where an agent must satisfy a reach-avoid specification with probability at least p while minimizing expected cumulative costs. It introduces reach-avoid probability certificates (RAPCs) to identify states from which such constraints are satisfiable. Building on RAPCs, the authors develop a contraction-based Bellman formulation as a surrogate for integrating reach-avoid considerations into RL. They claim almost sure convergence of the proposed algorithms to locally optimal policies and report improved cost performance with higher reach-avoid satisfaction rates in MuJoCo experiments.

Significance. If the contraction property holds under realistic RAPC approximations and the convergence result is rigorously established, the work would offer a principled surrogate for enforcing probabilistic reach-avoid constraints while optimizing costs in stochastic RL, addressing a gap in safe RL methods. The MuJoCo results suggest practical promise, but the absence of explicit analysis on approximation effects and proof details reduces the immediate significance of the theoretical claims.

major comments (2)

[Abstract and Bellman formulation] Abstract (paragraph 3) and the contraction-based Bellman formulation: the almost sure convergence to locally optimal policies is asserted on the basis of the RAPC-augmented Bellman operator being a contraction. In continuous stochastic dynamics, RAPCs must be approximated (e.g., via sampling or function approximation), yet no explicit modulus is supplied showing that the contraction constant remains strictly less than 1 under non-zero approximation error, nor is it shown that the fixed point remains a valid certificate. This is load-bearing for both the theoretical guarantee and the claim that the surrogate reliably enforces the probabilistic constraint.
[Experiments] Experiments section: the MuJoCo results claim consistently higher reach-avoid satisfaction rates, but the manuscript supplies no description of how RAPCs are obtained or approximated in the implementation, nor error-bar details or statistical tests. Without these, it is impossible to assess whether the observed gains are attributable to the proposed surrogate or to implementation specifics.

minor comments (2)

Add a brief proof sketch or key steps for the contraction property and convergence result, even if full details are in an appendix.
Clarify the relationship between the invented RAPC concept and prior reach-avoid literature to better situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We address each major comment below and indicate the specific revisions we will make to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract and Bellman formulation] Abstract (paragraph 3) and the contraction-based Bellman formulation: the almost sure convergence to locally optimal policies is asserted on the basis of the RAPC-augmented Bellman operator being a contraction. In continuous stochastic dynamics, RAPCs must be approximated (e.g., via sampling or function approximation), yet no explicit modulus is supplied showing that the contraction constant remains strictly less than 1 under non-zero approximation error, nor is it shown that the fixed point remains a valid certificate. This is load-bearing for both the theoretical guarantee and the claim that the surrogate reliably enforces the probabilistic constraint.

Authors: We agree that a detailed treatment of approximation effects is necessary to support the practical claims. The contraction mapping property and almost-sure convergence are proven for the exact RAPC-augmented Bellman operator in the theoretical development. In the revised manuscript we will add an explicit error analysis subsection that supplies a modulus of continuity for the contraction constant under bounded approximation error (via sampling or function approximation) and shows that the fixed point of the perturbed operator remains a valid certificate up to a controllable additive error term. This will directly address the load-bearing nature of the guarantee. revision: yes
Referee: [Experiments] Experiments section: the MuJoCo results claim consistently higher reach-avoid satisfaction rates, but the manuscript supplies no description of how RAPCs are obtained or approximated in the implementation, nor error-bar details or statistical tests. Without these, it is impossible to assess whether the observed gains are attributable to the proposed surrogate or to implementation specifics.

Authors: We thank the referee for highlighting this omission. The revised experiments section will include a precise description of the RAPC approximation procedure used in the MuJoCo tasks (neural-network function approximation combined with Monte-Carlo sampling for probability estimation). We will also report standard error bars computed over 10 independent random seeds and include paired statistical tests (e.g., Wilcoxon signed-rank) comparing the proposed method against baselines to substantiate the reported improvements in cost and reach-avoid satisfaction. revision: yes

Circularity Check

0 steps flagged

No significant circularity; RAPC concept and Bellman surrogate are independently introduced without reduction to inputs or self-citation chains.

full rationale

The paper defines reach-avoid probability certificates (RAPCs) as a new construct identifying states satisfying stochastic reach-avoid constraints with probability at least p, then builds a contraction-based Bellman operator that incorporates these certificates as a surrogate for the constraints. This construction is presented as a derivation from the RAPC definition rather than a fit or renaming of existing quantities. No equations reduce the objective or convergence claim to a self-referential quantity by construction, and the almost-sure convergence is asserted from the contraction property of the augmented operator. The derivation remains self-contained against external benchmarks such as standard constrained RL methods, with no load-bearing self-citations or ansatzes smuggled from prior author work. The skeptic concern about approximation error in continuous dynamics pertains to correctness and robustness rather than circularity in the claimed derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence and computability of RAPCs plus standard MDP assumptions; no free parameters or invented physical entities are declared.

axioms (1)

domain assumption The underlying process is a stochastic MDP whose transition kernel permits definition of reach-avoid probabilities.
Invoked when RAPCs are introduced as identifiers of satisfiable states.

invented entities (1)

Reach-Avoid Probability Certificates (RAPCs) no independent evidence
purpose: Mark states from which the probabilistic reach-avoid specification remains satisfiable.
New construct introduced to bridge safety certificates and the Bellman operator.

pith-pipeline@v0.9.0 · 5675 in / 1270 out tokens · 30208 ms · 2026-05-20T22:44:40.708223+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Define the operator B^π[V](x) := max{ h(x), min{ g(x), γ E[V(x')] } }. Then V(x)=B^π[V](x) admits a unique bounded solution V^π_{g,h} and B^π is a γ-contraction under ||·||_∞ (Lemma 4.4).
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

If V^π_{g,h}(x) < 0 then P^π(RA_x) ≥ −V^π_{g,h}(x)/M (Theorem 4.5).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 5 internal anchors

[1]

Brockman, G., Cheung, V ., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W

Ac- cessed: 2026-05-12. Brockman, G., Cheung, V ., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Ope- nai gym,

work page 2026
[2]

OpenAI Gym

URL https://arxiv.org/abs/ 1606.01540. Brunke, L., Greeff, M., Hall, A. W., Yuan, Z., Zhou, S., Panerati, J., and Schoellig, A. P. Safe learning in robotics: From learning-based control to safe reinforce- ment learning.Annual Review of Control, Robotics, and Autonomous Systems, 5(1):411–444,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Comparative analysis of barrier-like function methods for reach-avoid verification in stochastic discrete-time systems.arXiv preprint arXiv:2512.05348,

Cao, Z., Wang, P., Ong, L., Žikeli´c, Ð., Wagner, D., and Xue, B. Comparative analysis of barrier-like function methods for reach-avoid verification in stochastic discrete-time systems.arXiv preprint arXiv:2512.05348,

work page arXiv
[4]

L., Delage, E., Derman, E., Ghavamzadeh, M., and Petrik, M

Hau, J. L., Delage, E., Derman, E., Ghavamzadeh, M., and Petrik, M. Q-learning for quantile mdps: A decompo- sition, performance, and convergence analysis.arXiv preprint arXiv:2410.24128,

work page arXiv
[5]

Goal-conditioned reinforcement learning: Problems and solutions

Liu, M., Zhu, M., and Zhang, W. Goal-conditioned re- inforcement learning: Problems and solutions.arXiv preprint arXiv:2201.08299,

work page arXiv
[6]

and Kiumarsi, B

10 Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning Marvi, Z. and Kiumarsi, B. Safe reinforcement learning: A control barrier function optimization approach.Interna- tional Journal of Robust and Nonlinear Control, 31(6): 1923–1940,

work page 1923
[7]

Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research

Plappert, M., Andrychowicz, M., Ray, A., McGrew, B., Baker, B., Powell, G., Schneider, J., Tobin, J., Chociej, M., Welinder, P., et al. Multi-goal reinforcement learn- ing: Challenging robotics environments and request for research.arXiv preprint arXiv:1802.09464,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Benchmarking Batch Deep Reinforcement Learning Algorithms

Ray, A., Achiam, J., and Amodei, D. Benchmarking safe ex- ploration in deep reinforcement learning.arXiv preprint arXiv:1910.01708, 7(1):2,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[9]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

So, O., Ge, C., and Fan, C

URLhttps://arxiv.org/abs/2305.14154. So, O., Ge, C., and Fan, C. Solving minimum-cost reach avoid using reinforcement learning.Advances in Neural Information Processing Systems, 37:30951–30984,

work page arXiv
[11]

Reward Constrained Policy Optimization

Tessler, C., Mankowitz, D. J., and Mannor, S. Re- ward constrained policy optimization.arXiv preprint arXiv:1805.11074,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Yang, T.-Y ., Rosca, J., Narasimhan, K., and Ramadge, P. J. Projection-based constrained policy optimization.arXiv preprint arXiv:2010.03152,

work page arXiv 2010
[13]

Towards safe reinforcement learning via constraining con- ditional value-at-risk.arXiv preprint arXiv:2206.04436,

Ying, C., Zhou, X., Su, H., Yan, D., Chen, N., and Zhu, J. Towards safe reinforcement learning via constraining con- ditional value-at-risk.arXiv preprint arXiv:2206.04436,

work page arXiv
[14]

Proofs A.1

11 Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning A. Proofs A.1. The proof for lemma 4.4 Proof.LetB(R n)denote the space of all bounded functions mapping fromR n toR. That is, B(Rn) = V:R n →R|sup x∈Rn |V(x)|<∞ . It is a standard result that the space B(Rn) of bounded functions, endowed with the supremum norm ∥V∥ ∞ = supx∈Rn |V(x)|, constitute...

work page 1999
[15]

Given thatγ∈(0,1), the operatorB π is a contraction mapping on the Banach space(B(R n),∥ · ∥ ∞)

=γ∥V 1 −V 2∥∞.(Expectation of constant is constant) Since the inequality |Bπ[V1](x)−B π[V2](x)| ≤γ∥V 1 −V 2∥∞ holds for all x∈R n, we can take the supremum over x on the left side: ∥Bπ[V1]−B π[V2]∥∞ = sup x∈Rn |Bπ[V1](x)−B π[V2](x)| ≤γ∥V 1 −V 2∥∞. Given thatγ∈(0,1), the operatorB π is a contraction mapping on the Banach space(B(R n),∥ · ∥ ∞). By the Banac...

work page 2026
[16]

Definition B.1(Absorbing Markov Decision Process).The Absorbing Markov Decision Process is defined on f with an added absorbing states

that facilitates the computation of gradients. Definition B.1(Absorbing Markov Decision Process).The Absorbing Markov Decision Process is defined on f with an added absorbing states. We define the transition functionf ′ r with the absorbing state as f ′ r(x, a) = ( f(x, a),ifg(x)≥V π g,h(x)≥h(x), s,otherwise. (34) Denote byd ′ π(x)the stationary distribut...

work page 2023
[17]

Recognizing this, our method simplifies the formulation by defining g directly on the original state x, thereby avoiding the need for state augmentation

employs state augmentation, defining g based on an augmented state ˜x, the function g(˜x)in their work effectively depends only on the original state component x. Recognizing this, our method simplifies the formulation by defining g directly on the original state x, thereby avoiding the need for state augmentation. Apart from this change, the remaining ex...

work page 2024

[1] [1]

Brockman, G., Cheung, V ., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W

Ac- cessed: 2026-05-12. Brockman, G., Cheung, V ., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Ope- nai gym,

work page 2026

[2] [2]

OpenAI Gym

URL https://arxiv.org/abs/ 1606.01540. Brunke, L., Greeff, M., Hall, A. W., Yuan, Z., Zhou, S., Panerati, J., and Schoellig, A. P. Safe learning in robotics: From learning-based control to safe reinforce- ment learning.Annual Review of Control, Robotics, and Autonomous Systems, 5(1):411–444,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Comparative analysis of barrier-like function methods for reach-avoid verification in stochastic discrete-time systems.arXiv preprint arXiv:2512.05348,

Cao, Z., Wang, P., Ong, L., Žikeli´c, Ð., Wagner, D., and Xue, B. Comparative analysis of barrier-like function methods for reach-avoid verification in stochastic discrete-time systems.arXiv preprint arXiv:2512.05348,

work page arXiv

[4] [4]

L., Delage, E., Derman, E., Ghavamzadeh, M., and Petrik, M

Hau, J. L., Delage, E., Derman, E., Ghavamzadeh, M., and Petrik, M. Q-learning for quantile mdps: A decompo- sition, performance, and convergence analysis.arXiv preprint arXiv:2410.24128,

work page arXiv

[5] [5]

Goal-conditioned reinforcement learning: Problems and solutions

Liu, M., Zhu, M., and Zhang, W. Goal-conditioned re- inforcement learning: Problems and solutions.arXiv preprint arXiv:2201.08299,

work page arXiv

[6] [6]

and Kiumarsi, B

10 Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning Marvi, Z. and Kiumarsi, B. Safe reinforcement learning: A control barrier function optimization approach.Interna- tional Journal of Robust and Nonlinear Control, 31(6): 1923–1940,

work page 1923

[7] [7]

Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research

Plappert, M., Andrychowicz, M., Ray, A., McGrew, B., Baker, B., Powell, G., Schneider, J., Tobin, J., Chociej, M., Welinder, P., et al. Multi-goal reinforcement learn- ing: Challenging robotics environments and request for research.arXiv preprint arXiv:1802.09464,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Benchmarking Batch Deep Reinforcement Learning Algorithms

Ray, A., Achiam, J., and Amodei, D. Benchmarking safe ex- ploration in deep reinforcement learning.arXiv preprint arXiv:1910.01708, 7(1):2,

work page internal anchor Pith review Pith/arXiv arXiv 1910

[9] [9]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

So, O., Ge, C., and Fan, C

URLhttps://arxiv.org/abs/2305.14154. So, O., Ge, C., and Fan, C. Solving minimum-cost reach avoid using reinforcement learning.Advances in Neural Information Processing Systems, 37:30951–30984,

work page arXiv

[11] [11]

Reward Constrained Policy Optimization

Tessler, C., Mankowitz, D. J., and Mannor, S. Re- ward constrained policy optimization.arXiv preprint arXiv:1805.11074,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Yang, T.-Y ., Rosca, J., Narasimhan, K., and Ramadge, P. J. Projection-based constrained policy optimization.arXiv preprint arXiv:2010.03152,

work page arXiv 2010

[13] [13]

Towards safe reinforcement learning via constraining con- ditional value-at-risk.arXiv preprint arXiv:2206.04436,

Ying, C., Zhou, X., Su, H., Yan, D., Chen, N., and Zhu, J. Towards safe reinforcement learning via constraining con- ditional value-at-risk.arXiv preprint arXiv:2206.04436,

work page arXiv

[14] [14]

Proofs A.1

11 Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning A. Proofs A.1. The proof for lemma 4.4 Proof.LetB(R n)denote the space of all bounded functions mapping fromR n toR. That is, B(Rn) = V:R n →R|sup x∈Rn |V(x)|<∞ . It is a standard result that the space B(Rn) of bounded functions, endowed with the supremum norm ∥V∥ ∞ = supx∈Rn |V(x)|, constitute...

work page 1999

[15] [15]

Given thatγ∈(0,1), the operatorB π is a contraction mapping on the Banach space(B(R n),∥ · ∥ ∞)

=γ∥V 1 −V 2∥∞.(Expectation of constant is constant) Since the inequality |Bπ[V1](x)−B π[V2](x)| ≤γ∥V 1 −V 2∥∞ holds for all x∈R n, we can take the supremum over x on the left side: ∥Bπ[V1]−B π[V2]∥∞ = sup x∈Rn |Bπ[V1](x)−B π[V2](x)| ≤γ∥V 1 −V 2∥∞. Given thatγ∈(0,1), the operatorB π is a contraction mapping on the Banach space(B(R n),∥ · ∥ ∞). By the Banac...

work page 2026

[16] [16]

Definition B.1(Absorbing Markov Decision Process).The Absorbing Markov Decision Process is defined on f with an added absorbing states

that facilitates the computation of gradients. Definition B.1(Absorbing Markov Decision Process).The Absorbing Markov Decision Process is defined on f with an added absorbing states. We define the transition functionf ′ r with the absorbing state as f ′ r(x, a) = ( f(x, a),ifg(x)≥V π g,h(x)≥h(x), s,otherwise. (34) Denote byd ′ π(x)the stationary distribut...

work page 2023

[17] [17]

Recognizing this, our method simplifies the formulation by defining g directly on the original state x, thereby avoiding the need for state augmentation

employs state augmentation, defining g based on an augmented state ˜x, the function g(˜x)in their work effectively depends only on the original state component x. Recognizing this, our method simplifies the formulation by defining g directly on the original state x, thereby avoiding the need for state augmentation. Apart from this change, the remaining ex...

work page 2024