Bellman Value Decomposition for Task Logic in Safe Optimal Control

Chuchu Fan; Dylan Hirsch; Oswin So; Sylvia Herbert; William Sharpless

arxiv: 2602.19532 · v2 · pith:XUUK4HNOnew · submitted 2026-02-23 · 💻 cs.RO · cs.SY· eess.SY

Bellman Value Decomposition for Task Logic in Safe Optimal Control

William Sharpless , Oswin So , Dylan Hirsch , Sylvia Herbert , Chuchu Fan This is my paper

Pith reviewed 2026-05-15 21:05 UTC · model grok-4.3

classification 💻 cs.RO cs.SYeess.SY

keywords Bellman value decompositiontemporal logicsafe optimal controlreach-avoidreach-avoid-loopneural network policyVDPPOmulti-agent control

0 comments

The pith

The Bellman value for temporal logic tasks decomposes into a graph of simpler values connected by reach-avoid, avoid, and reach-avoid-loop equations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that the Bellman value for tasks specified in temporal logic naturally breaks down into a graph of simpler Bellman values. These nodes connect through the standard reach-avoid Bellman equation, the avoid Bellman equation, and a newly introduced reach-avoid-loop Bellman equation. The decomposition allows the full value function and optimal policy to be learned by embedding the graph structure directly into a two-layer neural network, which automatically resolves the dependencies between parts of the task. A reader would care because high-dimensional control problems routinely combine safety constraints with goal-directed behavior, and conventional approaches demand extensive manual reward design or automata construction that scales poorly. The authors test the resulting method on simulated and hardware experiments with multi-agent teams and nonlinear dynamics, reporting improved automatic balance of safety and goal achievement over existing baselines.

Core claim

We prove the Bellman Value for a complex task defined in temporal logic can be decomposed into a graph of Bellman Values, connected by a set of well-known Bellman equations (BEs): the Reach-Avoid BE, the Avoid BE, and a novel type, the Reach-Avoid-Loop BE. To solve the Value and optimal policy, we propose VDPPO, which embeds the decomposed Value graph into a two-layer neural net, bootstrapping the implicit dependencies.

What carries the argument

Decomposition of the Bellman value into a graph connected by the reach-avoid Bellman equation, the avoid Bellman equation, and the reach-avoid-loop Bellman equation, embedded in a two-layer neural network.

If this is right

The optimal policy for combined safety and goal tasks is obtained by solving the embedded graph in a two-layer neural net.
Safety and liveness specifications balance automatically without separate reward tuning.
The method applies directly to high-dimensional systems with nonlinear dynamics and heterogeneous agent teams.
Implicit task dependencies are resolved by the network bootstrapping process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This decomposition approach could be tested on task specifications outside temporal logic, such as signal temporal logic or other formalisms.
Recursive application of the graph structure might handle more deeply nested specifications in future extensions.
The two-layer embedding may transfer to continuous-time settings if the underlying Bellman equations are discretized consistently.

Load-bearing premise

The innate structure of the Bellman value organizes temporal logic tasks so that the decomposed graph embeds into a two-layer neural net without manual tuning or post-hoc adjustments.

What would settle it

A concrete temporal logic task where the value function computed from the decomposed graph and VDPPO differs measurably from the value obtained by solving the full undecomposed Bellman equation, or where the resulting policy violates a safety or liveness specification in direct simulation.

Figures

Figures reproduced from arXiv: 2602.19532 by Chuchu Fan, Dylan Hirsch, Oswin So, Sylvia Herbert, William Sharpless.

**Figure 1.** Figure 1: Value-Decomposition and VDPPO. The Bellman Value for a range of temporal logic (e.g., multi-goal, recurrence, stability, safety) decomposes into a Value graph connected by atomic Bellman equations (Thms. 1–4). We propose VDPPO, an algorithm that exploits this structure to learn policies for complex, high-dimensional tasks. Our approach is validated on hardware with Herding and Delivery, two complex tasks i… view at source ↗

**Figure 2.** Figure 2: E.g. N-Until-Conjunction Value Decomposition. Here we illustrate the primary decomposition result (Thm. 1 extension, Appendix), with a GridWorld example (left) for a given specification. The corresponding DVG is shown (center left) with each node representing a decomposed Value, and edges representing dependencies. In the center right, a subset of decomposed Values solved with dynamic programming are shown… view at source ↗

**Figure 3.** Figure 3: E.g. G(N-Until-Conjunction) Value Decomposition. We illustrate the recursive decomposition result (Thm. 3), with a GridWorld example (left) for a given specification. The plots here are analogous to those of [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Graphical Depiction of Algorithms. to the embedding. This allows us to leverage the decomposed structure of the Value functions to efficiently learn policies that satisfy complex TL specifications without sequentially approximating the Value. See the Appendix for further details. IX. SIMULATION RESULTS To better understand the performance of VDPPO, we design simulation experiments to answer the following q… view at source ↗

**Figure 5.** Figure 5: Performance scaling with TL complexity. Value decomposition enables VDPPO to better scale by tackling smaller problems [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Hardware Overview for Herding and Delivery Tasks complex interactions with uncontrolled agents (Herding), needing to collaborate (Delivery), or complex dynamics (Manipulator) and show the results in [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Trajectory snapshots from Herding and Delivery hardware tasks. We show a long-exposure photo (left), and stills from independent times (right), with depictions corresponding to those of the overview in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Effect of parameter sharing. Sharing parameters for the actor only improves performance while reduce the variance. O. HARDWARE In the hardware experiments, we evaluate VDPPO performance in the Herding and Delivery tasks. In both tasks, the state position is reported by HTC Vive base stations in communication with the an attached Lighthouse deck to each Crazyflie. The Go2 quadruped’s location is integrated … view at source ↗

read the original abstract

Real-world tasks involve nuanced combinations of goal and safety specifications. In high dimensions, the challenge is exacerbated: formal automata become cumbersome, and the combination of sparse rewards tends to require laborious tuning. In this work, we consider the innate structure of the Bellman Value as a means to naturally organize the problem for improved automatic performance. Namely, we prove the Bellman Value for a complex task defined in temporal logic can be decomposed into a graph of Bellman Values, connected by a set of well-known Bellman equations (BEs): the Reach-Avoid BE, the Avoid BE, and a novel type, the Reach-Avoid-Loop BE. To solve the Value and optimal policy, we propose VDPPO, which embeds the decomposed Value graph into a two-layer neural net, bootstrapping the implicit dependencies. We conduct a variety of simulated and hardware experiments to test our method on complex, high-dimensional tasks involving heterogeneous teams and nonlinear dynamics. Ultimately, we find this approach greatly improves performance over existing baselines, balancing safety and liveness automatically.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives a Reach-Avoid-Loop Bellman equation to decompose temporal logic tasks into a value graph and embeds it in a two-layer net called VDPPO.

read the letter

The main takeaway is that the authors claim any temporal-logic task value decomposes into a graph using the standard Reach-Avoid and Avoid Bellman equations plus one new Reach-Avoid-Loop equation, then solve it by wiring that graph into a two-layer neural net (VDPPO) that learns the policy without extra manual tuning. They test this on simulated and hardware tasks with nonlinear dynamics and multi-agent setups, reporting better safety-liveness trade-offs than baselines. The experiments appear to be the strongest part: they cover high-dimensional cases and real hardware, which is more than many value-decomposition papers manage. The core idea of letting the Bellman structure organize the temporal logic is straightforward and worth testing. The soft spot is the proof. The abstract states they prove the decomposition, but without the derivation steps it is impossible to check whether the new Reach-Avoid-Loop equation actually covers arbitrary STL formulas, stochastic transitions, or nested operators without hidden assumptions about finite loops or memoryless sub-tasks. If those assumptions are required, the claim narrows. The two-layer net claim also needs scrutiny; any implicit fitting during training could undermine the “no post-hoc tuning” assertion. The citation pattern looks standard for safe RL and formal methods work. This paper is aimed at researchers who already work on Bellman-based methods for constrained control or STL specifications. A reader who wants concrete experiments on combined safety and liveness in robotics will find usable ideas even if the theory needs tightening. I would send it to peer review because the empirical results are concrete and the decomposition direction is worth referee attention, though the reviewers will almost certainly press on the exact conditions of the new Bellman equation.

Referee Report

3 major / 2 minor

Summary. The paper claims to prove that the Bellman value for any complex task specified in temporal logic decomposes into a graph of simpler Bellman values connected exclusively by the Reach-Avoid Bellman equation, the Avoid Bellman equation, and a novel Reach-Avoid-Loop Bellman equation. It introduces VDPPO, which embeds this value graph into a two-layer neural network that bootstraps the implicit dependencies to compute the value function and optimal policy. Experiments on simulated and hardware tasks with heterogeneous teams and nonlinear dynamics are reported to show improved safety-liveness trade-offs over baselines without post-hoc tuning.

Significance. If the decomposition theorem holds for general STL formulas and the two-layer embedding preserves optimality without hidden parameters, the result would supply a structured, largely automatic route to safe optimal control for high-dimensional tasks whose specifications combine reachability, avoidance, and looping behaviors. The approach could reduce reliance on manual reward shaping or automata construction in robotics applications.

major comments (3)

[Main derivation of Reach-Avoid-Loop BE] The central proof relies on the correctness of the novel Reach-Avoid-Loop Bellman equation. The manuscript must explicitly derive this equation from the STL semantics and state the precise assumptions (e.g., deterministic vs. stochastic transitions, finite vs. infinite loop horizons, memoryless sub-tasks) under which the fixed-point equation is valid; without this, the claim that every STL task reduces to combinations of only the three listed operators cannot be verified.
[VDPPO architecture and training] The assertion that the decomposed graph embeds losslessly into a two-layer neural net without additional manual tuning or post-hoc adjustments is load-bearing for the practical contribution. The paper should provide a formal argument or explicit construction showing that all implicit dependencies among the sub-values are captured by the network architecture and loss; otherwise the performance gains may stem from implicit fitting rather than the decomposition.
[Experimental results] Experiments report improved performance, yet the manuscript does not include an ablation that isolates the contribution of the Reach-Avoid-Loop equation versus the standard Reach-Avoid and Avoid equations. Without this, it is unclear whether the novel operator is necessary for the observed gains or whether simpler decompositions suffice.

minor comments (2)

[Preliminaries] Notation for the three Bellman equations should be unified and introduced in a single preliminary section to improve readability.
[Experiments] The abstract states that the method 'balances safety and liveness automatically'; the experimental section should report quantitative safety-violation rates alongside task-completion rates for all baselines.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We address each of the major comments below, providing clarifications and committing to revisions where appropriate to strengthen the paper.

read point-by-point responses

Referee: [Main derivation of Reach-Avoid-Loop BE] The central proof relies on the correctness of the novel Reach-Avoid-Loop Bellman equation. The manuscript must explicitly derive this equation from the STL semantics and state the precise assumptions (e.g., deterministic vs. stochastic transitions, finite vs. infinite loop horizons, memoryless sub-tasks) under which the fixed-point equation is valid; without this, the claim that every STL task reduces to combinations of only the three listed operators cannot be verified.

Authors: We agree that an explicit derivation is essential for verifying the decomposition theorem. In the revised manuscript, we will add a dedicated subsection in the theoretical analysis that derives the Reach-Avoid-Loop Bellman equation step-by-step from the STL semantics. We will explicitly state the assumptions: deterministic transitions, infinite-horizon loops with appropriate discounting to ensure convergence, and memoryless sub-tasks as per the STL fragment considered. This will confirm that all complex STL formulas reduce to the three operators. revision: yes
Referee: [VDPPO architecture and training] The assertion that the decomposed graph embeds losslessly into a two-layer neural net without additional manual tuning or post-hoc adjustments is load-bearing for the practical contribution. The paper should provide a formal argument or explicit construction showing that all implicit dependencies among the sub-values are captured by the network architecture and loss; otherwise the performance gains may stem from implicit fitting rather than the decomposition.

Authors: The VDPPO architecture is specifically designed with the first layer computing the sub-value functions corresponding to the graph nodes and the second layer implementing the bootstrapping via the Bellman operators. We will include in the revision a formal construction in the appendix that proves the network captures all dependencies through its layered structure and the composite loss function, without requiring manual tuning or hidden parameters beyond the graph embedding. revision: yes
Referee: [Experimental results] Experiments report improved performance, yet the manuscript does not include an ablation that isolates the contribution of the Reach-Avoid-Loop equation versus the standard Reach-Avoid and Avoid equations. Without this, it is unclear whether the novel operator is necessary for the observed gains or whether simpler decompositions suffice.

Authors: We acknowledge the value of isolating the contribution of the novel operator. In the revised version, we will add an ablation study in the experimental section that compares the full VDPPO using all three equations against variants using only Reach-Avoid and Avoid equations on the same tasks, to demonstrate the necessity of the Reach-Avoid-Loop BE for the reported performance improvements. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained; no circular reductions identified

full rationale

The paper states it proves decomposition of the Bellman value for temporal-logic tasks into a graph connected by Reach-Avoid BE, Avoid BE, and a novel Reach-Avoid-Loop BE. The novel equation is presented as derived within the proof rather than obtained by fitting or self-definition. The two-layer neural net embedding is described as a solution architecture that bootstraps implicit dependencies, not as a statistical prediction forced by prior fits. No load-bearing self-citations, uniqueness theorems imported from the same authors, or ansatzes smuggled via prior work are referenced in the abstract or description. The central claim therefore retains independent mathematical content and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that temporal logic tasks possess an innate Bellman-value structure that decomposes cleanly; no explicit free parameters are stated, but the neural-net training implicitly introduces fitted weights.

axioms (1)

domain assumption Bellman value for temporal logic tasks admits a graph decomposition connected by reach-avoid, avoid, and reach-avoid-loop equations
Stated as the basis for the proof in the abstract.

invented entities (1)

Reach-Avoid-Loop Bellman equation no independent evidence
purpose: To capture looping behavior within the decomposed value graph
Presented as a novel type required to close the decomposition for certain tasks.

pith-pipeline@v0.9.0 · 5491 in / 1440 out tokens · 39943 ms · 2026-05-15T21:05:49.837826+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we prove the Bellman Value for a complex task defined in temporal logic can be decomposed into a graph of Bellman Values, connected by ... the Reach-Avoid BE, the Avoid BE, and a novel type, the Reach-Avoid-Loop BE
IndisputableMonolith/Foundation/ArrowOfTime.lean forward_accumulates unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 2 ... RAℓ-BE ... lim γ→1 Vγj = V*[G(∧j∈J (qj U rj))]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Value Functions for Temporal Logic: Optimal Policies and Safety Filters
cs.RO 2026-05 unverdicted novelty 6.0

Non-Markovian policies from decomposed temporal logic value functions are proven optimal for nested Until, Globally, and Globally-Until specifications and extend Q-function safety filters to complex tasks.

Reference graph

Works this paper leans on

97 extracted references · 97 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction. Cambridge, MA, USA: A Bradford Book, 2018

work page 2018
[2]

LTL and beyond: Formal languages for reward function specification in reinforcement learning,

A. Camacho, R. Toro Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith, “LTL and beyond: Formal languages for reward function specification in reinforcement learning,” inProceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. California: International Joint Conferences on Artificial Intelligence Organization, 1 Au...

work page 2019
[3]

A time-dependent hamilton-jacobi formulation of reachable sets for continuous dynamic games,

I. M. Mitchell, A. M. Bayen, and C. J. Tomlin, “A time-dependent hamilton-jacobi formulation of reachable sets for continuous dynamic games,”IEEE Transactions on automatic control, vol. 50, no. 7, pp. 947–957, 2005

work page 2005
[4]

Reach-avoid problems with time-varying dynamics, targets and constraints,

J. F. Fisac, M. Chen, C. J. Tomlin, and S. S. Sastry, “Reach-avoid problems with time-varying dynamics, targets and constraints,” inHybrid Systems: Computation and Control. ACM, 2015

work page 2015
[5]

Dual- objective reinforcement learning with novel hamilton-jacobi-bellman formulations,

W. Sharpless, D. Hirsch, S. Tonkens, N. Shinde, and S. Herbert, “Dual-objective reinforcement learning with novel hamilton-jacobi-bellman formulations,”arXiv preprint arXiv:2506.16016, 2025

work page arXiv 2025
[6]

Temporal logic guided safe model-based reinforcement learning: A hybrid systems approach,

M. H. Cohen, Z. Serlin, K. Leahy, and C. Belta, “Temporal logic guided safe model-based reinforcement learning: A hybrid systems approach,”Nonlinear Anal. Hybrid Syst., vol. 47, no. 101295, p. 101295, Feb. 2023

work page 2023
[7]

Instructing goal- conditioned reinforcement learning agents with temporal logic objectives,

W. Qiu, W. Mao, and H. Zhu, “Instructing goal- conditioned reinforcement learning agents with temporal logic objectives,”Neural Inf Process Syst, vol. 36, pp. 39 147–39 175, 2023

work page 2023
[8]

Verification of Markov decision processes using learning algorithms,

T. Brázdil, K. Chatterjee, M. Chmelík, V . Forejt, J. Kˇretínský, M. Kwiatkowska, D. Parker, and M. Ujma, “Verification of Markov decision processes using learning algorithms,”arXiv [cs.LO], 10 Feb. 2014

work page 2014
[9]

Training agents to satisfy timed and untimed signal temporal logic specifications with reinforcement learning,

N. Hamilton, P. K. Robinette, and T. T. Johnson, “Training agents to satisfy timed and untimed signal temporal logic specifications with reinforcement learning,” inSoftware Engineering and Formal Methods, ser. Lecture notes in computer science. Cham: Springer International Publishing, 2022, pp. 190–206

work page 2022
[10]

A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications,

D. Sadigh, E. S. Kim, S. Coogan, S. S. Sastry, and S. A. Seshia, “A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications,” in53rd IEEE Conference on Decision and Control. IEEE, Dec. 2014, pp. 1091–1096

work page 2014
[11]

Control synthesis from linear temporal logic specifications using model-free reinforcement learning,

A. K. Bozkurt, Y . Wang, M. M. Zavlanos, and M. Pajic, “Control synthesis from linear temporal logic specifications using model-free reinforcement learning,” in2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, May 2020, p. 10349–10355

work page 2020
[12]

Rewarding behaviors,

F. Bacchus, C. Boutilier, and A. J. Grove, “Rewarding behaviors,” inProceedings of the National Conference on Artificial Intelligence.cs.toronto.edu, 4 Aug. 1996, pp. 1160–1167

work page 1996
[13]

Decision-theoretic planning with non- Markovian rewards,

S. Thiebaux, C. Gretton, J. Slaney, D. Price, and F. Kabanza, “Decision-theoretic planning with non- Markovian rewards,”J. Artif. Intell. Res., vol. 25, pp. 17–74, 29 Jan. 2006

work page 2006
[14]

Non- Markovian rewards expressed in LTL: Guiding search via reward shaping,

A. Camacho, O. Chen, S. Sanner, and S. McIlraith, “Non- Markovian rewards expressed in LTL: Guiding search via reward shaping,”Proceedings of the International Symposium on Combinatorial Search, vol. 8, no. 1, pp. 159–160, 1 Sep. 2021

work page 2021
[15]

Using reward machines for high-level task specification and decomposition in reinforcement learning,

R. T. Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith, “Using reward machines for high-level task specification and decomposition in reinforcement learning,”ICML, vol. 80, pp. 2112–2121, 3 Jul. 2018

work page 2018
[16]

Q-learning for robust satisfaction of signal temporal logic specifications,

D. Aksaray, A. Jones, Z. Kong, M. Schwager, and C. Belta, “Q-learning for robust satisfaction of signal temporal logic specifications,” in2016 IEEE 55th Conference on Decision and Control (CDC). IEEE, Dec. 2016, pp. 6565–6570

work page 2016
[17]

Reinforcement learning with temporal logic rewards,

X. Li, C.-I. Vasile, and C. Belta, “Reinforcement learning with temporal logic rewards,” in2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, Sep. 2017, pp. 3834–3839

work page 2017
[18]

Modular deep reinforcement learning for continuous motion planning with temporal logic.IEEE robotics and automation letters, 6(4):7973–7980, 2021

M. Cai, M. Hasanbeig, S. Xiao, A. Abate, and Z. Kan, “Modular deep reinforcement learning for continuous motion planning with temporal logic,” IEEE Robotics and Automation Letters, vol. 6, no. 4, p. 7973–7980, Oct. 2021. [Online]. Available: http://dx.doi.org/10.1109/LRA.2021.3101544

work page doi:10.1109/lra.2021.3101544 2021
[19]

Planning with general objective functions: Going beyond total rewards,

R. Wang, P. Zhong, S. S. Du, R. R. Salakhutdinov, and L. Yang, “Planning with general objective functions: Going beyond total rewards,” inAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 14 486–14 497

work page 2020
[20]

Reinforcement learning with non-cumulative objective,

W. Cui and W. Yu, “Reinforcement learning with non-cumulative objective,”IEEE Transactions on Machine Learning in Communications and Networking, vol. 1, pp. 124–137, 2023

work page 2023
[21]

Recursive reward aggregation,

Y . Tang, Y . Zhang, J. Ackermann, Y .-J. Zhang, S. Nishi- mori, and M. Sugiyama, “Recursive reward aggregation,” inReinforcement Learning Conference, 2025

work page 2025
[22]

Hybrid reward architecture for reinforcement learning,

H. van Seijen, M. Fatemi, J. Romoff, R. Laroche, T. Barnes, and J. Tsang, “Hybrid reward architecture for reinforcement learning,” inProceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY , USA: Curran Associates Inc., 2017, p. 5398–5408

work page 2017
[23]

Consistent aggregation of objectives with diverse time preferences requires non-markovian rewards,

S. Pitis, “Consistent aggregation of objectives with diverse time preferences requires non-markovian rewards,” inThirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[24]

Rdˆ2: Reward decomposition with representation decomposition,

Z. Lin, D. Yang, L. Zhao, T. Qin, G. Yang, and T.-Y . Liu, “Rdˆ2: Reward decomposition with representation decomposition,” inAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran 9 Associates, Inc., 2020, pp. 11 298–11 308

work page 2020
[25]

Altman,Constrained Markov decision processes: Stochastic modeling

E. Altman,Constrained Markov decision processes: Stochastic modeling. Boca Raton: Routledge, 13 Dec. 2021

work page 2021
[26]

Constrained Policy Optimization

J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,”ICML, vol. abs/1705.10528, pp. 22–31, 30 May 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Safe reinforcement learning in constrained Markov decision processes,

A. Wachi and Y . Sui, “Safe reinforcement learning in constrained Markov decision processes,”ICML, vol. 119, pp. 9797–9806, 12 Jul. 2020

work page 2020
[28]

Responsive safety in reinforcement learning by PID lagrangian methods,

A. Stooke, J. Achiam, and P. Abbeel, “Responsive safety in reinforcement learning by PID lagrangian methods,” ICML, vol. 119, pp. 9133–9143, 8 Jul. 2020

work page 2020
[29]

Faster algorithm and sharper analysis for constrained Markov decision process,

T. Li, Z. Guan, S. Zou, T. Xu, Y . Liang, and G. Lan, “Faster algorithm and sharper analysis for constrained Markov decision process,”Oper. Res. Lett., vol. 54, no. 107107, p. 107107, May 2024

work page 2024
[30]

A primal-dual approach to constrained Markov decision processes,

Y . Chen, J. Dong, and Z. Wang, “A primal-dual approach to constrained Markov decision processes,”arXiv [math.OC], 26 Jan. 2021

work page 2021
[31]

A simple reward-free approach to constrained reinforcement learning,

S. Miryoosefi and C. Jin, “A simple reward-free approach to constrained reinforcement learning,”ICML, vol. abs/2107.05216, pp. 15 666–15 698, 12 Jul. 2021

work page arXiv 2021
[32]

Projection-based constrained policy optimization,

T.-Y . Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge, “Projection-based constrained policy optimization,”arXiv [cs.LG], 7 Oct. 2020

work page 2020
[33]

Natural policy gradient primal-dual method for constrained Markov decision processes,

D. Ding, K. Zhang, T. Ba¸ sar, and M. Jovanovi´c, “Natural policy gradient primal-dual method for constrained Markov decision processes,”Neural Inf Process Syst, vol. 33, pp. 8378–8390, 2020

work page 2020
[34]

Reward constrained policy optimization,

C. Tessler, D. J. Mankowitz, and S. Mannor, “Reward constrained policy optimization,”arXiv [cs.LG], 28 May 2018

work page 2018
[35]

Reinforcement learning for constrained Markov decision processes,

A. Gattami, Q. Bai, and V . Aggarwal, “Reinforcement learning for constrained Markov decision processes,” AISTATS, vol. 130, pp. 2656–2664, 2021

work page 2021
[36]

Constrained Markov decision processes via backward value functions,

H. Satija, P. Amortila, and J. Pineau, “Constrained Markov decision processes via backward value functions,” ICML, vol. 119, pp. 8502–8511, 12 Jul. 2020

work page 2020
[37]

Reinforcement learning with almost sure constraints,

A. Castellano, H. Min, E. Mallada, and J. A. Bazerque, “Reinforcement learning with almost sure constraints,” in Proceedings of The 4th Annual Learning for Dynamics and Control Conference, ser. Proceedings of Machine Learning Research, vol. 168. PMLR, 2022, pp. 559–570

work page 2022
[38]

Anytime-constrained reinforcement learning,

J. McMahan and X. Zhu, “Anytime-constrained reinforcement learning,” inProceedings of The 27th International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, S. Dasgupta, S. Mandt, and Y . Li, Eds., vol

work page
[39]

4321–4329

PMLR, 02–04 May 2024, pp. 4321–4329

work page 2024
[40]

Model-based multi-objective reinforcement learning,

M. A. Wiering, M. Withagen, and M. M. Drugan, “Model-based multi-objective reinforcement learning,” in2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL). IEEE, Dec. 2014, pp. 1–6

work page 2014
[41]

Multi-objective reinforcement learning using sets of Pareto dominating policies,

M. K. Van and A. Nowé, “Multi-objective reinforcement learning using sets of Pareto dominating policies,”The Journal of Machine Learning Research, vol. 15, no. 1, pp. 3483–3512, 2014

work page 2014
[42]

Distributional Pareto-optimal multi-objective reinforcement learning,

X.-Q. Cai, P. Zhang, L. Zhao, J. Bian, M. Sugiyama, and A. Llorens, “Distributional Pareto-optimal multi-objective reinforcement learning,”Neural Inf Process Syst, vol. 36, pp. 15 593–15 613, 2023

work page 2023
[43]

Multi-objective deep reinforcement learning,

H. Mossalam, Y . M. Assael, D. M. Roijers, and S. Whiteson, “Multi-objective deep reinforcement learning,”arXiv [cs.AI], 9 Oct. 2016

work page 2016
[44]

Dynamic weights in multi-objective deep reinforcement learning,

A. Abels, D. Roijers, T. Lenaerts, A. Nowé, and D. Steck- elmacher, “Dynamic weights in multi-objective deep reinforcement learning,” inProceedings of the 36th Inter- national Conference on Machine Learning, ser. Proceed- ings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 11–20

work page 2019
[45]

A generalized algorithm for multi-objective reinforcement learning and policy adaptation,

R. Yang, X. Sun, and K. Narasimhan, “A generalized algorithm for multi-objective reinforcement learning and policy adaptation,” inAdvances in Neural Information Processing Systems. proceedings.neurips.cc, 2019

work page 2019
[46]

Pareto set learning for multi-objective rein- forcement learning,

E. Liu, Y .-C. Wu, X. Huang, C. Gao, R.-J. Wang, K. Xue, and C. Qian, “Pareto set learning for multi-objective rein- forcement learning,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 18, 2025

work page 2025
[47]

Goal-conditioned reinforcement learning: Problems and solutions,

M. Liu, M. Zhu, and W. Zhang, “Goal-conditioned reinforcement learning: Problems and solutions,”arXiv [cs.AI], 20 Jan. 2022

work page 2022
[48]

Multi-goal re- inforcement learning: Challenging robotics environments and request for research,

M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, V . Kumar, and W. Zaremba, “Multi-goal re- inforcement learning: Challenging robotics environments and request for research,”arXiv [cs.LG], 26 Feb. 2018

work page 2018
[49]

Exploration via hindsight goal generation,

Z. Ren, K. Dong, Y . Zhou, Q. Liu, and J. Peng, “Exploration via hindsight goal generation,”Neural Inf Process Syst, vol. 32, pp. 13 464–13 474, 1 Jun. 2019

work page 2019
[50]

Offline goal-conditioned reinforcement learning via f- advantage regression,

J. Y . Ma, J. Yan, D. Jayaraman, and O. Bastani, “Offline goal-conditioned reinforcement learning via f- advantage regression,” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 310–323

work page 2022
[51]

Learning with AMIGo: Adversarially motivated intrinsic goals,

A. Campero, R. Raileanu, H. Küttler, J. B. Tenenbaum, T. Rocktäschel, and E. Grefenstette, “Learning with AMIGo: Adversarially motivated intrinsic goals,”arXiv [cs.LG], 22 Jun. 2020

work page 2020
[52]

Keeping your distance: Solving sparse reward tasks using self-balancing shaped rewards,

A. R. Trott, S. Zheng, C. Xiong, and R. Socher, “Keeping your distance: Solving sparse reward tasks using self-balancing shaped rewards,”Neural Inf Process Syst, vol. abs/1911.01417, 4 Nov. 2019

work page arXiv 1911
[53]

Contrastive learning as goal-conditioned reinforcement learning,

B. Eysenbach, T. Zhang, R. Salakhutdinov, and S. Levine, “Contrastive learning as goal-conditioned reinforcement learning,”Neural Inf Process Syst, vol. abs/2206.07568, pp. 35 603–35 620, 15 Jun. 2022

work page arXiv 2022
[54]

Goal- conditioned reinforcement learning with imagined subgoals,

E. Chane-Sane, C. Schmid, and I. Laptev, “Goal- conditioned reinforcement learning with imagined subgoals,”ICML, vol. abs/2107.00541, pp. 1430–1440, 10 1 Jul. 2021

work page arXiv 2021
[55]

Sig- nal temporal logic meets reachability: Connections and ap- plications,

M. Chen, Q. Tam, S. C. Livingston, and M. Pavone, “Sig- nal temporal logic meets reachability: Connections and ap- plications,” inInternational Workshop on the Algorithmic Foundations of Robotics. Springer, 2018, pp. 581–601

work page 2018
[56]

Solving minimum-cost reach avoid using reinforcement learning,

O. So, C. Ge, and C. Fan, “Solving minimum-cost reach avoid using reinforcement learning,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://openreview.net/forum?id=jzngdJQ2lY

work page 2024
[57]

Safety and liveness guarantees through reach-avoid reinforcement learning,

K.-C. Hsu, V . Rubies-Royo, C. J. Tomlin, and J. F. Fisac, “Safety and liveness guarantees through reach-avoid reinforcement learning,” inProceedings of Robotics: Science and Systems, Held Virtually, July 2021

work page 2021
[58]

Bridging hamilton-jacobi safety analysis and reinforcement learning,

J. F. Fisac, N. F. Lugovoy, V . Rubies-Royo, S. Ghosh, and C. J. Tomlin, “Bridging hamilton-jacobi safety analysis and reinforcement learning,” in2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 8550–8556

work page 2019
[59]

Learn- ing stabilization control from observations by learning lyapunov-like proxy models,

M. Ganai, C. Hirayama, Y .-C. Chang, and S. Gao, “Learn- ing stabilization control from observations by learning lyapunov-like proxy models,”2023 IEEE International Conference on Robotics and Automation (ICRA), 2023

work page 2023
[60]

Reachability constrained reinforcement learning,

D. Yu, H. Ma, S. Li, and J. Chen, “Reachability constrained reinforcement learning,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 25 636–25 655

work page 2022
[61]

Safe multi-agent reinforcement learning via approximate hamilton-jacobi reachability,

K. Zhu, F. Lan, W. Zhao, and T. Zhang, “Safe multi-agent reinforcement learning via approximate hamilton-jacobi reachability,”J. Intell. Robot. Syst., vol. 111, no. 1, 30 Dec. 2024

work page 2024
[62]

Monitoring temporal properties of continuous signals,

O. Maler and D. Nickovic, “Monitoring temporal properties of continuous signals,” inInternational symposium on formal techniques in real-time and fault-tolerant systems. Springer, 2004, pp. 152–166

work page 2004
[63]

Robust satisfaction of temporal logic over real-valued signals,

A. Donzé and O. Maler, “Robust satisfaction of temporal logic over real-valued signals,” inInternational conference on formal modeling and analysis of timed systems. Springer, 2010, pp. 92–106

work page 2010
[64]

Hamilton-jacobi reachability: A brief overview and recent advances,

S. Bansal, M. Chen, S. Herbert, and C. J. Tomlin, “Hamilton-jacobi reachability: A brief overview and recent advances,” in2017 IEEE 56th Annual Conference on De- cision and Control (CDC). IEEE, 2017, pp. 2242–2253

work page 2017
[65]

Iterative reachability estimation for safe reinforcement learning,

M. Ganai, Z. Gong, C. Yu, S. Herbert, and S. Gao, “Iterative reachability estimation for safe reinforcement learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 69 764–69 797, 2023

work page 2023
[66]

Baier and J.-P

C. Baier and J.-P. Katoen,Principles of model checking. MIT press, 2008

work page 2008
[67]

Model checking,

O. Grumberg, E. Clarke, and D. Peled, “Model checking,” inInternational Conference on Foundations of Software Technology and Theoretical Computer Science; Springer: Berlin/Heidelberg, Germany, 1999

work page 1999
[68]

A decision tree approach to data classification using signal temporal logic,

G. Bombara, C.-I. Vasile, F. Penedo, H. Yasuoka, and C. Belta, “A decision tree approach to data classification using signal temporal logic,” inProceedings of the 19th International Conference on Hybrid Systems: Computation and Control, 2016, pp. 1–10

work page 2016
[69]

Tgpo: Temporal grounded policy optimization for signal temporal logic tasks,

Y . Meng, F. Chen, and C. Fan, “Tgpo: Temporal grounded policy optimization for signal temporal logic tasks,”arXiv preprint arXiv:2510.00225, 2025

work page arXiv 2025
[70]

Lcrl: Certified policy synthesis via logically-constrained reinforcement learning,

M. Hasanbeig, D. Kroening, and A. Abate, “Lcrl: Certified policy synthesis via logically-constrained reinforcement learning,” inInternational Conference on Quantitative Evaluation of SysTems. Springer, 2022, pp. 217–231

work page 2022
[71]

Aggressive driving with model predictive path integral control,

G. Williams, P. Drews, B. Goldfain, J. M. Rehg, and E. A. Theodorou, “Aggressive driving with model predictive path integral control,” in2016 IEEE international conference on robotics and automation (ICRA). IEEE, 2016, pp. 1433–1440

work page 2016
[72]

Trajectory planning with signal temporal logic costs using deterministic path integral optimization,

P. Halder, H. Homburger, L. Kiltz, J. Reuter, and M. Althoff, “Trajectory planning with signal temporal logic costs using deterministic path integral optimization,” arXiv preprint arXiv:2503.01476, 2025

work page arXiv 2025
[73]

Concurrent learning of control policy and unknown safety specifications in reinforcement learning,

L. Yifru and A. Baheri, “Concurrent learning of control policy and unknown safety specifications in reinforcement learning,”IEEE Open Journal of Control Systems, vol. 3, pp. 266–281, 2024

work page 2024
[74]

Interpretable apprenticeship learning with temporal logic specifications,

D. Kasenberg and M. Scheutz, “Interpretable apprenticeship learning with temporal logic specifications,” in2017 IEEE 56th Annual Conference on Decision and Control (CDC), 2017, pp. 4914–4921

work page 2017
[75]

Reinforcement learning with non-markovian rewards,

M. Gaon and R. Brafman, “Reinforcement learning with non-markovian rewards,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp. 3980–3987, Apr. 2020

work page 2020
[76]

Compositional reinforcement learning from logical specifications,

K. Jothimurugan, S. Bansal, O. Bastani, and R. Alur, “Compositional reinforcement learning from logical specifications,” inAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 10 026–10 039

work page 2021
[77]

From spot 2.0 to spot 2.10: What’s new?

A. Duret-Lutz, E. Renault, M. Colange, F. Renkin, A. Gbaguidi Aisse, P. Schlehuber-Caissier, T. Medioni, A. Martin, J. Dubois, C. Gillardet al., “From spot 2.0 to spot 2.10: What’s new?” inInternational Conference on Computer Aided Verification. Springer, 2022, pp. 174–187

work page 2022
[78]

Diestel,Graph theory

R. Diestel,Graph theory. Springer Nature, 2025

work page 2025
[79]

Principles of mathematical analysis,

W. Rudin, “Principles of mathematical analysis,”3rd ed., 1976

work page 1976
[80]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,”arXiv preprint arXiv:1506.02438, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

Showing first 80 references.

[1] [1]

R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction. Cambridge, MA, USA: A Bradford Book, 2018

work page 2018

[2] [2]

LTL and beyond: Formal languages for reward function specification in reinforcement learning,

A. Camacho, R. Toro Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith, “LTL and beyond: Formal languages for reward function specification in reinforcement learning,” inProceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. California: International Joint Conferences on Artificial Intelligence Organization, 1 Au...

work page 2019

[3] [3]

A time-dependent hamilton-jacobi formulation of reachable sets for continuous dynamic games,

I. M. Mitchell, A. M. Bayen, and C. J. Tomlin, “A time-dependent hamilton-jacobi formulation of reachable sets for continuous dynamic games,”IEEE Transactions on automatic control, vol. 50, no. 7, pp. 947–957, 2005

work page 2005

[4] [4]

Reach-avoid problems with time-varying dynamics, targets and constraints,

J. F. Fisac, M. Chen, C. J. Tomlin, and S. S. Sastry, “Reach-avoid problems with time-varying dynamics, targets and constraints,” inHybrid Systems: Computation and Control. ACM, 2015

work page 2015

[5] [5]

Dual- objective reinforcement learning with novel hamilton-jacobi-bellman formulations,

W. Sharpless, D. Hirsch, S. Tonkens, N. Shinde, and S. Herbert, “Dual-objective reinforcement learning with novel hamilton-jacobi-bellman formulations,”arXiv preprint arXiv:2506.16016, 2025

work page arXiv 2025

[6] [6]

Temporal logic guided safe model-based reinforcement learning: A hybrid systems approach,

M. H. Cohen, Z. Serlin, K. Leahy, and C. Belta, “Temporal logic guided safe model-based reinforcement learning: A hybrid systems approach,”Nonlinear Anal. Hybrid Syst., vol. 47, no. 101295, p. 101295, Feb. 2023

work page 2023

[7] [7]

Instructing goal- conditioned reinforcement learning agents with temporal logic objectives,

W. Qiu, W. Mao, and H. Zhu, “Instructing goal- conditioned reinforcement learning agents with temporal logic objectives,”Neural Inf Process Syst, vol. 36, pp. 39 147–39 175, 2023

work page 2023

[8] [8]

Verification of Markov decision processes using learning algorithms,

T. Brázdil, K. Chatterjee, M. Chmelík, V . Forejt, J. Kˇretínský, M. Kwiatkowska, D. Parker, and M. Ujma, “Verification of Markov decision processes using learning algorithms,”arXiv [cs.LO], 10 Feb. 2014

work page 2014

[9] [9]

Training agents to satisfy timed and untimed signal temporal logic specifications with reinforcement learning,

N. Hamilton, P. K. Robinette, and T. T. Johnson, “Training agents to satisfy timed and untimed signal temporal logic specifications with reinforcement learning,” inSoftware Engineering and Formal Methods, ser. Lecture notes in computer science. Cham: Springer International Publishing, 2022, pp. 190–206

work page 2022

[10] [10]

A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications,

D. Sadigh, E. S. Kim, S. Coogan, S. S. Sastry, and S. A. Seshia, “A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications,” in53rd IEEE Conference on Decision and Control. IEEE, Dec. 2014, pp. 1091–1096

work page 2014

[11] [11]

Control synthesis from linear temporal logic specifications using model-free reinforcement learning,

A. K. Bozkurt, Y . Wang, M. M. Zavlanos, and M. Pajic, “Control synthesis from linear temporal logic specifications using model-free reinforcement learning,” in2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, May 2020, p. 10349–10355

work page 2020

[12] [12]

Rewarding behaviors,

F. Bacchus, C. Boutilier, and A. J. Grove, “Rewarding behaviors,” inProceedings of the National Conference on Artificial Intelligence.cs.toronto.edu, 4 Aug. 1996, pp. 1160–1167

work page 1996

[13] [13]

Decision-theoretic planning with non- Markovian rewards,

S. Thiebaux, C. Gretton, J. Slaney, D. Price, and F. Kabanza, “Decision-theoretic planning with non- Markovian rewards,”J. Artif. Intell. Res., vol. 25, pp. 17–74, 29 Jan. 2006

work page 2006

[14] [14]

Non- Markovian rewards expressed in LTL: Guiding search via reward shaping,

A. Camacho, O. Chen, S. Sanner, and S. McIlraith, “Non- Markovian rewards expressed in LTL: Guiding search via reward shaping,”Proceedings of the International Symposium on Combinatorial Search, vol. 8, no. 1, pp. 159–160, 1 Sep. 2021

work page 2021

[15] [15]

Using reward machines for high-level task specification and decomposition in reinforcement learning,

R. T. Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith, “Using reward machines for high-level task specification and decomposition in reinforcement learning,”ICML, vol. 80, pp. 2112–2121, 3 Jul. 2018

work page 2018

[16] [16]

Q-learning for robust satisfaction of signal temporal logic specifications,

D. Aksaray, A. Jones, Z. Kong, M. Schwager, and C. Belta, “Q-learning for robust satisfaction of signal temporal logic specifications,” in2016 IEEE 55th Conference on Decision and Control (CDC). IEEE, Dec. 2016, pp. 6565–6570

work page 2016

[17] [17]

Reinforcement learning with temporal logic rewards,

X. Li, C.-I. Vasile, and C. Belta, “Reinforcement learning with temporal logic rewards,” in2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, Sep. 2017, pp. 3834–3839

work page 2017

[18] [18]

Modular deep reinforcement learning for continuous motion planning with temporal logic.IEEE robotics and automation letters, 6(4):7973–7980, 2021

M. Cai, M. Hasanbeig, S. Xiao, A. Abate, and Z. Kan, “Modular deep reinforcement learning for continuous motion planning with temporal logic,” IEEE Robotics and Automation Letters, vol. 6, no. 4, p. 7973–7980, Oct. 2021. [Online]. Available: http://dx.doi.org/10.1109/LRA.2021.3101544

work page doi:10.1109/lra.2021.3101544 2021

[19] [19]

Planning with general objective functions: Going beyond total rewards,

R. Wang, P. Zhong, S. S. Du, R. R. Salakhutdinov, and L. Yang, “Planning with general objective functions: Going beyond total rewards,” inAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 14 486–14 497

work page 2020

[20] [20]

Reinforcement learning with non-cumulative objective,

W. Cui and W. Yu, “Reinforcement learning with non-cumulative objective,”IEEE Transactions on Machine Learning in Communications and Networking, vol. 1, pp. 124–137, 2023

work page 2023

[21] [21]

Recursive reward aggregation,

Y . Tang, Y . Zhang, J. Ackermann, Y .-J. Zhang, S. Nishi- mori, and M. Sugiyama, “Recursive reward aggregation,” inReinforcement Learning Conference, 2025

work page 2025

[22] [22]

Hybrid reward architecture for reinforcement learning,

H. van Seijen, M. Fatemi, J. Romoff, R. Laroche, T. Barnes, and J. Tsang, “Hybrid reward architecture for reinforcement learning,” inProceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY , USA: Curran Associates Inc., 2017, p. 5398–5408

work page 2017

[23] [23]

Consistent aggregation of objectives with diverse time preferences requires non-markovian rewards,

S. Pitis, “Consistent aggregation of objectives with diverse time preferences requires non-markovian rewards,” inThirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023

[24] [24]

Rdˆ2: Reward decomposition with representation decomposition,

Z. Lin, D. Yang, L. Zhao, T. Qin, G. Yang, and T.-Y . Liu, “Rdˆ2: Reward decomposition with representation decomposition,” inAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran 9 Associates, Inc., 2020, pp. 11 298–11 308

work page 2020

[25] [25]

Altman,Constrained Markov decision processes: Stochastic modeling

E. Altman,Constrained Markov decision processes: Stochastic modeling. Boca Raton: Routledge, 13 Dec. 2021

work page 2021

[26] [26]

Constrained Policy Optimization

J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,”ICML, vol. abs/1705.10528, pp. 22–31, 30 May 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

Safe reinforcement learning in constrained Markov decision processes,

A. Wachi and Y . Sui, “Safe reinforcement learning in constrained Markov decision processes,”ICML, vol. 119, pp. 9797–9806, 12 Jul. 2020

work page 2020

[28] [28]

Responsive safety in reinforcement learning by PID lagrangian methods,

A. Stooke, J. Achiam, and P. Abbeel, “Responsive safety in reinforcement learning by PID lagrangian methods,” ICML, vol. 119, pp. 9133–9143, 8 Jul. 2020

work page 2020

[29] [29]

Faster algorithm and sharper analysis for constrained Markov decision process,

T. Li, Z. Guan, S. Zou, T. Xu, Y . Liang, and G. Lan, “Faster algorithm and sharper analysis for constrained Markov decision process,”Oper. Res. Lett., vol. 54, no. 107107, p. 107107, May 2024

work page 2024

[30] [30]

A primal-dual approach to constrained Markov decision processes,

Y . Chen, J. Dong, and Z. Wang, “A primal-dual approach to constrained Markov decision processes,”arXiv [math.OC], 26 Jan. 2021

work page 2021

[31] [31]

A simple reward-free approach to constrained reinforcement learning,

S. Miryoosefi and C. Jin, “A simple reward-free approach to constrained reinforcement learning,”ICML, vol. abs/2107.05216, pp. 15 666–15 698, 12 Jul. 2021

work page arXiv 2021

[32] [32]

Projection-based constrained policy optimization,

T.-Y . Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge, “Projection-based constrained policy optimization,”arXiv [cs.LG], 7 Oct. 2020

work page 2020

[33] [33]

Natural policy gradient primal-dual method for constrained Markov decision processes,

D. Ding, K. Zhang, T. Ba¸ sar, and M. Jovanovi´c, “Natural policy gradient primal-dual method for constrained Markov decision processes,”Neural Inf Process Syst, vol. 33, pp. 8378–8390, 2020

work page 2020

[34] [34]

Reward constrained policy optimization,

C. Tessler, D. J. Mankowitz, and S. Mannor, “Reward constrained policy optimization,”arXiv [cs.LG], 28 May 2018

work page 2018

[35] [35]

Reinforcement learning for constrained Markov decision processes,

A. Gattami, Q. Bai, and V . Aggarwal, “Reinforcement learning for constrained Markov decision processes,” AISTATS, vol. 130, pp. 2656–2664, 2021

work page 2021

[36] [36]

Constrained Markov decision processes via backward value functions,

H. Satija, P. Amortila, and J. Pineau, “Constrained Markov decision processes via backward value functions,” ICML, vol. 119, pp. 8502–8511, 12 Jul. 2020

work page 2020

[37] [37]

Reinforcement learning with almost sure constraints,

A. Castellano, H. Min, E. Mallada, and J. A. Bazerque, “Reinforcement learning with almost sure constraints,” in Proceedings of The 4th Annual Learning for Dynamics and Control Conference, ser. Proceedings of Machine Learning Research, vol. 168. PMLR, 2022, pp. 559–570

work page 2022

[38] [38]

Anytime-constrained reinforcement learning,

J. McMahan and X. Zhu, “Anytime-constrained reinforcement learning,” inProceedings of The 27th International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, S. Dasgupta, S. Mandt, and Y . Li, Eds., vol

work page

[39] [39]

4321–4329

PMLR, 02–04 May 2024, pp. 4321–4329

work page 2024

[40] [40]

Model-based multi-objective reinforcement learning,

M. A. Wiering, M. Withagen, and M. M. Drugan, “Model-based multi-objective reinforcement learning,” in2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL). IEEE, Dec. 2014, pp. 1–6

work page 2014

[41] [41]

Multi-objective reinforcement learning using sets of Pareto dominating policies,

M. K. Van and A. Nowé, “Multi-objective reinforcement learning using sets of Pareto dominating policies,”The Journal of Machine Learning Research, vol. 15, no. 1, pp. 3483–3512, 2014

work page 2014

[42] [42]

Distributional Pareto-optimal multi-objective reinforcement learning,

X.-Q. Cai, P. Zhang, L. Zhao, J. Bian, M. Sugiyama, and A. Llorens, “Distributional Pareto-optimal multi-objective reinforcement learning,”Neural Inf Process Syst, vol. 36, pp. 15 593–15 613, 2023

work page 2023

[43] [43]

Multi-objective deep reinforcement learning,

H. Mossalam, Y . M. Assael, D. M. Roijers, and S. Whiteson, “Multi-objective deep reinforcement learning,”arXiv [cs.AI], 9 Oct. 2016

work page 2016

[44] [44]

Dynamic weights in multi-objective deep reinforcement learning,

A. Abels, D. Roijers, T. Lenaerts, A. Nowé, and D. Steck- elmacher, “Dynamic weights in multi-objective deep reinforcement learning,” inProceedings of the 36th Inter- national Conference on Machine Learning, ser. Proceed- ings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 11–20

work page 2019

[45] [45]

A generalized algorithm for multi-objective reinforcement learning and policy adaptation,

R. Yang, X. Sun, and K. Narasimhan, “A generalized algorithm for multi-objective reinforcement learning and policy adaptation,” inAdvances in Neural Information Processing Systems. proceedings.neurips.cc, 2019

work page 2019

[46] [46]

Pareto set learning for multi-objective rein- forcement learning,

E. Liu, Y .-C. Wu, X. Huang, C. Gao, R.-J. Wang, K. Xue, and C. Qian, “Pareto set learning for multi-objective rein- forcement learning,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 18, 2025

work page 2025

[47] [47]

Goal-conditioned reinforcement learning: Problems and solutions,

M. Liu, M. Zhu, and W. Zhang, “Goal-conditioned reinforcement learning: Problems and solutions,”arXiv [cs.AI], 20 Jan. 2022

work page 2022

[48] [48]

Multi-goal re- inforcement learning: Challenging robotics environments and request for research,

M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, V . Kumar, and W. Zaremba, “Multi-goal re- inforcement learning: Challenging robotics environments and request for research,”arXiv [cs.LG], 26 Feb. 2018

work page 2018

[49] [49]

Exploration via hindsight goal generation,

Z. Ren, K. Dong, Y . Zhou, Q. Liu, and J. Peng, “Exploration via hindsight goal generation,”Neural Inf Process Syst, vol. 32, pp. 13 464–13 474, 1 Jun. 2019

work page 2019

[50] [50]

Offline goal-conditioned reinforcement learning via f- advantage regression,

J. Y . Ma, J. Yan, D. Jayaraman, and O. Bastani, “Offline goal-conditioned reinforcement learning via f- advantage regression,” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 310–323

work page 2022

[51] [51]

Learning with AMIGo: Adversarially motivated intrinsic goals,

A. Campero, R. Raileanu, H. Küttler, J. B. Tenenbaum, T. Rocktäschel, and E. Grefenstette, “Learning with AMIGo: Adversarially motivated intrinsic goals,”arXiv [cs.LG], 22 Jun. 2020

work page 2020

[52] [52]

Keeping your distance: Solving sparse reward tasks using self-balancing shaped rewards,

A. R. Trott, S. Zheng, C. Xiong, and R. Socher, “Keeping your distance: Solving sparse reward tasks using self-balancing shaped rewards,”Neural Inf Process Syst, vol. abs/1911.01417, 4 Nov. 2019

work page arXiv 1911

[53] [53]

Contrastive learning as goal-conditioned reinforcement learning,

B. Eysenbach, T. Zhang, R. Salakhutdinov, and S. Levine, “Contrastive learning as goal-conditioned reinforcement learning,”Neural Inf Process Syst, vol. abs/2206.07568, pp. 35 603–35 620, 15 Jun. 2022

work page arXiv 2022

[54] [54]

Goal- conditioned reinforcement learning with imagined subgoals,

E. Chane-Sane, C. Schmid, and I. Laptev, “Goal- conditioned reinforcement learning with imagined subgoals,”ICML, vol. abs/2107.00541, pp. 1430–1440, 10 1 Jul. 2021

work page arXiv 2021

[55] [55]

Sig- nal temporal logic meets reachability: Connections and ap- plications,

M. Chen, Q. Tam, S. C. Livingston, and M. Pavone, “Sig- nal temporal logic meets reachability: Connections and ap- plications,” inInternational Workshop on the Algorithmic Foundations of Robotics. Springer, 2018, pp. 581–601

work page 2018

[56] [56]

Solving minimum-cost reach avoid using reinforcement learning,

O. So, C. Ge, and C. Fan, “Solving minimum-cost reach avoid using reinforcement learning,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://openreview.net/forum?id=jzngdJQ2lY

work page 2024

[57] [57]

Safety and liveness guarantees through reach-avoid reinforcement learning,

K.-C. Hsu, V . Rubies-Royo, C. J. Tomlin, and J. F. Fisac, “Safety and liveness guarantees through reach-avoid reinforcement learning,” inProceedings of Robotics: Science and Systems, Held Virtually, July 2021

work page 2021

[58] [58]

Bridging hamilton-jacobi safety analysis and reinforcement learning,

J. F. Fisac, N. F. Lugovoy, V . Rubies-Royo, S. Ghosh, and C. J. Tomlin, “Bridging hamilton-jacobi safety analysis and reinforcement learning,” in2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 8550–8556

work page 2019

[59] [59]

Learn- ing stabilization control from observations by learning lyapunov-like proxy models,

M. Ganai, C. Hirayama, Y .-C. Chang, and S. Gao, “Learn- ing stabilization control from observations by learning lyapunov-like proxy models,”2023 IEEE International Conference on Robotics and Automation (ICRA), 2023

work page 2023

[60] [60]

Reachability constrained reinforcement learning,

D. Yu, H. Ma, S. Li, and J. Chen, “Reachability constrained reinforcement learning,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 25 636–25 655

work page 2022

[61] [61]

Safe multi-agent reinforcement learning via approximate hamilton-jacobi reachability,

K. Zhu, F. Lan, W. Zhao, and T. Zhang, “Safe multi-agent reinforcement learning via approximate hamilton-jacobi reachability,”J. Intell. Robot. Syst., vol. 111, no. 1, 30 Dec. 2024

work page 2024

[62] [62]

Monitoring temporal properties of continuous signals,

O. Maler and D. Nickovic, “Monitoring temporal properties of continuous signals,” inInternational symposium on formal techniques in real-time and fault-tolerant systems. Springer, 2004, pp. 152–166

work page 2004

[63] [63]

Robust satisfaction of temporal logic over real-valued signals,

A. Donzé and O. Maler, “Robust satisfaction of temporal logic over real-valued signals,” inInternational conference on formal modeling and analysis of timed systems. Springer, 2010, pp. 92–106

work page 2010

[64] [64]

Hamilton-jacobi reachability: A brief overview and recent advances,

S. Bansal, M. Chen, S. Herbert, and C. J. Tomlin, “Hamilton-jacobi reachability: A brief overview and recent advances,” in2017 IEEE 56th Annual Conference on De- cision and Control (CDC). IEEE, 2017, pp. 2242–2253

work page 2017

[65] [65]

Iterative reachability estimation for safe reinforcement learning,

M. Ganai, Z. Gong, C. Yu, S. Herbert, and S. Gao, “Iterative reachability estimation for safe reinforcement learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 69 764–69 797, 2023

work page 2023

[66] [66]

Baier and J.-P

C. Baier and J.-P. Katoen,Principles of model checking. MIT press, 2008

work page 2008

[67] [67]

Model checking,

O. Grumberg, E. Clarke, and D. Peled, “Model checking,” inInternational Conference on Foundations of Software Technology and Theoretical Computer Science; Springer: Berlin/Heidelberg, Germany, 1999

work page 1999

[68] [68]

A decision tree approach to data classification using signal temporal logic,

G. Bombara, C.-I. Vasile, F. Penedo, H. Yasuoka, and C. Belta, “A decision tree approach to data classification using signal temporal logic,” inProceedings of the 19th International Conference on Hybrid Systems: Computation and Control, 2016, pp. 1–10

work page 2016

[69] [69]

Tgpo: Temporal grounded policy optimization for signal temporal logic tasks,

Y . Meng, F. Chen, and C. Fan, “Tgpo: Temporal grounded policy optimization for signal temporal logic tasks,”arXiv preprint arXiv:2510.00225, 2025

work page arXiv 2025

[70] [70]

Lcrl: Certified policy synthesis via logically-constrained reinforcement learning,

M. Hasanbeig, D. Kroening, and A. Abate, “Lcrl: Certified policy synthesis via logically-constrained reinforcement learning,” inInternational Conference on Quantitative Evaluation of SysTems. Springer, 2022, pp. 217–231

work page 2022

[71] [71]

Aggressive driving with model predictive path integral control,

G. Williams, P. Drews, B. Goldfain, J. M. Rehg, and E. A. Theodorou, “Aggressive driving with model predictive path integral control,” in2016 IEEE international conference on robotics and automation (ICRA). IEEE, 2016, pp. 1433–1440

work page 2016

[72] [72]

Trajectory planning with signal temporal logic costs using deterministic path integral optimization,

P. Halder, H. Homburger, L. Kiltz, J. Reuter, and M. Althoff, “Trajectory planning with signal temporal logic costs using deterministic path integral optimization,” arXiv preprint arXiv:2503.01476, 2025

work page arXiv 2025

[73] [73]

Concurrent learning of control policy and unknown safety specifications in reinforcement learning,

L. Yifru and A. Baheri, “Concurrent learning of control policy and unknown safety specifications in reinforcement learning,”IEEE Open Journal of Control Systems, vol. 3, pp. 266–281, 2024

work page 2024

[74] [74]

Interpretable apprenticeship learning with temporal logic specifications,

D. Kasenberg and M. Scheutz, “Interpretable apprenticeship learning with temporal logic specifications,” in2017 IEEE 56th Annual Conference on Decision and Control (CDC), 2017, pp. 4914–4921

work page 2017

[75] [75]

Reinforcement learning with non-markovian rewards,

M. Gaon and R. Brafman, “Reinforcement learning with non-markovian rewards,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp. 3980–3987, Apr. 2020

work page 2020

[76] [76]

Compositional reinforcement learning from logical specifications,

K. Jothimurugan, S. Bansal, O. Bastani, and R. Alur, “Compositional reinforcement learning from logical specifications,” inAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 10 026–10 039

work page 2021

[77] [77]

From spot 2.0 to spot 2.10: What’s new?

A. Duret-Lutz, E. Renault, M. Colange, F. Renkin, A. Gbaguidi Aisse, P. Schlehuber-Caissier, T. Medioni, A. Martin, J. Dubois, C. Gillardet al., “From spot 2.0 to spot 2.10: What’s new?” inInternational Conference on Computer Aided Verification. Springer, 2022, pp. 174–187

work page 2022

[78] [78]

Diestel,Graph theory

R. Diestel,Graph theory. Springer Nature, 2025

work page 2025

[79] [79]

Principles of mathematical analysis,

W. Rudin, “Principles of mathematical analysis,”3rd ed., 1976

work page 1976

[80] [80]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,”arXiv preprint arXiv:1506.02438, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015