pith. sign in

arxiv: 2602.19532 · v2 · pith:XUUK4HNOnew · submitted 2026-02-23 · 💻 cs.RO · cs.SY· eess.SY

Bellman Value Decomposition for Task Logic in Safe Optimal Control

Pith reviewed 2026-05-15 21:05 UTC · model grok-4.3

classification 💻 cs.RO cs.SYeess.SY
keywords Bellman value decompositiontemporal logicsafe optimal controlreach-avoidreach-avoid-loopneural network policyVDPPOmulti-agent control
0
0 comments X

The pith

The Bellman value for temporal logic tasks decomposes into a graph of simpler values connected by reach-avoid, avoid, and reach-avoid-loop equations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that the Bellman value for tasks specified in temporal logic naturally breaks down into a graph of simpler Bellman values. These nodes connect through the standard reach-avoid Bellman equation, the avoid Bellman equation, and a newly introduced reach-avoid-loop Bellman equation. The decomposition allows the full value function and optimal policy to be learned by embedding the graph structure directly into a two-layer neural network, which automatically resolves the dependencies between parts of the task. A reader would care because high-dimensional control problems routinely combine safety constraints with goal-directed behavior, and conventional approaches demand extensive manual reward design or automata construction that scales poorly. The authors test the resulting method on simulated and hardware experiments with multi-agent teams and nonlinear dynamics, reporting improved automatic balance of safety and goal achievement over existing baselines.

Core claim

We prove the Bellman Value for a complex task defined in temporal logic can be decomposed into a graph of Bellman Values, connected by a set of well-known Bellman equations (BEs): the Reach-Avoid BE, the Avoid BE, and a novel type, the Reach-Avoid-Loop BE. To solve the Value and optimal policy, we propose VDPPO, which embeds the decomposed Value graph into a two-layer neural net, bootstrapping the implicit dependencies.

What carries the argument

Decomposition of the Bellman value into a graph connected by the reach-avoid Bellman equation, the avoid Bellman equation, and the reach-avoid-loop Bellman equation, embedded in a two-layer neural network.

If this is right

  • The optimal policy for combined safety and goal tasks is obtained by solving the embedded graph in a two-layer neural net.
  • Safety and liveness specifications balance automatically without separate reward tuning.
  • The method applies directly to high-dimensional systems with nonlinear dynamics and heterogeneous agent teams.
  • Implicit task dependencies are resolved by the network bootstrapping process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This decomposition approach could be tested on task specifications outside temporal logic, such as signal temporal logic or other formalisms.
  • Recursive application of the graph structure might handle more deeply nested specifications in future extensions.
  • The two-layer embedding may transfer to continuous-time settings if the underlying Bellman equations are discretized consistently.

Load-bearing premise

The innate structure of the Bellman value organizes temporal logic tasks so that the decomposed graph embeds into a two-layer neural net without manual tuning or post-hoc adjustments.

What would settle it

A concrete temporal logic task where the value function computed from the decomposed graph and VDPPO differs measurably from the value obtained by solving the full undecomposed Bellman equation, or where the resulting policy violates a safety or liveness specification in direct simulation.

Figures

Figures reproduced from arXiv: 2602.19532 by Chuchu Fan, Dylan Hirsch, Oswin So, Sylvia Herbert, William Sharpless.

Figure 1
Figure 1. Figure 1: Value-Decomposition and VDPPO. The Bellman Value for a range of temporal logic (e.g., multi-goal, recurrence, stability, safety) decomposes into a Value graph connected by atomic Bellman equations (Thms. 1–4). We propose VDPPO, an algorithm that exploits this structure to learn policies for complex, high-dimensional tasks. Our approach is validated on hardware with Herding and Delivery, two complex tasks i… view at source ↗
Figure 2
Figure 2. Figure 2: E.g. N-Until-Conjunction Value Decomposition. Here we illustrate the primary decomposition result (Thm. 1 extension, Appendix), with a GridWorld example (left) for a given specification. The corresponding DVG is shown (center left) with each node representing a decomposed Value, and edges representing dependencies. In the center right, a subset of decomposed Values solved with dynamic programming are shown… view at source ↗
Figure 3
Figure 3. Figure 3: E.g. G(N-Until-Conjunction) Value Decomposition. We illustrate the recursive decomposition result (Thm. 3), with a GridWorld example (left) for a given specification. The plots here are analogous to those of [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Graphical Depiction of Algorithms. to the embedding. This allows us to leverage the decomposed structure of the Value functions to efficiently learn policies that satisfy complex TL specifications without sequentially approximating the Value. See the Appendix for further details. IX. SIMULATION RESULTS To better understand the performance of VDPPO, we design simulation experiments to answer the following q… view at source ↗
Figure 5
Figure 5. Figure 5: Performance scaling with TL complexity. Value decomposition enables VDPPO to better scale by tackling smaller problems [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Hardware Overview for Herding and Delivery Tasks complex interactions with uncontrolled agents (Herding), needing to collaborate (Delivery), or complex dynamics (Manipulator) and show the results in [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Trajectory snapshots from Herding and Delivery hardware tasks. We show a long-exposure photo (left), and stills from independent times (right), with depictions corresponding to those of the overview in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Effect of parameter sharing. Sharing parameters for the actor only improves performance while reduce the variance. O. HARDWARE In the hardware experiments, we evaluate VDPPO performance in the Herding and Delivery tasks. In both tasks, the state position is reported by HTC Vive base stations in communication with the an attached Lighthouse deck to each Crazyflie. The Go2 quadruped’s location is integrated … view at source ↗
read the original abstract

Real-world tasks involve nuanced combinations of goal and safety specifications. In high dimensions, the challenge is exacerbated: formal automata become cumbersome, and the combination of sparse rewards tends to require laborious tuning. In this work, we consider the innate structure of the Bellman Value as a means to naturally organize the problem for improved automatic performance. Namely, we prove the Bellman Value for a complex task defined in temporal logic can be decomposed into a graph of Bellman Values, connected by a set of well-known Bellman equations (BEs): the Reach-Avoid BE, the Avoid BE, and a novel type, the Reach-Avoid-Loop BE. To solve the Value and optimal policy, we propose VDPPO, which embeds the decomposed Value graph into a two-layer neural net, bootstrapping the implicit dependencies. We conduct a variety of simulated and hardware experiments to test our method on complex, high-dimensional tasks involving heterogeneous teams and nonlinear dynamics. Ultimately, we find this approach greatly improves performance over existing baselines, balancing safety and liveness automatically.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to prove that the Bellman value for any complex task specified in temporal logic decomposes into a graph of simpler Bellman values connected exclusively by the Reach-Avoid Bellman equation, the Avoid Bellman equation, and a novel Reach-Avoid-Loop Bellman equation. It introduces VDPPO, which embeds this value graph into a two-layer neural network that bootstraps the implicit dependencies to compute the value function and optimal policy. Experiments on simulated and hardware tasks with heterogeneous teams and nonlinear dynamics are reported to show improved safety-liveness trade-offs over baselines without post-hoc tuning.

Significance. If the decomposition theorem holds for general STL formulas and the two-layer embedding preserves optimality without hidden parameters, the result would supply a structured, largely automatic route to safe optimal control for high-dimensional tasks whose specifications combine reachability, avoidance, and looping behaviors. The approach could reduce reliance on manual reward shaping or automata construction in robotics applications.

major comments (3)
  1. [Main derivation of Reach-Avoid-Loop BE] The central proof relies on the correctness of the novel Reach-Avoid-Loop Bellman equation. The manuscript must explicitly derive this equation from the STL semantics and state the precise assumptions (e.g., deterministic vs. stochastic transitions, finite vs. infinite loop horizons, memoryless sub-tasks) under which the fixed-point equation is valid; without this, the claim that every STL task reduces to combinations of only the three listed operators cannot be verified.
  2. [VDPPO architecture and training] The assertion that the decomposed graph embeds losslessly into a two-layer neural net without additional manual tuning or post-hoc adjustments is load-bearing for the practical contribution. The paper should provide a formal argument or explicit construction showing that all implicit dependencies among the sub-values are captured by the network architecture and loss; otherwise the performance gains may stem from implicit fitting rather than the decomposition.
  3. [Experimental results] Experiments report improved performance, yet the manuscript does not include an ablation that isolates the contribution of the Reach-Avoid-Loop equation versus the standard Reach-Avoid and Avoid equations. Without this, it is unclear whether the novel operator is necessary for the observed gains or whether simpler decompositions suffice.
minor comments (2)
  1. [Preliminaries] Notation for the three Bellman equations should be unified and introduced in a single preliminary section to improve readability.
  2. [Experiments] The abstract states that the method 'balances safety and liveness automatically'; the experimental section should report quantitative safety-violation rates alongside task-completion rates for all baselines.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We address each of the major comments below, providing clarifications and committing to revisions where appropriate to strengthen the paper.

read point-by-point responses
  1. Referee: [Main derivation of Reach-Avoid-Loop BE] The central proof relies on the correctness of the novel Reach-Avoid-Loop Bellman equation. The manuscript must explicitly derive this equation from the STL semantics and state the precise assumptions (e.g., deterministic vs. stochastic transitions, finite vs. infinite loop horizons, memoryless sub-tasks) under which the fixed-point equation is valid; without this, the claim that every STL task reduces to combinations of only the three listed operators cannot be verified.

    Authors: We agree that an explicit derivation is essential for verifying the decomposition theorem. In the revised manuscript, we will add a dedicated subsection in the theoretical analysis that derives the Reach-Avoid-Loop Bellman equation step-by-step from the STL semantics. We will explicitly state the assumptions: deterministic transitions, infinite-horizon loops with appropriate discounting to ensure convergence, and memoryless sub-tasks as per the STL fragment considered. This will confirm that all complex STL formulas reduce to the three operators. revision: yes

  2. Referee: [VDPPO architecture and training] The assertion that the decomposed graph embeds losslessly into a two-layer neural net without additional manual tuning or post-hoc adjustments is load-bearing for the practical contribution. The paper should provide a formal argument or explicit construction showing that all implicit dependencies among the sub-values are captured by the network architecture and loss; otherwise the performance gains may stem from implicit fitting rather than the decomposition.

    Authors: The VDPPO architecture is specifically designed with the first layer computing the sub-value functions corresponding to the graph nodes and the second layer implementing the bootstrapping via the Bellman operators. We will include in the revision a formal construction in the appendix that proves the network captures all dependencies through its layered structure and the composite loss function, without requiring manual tuning or hidden parameters beyond the graph embedding. revision: yes

  3. Referee: [Experimental results] Experiments report improved performance, yet the manuscript does not include an ablation that isolates the contribution of the Reach-Avoid-Loop equation versus the standard Reach-Avoid and Avoid equations. Without this, it is unclear whether the novel operator is necessary for the observed gains or whether simpler decompositions suffice.

    Authors: We acknowledge the value of isolating the contribution of the novel operator. In the revised version, we will add an ablation study in the experimental section that compares the full VDPPO using all three equations against variants using only Reach-Avoid and Avoid equations on the same tasks, to demonstrate the necessity of the Reach-Avoid-Loop BE for the reported performance improvements. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained; no circular reductions identified

full rationale

The paper states it proves decomposition of the Bellman value for temporal-logic tasks into a graph connected by Reach-Avoid BE, Avoid BE, and a novel Reach-Avoid-Loop BE. The novel equation is presented as derived within the proof rather than obtained by fitting or self-definition. The two-layer neural net embedding is described as a solution architecture that bootstraps implicit dependencies, not as a statistical prediction forced by prior fits. No load-bearing self-citations, uniqueness theorems imported from the same authors, or ansatzes smuggled via prior work are referenced in the abstract or description. The central claim therefore retains independent mathematical content and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that temporal logic tasks possess an innate Bellman-value structure that decomposes cleanly; no explicit free parameters are stated, but the neural-net training implicitly introduces fitted weights.

axioms (1)
  • domain assumption Bellman value for temporal logic tasks admits a graph decomposition connected by reach-avoid, avoid, and reach-avoid-loop equations
    Stated as the basis for the proof in the abstract.
invented entities (1)
  • Reach-Avoid-Loop Bellman equation no independent evidence
    purpose: To capture looping behavior within the decomposed value graph
    Presented as a novel type required to close the decomposition for certain tasks.

pith-pipeline@v0.9.0 · 5491 in / 1440 out tokens · 39943 ms · 2026-05-15T21:05:49.837826+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Value Functions for Temporal Logic: Optimal Policies and Safety Filters

    cs.RO 2026-05 unverdicted novelty 6.0

    Non-Markovian policies from decomposed temporal logic value functions are proven optimal for nested Until, Globally, and Globally-Until specifications and extend Q-function safety filters to complex tasks.

Reference graph

Works this paper leans on

97 extracted references · 97 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction. Cambridge, MA, USA: A Bradford Book, 2018

  2. [2]

    LTL and beyond: Formal languages for reward function specification in reinforcement learning,

    A. Camacho, R. Toro Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith, “LTL and beyond: Formal languages for reward function specification in reinforcement learning,” inProceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. California: International Joint Conferences on Artificial Intelligence Organization, 1 Au...

  3. [3]

    A time-dependent hamilton-jacobi formulation of reachable sets for continuous dynamic games,

    I. M. Mitchell, A. M. Bayen, and C. J. Tomlin, “A time-dependent hamilton-jacobi formulation of reachable sets for continuous dynamic games,”IEEE Transactions on automatic control, vol. 50, no. 7, pp. 947–957, 2005

  4. [4]

    Reach-avoid problems with time-varying dynamics, targets and constraints,

    J. F. Fisac, M. Chen, C. J. Tomlin, and S. S. Sastry, “Reach-avoid problems with time-varying dynamics, targets and constraints,” inHybrid Systems: Computation and Control. ACM, 2015

  5. [5]

    Dual- objective reinforcement learning with novel hamilton-jacobi-bellman formulations,

    W. Sharpless, D. Hirsch, S. Tonkens, N. Shinde, and S. Herbert, “Dual-objective reinforcement learning with novel hamilton-jacobi-bellman formulations,”arXiv preprint arXiv:2506.16016, 2025

  6. [6]

    Temporal logic guided safe model-based reinforcement learning: A hybrid systems approach,

    M. H. Cohen, Z. Serlin, K. Leahy, and C. Belta, “Temporal logic guided safe model-based reinforcement learning: A hybrid systems approach,”Nonlinear Anal. Hybrid Syst., vol. 47, no. 101295, p. 101295, Feb. 2023

  7. [7]

    Instructing goal- conditioned reinforcement learning agents with temporal logic objectives,

    W. Qiu, W. Mao, and H. Zhu, “Instructing goal- conditioned reinforcement learning agents with temporal logic objectives,”Neural Inf Process Syst, vol. 36, pp. 39 147–39 175, 2023

  8. [8]

    Verification of Markov decision processes using learning algorithms,

    T. Brázdil, K. Chatterjee, M. Chmelík, V . Forejt, J. Kˇretínský, M. Kwiatkowska, D. Parker, and M. Ujma, “Verification of Markov decision processes using learning algorithms,”arXiv [cs.LO], 10 Feb. 2014

  9. [9]

    Training agents to satisfy timed and untimed signal temporal logic specifications with reinforcement learning,

    N. Hamilton, P. K. Robinette, and T. T. Johnson, “Training agents to satisfy timed and untimed signal temporal logic specifications with reinforcement learning,” inSoftware Engineering and Formal Methods, ser. Lecture notes in computer science. Cham: Springer International Publishing, 2022, pp. 190–206

  10. [10]

    A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications,

    D. Sadigh, E. S. Kim, S. Coogan, S. S. Sastry, and S. A. Seshia, “A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications,” in53rd IEEE Conference on Decision and Control. IEEE, Dec. 2014, pp. 1091–1096

  11. [11]

    Control synthesis from linear temporal logic specifications using model-free reinforcement learning,

    A. K. Bozkurt, Y . Wang, M. M. Zavlanos, and M. Pajic, “Control synthesis from linear temporal logic specifications using model-free reinforcement learning,” in2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, May 2020, p. 10349–10355

  12. [12]

    Rewarding behaviors,

    F. Bacchus, C. Boutilier, and A. J. Grove, “Rewarding behaviors,” inProceedings of the National Conference on Artificial Intelligence.cs.toronto.edu, 4 Aug. 1996, pp. 1160–1167

  13. [13]

    Decision-theoretic planning with non- Markovian rewards,

    S. Thiebaux, C. Gretton, J. Slaney, D. Price, and F. Kabanza, “Decision-theoretic planning with non- Markovian rewards,”J. Artif. Intell. Res., vol. 25, pp. 17–74, 29 Jan. 2006

  14. [14]

    Non- Markovian rewards expressed in LTL: Guiding search via reward shaping,

    A. Camacho, O. Chen, S. Sanner, and S. McIlraith, “Non- Markovian rewards expressed in LTL: Guiding search via reward shaping,”Proceedings of the International Symposium on Combinatorial Search, vol. 8, no. 1, pp. 159–160, 1 Sep. 2021

  15. [15]

    Using reward machines for high-level task specification and decomposition in reinforcement learning,

    R. T. Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith, “Using reward machines for high-level task specification and decomposition in reinforcement learning,”ICML, vol. 80, pp. 2112–2121, 3 Jul. 2018

  16. [16]

    Q-learning for robust satisfaction of signal temporal logic specifications,

    D. Aksaray, A. Jones, Z. Kong, M. Schwager, and C. Belta, “Q-learning for robust satisfaction of signal temporal logic specifications,” in2016 IEEE 55th Conference on Decision and Control (CDC). IEEE, Dec. 2016, pp. 6565–6570

  17. [17]

    Reinforcement learning with temporal logic rewards,

    X. Li, C.-I. Vasile, and C. Belta, “Reinforcement learning with temporal logic rewards,” in2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, Sep. 2017, pp. 3834–3839

  18. [18]

    Modular deep reinforcement learning for continuous motion planning with temporal logic.IEEE robotics and automation letters, 6(4):7973–7980, 2021

    M. Cai, M. Hasanbeig, S. Xiao, A. Abate, and Z. Kan, “Modular deep reinforcement learning for continuous motion planning with temporal logic,” IEEE Robotics and Automation Letters, vol. 6, no. 4, p. 7973–7980, Oct. 2021. [Online]. Available: http://dx.doi.org/10.1109/LRA.2021.3101544

  19. [19]

    Planning with general objective functions: Going beyond total rewards,

    R. Wang, P. Zhong, S. S. Du, R. R. Salakhutdinov, and L. Yang, “Planning with general objective functions: Going beyond total rewards,” inAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 14 486–14 497

  20. [20]

    Reinforcement learning with non-cumulative objective,

    W. Cui and W. Yu, “Reinforcement learning with non-cumulative objective,”IEEE Transactions on Machine Learning in Communications and Networking, vol. 1, pp. 124–137, 2023

  21. [21]

    Recursive reward aggregation,

    Y . Tang, Y . Zhang, J. Ackermann, Y .-J. Zhang, S. Nishi- mori, and M. Sugiyama, “Recursive reward aggregation,” inReinforcement Learning Conference, 2025

  22. [22]

    Hybrid reward architecture for reinforcement learning,

    H. van Seijen, M. Fatemi, J. Romoff, R. Laroche, T. Barnes, and J. Tsang, “Hybrid reward architecture for reinforcement learning,” inProceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY , USA: Curran Associates Inc., 2017, p. 5398–5408

  23. [23]

    Consistent aggregation of objectives with diverse time preferences requires non-markovian rewards,

    S. Pitis, “Consistent aggregation of objectives with diverse time preferences requires non-markovian rewards,” inThirty-seventh Conference on Neural Information Processing Systems, 2023

  24. [24]

    Rdˆ2: Reward decomposition with representation decomposition,

    Z. Lin, D. Yang, L. Zhao, T. Qin, G. Yang, and T.-Y . Liu, “Rdˆ2: Reward decomposition with representation decomposition,” inAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran 9 Associates, Inc., 2020, pp. 11 298–11 308

  25. [25]

    Altman,Constrained Markov decision processes: Stochastic modeling

    E. Altman,Constrained Markov decision processes: Stochastic modeling. Boca Raton: Routledge, 13 Dec. 2021

  26. [26]

    Constrained Policy Optimization

    J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,”ICML, vol. abs/1705.10528, pp. 22–31, 30 May 2017

  27. [27]

    Safe reinforcement learning in constrained Markov decision processes,

    A. Wachi and Y . Sui, “Safe reinforcement learning in constrained Markov decision processes,”ICML, vol. 119, pp. 9797–9806, 12 Jul. 2020

  28. [28]

    Responsive safety in reinforcement learning by PID lagrangian methods,

    A. Stooke, J. Achiam, and P. Abbeel, “Responsive safety in reinforcement learning by PID lagrangian methods,” ICML, vol. 119, pp. 9133–9143, 8 Jul. 2020

  29. [29]

    Faster algorithm and sharper analysis for constrained Markov decision process,

    T. Li, Z. Guan, S. Zou, T. Xu, Y . Liang, and G. Lan, “Faster algorithm and sharper analysis for constrained Markov decision process,”Oper. Res. Lett., vol. 54, no. 107107, p. 107107, May 2024

  30. [30]

    A primal-dual approach to constrained Markov decision processes,

    Y . Chen, J. Dong, and Z. Wang, “A primal-dual approach to constrained Markov decision processes,”arXiv [math.OC], 26 Jan. 2021

  31. [31]

    A simple reward-free approach to constrained reinforcement learning,

    S. Miryoosefi and C. Jin, “A simple reward-free approach to constrained reinforcement learning,”ICML, vol. abs/2107.05216, pp. 15 666–15 698, 12 Jul. 2021

  32. [32]

    Projection-based constrained policy optimization,

    T.-Y . Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge, “Projection-based constrained policy optimization,”arXiv [cs.LG], 7 Oct. 2020

  33. [33]

    Natural policy gradient primal-dual method for constrained Markov decision processes,

    D. Ding, K. Zhang, T. Ba¸ sar, and M. Jovanovi´c, “Natural policy gradient primal-dual method for constrained Markov decision processes,”Neural Inf Process Syst, vol. 33, pp. 8378–8390, 2020

  34. [34]

    Reward constrained policy optimization,

    C. Tessler, D. J. Mankowitz, and S. Mannor, “Reward constrained policy optimization,”arXiv [cs.LG], 28 May 2018

  35. [35]

    Reinforcement learning for constrained Markov decision processes,

    A. Gattami, Q. Bai, and V . Aggarwal, “Reinforcement learning for constrained Markov decision processes,” AISTATS, vol. 130, pp. 2656–2664, 2021

  36. [36]

    Constrained Markov decision processes via backward value functions,

    H. Satija, P. Amortila, and J. Pineau, “Constrained Markov decision processes via backward value functions,” ICML, vol. 119, pp. 8502–8511, 12 Jul. 2020

  37. [37]

    Reinforcement learning with almost sure constraints,

    A. Castellano, H. Min, E. Mallada, and J. A. Bazerque, “Reinforcement learning with almost sure constraints,” in Proceedings of The 4th Annual Learning for Dynamics and Control Conference, ser. Proceedings of Machine Learning Research, vol. 168. PMLR, 2022, pp. 559–570

  38. [38]

    Anytime-constrained reinforcement learning,

    J. McMahan and X. Zhu, “Anytime-constrained reinforcement learning,” inProceedings of The 27th International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, S. Dasgupta, S. Mandt, and Y . Li, Eds., vol

  39. [39]

    4321–4329

    PMLR, 02–04 May 2024, pp. 4321–4329

  40. [40]

    Model-based multi-objective reinforcement learning,

    M. A. Wiering, M. Withagen, and M. M. Drugan, “Model-based multi-objective reinforcement learning,” in2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL). IEEE, Dec. 2014, pp. 1–6

  41. [41]

    Multi-objective reinforcement learning using sets of Pareto dominating policies,

    M. K. Van and A. Nowé, “Multi-objective reinforcement learning using sets of Pareto dominating policies,”The Journal of Machine Learning Research, vol. 15, no. 1, pp. 3483–3512, 2014

  42. [42]

    Distributional Pareto-optimal multi-objective reinforcement learning,

    X.-Q. Cai, P. Zhang, L. Zhao, J. Bian, M. Sugiyama, and A. Llorens, “Distributional Pareto-optimal multi-objective reinforcement learning,”Neural Inf Process Syst, vol. 36, pp. 15 593–15 613, 2023

  43. [43]

    Multi-objective deep reinforcement learning,

    H. Mossalam, Y . M. Assael, D. M. Roijers, and S. Whiteson, “Multi-objective deep reinforcement learning,”arXiv [cs.AI], 9 Oct. 2016

  44. [44]

    Dynamic weights in multi-objective deep reinforcement learning,

    A. Abels, D. Roijers, T. Lenaerts, A. Nowé, and D. Steck- elmacher, “Dynamic weights in multi-objective deep reinforcement learning,” inProceedings of the 36th Inter- national Conference on Machine Learning, ser. Proceed- ings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 11–20

  45. [45]

    A generalized algorithm for multi-objective reinforcement learning and policy adaptation,

    R. Yang, X. Sun, and K. Narasimhan, “A generalized algorithm for multi-objective reinforcement learning and policy adaptation,” inAdvances in Neural Information Processing Systems. proceedings.neurips.cc, 2019

  46. [46]

    Pareto set learning for multi-objective rein- forcement learning,

    E. Liu, Y .-C. Wu, X. Huang, C. Gao, R.-J. Wang, K. Xue, and C. Qian, “Pareto set learning for multi-objective rein- forcement learning,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 18, 2025

  47. [47]

    Goal-conditioned reinforcement learning: Problems and solutions,

    M. Liu, M. Zhu, and W. Zhang, “Goal-conditioned reinforcement learning: Problems and solutions,”arXiv [cs.AI], 20 Jan. 2022

  48. [48]

    Multi-goal re- inforcement learning: Challenging robotics environments and request for research,

    M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, V . Kumar, and W. Zaremba, “Multi-goal re- inforcement learning: Challenging robotics environments and request for research,”arXiv [cs.LG], 26 Feb. 2018

  49. [49]

    Exploration via hindsight goal generation,

    Z. Ren, K. Dong, Y . Zhou, Q. Liu, and J. Peng, “Exploration via hindsight goal generation,”Neural Inf Process Syst, vol. 32, pp. 13 464–13 474, 1 Jun. 2019

  50. [50]

    Offline goal-conditioned reinforcement learning via f- advantage regression,

    J. Y . Ma, J. Yan, D. Jayaraman, and O. Bastani, “Offline goal-conditioned reinforcement learning via f- advantage regression,” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 310–323

  51. [51]

    Learning with AMIGo: Adversarially motivated intrinsic goals,

    A. Campero, R. Raileanu, H. Küttler, J. B. Tenenbaum, T. Rocktäschel, and E. Grefenstette, “Learning with AMIGo: Adversarially motivated intrinsic goals,”arXiv [cs.LG], 22 Jun. 2020

  52. [52]

    Keeping your distance: Solving sparse reward tasks using self-balancing shaped rewards,

    A. R. Trott, S. Zheng, C. Xiong, and R. Socher, “Keeping your distance: Solving sparse reward tasks using self-balancing shaped rewards,”Neural Inf Process Syst, vol. abs/1911.01417, 4 Nov. 2019

  53. [53]

    Contrastive learning as goal-conditioned reinforcement learning,

    B. Eysenbach, T. Zhang, R. Salakhutdinov, and S. Levine, “Contrastive learning as goal-conditioned reinforcement learning,”Neural Inf Process Syst, vol. abs/2206.07568, pp. 35 603–35 620, 15 Jun. 2022

  54. [54]

    Goal- conditioned reinforcement learning with imagined subgoals,

    E. Chane-Sane, C. Schmid, and I. Laptev, “Goal- conditioned reinforcement learning with imagined subgoals,”ICML, vol. abs/2107.00541, pp. 1430–1440, 10 1 Jul. 2021

  55. [55]

    Sig- nal temporal logic meets reachability: Connections and ap- plications,

    M. Chen, Q. Tam, S. C. Livingston, and M. Pavone, “Sig- nal temporal logic meets reachability: Connections and ap- plications,” inInternational Workshop on the Algorithmic Foundations of Robotics. Springer, 2018, pp. 581–601

  56. [56]

    Solving minimum-cost reach avoid using reinforcement learning,

    O. So, C. Ge, and C. Fan, “Solving minimum-cost reach avoid using reinforcement learning,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://openreview.net/forum?id=jzngdJQ2lY

  57. [57]

    Safety and liveness guarantees through reach-avoid reinforcement learning,

    K.-C. Hsu, V . Rubies-Royo, C. J. Tomlin, and J. F. Fisac, “Safety and liveness guarantees through reach-avoid reinforcement learning,” inProceedings of Robotics: Science and Systems, Held Virtually, July 2021

  58. [58]

    Bridging hamilton-jacobi safety analysis and reinforcement learning,

    J. F. Fisac, N. F. Lugovoy, V . Rubies-Royo, S. Ghosh, and C. J. Tomlin, “Bridging hamilton-jacobi safety analysis and reinforcement learning,” in2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 8550–8556

  59. [59]

    Learn- ing stabilization control from observations by learning lyapunov-like proxy models,

    M. Ganai, C. Hirayama, Y .-C. Chang, and S. Gao, “Learn- ing stabilization control from observations by learning lyapunov-like proxy models,”2023 IEEE International Conference on Robotics and Automation (ICRA), 2023

  60. [60]

    Reachability constrained reinforcement learning,

    D. Yu, H. Ma, S. Li, and J. Chen, “Reachability constrained reinforcement learning,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 25 636–25 655

  61. [61]

    Safe multi-agent reinforcement learning via approximate hamilton-jacobi reachability,

    K. Zhu, F. Lan, W. Zhao, and T. Zhang, “Safe multi-agent reinforcement learning via approximate hamilton-jacobi reachability,”J. Intell. Robot. Syst., vol. 111, no. 1, 30 Dec. 2024

  62. [62]

    Monitoring temporal properties of continuous signals,

    O. Maler and D. Nickovic, “Monitoring temporal properties of continuous signals,” inInternational symposium on formal techniques in real-time and fault-tolerant systems. Springer, 2004, pp. 152–166

  63. [63]

    Robust satisfaction of temporal logic over real-valued signals,

    A. Donzé and O. Maler, “Robust satisfaction of temporal logic over real-valued signals,” inInternational conference on formal modeling and analysis of timed systems. Springer, 2010, pp. 92–106

  64. [64]

    Hamilton-jacobi reachability: A brief overview and recent advances,

    S. Bansal, M. Chen, S. Herbert, and C. J. Tomlin, “Hamilton-jacobi reachability: A brief overview and recent advances,” in2017 IEEE 56th Annual Conference on De- cision and Control (CDC). IEEE, 2017, pp. 2242–2253

  65. [65]

    Iterative reachability estimation for safe reinforcement learning,

    M. Ganai, Z. Gong, C. Yu, S. Herbert, and S. Gao, “Iterative reachability estimation for safe reinforcement learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 69 764–69 797, 2023

  66. [66]

    Baier and J.-P

    C. Baier and J.-P. Katoen,Principles of model checking. MIT press, 2008

  67. [67]

    Model checking,

    O. Grumberg, E. Clarke, and D. Peled, “Model checking,” inInternational Conference on Foundations of Software Technology and Theoretical Computer Science; Springer: Berlin/Heidelberg, Germany, 1999

  68. [68]

    A decision tree approach to data classification using signal temporal logic,

    G. Bombara, C.-I. Vasile, F. Penedo, H. Yasuoka, and C. Belta, “A decision tree approach to data classification using signal temporal logic,” inProceedings of the 19th International Conference on Hybrid Systems: Computation and Control, 2016, pp. 1–10

  69. [69]

    Tgpo: Temporal grounded policy optimization for signal temporal logic tasks,

    Y . Meng, F. Chen, and C. Fan, “Tgpo: Temporal grounded policy optimization for signal temporal logic tasks,”arXiv preprint arXiv:2510.00225, 2025

  70. [70]

    Lcrl: Certified policy synthesis via logically-constrained reinforcement learning,

    M. Hasanbeig, D. Kroening, and A. Abate, “Lcrl: Certified policy synthesis via logically-constrained reinforcement learning,” inInternational Conference on Quantitative Evaluation of SysTems. Springer, 2022, pp. 217–231

  71. [71]

    Aggressive driving with model predictive path integral control,

    G. Williams, P. Drews, B. Goldfain, J. M. Rehg, and E. A. Theodorou, “Aggressive driving with model predictive path integral control,” in2016 IEEE international conference on robotics and automation (ICRA). IEEE, 2016, pp. 1433–1440

  72. [72]

    Trajectory planning with signal temporal logic costs using deterministic path integral optimization,

    P. Halder, H. Homburger, L. Kiltz, J. Reuter, and M. Althoff, “Trajectory planning with signal temporal logic costs using deterministic path integral optimization,” arXiv preprint arXiv:2503.01476, 2025

  73. [73]

    Concurrent learning of control policy and unknown safety specifications in reinforcement learning,

    L. Yifru and A. Baheri, “Concurrent learning of control policy and unknown safety specifications in reinforcement learning,”IEEE Open Journal of Control Systems, vol. 3, pp. 266–281, 2024

  74. [74]

    Interpretable apprenticeship learning with temporal logic specifications,

    D. Kasenberg and M. Scheutz, “Interpretable apprenticeship learning with temporal logic specifications,” in2017 IEEE 56th Annual Conference on Decision and Control (CDC), 2017, pp. 4914–4921

  75. [75]

    Reinforcement learning with non-markovian rewards,

    M. Gaon and R. Brafman, “Reinforcement learning with non-markovian rewards,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp. 3980–3987, Apr. 2020

  76. [76]

    Compositional reinforcement learning from logical specifications,

    K. Jothimurugan, S. Bansal, O. Bastani, and R. Alur, “Compositional reinforcement learning from logical specifications,” inAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 10 026–10 039

  77. [77]

    From spot 2.0 to spot 2.10: What’s new?

    A. Duret-Lutz, E. Renault, M. Colange, F. Renkin, A. Gbaguidi Aisse, P. Schlehuber-Caissier, T. Medioni, A. Martin, J. Dubois, C. Gillardet al., “From spot 2.0 to spot 2.10: What’s new?” inInternational Conference on Computer Aided Verification. Springer, 2022, pp. 174–187

  78. [78]

    Diestel,Graph theory

    R. Diestel,Graph theory. Springer Nature, 2025

  79. [79]

    Principles of mathematical analysis,

    W. Rudin, “Principles of mathematical analysis,”3rd ed., 1976

  80. [80]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,”arXiv preprint arXiv:1506.02438, 2015

Showing first 80 references.