Bellman Value Decomposition for Task Logic in Safe Optimal Control
Pith reviewed 2026-05-15 21:05 UTC · model grok-4.3
The pith
The Bellman value for temporal logic tasks decomposes into a graph of simpler values connected by reach-avoid, avoid, and reach-avoid-loop equations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We prove the Bellman Value for a complex task defined in temporal logic can be decomposed into a graph of Bellman Values, connected by a set of well-known Bellman equations (BEs): the Reach-Avoid BE, the Avoid BE, and a novel type, the Reach-Avoid-Loop BE. To solve the Value and optimal policy, we propose VDPPO, which embeds the decomposed Value graph into a two-layer neural net, bootstrapping the implicit dependencies.
What carries the argument
Decomposition of the Bellman value into a graph connected by the reach-avoid Bellman equation, the avoid Bellman equation, and the reach-avoid-loop Bellman equation, embedded in a two-layer neural network.
If this is right
- The optimal policy for combined safety and goal tasks is obtained by solving the embedded graph in a two-layer neural net.
- Safety and liveness specifications balance automatically without separate reward tuning.
- The method applies directly to high-dimensional systems with nonlinear dynamics and heterogeneous agent teams.
- Implicit task dependencies are resolved by the network bootstrapping process.
Where Pith is reading between the lines
- This decomposition approach could be tested on task specifications outside temporal logic, such as signal temporal logic or other formalisms.
- Recursive application of the graph structure might handle more deeply nested specifications in future extensions.
- The two-layer embedding may transfer to continuous-time settings if the underlying Bellman equations are discretized consistently.
Load-bearing premise
The innate structure of the Bellman value organizes temporal logic tasks so that the decomposed graph embeds into a two-layer neural net without manual tuning or post-hoc adjustments.
What would settle it
A concrete temporal logic task where the value function computed from the decomposed graph and VDPPO differs measurably from the value obtained by solving the full undecomposed Bellman equation, or where the resulting policy violates a safety or liveness specification in direct simulation.
Figures
read the original abstract
Real-world tasks involve nuanced combinations of goal and safety specifications. In high dimensions, the challenge is exacerbated: formal automata become cumbersome, and the combination of sparse rewards tends to require laborious tuning. In this work, we consider the innate structure of the Bellman Value as a means to naturally organize the problem for improved automatic performance. Namely, we prove the Bellman Value for a complex task defined in temporal logic can be decomposed into a graph of Bellman Values, connected by a set of well-known Bellman equations (BEs): the Reach-Avoid BE, the Avoid BE, and a novel type, the Reach-Avoid-Loop BE. To solve the Value and optimal policy, we propose VDPPO, which embeds the decomposed Value graph into a two-layer neural net, bootstrapping the implicit dependencies. We conduct a variety of simulated and hardware experiments to test our method on complex, high-dimensional tasks involving heterogeneous teams and nonlinear dynamics. Ultimately, we find this approach greatly improves performance over existing baselines, balancing safety and liveness automatically.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to prove that the Bellman value for any complex task specified in temporal logic decomposes into a graph of simpler Bellman values connected exclusively by the Reach-Avoid Bellman equation, the Avoid Bellman equation, and a novel Reach-Avoid-Loop Bellman equation. It introduces VDPPO, which embeds this value graph into a two-layer neural network that bootstraps the implicit dependencies to compute the value function and optimal policy. Experiments on simulated and hardware tasks with heterogeneous teams and nonlinear dynamics are reported to show improved safety-liveness trade-offs over baselines without post-hoc tuning.
Significance. If the decomposition theorem holds for general STL formulas and the two-layer embedding preserves optimality without hidden parameters, the result would supply a structured, largely automatic route to safe optimal control for high-dimensional tasks whose specifications combine reachability, avoidance, and looping behaviors. The approach could reduce reliance on manual reward shaping or automata construction in robotics applications.
major comments (3)
- [Main derivation of Reach-Avoid-Loop BE] The central proof relies on the correctness of the novel Reach-Avoid-Loop Bellman equation. The manuscript must explicitly derive this equation from the STL semantics and state the precise assumptions (e.g., deterministic vs. stochastic transitions, finite vs. infinite loop horizons, memoryless sub-tasks) under which the fixed-point equation is valid; without this, the claim that every STL task reduces to combinations of only the three listed operators cannot be verified.
- [VDPPO architecture and training] The assertion that the decomposed graph embeds losslessly into a two-layer neural net without additional manual tuning or post-hoc adjustments is load-bearing for the practical contribution. The paper should provide a formal argument or explicit construction showing that all implicit dependencies among the sub-values are captured by the network architecture and loss; otherwise the performance gains may stem from implicit fitting rather than the decomposition.
- [Experimental results] Experiments report improved performance, yet the manuscript does not include an ablation that isolates the contribution of the Reach-Avoid-Loop equation versus the standard Reach-Avoid and Avoid equations. Without this, it is unclear whether the novel operator is necessary for the observed gains or whether simpler decompositions suffice.
minor comments (2)
- [Preliminaries] Notation for the three Bellman equations should be unified and introduced in a single preliminary section to improve readability.
- [Experiments] The abstract states that the method 'balances safety and liveness automatically'; the experimental section should report quantitative safety-violation rates alongside task-completion rates for all baselines.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback on our manuscript. We address each of the major comments below, providing clarifications and committing to revisions where appropriate to strengthen the paper.
read point-by-point responses
-
Referee: [Main derivation of Reach-Avoid-Loop BE] The central proof relies on the correctness of the novel Reach-Avoid-Loop Bellman equation. The manuscript must explicitly derive this equation from the STL semantics and state the precise assumptions (e.g., deterministic vs. stochastic transitions, finite vs. infinite loop horizons, memoryless sub-tasks) under which the fixed-point equation is valid; without this, the claim that every STL task reduces to combinations of only the three listed operators cannot be verified.
Authors: We agree that an explicit derivation is essential for verifying the decomposition theorem. In the revised manuscript, we will add a dedicated subsection in the theoretical analysis that derives the Reach-Avoid-Loop Bellman equation step-by-step from the STL semantics. We will explicitly state the assumptions: deterministic transitions, infinite-horizon loops with appropriate discounting to ensure convergence, and memoryless sub-tasks as per the STL fragment considered. This will confirm that all complex STL formulas reduce to the three operators. revision: yes
-
Referee: [VDPPO architecture and training] The assertion that the decomposed graph embeds losslessly into a two-layer neural net without additional manual tuning or post-hoc adjustments is load-bearing for the practical contribution. The paper should provide a formal argument or explicit construction showing that all implicit dependencies among the sub-values are captured by the network architecture and loss; otherwise the performance gains may stem from implicit fitting rather than the decomposition.
Authors: The VDPPO architecture is specifically designed with the first layer computing the sub-value functions corresponding to the graph nodes and the second layer implementing the bootstrapping via the Bellman operators. We will include in the revision a formal construction in the appendix that proves the network captures all dependencies through its layered structure and the composite loss function, without requiring manual tuning or hidden parameters beyond the graph embedding. revision: yes
-
Referee: [Experimental results] Experiments report improved performance, yet the manuscript does not include an ablation that isolates the contribution of the Reach-Avoid-Loop equation versus the standard Reach-Avoid and Avoid equations. Without this, it is unclear whether the novel operator is necessary for the observed gains or whether simpler decompositions suffice.
Authors: We acknowledge the value of isolating the contribution of the novel operator. In the revised version, we will add an ablation study in the experimental section that compares the full VDPPO using all three equations against variants using only Reach-Avoid and Avoid equations on the same tasks, to demonstrate the necessity of the Reach-Avoid-Loop BE for the reported performance improvements. revision: yes
Circularity Check
Derivation self-contained; no circular reductions identified
full rationale
The paper states it proves decomposition of the Bellman value for temporal-logic tasks into a graph connected by Reach-Avoid BE, Avoid BE, and a novel Reach-Avoid-Loop BE. The novel equation is presented as derived within the proof rather than obtained by fitting or self-definition. The two-layer neural net embedding is described as a solution architecture that bootstraps implicit dependencies, not as a statistical prediction forced by prior fits. No load-bearing self-citations, uniqueness theorems imported from the same authors, or ansatzes smuggled via prior work are referenced in the abstract or description. The central claim therefore retains independent mathematical content and does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Bellman value for temporal logic tasks admits a graph decomposition connected by reach-avoid, avoid, and reach-avoid-loop equations
invented entities (1)
-
Reach-Avoid-Loop Bellman equation
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we prove the Bellman Value for a complex task defined in temporal logic can be decomposed into a graph of Bellman Values, connected by ... the Reach-Avoid BE, the Avoid BE, and a novel type, the Reach-Avoid-Loop BE
-
IndisputableMonolith/Foundation/ArrowOfTime.leanforward_accumulates unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lemma 2 ... RAℓ-BE ... lim γ→1 Vγj = V*[G(∧j∈J (qj U rj))]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Value Functions for Temporal Logic: Optimal Policies and Safety Filters
Non-Markovian policies from decomposed temporal logic value functions are proven optimal for nested Until, Globally, and Globally-Until specifications and extend Q-function safety filters to complex tasks.
Reference graph
Works this paper leans on
-
[1]
R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction. Cambridge, MA, USA: A Bradford Book, 2018
work page 2018
-
[2]
LTL and beyond: Formal languages for reward function specification in reinforcement learning,
A. Camacho, R. Toro Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith, “LTL and beyond: Formal languages for reward function specification in reinforcement learning,” inProceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. California: International Joint Conferences on Artificial Intelligence Organization, 1 Au...
work page 2019
-
[3]
A time-dependent hamilton-jacobi formulation of reachable sets for continuous dynamic games,
I. M. Mitchell, A. M. Bayen, and C. J. Tomlin, “A time-dependent hamilton-jacobi formulation of reachable sets for continuous dynamic games,”IEEE Transactions on automatic control, vol. 50, no. 7, pp. 947–957, 2005
work page 2005
-
[4]
Reach-avoid problems with time-varying dynamics, targets and constraints,
J. F. Fisac, M. Chen, C. J. Tomlin, and S. S. Sastry, “Reach-avoid problems with time-varying dynamics, targets and constraints,” inHybrid Systems: Computation and Control. ACM, 2015
work page 2015
-
[5]
Dual- objective reinforcement learning with novel hamilton-jacobi-bellman formulations,
W. Sharpless, D. Hirsch, S. Tonkens, N. Shinde, and S. Herbert, “Dual-objective reinforcement learning with novel hamilton-jacobi-bellman formulations,”arXiv preprint arXiv:2506.16016, 2025
-
[6]
Temporal logic guided safe model-based reinforcement learning: A hybrid systems approach,
M. H. Cohen, Z. Serlin, K. Leahy, and C. Belta, “Temporal logic guided safe model-based reinforcement learning: A hybrid systems approach,”Nonlinear Anal. Hybrid Syst., vol. 47, no. 101295, p. 101295, Feb. 2023
work page 2023
-
[7]
Instructing goal- conditioned reinforcement learning agents with temporal logic objectives,
W. Qiu, W. Mao, and H. Zhu, “Instructing goal- conditioned reinforcement learning agents with temporal logic objectives,”Neural Inf Process Syst, vol. 36, pp. 39 147–39 175, 2023
work page 2023
-
[8]
Verification of Markov decision processes using learning algorithms,
T. Brázdil, K. Chatterjee, M. Chmelík, V . Forejt, J. Kˇretínský, M. Kwiatkowska, D. Parker, and M. Ujma, “Verification of Markov decision processes using learning algorithms,”arXiv [cs.LO], 10 Feb. 2014
work page 2014
-
[9]
N. Hamilton, P. K. Robinette, and T. T. Johnson, “Training agents to satisfy timed and untimed signal temporal logic specifications with reinforcement learning,” inSoftware Engineering and Formal Methods, ser. Lecture notes in computer science. Cham: Springer International Publishing, 2022, pp. 190–206
work page 2022
-
[10]
D. Sadigh, E. S. Kim, S. Coogan, S. S. Sastry, and S. A. Seshia, “A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications,” in53rd IEEE Conference on Decision and Control. IEEE, Dec. 2014, pp. 1091–1096
work page 2014
-
[11]
Control synthesis from linear temporal logic specifications using model-free reinforcement learning,
A. K. Bozkurt, Y . Wang, M. M. Zavlanos, and M. Pajic, “Control synthesis from linear temporal logic specifications using model-free reinforcement learning,” in2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, May 2020, p. 10349–10355
work page 2020
-
[12]
F. Bacchus, C. Boutilier, and A. J. Grove, “Rewarding behaviors,” inProceedings of the National Conference on Artificial Intelligence.cs.toronto.edu, 4 Aug. 1996, pp. 1160–1167
work page 1996
-
[13]
Decision-theoretic planning with non- Markovian rewards,
S. Thiebaux, C. Gretton, J. Slaney, D. Price, and F. Kabanza, “Decision-theoretic planning with non- Markovian rewards,”J. Artif. Intell. Res., vol. 25, pp. 17–74, 29 Jan. 2006
work page 2006
-
[14]
Non- Markovian rewards expressed in LTL: Guiding search via reward shaping,
A. Camacho, O. Chen, S. Sanner, and S. McIlraith, “Non- Markovian rewards expressed in LTL: Guiding search via reward shaping,”Proceedings of the International Symposium on Combinatorial Search, vol. 8, no. 1, pp. 159–160, 1 Sep. 2021
work page 2021
-
[15]
Using reward machines for high-level task specification and decomposition in reinforcement learning,
R. T. Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith, “Using reward machines for high-level task specification and decomposition in reinforcement learning,”ICML, vol. 80, pp. 2112–2121, 3 Jul. 2018
work page 2018
-
[16]
Q-learning for robust satisfaction of signal temporal logic specifications,
D. Aksaray, A. Jones, Z. Kong, M. Schwager, and C. Belta, “Q-learning for robust satisfaction of signal temporal logic specifications,” in2016 IEEE 55th Conference on Decision and Control (CDC). IEEE, Dec. 2016, pp. 6565–6570
work page 2016
-
[17]
Reinforcement learning with temporal logic rewards,
X. Li, C.-I. Vasile, and C. Belta, “Reinforcement learning with temporal logic rewards,” in2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, Sep. 2017, pp. 3834–3839
work page 2017
-
[18]
M. Cai, M. Hasanbeig, S. Xiao, A. Abate, and Z. Kan, “Modular deep reinforcement learning for continuous motion planning with temporal logic,” IEEE Robotics and Automation Letters, vol. 6, no. 4, p. 7973–7980, Oct. 2021. [Online]. Available: http://dx.doi.org/10.1109/LRA.2021.3101544
-
[19]
Planning with general objective functions: Going beyond total rewards,
R. Wang, P. Zhong, S. S. Du, R. R. Salakhutdinov, and L. Yang, “Planning with general objective functions: Going beyond total rewards,” inAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 14 486–14 497
work page 2020
-
[20]
Reinforcement learning with non-cumulative objective,
W. Cui and W. Yu, “Reinforcement learning with non-cumulative objective,”IEEE Transactions on Machine Learning in Communications and Networking, vol. 1, pp. 124–137, 2023
work page 2023
-
[21]
Y . Tang, Y . Zhang, J. Ackermann, Y .-J. Zhang, S. Nishi- mori, and M. Sugiyama, “Recursive reward aggregation,” inReinforcement Learning Conference, 2025
work page 2025
-
[22]
Hybrid reward architecture for reinforcement learning,
H. van Seijen, M. Fatemi, J. Romoff, R. Laroche, T. Barnes, and J. Tsang, “Hybrid reward architecture for reinforcement learning,” inProceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Red Hook, NY , USA: Curran Associates Inc., 2017, p. 5398–5408
work page 2017
-
[23]
Consistent aggregation of objectives with diverse time preferences requires non-markovian rewards,
S. Pitis, “Consistent aggregation of objectives with diverse time preferences requires non-markovian rewards,” inThirty-seventh Conference on Neural Information Processing Systems, 2023
work page 2023
-
[24]
Rdˆ2: Reward decomposition with representation decomposition,
Z. Lin, D. Yang, L. Zhao, T. Qin, G. Yang, and T.-Y . Liu, “Rdˆ2: Reward decomposition with representation decomposition,” inAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran 9 Associates, Inc., 2020, pp. 11 298–11 308
work page 2020
-
[25]
Altman,Constrained Markov decision processes: Stochastic modeling
E. Altman,Constrained Markov decision processes: Stochastic modeling. Boca Raton: Routledge, 13 Dec. 2021
work page 2021
-
[26]
Constrained Policy Optimization
J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,”ICML, vol. abs/1705.10528, pp. 22–31, 30 May 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[27]
Safe reinforcement learning in constrained Markov decision processes,
A. Wachi and Y . Sui, “Safe reinforcement learning in constrained Markov decision processes,”ICML, vol. 119, pp. 9797–9806, 12 Jul. 2020
work page 2020
-
[28]
Responsive safety in reinforcement learning by PID lagrangian methods,
A. Stooke, J. Achiam, and P. Abbeel, “Responsive safety in reinforcement learning by PID lagrangian methods,” ICML, vol. 119, pp. 9133–9143, 8 Jul. 2020
work page 2020
-
[29]
Faster algorithm and sharper analysis for constrained Markov decision process,
T. Li, Z. Guan, S. Zou, T. Xu, Y . Liang, and G. Lan, “Faster algorithm and sharper analysis for constrained Markov decision process,”Oper. Res. Lett., vol. 54, no. 107107, p. 107107, May 2024
work page 2024
-
[30]
A primal-dual approach to constrained Markov decision processes,
Y . Chen, J. Dong, and Z. Wang, “A primal-dual approach to constrained Markov decision processes,”arXiv [math.OC], 26 Jan. 2021
work page 2021
-
[31]
A simple reward-free approach to constrained reinforcement learning,
S. Miryoosefi and C. Jin, “A simple reward-free approach to constrained reinforcement learning,”ICML, vol. abs/2107.05216, pp. 15 666–15 698, 12 Jul. 2021
-
[32]
Projection-based constrained policy optimization,
T.-Y . Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge, “Projection-based constrained policy optimization,”arXiv [cs.LG], 7 Oct. 2020
work page 2020
-
[33]
Natural policy gradient primal-dual method for constrained Markov decision processes,
D. Ding, K. Zhang, T. Ba¸ sar, and M. Jovanovi´c, “Natural policy gradient primal-dual method for constrained Markov decision processes,”Neural Inf Process Syst, vol. 33, pp. 8378–8390, 2020
work page 2020
-
[34]
Reward constrained policy optimization,
C. Tessler, D. J. Mankowitz, and S. Mannor, “Reward constrained policy optimization,”arXiv [cs.LG], 28 May 2018
work page 2018
-
[35]
Reinforcement learning for constrained Markov decision processes,
A. Gattami, Q. Bai, and V . Aggarwal, “Reinforcement learning for constrained Markov decision processes,” AISTATS, vol. 130, pp. 2656–2664, 2021
work page 2021
-
[36]
Constrained Markov decision processes via backward value functions,
H. Satija, P. Amortila, and J. Pineau, “Constrained Markov decision processes via backward value functions,” ICML, vol. 119, pp. 8502–8511, 12 Jul. 2020
work page 2020
-
[37]
Reinforcement learning with almost sure constraints,
A. Castellano, H. Min, E. Mallada, and J. A. Bazerque, “Reinforcement learning with almost sure constraints,” in Proceedings of The 4th Annual Learning for Dynamics and Control Conference, ser. Proceedings of Machine Learning Research, vol. 168. PMLR, 2022, pp. 559–570
work page 2022
-
[38]
Anytime-constrained reinforcement learning,
J. McMahan and X. Zhu, “Anytime-constrained reinforcement learning,” inProceedings of The 27th International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, S. Dasgupta, S. Mandt, and Y . Li, Eds., vol
- [39]
-
[40]
Model-based multi-objective reinforcement learning,
M. A. Wiering, M. Withagen, and M. M. Drugan, “Model-based multi-objective reinforcement learning,” in2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL). IEEE, Dec. 2014, pp. 1–6
work page 2014
-
[41]
Multi-objective reinforcement learning using sets of Pareto dominating policies,
M. K. Van and A. Nowé, “Multi-objective reinforcement learning using sets of Pareto dominating policies,”The Journal of Machine Learning Research, vol. 15, no. 1, pp. 3483–3512, 2014
work page 2014
-
[42]
Distributional Pareto-optimal multi-objective reinforcement learning,
X.-Q. Cai, P. Zhang, L. Zhao, J. Bian, M. Sugiyama, and A. Llorens, “Distributional Pareto-optimal multi-objective reinforcement learning,”Neural Inf Process Syst, vol. 36, pp. 15 593–15 613, 2023
work page 2023
-
[43]
Multi-objective deep reinforcement learning,
H. Mossalam, Y . M. Assael, D. M. Roijers, and S. Whiteson, “Multi-objective deep reinforcement learning,”arXiv [cs.AI], 9 Oct. 2016
work page 2016
-
[44]
Dynamic weights in multi-objective deep reinforcement learning,
A. Abels, D. Roijers, T. Lenaerts, A. Nowé, and D. Steck- elmacher, “Dynamic weights in multi-objective deep reinforcement learning,” inProceedings of the 36th Inter- national Conference on Machine Learning, ser. Proceed- ings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 11–20
work page 2019
-
[45]
A generalized algorithm for multi-objective reinforcement learning and policy adaptation,
R. Yang, X. Sun, and K. Narasimhan, “A generalized algorithm for multi-objective reinforcement learning and policy adaptation,” inAdvances in Neural Information Processing Systems. proceedings.neurips.cc, 2019
work page 2019
-
[46]
Pareto set learning for multi-objective rein- forcement learning,
E. Liu, Y .-C. Wu, X. Huang, C. Gao, R.-J. Wang, K. Xue, and C. Qian, “Pareto set learning for multi-objective rein- forcement learning,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 18, 2025
work page 2025
-
[47]
Goal-conditioned reinforcement learning: Problems and solutions,
M. Liu, M. Zhu, and W. Zhang, “Goal-conditioned reinforcement learning: Problems and solutions,”arXiv [cs.AI], 20 Jan. 2022
work page 2022
-
[48]
Multi-goal re- inforcement learning: Challenging robotics environments and request for research,
M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, V . Kumar, and W. Zaremba, “Multi-goal re- inforcement learning: Challenging robotics environments and request for research,”arXiv [cs.LG], 26 Feb. 2018
work page 2018
-
[49]
Exploration via hindsight goal generation,
Z. Ren, K. Dong, Y . Zhou, Q. Liu, and J. Peng, “Exploration via hindsight goal generation,”Neural Inf Process Syst, vol. 32, pp. 13 464–13 474, 1 Jun. 2019
work page 2019
-
[50]
Offline goal-conditioned reinforcement learning via f- advantage regression,
J. Y . Ma, J. Yan, D. Jayaraman, and O. Bastani, “Offline goal-conditioned reinforcement learning via f- advantage regression,” inAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 310–323
work page 2022
-
[51]
Learning with AMIGo: Adversarially motivated intrinsic goals,
A. Campero, R. Raileanu, H. Küttler, J. B. Tenenbaum, T. Rocktäschel, and E. Grefenstette, “Learning with AMIGo: Adversarially motivated intrinsic goals,”arXiv [cs.LG], 22 Jun. 2020
work page 2020
-
[52]
Keeping your distance: Solving sparse reward tasks using self-balancing shaped rewards,
A. R. Trott, S. Zheng, C. Xiong, and R. Socher, “Keeping your distance: Solving sparse reward tasks using self-balancing shaped rewards,”Neural Inf Process Syst, vol. abs/1911.01417, 4 Nov. 2019
-
[53]
Contrastive learning as goal-conditioned reinforcement learning,
B. Eysenbach, T. Zhang, R. Salakhutdinov, and S. Levine, “Contrastive learning as goal-conditioned reinforcement learning,”Neural Inf Process Syst, vol. abs/2206.07568, pp. 35 603–35 620, 15 Jun. 2022
-
[54]
Goal- conditioned reinforcement learning with imagined subgoals,
E. Chane-Sane, C. Schmid, and I. Laptev, “Goal- conditioned reinforcement learning with imagined subgoals,”ICML, vol. abs/2107.00541, pp. 1430–1440, 10 1 Jul. 2021
-
[55]
Sig- nal temporal logic meets reachability: Connections and ap- plications,
M. Chen, Q. Tam, S. C. Livingston, and M. Pavone, “Sig- nal temporal logic meets reachability: Connections and ap- plications,” inInternational Workshop on the Algorithmic Foundations of Robotics. Springer, 2018, pp. 581–601
work page 2018
-
[56]
Solving minimum-cost reach avoid using reinforcement learning,
O. So, C. Ge, and C. Fan, “Solving minimum-cost reach avoid using reinforcement learning,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://openreview.net/forum?id=jzngdJQ2lY
work page 2024
-
[57]
Safety and liveness guarantees through reach-avoid reinforcement learning,
K.-C. Hsu, V . Rubies-Royo, C. J. Tomlin, and J. F. Fisac, “Safety and liveness guarantees through reach-avoid reinforcement learning,” inProceedings of Robotics: Science and Systems, Held Virtually, July 2021
work page 2021
-
[58]
Bridging hamilton-jacobi safety analysis and reinforcement learning,
J. F. Fisac, N. F. Lugovoy, V . Rubies-Royo, S. Ghosh, and C. J. Tomlin, “Bridging hamilton-jacobi safety analysis and reinforcement learning,” in2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 8550–8556
work page 2019
-
[59]
Learn- ing stabilization control from observations by learning lyapunov-like proxy models,
M. Ganai, C. Hirayama, Y .-C. Chang, and S. Gao, “Learn- ing stabilization control from observations by learning lyapunov-like proxy models,”2023 IEEE International Conference on Robotics and Automation (ICRA), 2023
work page 2023
-
[60]
Reachability constrained reinforcement learning,
D. Yu, H. Ma, S. Li, and J. Chen, “Reachability constrained reinforcement learning,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 25 636–25 655
work page 2022
-
[61]
Safe multi-agent reinforcement learning via approximate hamilton-jacobi reachability,
K. Zhu, F. Lan, W. Zhao, and T. Zhang, “Safe multi-agent reinforcement learning via approximate hamilton-jacobi reachability,”J. Intell. Robot. Syst., vol. 111, no. 1, 30 Dec. 2024
work page 2024
-
[62]
Monitoring temporal properties of continuous signals,
O. Maler and D. Nickovic, “Monitoring temporal properties of continuous signals,” inInternational symposium on formal techniques in real-time and fault-tolerant systems. Springer, 2004, pp. 152–166
work page 2004
-
[63]
Robust satisfaction of temporal logic over real-valued signals,
A. Donzé and O. Maler, “Robust satisfaction of temporal logic over real-valued signals,” inInternational conference on formal modeling and analysis of timed systems. Springer, 2010, pp. 92–106
work page 2010
-
[64]
Hamilton-jacobi reachability: A brief overview and recent advances,
S. Bansal, M. Chen, S. Herbert, and C. J. Tomlin, “Hamilton-jacobi reachability: A brief overview and recent advances,” in2017 IEEE 56th Annual Conference on De- cision and Control (CDC). IEEE, 2017, pp. 2242–2253
work page 2017
-
[65]
Iterative reachability estimation for safe reinforcement learning,
M. Ganai, Z. Gong, C. Yu, S. Herbert, and S. Gao, “Iterative reachability estimation for safe reinforcement learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 69 764–69 797, 2023
work page 2023
-
[66]
C. Baier and J.-P. Katoen,Principles of model checking. MIT press, 2008
work page 2008
-
[67]
O. Grumberg, E. Clarke, and D. Peled, “Model checking,” inInternational Conference on Foundations of Software Technology and Theoretical Computer Science; Springer: Berlin/Heidelberg, Germany, 1999
work page 1999
-
[68]
A decision tree approach to data classification using signal temporal logic,
G. Bombara, C.-I. Vasile, F. Penedo, H. Yasuoka, and C. Belta, “A decision tree approach to data classification using signal temporal logic,” inProceedings of the 19th International Conference on Hybrid Systems: Computation and Control, 2016, pp. 1–10
work page 2016
-
[69]
Tgpo: Temporal grounded policy optimization for signal temporal logic tasks,
Y . Meng, F. Chen, and C. Fan, “Tgpo: Temporal grounded policy optimization for signal temporal logic tasks,”arXiv preprint arXiv:2510.00225, 2025
-
[70]
Lcrl: Certified policy synthesis via logically-constrained reinforcement learning,
M. Hasanbeig, D. Kroening, and A. Abate, “Lcrl: Certified policy synthesis via logically-constrained reinforcement learning,” inInternational Conference on Quantitative Evaluation of SysTems. Springer, 2022, pp. 217–231
work page 2022
-
[71]
Aggressive driving with model predictive path integral control,
G. Williams, P. Drews, B. Goldfain, J. M. Rehg, and E. A. Theodorou, “Aggressive driving with model predictive path integral control,” in2016 IEEE international conference on robotics and automation (ICRA). IEEE, 2016, pp. 1433–1440
work page 2016
-
[72]
Trajectory planning with signal temporal logic costs using deterministic path integral optimization,
P. Halder, H. Homburger, L. Kiltz, J. Reuter, and M. Althoff, “Trajectory planning with signal temporal logic costs using deterministic path integral optimization,” arXiv preprint arXiv:2503.01476, 2025
-
[73]
Concurrent learning of control policy and unknown safety specifications in reinforcement learning,
L. Yifru and A. Baheri, “Concurrent learning of control policy and unknown safety specifications in reinforcement learning,”IEEE Open Journal of Control Systems, vol. 3, pp. 266–281, 2024
work page 2024
-
[74]
Interpretable apprenticeship learning with temporal logic specifications,
D. Kasenberg and M. Scheutz, “Interpretable apprenticeship learning with temporal logic specifications,” in2017 IEEE 56th Annual Conference on Decision and Control (CDC), 2017, pp. 4914–4921
work page 2017
-
[75]
Reinforcement learning with non-markovian rewards,
M. Gaon and R. Brafman, “Reinforcement learning with non-markovian rewards,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp. 3980–3987, Apr. 2020
work page 2020
-
[76]
Compositional reinforcement learning from logical specifications,
K. Jothimurugan, S. Bansal, O. Bastani, and R. Alur, “Compositional reinforcement learning from logical specifications,” inAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 10 026–10 039
work page 2021
-
[77]
From spot 2.0 to spot 2.10: What’s new?
A. Duret-Lutz, E. Renault, M. Colange, F. Renkin, A. Gbaguidi Aisse, P. Schlehuber-Caissier, T. Medioni, A. Martin, J. Dubois, C. Gillardet al., “From spot 2.0 to spot 2.10: What’s new?” inInternational Conference on Computer Aided Verification. Springer, 2022, pp. 174–187
work page 2022
- [78]
-
[79]
Principles of mathematical analysis,
W. Rudin, “Principles of mathematical analysis,”3rd ed., 1976
work page 1976
-
[80]
High-Dimensional Continuous Control Using Generalized Advantage Estimation
J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,”arXiv preprint arXiv:1506.02438, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.