Control Synthesis from Linear Temporal Logic Specifications using Model-Free Reinforcement Learning

Alper Kamil Bozkurt; Michael M. Zavlanos; Miroslav Pajic; Yu Wang

arxiv: 1909.07299 · v2 · submitted 2019-09-16 · 💻 cs.RO · cs.AI· cs.LG

Control Synthesis from Linear Temporal Logic Specifications using Model-Free Reinforcement Learning

Alper Kamil Bozkurt , Yu Wang , Michael M. Zavlanos , Miroslav Pajic This is my paper

Pith reviewed 2026-05-24 15:15 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords reinforcement learninglinear temporal logiccontrol synthesismodel-free learningMarkov decision processesmotion planningstochastic environments

0 comments

The pith

A model-free RL method learns policies that maximize the probability of satisfying LTL specifications without learning transition probabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to convert an LTL formula into a set of rewards and path-dependent discount factors so that the RL objective of maximizing total discounted return becomes equivalent to maximizing the probability that the LTL formula is satisfied. Because the construction works for any unknown MDP, standard model-free algorithms such as Q-learning are guaranteed to converge to the policy that is optimal for the original LTL goal. The approach is demonstrated on motion-planning tasks where the agent must reach goals while avoiding unsafe regions under stochastic dynamics. This removes the need to estimate a full transition model before synthesis, which is often impractical in real environments.

Core claim

By introducing a novel rewarding and path-dependent discounting mechanism derived from the LTL formula, an optimal policy for the resulting discounted-reward problem also maximizes the probability of satisfying the LTL objective, and any model-free RL algorithm using these rewards and discounts is guaranteed to converge to such a policy.

What carries the argument

LTL-derived reward function together with path-dependent discount factors that encode satisfaction probabilities into the expected return.

If this is right

Model-free RL converges to the LTL-optimal policy without ever learning transition probabilities.
The same reward and discount construction applies to any finite-state MDP and any LTL formula.
The resulting policies can be used directly for motion planning under stochastic dynamics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be combined with function approximation for continuous-state problems where tabular Q-learning is infeasible.
Online re-computation of the discounts might allow the approach to handle slowly changing environments without restarting learning.
Because the construction is local to each automaton state, it may extend to other automata-based specifications beyond LTL.

Load-bearing premise

The chosen rewards and path-dependent discounts must correctly encode the LTL satisfaction probability so that the RL optimum coincides with the LTL optimum.

What would settle it

An MDP and LTL formula for which the policy that maximizes the constructed discounted return achieves a strictly lower satisfaction probability than another policy.

Figures

Figures reproduced from arXiv: 1909.07299 by Alper Kamil Bozkurt, Michael M. Zavlanos, Miroslav Pajic, Yu Wang.

**Figure 1.** Figure 1: Product MDP M× obtained from an MDP M and an LDBA A that is automatically derived from an LTL formula ϕ. start state q0 and whenever the MDP M makes a transition from s to s 0 , Aϕ updates its current state from q to δ(q, L(s)). The action to be selected in an MDP state s when Aϕ is in a state q is determined by π ϕ B as follows: if π ϕ B(hs, qi) is an -action q 0 , Aϕ changes its state to q 0 and the ac… view at source ↗

**Figure 3.** Figure 3: A summary of the synthesized policy for the nursery scenario. Arrows: actions top, left, down, and right; encircled characters: state labels. The actions in states that are not reachable or lead to another LDBA state are not displayed. In all subfigures, the most likely paths are highlighted in red. probability 0.8 and 0.2, respectively. In addition, this type of non-0 or non-1 probability guarantees canno… view at source ↗

read the original abstract

We present a reinforcement learning (RL) framework to synthesize a control policy from a given linear temporal logic (LTL) specification in an unknown stochastic environment that can be modeled as a Markov Decision Process (MDP). Specifically, we learn a policy that maximizes the probability of satisfying the LTL formula without learning the transition probabilities. We introduce a novel rewarding and path-dependent discounting mechanism based on the LTL formula such that (i) an optimal policy maximizing the total discounted reward effectively maximizes the probabilities of satisfying LTL objectives, and (ii) a model-free RL algorithm using these rewards and discount factors is guaranteed to converge to such policy. Finally, we illustrate the applicability of our RL-based synthesis approach on two motion planning case studies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main contribution is a new LTL-tied reward and path-dependent discount scheme that lets model-free RL target satisfaction probabilities without learning transitions, but the convergence claim is the part that needs the most checking.

read the letter

The paper gives a concrete mechanism for turning an LTL formula into rewards and history-dependent discounts so that the RL optimum matches the probability of satisfying the spec. That construction is the actual new piece; prior work either learned the model or used different encodings. The two motion-planning examples show the approach runs on standard Q-learning style updates and produces policies that meet the specs in those cases. That is useful for anyone who wants formal guarantees without building an explicit MDP first. The soft spot is the convergence guarantee. Standard model-free results rely on a fixed discount less than one to get a contraction. Once the discount varies with the path or automaton progress, the effective return is no longer obviously stationary or Markovian unless the state is augmented in a way that restores the contraction. The abstract asserts the guarantee follows, but the letter does not show the reduction or the error bounds, so it is not clear whether the argument closes. If the full proof augments the state properly and verifies the contraction, the claim holds; otherwise it is the load-bearing step that could slip. This is for researchers who already work at the LTL-RL boundary and are willing to verify the math themselves. It is worth sending to a serious referee who can check the proofs and the experimental controls, rather than desk-rejecting on the abstract alone.

Referee Report

2 major / 1 minor

Summary. The paper proposes a model-free RL framework to synthesize policies maximizing the probability of satisfying a given LTL specification in an unknown stochastic MDP. It introduces a novel LTL-derived rewarding scheme combined with path-dependent discounting, claiming that (i) the resulting optimal policy coincides with the LTL satisfaction probability maximizer and (ii) standard model-free RL algorithms converge to this policy. The approach is demonstrated on two motion planning case studies.

Significance. If the central claims hold, the work would enable LTL-constrained policy synthesis without explicit model identification or transition probability learning, which is valuable for robotics applications where dynamics are hard to characterize. The paper introduces a concrete mechanism for embedding LTL progress into rewards and discounts, and the case studies provide initial empirical support.

major comments (2)

[Abstract and convergence section] Abstract and the section presenting the convergence argument: the claim that model-free RL (e.g., Q-learning) is guaranteed to converge under the proposed path-dependent discounts is load-bearing for the second half of the main result. Standard convergence theorems require a fixed discount factor yielding a contraction in sup norm; the manuscript must explicitly show either (a) that the path-dependent mechanism reduces to an equivalent stationary MDP with fixed γ via state augmentation with the LTL automaton, or (b) that the effective return still satisfies the contraction property. Without this reduction or proof, the guarantee does not follow from textbook results.
[Mechanism definition section] The section defining the LTL-based rewards and path-dependent discounts: the construction must be shown to encode the satisfaction probability exactly (so that reward maximization coincides with probability maximization) without requiring knowledge of the unknown transition function. The paper should provide the explicit equivalence proof relating the discounted return under the new mechanism to the LTL satisfaction probability.

minor comments (1)

[Case studies] The two motion planning case studies would benefit from additional quantitative metrics (e.g., empirical satisfaction probability over multiple runs) to strengthen the empirical validation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help strengthen the presentation of our convergence guarantees and the formal equivalence between the proposed reward mechanism and LTL satisfaction probability. We address each major comment below and will incorporate the requested clarifications and proofs into the revised manuscript.

read point-by-point responses

Referee: [Abstract and convergence section] Abstract and the section presenting the convergence argument: the claim that model-free RL (e.g., Q-learning) is guaranteed to converge under the proposed path-dependent discounts is load-bearing for the second half of the main result. Standard convergence theorems require a fixed discount factor yielding a contraction in sup norm; the manuscript must explicitly show either (a) that the path-dependent mechanism reduces to an equivalent stationary MDP with fixed γ via state augmentation with the LTL automaton, or (b) that the effective return still satisfies the contraction property. Without this reduction or proof, the guarantee does not follow from textbook results.

Authors: We agree that an explicit justification is required to connect the path-dependent discounting to standard convergence results. In the revised manuscript we will add a dedicated subsection that shows the path-dependent discount factors depend only on the current state of the LTL automaton. By augmenting the original MDP state space with the automaton states, the resulting process is a stationary MDP in which each augmented state has a fixed discount factor. This reduction allows direct application of the classical Q-learning convergence theorem (contraction mapping in the sup-norm). We will also include a short proof that the effective Bellman operator remains a contraction under the augmented representation. revision: yes
Referee: [Mechanism definition section] The section defining the LTL-based rewards and path-dependent discounts: the construction must be shown to encode the satisfaction probability exactly (so that reward maximization coincides with probability maximization) without requiring knowledge of the unknown transition function. The paper should provide the explicit equivalence proof relating the discounted return under the new mechanism to the LTL satisfaction probability.

Authors: We acknowledge that the manuscript currently states the equivalence but does not supply a self-contained formal proof. In the revision we will insert a theorem (with proof) establishing that, for any policy, the expected discounted return under the proposed LTL-derived rewards and path-dependent discounts equals the probability that the policy satisfies the LTL formula. The proof relies only on the structure of the automaton and the definition of the reward/discount functions; it is therefore independent of the unknown transition probabilities. This will rigorously confirm that maximizing the RL objective is equivalent to maximizing the LTL satisfaction probability. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation self-contained

full rationale

The central construction introduces a novel LTL-based reward and path-dependent discount mechanism whose properties (i) and (ii) are asserted to hold by design of the encoding. No quoted equations or self-citations reduce the optimality or convergence claims to fitted parameters, tautological definitions, or prior author results by construction. The method is presented as independent of learned transition probabilities, consistent with external RL convergence results under the stated augmentation. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review; ledger reflects only elements explicitly named in the provided text. Full paper may contain additional fitted parameters or assumptions.

axioms (1)

domain assumption The unknown stochastic environment can be modeled as a Markov Decision Process (MDP).
Stated directly in the abstract as the modeling assumption for the environment.

invented entities (1)

LTL-based rewarding and path-dependent discounting mechanism no independent evidence
purpose: To encode LTL satisfaction probability into the RL objective so that standard model-free algorithms converge to the desired policy
Introduced as novel in the abstract; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5663 in / 1276 out tokens · 25149 ms · 2026-05-24T15:15:32.862788+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Model-Free Reinforcement Learning for Stochastic Games with Linear Temporal Logic Objectives
cs.RO 2020-10 unverdicted novelty 6.0

A model-free RL methodology is developed to maximize the probability of LTL satisfaction in unknown stochastic games when the derived DRA has a single Rabin pair, with a generalization providing lower bounds for multi...

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Sampling-based algorithms for optimal motion planning

Sertac Karaman and Emilio Frazzoli. Sampling-based algorithms for optimal motion planning. The International Journal of Robotics Research, 30(7):846–894, 2011

work page 2011
[2]

Anytime motion planning using the RRT

Sertac Karaman, Matthew R Walter, Alejandro Perez, Emilio Frazzoli, and Seth Teller. Anytime motion planning using the RRT. In 2011 IEEE International Conference on Robotics and Automation , pages 1478–1483. IEEE, 2011

work page 2011
[3]

C. I. Vasile and C. Belta. Sampling-based temporal logic path planning. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems , pages 4817–4822, Nov 2013

work page 2013
[4]

Optimal path planning for surveillance with temporal-logic constraints

Stephen L Smith, Jana Tmov, Calin Belta, and Daniela Rus. Optimal path planning for surveillance with temporal-logic constraints. The International Journal of Robotics Research , 30(14):1695–1708, 2011

work page 2011
[5]

Y . Chen, X. C. Ding, A. Stefanescu, and C. Belta. Formal approach to the deployment of distributed robotic teams. IEEE Transactions on Robotics, 28(1):158–171, Feb 2012

work page 2012
[6]

Kantaros and M

Y . Kantaros and M. M. Zavlanos. Sampling-based control synthesis for multi-robot systems under global temporal speciﬁcations. In 2017 ACM/IEEE 8th International Conference on Cyber-Physical Systems (ICCPS), pages 3–14, April 2017

work page 2017
[7]

E. M. Wolff, U. Topcu, and R. M. Murray. Optimization-based trajectory generation with linear temporal logic speciﬁcations. In 2014 IEEE International Conference on Robotics and Automation (ICRA) , pages 5319–5325, May 2014

work page 2014
[8]

Guo and M

M. Guo and M. M. Zavlanos. Probabilistic motion planning under temporal tasks and soft constraints. IEEE Transactions on Automatic Control, 63(12):4051–4066, Dec 2018

work page 2018
[9]

Multi-agent plan reconﬁgu- ration under local LTL speciﬁcations

Meng Guo and Dimos V Dimarogonas. Multi-agent plan reconﬁgu- ration under local LTL speciﬁcations. The International Journal of Robotics Research, 34(2):218–235, 2015

work page 2015
[10]

Kantaros and M

Y . Kantaros and M. M. Zavlanos. Sampling-based optimal control synthesis for multirobot systems under global temporal tasks. IEEE Transactions on Automatic Control , 64(5):1916–1931, May 2019

work page 1916
[11]

Lahijanian, S

M. Lahijanian, S. B. Andersson, and C. Belta. Temporal logic motion planning and control with probabilistic satisfaction guarantees. IEEE Transactions on Robotics , 28(2):396–409, April 2012

work page 2012
[12]

E. M. Wolff, U. Topcu, and R. M. Murray. Robust control of uncertain Markov decision processes with temporal logic speciﬁcations. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC) , pages 3372–3379, Dec 2012

work page 2012
[13]

Automated veriﬁcation and strategy synthesis for probabilistic systems

Marta Kwiatkowska and David Parker. Automated veriﬁcation and strategy synthesis for probabilistic systems. In Dang Van Hung and Mizuhito Ogawa, editors, Automated Technology for Veriﬁcation and Analysis, pages 5–22, Cham, 2013. Springer International Publishing

work page 2013
[14]

X. Ding, S. L. Smith, C. Belta, and D. Rus. Optimal control of Markov decision processes with linear temporal logic constraints. IEEE Transactions on Automatic Control , 59(5):1244–1257, May 2014

work page 2014
[15]

Principles of Model Checking

Christel Baier and Joost-Pieter Katoen. Principles of Model Checking. MIT Press, Cambridge, MA, USA, 2008

work page 2008
[16]

Probably Approximately Correct MDP Learning and Control With Temporal Logic Constraints

Jie Fu and Ufuk Topcu. Probably approximately correct MDP learning and control with temporal logic constraints, 2014. arXiv:1404.7073 [cs.SY]

work page internal anchor Pith review Pith/arXiv arXiv 2014
[17]

Veriﬁcation of Markov decision processes using learning algorithms

Tom ´aˇs Br ´azdil, Krishnendu Chatterjee, Martin Chmel ´ık, V ojtˇech Forejt, Jan Kˇret´ınsk´y, Marta Kwiatkowska, David Parker, and Mateusz Ujma. Veriﬁcation of Markov decision processes using learning algorithms. In Franck Cassez and Jean-Franc ¸ois Raskin, editors, Automated Technology for Veriﬁcation and Analysis , pages 98–114, Cham, 2014. Springer ...

work page 2014
[18]

Correct-by-synthesis reinforcement learning with temporal logic constraints

Min Wen, R ¨udiger Ehlers, and Ufuk Topcu. Correct-by-synthesis reinforcement learning with temporal logic constraints. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4983–4990. IEEE, 2015

work page 2015
[19]

Aksaray, A

D. Aksaray, A. Jones, Z. Kong, M. Schwager, and C. Belta. Q-learning for robust satisfaction of signal temporal logic speciﬁcations. In 2016 IEEE 55th Conference on Decision and Control (CDC) , pages 6565– 6570, Dec 2016

work page 2016
[20]

X. Li, C. Vasile, and C. Belta. Reinforcement learning with tem- poral logic rewards. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 3834–3839, Sep. 2017

work page 2017
[21]

Teaching multiple tasks to an RL agent using LTL

Rodrigo Toro Icarte, Toryn Q Klassen, Richard Valenzano, and Sheila A McIlraith. Teaching multiple tasks to an RL agent using LTL. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems , pages 452–461. International Foun- dation for Autonomous Agents and Multiagent Systems, 2018

work page 2018
[22]

Foundations for restraining bolts: Reinforcement learning with LTL f /LDLf restraining speciﬁcations

Giuseppe De Giacomo, Luca Iocchi, Marco Favorito, and Fabio Patrizi. Foundations for restraining bolts: Reinforcement learning with LTL f /LDLf restraining speciﬁcations. In Proceedings of the International Conference on Automated Planning and Scheduling , volume 29, pages 128–136, 2019

work page 2019
[23]

Sadigh, E

D. Sadigh, E. S. Kim, S. Coogan, S. S. Sastry, and S. A. Seshia. A learning based approach to control synthesis of Markov decision processes for linear temporal logic speciﬁcations. In 53rd IEEE Conference on Decision and Control , pages 1091–1096, Dec 2014

work page 2014
[24]

Zavlanos

Qitong Gao, Davood Hajinezhad, Yan Zhang, Yiannis Kantaros, and Michael M. Zavlanos. Reduced variance deep reinforcement learning with temporal logic speciﬁcations. In Proceedings of the 10th ACM/IEEE International Conference on Cyber-Physical Systems, ICCPS ’19, pages 237–248, New York, NY , USA, 2019. ACM

work page 2019
[25]

Omega-regular objectives in model-free reinforcement learning

Ernst Moritz Hahn, Mateo Perez, Sven Schewe, Fabio Somenzi, Ashutosh Trivedi, and Dominik Wojtczak. Omega-regular objectives in model-free reinforcement learning. In Tom ´aˇs V ojnar and Lijun Zhang, editors, Tools and Algorithms for the Construction and Anal- ysis of Systems , pages 395–412, Cham, 2019. Springer International Publishing

work page 2019
[26]

A policy search method for temporal logic speciﬁed reinforcement learning tasks

Xiao Li, Yao Ma, and Calin Belta. A policy search method for temporal logic speciﬁed reinforcement learning tasks. In 2018 Annual American Control Conference (ACC) , pages 240–245. IEEE, 2018

work page 2018
[27]

Lazy Probabilistic Model Checking without De- terminisation

Ernst Moritz Hahn, Guangyuan Li, Sven Schewe, Andrea Turrini, and Lijun Zhang. Lazy Probabilistic Model Checking without De- terminisation. In Luca Aceto and David de Frutos Escrig, editors, 26th International Conference on Concurrency Theory (CONCUR 2015), volume 42 of Leibniz International Proceedings in Informatics (LIPIcs), pages 354–367, Dagstuhl, Ge...

work page 2015
[28]

Limit-deterministic B ¨uchi automata for linear temporal logic

Salomon Sickert, Javier Esparza, Stefan Jaax, and Jan K ˇret´ınsk´y. Limit-deterministic B ¨uchi automata for linear temporal logic. In Swarat Chaudhuri and Azadeh Farzan, editors, Computer Aided Veriﬁ- cation, pages 312–332, Cham, 2016. Springer International Publishing

work page 2016
[29]

Logically-Constrained Reinforcement Learning

Mohammadhosein Hasanbeig, Alessandro Abate, and Daniel Kroen- ing. Logically-constrained reinforcement learning. arXiv:1801.08099 [cs.LG], 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L

Alexander L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman. Pac model-free reinforcement learning. In Pro- ceedings of the 23rd International Conference on Machine Learning , ICML 06, page 881888, New York, NY , USA, 2006. Association for Computing Machinery

work page 2006
[31]

Reinforcement Learning: An Introduction

Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, USA, 2nd edition, 2018

work page 2018
[32]

Jordan, and Satinder P

Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. Conver- gence of stochastic iterative dynamic programming algorithms. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems 6 , pages 703–710. Morgan- Kaufmann, 1994

work page 1994
[33]

Rabinizer 4: From LTL to your favourite deterministic automaton

Jan K ˇret´ınsk´y, Tobias Meggendorfer, Salomon Sickert, and Christo- pher Ziegler. Rabinizer 4: From LTL to your favourite deterministic automaton. In Hana Chockler and Georg Weissenbacher, editors, Computer Aided Veriﬁcation , pages 567–577, Cham, 2018. Springer International Publishing

work page 2018
[34]

A. K. Bozkurt, Y . Wang, M. M. Zavlanos, and M. Pajic. CSRL, 2019. https://gitlab.oit.duke.edu/cpsl/csrl

work page 2019
[35]

Kress-Gazit, G

H. Kress-Gazit, G. E. Fainekos, and G. J. Pappas. Where’s Waldo? sensor-based temporal logic motion planning. In Proceedings 2007 IEEE International Conference on Robotics and Automation , pages 3116–3121, April 2007

work page 2007

[1] [1]

Sampling-based algorithms for optimal motion planning

Sertac Karaman and Emilio Frazzoli. Sampling-based algorithms for optimal motion planning. The International Journal of Robotics Research, 30(7):846–894, 2011

work page 2011

[2] [2]

Anytime motion planning using the RRT

Sertac Karaman, Matthew R Walter, Alejandro Perez, Emilio Frazzoli, and Seth Teller. Anytime motion planning using the RRT. In 2011 IEEE International Conference on Robotics and Automation , pages 1478–1483. IEEE, 2011

work page 2011

[3] [3]

C. I. Vasile and C. Belta. Sampling-based temporal logic path planning. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems , pages 4817–4822, Nov 2013

work page 2013

[4] [4]

Optimal path planning for surveillance with temporal-logic constraints

Stephen L Smith, Jana Tmov, Calin Belta, and Daniela Rus. Optimal path planning for surveillance with temporal-logic constraints. The International Journal of Robotics Research , 30(14):1695–1708, 2011

work page 2011

[5] [5]

Y . Chen, X. C. Ding, A. Stefanescu, and C. Belta. Formal approach to the deployment of distributed robotic teams. IEEE Transactions on Robotics, 28(1):158–171, Feb 2012

work page 2012

[6] [6]

Kantaros and M

Y . Kantaros and M. M. Zavlanos. Sampling-based control synthesis for multi-robot systems under global temporal speciﬁcations. In 2017 ACM/IEEE 8th International Conference on Cyber-Physical Systems (ICCPS), pages 3–14, April 2017

work page 2017

[7] [7]

E. M. Wolff, U. Topcu, and R. M. Murray. Optimization-based trajectory generation with linear temporal logic speciﬁcations. In 2014 IEEE International Conference on Robotics and Automation (ICRA) , pages 5319–5325, May 2014

work page 2014

[8] [8]

Guo and M

M. Guo and M. M. Zavlanos. Probabilistic motion planning under temporal tasks and soft constraints. IEEE Transactions on Automatic Control, 63(12):4051–4066, Dec 2018

work page 2018

[9] [9]

Multi-agent plan reconﬁgu- ration under local LTL speciﬁcations

Meng Guo and Dimos V Dimarogonas. Multi-agent plan reconﬁgu- ration under local LTL speciﬁcations. The International Journal of Robotics Research, 34(2):218–235, 2015

work page 2015

[10] [10]

Kantaros and M

Y . Kantaros and M. M. Zavlanos. Sampling-based optimal control synthesis for multirobot systems under global temporal tasks. IEEE Transactions on Automatic Control , 64(5):1916–1931, May 2019

work page 1916

[11] [11]

Lahijanian, S

M. Lahijanian, S. B. Andersson, and C. Belta. Temporal logic motion planning and control with probabilistic satisfaction guarantees. IEEE Transactions on Robotics , 28(2):396–409, April 2012

work page 2012

[12] [12]

E. M. Wolff, U. Topcu, and R. M. Murray. Robust control of uncertain Markov decision processes with temporal logic speciﬁcations. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC) , pages 3372–3379, Dec 2012

work page 2012

[13] [13]

Automated veriﬁcation and strategy synthesis for probabilistic systems

Marta Kwiatkowska and David Parker. Automated veriﬁcation and strategy synthesis for probabilistic systems. In Dang Van Hung and Mizuhito Ogawa, editors, Automated Technology for Veriﬁcation and Analysis, pages 5–22, Cham, 2013. Springer International Publishing

work page 2013

[14] [14]

X. Ding, S. L. Smith, C. Belta, and D. Rus. Optimal control of Markov decision processes with linear temporal logic constraints. IEEE Transactions on Automatic Control , 59(5):1244–1257, May 2014

work page 2014

[15] [15]

Principles of Model Checking

Christel Baier and Joost-Pieter Katoen. Principles of Model Checking. MIT Press, Cambridge, MA, USA, 2008

work page 2008

[16] [16]

Probably Approximately Correct MDP Learning and Control With Temporal Logic Constraints

Jie Fu and Ufuk Topcu. Probably approximately correct MDP learning and control with temporal logic constraints, 2014. arXiv:1404.7073 [cs.SY]

work page internal anchor Pith review Pith/arXiv arXiv 2014

[17] [17]

Veriﬁcation of Markov decision processes using learning algorithms

Tom ´aˇs Br ´azdil, Krishnendu Chatterjee, Martin Chmel ´ık, V ojtˇech Forejt, Jan Kˇret´ınsk´y, Marta Kwiatkowska, David Parker, and Mateusz Ujma. Veriﬁcation of Markov decision processes using learning algorithms. In Franck Cassez and Jean-Franc ¸ois Raskin, editors, Automated Technology for Veriﬁcation and Analysis , pages 98–114, Cham, 2014. Springer ...

work page 2014

[18] [18]

Correct-by-synthesis reinforcement learning with temporal logic constraints

Min Wen, R ¨udiger Ehlers, and Ufuk Topcu. Correct-by-synthesis reinforcement learning with temporal logic constraints. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4983–4990. IEEE, 2015

work page 2015

[19] [19]

Aksaray, A

D. Aksaray, A. Jones, Z. Kong, M. Schwager, and C. Belta. Q-learning for robust satisfaction of signal temporal logic speciﬁcations. In 2016 IEEE 55th Conference on Decision and Control (CDC) , pages 6565– 6570, Dec 2016

work page 2016

[20] [20]

X. Li, C. Vasile, and C. Belta. Reinforcement learning with tem- poral logic rewards. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 3834–3839, Sep. 2017

work page 2017

[21] [21]

Teaching multiple tasks to an RL agent using LTL

Rodrigo Toro Icarte, Toryn Q Klassen, Richard Valenzano, and Sheila A McIlraith. Teaching multiple tasks to an RL agent using LTL. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems , pages 452–461. International Foun- dation for Autonomous Agents and Multiagent Systems, 2018

work page 2018

[22] [22]

Foundations for restraining bolts: Reinforcement learning with LTL f /LDLf restraining speciﬁcations

Giuseppe De Giacomo, Luca Iocchi, Marco Favorito, and Fabio Patrizi. Foundations for restraining bolts: Reinforcement learning with LTL f /LDLf restraining speciﬁcations. In Proceedings of the International Conference on Automated Planning and Scheduling , volume 29, pages 128–136, 2019

work page 2019

[23] [23]

Sadigh, E

D. Sadigh, E. S. Kim, S. Coogan, S. S. Sastry, and S. A. Seshia. A learning based approach to control synthesis of Markov decision processes for linear temporal logic speciﬁcations. In 53rd IEEE Conference on Decision and Control , pages 1091–1096, Dec 2014

work page 2014

[24] [24]

Zavlanos

Qitong Gao, Davood Hajinezhad, Yan Zhang, Yiannis Kantaros, and Michael M. Zavlanos. Reduced variance deep reinforcement learning with temporal logic speciﬁcations. In Proceedings of the 10th ACM/IEEE International Conference on Cyber-Physical Systems, ICCPS ’19, pages 237–248, New York, NY , USA, 2019. ACM

work page 2019

[25] [25]

Omega-regular objectives in model-free reinforcement learning

Ernst Moritz Hahn, Mateo Perez, Sven Schewe, Fabio Somenzi, Ashutosh Trivedi, and Dominik Wojtczak. Omega-regular objectives in model-free reinforcement learning. In Tom ´aˇs V ojnar and Lijun Zhang, editors, Tools and Algorithms for the Construction and Anal- ysis of Systems , pages 395–412, Cham, 2019. Springer International Publishing

work page 2019

[26] [26]

A policy search method for temporal logic speciﬁed reinforcement learning tasks

Xiao Li, Yao Ma, and Calin Belta. A policy search method for temporal logic speciﬁed reinforcement learning tasks. In 2018 Annual American Control Conference (ACC) , pages 240–245. IEEE, 2018

work page 2018

[27] [27]

Lazy Probabilistic Model Checking without De- terminisation

Ernst Moritz Hahn, Guangyuan Li, Sven Schewe, Andrea Turrini, and Lijun Zhang. Lazy Probabilistic Model Checking without De- terminisation. In Luca Aceto and David de Frutos Escrig, editors, 26th International Conference on Concurrency Theory (CONCUR 2015), volume 42 of Leibniz International Proceedings in Informatics (LIPIcs), pages 354–367, Dagstuhl, Ge...

work page 2015

[28] [28]

Limit-deterministic B ¨uchi automata for linear temporal logic

Salomon Sickert, Javier Esparza, Stefan Jaax, and Jan K ˇret´ınsk´y. Limit-deterministic B ¨uchi automata for linear temporal logic. In Swarat Chaudhuri and Azadeh Farzan, editors, Computer Aided Veriﬁ- cation, pages 312–332, Cham, 2016. Springer International Publishing

work page 2016

[29] [29]

Logically-Constrained Reinforcement Learning

Mohammadhosein Hasanbeig, Alessandro Abate, and Daniel Kroen- ing. Logically-constrained reinforcement learning. arXiv:1801.08099 [cs.LG], 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[30] [30]

Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L

Alexander L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman. Pac model-free reinforcement learning. In Pro- ceedings of the 23rd International Conference on Machine Learning , ICML 06, page 881888, New York, NY , USA, 2006. Association for Computing Machinery

work page 2006

[31] [31]

Reinforcement Learning: An Introduction

Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, USA, 2nd edition, 2018

work page 2018

[32] [32]

Jordan, and Satinder P

Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. Conver- gence of stochastic iterative dynamic programming algorithms. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems 6 , pages 703–710. Morgan- Kaufmann, 1994

work page 1994

[33] [33]

Rabinizer 4: From LTL to your favourite deterministic automaton

Jan K ˇret´ınsk´y, Tobias Meggendorfer, Salomon Sickert, and Christo- pher Ziegler. Rabinizer 4: From LTL to your favourite deterministic automaton. In Hana Chockler and Georg Weissenbacher, editors, Computer Aided Veriﬁcation , pages 567–577, Cham, 2018. Springer International Publishing

work page 2018

[34] [34]

A. K. Bozkurt, Y . Wang, M. M. Zavlanos, and M. Pajic. CSRL, 2019. https://gitlab.oit.duke.edu/cpsl/csrl

work page 2019

[35] [35]

Kress-Gazit, G

H. Kress-Gazit, G. E. Fainekos, and G. J. Pappas. Where’s Waldo? sensor-based temporal logic motion planning. In Proceedings 2007 IEEE International Conference on Robotics and Automation , pages 3116–3121, April 2007

work page 2007