arxiv: 2512.17637 · v2 · submitted 2025-12-19 · 💻 cs.AI · cs.FL· cs.LO

Recognition: 2 theorem links

· Lean Theorem

About Time: Model-free Reinforcement Learning with Timed Reward Machines

Rajarshi Roy , Anirban Majumdar , Ritam Raha , David Parker , Marta Kwiatkowska

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:58 UTC · model grok-4.3

classification 💻 cs.AI cs.FLcs.LO

keywords reinforcement learningreward machinestimed automatatiming constraintsmodel-free learningQ-learning

0 comments

The pith

Timed reward machines extend reward machines by adding timing constraints so model-free RL can learn policies that meet deadlines and penalize delays.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes timed reward machines (TRMs) to encode precise timing requirements directly into the reward structure for reinforcement learning tasks. Standard reward machines capture history dependence but cannot express constraints such as costs for late actions or bonuses for timely completion. The authors integrate TRMs into tabular Q-learning by using timed automata abstractions together with counterfactual-imagining heuristics that exploit the TRM structure. Experiments on common RL benchmarks demonstrate that the resulting policies achieve high rewards while satisfying the timing specifications under both digital and real-time interpretations.

Core claim

Timed reward machines (TRMs) extend reward machines with timing constraints to support expressive, tunable reward logic that imposes costs for delays and grants rewards for timely actions. Model-free algorithms embed the TRM into learning through abstractions of timed automata and counterfactual-imagining heuristics, enabling Q-learning to produce optimal policies under digital and real-time semantics. On standard RL benchmarks the learned policies satisfy the TRM timing constraints while attaining high cumulative reward, with ablations confirming the contribution of the heuristics and comparisons showing differences across semantics.

What carries the argument

Timed reward machines (TRMs), reward machines augmented with timing constraints that are integrated into model-free learning via timed automata abstractions and counterfactual-imagining heuristics.

If this is right

Agents incur explicit costs for delays and receive rewards only for actions completed within TRM-specified time bounds.
Tabular Q-learning produces policies that satisfy the timing constraints while maximizing the TRM-defined reward.
The same learning procedure works under both digital-clock and real-time interpretations of the TRM.
Counterfactual-imagining heuristics measurably accelerate search and improve final policy quality.
Comparative runs reveal performance differences between the two TRM semantics on identical tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The TRM formalism could be lifted to deep RL settings that replace tabular Q-learning with neural function approximators.
Similar timing extensions might be applied to other formal reward or option frameworks that currently lack explicit clocks.
Safety-critical domains with hard deadlines could use TRMs to encode temporal requirements without shifting to model-based planning.

Load-bearing premise

The timed automata abstractions combined with counterfactual-imagining heuristics improve learning without introducing bias or instability in the model-free setting.

What would settle it

A standard RL benchmark in which the TRM-guided Q-learning agent violates the specified timing constraints on a majority of episodes or obtains substantially lower reward than a baseline using ordinary reward machines.

Figures

Figures reproduced from arXiv: 2512.17637 by Anirban Majumdar, David Parker, Marta Kwiatkowska, Rajarshi Roy, Ritam Raha.

**Figure 2.** Figure 2: Environment (above) along with TRM objective (below). The cost function [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Trajectories ζ1 and ζ2 with induced words, TRM runs, and discounted rewards for digital and real-time settings (γ = 0.9). follow the usual comparison semantics, e.g., ∞ > 2 is true, while ∞ ≤ 3 is false. We also define bounded delays as d = d if d < M else d = M. We extend this definition to trajectories: for a trajectory ζ = s0 · (d0, a0)· · ·(dn, an) · sn+1, we define a delay-bounded trajectory ζ = s0 · … view at source ↗

**Figure 4.** Figure 4: TRM (on the left) and MDP (on the right) illustrating agent behavior in the real-time [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Gym environments used in experiments. Taxi Domain on TRM1 Frozen Lake on TRM2 (a) Discounted reward comparison Taxi Domain on TRM1 Frozen Lake on TRM2 (b) Episode time comparison [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: RQ1: Performance gain for counterfactual imagining for digital and real-time settings for [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: RQ2: Performance difference for various timed interpretations [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: TRMs for Experimental Evaluations 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Reward specification plays a central role in reinforcement learning (RL), guiding the agent's behavior. To express non-Markovian rewards, formalisms such as reward machines have been introduced to capture dependencies on histories. However, traditional reward machines lack the ability to model precise timing constraints, limiting their use in time-sensitive applications. In this paper, we propose timed reward machines (TRMs), which are an extension of reward machines that incorporate timing constraints into the reward structure. TRMs enable more expressive specifications with tunable reward logic, for example, imposing costs for delays and granting rewards for timely actions. We study model-free RL frameworks (i.e., tabular Q-learning) for learning optimal policies with TRMs under digital and real-time semantics. Our algorithms integrate the TRM into learning via abstractions of timed automata, and employ counterfactual-imagining heuristics that exploit the structure of the TRM to improve the search. Experimentally, we demonstrate that our algorithm learns policies that achieve high rewards while satisfying the timing constraints specified by the TRM on popular RL benchmarks. Moreover, we conduct comparative studies of performance under different TRM semantics, along with ablations that highlight the benefits of counterfactual-imagining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds clocks and timing guards to reward machines so RL can penalize delays or reward timely actions, and it folds the result into tabular Q-learning with some heuristics.

read the letter

The main contribution is the timed reward machine formalism itself. It extends ordinary reward machines with explicit timing constraints, letting you write things like cost for being late or bonus for finishing inside a window. They then show how to run model-free Q-learning on top of it by abstracting the TRM into a timed automaton and adding counterfactual heuristics that imagine alternate timing paths to guide updates. The experiments on standard benchmarks report that the resulting policies meet the timing specs while still collecting high reward, and they include ablations on the heuristics plus comparisons between digital and real-time semantics. That part is straightforward and useful for anyone who has tried to encode timing by hand in the state or reward function. The integration with existing reward-machine machinery is clean and the heuristics look like a practical way to exploit the structure without needing a full model. The soft spot is the state-space question. Timed-automaton abstractions are known to produce region graphs whose size grows with the number of clocks and the size of the constants. The paper does not report the effective number of states after abstraction on the benchmarks or give a bound showing the blow-up stays small. If the discretized MDP ends up much larger than the original problem, the performance numbers could partly reflect the extra states and the heuristic compensating for them rather than the TRM semantics alone. That does not kill the idea, but it leaves the scalability claim thinner than it should be. This is for people at the RL-formal-methods boundary who need structured timing in rewards. A reader who already works with reward machines will pick up the extension and the heuristic tricks quickly. The underlying thinking is clear and it cites the right prior work, so the paper is worth sending to referees who can check the abstraction sizes and the experimental controls.

Referee Report

2 major / 2 minor

Summary. The paper proposes timed reward machines (TRMs) as an extension of reward machines incorporating timing constraints into reward structures. It develops model-free tabular Q-learning algorithms that integrate TRMs via timed automata abstractions and counterfactual-imagining heuristics, studies them under digital and real-time semantics, and evaluates them experimentally on RL benchmarks for high reward achievement while satisfying timing constraints.

Significance. If the central claims hold, the work would meaningfully advance expressive, time-sensitive reward specification in model-free RL by bridging timed automata with RL, enabling tunable delay costs and timely rewards. The use of formal abstractions plus structure-exploiting heuristics, together with comparative studies and ablations, provides a concrete foundation for further development in time-critical domains.

major comments (2)

[Algorithms section (integration via timed automata abstractions)] The integration of region-graph abstractions of timed automata directly into the Q-table (described in the algorithms section following the TRM definition) lacks any bound or empirical report on the resulting state cardinality. Region graphs are known to be exponential in the number of clocks and the largest constant in the guards; without this information the claim that the approach preserves a practical model-free tabular setting cannot be assessed.
[Experimental evaluation and ablations] The experimental evaluation reports performance gains but does not include the effective state-space sizes after abstraction for any benchmark; this datum is load-bearing for distinguishing whether observed improvements stem from TRM timing semantics or from the counterfactual heuristic compensating for discretization artifacts.

minor comments (2)

[Abstract] The abstract refers to 'popular RL benchmarks' without naming them or providing a brief characterization; this should be expanded for immediate clarity.
[Preliminaries and notation] Notation for digital versus real-time semantics should be introduced once and used consistently in all subsequent sections and figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and will revise the manuscript to incorporate the requested empirical information on state cardinalities.

read point-by-point responses

Referee: [Algorithms section (integration via timed automata abstractions)] The integration of region-graph abstractions of timed automata directly into the Q-table (described in the algorithms section following the TRM definition) lacks any bound or empirical report on the resulting state cardinality. Region graphs are known to be exponential in the number of clocks and the largest constant in the guards; without this information the claim that the approach preserves a practical model-free tabular setting cannot be assessed.

Authors: We agree that the manuscript currently lacks explicit bounds or empirical reports on the state cardinality after region-graph abstraction. While the exponential dependence on the number of clocks and guard constants is a known property of region graphs, the TRMs in our benchmarks use only one or two clocks with small constants, keeping the resulting state spaces practical for tabular Q-learning. In the revised manuscript we will add a table (or section) reporting the effective number of states in the abstracted Q-table for every benchmark, together with the number of clocks and maximum constants appearing in each TRM. revision: yes
Referee: [Experimental evaluation and ablations] The experimental evaluation reports performance gains but does not include the effective state-space sizes after abstraction for any benchmark; this datum is load-bearing for distinguishing whether observed improvements stem from TRM timing semantics or from the counterfactual heuristic compensating for discretization artifacts.

Authors: We concur that reporting the post-abstraction state-space sizes is essential for interpreting the source of the observed gains. The counterfactual-imagining heuristic exploits TRM structure, yet without the cardinality data it is difficult to separate the contribution of the timing semantics from any effect of the heuristic on discretization. In the revised version we will include the effective state-space sizes for all benchmarks in the experimental section (and in the ablations) so that readers can directly assess practicality and the origin of the performance differences. revision: yes

Circularity Check

0 steps flagged

No significant circularity: new formalism and algorithms introduced without reduction to fitted inputs or self-citation chains

full rationale

The paper proposes timed reward machines (TRMs) as an extension of reward machines incorporating timing constraints, then presents model-free algorithms that integrate timed automata abstractions and counterfactual heuristics into tabular Q-learning. No equations or central claims reduce by construction to prior fitted parameters, self-defined quantities, or load-bearing self-citations; the derivation chain consists of definitional extensions followed by empirical validation on benchmarks. The approach is self-contained against external benchmarks, with performance claims resting on experimental outcomes rather than tautological mappings from inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard RL assumptions plus the new TRM formalism; no free parameters or invented entities beyond the proposed extension are described in the abstract.

axioms (1)

domain assumption Standard model-free RL convergence assumptions hold when the TRM is integrated via timed automata abstractions.
Invoked implicitly when claiming the algorithms learn optimal policies.

invented entities (1)

Timed Reward Machine (TRM) no independent evidence
purpose: To extend reward machines with timing constraints for expressive reward specifications.
New formalism introduced in the paper; no independent evidence provided beyond the proposal itself.

pith-pipeline@v0.9.0 · 5522 in / 1146 out tokens · 22522 ms · 2026-05-16T20:58:44.460629+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose timed reward machines (TRMs), which are an extension of reward machines that incorporate timing constraints into the reward structure... Our algorithms integrate the TRM into learning via abstractions of timed automata, and employ counterfactual-imagining heuristics
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

corner-point abstraction based on region abstraction of timed automata... region-equivalent corner trajectories

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 1 internal anchor

[1]

In: RTSS

Alur, R., Courcoubetis, C., Dill, D.L., Halbwachs, N., Wong-Toi, H.: An implementation of three algorithms for timing verification based on automata emptiness. In: RTSS. pp. 157–166. IEEE Computer Society (1992)

work page 1992
[2]

Alur, R., Dill, D.L.: A theory of timed automata. Theor. Comput. Sci.126(2), 183–235 (1994)

work page 1994
[3]

In: HSCC

Alur, R., La Torre, S., Pappas, G.J.: Optimal paths in weighted timed automata. In: HSCC. Lecture Notes in Computer Science, vol. 2034, pp. 49–62. Springer (2001)

work page 2034
[4]

Lecture Notes in Computer Science, vol

Behrmann, G., Cougnard, A., David, A., Fleury, E., Larsen, K.G., Lime, D.: Uppaal-tiga: Time for playing games! In: CAV. Lecture Notes in Computer Science, vol. 4590, pp. 121–125. Springer (2007)

work page 2007
[5]

In: HSCC

Behrmann, G., Fehnker, A., Hune, T., Larsen, K.G., Pettersson, P., Romijn, J., Vaandrager, F.W.: Minimum-cost reachability for priced timed automata. In: HSCC. Lecture Notes in Computer Science, vol. 2034, pp. 147–161. Springer (2001)

work page 2034
[6]

In: FMCO

Behrmann, G., Larsen, K.G., Rasmussen, J.I.: Priced timed automata: Algorithms and ap- plications. In: FMCO. Lecture Notes in Computer Science, vol. 3657, pp. 162–182. Springer (2004)

work page 2004
[7]

In: HSCC

Bouyer, P., Brinksma, E., Larsen, K.G.: Staying alive as cheaply as possible. In: HSCC. Lecture Notes in Computer Science, vol. 2993, pp. 203–218. Springer (2004)

work page 2004
[8]

Formal Methods Syst

Bouyer, P., Brinksma, E., Larsen, K.G.: Optimal infinite scheduling for multi-priced timed automata. Formal Methods Syst. Des.32(1), 3–23 (2008)

work page 2008
[9]

In: FSTTCS

Bouyer, P., Cassez, F., Fleury, E., Larsen, K.G.: Optimal strategies in priced timed game automata. In: FSTTCS. Lecture Notes in Computer Science, vol. 3328, pp. 148–160. Springer (2004)

work page 2004
[10]

Bozga, M., Daws, C., Maler, O., Olivero, A., Tripakis, S., Yovine, S.: Kronos: A model- checking tool for real-time systems. In: CAV. Lecture Notes in Computer Science, vol. 1427, pp. 546–550. Springer (1998)

work page 1998
[11]

In: 2020 IEEE Interna- tional Conference on Robotics and Automation, ICRA 2020, Paris, France, May 31 - August 31, 2020

Bozkurt, A.K., Wang, Y., Zavlanos, M.M., Pajic, M.: Control synthesis from linear tem- poral logic specifications using model-free reinforcement learning. In: 2020 IEEE Interna- tional Conference on Robotics and Automation, ICRA 2020, Paris, France, May 31 - August 31, 2020. pp. 10349–10355. IEEE (2020). https://doi.org/10.1109/ICRA40945.2020.9196796, htt...

work page doi:10.1109/icra40945.2020.9196796 2020
[12]

In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intel- ligence, IJCAI 2019, Macao, China, August 10-16, 2019

Camacho, A., Icarte, R.T., Klassen, T.Q., Valenzano, R.A., McIlraith, S.A.: LTL and beyond: Formal languages for reward function specification in reinforcement learning. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intel- ligence, IJCAI 2019, Macao, China, August 10-16, 2019. pp. 6065–6073. ijcai.org (2019). https://do...

work page doi:10.24963/ijcai.2019/840 2019
[13]

In: HSCC

Cohen, M.H., Belta, C.: Model-based reinforcement learning for approximate optimal control with temporal logic specifications. In: HSCC. pp. 12:1–12:11. ACM (2021) 19

work page 2021
[14]

In: AAAI

Corazza, J., Gavran, I., Neider, D.: Reinforcement learning with stochastic reward machines. In: AAAI. pp. 6429–6436. AAAI Press (2022)

work page 2022
[15]

In: TACAS

David, A., Jensen, P.G., Larsen, K.G., Mikucionis, M., Taankvist, J.H.: Uppaal stratego. In: TACAS. Lecture Notes in Computer Science, vol. 9035, pp. 206–211. Springer (2015)

work page 2015
[16]

In: RTSS

Dole, K., Gupta, A., Komp, J., Krishna, S., Trivedi, A.: Event-triggered and time-triggered duration calculus for model-free reinforcement learning. In: RTSS. pp. 240–252. IEEE (2021)

work page 2021
[17]

In: AAAI

Dole, K., Gupta, A., Komp, J., Krishna, S., Trivedi, A.: Correct-by-construction reinforcement learning of cardiac pacemakers from duration calculus requirements. In: AAAI. pp. 14792– 14800. AAAI Press (2023)

work page 2023
[18]

In: ICAPS

Falah, A., Guha, S., Trivedi, A.: Reinforcement learning for omega-regular specifications on continuous-time MDP. In: ICAPS. pp. 578–586. AAAI Press (2023)

work page 2023
[19]

In: IJCAI

Falah, A., Guha, S., Trivedi, A.: Continuous-time reward machines. In: IJCAI. pp. 5056–5064. ijcai.org (2025)

work page 2025
[20]

In: ICML

Fujimoto, S., van Hoof, H., Meger, D.: Addressing function approximation error in actor-critic methods. In: ICML. Proceedings of Machine Learning Research, vol. 80, pp. 1582–1591. PMLR (2018)

work page 2018
[21]

In: ECAI

Hahn, E.M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A., Wojtczak, D.: Omega-regular reward machines. In: ECAI. Frontiers in Artificial Intelligence and Applications, vol. 372, pp. 972–979. IOS Press (2023)

work page 2023
[22]

In: 58th IEEE Conference on Decision and Control, CDC 2019, Nice, France, December 11-13, 2019

Hasanbeig, M., Kantaros, Y., Abate, A., Kroening, D., Pappas, G.J., Lee, I.: Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees. In: 58th IEEE Conference on Decision and Control, CDC 2019, Nice, France, December 11-13, 2019. pp. 5338–5343. IEEE (2019). https://doi.org/10.1109/CDC40024.2019.9028919,https:...

work page doi:10.1109/cdc40024.2019.9028919 2019
[23]

Lecture Notes in Computer Science, vol

Henzinger, T.A., Manna, Z., Pnueli, A.: What good are digital clocks? In: ICALP. Lecture Notes in Computer Science, vol. 623, pp. 545–558. Springer (1992)

work page 1992
[24]

Icarte, R.T.: Reward Machines. Ph.D. thesis, University of Toronto, Canada (2022)

work page 2022
[25]

Icarte, R.T., Klassen, T.: Reward machines.https://github.com/RodrigoToroIcarte/ reward_machines(2018), gitHub repository

work page 2018
[26]

In: ICML

Icarte, R.T., Klassen, T.Q., Valenzano, R.A., McIlraith, S.A.: Using reward machines for high- level task specification and decomposition in reinforcement learning. In: ICML. Proceedings of Machine Learning Research, vol. 80, pp. 2112–2121. PMLR (2018)

work page 2018
[27]

Icarte, R.T., Klassen, T.Q., Valenzano, R.A., McIlraith, S.A.: Reward machines: Exploiting reward function structure in reinforcement learning. J. Artif. Intell. Res.73, 173–208 (2022). https://doi.org/10.1613/JAIR.1.12440,https://doi.org/10.1613/jair.1.12440

work page doi:10.1613/jair.1.12440 2022
[28]

In: NeurIPS

Icarte, R.T., Waldie, E., Klassen, T.Q., Valenzano, R.A., Castro, M.P., McIlraith, S.A.: Learn- ing reward machines for partially observable reinforcement learning. In: NeurIPS. pp. 15497– 15508 (2019) 20

work page 2019
[29]

In: NeurIPS

Jothimurugan, K., Bansal, S., Bastani, O., Alur, R.: Compositional reinforcement learning from logical specifications. In: NeurIPS. pp. 10026–10039 (2021)

work page 2021
[30]

Kane, A., Chowdhury, O., Datta, A., Koopman, P.: A case study on runtime monitoring of an autonomous research vehicle (ARV) system. In: RV. Lecture Notes in Computer Science, vol. 9333, pp. 102–117. Springer (2015)

work page 2015
[31]

Larsen, K.G., Pettersson, P., Yi, W.: UPPAAL in a nutshell. Int. J. Softw. Tools Technol. Transf.1(1-2), 134–152 (1997)

work page 1997
[32]

In: AAMAS

Neary, C., Xu, Z., Wu, B., Topcu, U.: Reward machines for cooperative multi-agent reinforce- ment learning. In: AAMAS. pp. 934–942. ACM (2021)

work page 2021
[33]

Wiley Series in Probability and Statistics, Wiley (1994)

Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Statistics, Wiley (1994)

work page 1994
[34]

In: IJCAI

Shao, D., Kwiatkowska, M.: Sample efficient model-free reinforcement learning from LTL specifications with optimality guarantees. In: IJCAI. pp. 4180–4189. ijcai.org (2023)

work page 2023
[35]

MIT Press (2018)

Sutton, R.S., Barto, A.G.: Reinforcement learning - an introduction, 2nd Edition. MIT Press (2018)

work page 2018
[36]

In: ICAPS

Tollund, R.G., Johansen, N.S., Nielsen, K.Ø., Torralba, Á., Larsen, K.G.: Optimal infinite temporal planning: Cyclic plans for priced timed automata. In: ICAPS. pp. 588–596. AAAI Press (2024)

work page 2024
[37]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

Towers, M., Kwiatkowski, A., Terry, J., Balis, J.U., De Cola, G., Deleu, T., Goulão, M., Kallinteris, A., Krimmel, M., KG,A., Perez-Vicente, R., Pierré, A., Schulhoff, S., Tai, J.J., Tan, H., Younis, O.G.: Gymnasium: A standard interface for reinforcement learning environments. arXiv preprint arXiv:2407.17032 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Watkins, C.J.C.H., Dayan, P.: Technical note q-learning. Mach. Learn.8, 279–292 (1992)

work page 1992
[39]

In: IJCAI

Xu, Z., Topcu, U.: Transfer of temporal logic formulas in reinforcement learning. In: IJCAI. pp. 4010–4018. ijcai.org (2019) 21

work page 2019