pith. sign in

arxiv: 2606.00838 · v1 · pith:BOEINB4Onew · submitted 2026-05-30 · 💻 cs.AI

Decoupled Behavioral Cloning for Scalable Inductive Generalization in RL from Specifications

Pith reviewed 2026-06-28 18:24 UTC · model grok-4.3

classification 💻 cs.AI
keywords inductive generalizationbehavioral cloningreinforcement learningpolicy evolutionzero-shot generalizationmeta reinforcement learningscalability
0
0 comments X

The pith

Decoupling per-task RL from evolution-function learning replaces noisy rewards with dense supervision and improves stability plus zero-shot generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Inductive generalization requires that related tasks induce related policies, captured by a higher-order evolution function. Earlier approaches train this function end-to-end with RL, so growing task counts produce aggregated reward signals that are noisy and contradictory, destabilizing optimization and weakening transfer. DIBS first trains ordinary teacher policies for each task with standard RL, then fits the evolution function by behavioral cloning on the state-action pairs those teachers produce. The change supplies stable, dense labels instead of conflicting scalar rewards. The resulting method trains more reliably and transfers better to unseen tasks than prior RL and meta-RL baselines.

Core claim

DIBS decouples the learning of task-specific policies from the learning of the policy-evolution function: individual teacher policies are obtained via ordinary per-task RL, after which the evolution function is fit by behavioral cloning on the teacher-generated state-action pairs, substituting dense stable supervision for noisy aggregated rewards.

What carries the argument

The policy-evolution function fitted by behavioral cloning on teacher-labeled state-action pairs.

If this is right

  • Training stability holds as the number of tasks grows because reward aggregation is removed.
  • Zero-shot success on new tasks rises because the evolution function receives cleaner training data.
  • The method applies to any inductive family once per-task teachers can be obtained.
  • It outperforms both direct RL and meta-RL baselines on the reported metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decoupling could be tested in settings where the evolution function must be learned from partial rather than full trajectories.
  • If teacher policies are allowed to share parameters, the approach might further reduce total compute while preserving the dense-supervision benefit.
  • The framework naturally extends to any higher-order structure whose direct RL training suffers from reward interference.

Load-bearing premise

The separately trained teacher policies must be of sufficient quality and coverage to supply unbiased, dense supervision for the evolution function.

What would settle it

Run the same suite of tasks with deliberately degraded teacher policies (e.g., early-stopped or low-coverage RL) and measure whether DIBS still outperforms the RL and meta-RL baselines on stability and zero-shot success.

Figures

Figures reproduced from arXiv: 2606.00838 by Subhajit Roy, Suguman Bansal, Vignesh Subramanian.

Figure 1
Figure 1. Figure 1: Left column: Benchmark illustrations. (a) Reacher benchmark environment. (b) Inductive task family in the Reacher benchmark. Right column: Contrasting flow diagrams. (c) GenRL flow diagram. (d) DIBS flow diagram. learning for training tasks is tightly coupled with κ-coefficient training. As the number of training tasks grows, aggregated reward feedback becomes noisy and conflicting, making the training loo… view at source ↗
Figure 2
Figure 2. Figure 2: a illustrates an inductive task family in a 2D plane. In each task instance [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 2
Figure 2. Figure 2: Choice Benchmark in Cartesian 2D Plane. (a) Illustration of the Choice (1-level): from initial region, reach either g1 or g2 then reach the final goal; initial region and the goal shift across indices. (b) Trajectories produced by DIBS: blue denotes Train tasks and red denotes Unseen tasks. Given an inductive task family R = {Ri} L i=0 and training indices Train ⊆ {0, . . . , L} with 0, L ∈ Train, the goal… view at source ↗
Figure 3
Figure 3. Figure 3: Scalability plots. Each row corresponds to one environment on k-reachability tasks. Column (a) shows the successful training ratio (y-axis) versus number of training tasks |Train| (x-axis) for k = 5. Column (b) shows the zero-shot generalization ratio (y-axis) versus number of training tasks |Train| (x-axis) for k = 5. Column (c) shows absolute zero-shot generalization (y-axis) for k ∈ {1, . . . , 5} (x-ax… view at source ↗
Figure 4
Figure 4. Figure 4: REACHER task variants and zero-shot generalization results. Top row: specification illustrations for the four REACHER tower pick-and-place variants. Bottom row: absolute zero-shot generalization results for the 8-block and 10-block tower settings. The y-axis in the bottom-row plots shows absolute zero-shot generalization, and the x-axis denotes the four Reacher variants shown in the top row. degrade relati… view at source ↗
Figure 5
Figure 5. Figure 5: Agent models and their dynamics. cient coverage for Stage 2. We similarly choose the cross-index regularization weight by sweeping λx ∈ {0, 10−4 , 10−3 , 10−2} and selecting the largest value that reduces cross-index teacher drift without degrading teacher specification satisfaction. Dataset aggregation settings. For each i ∈ Train, we construct a preliminary candidate set Di by sampling states from the co… view at source ↗
Figure 6
Figure 6. Figure 6: Specification illustrations for n-REACHABILITY and n-REACHABILITY+OBS. (a) CHOICE(L) specification illustration where l = 1 (b) CHOICE(L) specification illustration where l = 2 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: CHOICE(L) specification illustrations for two levels. This specification adds a safety requirement to n-REACHABILITY. The first line is the same ordered sequence of reach goals. The second line, ensuring (avoid (obs)), requires that the agent avoid the obstacle region obs at all timesteps during execution. G.3 CHOICE(l) ( [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Scalability plots for CAR2D k-Reachability with obstacles. Column (a) shows the successful training ratio (y-axis) versus the number of training tasks |Train| (x-axis) for k = 5. Column (b) shows the zero-shot generalization ratio (y-axis) versus the number of training tasks |Train| (x-axis) for k = 5. Column (c) shows absolute zero-shot generalization (y-axis) for k ∈ {1, . . . , 5} (x-axis). We report me… view at source ↗
Figure 13
Figure 13. Figure 13: In these experiments, only GenRL and DIBS achieve non-trivial performance, since the other baselines do not have an explicit branching mechanism and therefore struggle to solve the structured choice problem. Across both branching levels, DIBS consistently outperforms GenRL, showing that the decoupled imitation-based template learning procedure is more effective for handling branching task structure and ge… view at source ↗
Figure 9
Figure 9. Figure 9: Scalability plots for k = 4 reachability. Each row corresponds to one environment on k-reachability tasks. Column (a) shows the successful training ratio (y-axis) versus the number of training tasks |Train| (x-axis) for k = 4. Column (b) shows the zero-shot generalization ratio (y-axis) versus the number of training tasks |Train| (x-axis) for k = 4. We report mean (± standard deviation) performance across … view at source ↗
Figure 10
Figure 10. Figure 10: Scalability plots for k = 3 reachability. Each row corresponds to one environment on k-reachability tasks. Column (a) shows the successful training ratio (y-axis) versus the number of training tasks |Train| (x-axis) for k = 3. Column (b) shows the zero-shot generalization ratio (y-axis) versus the number of training tasks |Train| (x-axis) for k = 3. We report mean (± standard deviation) performance across… view at source ↗
Figure 11
Figure 11. Figure 11: Scalability plots for k = 2 reachability. Each row corresponds to one environment on k-reachability tasks. Column (a) shows the successful training ratio (y-axis) versus the number of training tasks |Train| (x-axis) for k = 2. Column (b) shows the zero-shot generalization ratio (y-axis) versus the number of training tasks |Train| (x-axis) for k = 2. We report mean (± standard deviation) performance across… view at source ↗
Figure 12
Figure 12. Figure 12: Scalability plots for k = 1 reachability. Each row corresponds to one environment on k-reachability tasks. Column (a) shows the successful training ratio (y-axis) versus the number of training tasks |Train| (x-axis) for k = 1. Column (b) shows the zero-shot generalization ratio (y-axis) versus the number of training tasks |Train| (x-axis) for k = 1. We report mean (± standard deviation) performance across… view at source ↗
Figure 13
Figure 13. Figure 13: Scalability plots for the CHOICE benchmark. Each row corresponds to one CAR2D CHOICE benchmark variant. Column (a) shows the successful training ratio (y-axis) versus the number of training tasks |Train| (x-axis). Column (b) shows the zero-shot generalization ratio (y-axis) versus the number of training tasks |Train| (x-axis). Column (c) shows absolute zero-shot generalization (y-axis). We report mean (± … view at source ↗
Figure 14
Figure 14. Figure 14: REACHER: 9 blocks 25 [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
read the original abstract

Inductive generalization is a framework for reinforcement learning (RL) generalization in which inductively related task instances admit inductively related policies. Prior work captures this structure via a higher-order policy-evolution function learned directly with RL, but suffers from poor training scalability: as training tasks grow, aggregated reward feedback becomes noisy and conflicting, destabilizing training and weakening generalization. We propose DIBS, a decoupled behavioral cloning approach that separates learning task-specific policies from learning the evolution function. We first learn individual teacher policies per task via standard RL, then fit the evolution function via behavioral cloning on teacher-labeled state-action pairs. This replaces noisy reward aggregation with dense, stable supervision. DIBS achieves significant improvements in both training stability and zero-shot generalization against existing RL and meta-RL algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes DIBS, a decoupled behavioral cloning approach for inductive generalization in RL from specifications. It first obtains task-specific teacher policies independently via standard RL, then learns the higher-order policy-evolution function by behavioral cloning on the resulting state-action pairs. This is presented as replacing noisy aggregated reward feedback with dense supervision, yielding claimed gains in training stability and zero-shot generalization over prior RL and meta-RL methods.

Significance. If the empirical claims are substantiated, the decoupling strategy could address a scalability bottleneck in learning inductive policy structures by avoiding conflicting reward signals, offering a practical alternative for generalization in specification-based RL.

major comments (2)
  1. [Abstract] Abstract: the headline claim that 'DIBS achieves significant improvements in both training stability and zero-shot generalization' is asserted without any quantitative results, baseline details, experimental protocol, or statistical evidence, rendering the central performance claim unverifiable from the manuscript.
  2. [Method description] Method overview (abstract and skeptic note on weakest assumption): the approach depends on per-task RL producing near-optimal, high-coverage teacher policies to supply unbiased dense supervision for the behavioral cloning step; no analysis, ablation, or scaling experiments are described to validate that teacher quality remains adequate as task count or difficulty increases, which directly underpins the stability and generalization claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that 'DIBS achieves significant improvements in both training stability and zero-shot generalization' is asserted without any quantitative results, baseline details, experimental protocol, or statistical evidence, rendering the central performance claim unverifiable from the manuscript.

    Authors: We agree the abstract states the performance claims at a high level. The body of the paper reports the supporting quantitative results, baselines, protocols, and statistics. We will revise the abstract to include brief references to the key metrics (e.g., stability and generalization deltas) while preserving conciseness. revision: partial

  2. Referee: [Method description] Method overview (abstract and skeptic note on weakest assumption): the approach depends on per-task RL producing near-optimal, high-coverage teacher policies to supply unbiased dense supervision for the behavioral cloning step; no analysis, ablation, or scaling experiments are described to validate that teacher quality remains adequate as task count or difficulty increases, which directly underpins the stability and generalization claims.

    Authors: The referee correctly notes that the approach assumes per-task RL yields sufficiently high-quality teachers. The manuscript does not contain dedicated ablations or scaling studies that vary task count or difficulty to measure degradation in teacher quality. We will add a discussion of this assumption together with a targeted ablation on teacher quality in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; algorithmic proposal is self-contained

full rationale

The paper describes an algorithmic decoupling: per-task RL to obtain teacher policies, followed by behavioral cloning to fit an evolution function. No derivation, equation, or fitted quantity is shown to reduce by construction to its own inputs. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked in the provided text. The central claim rests on the empirical performance of the proposed procedure rather than a mathematical identity or renamed input. This is the expected non-finding for a methods paper whose contribution is a change in training procedure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations, hyperparameters, or modeling assumptions are stated that would allow identification of free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5659 in / 1076 out tokens · 19884 ms · 2026-06-28T18:24:00.278990+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    Q-learning for robust satisfaction of signal temporal logic specifications

    Aksaray, D., Jones, A., Kong, Z., Schwager, M., and Belta, C. Q-learning for robust satisfaction of signal temporal logic specifications. InConference on Decision and Control (CDC), pp. 6565–6570. IEEE, 2016

  2. [2]

    A framework for transforming specifica- tions in reinforcement learning

    Alur, R., Bansal, S., Bastani, O., and Jothimurugan, K. A framework for transforming specifica- tions in reinforcement learning. InPrinciples of Systems Design: Essays Dedicated to Thomas A. Henzinger on the Occasion of His 60th Birthday, pp. 604–624. Springer, 2022

  3. [3]

    Specification-guided reinforcement learning.Communications of the ACM, 69(2):80–87, 2026

    Alur, R., Bansal, S., Bastani, O., and Jothimurugan, K. Specification-guided reinforcement learning.Communications of the ACM, 69(2):80–87, 2026

  4. [4]

    Verifiable reinforcement learning via policy extraction

    Bastani, O., Pu, Y ., and Solar-Lezama, A. Verifiable reinforcement learning via policy extraction. Advances in neural information processing systems, 31, 2018

  5. [5]

    LTLf/LDLf non-markovian rewards

    Brafman, R., De Giacomo, G., and Patrizi, F. LTLf/LDLf non-markovian rewards. InProceed- ings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

  6. [6]

    GALOIS: boosting deep reinforcement learning via generalizable logic synthesis.Advances in Neural Information Processing Systems, 35:19930–19943, 2022

    Cao, Y ., Li, Z., Yang, T., Zhang, H., Zheng, Y ., Li, Y ., Hao, J., and Liu, Y . GALOIS: boosting deep reinforcement learning via generalizable logic synthesis.Advances in Neural Information Processing Systems, 35:19930–19943, 2022

  7. [7]

    Quantifying generalization in reinforcement learning

    Cobbe, K., Klimov, O., Hesse, C., Kim, T., and Schulman, J. Quantifying generalization in reinforcement learning. InInternational conference on machine learning, pp. 1282–1289. PMLR, 2019

  8. [8]

    Leveraging procedural generation to benchmark reinforcement learning

    Cobbe, K., Hesse, C., Hilton, J., and Schulman, J. Leveraging procedural generation to benchmark reinforcement learning. InInternational conference on machine learning, pp. 2048–2056. PMLR, 2020

  9. [9]

    Foundations for restraining bolts: Reinforcement learning with LTLf/LDLf restraining specifications

    De Giacomo, G., Iocchi, L., Favorito, M., and Patrizi, F. Foundations for restraining bolts: Reinforcement learning with LTLf/LDLf restraining specifications. InProceedings of the International Conference on Automated Planning and Scheduling, volume 29, pp. 128–136, 2019

  10. [10]

    Regular reinforcement learning

    Dohmen, T., Perez, M., Somenzi, F., and Trivedi, A. Regular reinforcement learning. In Gurfinkel, A. and Ganesh, V . (eds.),Computer Aided Verification, pp. 184–208, Cham, 2024. Springer Nature Switzerland

  11. [11]

    Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures

    Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V ., Ward, T., Doron, Y ., Firoiu, V ., Harley, T., Dunning, I., et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. InInternational conference on machine learning, pp. 1407–1416. PMLR, 2018

  12. [12]

    Model-agnostic meta-learning for fast adaptation of deep networks

    Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. InInternational conference on machine learning, pp. 1126–1135. PMLR, 2017

  13. [13]

    One subgoal at a time: Zero-shot generalization to arbitrary linear temporal logic requirements in multi-task reinforcement learning.arXiv preprint arXiv:2508.01561, 2025

    Guo, Z., I¸ sık,˙I., Ahmad, H., and Li, W. One subgoal at a time: Zero-shot generalization to arbitrary linear temporal logic requirements in multi-task reinforcement learning.arXiv preprint arXiv:2508.01561, 2025

  14. [14]

    Logically-Constrained Reinforcement Learning

    Hasanbeig, M., Abate, A., and Kroening, D. Logically-constrained reinforcement learning. arXiv preprint arXiv:1801.08099, 2018

  15. [15]

    J., and Lee, I

    Hasanbeig, M., Kantaros, Y ., Abate, A., Kroening, D., Pappas, G. J., and Lee, I. Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees. In Conference on Decision and Control (CDC), pp. 5338–5343, 2019

  16. [16]

    Meta-learning in neural networks: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(9):5149–5169, 2021

    Hospedales, T., Antoniou, A., Micaelli, P., and Storkey, A. Meta-learning in neural networks: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(9):5149–5169, 2021

  17. [17]

    A composable specification language for reinforce- ment learning tasks.Advances in Neural Information Processing Systems, 32, 2019

    Jothimurugan, K., Alur, R., and Bastani, O. A composable specification language for reinforce- ment learning tasks.Advances in Neural Information Processing Systems, 32, 2019. 11

  18. [18]

    Compositional reinforcement learning from logical specifications.Advances in Neural Information Processing Systems, 34:10026– 10039, 2021

    Jothimurugan, K., Bansal, S., Bastani, O., and Alur, R. Compositional reinforcement learning from logical specifications.Advances in Neural Information Processing Systems, 34:10026– 10039, 2021

  19. [19]

    Specification-guided learning of Nash equilibria with high social welfare

    Jothimurugan, K., Bansal, S., Bastani, O., and Alur, R. Specification-guided learning of Nash equilibria with high social welfare. InInternational Conference on Computer Aided Verification, pp. 343–363. Springer, 2022

  20. [20]

    A survey of zero-shot generalisation in deep reinforcement learning.Journal of Artificial Intelligence Research, 76:201–264, 2023

    Kirk, R., Zhang, A., Grefenstette, E., and Rocktäschel, T. A survey of zero-shot generalisation in deep reinforcement learning.Journal of Artificial Intelligence Research, 76:201–264, 2023

  21. [21]

    End-to-end training of deep visuomotor policies

    Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(39):1–40, 2016

  22. [22]

    Reinforcement learning with temporal logic rewards

    Li, X., Vasile, C.-I., and Belta, C. Reinforcement learning with temporal logic rewards. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3834–3839. IEEE, 2017

  23. [23]

    Goal-conditioned reinforcement learning: Problems and solutions.arXiv preprint arXiv:2201.08299,

    Liu, M., Zhu, M., and Zhang, W. Goal-conditioned reinforcement learning: Problems and solutions.arXiv preprint arXiv:2201.08299, 2022

  24. [24]

    Constrained decision transformer for offline safe reinforcement learning

    Liu, Z., Guo, Z., Yao, Y ., Cen, Z., Yu, W., Zhang, T., and Zhao, D. Constrained decision transformer for offline safe reinforcement learning. InInternational conference on machine learning, pp. 21611–21630. PMLR, 2023

  25. [25]

    Regret-free reinforcement learning for temporal logic specifications

    Majumdar, R., Salamati, M., and Soudjani, S. Regret-free reinforcement learning for temporal logic specifications. InForty-second International Conference on Machine Learning, 2025

  26. [26]

    Simple random search of static linear policies is competitive for reinforcement learning.Advances in neural information processing systems, 31, 2018

    Mania, H., Guy, A., and Recht, B. Simple random search of static linear policies is competitive for reinforcement learning.Advances in neural information processing systems, 31, 2018

  27. [27]

    J., Caterini, A

    Naderian, P., Loaiza-Ganem, G., Braviner, H. J., Caterini, A. L., Cresswell, J. C., Li, T., and Garg, A. C-learning: Horizon-aware cumulative accessibility estimation.International Conference on Learning Representations, 2021

  28. [28]

    Zero-shot task generalization with multi-task deep reinforcement learning

    Oh, J., Singh, S., Lee, H., and Kohli, P. Zero-shot task generalization with multi-task deep reinforcement learning. InInternational Conference on Machine Learning, pp. 2661–2670. PMLR, 2017

  29. [29]

    A., Abbeel, P., and Peters, J

    Osa, T., Pajarinen, J., Neumann, G., Bagnell, J. A., Abbeel, P., and Peters, J. An algorithmic perspective on imitation learning.Foundations and Trends® in Robotics, 7(1-2):1–179, 2018

  30. [30]

    Multi-task reinforcement learning with context-based representations

    Sodhani, S., Zhang, A., and Pineau, J. Multi-task reinforcement learning with context-based representations. InInternational Conference on Machine Learning, pp. 9767–9779. PMLR, 2021

  31. [31]

    Hierarchical reinforcement learning for zero-shot generalization with subtask dependencies.Advances in neural information processing systems, 31, 2018

    Sohn, S., Oh, J., and Lee, H. Hierarchical reinforcement learning for zero-shot generalization with subtask dependencies.Advances in neural information processing systems, 31, 2018

  32. [32]

    Inductive generalization in reinforce- ment learning from specifications

    Subramanian, V ., Kushwah, R., Roy, S., and Bansal, S. Inductive generalization in reinforce- ment learning from specifications. InInternational Symposium on Automated Technology for Verification and Analysis, pp. 277–298. Springer, 2025

  33. [33]

    S., Barto, A

    Sutton, R. S., Barto, A. G., et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  34. [34]

    Reinforcement learning from reachability specifica- tions: Pac guarantees with expected conditional distance

    Svoboda, J., Bansal, S., and Chatterjee, K. Reinforcement learning from reachability specifica- tions: Pac guarantees with expected conditional distance. InForty-first International Conference on Machine Learning, 2024

  35. [35]

    M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N., and Pascanu, R

    Teh, Y ., Bapst, V ., Czarnecki, W. M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N., and Pascanu, R. Distral: Robust multitask reinforcement learning.Advances in neural information processing systems, 30, 2017. 12

  36. [36]

    Behavioral Cloning from Observation

    Torabi, F., Warnell, G., and Stone, P. Behavioral cloning from observation.arXiv preprint arXiv:1805.01954, 2018

  37. [37]

    Programmatically interpretable reinforcement learning

    Verma, A., Murali, V ., Singh, R., Kohli, P., and Chaudhuri, S. Programmatically interpretable reinforcement learning. InInternational Conference on Machine Learning, pp. 5045–5054. PMLR, 2018

  38. [38]

    and Topcu, U

    Xu, Z. and Topcu, U. Transfer of temporal logic formulas in reinforcement learning. In International Joint Conference on Artificial Intelligence, pp. 4010–4018, 7 2019

  39. [39]

    On the (in) tractability of reinforcement learning for ltl objectives.arXiv preprint arXiv:2111.12679, 2021

    Yang, C., Littman, M., and Carbin, M. On the (in) tractability of reinforcement learning for ltl objectives.arXiv preprint arXiv:2111.12679, 2021

  40. [40]

    Z., Hasanbeig, M., Abate, A., and Kroening, D

    Yuan, L. Z., Hasanbeig, M., Abate, A., and Kroening, D. Modular deep reinforcement learning with temporal logic specifications.arXiv preprint arXiv:1909.11591, 2019

  41. [41]

    A Study on Overfitting in Deep Reinforcement Learning

    Zhang, C., Vinyals, O., Munos, R., and Bengio, S. A study on overfitting in deep reinforcement learning.arXiv preprint arXiv:1804.06893, 2018

  42. [42]

    An inductive synthesis framework for verifiable reinforcement learning

    Zhu, H., Xiong, Z., Magill, S., and Jagannathan, S. An inductive synthesis framework for verifiable reinforcement learning. InProceedings of the 40th ACM SIGPLAN conference on programming language design and implementation, pp. 686–701, 2019

  43. [43]

    pick” region (typically the top block of the source tower), then reaches a designated “place

    Zintgraf, L., Shiarlis, K., Igl, M., Schulze, S., Gal, Y ., Hofmann, K., and Whiteson, S. Varibad: A very good method for bayes-adaptive deep rl via meta-learning. InInternational Conference on Learning Representations, 2019. 13 Appendix A Limitations Our approach improves inductive generalization by learning a shared policy-evolution template, but it com...