Decoupled Behavioral Cloning for Scalable Inductive Generalization in RL from Specifications

Subhajit Roy; Suguman Bansal; Vignesh Subramanian

arxiv: 2606.00838 · v1 · pith:BOEINB4Onew · submitted 2026-05-30 · 💻 cs.AI

Decoupled Behavioral Cloning for Scalable Inductive Generalization in RL from Specifications

Vignesh Subramanian , Subhajit Roy , Suguman Bansal This is my paper

Pith reviewed 2026-06-28 18:24 UTC · model grok-4.3

classification 💻 cs.AI

keywords inductive generalizationbehavioral cloningreinforcement learningpolicy evolutionzero-shot generalizationmeta reinforcement learningscalability

0 comments

The pith

Decoupling per-task RL from evolution-function learning replaces noisy rewards with dense supervision and improves stability plus zero-shot generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Inductive generalization requires that related tasks induce related policies, captured by a higher-order evolution function. Earlier approaches train this function end-to-end with RL, so growing task counts produce aggregated reward signals that are noisy and contradictory, destabilizing optimization and weakening transfer. DIBS first trains ordinary teacher policies for each task with standard RL, then fits the evolution function by behavioral cloning on the state-action pairs those teachers produce. The change supplies stable, dense labels instead of conflicting scalar rewards. The resulting method trains more reliably and transfers better to unseen tasks than prior RL and meta-RL baselines.

Core claim

DIBS decouples the learning of task-specific policies from the learning of the policy-evolution function: individual teacher policies are obtained via ordinary per-task RL, after which the evolution function is fit by behavioral cloning on the teacher-generated state-action pairs, substituting dense stable supervision for noisy aggregated rewards.

What carries the argument

The policy-evolution function fitted by behavioral cloning on teacher-labeled state-action pairs.

If this is right

Training stability holds as the number of tasks grows because reward aggregation is removed.
Zero-shot success on new tasks rises because the evolution function receives cleaner training data.
The method applies to any inductive family once per-task teachers can be obtained.
It outperforms both direct RL and meta-RL baselines on the reported metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupling could be tested in settings where the evolution function must be learned from partial rather than full trajectories.
If teacher policies are allowed to share parameters, the approach might further reduce total compute while preserving the dense-supervision benefit.
The framework naturally extends to any higher-order structure whose direct RL training suffers from reward interference.

Load-bearing premise

The separately trained teacher policies must be of sufficient quality and coverage to supply unbiased, dense supervision for the evolution function.

What would settle it

Run the same suite of tasks with deliberately degraded teacher policies (e.g., early-stopped or low-coverage RL) and measure whether DIBS still outperforms the RL and meta-RL baselines on stability and zero-shot success.

Figures

Figures reproduced from arXiv: 2606.00838 by Subhajit Roy, Suguman Bansal, Vignesh Subramanian.

**Figure 1.** Figure 1: Left column: Benchmark illustrations. (a) Reacher benchmark environment. (b) Inductive task family in the Reacher benchmark. Right column: Contrasting flow diagrams. (c) GenRL flow diagram. (d) DIBS flow diagram. learning for training tasks is tightly coupled with κ-coefficient training. As the number of training tasks grows, aggregated reward feedback becomes noisy and conflicting, making the training loo… view at source ↗

**Figure 2.** Figure 2: a illustrates an inductive task family in a 2D plane. In each task instance [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 2.** Figure 2: Choice Benchmark in Cartesian 2D Plane. (a) Illustration of the Choice (1-level): from initial region, reach either g1 or g2 then reach the final goal; initial region and the goal shift across indices. (b) Trajectories produced by DIBS: blue denotes Train tasks and red denotes Unseen tasks. Given an inductive task family R = {Ri} L i=0 and training indices Train ⊆ {0, . . . , L} with 0, L ∈ Train, the goal… view at source ↗

**Figure 3.** Figure 3: Scalability plots. Each row corresponds to one environment on k-reachability tasks. Column (a) shows the successful training ratio (y-axis) versus number of training tasks |Train| (x-axis) for k = 5. Column (b) shows the zero-shot generalization ratio (y-axis) versus number of training tasks |Train| (x-axis) for k = 5. Column (c) shows absolute zero-shot generalization (y-axis) for k ∈ {1, . . . , 5} (x-ax… view at source ↗

**Figure 4.** Figure 4: REACHER task variants and zero-shot generalization results. Top row: specification illustrations for the four REACHER tower pick-and-place variants. Bottom row: absolute zero-shot generalization results for the 8-block and 10-block tower settings. The y-axis in the bottom-row plots shows absolute zero-shot generalization, and the x-axis denotes the four Reacher variants shown in the top row. degrade relati… view at source ↗

**Figure 5.** Figure 5: Agent models and their dynamics. cient coverage for Stage 2. We similarly choose the cross-index regularization weight by sweeping λx ∈ {0, 10−4 , 10−3 , 10−2} and selecting the largest value that reduces cross-index teacher drift without degrading teacher specification satisfaction. Dataset aggregation settings. For each i ∈ Train, we construct a preliminary candidate set Di by sampling states from the co… view at source ↗

**Figure 6.** Figure 6: Specification illustrations for n-REACHABILITY and n-REACHABILITY+OBS. (a) CHOICE(L) specification illustration where l = 1 (b) CHOICE(L) specification illustration where l = 2 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: CHOICE(L) specification illustrations for two levels. This specification adds a safety requirement to n-REACHABILITY. The first line is the same ordered sequence of reach goals. The second line, ensuring (avoid (obs)), requires that the agent avoid the obstacle region obs at all timesteps during execution. G.3 CHOICE(l) ( [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Scalability plots for CAR2D k-Reachability with obstacles. Column (a) shows the successful training ratio (y-axis) versus the number of training tasks |Train| (x-axis) for k = 5. Column (b) shows the zero-shot generalization ratio (y-axis) versus the number of training tasks |Train| (x-axis) for k = 5. Column (c) shows absolute zero-shot generalization (y-axis) for k ∈ {1, . . . , 5} (x-axis). We report me… view at source ↗

**Figure 13.** Figure 13: In these experiments, only GenRL and DIBS achieve non-trivial performance, since the other baselines do not have an explicit branching mechanism and therefore struggle to solve the structured choice problem. Across both branching levels, DIBS consistently outperforms GenRL, showing that the decoupled imitation-based template learning procedure is more effective for handling branching task structure and ge… view at source ↗

**Figure 9.** Figure 9: Scalability plots for k = 4 reachability. Each row corresponds to one environment on k-reachability tasks. Column (a) shows the successful training ratio (y-axis) versus the number of training tasks |Train| (x-axis) for k = 4. Column (b) shows the zero-shot generalization ratio (y-axis) versus the number of training tasks |Train| (x-axis) for k = 4. We report mean (± standard deviation) performance across … view at source ↗

**Figure 10.** Figure 10: Scalability plots for k = 3 reachability. Each row corresponds to one environment on k-reachability tasks. Column (a) shows the successful training ratio (y-axis) versus the number of training tasks |Train| (x-axis) for k = 3. Column (b) shows the zero-shot generalization ratio (y-axis) versus the number of training tasks |Train| (x-axis) for k = 3. We report mean (± standard deviation) performance across… view at source ↗

**Figure 11.** Figure 11: Scalability plots for k = 2 reachability. Each row corresponds to one environment on k-reachability tasks. Column (a) shows the successful training ratio (y-axis) versus the number of training tasks |Train| (x-axis) for k = 2. Column (b) shows the zero-shot generalization ratio (y-axis) versus the number of training tasks |Train| (x-axis) for k = 2. We report mean (± standard deviation) performance across… view at source ↗

**Figure 12.** Figure 12: Scalability plots for k = 1 reachability. Each row corresponds to one environment on k-reachability tasks. Column (a) shows the successful training ratio (y-axis) versus the number of training tasks |Train| (x-axis) for k = 1. Column (b) shows the zero-shot generalization ratio (y-axis) versus the number of training tasks |Train| (x-axis) for k = 1. We report mean (± standard deviation) performance across… view at source ↗

**Figure 13.** Figure 13: Scalability plots for the CHOICE benchmark. Each row corresponds to one CAR2D CHOICE benchmark variant. Column (a) shows the successful training ratio (y-axis) versus the number of training tasks |Train| (x-axis). Column (b) shows the zero-shot generalization ratio (y-axis) versus the number of training tasks |Train| (x-axis). Column (c) shows absolute zero-shot generalization (y-axis). We report mean (± … view at source ↗

**Figure 14.** Figure 14: REACHER: 9 blocks 25 [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

read the original abstract

Inductive generalization is a framework for reinforcement learning (RL) generalization in which inductively related task instances admit inductively related policies. Prior work captures this structure via a higher-order policy-evolution function learned directly with RL, but suffers from poor training scalability: as training tasks grow, aggregated reward feedback becomes noisy and conflicting, destabilizing training and weakening generalization. We propose DIBS, a decoupled behavioral cloning approach that separates learning task-specific policies from learning the evolution function. We first learn individual teacher policies per task via standard RL, then fit the evolution function via behavioral cloning on teacher-labeled state-action pairs. This replaces noisy reward aggregation with dense, stable supervision. DIBS achieves significant improvements in both training stability and zero-shot generalization against existing RL and meta-RL algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DIBS splits per-task RL from behavioral cloning of the evolution function to sidestep noisy reward aggregation, but the abstract supplies zero numbers or checks on whether the teachers are actually good enough.

read the letter

The paper's main move is to first run ordinary RL on each training task to produce teacher policies, then fit the evolution function by cloning those teachers' state-action pairs. This replaces the aggregated reward signal that prior end-to-end RL approaches use for the higher-order function. The decoupling is presented as the fix for the scalability problem that appears when the number of tasks grows and the reward feedback becomes conflicting.

The idea is concrete and directly targets a known pain point in inductive generalization setups. Using dense supervision from the teachers instead of sparse rewards is a reasonable engineering step, and the abstract frames it clearly against existing RL and meta-RL baselines for this setting.

The soft spot is the unexamined assumption that the per-task teachers will be high-quality and provide broad coverage. If any tasks are harder or have sparser rewards, the resulting teachers can be suboptimal or narrowly distributed, and the cloning step will simply reproduce those weaknesses. The abstract gives no ablations, no teacher-quality metrics, and no evidence that this holds as task count increases. All performance claims are stated without numbers, protocols, or statistical detail, so they cannot be assessed.

This is for people working on RL generalization across families of related tasks who already know the inductive setting. A referee should see the full experiments to check whether the teacher assumption actually holds and whether the reported gains survive proper controls. It is worth sending to review rather than desk-rejecting because the algorithmic change is well-specified and addresses a practical bottleneck, even if the current evidence is thin.

Referee Report

2 major / 0 minor

Summary. The paper proposes DIBS, a decoupled behavioral cloning approach for inductive generalization in RL from specifications. It first obtains task-specific teacher policies independently via standard RL, then learns the higher-order policy-evolution function by behavioral cloning on the resulting state-action pairs. This is presented as replacing noisy aggregated reward feedback with dense supervision, yielding claimed gains in training stability and zero-shot generalization over prior RL and meta-RL methods.

Significance. If the empirical claims are substantiated, the decoupling strategy could address a scalability bottleneck in learning inductive policy structures by avoiding conflicting reward signals, offering a practical alternative for generalization in specification-based RL.

major comments (2)

[Abstract] Abstract: the headline claim that 'DIBS achieves significant improvements in both training stability and zero-shot generalization' is asserted without any quantitative results, baseline details, experimental protocol, or statistical evidence, rendering the central performance claim unverifiable from the manuscript.
[Method description] Method overview (abstract and skeptic note on weakest assumption): the approach depends on per-task RL producing near-optimal, high-coverage teacher policies to supply unbiased dense supervision for the behavioral cloning step; no analysis, ablation, or scaling experiments are described to validate that teacher quality remains adequate as task count or difficulty increases, which directly underpins the stability and generalization claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that 'DIBS achieves significant improvements in both training stability and zero-shot generalization' is asserted without any quantitative results, baseline details, experimental protocol, or statistical evidence, rendering the central performance claim unverifiable from the manuscript.

Authors: We agree the abstract states the performance claims at a high level. The body of the paper reports the supporting quantitative results, baselines, protocols, and statistics. We will revise the abstract to include brief references to the key metrics (e.g., stability and generalization deltas) while preserving conciseness. revision: partial
Referee: [Method description] Method overview (abstract and skeptic note on weakest assumption): the approach depends on per-task RL producing near-optimal, high-coverage teacher policies to supply unbiased dense supervision for the behavioral cloning step; no analysis, ablation, or scaling experiments are described to validate that teacher quality remains adequate as task count or difficulty increases, which directly underpins the stability and generalization claims.

Authors: The referee correctly notes that the approach assumes per-task RL yields sufficiently high-quality teachers. The manuscript does not contain dedicated ablations or scaling studies that vary task count or difficulty to measure degradation in teacher quality. We will add a discussion of this assumption together with a targeted ablation on teacher quality in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; algorithmic proposal is self-contained

full rationale

The paper describes an algorithmic decoupling: per-task RL to obtain teacher policies, followed by behavioral cloning to fit an evolution function. No derivation, equation, or fitted quantity is shown to reduce by construction to its own inputs. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked in the provided text. The central claim rests on the empirical performance of the proposed procedure rather than a mathematical identity or renamed input. This is the expected non-finding for a methods paper whose contribution is a change in training procedure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations, hyperparameters, or modeling assumptions are stated that would allow identification of free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5659 in / 1076 out tokens · 19884 ms · 2026-06-28T18:24:00.278990+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 7 canonical work pages · 3 internal anchors

[1]

Q-learning for robust satisfaction of signal temporal logic specifications

Aksaray, D., Jones, A., Kong, Z., Schwager, M., and Belta, C. Q-learning for robust satisfaction of signal temporal logic specifications. InConference on Decision and Control (CDC), pp. 6565–6570. IEEE, 2016

2016
[2]

A framework for transforming specifica- tions in reinforcement learning

Alur, R., Bansal, S., Bastani, O., and Jothimurugan, K. A framework for transforming specifica- tions in reinforcement learning. InPrinciples of Systems Design: Essays Dedicated to Thomas A. Henzinger on the Occasion of His 60th Birthday, pp. 604–624. Springer, 2022

2022
[3]

Specification-guided reinforcement learning.Communications of the ACM, 69(2):80–87, 2026

Alur, R., Bansal, S., Bastani, O., and Jothimurugan, K. Specification-guided reinforcement learning.Communications of the ACM, 69(2):80–87, 2026

2026
[4]

Verifiable reinforcement learning via policy extraction

Bastani, O., Pu, Y ., and Solar-Lezama, A. Verifiable reinforcement learning via policy extraction. Advances in neural information processing systems, 31, 2018

2018
[5]

LTLf/LDLf non-markovian rewards

Brafman, R., De Giacomo, G., and Patrizi, F. LTLf/LDLf non-markovian rewards. InProceed- ings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

2018
[6]

GALOIS: boosting deep reinforcement learning via generalizable logic synthesis.Advances in Neural Information Processing Systems, 35:19930–19943, 2022

Cao, Y ., Li, Z., Yang, T., Zhang, H., Zheng, Y ., Li, Y ., Hao, J., and Liu, Y . GALOIS: boosting deep reinforcement learning via generalizable logic synthesis.Advances in Neural Information Processing Systems, 35:19930–19943, 2022

2022
[7]

Quantifying generalization in reinforcement learning

Cobbe, K., Klimov, O., Hesse, C., Kim, T., and Schulman, J. Quantifying generalization in reinforcement learning. InInternational conference on machine learning, pp. 1282–1289. PMLR, 2019

2019
[8]

Leveraging procedural generation to benchmark reinforcement learning

Cobbe, K., Hesse, C., Hilton, J., and Schulman, J. Leveraging procedural generation to benchmark reinforcement learning. InInternational conference on machine learning, pp. 2048–2056. PMLR, 2020

2048
[9]

Foundations for restraining bolts: Reinforcement learning with LTLf/LDLf restraining specifications

De Giacomo, G., Iocchi, L., Favorito, M., and Patrizi, F. Foundations for restraining bolts: Reinforcement learning with LTLf/LDLf restraining specifications. InProceedings of the International Conference on Automated Planning and Scheduling, volume 29, pp. 128–136, 2019

2019
[10]

Regular reinforcement learning

Dohmen, T., Perez, M., Somenzi, F., and Trivedi, A. Regular reinforcement learning. In Gurfinkel, A. and Ganesh, V . (eds.),Computer Aided Verification, pp. 184–208, Cham, 2024. Springer Nature Switzerland

2024
[11]

Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures

Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V ., Ward, T., Doron, Y ., Firoiu, V ., Harley, T., Dunning, I., et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. InInternational conference on machine learning, pp. 1407–1416. PMLR, 2018

2018
[12]

Model-agnostic meta-learning for fast adaptation of deep networks

Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. InInternational conference on machine learning, pp. 1126–1135. PMLR, 2017

2017
[13]

One subgoal at a time: Zero-shot generalization to arbitrary linear temporal logic requirements in multi-task reinforcement learning.arXiv preprint arXiv:2508.01561, 2025

Guo, Z., I¸ sık,˙I., Ahmad, H., and Li, W. One subgoal at a time: Zero-shot generalization to arbitrary linear temporal logic requirements in multi-task reinforcement learning.arXiv preprint arXiv:2508.01561, 2025

work page arXiv 2025
[14]

Logically-Constrained Reinforcement Learning

Hasanbeig, M., Abate, A., and Kroening, D. Logically-constrained reinforcement learning. arXiv preprint arXiv:1801.08099, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

J., and Lee, I

Hasanbeig, M., Kantaros, Y ., Abate, A., Kroening, D., Pappas, G. J., and Lee, I. Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees. In Conference on Decision and Control (CDC), pp. 5338–5343, 2019

2019
[16]

Meta-learning in neural networks: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(9):5149–5169, 2021

Hospedales, T., Antoniou, A., Micaelli, P., and Storkey, A. Meta-learning in neural networks: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(9):5149–5169, 2021

2021
[17]

A composable specification language for reinforce- ment learning tasks.Advances in Neural Information Processing Systems, 32, 2019

Jothimurugan, K., Alur, R., and Bastani, O. A composable specification language for reinforce- ment learning tasks.Advances in Neural Information Processing Systems, 32, 2019. 11

2019
[18]

Compositional reinforcement learning from logical specifications.Advances in Neural Information Processing Systems, 34:10026– 10039, 2021

Jothimurugan, K., Bansal, S., Bastani, O., and Alur, R. Compositional reinforcement learning from logical specifications.Advances in Neural Information Processing Systems, 34:10026– 10039, 2021

2021
[19]

Specification-guided learning of Nash equilibria with high social welfare

Jothimurugan, K., Bansal, S., Bastani, O., and Alur, R. Specification-guided learning of Nash equilibria with high social welfare. InInternational Conference on Computer Aided Verification, pp. 343–363. Springer, 2022

2022
[20]

A survey of zero-shot generalisation in deep reinforcement learning.Journal of Artificial Intelligence Research, 76:201–264, 2023

Kirk, R., Zhang, A., Grefenstette, E., and Rocktäschel, T. A survey of zero-shot generalisation in deep reinforcement learning.Journal of Artificial Intelligence Research, 76:201–264, 2023

2023
[21]

End-to-end training of deep visuomotor policies

Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(39):1–40, 2016

2016
[22]

Reinforcement learning with temporal logic rewards

Li, X., Vasile, C.-I., and Belta, C. Reinforcement learning with temporal logic rewards. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3834–3839. IEEE, 2017

2017
[23]

Goal-conditioned reinforcement learning: Problems and solutions.arXiv preprint arXiv:2201.08299,

Liu, M., Zhu, M., and Zhang, W. Goal-conditioned reinforcement learning: Problems and solutions.arXiv preprint arXiv:2201.08299, 2022

work page arXiv 2022
[24]

Constrained decision transformer for offline safe reinforcement learning

Liu, Z., Guo, Z., Yao, Y ., Cen, Z., Yu, W., Zhang, T., and Zhao, D. Constrained decision transformer for offline safe reinforcement learning. InInternational conference on machine learning, pp. 21611–21630. PMLR, 2023

2023
[25]

Regret-free reinforcement learning for temporal logic specifications

Majumdar, R., Salamati, M., and Soudjani, S. Regret-free reinforcement learning for temporal logic specifications. InForty-second International Conference on Machine Learning, 2025

2025
[26]

Simple random search of static linear policies is competitive for reinforcement learning.Advances in neural information processing systems, 31, 2018

Mania, H., Guy, A., and Recht, B. Simple random search of static linear policies is competitive for reinforcement learning.Advances in neural information processing systems, 31, 2018

2018
[27]

J., Caterini, A

Naderian, P., Loaiza-Ganem, G., Braviner, H. J., Caterini, A. L., Cresswell, J. C., Li, T., and Garg, A. C-learning: Horizon-aware cumulative accessibility estimation.International Conference on Learning Representations, 2021

2021
[28]

Zero-shot task generalization with multi-task deep reinforcement learning

Oh, J., Singh, S., Lee, H., and Kohli, P. Zero-shot task generalization with multi-task deep reinforcement learning. InInternational Conference on Machine Learning, pp. 2661–2670. PMLR, 2017

2017
[29]

A., Abbeel, P., and Peters, J

Osa, T., Pajarinen, J., Neumann, G., Bagnell, J. A., Abbeel, P., and Peters, J. An algorithmic perspective on imitation learning.Foundations and Trends® in Robotics, 7(1-2):1–179, 2018

2018
[30]

Multi-task reinforcement learning with context-based representations

Sodhani, S., Zhang, A., and Pineau, J. Multi-task reinforcement learning with context-based representations. InInternational Conference on Machine Learning, pp. 9767–9779. PMLR, 2021

2021
[31]

Hierarchical reinforcement learning for zero-shot generalization with subtask dependencies.Advances in neural information processing systems, 31, 2018

Sohn, S., Oh, J., and Lee, H. Hierarchical reinforcement learning for zero-shot generalization with subtask dependencies.Advances in neural information processing systems, 31, 2018

2018
[32]

Inductive generalization in reinforce- ment learning from specifications

Subramanian, V ., Kushwah, R., Roy, S., and Bansal, S. Inductive generalization in reinforce- ment learning from specifications. InInternational Symposium on Automated Technology for Verification and Analysis, pp. 277–298. Springer, 2025

2025
[33]

S., Barto, A

Sutton, R. S., Barto, A. G., et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

1998
[34]

Reinforcement learning from reachability specifica- tions: Pac guarantees with expected conditional distance

Svoboda, J., Bansal, S., and Chatterjee, K. Reinforcement learning from reachability specifica- tions: Pac guarantees with expected conditional distance. InForty-first International Conference on Machine Learning, 2024

2024
[35]

M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N., and Pascanu, R

Teh, Y ., Bapst, V ., Czarnecki, W. M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N., and Pascanu, R. Distral: Robust multitask reinforcement learning.Advances in neural information processing systems, 30, 2017. 12

2017
[36]

Behavioral Cloning from Observation

Torabi, F., Warnell, G., and Stone, P. Behavioral cloning from observation.arXiv preprint arXiv:1805.01954, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[37]

Programmatically interpretable reinforcement learning

Verma, A., Murali, V ., Singh, R., Kohli, P., and Chaudhuri, S. Programmatically interpretable reinforcement learning. InInternational Conference on Machine Learning, pp. 5045–5054. PMLR, 2018

2018
[38]

and Topcu, U

Xu, Z. and Topcu, U. Transfer of temporal logic formulas in reinforcement learning. In International Joint Conference on Artificial Intelligence, pp. 4010–4018, 7 2019

2019
[39]

On the (in) tractability of reinforcement learning for ltl objectives.arXiv preprint arXiv:2111.12679, 2021

Yang, C., Littman, M., and Carbin, M. On the (in) tractability of reinforcement learning for ltl objectives.arXiv preprint arXiv:2111.12679, 2021

work page arXiv 2021
[40]

Z., Hasanbeig, M., Abate, A., and Kroening, D

Yuan, L. Z., Hasanbeig, M., Abate, A., and Kroening, D. Modular deep reinforcement learning with temporal logic specifications.arXiv preprint arXiv:1909.11591, 2019

work page arXiv 1909
[41]

A Study on Overfitting in Deep Reinforcement Learning

Zhang, C., Vinyals, O., Munos, R., and Bengio, S. A study on overfitting in deep reinforcement learning.arXiv preprint arXiv:1804.06893, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[42]

An inductive synthesis framework for verifiable reinforcement learning

Zhu, H., Xiong, Z., Magill, S., and Jagannathan, S. An inductive synthesis framework for verifiable reinforcement learning. InProceedings of the 40th ACM SIGPLAN conference on programming language design and implementation, pp. 686–701, 2019

2019
[43]

pick” region (typically the top block of the source tower), then reaches a designated “place

Zintgraf, L., Shiarlis, K., Igl, M., Schulze, S., Gal, Y ., Hofmann, K., and Whiteson, S. Varibad: A very good method for bayes-adaptive deep rl via meta-learning. InInternational Conference on Learning Representations, 2019. 13 Appendix A Limitations Our approach improves inductive generalization by learning a shared policy-evolution template, but it com...

2019

[1] [1]

Q-learning for robust satisfaction of signal temporal logic specifications

Aksaray, D., Jones, A., Kong, Z., Schwager, M., and Belta, C. Q-learning for robust satisfaction of signal temporal logic specifications. InConference on Decision and Control (CDC), pp. 6565–6570. IEEE, 2016

2016

[2] [2]

A framework for transforming specifica- tions in reinforcement learning

Alur, R., Bansal, S., Bastani, O., and Jothimurugan, K. A framework for transforming specifica- tions in reinforcement learning. InPrinciples of Systems Design: Essays Dedicated to Thomas A. Henzinger on the Occasion of His 60th Birthday, pp. 604–624. Springer, 2022

2022

[3] [3]

Specification-guided reinforcement learning.Communications of the ACM, 69(2):80–87, 2026

Alur, R., Bansal, S., Bastani, O., and Jothimurugan, K. Specification-guided reinforcement learning.Communications of the ACM, 69(2):80–87, 2026

2026

[4] [4]

Verifiable reinforcement learning via policy extraction

Bastani, O., Pu, Y ., and Solar-Lezama, A. Verifiable reinforcement learning via policy extraction. Advances in neural information processing systems, 31, 2018

2018

[5] [5]

LTLf/LDLf non-markovian rewards

Brafman, R., De Giacomo, G., and Patrizi, F. LTLf/LDLf non-markovian rewards. InProceed- ings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

2018

[6] [6]

GALOIS: boosting deep reinforcement learning via generalizable logic synthesis.Advances in Neural Information Processing Systems, 35:19930–19943, 2022

Cao, Y ., Li, Z., Yang, T., Zhang, H., Zheng, Y ., Li, Y ., Hao, J., and Liu, Y . GALOIS: boosting deep reinforcement learning via generalizable logic synthesis.Advances in Neural Information Processing Systems, 35:19930–19943, 2022

2022

[7] [7]

Quantifying generalization in reinforcement learning

Cobbe, K., Klimov, O., Hesse, C., Kim, T., and Schulman, J. Quantifying generalization in reinforcement learning. InInternational conference on machine learning, pp. 1282–1289. PMLR, 2019

2019

[8] [8]

Leveraging procedural generation to benchmark reinforcement learning

Cobbe, K., Hesse, C., Hilton, J., and Schulman, J. Leveraging procedural generation to benchmark reinforcement learning. InInternational conference on machine learning, pp. 2048–2056. PMLR, 2020

2048

[9] [9]

Foundations for restraining bolts: Reinforcement learning with LTLf/LDLf restraining specifications

De Giacomo, G., Iocchi, L., Favorito, M., and Patrizi, F. Foundations for restraining bolts: Reinforcement learning with LTLf/LDLf restraining specifications. InProceedings of the International Conference on Automated Planning and Scheduling, volume 29, pp. 128–136, 2019

2019

[10] [10]

Regular reinforcement learning

Dohmen, T., Perez, M., Somenzi, F., and Trivedi, A. Regular reinforcement learning. In Gurfinkel, A. and Ganesh, V . (eds.),Computer Aided Verification, pp. 184–208, Cham, 2024. Springer Nature Switzerland

2024

[11] [11]

Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures

Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V ., Ward, T., Doron, Y ., Firoiu, V ., Harley, T., Dunning, I., et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. InInternational conference on machine learning, pp. 1407–1416. PMLR, 2018

2018

[12] [12]

Model-agnostic meta-learning for fast adaptation of deep networks

Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. InInternational conference on machine learning, pp. 1126–1135. PMLR, 2017

2017

[13] [13]

One subgoal at a time: Zero-shot generalization to arbitrary linear temporal logic requirements in multi-task reinforcement learning.arXiv preprint arXiv:2508.01561, 2025

Guo, Z., I¸ sık,˙I., Ahmad, H., and Li, W. One subgoal at a time: Zero-shot generalization to arbitrary linear temporal logic requirements in multi-task reinforcement learning.arXiv preprint arXiv:2508.01561, 2025

work page arXiv 2025

[14] [14]

Logically-Constrained Reinforcement Learning

Hasanbeig, M., Abate, A., and Kroening, D. Logically-constrained reinforcement learning. arXiv preprint arXiv:1801.08099, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

J., and Lee, I

Hasanbeig, M., Kantaros, Y ., Abate, A., Kroening, D., Pappas, G. J., and Lee, I. Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees. In Conference on Decision and Control (CDC), pp. 5338–5343, 2019

2019

[16] [16]

Meta-learning in neural networks: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(9):5149–5169, 2021

Hospedales, T., Antoniou, A., Micaelli, P., and Storkey, A. Meta-learning in neural networks: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(9):5149–5169, 2021

2021

[17] [17]

A composable specification language for reinforce- ment learning tasks.Advances in Neural Information Processing Systems, 32, 2019

Jothimurugan, K., Alur, R., and Bastani, O. A composable specification language for reinforce- ment learning tasks.Advances in Neural Information Processing Systems, 32, 2019. 11

2019

[18] [18]

Compositional reinforcement learning from logical specifications.Advances in Neural Information Processing Systems, 34:10026– 10039, 2021

Jothimurugan, K., Bansal, S., Bastani, O., and Alur, R. Compositional reinforcement learning from logical specifications.Advances in Neural Information Processing Systems, 34:10026– 10039, 2021

2021

[19] [19]

Specification-guided learning of Nash equilibria with high social welfare

Jothimurugan, K., Bansal, S., Bastani, O., and Alur, R. Specification-guided learning of Nash equilibria with high social welfare. InInternational Conference on Computer Aided Verification, pp. 343–363. Springer, 2022

2022

[20] [20]

A survey of zero-shot generalisation in deep reinforcement learning.Journal of Artificial Intelligence Research, 76:201–264, 2023

Kirk, R., Zhang, A., Grefenstette, E., and Rocktäschel, T. A survey of zero-shot generalisation in deep reinforcement learning.Journal of Artificial Intelligence Research, 76:201–264, 2023

2023

[21] [21]

End-to-end training of deep visuomotor policies

Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(39):1–40, 2016

2016

[22] [22]

Reinforcement learning with temporal logic rewards

Li, X., Vasile, C.-I., and Belta, C. Reinforcement learning with temporal logic rewards. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3834–3839. IEEE, 2017

2017

[23] [23]

Goal-conditioned reinforcement learning: Problems and solutions.arXiv preprint arXiv:2201.08299,

Liu, M., Zhu, M., and Zhang, W. Goal-conditioned reinforcement learning: Problems and solutions.arXiv preprint arXiv:2201.08299, 2022

work page arXiv 2022

[24] [24]

Constrained decision transformer for offline safe reinforcement learning

Liu, Z., Guo, Z., Yao, Y ., Cen, Z., Yu, W., Zhang, T., and Zhao, D. Constrained decision transformer for offline safe reinforcement learning. InInternational conference on machine learning, pp. 21611–21630. PMLR, 2023

2023

[25] [25]

Regret-free reinforcement learning for temporal logic specifications

Majumdar, R., Salamati, M., and Soudjani, S. Regret-free reinforcement learning for temporal logic specifications. InForty-second International Conference on Machine Learning, 2025

2025

[26] [26]

Simple random search of static linear policies is competitive for reinforcement learning.Advances in neural information processing systems, 31, 2018

Mania, H., Guy, A., and Recht, B. Simple random search of static linear policies is competitive for reinforcement learning.Advances in neural information processing systems, 31, 2018

2018

[27] [27]

J., Caterini, A

Naderian, P., Loaiza-Ganem, G., Braviner, H. J., Caterini, A. L., Cresswell, J. C., Li, T., and Garg, A. C-learning: Horizon-aware cumulative accessibility estimation.International Conference on Learning Representations, 2021

2021

[28] [28]

Zero-shot task generalization with multi-task deep reinforcement learning

Oh, J., Singh, S., Lee, H., and Kohli, P. Zero-shot task generalization with multi-task deep reinforcement learning. InInternational Conference on Machine Learning, pp. 2661–2670. PMLR, 2017

2017

[29] [29]

A., Abbeel, P., and Peters, J

Osa, T., Pajarinen, J., Neumann, G., Bagnell, J. A., Abbeel, P., and Peters, J. An algorithmic perspective on imitation learning.Foundations and Trends® in Robotics, 7(1-2):1–179, 2018

2018

[30] [30]

Multi-task reinforcement learning with context-based representations

Sodhani, S., Zhang, A., and Pineau, J. Multi-task reinforcement learning with context-based representations. InInternational Conference on Machine Learning, pp. 9767–9779. PMLR, 2021

2021

[31] [31]

Hierarchical reinforcement learning for zero-shot generalization with subtask dependencies.Advances in neural information processing systems, 31, 2018

Sohn, S., Oh, J., and Lee, H. Hierarchical reinforcement learning for zero-shot generalization with subtask dependencies.Advances in neural information processing systems, 31, 2018

2018

[32] [32]

Inductive generalization in reinforce- ment learning from specifications

Subramanian, V ., Kushwah, R., Roy, S., and Bansal, S. Inductive generalization in reinforce- ment learning from specifications. InInternational Symposium on Automated Technology for Verification and Analysis, pp. 277–298. Springer, 2025

2025

[33] [33]

S., Barto, A

Sutton, R. S., Barto, A. G., et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

1998

[34] [34]

Reinforcement learning from reachability specifica- tions: Pac guarantees with expected conditional distance

Svoboda, J., Bansal, S., and Chatterjee, K. Reinforcement learning from reachability specifica- tions: Pac guarantees with expected conditional distance. InForty-first International Conference on Machine Learning, 2024

2024

[35] [35]

M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N., and Pascanu, R

Teh, Y ., Bapst, V ., Czarnecki, W. M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N., and Pascanu, R. Distral: Robust multitask reinforcement learning.Advances in neural information processing systems, 30, 2017. 12

2017

[36] [36]

Behavioral Cloning from Observation

Torabi, F., Warnell, G., and Stone, P. Behavioral cloning from observation.arXiv preprint arXiv:1805.01954, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[37] [37]

Programmatically interpretable reinforcement learning

Verma, A., Murali, V ., Singh, R., Kohli, P., and Chaudhuri, S. Programmatically interpretable reinforcement learning. InInternational Conference on Machine Learning, pp. 5045–5054. PMLR, 2018

2018

[38] [38]

and Topcu, U

Xu, Z. and Topcu, U. Transfer of temporal logic formulas in reinforcement learning. In International Joint Conference on Artificial Intelligence, pp. 4010–4018, 7 2019

2019

[39] [39]

On the (in) tractability of reinforcement learning for ltl objectives.arXiv preprint arXiv:2111.12679, 2021

Yang, C., Littman, M., and Carbin, M. On the (in) tractability of reinforcement learning for ltl objectives.arXiv preprint arXiv:2111.12679, 2021

work page arXiv 2021

[40] [40]

Z., Hasanbeig, M., Abate, A., and Kroening, D

Yuan, L. Z., Hasanbeig, M., Abate, A., and Kroening, D. Modular deep reinforcement learning with temporal logic specifications.arXiv preprint arXiv:1909.11591, 2019

work page arXiv 1909

[41] [41]

A Study on Overfitting in Deep Reinforcement Learning

Zhang, C., Vinyals, O., Munos, R., and Bengio, S. A study on overfitting in deep reinforcement learning.arXiv preprint arXiv:1804.06893, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[42] [42]

An inductive synthesis framework for verifiable reinforcement learning

Zhu, H., Xiong, Z., Magill, S., and Jagannathan, S. An inductive synthesis framework for verifiable reinforcement learning. InProceedings of the 40th ACM SIGPLAN conference on programming language design and implementation, pp. 686–701, 2019

2019

[43] [43]

pick” region (typically the top block of the source tower), then reaches a designated “place

Zintgraf, L., Shiarlis, K., Igl, M., Schulze, S., Gal, Y ., Hofmann, K., and Whiteson, S. Varibad: A very good method for bayes-adaptive deep rl via meta-learning. InInternational Conference on Learning Representations, 2019. 13 Appendix A Limitations Our approach improves inductive generalization by learning a shared policy-evolution template, but it com...

2019