pith. machine review for the scientific record. sign in

arxiv: 2605.08450 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Zero-shot Imitation Learning by Latent Topology Mapping

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords imitation learningzero-shot adaptationlatent topologyhub statesgoal-conditioned taskslong-horizon planningtrajectory mapping3D maze navigation
0
0 comments X

The pith

ZALT lets agents solve unseen long-horizon tasks by planning over a latent topology of hub states extracted from demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that imitation learning need not require full demonstrations for every possible task. Instead, by locating states in a learned latent space where many trajectories meet or split, the method turns existing demonstrations into reusable segments. An agent then learns to move between these segments and plans short sequences of such moves to reach new goals. A reader would care if this holds because collecting expert data for every variation of a complex task quickly becomes impractical in robotics or navigation settings.

Core claim

ZALT identifies latent hub states where trajectories converge or diverge, learns policies and a dynamics model over hub-to-hub transitions, and plans over the hub topology to complete new tasks. This topology makes demonstrated behaviors explicitly composable while compressing long tasks into shorter sequences of abstract transitions, enabling zero-shot adaptation to unseen start-goal pairs.

What carries the argument

Latent hub states that form a topology of composable transitions, allowing the agent to replace long primitive-action sequences with planned sequences of hub-to-hub moves.

If this is right

  • Demonstrated behaviors become reusable building blocks that can be chained for goals outside the original dataset.
  • Long trajectories are replaced by shorter plans over abstract transitions, limiting the accumulation of small errors.
  • Zero-shot success on novel start-goal pairs reaches 55 percent in a complex 3D maze while the strongest baseline reaches 6 percent.
  • Fewer complete demonstrations suffice to cover a broad range of tasks in the same environment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hub-extraction step could be applied in other sequential domains where partial expert traces are cheaper to obtain than full ones.
  • If the latent space is learned jointly with the dynamics model, small changes to the environment might require only re-identifying hubs rather than new demonstrations.
  • Extending the topology to continuous or stochastic settings would require checking whether convergence points remain stable under noise.

Load-bearing premise

That the identified latent hub states reliably mark points where trajectories converge or diverge so that planning over them yields correct compositions for tasks never demonstrated.

What would settle it

An experiment on held-out tasks that require hub sequences absent from any combination of the training demonstrations, where measured success falls to the level of standard imitation baselines.

Figures

Figures reproduced from arXiv: 2605.08450 by Maxwell J. Jacobson, Yexiang Xue.

Figure 1
Figure 1. Figure 1: Directly composing behavior at the primitive-action level requires many sequential decisions, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Hub formation. ZALT coa￾lesces demonstration latent vectors into clusters. Hubs are clusters where demos converge from multiple previous clusters or diverge into multiple next clusters. In practice, exact state equality is too strict, especially with high-dimensional observations. ZALT therefore de￾tects convergences and divergences in a learned latent space (see [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Success rate on unseen start–goal pairs immediately after seeing demonstrations (left), and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Video frames from a successful ZALT zero-shot inference run (goal: green gem, then red [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Imitation learning is effective for training agents when expert demonstrations are available, but collecting demonstrations for every complex task in an environment is costly. We study the long-horizon, goal-conditioned setting where a fixed demonstration dataset contains useful behavior, but not complete examples for every task the agent must solve. Existing imitation learning methods can learn strong policies from demonstrations, but when solving long-horizon tasks, small errors accumulate over long primitive-action trajectories and make zero-shot adaptation to new tasks unreliable. We introduce Zero-shot Agents from Latent Topologies (ZALT), an imitation-learning method that solves unseen start-goal tasks beyond those demonstrated during training. ZALT identifies latent hub states where trajectories converge or diverge, learns policies and a dynamics model over hub-to-hub transitions, and plans over the hub topology to complete new tasks. This topology makes demonstrated behaviors explicitly composable while compressing long tasks into shorter sequences of abstract transitions -- combined, these enable ZALT to perform zero-shot adaptation. In a complex 3D maze environment, ZALT achieves 55% zero-shot success on unseen tasks, compared to 6% for the strongest baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Zero-shot Agents from Latent Topologies (ZALT) for imitation learning in long-horizon goal-conditioned settings. From a fixed demonstration dataset, ZALT identifies latent hub states (convergence/divergence points in trajectories), learns policies and a dynamics model over hub-to-hub transitions, and performs planning over the resulting topology to solve unseen start-goal tasks. The central empirical claim is that this yields 55% zero-shot success in a complex 3D maze environment, versus 6% for the strongest baseline.

Significance. If the hub identification and topology-based planning reliably generalize compositions beyond the demonstration set, the method would provide a concrete mechanism for making demonstrated behaviors explicitly composable and compressible, addressing error accumulation in long-horizon imitation learning. This could reduce reliance on exhaustive task-specific demonstrations in robotics and navigation domains.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (Method): The hub identification procedure is described only at a high level ('identifies latent hub states where trajectories converge or diverge') with no algorithm, hyperparameters, or pseudocode. This is load-bearing for the central claim, as the 55% success rate depends on whether the extracted topology covers and correctly composes paths for unseen start-goal pairs; without the precise detection rule, it is impossible to determine if the result is robust or sensitive to post-hoc choices.
  2. [§5] §5 (Experiments): The 55% vs 6% comparison reports no error bars, no statistical tests, no ablation on hub count or dynamics model capacity, and no metric of hub coverage for the test tasks. These omissions directly undermine evaluation of the weakest assumption that planning over the hub topology produces correct sequences outside the demonstrated connectivity.
  3. [§4 and §5] §4 (Approach) and §5: No analysis is provided of hub stability across random data subsets or of planning success conditioned on whether a test task's optimal path intersects the identified hubs. Such checks are required to substantiate that the topology enables zero-shot adaptation rather than succeeding only on tasks whose connectivity is already captured by the fixed demonstration set.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'parameter-free' is not used, but the method description should explicitly state whether the hub detection or planning steps introduce any tunable thresholds that were selected after seeing test performance.
  2. [Figure 1] Figure 1 (if present): The diagram of the hub topology should include an example of an unseen task whose solution path is composed from demonstrated hub transitions, with the corresponding plan highlighted.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript to improve clarity, reproducibility, and evaluation rigor.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Method): The hub identification procedure is described only at a high level ('identifies latent hub states where trajectories converge or diverge') with no algorithm, hyperparameters, or pseudocode. This is load-bearing for the central claim, as the 55% success rate depends on whether the extracted topology covers and correctly composes paths for unseen start-goal pairs; without the precise detection rule, it is impossible to determine if the result is robust or sensitive to post-hoc choices.

    Authors: We agree that the hub identification procedure requires a more precise and reproducible description. In the revised manuscript we will expand §3 with the full algorithm for detecting latent hub states (including the exact convergence/divergence criteria applied to the demonstration trajectories), all hyperparameters, and pseudocode placed in the appendix. This will allow direct assessment of how the extracted topology supports composition for the reported zero-shot tasks. revision: yes

  2. Referee: [§5] §5 (Experiments): The 55% vs 6% comparison reports no error bars, no statistical tests, no ablation on hub count or dynamics model capacity, and no metric of hub coverage for the test tasks. These omissions directly undermine evaluation of the weakest assumption that planning over the hub topology produces correct sequences outside the demonstrated connectivity.

    Authors: We acknowledge that the current experimental reporting lacks statistical detail and supporting ablations. We will update §5 to include error bars over multiple random seeds, statistical significance tests between ZALT and baselines, ablations on hub count and dynamics-model capacity, and a quantitative hub-coverage metric for the test tasks. These additions will directly address the concern about whether planning succeeds due to topology composition rather than incidental coverage. revision: yes

  3. Referee: [§4 and §5] §4 (Approach) and §5: No analysis is provided of hub stability across random data subsets or of planning success conditioned on whether a test task's optimal path intersects the identified hubs. Such checks are required to substantiate that the topology enables zero-shot adaptation rather than succeeding only on tasks whose connectivity is already captured by the fixed demonstration set.

    Authors: We agree that stability and conditional-success analyses would strengthen the claim that the topology enables genuine zero-shot composition. In the revision we will add (i) hub-stability results obtained by repeating identification on random subsets of the demonstration data and (ii) success rates conditioned on whether each test task's optimal path intersects the identified hubs. These checks will be reported alongside the existing 55 % figure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical zero-shot performance is independent of method description

full rationale

The paper presents ZALT as a method that extracts latent hubs from a fixed demonstration set, learns hub-to-hub policies and dynamics, then plans compositions for unseen start-goal pairs. The 55% success rate is reported as an experimental outcome in a 3D maze, not as a quantity derived by construction from the hub identification procedure or any fitted parameter. No equations, self-citations, or uniqueness theorems are invoked that would reduce the performance claim to a renaming or re-fitting of the input demonstrations. The derivation chain (hub detection → abstract dynamics → planning) remains logically independent of the final measured success rate, satisfying the criteria for a self-contained empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that demonstration trajectories admit a useful latent decomposition into hub-to-hub segments; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)
  • domain assumption Demonstrated trajectories contain identifiable latent hub states at which paths converge or diverge.
    Invoked to justify identifying hubs and building the topology for planning.

pith-pipeline@v0.9.0 · 5489 in / 1204 out tokens · 36037 ms · 2026-05-12T02:28:04.864734+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 3 internal anchors

  1. [1]

    arXiv preprint arXiv:1807.10299 , year=

    Joshua Achiam, Harrison Edwards, Dario Amodei, and Pieter Abbeel. Variational option discovery algorithms.arXiv preprint arXiv:1807.10299, 2018

  2. [2]

    The option-critic architecture

    Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. InProceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 1726–1734. AAAI Press, 2017

  3. [3]

    A survey of meta-reinforcement learning.arXiv e-prints, pages arXiv–2301, 2023

    Jacob Beck, Risto Vuorio, Evan Zheran Liu, Zheng Xiong, Luisa Zintgraf, Chelsea Finn, and Shimon Whiteson. A survey of meta-reinforcement learning.arXiv e-prints, pages arXiv–2301, 2023

  4. [4]

    Carl: A benchmark for contextual and adaptive reinforce- ment learning.arXiv preprint arXiv:2110.02102, 2021

    Carolin Benjamins, Theresa Eimer, Frederik Schubert, André Biedenkapp, Bodo Rosenhahn, Frank Hutter, and Marius Lindauer. Carl: A benchmark for contextual and adaptive reinforce- ment learning.arXiv preprint arXiv:2110.02102, 2021

  5. [5]

    Hierarchical model- based imitation learning for planning in autonomous driving

    Eli Bronstein, Mark Palatucci, Dominik Notz, Brandyn White, Alex Kuefler, Yiren Lu, Supratik Paul, Payam Nikdel, Paul Mougin, Hongge Chen, Justin Fu, Austin Abrams, Punit Shah, Evan Racah, Benjamin Frenkel, Shimon Whiteson, and Dragomir Anguelov. Hierarchical model- based imitation learning for planning in autonomous driving. In2022 IEEE/RSJ International...

  6. [6]

    Decision transformer: Reinforcement learning via sequence modeling

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. InAdvances in Neural Information Processing Systems, pages 15084– 15097, 2021

  7. [7]

    Bail: Best- action imitation learning for batch deep reinforcement learning.Advances in Neural Information Processing Systems, 33:18353–18363, 2020

    Xinyue Chen, Zijian Zhou, Zheng Wang, Che Wang, Yanqiu Wu, and Keith Ross. Bail: Best- action imitation learning for batch deep reinforcement learning.Advances in Neural Information Processing Systems, 33:18353–18363, 2020

  8. [8]

    Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo Perez-Vicente, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and J. K. Terry. Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks. In Advances in Neural Information Processing Systems, volume 36, 2023

  9. [9]

    Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

    Kyunghyun Cho, Bart van Merriënboer, Ça ˘glar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder– decoder for statistical machine translation.arXiv preprint arXiv:1406.1078, 2014

  10. [10]

    Co-Reyes, Yuxuan Liu, Abhishek Gupta, Benjamin Eysenbach, Pieter Abbeel, and Sergey Levine

    John D. Co-Reyes, Yuxuan Liu, Abhishek Gupta, Benjamin Eysenbach, Pieter Abbeel, and Sergey Levine. Self-consistent trajectory autoencoder: Hierarchical reinforcement learning with trajectory embeddings.arXiv preprint arXiv:1806.02813, 2018

  11. [11]

    Özgür ¸ Sim¸ sek and Andrew G. Barto. Skill characterization based on betweenness. InAdvances in Neural Information Processing Systems, volume 21, pages 1497–1504, 2008

  12. [12]

    Peter Dayan and Geoffrey E. Hinton. Feudal reinforcement learning. InAdvances in Neural Information Processing Systems, volume 5, pages 271–278. Morgan-Kaufmann, 1993

  13. [13]

    Dietterich

    Thomas G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition.Journal of Artificial Intelligence Research, 13:227–303, 2000

  14. [14]

    RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

    Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel. RL2: Fast reinforcement learning via slow reinforcement learning.arXiv preprint arXiv:1611.02779, 2016

  15. [15]

    RvS: What is essential for offline RL via supervised learning? InInternational Conference on Learning Representations, 2022

    Scott Emmons, Benjamin Eysenbach, Ilya Kostrikov, and Sergey Levine. RvS: What is essential for offline RL via supervised learning? InInternational Conference on Learning Representations, 2022

  16. [16]

    Salakhutdinov, and Sergey Levine

    Benjamin Eysenbach, Ruslan R. Salakhutdinov, and Sergey Levine. Search on the replay buffer: Bridging planning and reinforcement learning. InAdvances in Neural Information Processing Systems, volume 32, pages 15246–15257, 2019. 10

  17. [17]

    Model-agnostic meta-learning for fast adap- tation of deep networks

    Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap- tation of deep networks. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1126–1135. PMLR, 2017

  18. [18]

    Learning robust rewards with adverserial inverse reinforcement learning

    Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adverserial inverse reinforcement learning. InInternational Conference on Learning Representations, 2018

  19. [19]

    Off-policy deep reinforcement learning without exploration

    Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InProceedings of the 36th International Conference on Machine Learning, pages 2052–2062, 2019

  20. [20]

    Recurrent world models facilitate policy evolution

    David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, volume 31, pages 2450–2462, 2018

  21. [21]

    Dream to control: Learning behaviors by latent imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Representa- tions, 2020

  22. [22]

    Learning latent dynamics for planning from pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 2555–2565. PMLR, 2019

  23. [23]

    Generative adversarial imitation learning

    Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. InAdvances in Neural Information Processing Systems, volume 29, pages 4565–4573, 2016

  24. [24]

    Model-based imitation learning for urban driving

    Anthony Hu, Gianluca Corrado, Nicolas Griffiths, Zachary Murez, Corina Gurau, Hudson Yeo, Alex Kendall, Roberto Cipolla, and Jamie Shotton. Model-based imitation learning for urban driving. InAdvances in Neural Information Processing Systems, volume 35, pages 20703–20716, 2022

  25. [25]

    Imitation learning: A survey of learning methods.ACM Computing Surveys (CSUR), 50(2):1–35, 2017

    Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. Imitation learning: A survey of learning methods.ACM Computing Surveys (CSUR), 50(2):1–35, 2017

  26. [26]

    Hypothesis network planned exploration for rapid meta-reinforcement learning adaptation, 2025

    Maxwell Joseph Jacobson, Rohan Menon, John Zeng, and Yexiang Xue. Hypothesis network planned exploration for rapid meta-reinforcement learning adaptation, 2025

  27. [27]

    Offline reinforcement learning as one big sequence modeling problem

    Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. InAdvances in Neural Information Processing Systems, pages 1273–1286, 2021

  28. [28]

    MobILE: Model-based imitation learning from observation alone

    Rahul Kidambi, Jonathan Chang, and Wen Sun. MobILE: Model-based imitation learning from observation alone. InAdvances in Neural Information Processing Systems, volume 34, pages 28598–28611, 2021

  29. [29]

    MOReL: Model-based offline reinforcement learning

    Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. MOReL: Model-based offline reinforcement learning. InAdvances in Neural Information Processing Systems, pages 21810–21823, 2020

  30. [30]

    Skill discovery in continuous reinforcement learning domains using skill chaining

    George Konidaris and Andrew Barto. Skill discovery in continuous reinforcement learning domains using skill chaining. InAdvances in Neural Information Processing Systems, volume 22, pages 1015–1023, 2009

  31. [31]

    Offline reinforcement learning with implicit q-learning

    Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. InInternational Conference on Learning Representations, 2022

  32. [32]

    Kulkarni, Karthik R

    Tejas D. Kulkarni, Karthik R. Narasimhan, Ardavan Saeedi, and Joshua B. Tenenbaum. Hierar- chical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. InAdvances in Neural Information Processing Systems, volume 29, pages 3682–3690, 2016

  33. [33]

    Stabilizing off-policy q-learning via bootstrapping error reduction

    Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. InAdvances in Neural Information Processing Systems, 2019. 11

  34. [34]

    Conservative q-learning for offline reinforcement learning

    Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. InAdvances in Neural Information Processing Systems, volume 33, pages 1179–1191, 2020

  35. [35]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

  36. [36]

    Learning multi-level hierarchies with hindsight

    Andrew Levy, George Dimitri Konidaris, Robert Platt, and Kate Saenko. Learning multi-level hierarchies with hindsight. InInternational Conference on Learning Representations, 2019

  37. [37]

    Goal-conditioned reinforcement learning: Problems and solutions

    Minghuan Liu, Menghui Zhu, and Weinan Zhang. Goal-conditioned reinforcement learning: Problems and solutions. InProceedings of the Thirty-First International Joint Conference on Ar- tificial Intelligence, pages 5502–5511. International Joint Conferences on Artificial Intelligence Organization, 2022

  38. [38]

    Amy McGovern and Andrew G. Barto. Automatic discovery of subgoals in reinforcement learning using diverse density. InProceedings of the 18th International Conference on Machine Learning, pages 361–368. Morgan Kaufmann, 2001

  39. [39]

    Q-cut—dynamic discovery of sub-goals in reinforcement learning

    Ishai Menache, Shie Mannor, and Nahum Shimkin. Q-cut—dynamic discovery of sub-goals in reinforcement learning. InProceedings of the 13th European Conference on Machine Learning, pages 295–306. Springer, 2002

  40. [40]

    Rusu, Joel Veness, Marc G

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcemen...

  41. [41]

    Rafael Müller, Simon Kornblith, and Geoffrey E. Hinton. When does label smoothing help? In Advances in Neural Information Processing Systems, volume 32, pages 4696–4705, 2019

  42. [42]

    Data-efficient hierarchical reinforcement learning

    Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. InAdvances in Neural Information Processing Systems, volume 31, pages 3307–3317, 2018

  43. [43]

    Ng and Stuart Russell

    Andrew Y . Ng and Stuart Russell. Algorithms for inverse reinforcement learning. InProceedings of the Seventeenth International Conference on Machine Learning, pages 663–670. Morgan Kaufmann, 2000

  44. [44]

    An algorithmic perspective on imitation learning.Foundations and Trends® in Robotics, 7(1-2):1–179, 2018

    Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J Andrew Bagnell, Pieter Abbeel, and Jan Peters. An algorithmic perspective on imitation learning.Foundations and Trends® in Robotics, 7(1-2):1–179, 2018

  45. [45]

    Hierarchical rein- forcement learning: A comprehensive survey.ACM Computing Surveys (CSUR), 54(5):1–35, 2021

    Shubham Pateria, Budhitama Subagdja, Ah-hwee Tan, and Chai Quek. Hierarchical rein- forcement learning: A comprehensive survey.ACM Computing Surveys (CSUR), 54(5):1–35, 2021

  46. [46]

    A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE transactions on neural networks and learning systems, 35(8):10237–10257, 2023

    Rafael Figueiredo Prudencio, Marcos ROA Maximo, and Esther Luna Colombini. A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE transactions on neural networks and learning systems, 35(8):10237–10257, 2023

  47. [47]

    Efficient off-policy meta-reinforcement learning via probabilistic context variables

    Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. InInternational conference on machine learning, pages 5331–5340. PMLR, 2019

  48. [48]

    A reduction of imitation learning and structured prediction to no-regret online learning

    Stephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors,Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pag...

  49. [49]

    Lim, and Youngwoon Lee

    Lucy Xiaoyang Shi, Joseph J. Lim, and Youngwoon Lee. Skill-based model-based reinforcement learning. InConference on Robot Learning, 2022. 12

  50. [50]

    Hierarchical reinforcement learning for zero- shot generalization with subtask dependencies

    Sungryull Sohn, Junhyuk Oh, and Honglak Lee. Hierarchical reinforcement learning for zero- shot generalization with subtask dependencies. InAdvances in Neural Information Processing Systems, volume 31, pages 7156–7166, 2018

  51. [51]

    Sutton, Doina Precup, and Satinder Singh

    Richard S. Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1– 2):181–211, 1999

  52. [52]

    Re- thinking the inception architecture for computer vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re- thinking the inception architecture for computer vision. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016

  53. [53]

    Feudal networks for hierarchical reinforcement learning

    Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. InProceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3540–3549. PMLR, 2017

  54. [54]

    Learning to reinforcement learn

    Jane X. Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z. Leibo, Rémi Munos, Charles Blundell, Dharshan Kumaran, and Matthew Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016

  55. [55]

    Critic regularized regression

    Ziyu Wang, Alexander Novikov, Konrad Zolna, Josh Merel, Jost Tobias Springenberg, Scott Reed, Bobak Shahriari, Noah Siegel, Caglar Gulcehre, Nicolas Heess, and Nando de Freitas. Critic regularized regression. InAdvances in Neural Information Processing Systems, pages 7768–7778, 2020

  56. [56]

    Behavior Regularized Offline Reinforcement Learning

    Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning.arXiv preprint arXiv:1911.11361, 2019

  57. [57]

    COMBO: Conservative offline model-based policy optimization

    Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. COMBO: Conservative offline model-based policy optimization. InAdvances in Neural Information Processing Systems, pages 28954–28967, 2021

  58. [58]

    MOPO: Model-based offline policy optimization

    Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. MOPO: Model-based offline policy optimization. InAdvances in Neural Information Processing Systems, pages 14129–14142, 2020

  59. [59]

    DAC: The double actor-critic architecture for learning options

    Shangtong Zhang and Shimon Whiteson. DAC: The double actor-critic architecture for learning options. InAdvances in Neural Information Processing Systems, volume 32, pages 2012–2022, 2019

  60. [60]

    Varibad: variational bayes-adaptive deep rl via meta-learning.J

    Luisa Zintgraf, Sebastian Schulze, Cong Lu, Leo Feng, Maximilian Igl, Kyriacos Shiarlis, Yarin Gal, Katja Hofmann, and Shimon Whiteson. Varibad: variational bayes-adaptive deep rl via meta-learning.J. Mach. Learn. Res., 22(1), January 2021. 13 A Method Additional Material A.1 Networks Summary ZALT learns the following parameterized neural networks: Encθe ...