arxiv: 2605.08450 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Zero-shot Imitation Learning by Latent Topology Mapping

Maxwell J. Jacobson , Yexiang Xue

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords imitation learningzero-shot adaptationlatent topologyhub statesgoal-conditioned taskslong-horizon planningtrajectory mapping3D maze navigation

0 comments

The pith

ZALT lets agents solve unseen long-horizon tasks by planning over a latent topology of hub states extracted from demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that imitation learning need not require full demonstrations for every possible task. Instead, by locating states in a learned latent space where many trajectories meet or split, the method turns existing demonstrations into reusable segments. An agent then learns to move between these segments and plans short sequences of such moves to reach new goals. A reader would care if this holds because collecting expert data for every variation of a complex task quickly becomes impractical in robotics or navigation settings.

Core claim

ZALT identifies latent hub states where trajectories converge or diverge, learns policies and a dynamics model over hub-to-hub transitions, and plans over the hub topology to complete new tasks. This topology makes demonstrated behaviors explicitly composable while compressing long tasks into shorter sequences of abstract transitions, enabling zero-shot adaptation to unseen start-goal pairs.

What carries the argument

Latent hub states that form a topology of composable transitions, allowing the agent to replace long primitive-action sequences with planned sequences of hub-to-hub moves.

If this is right

Demonstrated behaviors become reusable building blocks that can be chained for goals outside the original dataset.
Long trajectories are replaced by shorter plans over abstract transitions, limiting the accumulation of small errors.
Zero-shot success on novel start-goal pairs reaches 55 percent in a complex 3D maze while the strongest baseline reaches 6 percent.
Fewer complete demonstrations suffice to cover a broad range of tasks in the same environment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hub-extraction step could be applied in other sequential domains where partial expert traces are cheaper to obtain than full ones.
If the latent space is learned jointly with the dynamics model, small changes to the environment might require only re-identifying hubs rather than new demonstrations.
Extending the topology to continuous or stochastic settings would require checking whether convergence points remain stable under noise.

Load-bearing premise

That the identified latent hub states reliably mark points where trajectories converge or diverge so that planning over them yields correct compositions for tasks never demonstrated.

What would settle it

An experiment on held-out tasks that require hub sequences absent from any combination of the training demonstrations, where measured success falls to the level of standard imitation baselines.

Figures

Figures reproduced from arXiv: 2605.08450 by Maxwell J. Jacobson, Yexiang Xue.

**Figure 2.** Figure 2: Hub formation. ZALT coalesces demonstration latent vectors into clusters. Hubs are clusters where demos converge from multiple previous clusters or diverge into multiple next clusters. In practice, exact state equality is too strict, especially with high-dimensional observations. ZALT therefore detects convergences and divergences in a learned latent space (see [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Success rate on unseen start–goal pairs immediately after seeing demonstrations (left), and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Video frames from a successful ZALT zero-shot inference run (goal: green gem, then red [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Imitation learning is effective for training agents when expert demonstrations are available, but collecting demonstrations for every complex task in an environment is costly. We study the long-horizon, goal-conditioned setting where a fixed demonstration dataset contains useful behavior, but not complete examples for every task the agent must solve. Existing imitation learning methods can learn strong policies from demonstrations, but when solving long-horizon tasks, small errors accumulate over long primitive-action trajectories and make zero-shot adaptation to new tasks unreliable. We introduce Zero-shot Agents from Latent Topologies (ZALT), an imitation-learning method that solves unseen start-goal tasks beyond those demonstrated during training. ZALT identifies latent hub states where trajectories converge or diverge, learns policies and a dynamics model over hub-to-hub transitions, and plans over the hub topology to complete new tasks. This topology makes demonstrated behaviors explicitly composable while compressing long tasks into shorter sequences of abstract transitions -- combined, these enable ZALT to perform zero-shot adaptation. In a complex 3D maze environment, ZALT achieves 55% zero-shot success on unseen tasks, compared to 6% for the strongest baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ZALT shows a workable way to compose imitation behaviors for unseen long-horizon tasks via latent hubs, but the 55% result depends on unexamined details of hub finding and planning reliability.

read the letter

ZALT claims a clear win on zero-shot imitation learning for long-horizon goal-conditioned tasks. By mapping demonstrations to a latent topology of hub states, it achieves 55% success on unseen tasks in a 3D maze, against 6% for the strongest baseline. The new part is the combination of identifying convergence and divergence points as hubs, learning hub-to-hub policies and dynamics, and then planning over that reduced topology. This setup makes the behaviors explicitly composable, so the agent can string together demonstrated segments for tasks that were never shown end-to-end. It addresses the accumulation of small errors in primitive actions by working at a higher level of abstraction. The paper does well at laying out why standard imitation struggles with long tasks and how the topology helps compress them. The empirical gap is large enough to get attention in the subfield. The soft spots are around the hub identification step and its reliability. The description is high-level, with no mention of the exact algorithm, sensitivity to choices, or tests like whether the hubs cover the test tasks or stay stable across data splits. If the hubs are found via clustering that only reflects the demo distribution, planning could produce invalid sequences for new pairs. The dynamics model might also accumulate errors when chaining unseen hub transitions. These are the load-bearing parts, and without more evidence they hold, the zero-shot claim is not yet convincing. This work is for people building agents that need to generalize from limited expert data in sequential settings like robotics. It shows clear thinking on the composability issue and engages the literature on imitation and planning. It deserves a serious referee to check the methods and experiments in detail. I would send it to peer review.

Referee Report

3 major / 2 minor

Summary. The paper introduces Zero-shot Agents from Latent Topologies (ZALT) for imitation learning in long-horizon goal-conditioned settings. From a fixed demonstration dataset, ZALT identifies latent hub states (convergence/divergence points in trajectories), learns policies and a dynamics model over hub-to-hub transitions, and performs planning over the resulting topology to solve unseen start-goal tasks. The central empirical claim is that this yields 55% zero-shot success in a complex 3D maze environment, versus 6% for the strongest baseline.

Significance. If the hub identification and topology-based planning reliably generalize compositions beyond the demonstration set, the method would provide a concrete mechanism for making demonstrated behaviors explicitly composable and compressible, addressing error accumulation in long-horizon imitation learning. This could reduce reliance on exhaustive task-specific demonstrations in robotics and navigation domains.

major comments (3)

[Abstract and §3] Abstract and §3 (Method): The hub identification procedure is described only at a high level ('identifies latent hub states where trajectories converge or diverge') with no algorithm, hyperparameters, or pseudocode. This is load-bearing for the central claim, as the 55% success rate depends on whether the extracted topology covers and correctly composes paths for unseen start-goal pairs; without the precise detection rule, it is impossible to determine if the result is robust or sensitive to post-hoc choices.
[§5] §5 (Experiments): The 55% vs 6% comparison reports no error bars, no statistical tests, no ablation on hub count or dynamics model capacity, and no metric of hub coverage for the test tasks. These omissions directly undermine evaluation of the weakest assumption that planning over the hub topology produces correct sequences outside the demonstrated connectivity.
[§4 and §5] §4 (Approach) and §5: No analysis is provided of hub stability across random data subsets or of planning success conditioned on whether a test task's optimal path intersects the identified hubs. Such checks are required to substantiate that the topology enables zero-shot adaptation rather than succeeding only on tasks whose connectivity is already captured by the fixed demonstration set.

minor comments (2)

[Abstract] Abstract: The phrase 'parameter-free' is not used, but the method description should explicitly state whether the hub detection or planning steps introduce any tunable thresholds that were selected after seeing test performance.
[Figure 1] Figure 1 (if present): The diagram of the hub topology should include an example of an unseen task whose solution path is composed from demonstrated hub transitions, with the corresponding plan highlighted.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript to improve clarity, reproducibility, and evaluation rigor.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Method): The hub identification procedure is described only at a high level ('identifies latent hub states where trajectories converge or diverge') with no algorithm, hyperparameters, or pseudocode. This is load-bearing for the central claim, as the 55% success rate depends on whether the extracted topology covers and correctly composes paths for unseen start-goal pairs; without the precise detection rule, it is impossible to determine if the result is robust or sensitive to post-hoc choices.

Authors: We agree that the hub identification procedure requires a more precise and reproducible description. In the revised manuscript we will expand §3 with the full algorithm for detecting latent hub states (including the exact convergence/divergence criteria applied to the demonstration trajectories), all hyperparameters, and pseudocode placed in the appendix. This will allow direct assessment of how the extracted topology supports composition for the reported zero-shot tasks. revision: yes
Referee: [§5] §5 (Experiments): The 55% vs 6% comparison reports no error bars, no statistical tests, no ablation on hub count or dynamics model capacity, and no metric of hub coverage for the test tasks. These omissions directly undermine evaluation of the weakest assumption that planning over the hub topology produces correct sequences outside the demonstrated connectivity.

Authors: We acknowledge that the current experimental reporting lacks statistical detail and supporting ablations. We will update §5 to include error bars over multiple random seeds, statistical significance tests between ZALT and baselines, ablations on hub count and dynamics-model capacity, and a quantitative hub-coverage metric for the test tasks. These additions will directly address the concern about whether planning succeeds due to topology composition rather than incidental coverage. revision: yes
Referee: [§4 and §5] §4 (Approach) and §5: No analysis is provided of hub stability across random data subsets or of planning success conditioned on whether a test task's optimal path intersects the identified hubs. Such checks are required to substantiate that the topology enables zero-shot adaptation rather than succeeding only on tasks whose connectivity is already captured by the fixed demonstration set.

Authors: We agree that stability and conditional-success analyses would strengthen the claim that the topology enables genuine zero-shot composition. In the revision we will add (i) hub-stability results obtained by repeating identification on random subsets of the demonstration data and (ii) success rates conditioned on whether each test task's optimal path intersects the identified hubs. These checks will be reported alongside the existing 55 % figure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical zero-shot performance is independent of method description

full rationale

The paper presents ZALT as a method that extracts latent hubs from a fixed demonstration set, learns hub-to-hub policies and dynamics, then plans compositions for unseen start-goal pairs. The 55% success rate is reported as an experimental outcome in a 3D maze, not as a quantity derived by construction from the hub identification procedure or any fitted parameter. No equations, self-citations, or uniqueness theorems are invoked that would reduce the performance claim to a renaming or re-fitting of the input demonstrations. The derivation chain (hub detection → abstract dynamics → planning) remains logically independent of the final measured success rate, satisfying the criteria for a self-contained empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that demonstration trajectories admit a useful latent decomposition into hub-to-hub segments; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)

domain assumption Demonstrated trajectories contain identifiable latent hub states at which paths converge or diverge.
Invoked to justify identifying hubs and building the topology for planning.

pith-pipeline@v0.9.0 · 5489 in / 1204 out tokens · 36037 ms · 2026-05-12T02:28:04.864734+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 3 internal anchors

[1]

arXiv preprint arXiv:1807.10299 , year=

Joshua Achiam, Harrison Edwards, Dario Amodei, and Pieter Abbeel. Variational option discovery algorithms.arXiv preprint arXiv:1807.10299, 2018

work page arXiv 2018
[2]

The option-critic architecture

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. InProceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 1726–1734. AAAI Press, 2017

work page 2017
[3]

A survey of meta-reinforcement learning.arXiv e-prints, pages arXiv–2301, 2023

Jacob Beck, Risto Vuorio, Evan Zheran Liu, Zheng Xiong, Luisa Zintgraf, Chelsea Finn, and Shimon Whiteson. A survey of meta-reinforcement learning.arXiv e-prints, pages arXiv–2301, 2023

work page 2023
[4]

Carl: A benchmark for contextual and adaptive reinforce- ment learning.arXiv preprint arXiv:2110.02102, 2021

Carolin Benjamins, Theresa Eimer, Frederik Schubert, André Biedenkapp, Bodo Rosenhahn, Frank Hutter, and Marius Lindauer. Carl: A benchmark for contextual and adaptive reinforce- ment learning.arXiv preprint arXiv:2110.02102, 2021

work page arXiv 2021
[5]

Hierarchical model- based imitation learning for planning in autonomous driving

Eli Bronstein, Mark Palatucci, Dominik Notz, Brandyn White, Alex Kuefler, Yiren Lu, Supratik Paul, Payam Nikdel, Paul Mougin, Hongge Chen, Justin Fu, Austin Abrams, Punit Shah, Evan Racah, Benjamin Frenkel, Shimon Whiteson, and Dragomir Anguelov. Hierarchical model- based imitation learning for planning in autonomous driving. In2022 IEEE/RSJ International...

work page 2022
[6]

Decision transformer: Reinforcement learning via sequence modeling

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. InAdvances in Neural Information Processing Systems, pages 15084– 15097, 2021

work page 2021
[7]

Bail: Best- action imitation learning for batch deep reinforcement learning.Advances in Neural Information Processing Systems, 33:18353–18363, 2020

Xinyue Chen, Zijian Zhou, Zheng Wang, Che Wang, Yanqiu Wu, and Keith Ross. Bail: Best- action imitation learning for batch deep reinforcement learning.Advances in Neural Information Processing Systems, 33:18353–18363, 2020

work page 2020
[8]

Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo Perez-Vicente, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and J. K. Terry. Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks. In Advances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[9]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho, Bart van Merriënboer, Ça ˘glar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder– decoder for statistical machine translation.arXiv preprint arXiv:1406.1078, 2014

work page internal anchor Pith review arXiv 2014
[10]

Co-Reyes, Yuxuan Liu, Abhishek Gupta, Benjamin Eysenbach, Pieter Abbeel, and Sergey Levine

John D. Co-Reyes, Yuxuan Liu, Abhishek Gupta, Benjamin Eysenbach, Pieter Abbeel, and Sergey Levine. Self-consistent trajectory autoencoder: Hierarchical reinforcement learning with trajectory embeddings.arXiv preprint arXiv:1806.02813, 2018

work page arXiv 2018
[11]

Özgür ¸ Sim¸ sek and Andrew G. Barto. Skill characterization based on betweenness. InAdvances in Neural Information Processing Systems, volume 21, pages 1497–1504, 2008

work page 2008
[12]

Peter Dayan and Geoffrey E. Hinton. Feudal reinforcement learning. InAdvances in Neural Information Processing Systems, volume 5, pages 271–278. Morgan-Kaufmann, 1993

work page 1993
[13]

Dietterich

Thomas G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition.Journal of Artificial Intelligence Research, 13:227–303, 2000

work page 2000
[14]

RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel. RL2: Fast reinforcement learning via slow reinforcement learning.arXiv preprint arXiv:1611.02779, 2016

work page Pith review arXiv 2016
[15]

RvS: What is essential for offline RL via supervised learning? InInternational Conference on Learning Representations, 2022

Scott Emmons, Benjamin Eysenbach, Ilya Kostrikov, and Sergey Levine. RvS: What is essential for offline RL via supervised learning? InInternational Conference on Learning Representations, 2022

work page 2022
[16]

Salakhutdinov, and Sergey Levine

Benjamin Eysenbach, Ruslan R. Salakhutdinov, and Sergey Levine. Search on the replay buffer: Bridging planning and reinforcement learning. InAdvances in Neural Information Processing Systems, volume 32, pages 15246–15257, 2019. 10

work page 2019
[17]

Model-agnostic meta-learning for fast adap- tation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap- tation of deep networks. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1126–1135. PMLR, 2017

work page 2017
[18]

Learning robust rewards with adverserial inverse reinforcement learning

Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adverserial inverse reinforcement learning. InInternational Conference on Learning Representations, 2018

work page 2018
[19]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InProceedings of the 36th International Conference on Machine Learning, pages 2052–2062, 2019

work page 2052
[20]

Recurrent world models facilitate policy evolution

David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, volume 31, pages 2450–2462, 2018

work page 2018
[21]

Dream to control: Learning behaviors by latent imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Representa- tions, 2020

work page 2020
[22]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 2555–2565. PMLR, 2019

work page 2019
[23]

Generative adversarial imitation learning

Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. InAdvances in Neural Information Processing Systems, volume 29, pages 4565–4573, 2016

work page 2016
[24]

Model-based imitation learning for urban driving

Anthony Hu, Gianluca Corrado, Nicolas Griffiths, Zachary Murez, Corina Gurau, Hudson Yeo, Alex Kendall, Roberto Cipolla, and Jamie Shotton. Model-based imitation learning for urban driving. InAdvances in Neural Information Processing Systems, volume 35, pages 20703–20716, 2022

work page 2022
[25]

Imitation learning: A survey of learning methods.ACM Computing Surveys (CSUR), 50(2):1–35, 2017

Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. Imitation learning: A survey of learning methods.ACM Computing Surveys (CSUR), 50(2):1–35, 2017

work page 2017
[26]

Hypothesis network planned exploration for rapid meta-reinforcement learning adaptation, 2025

Maxwell Joseph Jacobson, Rohan Menon, John Zeng, and Yexiang Xue. Hypothesis network planned exploration for rapid meta-reinforcement learning adaptation, 2025

work page 2025
[27]

Offline reinforcement learning as one big sequence modeling problem

Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. InAdvances in Neural Information Processing Systems, pages 1273–1286, 2021

work page 2021
[28]

MobILE: Model-based imitation learning from observation alone

Rahul Kidambi, Jonathan Chang, and Wen Sun. MobILE: Model-based imitation learning from observation alone. InAdvances in Neural Information Processing Systems, volume 34, pages 28598–28611, 2021

work page 2021
[29]

MOReL: Model-based offline reinforcement learning

Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. MOReL: Model-based offline reinforcement learning. InAdvances in Neural Information Processing Systems, pages 21810–21823, 2020

work page 2020
[30]

Skill discovery in continuous reinforcement learning domains using skill chaining

George Konidaris and Andrew Barto. Skill discovery in continuous reinforcement learning domains using skill chaining. InAdvances in Neural Information Processing Systems, volume 22, pages 1015–1023, 2009

work page 2009
[31]

Offline reinforcement learning with implicit q-learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. InInternational Conference on Learning Representations, 2022

work page 2022
[32]

Kulkarni, Karthik R

Tejas D. Kulkarni, Karthik R. Narasimhan, Ardavan Saeedi, and Joshua B. Tenenbaum. Hierar- chical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. InAdvances in Neural Information Processing Systems, volume 29, pages 3682–3690, 2016

work page 2016
[33]

Stabilizing off-policy q-learning via bootstrapping error reduction

Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. InAdvances in Neural Information Processing Systems, 2019. 11

work page 2019
[34]

Conservative q-learning for offline reinforcement learning

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. InAdvances in Neural Information Processing Systems, volume 33, pages 1179–1191, 2020

work page 2020
[35]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[36]

Learning multi-level hierarchies with hindsight

Andrew Levy, George Dimitri Konidaris, Robert Platt, and Kate Saenko. Learning multi-level hierarchies with hindsight. InInternational Conference on Learning Representations, 2019

work page 2019
[37]

Goal-conditioned reinforcement learning: Problems and solutions

Minghuan Liu, Menghui Zhu, and Weinan Zhang. Goal-conditioned reinforcement learning: Problems and solutions. InProceedings of the Thirty-First International Joint Conference on Ar- tificial Intelligence, pages 5502–5511. International Joint Conferences on Artificial Intelligence Organization, 2022

work page 2022
[38]

Amy McGovern and Andrew G. Barto. Automatic discovery of subgoals in reinforcement learning using diverse density. InProceedings of the 18th International Conference on Machine Learning, pages 361–368. Morgan Kaufmann, 2001

work page 2001
[39]

Q-cut—dynamic discovery of sub-goals in reinforcement learning

Ishai Menache, Shie Mannor, and Nahum Shimkin. Q-cut—dynamic discovery of sub-goals in reinforcement learning. InProceedings of the 13th European Conference on Machine Learning, pages 295–306. Springer, 2002

work page 2002
[40]

Rusu, Joel Veness, Marc G

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcemen...

work page 2015
[41]

Rafael Müller, Simon Kornblith, and Geoffrey E. Hinton. When does label smoothing help? In Advances in Neural Information Processing Systems, volume 32, pages 4696–4705, 2019

work page 2019
[42]

Data-efficient hierarchical reinforcement learning

Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. InAdvances in Neural Information Processing Systems, volume 31, pages 3307–3317, 2018

work page 2018
[43]

Ng and Stuart Russell

Andrew Y . Ng and Stuart Russell. Algorithms for inverse reinforcement learning. InProceedings of the Seventeenth International Conference on Machine Learning, pages 663–670. Morgan Kaufmann, 2000

work page 2000
[44]

An algorithmic perspective on imitation learning.Foundations and Trends® in Robotics, 7(1-2):1–179, 2018

Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J Andrew Bagnell, Pieter Abbeel, and Jan Peters. An algorithmic perspective on imitation learning.Foundations and Trends® in Robotics, 7(1-2):1–179, 2018

work page 2018
[45]

Hierarchical rein- forcement learning: A comprehensive survey.ACM Computing Surveys (CSUR), 54(5):1–35, 2021

Shubham Pateria, Budhitama Subagdja, Ah-hwee Tan, and Chai Quek. Hierarchical rein- forcement learning: A comprehensive survey.ACM Computing Surveys (CSUR), 54(5):1–35, 2021

work page 2021
[46]

A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE transactions on neural networks and learning systems, 35(8):10237–10257, 2023

Rafael Figueiredo Prudencio, Marcos ROA Maximo, and Esther Luna Colombini. A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE transactions on neural networks and learning systems, 35(8):10237–10257, 2023

work page 2023
[47]

Efficient off-policy meta-reinforcement learning via probabilistic context variables

Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. InInternational conference on machine learning, pages 5331–5340. PMLR, 2019

work page 2019
[48]

A reduction of imitation learning and structured prediction to no-regret online learning

Stephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors,Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pag...

work page 2011
[49]

Lim, and Youngwoon Lee

Lucy Xiaoyang Shi, Joseph J. Lim, and Youngwoon Lee. Skill-based model-based reinforcement learning. InConference on Robot Learning, 2022. 12

work page 2022
[50]

Hierarchical reinforcement learning for zero- shot generalization with subtask dependencies

Sungryull Sohn, Junhyuk Oh, and Honglak Lee. Hierarchical reinforcement learning for zero- shot generalization with subtask dependencies. InAdvances in Neural Information Processing Systems, volume 31, pages 7156–7166, 2018

work page 2018
[51]

Sutton, Doina Precup, and Satinder Singh

Richard S. Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1– 2):181–211, 1999

work page 1999
[52]

Re- thinking the inception architecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re- thinking the inception architecture for computer vision. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016

work page 2016
[53]

Feudal networks for hierarchical reinforcement learning

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. InProceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3540–3549. PMLR, 2017

work page 2017
[54]

Learning to reinforcement learn

Jane X. Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z. Leibo, Rémi Munos, Charles Blundell, Dharshan Kumaran, and Matthew Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016

work page Pith review arXiv 2016
[55]

Critic regularized regression

Ziyu Wang, Alexander Novikov, Konrad Zolna, Josh Merel, Jost Tobias Springenberg, Scott Reed, Bobak Shahriari, Noah Siegel, Caglar Gulcehre, Nicolas Heess, and Nando de Freitas. Critic regularized regression. InAdvances in Neural Information Processing Systems, pages 7768–7778, 2020

work page 2020
[56]

Behavior Regularized Offline Reinforcement Learning

Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning.arXiv preprint arXiv:1911.11361, 2019

work page internal anchor Pith review arXiv 1911
[57]

COMBO: Conservative offline model-based policy optimization

Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. COMBO: Conservative offline model-based policy optimization. InAdvances in Neural Information Processing Systems, pages 28954–28967, 2021

work page 2021
[58]

MOPO: Model-based offline policy optimization

Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. MOPO: Model-based offline policy optimization. InAdvances in Neural Information Processing Systems, pages 14129–14142, 2020

work page 2020
[59]

DAC: The double actor-critic architecture for learning options

Shangtong Zhang and Shimon Whiteson. DAC: The double actor-critic architecture for learning options. InAdvances in Neural Information Processing Systems, volume 32, pages 2012–2022, 2019

work page 2012
[60]

Varibad: variational bayes-adaptive deep rl via meta-learning.J

Luisa Zintgraf, Sebastian Schulze, Cong Lu, Leo Feng, Maximilian Igl, Kyriacos Shiarlis, Yarin Gal, Katja Hofmann, and Shimon Whiteson. Varibad: variational bayes-adaptive deep rl via meta-learning.J. Mach. Learn. Res., 22(1), January 2021. 13 A Method Additional Material A.1 Networks Summary ZALT learns the following parameterized neural networks: Encθe ...

work page 2021