Recognition: no theorem link
Zero-shot Imitation Learning by Latent Topology Mapping
Pith reviewed 2026-05-12 02:28 UTC · model grok-4.3
The pith
ZALT lets agents solve unseen long-horizon tasks by planning over a latent topology of hub states extracted from demonstrations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ZALT identifies latent hub states where trajectories converge or diverge, learns policies and a dynamics model over hub-to-hub transitions, and plans over the hub topology to complete new tasks. This topology makes demonstrated behaviors explicitly composable while compressing long tasks into shorter sequences of abstract transitions, enabling zero-shot adaptation to unseen start-goal pairs.
What carries the argument
Latent hub states that form a topology of composable transitions, allowing the agent to replace long primitive-action sequences with planned sequences of hub-to-hub moves.
If this is right
- Demonstrated behaviors become reusable building blocks that can be chained for goals outside the original dataset.
- Long trajectories are replaced by shorter plans over abstract transitions, limiting the accumulation of small errors.
- Zero-shot success on novel start-goal pairs reaches 55 percent in a complex 3D maze while the strongest baseline reaches 6 percent.
- Fewer complete demonstrations suffice to cover a broad range of tasks in the same environment.
Where Pith is reading between the lines
- The same hub-extraction step could be applied in other sequential domains where partial expert traces are cheaper to obtain than full ones.
- If the latent space is learned jointly with the dynamics model, small changes to the environment might require only re-identifying hubs rather than new demonstrations.
- Extending the topology to continuous or stochastic settings would require checking whether convergence points remain stable under noise.
Load-bearing premise
That the identified latent hub states reliably mark points where trajectories converge or diverge so that planning over them yields correct compositions for tasks never demonstrated.
What would settle it
An experiment on held-out tasks that require hub sequences absent from any combination of the training demonstrations, where measured success falls to the level of standard imitation baselines.
Figures
read the original abstract
Imitation learning is effective for training agents when expert demonstrations are available, but collecting demonstrations for every complex task in an environment is costly. We study the long-horizon, goal-conditioned setting where a fixed demonstration dataset contains useful behavior, but not complete examples for every task the agent must solve. Existing imitation learning methods can learn strong policies from demonstrations, but when solving long-horizon tasks, small errors accumulate over long primitive-action trajectories and make zero-shot adaptation to new tasks unreliable. We introduce Zero-shot Agents from Latent Topologies (ZALT), an imitation-learning method that solves unseen start-goal tasks beyond those demonstrated during training. ZALT identifies latent hub states where trajectories converge or diverge, learns policies and a dynamics model over hub-to-hub transitions, and plans over the hub topology to complete new tasks. This topology makes demonstrated behaviors explicitly composable while compressing long tasks into shorter sequences of abstract transitions -- combined, these enable ZALT to perform zero-shot adaptation. In a complex 3D maze environment, ZALT achieves 55% zero-shot success on unseen tasks, compared to 6% for the strongest baseline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Zero-shot Agents from Latent Topologies (ZALT) for imitation learning in long-horizon goal-conditioned settings. From a fixed demonstration dataset, ZALT identifies latent hub states (convergence/divergence points in trajectories), learns policies and a dynamics model over hub-to-hub transitions, and performs planning over the resulting topology to solve unseen start-goal tasks. The central empirical claim is that this yields 55% zero-shot success in a complex 3D maze environment, versus 6% for the strongest baseline.
Significance. If the hub identification and topology-based planning reliably generalize compositions beyond the demonstration set, the method would provide a concrete mechanism for making demonstrated behaviors explicitly composable and compressible, addressing error accumulation in long-horizon imitation learning. This could reduce reliance on exhaustive task-specific demonstrations in robotics and navigation domains.
major comments (3)
- [Abstract and §3] Abstract and §3 (Method): The hub identification procedure is described only at a high level ('identifies latent hub states where trajectories converge or diverge') with no algorithm, hyperparameters, or pseudocode. This is load-bearing for the central claim, as the 55% success rate depends on whether the extracted topology covers and correctly composes paths for unseen start-goal pairs; without the precise detection rule, it is impossible to determine if the result is robust or sensitive to post-hoc choices.
- [§5] §5 (Experiments): The 55% vs 6% comparison reports no error bars, no statistical tests, no ablation on hub count or dynamics model capacity, and no metric of hub coverage for the test tasks. These omissions directly undermine evaluation of the weakest assumption that planning over the hub topology produces correct sequences outside the demonstrated connectivity.
- [§4 and §5] §4 (Approach) and §5: No analysis is provided of hub stability across random data subsets or of planning success conditioned on whether a test task's optimal path intersects the identified hubs. Such checks are required to substantiate that the topology enables zero-shot adaptation rather than succeeding only on tasks whose connectivity is already captured by the fixed demonstration set.
minor comments (2)
- [Abstract] Abstract: The phrase 'parameter-free' is not used, but the method description should explicitly state whether the hub detection or planning steps introduce any tunable thresholds that were selected after seeing test performance.
- [Figure 1] Figure 1 (if present): The diagram of the hub topology should include an example of an unseen task whose solution path is composed from demonstrated hub transitions, with the corresponding plan highlighted.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript to improve clarity, reproducibility, and evaluation rigor.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Method): The hub identification procedure is described only at a high level ('identifies latent hub states where trajectories converge or diverge') with no algorithm, hyperparameters, or pseudocode. This is load-bearing for the central claim, as the 55% success rate depends on whether the extracted topology covers and correctly composes paths for unseen start-goal pairs; without the precise detection rule, it is impossible to determine if the result is robust or sensitive to post-hoc choices.
Authors: We agree that the hub identification procedure requires a more precise and reproducible description. In the revised manuscript we will expand §3 with the full algorithm for detecting latent hub states (including the exact convergence/divergence criteria applied to the demonstration trajectories), all hyperparameters, and pseudocode placed in the appendix. This will allow direct assessment of how the extracted topology supports composition for the reported zero-shot tasks. revision: yes
-
Referee: [§5] §5 (Experiments): The 55% vs 6% comparison reports no error bars, no statistical tests, no ablation on hub count or dynamics model capacity, and no metric of hub coverage for the test tasks. These omissions directly undermine evaluation of the weakest assumption that planning over the hub topology produces correct sequences outside the demonstrated connectivity.
Authors: We acknowledge that the current experimental reporting lacks statistical detail and supporting ablations. We will update §5 to include error bars over multiple random seeds, statistical significance tests between ZALT and baselines, ablations on hub count and dynamics-model capacity, and a quantitative hub-coverage metric for the test tasks. These additions will directly address the concern about whether planning succeeds due to topology composition rather than incidental coverage. revision: yes
-
Referee: [§4 and §5] §4 (Approach) and §5: No analysis is provided of hub stability across random data subsets or of planning success conditioned on whether a test task's optimal path intersects the identified hubs. Such checks are required to substantiate that the topology enables zero-shot adaptation rather than succeeding only on tasks whose connectivity is already captured by the fixed demonstration set.
Authors: We agree that stability and conditional-success analyses would strengthen the claim that the topology enables genuine zero-shot composition. In the revision we will add (i) hub-stability results obtained by repeating identification on random subsets of the demonstration data and (ii) success rates conditioned on whether each test task's optimal path intersects the identified hubs. These checks will be reported alongside the existing 55 % figure. revision: yes
Circularity Check
No circularity: empirical zero-shot performance is independent of method description
full rationale
The paper presents ZALT as a method that extracts latent hubs from a fixed demonstration set, learns hub-to-hub policies and dynamics, then plans compositions for unseen start-goal pairs. The 55% success rate is reported as an experimental outcome in a 3D maze, not as a quantity derived by construction from the hub identification procedure or any fitted parameter. No equations, self-citations, or uniqueness theorems are invoked that would reduce the performance claim to a renaming or re-fitting of the input demonstrations. The derivation chain (hub detection → abstract dynamics → planning) remains logically independent of the final measured success rate, satisfying the criteria for a self-contained empirical result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Demonstrated trajectories contain identifiable latent hub states at which paths converge or diverge.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:1807.10299 , year=
Joshua Achiam, Harrison Edwards, Dario Amodei, and Pieter Abbeel. Variational option discovery algorithms.arXiv preprint arXiv:1807.10299, 2018
-
[2]
The option-critic architecture
Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. InProceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 1726–1734. AAAI Press, 2017
work page 2017
-
[3]
A survey of meta-reinforcement learning.arXiv e-prints, pages arXiv–2301, 2023
Jacob Beck, Risto Vuorio, Evan Zheran Liu, Zheng Xiong, Luisa Zintgraf, Chelsea Finn, and Shimon Whiteson. A survey of meta-reinforcement learning.arXiv e-prints, pages arXiv–2301, 2023
work page 2023
-
[4]
Carolin Benjamins, Theresa Eimer, Frederik Schubert, André Biedenkapp, Bodo Rosenhahn, Frank Hutter, and Marius Lindauer. Carl: A benchmark for contextual and adaptive reinforce- ment learning.arXiv preprint arXiv:2110.02102, 2021
-
[5]
Hierarchical model- based imitation learning for planning in autonomous driving
Eli Bronstein, Mark Palatucci, Dominik Notz, Brandyn White, Alex Kuefler, Yiren Lu, Supratik Paul, Payam Nikdel, Paul Mougin, Hongge Chen, Justin Fu, Austin Abrams, Punit Shah, Evan Racah, Benjamin Frenkel, Shimon Whiteson, and Dragomir Anguelov. Hierarchical model- based imitation learning for planning in autonomous driving. In2022 IEEE/RSJ International...
work page 2022
-
[6]
Decision transformer: Reinforcement learning via sequence modeling
Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. InAdvances in Neural Information Processing Systems, pages 15084– 15097, 2021
work page 2021
-
[7]
Xinyue Chen, Zijian Zhou, Zheng Wang, Che Wang, Yanqiu Wu, and Keith Ross. Bail: Best- action imitation learning for batch deep reinforcement learning.Advances in Neural Information Processing Systems, 33:18353–18363, 2020
work page 2020
-
[8]
Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo Perez-Vicente, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and J. K. Terry. Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks. In Advances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[9]
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
Kyunghyun Cho, Bart van Merriënboer, Ça ˘glar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder– decoder for statistical machine translation.arXiv preprint arXiv:1406.1078, 2014
work page internal anchor Pith review arXiv 2014
-
[10]
Co-Reyes, Yuxuan Liu, Abhishek Gupta, Benjamin Eysenbach, Pieter Abbeel, and Sergey Levine
John D. Co-Reyes, Yuxuan Liu, Abhishek Gupta, Benjamin Eysenbach, Pieter Abbeel, and Sergey Levine. Self-consistent trajectory autoencoder: Hierarchical reinforcement learning with trajectory embeddings.arXiv preprint arXiv:1806.02813, 2018
-
[11]
Özgür ¸ Sim¸ sek and Andrew G. Barto. Skill characterization based on betweenness. InAdvances in Neural Information Processing Systems, volume 21, pages 1497–1504, 2008
work page 2008
-
[12]
Peter Dayan and Geoffrey E. Hinton. Feudal reinforcement learning. InAdvances in Neural Information Processing Systems, volume 5, pages 271–278. Morgan-Kaufmann, 1993
work page 1993
-
[13]
Thomas G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition.Journal of Artificial Intelligence Research, 13:227–303, 2000
work page 2000
-
[14]
RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning
Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel. RL2: Fast reinforcement learning via slow reinforcement learning.arXiv preprint arXiv:1611.02779, 2016
work page Pith review arXiv 2016
-
[15]
Scott Emmons, Benjamin Eysenbach, Ilya Kostrikov, and Sergey Levine. RvS: What is essential for offline RL via supervised learning? InInternational Conference on Learning Representations, 2022
work page 2022
-
[16]
Salakhutdinov, and Sergey Levine
Benjamin Eysenbach, Ruslan R. Salakhutdinov, and Sergey Levine. Search on the replay buffer: Bridging planning and reinforcement learning. InAdvances in Neural Information Processing Systems, volume 32, pages 15246–15257, 2019. 10
work page 2019
-
[17]
Model-agnostic meta-learning for fast adap- tation of deep networks
Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap- tation of deep networks. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1126–1135. PMLR, 2017
work page 2017
-
[18]
Learning robust rewards with adverserial inverse reinforcement learning
Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adverserial inverse reinforcement learning. InInternational Conference on Learning Representations, 2018
work page 2018
-
[19]
Off-policy deep reinforcement learning without exploration
Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InProceedings of the 36th International Conference on Machine Learning, pages 2052–2062, 2019
work page 2052
-
[20]
Recurrent world models facilitate policy evolution
David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, volume 31, pages 2450–2462, 2018
work page 2018
-
[21]
Dream to control: Learning behaviors by latent imagination
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Representa- tions, 2020
work page 2020
-
[22]
Learning latent dynamics for planning from pixels
Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 2555–2565. PMLR, 2019
work page 2019
-
[23]
Generative adversarial imitation learning
Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. InAdvances in Neural Information Processing Systems, volume 29, pages 4565–4573, 2016
work page 2016
-
[24]
Model-based imitation learning for urban driving
Anthony Hu, Gianluca Corrado, Nicolas Griffiths, Zachary Murez, Corina Gurau, Hudson Yeo, Alex Kendall, Roberto Cipolla, and Jamie Shotton. Model-based imitation learning for urban driving. InAdvances in Neural Information Processing Systems, volume 35, pages 20703–20716, 2022
work page 2022
-
[25]
Imitation learning: A survey of learning methods.ACM Computing Surveys (CSUR), 50(2):1–35, 2017
Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. Imitation learning: A survey of learning methods.ACM Computing Surveys (CSUR), 50(2):1–35, 2017
work page 2017
-
[26]
Hypothesis network planned exploration for rapid meta-reinforcement learning adaptation, 2025
Maxwell Joseph Jacobson, Rohan Menon, John Zeng, and Yexiang Xue. Hypothesis network planned exploration for rapid meta-reinforcement learning adaptation, 2025
work page 2025
-
[27]
Offline reinforcement learning as one big sequence modeling problem
Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. InAdvances in Neural Information Processing Systems, pages 1273–1286, 2021
work page 2021
-
[28]
MobILE: Model-based imitation learning from observation alone
Rahul Kidambi, Jonathan Chang, and Wen Sun. MobILE: Model-based imitation learning from observation alone. InAdvances in Neural Information Processing Systems, volume 34, pages 28598–28611, 2021
work page 2021
-
[29]
MOReL: Model-based offline reinforcement learning
Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. MOReL: Model-based offline reinforcement learning. InAdvances in Neural Information Processing Systems, pages 21810–21823, 2020
work page 2020
-
[30]
Skill discovery in continuous reinforcement learning domains using skill chaining
George Konidaris and Andrew Barto. Skill discovery in continuous reinforcement learning domains using skill chaining. InAdvances in Neural Information Processing Systems, volume 22, pages 1015–1023, 2009
work page 2009
-
[31]
Offline reinforcement learning with implicit q-learning
Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. InInternational Conference on Learning Representations, 2022
work page 2022
-
[32]
Tejas D. Kulkarni, Karthik R. Narasimhan, Ardavan Saeedi, and Joshua B. Tenenbaum. Hierar- chical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. InAdvances in Neural Information Processing Systems, volume 29, pages 3682–3690, 2016
work page 2016
-
[33]
Stabilizing off-policy q-learning via bootstrapping error reduction
Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. InAdvances in Neural Information Processing Systems, 2019. 11
work page 2019
-
[34]
Conservative q-learning for offline reinforcement learning
Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. InAdvances in Neural Information Processing Systems, volume 33, pages 1179–1191, 2020
work page 2020
-
[35]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[36]
Learning multi-level hierarchies with hindsight
Andrew Levy, George Dimitri Konidaris, Robert Platt, and Kate Saenko. Learning multi-level hierarchies with hindsight. InInternational Conference on Learning Representations, 2019
work page 2019
-
[37]
Goal-conditioned reinforcement learning: Problems and solutions
Minghuan Liu, Menghui Zhu, and Weinan Zhang. Goal-conditioned reinforcement learning: Problems and solutions. InProceedings of the Thirty-First International Joint Conference on Ar- tificial Intelligence, pages 5502–5511. International Joint Conferences on Artificial Intelligence Organization, 2022
work page 2022
-
[38]
Amy McGovern and Andrew G. Barto. Automatic discovery of subgoals in reinforcement learning using diverse density. InProceedings of the 18th International Conference on Machine Learning, pages 361–368. Morgan Kaufmann, 2001
work page 2001
-
[39]
Q-cut—dynamic discovery of sub-goals in reinforcement learning
Ishai Menache, Shie Mannor, and Nahum Shimkin. Q-cut—dynamic discovery of sub-goals in reinforcement learning. InProceedings of the 13th European Conference on Machine Learning, pages 295–306. Springer, 2002
work page 2002
-
[40]
V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcemen...
work page 2015
-
[41]
Rafael Müller, Simon Kornblith, and Geoffrey E. Hinton. When does label smoothing help? In Advances in Neural Information Processing Systems, volume 32, pages 4696–4705, 2019
work page 2019
-
[42]
Data-efficient hierarchical reinforcement learning
Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. InAdvances in Neural Information Processing Systems, volume 31, pages 3307–3317, 2018
work page 2018
-
[43]
Andrew Y . Ng and Stuart Russell. Algorithms for inverse reinforcement learning. InProceedings of the Seventeenth International Conference on Machine Learning, pages 663–670. Morgan Kaufmann, 2000
work page 2000
-
[44]
Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J Andrew Bagnell, Pieter Abbeel, and Jan Peters. An algorithmic perspective on imitation learning.Foundations and Trends® in Robotics, 7(1-2):1–179, 2018
work page 2018
-
[45]
Shubham Pateria, Budhitama Subagdja, Ah-hwee Tan, and Chai Quek. Hierarchical rein- forcement learning: A comprehensive survey.ACM Computing Surveys (CSUR), 54(5):1–35, 2021
work page 2021
-
[46]
Rafael Figueiredo Prudencio, Marcos ROA Maximo, and Esther Luna Colombini. A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE transactions on neural networks and learning systems, 35(8):10237–10257, 2023
work page 2023
-
[47]
Efficient off-policy meta-reinforcement learning via probabilistic context variables
Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. InInternational conference on machine learning, pages 5331–5340. PMLR, 2019
work page 2019
-
[48]
A reduction of imitation learning and structured prediction to no-regret online learning
Stephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors,Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pag...
work page 2011
-
[49]
Lucy Xiaoyang Shi, Joseph J. Lim, and Youngwoon Lee. Skill-based model-based reinforcement learning. InConference on Robot Learning, 2022. 12
work page 2022
-
[50]
Hierarchical reinforcement learning for zero- shot generalization with subtask dependencies
Sungryull Sohn, Junhyuk Oh, and Honglak Lee. Hierarchical reinforcement learning for zero- shot generalization with subtask dependencies. InAdvances in Neural Information Processing Systems, volume 31, pages 7156–7166, 2018
work page 2018
-
[51]
Sutton, Doina Precup, and Satinder Singh
Richard S. Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1– 2):181–211, 1999
work page 1999
-
[52]
Re- thinking the inception architecture for computer vision
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re- thinking the inception architecture for computer vision. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016
work page 2016
-
[53]
Feudal networks for hierarchical reinforcement learning
Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. InProceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3540–3549. PMLR, 2017
work page 2017
-
[54]
Learning to reinforcement learn
Jane X. Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z. Leibo, Rémi Munos, Charles Blundell, Dharshan Kumaran, and Matthew Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016
work page Pith review arXiv 2016
-
[55]
Ziyu Wang, Alexander Novikov, Konrad Zolna, Josh Merel, Jost Tobias Springenberg, Scott Reed, Bobak Shahriari, Noah Siegel, Caglar Gulcehre, Nicolas Heess, and Nando de Freitas. Critic regularized regression. InAdvances in Neural Information Processing Systems, pages 7768–7778, 2020
work page 2020
-
[56]
Behavior Regularized Offline Reinforcement Learning
Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning.arXiv preprint arXiv:1911.11361, 2019
work page internal anchor Pith review arXiv 1911
-
[57]
COMBO: Conservative offline model-based policy optimization
Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. COMBO: Conservative offline model-based policy optimization. InAdvances in Neural Information Processing Systems, pages 28954–28967, 2021
work page 2021
-
[58]
MOPO: Model-based offline policy optimization
Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. MOPO: Model-based offline policy optimization. InAdvances in Neural Information Processing Systems, pages 14129–14142, 2020
work page 2020
-
[59]
DAC: The double actor-critic architecture for learning options
Shangtong Zhang and Shimon Whiteson. DAC: The double actor-critic architecture for learning options. InAdvances in Neural Information Processing Systems, volume 32, pages 2012–2022, 2019
work page 2012
-
[60]
Varibad: variational bayes-adaptive deep rl via meta-learning.J
Luisa Zintgraf, Sebastian Schulze, Cong Lu, Leo Feng, Maximilian Igl, Kyriacos Shiarlis, Yarin Gal, Katja Hofmann, and Shimon Whiteson. Varibad: variational bayes-adaptive deep rl via meta-learning.J. Mach. Learn. Res., 22(1), January 2021. 13 A Method Additional Material A.1 Networks Summary ZALT learns the following parameterized neural networks: Encθe ...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.