Scalable Option Learning in High-Throughput Environments
Pith reviewed 2026-05-18 20:00 UTC · model grok-4.3
The pith
Scalable Option Learning trains hierarchical agents on 30 billion frames of NetHack and surpasses flat agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We identify and solve several key challenges in scaling online hierarchical RL to high-throughput environments. We propose Scalable Option Learning (SOL), a highly scalable hierarchical RL algorithm which achieves a ~35x higher throughput compared to existing hierarchical methods. To demonstrate SOL's performance and scalability, we train hierarchical agents using 30 billion frames of experience on the complex game of NetHack, significantly surpassing flat agents and demonstrating positive scaling trends. We also validate SOL on MiniHack and Mujoco environments, showcasing its general applicability.
What carries the argument
Scalable Option Learning (SOL), a hierarchical RL algorithm that solves identified scaling challenges to enable high-throughput training while preserving the benefits of hierarchy.
If this is right
- Hierarchical agents become feasible to train at scales previously limited to flat methods.
- Performance continues to improve as training data increases, following positive scaling trends.
- The approach extends to other environments including MiniHack and MuJoCo.
- Long-timescale decision making in complex tasks becomes practical with hierarchy at scale.
Where Pith is reading between the lines
- Prior failures of hierarchical RL may have stemmed mainly from throughput limits rather than inherent design flaws.
- This method could be tested in domains like robotics where long-horizon planning is needed but data throughput has been a barrier.
- Further scaling experiments beyond 30 billion frames would check whether the observed trends hold.
Load-bearing premise
That the identified challenges in scaling online hierarchical RL can be solved by SOL in a way that preserves hierarchy benefits and translates into better performance on complex tasks.
What would settle it
Training SOL hierarchical agents and flat agents on NetHack for equivalent frames and observing no performance advantage for the hierarchical version would falsify the superiority claim.
read the original abstract
Hierarchical reinforcement learning (RL) has the potential to enable effective decision-making over long timescales. Existing approaches, while promising, have yet to realize the benefits of large-scale training. In this work, we identify and solve several key challenges in scaling online hierarchical RL to high-throughput environments. We propose Scalable Option Learning (SOL), a highly scalable hierarchical RL algorithm which achieves a ~35x higher throughput compared to existing hierarchical methods. To demonstrate SOL's performance and scalability, we train hierarchical agents using 30 billion frames of experience on the complex game of NetHack, significantly surpassing flat agents and demonstrating positive scaling trends. We also validate SOL on MiniHack and Mujoco environments, showcasing its general applicability. Our code is open sourced at: github.com/facebookresearch/sol.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Scalable Option Learning (SOL), a hierarchical RL algorithm for high-throughput environments. It claims a ~35x throughput improvement over existing hierarchical methods and demonstrates training hierarchical agents on 30 billion frames in NetHack, where they significantly surpass flat agents while exhibiting positive scaling trends. Additional validation is provided on MiniHack and MuJoCo, with open-sourced code.
Significance. If the central claims hold, this work would represent a meaningful advance in scaling hierarchical RL to large experience budgets on complex tasks, addressing throughput bottlenecks that have limited prior online hierarchical approaches. The scale of the NetHack experiment (30B frames) and the open-sourced implementation are concrete strengths that could support reproducibility and follow-on research.
major comments (1)
- [Abstract and NetHack experiments] The central claim that SOL surpasses flat agents on NetHack via preserved hierarchical benefits after 30B frames (abstract) is load-bearing on options retaining meaningful temporal abstraction rather than collapsing to single-step primitives. The manuscript provides no reported metrics on option termination rates, average durations, or intra-option policy complexity in the NetHack results; without these, performance gains could be attributable to the scalable infrastructure alone.
minor comments (1)
- [Abstract] The abstract states 'positive scaling trends' but does not reference a specific figure or table showing the scaling curve; adding an explicit pointer would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We have addressed the major comment regarding the NetHack experiments by incorporating additional analyses in the revised manuscript to better substantiate the preservation of hierarchical benefits.
read point-by-point responses
-
Referee: [Abstract and NetHack experiments] The central claim that SOL surpasses flat agents on NetHack via preserved hierarchical benefits after 30B frames (abstract) is load-bearing on options retaining meaningful temporal abstraction rather than collapsing to single-step primitives. The manuscript provides no reported metrics on option termination rates, average durations, or intra-option policy complexity in the NetHack results; without these, performance gains could be attributable to the scalable infrastructure alone.
Authors: We agree that the absence of these specific metrics in the original manuscript leaves room for the interpretation that performance gains could stem primarily from the scalable infrastructure. To directly address this, the revised manuscript now includes option termination rates, average option durations, and measures of intra-option policy complexity for the NetHack results. These additions show that options maintain durations substantially longer than single steps and exhibit non-trivial intra-option behavior even after 30 billion frames, supporting that the hierarchical structure contributes meaningfully to the observed advantages over flat agents and the positive scaling trends. revision: yes
Circularity Check
No circularity: empirical scaling results rest on direct measurements
full rationale
The paper proposes the SOL algorithm to scale online hierarchical RL and validates it via large-scale empirical training (30B frames on NetHack, 35x throughput gains over prior hierarchical methods). Claims of surpassing flat agents and positive scaling are grounded in reported performance metrics and throughput benchmarks rather than any derivation that reduces to fitted parameters, self-definitions, or self-citations. No equations or load-bearing steps in the abstract or described approach exhibit the enumerated circular patterns; the work is self-contained against external benchmarks and open-sourced code.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose Scalable Option Learning (SOL), a highly scalable hierarchical RL algorithm which achieves a ~35x higher throughput compared to existing hierarchical methods... train hierarchical agents using 30 billion frames of experience on the complex game of NetHack
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat induction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Each option ω ∈ Ω represents a temporally extended behavior, and is defined by a tuple (π_ω, I_ω, β_ω)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Goal-Conditioned Agents that Learn Everything All at Once
LEO enables efficient all-goals learning in goal-conditioned RL by jointly predicting for all goals in one network pass, yielding >250x speedup over relabelling and better performance on Craftax.
-
Hierarchical Behaviour Spaces
Hierarchical Behaviour Spaces uses linear combinations of reward functions to induce expressive behavior spaces in hierarchical RL, yielding strong performance on NetHack primarily through better exploration rather th...
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
The option-critic architecture
Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI'17, pp.\ 1726–1734. AAAI Press, 2017
work page 2017
-
[3]
2011, Computing in Science Engineering, 13, 31 , 10.1109/MCSE.2010.118
S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, D.S. Seljebotn, and K. Smith. Cython: The best of both worlds. Computing in Science Engineering, 13 0 (2): 0 31 --39, 2011. ISSN 1521-9615. doi:10.1109/MCSE.2010.118
-
[4]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert - Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[5]
Exploration by random network distillation
Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. In International Conference on Learning Representations, 2019
work page 2019
-
[6]
arXiv preprint arXiv:2309.00987 , year=
Yuanpei Chen, Chen Wang, Li Fei-Fei, and C Karen Liu. Sequential dexterity: Chaining dexterous policies for long-horizon manipulation. arXiv preprint arXiv:2309.00987, 2023
-
[7]
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
Kyunghyun Cho, Bart van Merrienboer, C aglar G \" u l c ehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014. URL http://arxiv.org/abs/1406.1078
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[8]
NetHackWiki contributors. Nethack standard strategy. URL https://nethackwiki.com/wiki/Standard_strategy
-
[9]
\" O zg\" u r S im s ek and Andrew G. Barto. Skill characterization based on betweenness. In Proceedings of the 22nd International Conference on Neural Information Processing Systems, NIPS'08, pp.\ 1497–1504, Red Hook, NY, USA, 2008. Curran Associates Inc. ISBN 9781605609492
work page 2008
-
[10]
Peter Dayan and Geoffrey E. Hinton. Feudal reinforcement learning. In Proceedings of the 6th International Conference on Neural Information Processing Systems, NIPS'92, pp.\ 271–278, San Francisco, CA, USA, 1992. Morgan Kaufmann Publishers Inc. ISBN 1558602747
work page 1992
-
[11]
Rodrigo de Lazcano, Kallinteris Andreas, Jun Jet Tai, Seungjae Ryan Lee, and Jordan Terry. Gymnasium robotics, 2024. URL http://github.com/Farama-Foundation/Gymnasium-Robotics
work page 2024
-
[12]
Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017
work page 2017
-
[13]
Abhimanyu Dubey et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
IMPALA : Scalable distributed deep- RL with importance weighted actor-learner architectures
Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA : Scalable distributed deep- RL with importance weighted actor-learner architectures. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Mach...
work page 2018
-
[15]
Diversity is all you need: Learning skills without a reward function
Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SJx63jRqFm
work page 2019
-
[16]
Minedojo: Building open-ended embodied agents with internet-scale knowledge
Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=r...
work page 2022
-
[17]
Nico G\" u rtler, Dieter B\" u chler, and Georg Martius. Hierarchical reinforcement learning with timed subgoals. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS '21, Red Hook, NY, USA, 2021. Curran Associates Inc. ISBN 9781713845393
work page 2021
-
[18]
Learning and Transfer of Modulated Locomotor Controllers
Nicolas Manfred Otto Heess, Greg Wayne, Yuval Tassa, Timothy P. Lillicrap, Martin A. Riedmiller, and David Silver. Learning and transfer of modulated locomotor controllers. ArXiv, abs/1610.05182, 2016. URL https://api.semanticscholar.org/CorpusID:9692454
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[19]
Exploration via elliptical episodic bonuses
Mikael Henaff, Roberta Raileanu, Minqi Jiang, and Tim Rockt \"a schel. Exploration via elliptical episodic bonuses. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022
work page 2022
-
[20]
Hierarchical learning in stochastic domains: preliminary results
Leslie Pack Kaelbling. Hierarchical learning in stochastic domains: preliminary results. In Proceedings of the Tenth International Conference on International Conference on Machine Learning, ICML'93, pp.\ 167–173, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc. ISBN 1558603077
work page 1993
-
[21]
Nethack learning environment sample factory baseline
Anssi Kanervisto and Karolis Jucys. Nethack learning environment sample factory baseline. https://github.com/Miffyli/nle-sample-factory-baseline, 2022. Accessed: 2025-03-28
work page 2022
-
[22]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[23]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll \'a r, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Martin Klissarov and Doina Precup. Flexible option learning. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=L5vbEVIePyb
work page 2021
-
[25]
Learnings Options End-to-End for Continuous Action Tasks
Martin Klissarov, Pierre-Luc Bacon, Jean Harb, and Doina Precup. Learnings options end-to-end for continuous action tasks. ArXiv, abs/1712.00004, 2017. URL https://api.semanticscholar.org/CorpusID:1809550
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[26]
Motif: Intrinsic motivation from artificial intelligence feedback
Martin Klissarov, Pierluca D'Oro, Shagun Sodhani, Roberta Raileanu, Pierre-Luc Bacon, Pascal Vincent, Amy Zhang, and Mikael Henaff. Motif: Intrinsic motivation from artificial intelligence feedback. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=tmBKIecDE9
work page 2024
-
[27]
Martin Klissarov, Mikael Henaff, Roberta Raileanu, Shagun Sodhani, Pascal Vincent, Amy Zhang, Pierre-Luc Bacon, Doina Precup, Marlos C. Machado, and Pierluca D'Oro. Maestromotif: Skill design from artificial intelligence feedback. 2025. URL https://openreview.net/forum?id=or8mMhmyRV
work page 2025
-
[28]
Vijay Konda and John Tsitsiklis. Actor-critic algorithms. In S. Solla, T. Leen, and K. M\" u ller (eds.), Advances in Neural Information Processing Systems, volume 12. MIT Press, 1999. URL https://proceedings.neurips.cc/paper_files/paper/1999/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf
work page 1999
-
[29]
TorchBeast: A PyTo rch Platform for Distributed RL
Heinrich K\" u ttler, Nantas Nardelli, Thibaut Lavril, Marco Selvatici, Viswanath Sivakumar, Tim Rockt\" a schel, and Edward Grefenstette. TorchBeast: A PyTorch Platform for Distributed RL . arXiv preprint arXiv:1910.03552, 2019. URL https://github.com/facebookresearch/torchbeast
-
[30]
Heinrich K \" u ttler, Nantas Nardelli, Alexander H. Miller, Roberta Raileanu, Marco Selvatici, Edward Grefenstette, and Tim Rockt \" a schel. The NetHack Learning Environment . In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2020
work page 2020
-
[31]
Reward design with language models
Minae Kwon, Sang Michael Xie, Kalesha Bullard, and Dorsa Sadigh. Reward design with language models. In The Eleventh International Conference on Learning Representations, 2023 a . URL https://openreview.net/forum?id=10uNUgI5Kl
work page 2023
-
[32]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023 b
work page 2023
-
[33]
Voicebox: Text-guided multilingual universal speech generation at scale
Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and Wei-Ning Hsu. Voicebox: Text-guided multilingual universal speech generation at scale. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), NeurIPS, 2023. URL http://dblp.u...
work page 2023
-
[34]
Learning Multi-Level Hi- erarchies with Hindsight, September 2019
Andrew Levy, Robert Platt Jr., and Kate Saenko. Hierarchical actor-critic. CoRR, abs/1712.00948, 2017. URL http://arxiv.org/abs/1712.00948
-
[35]
Sub-policy adaptation for hierarchical reinforcement learning
Alexander Li, Carlos Florensa, Ignasi Clavera, and Pieter Abbeel. Sub-policy adaptation for hierarchical reinforcement learning. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=ByeWogStDS
work page 2020
-
[36]
RLlib: Abstractions for Distributed Reinforcement Learning
Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph E. Gonzalez, Michael I. Jordan, and Ion Stoica. RLlib : Abstractions for distributed reinforcement learning. In International Conference on Machine Learning ( ICML ) , 2018. URL https://arxiv.org/pdf/1712.09381
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[37]
Eureka: Human-Level Reward Design via Coding Large Language Models
Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Michael T. Matthews, Michael Beukman, Benjamin Ellis, Mikayel Samvelyan, Matthew Thomas Jackson, Samuel Coward, and Jakob Nicolaus Foerster. Craftax: A lightning-fast benchmark for open-ended reinforcement learning. In ICML, 2024. URL https://openreview.net/forum?id=hg4wXlrQCV
work page 2024
-
[39]
Amy McGovern and Andrew G. Barto. Automatic discovery of subgoals in reinforcement learning using diverse density. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML '01, pp.\ 361–368, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558607781
work page 2001
-
[40]
moolib: A Platform for Distributed RL
Vegard Mella, Eric Hambro, Danielle Rothermel, and Heinrich K \" u ttler. moolib: A Platform for Distributed RL . 2022. URL https://github.com/facebookresearch/moolib
work page 2022
-
[41]
Q-cut - dynamic discovery of sub-goals in reinforcement learning
Ishai Menache, Shie Mannor, and Nahum Shimkin. Q-cut - dynamic discovery of sub-goals in reinforcement learning. In Proceedings of the 13th European Conference on Machine Learning, ECML '02, pp.\ 295–306, Berlin, Heidelberg, 2002. Springer-Verlag. ISBN 3540440364
work page 2002
-
[42]
Asynchronous Methods for Deep Reinforcement Learning
Volodymyr Mnih, Adri \` a Puigdom \` e nech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783, 2016. URL http://arxiv.org/abs/1602.01783
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[43]
Data-efficient hierarchical reinforcement learning
Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS'18, pp.\ 3307–3317, Red Hook, NY, USA, 2018. Curran Associates Inc
work page 2018
-
[44]
OpenAI. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
BALROG : Benchmarking agentic LLM and VLM reasoning on games
Davide Paglieri, Bart omiej Cupia , Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, ukasz Kuci \'n ski, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rockt \"a schel. BALROG : Benchmarking agentic LLM and VLM reasoning on games. In The Thirteenth International Conference on Learning Represe...
work page 2025
-
[46]
Horizon reduction makes rl scalable.arXiv preprint arXiv:2506.04168, 2025
Seohong Park, Kevin Frans, Deepinder Mann, Benjamin Eysenbach, Aviral Kumar, and Sergey Levine. Horizon reduction makes rl scalable, 2025. URL https://arxiv.org/abs/2506.04168
-
[47]
Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning
Xue Bin Peng, Glen Berseth, Kangkang Yin, and Michiel Van De Panne. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Trans. Graph., 36 0 (4): 0 41:1--41:13, July 2017. ISSN 0730-0301. doi:10.1145/3072959.3073602. URL http://doi.acm.org/10.1145/3072959.3073602
-
[48]
Karl Pertsch, Youngwoon Lee, and Joseph J. Lim. Accelerating reinforcement learning with learned skill priors. In Conference on Robot Learning (CoRL), 2020
work page 2020
-
[49]
Aleksei Petrenko, Zhehui Huang, Tushar Kumar, Gaurav S. Sukhatme, and Vladlen Koltun. Sample factory: Egocentric 3d control from pixels at 100000 FPS with asynchronous reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , volume 119 of Proceedings of Machine Learning Re...
work page 2020
-
[50]
Doina Precup and Richard S. Sutton. Temporal abstraction in reinforcement learning. PhD thesis, 2000. AAI9978540
work page 2000
-
[51]
From simple to complex skills: The case of in-hand object reorientation, 2025
Haozhi Qi, Brent Yi, Mike Lambeta, Yi Ma, Roberto Calandra, and Jitendra Malik. From simple to complex skills: The case of in-hand object reorientation, 2025. URL https://arxiv.org/abs/2501.05439
-
[52]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021. URL https://arxiv.org/abs/2103.00020
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[53]
SAM 2: Segment anything in images and videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R \"a dle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos. In The Thirteenth Intern...
work page 2025
-
[54]
Minihack the planet: A sandbox for open-ended reinforcement learning research
Mikayel Samvelyan, Robert Kirk, Vitaly Kurin, Jack Parker-Holder, Minqi Jiang, Eric Hambro, Fabio Petroni, Heinrich Kuttler, Edward Grefenstette, and Tim Rockt \"a schel. Minihack the planet: A sandbox for open-ended reinforcement learning research. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)...
work page 2021
-
[55]
Habitat: A platform for embodied ai research
Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 9338--9346, 2019. URL https://api.semanticscholar.org/CorpusID...
work page 2019
-
[56]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[57]
Mastering the game of go with deep neural networks and tree search
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529 0 (7587): 0 484--489, 2016
work page 2016
-
[58]
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy P. Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. CoRR, abs/1712.01815, 2017. URL http://arxiv.org/abs...
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[59]
Satinder P. Singh. Scaling reinforcement learning algorithms by learning variable temporal resolution models. In Derek H. Sleeman and Peter Edwards (eds.), Proceedings of the Ninth International Workshop on Machine Learning (ML 1992), Aberdeen, Scotland, UK, July 1-3, 1992 , pp.\ 406--415. Morgan Kaufmann, 1992 a . doi:10.1016/B978-1-55860-247-2.50058-9. ...
-
[60]
Satinder P. Singh. Reinforcement learning with a hierarchy of abstract models. In Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI'92, pp.\ 202–207. AAAI Press, 1992 b . ISBN 0262510634
work page 1992
-
[61]
An inference-based policy gradient method for learning options
Matthew Smith, Herke van Hoof, and Joelle Pineau. An inference-based policy gradient method for learning options. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 4703--4712. PMLR, 10--15 Jul 2018. URL https://proceedings.mlr.press/v8...
work page 2018
-
[62]
Learning options in reinforcement learning
Martin Stolle and Doina Precup. Learning options in reinforcement learning. In Proceedings of the 5th International Symposium on Abstraction, Reformulation and Approximation, pp.\ 212–223, Berlin, Heidelberg, 2002. Springer-Verlag. ISBN 3540439412
work page 2002
-
[63]
Pufferlib: Making reinforcement learning libraries and environments play nice, 2024
Joseph Suarez. Pufferlib: Making reinforcement learning libraries and environments play nice, 2024. URL https://arxiv.org/abs/2406.12905
-
[64]
Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd.html
work page 2018
-
[65]
Sutton, Doina Precup, and Satinder Singh
Richard S. Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artif. Intell., 112 0 (1-2): 0 181--211, 1999. URL http://dblp.uni-trier.de/db/journals/ai/ai112.html#SuttonPS99
work page 1999
-
[66]
Habitat 2.0: training home assistants to rearrange their habitat
Andrew Szot, Alex Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir Vondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: training home assista...
work page 2021
-
[67]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team. Gemini: A family of highly capable multimodal models, 2024. URL https://arxiv.org/abs/2312.11805
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[68]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
Feudal networks for hierarchical reinforcement learning
Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML'17, pp.\ 3540–3549. JMLR.org, 2017
work page 2017
-
[70]
Marco Wiering and J \"u rgen Schmidhuber. Hq-learning. Adaptive Behavior, 6 0 (2): 0 219--246, 1997. ISSN 1059-7123
work page 1997
-
[71]
Decentralized distributed PPO: solving pointgoal navigation
Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, and Dhruv Batra. Decentralized distributed PPO: solving pointgoal navigation. CoRR, abs/1911.00357, 2019. URL http://arxiv.org/abs/1911.00357
-
[72]
Function optimization using connectionist reinforcement learning algorithms
Ronald Williams and Jing Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3 0 (3): 0 241--268, 1991. doi:10.1080/09540099108946587
-
[73]
Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn., 8 0 (3–4): 0 229–256, May 1992. ISSN 0885-6125. doi:10.1007/BF00992696. URL https://doi.org/10.1007/BF00992696
-
[74]
Zhanghao Wu, Eric Liang, Michael Luo, Sven Mika, Joseph E. Gonzalez, and Ion Stoica. RLlib flow: Distributed reinforcement learning is a dataflow problem. In Conference on Neural Information Processing Systems ( NeurIPS ) , 2021. URL https://proceedings.neurips.cc/paper/2021/file/2bce32ed409f5ebcee2a7b417ad9beed-Paper.pdf
work page 2021
-
[75]
ASC: Adaptive Skill Coordination for Robotic Mobile Manipulation
Naoki Yokoyama, Alexander William Clegg, Joanne Truong, Eric Undersander, Jimmy Yang, Sergio Arnaud, Sehoon Ha, Dhruv Batra, and Akshara Rai. ASC: Adaptive Skill Coordination for Robotic Mobile Manipulation . IEEE Robotics and Automation Letters, 2023
work page 2023
-
[76]
Online intrinsic rewards for decision making agents from large language model feedback, 2024
Qinqing Zheng, Mikael Henaff, Amy Zhang, Aditya Grover, and Brandon Amos. Online intrinsic rewards for decision making agents from large language model feedback, 2024. URL https://arxiv.org/abs/2410.23022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.