Scalable Option Learning in High-Throughput Environments

Michael Matthews; Michael Rabbat; Mikael Henaff; Scott Fujimoto

arxiv: 2509.00338 · v3 · submitted 2025-08-30 · 💻 cs.LG · cs.AI

Scalable Option Learning in High-Throughput Environments

Mikael Henaff , Scott Fujimoto , Michael Matthews , Michael Rabbat This is my paper

Pith reviewed 2026-05-18 20:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords hierarchical reinforcement learningscalable option learningNetHackhigh-throughput environmentsoption discoverydeep reinforcement learningscaling trends

0 comments

The pith

Scalable Option Learning trains hierarchical agents on 30 billion frames of NetHack and surpasses flat agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper identifies key challenges that have prevented hierarchical reinforcement learning from scaling to high-throughput environments. It introduces Scalable Option Learning (SOL) as a solution that achieves approximately 35 times higher throughput than previous hierarchical methods. The authors demonstrate the approach by training agents on 30 billion frames of the complex game NetHack, where the hierarchical agents perform significantly better than flat agents and exhibit positive scaling with more data. SOL is also shown to work on simpler environments like MiniHack and MuJoCo, indicating general applicability. A sympathetic reader would care because this could unlock the long-timescale decision making that hierarchy promises but has not yet delivered at scale.

Core claim

We identify and solve several key challenges in scaling online hierarchical RL to high-throughput environments. We propose Scalable Option Learning (SOL), a highly scalable hierarchical RL algorithm which achieves a ~35x higher throughput compared to existing hierarchical methods. To demonstrate SOL's performance and scalability, we train hierarchical agents using 30 billion frames of experience on the complex game of NetHack, significantly surpassing flat agents and demonstrating positive scaling trends. We also validate SOL on MiniHack and Mujoco environments, showcasing its general applicability.

What carries the argument

Scalable Option Learning (SOL), a hierarchical RL algorithm that solves identified scaling challenges to enable high-throughput training while preserving the benefits of hierarchy.

If this is right

Hierarchical agents become feasible to train at scales previously limited to flat methods.
Performance continues to improve as training data increases, following positive scaling trends.
The approach extends to other environments including MiniHack and MuJoCo.
Long-timescale decision making in complex tasks becomes practical with hierarchy at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Prior failures of hierarchical RL may have stemmed mainly from throughput limits rather than inherent design flaws.
This method could be tested in domains like robotics where long-horizon planning is needed but data throughput has been a barrier.
Further scaling experiments beyond 30 billion frames would check whether the observed trends hold.

Load-bearing premise

That the identified challenges in scaling online hierarchical RL can be solved by SOL in a way that preserves hierarchy benefits and translates into better performance on complex tasks.

What would settle it

Training SOL hierarchical agents and flat agents on NetHack for equivalent frames and observing no performance advantage for the hierarchical version would falsify the superiority claim.

read the original abstract

Hierarchical reinforcement learning (RL) has the potential to enable effective decision-making over long timescales. Existing approaches, while promising, have yet to realize the benefits of large-scale training. In this work, we identify and solve several key challenges in scaling online hierarchical RL to high-throughput environments. We propose Scalable Option Learning (SOL), a highly scalable hierarchical RL algorithm which achieves a ~35x higher throughput compared to existing hierarchical methods. To demonstrate SOL's performance and scalability, we train hierarchical agents using 30 billion frames of experience on the complex game of NetHack, significantly surpassing flat agents and demonstrating positive scaling trends. We also validate SOL on MiniHack and Mujoco environments, showcasing its general applicability. Our code is open sourced at: github.com/facebookresearch/sol.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SOL gets real throughput gains for hierarchical RL at scale but the paper leaves it unclear whether options keep meaningful temporal abstraction.

read the letter

The key takeaway is that SOL fixes practical scaling issues in online hierarchical RL and delivers a measured 35x throughput boost, which lets them run 30 billion frames on NetHack and show hierarchical agents beating flat ones with positive scaling trends. They also test on MiniHack and Mujoco and release the code, which is straightforward value for anyone trying to run large hierarchical experiments. That combination of identified bottlenecks, concrete speed-up, and large-scale result is the part that stands out as new and useful. The work is grounded in empirical measurements rather than just new theory, and the open-source release makes it easy to check or extend. The soft spot is exactly the one the stress-test note flags: there is no reported data on option durations, termination rates, or intra-option complexity. Without those numbers it is hard to tell whether the performance edge comes from preserved hierarchy or simply from the faster training setup itself. If options collapse to short primitives under the high-throughput regime, the central claim that SOL solves scaling while keeping hierarchy advantages would need more support. The paper is aimed at people working on scalable hierarchical RL and large-scale online training. A reader who cares about practical speed-ups and wants a working baseline for NetHack-scale experiments will get something concrete from it. The throughput result and the scale of the experiment are solid enough to justify sending it to peer review so referees can check the implementation details and ask for the missing option statistics.

Referee Report

1 major / 1 minor

Summary. The paper proposes Scalable Option Learning (SOL), a hierarchical RL algorithm for high-throughput environments. It claims a ~35x throughput improvement over existing hierarchical methods and demonstrates training hierarchical agents on 30 billion frames in NetHack, where they significantly surpass flat agents while exhibiting positive scaling trends. Additional validation is provided on MiniHack and MuJoCo, with open-sourced code.

Significance. If the central claims hold, this work would represent a meaningful advance in scaling hierarchical RL to large experience budgets on complex tasks, addressing throughput bottlenecks that have limited prior online hierarchical approaches. The scale of the NetHack experiment (30B frames) and the open-sourced implementation are concrete strengths that could support reproducibility and follow-on research.

major comments (1)

[Abstract and NetHack experiments] The central claim that SOL surpasses flat agents on NetHack via preserved hierarchical benefits after 30B frames (abstract) is load-bearing on options retaining meaningful temporal abstraction rather than collapsing to single-step primitives. The manuscript provides no reported metrics on option termination rates, average durations, or intra-option policy complexity in the NetHack results; without these, performance gains could be attributable to the scalable infrastructure alone.

minor comments (1)

[Abstract] The abstract states 'positive scaling trends' but does not reference a specific figure or table showing the scaling curve; adding an explicit pointer would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We have addressed the major comment regarding the NetHack experiments by incorporating additional analyses in the revised manuscript to better substantiate the preservation of hierarchical benefits.

read point-by-point responses

Referee: [Abstract and NetHack experiments] The central claim that SOL surpasses flat agents on NetHack via preserved hierarchical benefits after 30B frames (abstract) is load-bearing on options retaining meaningful temporal abstraction rather than collapsing to single-step primitives. The manuscript provides no reported metrics on option termination rates, average durations, or intra-option policy complexity in the NetHack results; without these, performance gains could be attributable to the scalable infrastructure alone.

Authors: We agree that the absence of these specific metrics in the original manuscript leaves room for the interpretation that performance gains could stem primarily from the scalable infrastructure. To directly address this, the revised manuscript now includes option termination rates, average option durations, and measures of intra-option policy complexity for the NetHack results. These additions show that options maintain durations substantially longer than single steps and exhibit non-trivial intra-option behavior even after 30 billion frames, supporting that the hierarchical structure contributes meaningfully to the observed advantages over flat agents and the positive scaling trends. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical scaling results rest on direct measurements

full rationale

The paper proposes the SOL algorithm to scale online hierarchical RL and validates it via large-scale empirical training (30B frames on NetHack, 35x throughput gains over prior hierarchical methods). Claims of surpassing flat agents and positive scaling are grounded in reported performance metrics and throughput benchmarks rather than any derivation that reduces to fitted parameters, self-definitions, or self-citations. No equations or load-bearing steps in the abstract or described approach exhibit the enumerated circular patterns; the work is self-contained against external benchmarks and open-sourced code.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; detailed ledger requires the full manuscript.

pith-pipeline@v0.9.0 · 5658 in / 972 out tokens · 44369 ms · 2026-05-18T20:00:54.315287+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Scalable Option Learning (SOL), a highly scalable hierarchical RL algorithm which achieves a ~35x higher throughput compared to existing hierarchical methods... train hierarchical agents using 30 billion frames of experience on the complex game of NetHack
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each option ω ∈ Ω represents a temporally extended behavior, and is defined by a tuple (π_ω, I_ω, β_ω)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Goal-Conditioned Agents that Learn Everything All at Once
cs.LG 2026-05 unverdicted novelty 6.0

LEO enables efficient all-goals learning in goal-conditioned RL by jointly predicting for all goals in one network pass, yielding >250x speedup over relabelling and better performance on Craftax.
Hierarchical Behaviour Spaces
cs.AI 2026-04 unverdicted novelty 6.0

Hierarchical Behaviour Spaces uses linear combinations of reward functions to induce expressive behavior spaces in hierarchical RL, yielding strong performance on NetHack primarily through better exploration rather th...

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 2 Pith papers · 16 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

The option-critic architecture

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI'17, pp.\ 1726–1734. AAAI Press, 2017

work page 2017
[3]

2011, Computing in Science Engineering, 13, 31 , 10.1109/MCSE.2010.118

S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, D.S. Seljebotn, and K. Smith. Cython: The best of both worlds. Computing in Science Engineering, 13 0 (2): 0 31 --39, 2011. ISSN 1521-9615. doi:10.1109/MCSE.2010.118

work page doi:10.1109/mcse.2010.118 2011
[4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert - Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[5]

Exploration by random network distillation

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. In International Conference on Learning Representations, 2019

work page 2019
[6]

arXiv preprint arXiv:2309.00987 , year=

Yuanpei Chen, Chen Wang, Li Fei-Fei, and C Karen Liu. Sequential dexterity: Chaining dexterous policies for long-horizon manipulation. arXiv preprint arXiv:2309.00987, 2023

work page arXiv 2023
[7]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho, Bart van Merrienboer, C aglar G \" u l c ehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014. URL http://arxiv.org/abs/1406.1078

work page internal anchor Pith review Pith/arXiv arXiv 2014
[8]

Nethack standard strategy

NetHackWiki contributors. Nethack standard strategy. URL https://nethackwiki.com/wiki/Standard_strategy

work page
[9]

\" O zg\" u r S im s ek and Andrew G. Barto. Skill characterization based on betweenness. In Proceedings of the 22nd International Conference on Neural Information Processing Systems, NIPS'08, pp.\ 1497–1504, Red Hook, NY, USA, 2008. Curran Associates Inc. ISBN 9781605609492

work page 2008
[10]

Peter Dayan and Geoffrey E. Hinton. Feudal reinforcement learning. In Proceedings of the 6th International Conference on Neural Information Processing Systems, NIPS'92, pp.\ 271–278, San Francisco, CA, USA, 1992. Morgan Kaufmann Publishers Inc. ISBN 1558602747

work page 1992
[11]

Gymnasium robotics, 2024

Rodrigo de Lazcano, Kallinteris Andreas, Jun Jet Tai, Seungjae Ryan Lee, and Jordan Terry. Gymnasium robotics, 2024. URL http://github.com/Farama-Foundation/Gymnasium-Robotics

work page 2024
[12]

Openai baselines

Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017

work page 2017
[13]

The Llama 3 Herd of Models

Abhimanyu Dubey et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

IMPALA : Scalable distributed deep- RL with importance weighted actor-learner architectures

Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA : Scalable distributed deep- RL with importance weighted actor-learner architectures. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Mach...

work page 2018
[15]

Diversity is all you need: Learning skills without a reward function

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SJx63jRqFm

work page 2019
[16]

Minedojo: Building open-ended embodied agents with internet-scale knowledge

Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=r...

work page 2022
[17]

u rtler, Dieter B\

Nico G\" u rtler, Dieter B\" u chler, and Georg Martius. Hierarchical reinforcement learning with timed subgoals. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS '21, Red Hook, NY, USA, 2021. Curran Associates Inc. ISBN 9781713845393

work page 2021
[18]

Learning and Transfer of Modulated Locomotor Controllers

Nicolas Manfred Otto Heess, Greg Wayne, Yuval Tassa, Timothy P. Lillicrap, Martin A. Riedmiller, and David Silver. Learning and transfer of modulated locomotor controllers. ArXiv, abs/1610.05182, 2016. URL https://api.semanticscholar.org/CorpusID:9692454

work page internal anchor Pith review Pith/arXiv arXiv 2016
[19]

Exploration via elliptical episodic bonuses

Mikael Henaff, Roberta Raileanu, Minqi Jiang, and Tim Rockt \"a schel. Exploration via elliptical episodic bonuses. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022

work page 2022
[20]

Hierarchical learning in stochastic domains: preliminary results

Leslie Pack Kaelbling. Hierarchical learning in stochastic domains: preliminary results. In Proceedings of the Tenth International Conference on International Conference on Machine Learning, ICML'93, pp.\ 167–173, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc. ISBN 1558603077

work page 1993
[21]

Nethack learning environment sample factory baseline

Anssi Kanervisto and Karolis Jucys. Nethack learning environment sample factory baseline. https://github.com/Miffyli/nle-sample-factory-baseline, 2022. Accessed: 2025-03-28

work page 2022
[22]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2001
[23]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll \'a r, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Flexible option learning

Martin Klissarov and Doina Precup. Flexible option learning. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=L5vbEVIePyb

work page 2021
[25]

Learnings Options End-to-End for Continuous Action Tasks

Martin Klissarov, Pierre-Luc Bacon, Jean Harb, and Doina Precup. Learnings options end-to-end for continuous action tasks. ArXiv, abs/1712.00004, 2017. URL https://api.semanticscholar.org/CorpusID:1809550

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

Motif: Intrinsic motivation from artificial intelligence feedback

Martin Klissarov, Pierluca D'Oro, Shagun Sodhani, Roberta Raileanu, Pierre-Luc Bacon, Pascal Vincent, Amy Zhang, and Mikael Henaff. Motif: Intrinsic motivation from artificial intelligence feedback. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=tmBKIecDE9

work page 2024
[27]

Machado, and Pierluca D'Oro

Martin Klissarov, Mikael Henaff, Roberta Raileanu, Shagun Sodhani, Pascal Vincent, Amy Zhang, Pierre-Luc Bacon, Doina Precup, Marlos C. Machado, and Pierluca D'Oro. Maestromotif: Skill design from artificial intelligence feedback. 2025. URL https://openreview.net/forum?id=or8mMhmyRV

work page 2025
[28]

Actor-critic algorithms

Vijay Konda and John Tsitsiklis. Actor-critic algorithms. In S. Solla, T. Leen, and K. M\" u ller (eds.), Advances in Neural Information Processing Systems, volume 12. MIT Press, 1999. URL https://proceedings.neurips.cc/paper_files/paper/1999/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf

work page 1999
[29]

TorchBeast: A PyTo rch Platform for Distributed RL

Heinrich K\" u ttler, Nantas Nardelli, Thibaut Lavril, Marco Selvatici, Viswanath Sivakumar, Tim Rockt\" a schel, and Edward Grefenstette. TorchBeast: A PyTorch Platform for Distributed RL . arXiv preprint arXiv:1910.03552, 2019. URL https://github.com/facebookresearch/torchbeast

work page arXiv 1910
[30]

u ttler, Nantas Nardelli, Alexander H. Miller, Roberta Raileanu, Marco Selvatici, Edward Grefenstette, and Tim Rockt \

Heinrich K \" u ttler, Nantas Nardelli, Alexander H. Miller, Roberta Raileanu, Marco Selvatici, Edward Grefenstette, and Tim Rockt \" a schel. The NetHack Learning Environment . In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2020

work page 2020
[31]

Reward design with language models

Minae Kwon, Sang Michael Xie, Kalesha Bullard, and Dorsa Sadigh. Reward design with language models. In The Eleventh International Conference on Learning Representations, 2023 a . URL https://openreview.net/forum?id=10uNUgI5Kl

work page 2023
[32]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023 b

work page 2023
[33]

Voicebox: Text-guided multilingual universal speech generation at scale

Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and Wei-Ning Hsu. Voicebox: Text-guided multilingual universal speech generation at scale. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), NeurIPS, 2023. URL http://dblp.u...

work page 2023
[34]

Learning Multi-Level Hi- erarchies with Hindsight, September 2019

Andrew Levy, Robert Platt Jr., and Kate Saenko. Hierarchical actor-critic. CoRR, abs/1712.00948, 2017. URL http://arxiv.org/abs/1712.00948

work page arXiv 2017
[35]

Sub-policy adaptation for hierarchical reinforcement learning

Alexander Li, Carlos Florensa, Ignasi Clavera, and Pieter Abbeel. Sub-policy adaptation for hierarchical reinforcement learning. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=ByeWogStDS

work page 2020
[36]

RLlib: Abstractions for Distributed Reinforcement Learning

Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph E. Gonzalez, Michael I. Jordan, and Ion Stoica. RLlib : Abstractions for distributed reinforcement learning. In International Conference on Machine Learning ( ICML ) , 2018. URL https://arxiv.org/pdf/1712.09381

work page internal anchor Pith review Pith/arXiv arXiv 2018
[37]

Eureka: Human-Level Reward Design via Coding Large Language Models

Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Matthews, Michael Beukman, Benjamin Ellis, Mikayel Samvelyan, Matthew Thomas Jackson, Samuel Coward, and Jakob Nicolaus Foerster

Michael T. Matthews, Michael Beukman, Benjamin Ellis, Mikayel Samvelyan, Matthew Thomas Jackson, Samuel Coward, and Jakob Nicolaus Foerster. Craftax: A lightning-fast benchmark for open-ended reinforcement learning. In ICML, 2024. URL https://openreview.net/forum?id=hg4wXlrQCV

work page 2024
[39]

Amy McGovern and Andrew G. Barto. Automatic discovery of subgoals in reinforcement learning using diverse density. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML '01, pp.\ 361–368, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558607781

work page 2001
[40]

moolib: A Platform for Distributed RL

Vegard Mella, Eric Hambro, Danielle Rothermel, and Heinrich K \" u ttler. moolib: A Platform for Distributed RL . 2022. URL https://github.com/facebookresearch/moolib

work page 2022
[41]

Q-cut - dynamic discovery of sub-goals in reinforcement learning

Ishai Menache, Shie Mannor, and Nahum Shimkin. Q-cut - dynamic discovery of sub-goals in reinforcement learning. In Proceedings of the 13th European Conference on Machine Learning, ECML '02, pp.\ 295–306, Berlin, Heidelberg, 2002. Springer-Verlag. ISBN 3540440364

work page 2002
[42]

Asynchronous Methods for Deep Reinforcement Learning

Volodymyr Mnih, Adri \` a Puigdom \` e nech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783, 2016. URL http://arxiv.org/abs/1602.01783

work page internal anchor Pith review Pith/arXiv arXiv 2016
[43]

Data-efficient hierarchical reinforcement learning

Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS'18, pp.\ 3307–3317, Red Hook, NY, USA, 2018. Curran Associates Inc

work page 2018
[44]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

BALROG : Benchmarking agentic LLM and VLM reasoning on games

Davide Paglieri, Bart omiej Cupia , Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, ukasz Kuci \'n ski, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rockt \"a schel. BALROG : Benchmarking agentic LLM and VLM reasoning on games. In The Thirteenth International Conference on Learning Represe...

work page 2025
[46]

Horizon reduction makes rl scalable.arXiv preprint arXiv:2506.04168, 2025

Seohong Park, Kevin Frans, Deepinder Mann, Benjamin Eysenbach, Aviral Kumar, and Sergey Levine. Horizon reduction makes rl scalable, 2025. URL https://arxiv.org/abs/2506.04168

work page arXiv 2025
[47]

Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning

Xue Bin Peng, Glen Berseth, Kangkang Yin, and Michiel Van De Panne. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Trans. Graph., 36 0 (4): 0 41:1--41:13, July 2017. ISSN 0730-0301. doi:10.1145/3072959.3073602. URL http://doi.acm.org/10.1145/3072959.3073602

work page doi:10.1145/3072959.3073602 2017
[48]

Karl Pertsch, Youngwoon Lee, and Joseph J. Lim. Accelerating reinforcement learning with learned skill priors. In Conference on Robot Learning (CoRL), 2020

work page 2020
[49]

Sukhatme, and Vladlen Koltun

Aleksei Petrenko, Zhehui Huang, Tushar Kumar, Gaurav S. Sukhatme, and Vladlen Koltun. Sample factory: Egocentric 3d control from pixels at 100000 FPS with asynchronous reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , volume 119 of Proceedings of Machine Learning Re...

work page 2020
[50]

Doina Precup and Richard S. Sutton. Temporal abstraction in reinforcement learning. PhD thesis, 2000. AAI9978540

work page 2000
[51]

From simple to complex skills: The case of in-hand object reorientation, 2025

Haozhi Qi, Brent Yi, Mike Lambeta, Yi Ma, Roberto Calandra, and Jitendra Malik. From simple to complex skills: The case of in-hand object reorientation, 2025. URL https://arxiv.org/abs/2501.05439

work page arXiv 2025
[52]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021. URL https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[53]

SAM 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R \"a dle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos. In The Thirteenth Intern...

work page 2025
[54]

Minihack the planet: A sandbox for open-ended reinforcement learning research

Mikayel Samvelyan, Robert Kirk, Vitaly Kurin, Jack Parker-Holder, Minqi Jiang, Eric Hambro, Fabio Petroni, Heinrich Kuttler, Edward Grefenstette, and Tim Rockt \"a schel. Minihack the planet: A sandbox for open-ended reinforcement learning research. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)...

work page 2021
[55]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 9338--9346, 2019. URL https://api.semanticscholar.org/CorpusID...

work page 2019
[56]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[57]

Mastering the game of go with deep neural networks and tree search

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529 0 (7587): 0 484--489, 2016

work page 2016
[58]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy P. Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. CoRR, abs/1712.01815, 2017. URL http://arxiv.org/abs...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[59]

Satinder P. Singh. Scaling reinforcement learning algorithms by learning variable temporal resolution models. In Derek H. Sleeman and Peter Edwards (eds.), Proceedings of the Ninth International Workshop on Machine Learning (ML 1992), Aberdeen, Scotland, UK, July 1-3, 1992 , pp.\ 406--415. Morgan Kaufmann, 1992 a . doi:10.1016/B978-1-55860-247-2.50058-9. ...

work page doi:10.1016/b978-1-55860-247-2.50058-9 1992
[60]

Satinder P. Singh. Reinforcement learning with a hierarchy of abstract models. In Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI'92, pp.\ 202–207. AAAI Press, 1992 b . ISBN 0262510634

work page 1992
[61]

An inference-based policy gradient method for learning options

Matthew Smith, Herke van Hoof, and Joelle Pineau. An inference-based policy gradient method for learning options. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 4703--4712. PMLR, 10--15 Jul 2018. URL https://proceedings.mlr.press/v8...

work page 2018
[62]

Learning options in reinforcement learning

Martin Stolle and Doina Precup. Learning options in reinforcement learning. In Proceedings of the 5th International Symposium on Abstraction, Reformulation and Approximation, pp.\ 212–223, Berlin, Heidelberg, 2002. Springer-Verlag. ISBN 3540439412

work page 2002
[63]

Pufferlib: Making reinforcement learning libraries and environments play nice, 2024

Joseph Suarez. Pufferlib: Making reinforcement learning libraries and environments play nice, 2024. URL https://arxiv.org/abs/2406.12905

work page arXiv 2024
[64]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd.html

work page 2018
[65]

Sutton, Doina Precup, and Satinder Singh

Richard S. Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artif. Intell., 112 0 (1-2): 0 181--211, 1999. URL http://dblp.uni-trier.de/db/journals/ai/ai112.html#SuttonPS99

work page 1999
[66]

Habitat 2.0: training home assistants to rearrange their habitat

Andrew Szot, Alex Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir Vondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: training home assista...

work page 2021
[67]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team. Gemini: A family of highly capable multimodal models, 2024. URL https://arxiv.org/abs/2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

Feudal networks for hierarchical reinforcement learning

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML'17, pp.\ 3540–3549. JMLR.org, 2017

work page 2017
[70]

Hq-learning

Marco Wiering and J \"u rgen Schmidhuber. Hq-learning. Adaptive Behavior, 6 0 (2): 0 219--246, 1997. ISSN 1059-7123

work page 1997
[71]

Decentralized distributed PPO: solving pointgoal navigation

Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, and Dhruv Batra. Decentralized distributed PPO: solving pointgoal navigation. CoRR, abs/1911.00357, 2019. URL http://arxiv.org/abs/1911.00357

work page arXiv 1911
[72]

Function optimization using connectionist reinforcement learning algorithms

Ronald Williams and Jing Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3 0 (3): 0 241--268, 1991. doi:10.1080/09540099108946587

work page doi:10.1080/09540099108946587 1991
[73]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn., 8 0 (3–4): 0 229–256, May 1992. ISSN 0885-6125. doi:10.1007/BF00992696. URL https://doi.org/10.1007/BF00992696

work page doi:10.1007/bf00992696 1992
[74]

Gonzalez, and Ion Stoica

Zhanghao Wu, Eric Liang, Michael Luo, Sven Mika, Joseph E. Gonzalez, and Ion Stoica. RLlib flow: Distributed reinforcement learning is a dataflow problem. In Conference on Neural Information Processing Systems ( NeurIPS ) , 2021. URL https://proceedings.neurips.cc/paper/2021/file/2bce32ed409f5ebcee2a7b417ad9beed-Paper.pdf

work page 2021
[75]

ASC: Adaptive Skill Coordination for Robotic Mobile Manipulation

Naoki Yokoyama, Alexander William Clegg, Joanne Truong, Eric Undersander, Jimmy Yang, Sergio Arnaud, Sehoon Ha, Dhruv Batra, and Akshara Rai. ASC: Adaptive Skill Coordination for Robotic Mobile Manipulation . IEEE Robotics and Automation Letters, 2023

work page 2023
[76]

Online intrinsic rewards for decision making agents from large language model feedback, 2024

Qinqing Zheng, Mikael Henaff, Amy Zhang, Aditya Grover, and Brandon Amos. Online intrinsic rewards for decision making agents from large language model feedback, 2024. URL https://arxiv.org/abs/2410.23022

work page arXiv 2024

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

The option-critic architecture

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI'17, pp.\ 1726–1734. AAAI Press, 2017

work page 2017

[3] [3]

2011, Computing in Science Engineering, 13, 31 , 10.1109/MCSE.2010.118

S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, D.S. Seljebotn, and K. Smith. Cython: The best of both worlds. Computing in Science Engineering, 13 0 (2): 0 31 --39, 2011. ISSN 1521-9615. doi:10.1109/MCSE.2010.118

work page doi:10.1109/mcse.2010.118 2011

[4] [4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert - Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

work page internal anchor Pith review Pith/arXiv arXiv 2005

[5] [5]

Exploration by random network distillation

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. In International Conference on Learning Representations, 2019

work page 2019

[6] [6]

arXiv preprint arXiv:2309.00987 , year=

Yuanpei Chen, Chen Wang, Li Fei-Fei, and C Karen Liu. Sequential dexterity: Chaining dexterous policies for long-horizon manipulation. arXiv preprint arXiv:2309.00987, 2023

work page arXiv 2023

[7] [7]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho, Bart van Merrienboer, C aglar G \" u l c ehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014. URL http://arxiv.org/abs/1406.1078

work page internal anchor Pith review Pith/arXiv arXiv 2014

[8] [8]

Nethack standard strategy

NetHackWiki contributors. Nethack standard strategy. URL https://nethackwiki.com/wiki/Standard_strategy

work page

[9] [9]

\" O zg\" u r S im s ek and Andrew G. Barto. Skill characterization based on betweenness. In Proceedings of the 22nd International Conference on Neural Information Processing Systems, NIPS'08, pp.\ 1497–1504, Red Hook, NY, USA, 2008. Curran Associates Inc. ISBN 9781605609492

work page 2008

[10] [10]

Peter Dayan and Geoffrey E. Hinton. Feudal reinforcement learning. In Proceedings of the 6th International Conference on Neural Information Processing Systems, NIPS'92, pp.\ 271–278, San Francisco, CA, USA, 1992. Morgan Kaufmann Publishers Inc. ISBN 1558602747

work page 1992

[11] [11]

Gymnasium robotics, 2024

Rodrigo de Lazcano, Kallinteris Andreas, Jun Jet Tai, Seungjae Ryan Lee, and Jordan Terry. Gymnasium robotics, 2024. URL http://github.com/Farama-Foundation/Gymnasium-Robotics

work page 2024

[12] [12]

Openai baselines

Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017

work page 2017

[13] [13]

The Llama 3 Herd of Models

Abhimanyu Dubey et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

IMPALA : Scalable distributed deep- RL with importance weighted actor-learner architectures

Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA : Scalable distributed deep- RL with importance weighted actor-learner architectures. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Mach...

work page 2018

[15] [15]

Diversity is all you need: Learning skills without a reward function

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SJx63jRqFm

work page 2019

[16] [16]

Minedojo: Building open-ended embodied agents with internet-scale knowledge

Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=r...

work page 2022

[17] [17]

u rtler, Dieter B\

Nico G\" u rtler, Dieter B\" u chler, and Georg Martius. Hierarchical reinforcement learning with timed subgoals. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS '21, Red Hook, NY, USA, 2021. Curran Associates Inc. ISBN 9781713845393

work page 2021

[18] [18]

Learning and Transfer of Modulated Locomotor Controllers

Nicolas Manfred Otto Heess, Greg Wayne, Yuval Tassa, Timothy P. Lillicrap, Martin A. Riedmiller, and David Silver. Learning and transfer of modulated locomotor controllers. ArXiv, abs/1610.05182, 2016. URL https://api.semanticscholar.org/CorpusID:9692454

work page internal anchor Pith review Pith/arXiv arXiv 2016

[19] [19]

Exploration via elliptical episodic bonuses

Mikael Henaff, Roberta Raileanu, Minqi Jiang, and Tim Rockt \"a schel. Exploration via elliptical episodic bonuses. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022

work page 2022

[20] [20]

Hierarchical learning in stochastic domains: preliminary results

Leslie Pack Kaelbling. Hierarchical learning in stochastic domains: preliminary results. In Proceedings of the Tenth International Conference on International Conference on Machine Learning, ICML'93, pp.\ 167–173, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc. ISBN 1558603077

work page 1993

[21] [21]

Nethack learning environment sample factory baseline

Anssi Kanervisto and Karolis Jucys. Nethack learning environment sample factory baseline. https://github.com/Miffyli/nle-sample-factory-baseline, 2022. Accessed: 2025-03-28

work page 2022

[22] [22]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2001

[23] [23]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll \'a r, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Flexible option learning

Martin Klissarov and Doina Precup. Flexible option learning. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=L5vbEVIePyb

work page 2021

[25] [25]

Learnings Options End-to-End for Continuous Action Tasks

Martin Klissarov, Pierre-Luc Bacon, Jean Harb, and Doina Precup. Learnings options end-to-end for continuous action tasks. ArXiv, abs/1712.00004, 2017. URL https://api.semanticscholar.org/CorpusID:1809550

work page internal anchor Pith review Pith/arXiv arXiv 2017

[26] [26]

Motif: Intrinsic motivation from artificial intelligence feedback

Martin Klissarov, Pierluca D'Oro, Shagun Sodhani, Roberta Raileanu, Pierre-Luc Bacon, Pascal Vincent, Amy Zhang, and Mikael Henaff. Motif: Intrinsic motivation from artificial intelligence feedback. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=tmBKIecDE9

work page 2024

[27] [27]

Machado, and Pierluca D'Oro

Martin Klissarov, Mikael Henaff, Roberta Raileanu, Shagun Sodhani, Pascal Vincent, Amy Zhang, Pierre-Luc Bacon, Doina Precup, Marlos C. Machado, and Pierluca D'Oro. Maestromotif: Skill design from artificial intelligence feedback. 2025. URL https://openreview.net/forum?id=or8mMhmyRV

work page 2025

[28] [28]

Actor-critic algorithms

Vijay Konda and John Tsitsiklis. Actor-critic algorithms. In S. Solla, T. Leen, and K. M\" u ller (eds.), Advances in Neural Information Processing Systems, volume 12. MIT Press, 1999. URL https://proceedings.neurips.cc/paper_files/paper/1999/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf

work page 1999

[29] [29]

TorchBeast: A PyTo rch Platform for Distributed RL

Heinrich K\" u ttler, Nantas Nardelli, Thibaut Lavril, Marco Selvatici, Viswanath Sivakumar, Tim Rockt\" a schel, and Edward Grefenstette. TorchBeast: A PyTorch Platform for Distributed RL . arXiv preprint arXiv:1910.03552, 2019. URL https://github.com/facebookresearch/torchbeast

work page arXiv 1910

[30] [30]

u ttler, Nantas Nardelli, Alexander H. Miller, Roberta Raileanu, Marco Selvatici, Edward Grefenstette, and Tim Rockt \

Heinrich K \" u ttler, Nantas Nardelli, Alexander H. Miller, Roberta Raileanu, Marco Selvatici, Edward Grefenstette, and Tim Rockt \" a schel. The NetHack Learning Environment . In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2020

work page 2020

[31] [31]

Reward design with language models

Minae Kwon, Sang Michael Xie, Kalesha Bullard, and Dorsa Sadigh. Reward design with language models. In The Eleventh International Conference on Learning Representations, 2023 a . URL https://openreview.net/forum?id=10uNUgI5Kl

work page 2023

[32] [32]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023 b

work page 2023

[33] [33]

Voicebox: Text-guided multilingual universal speech generation at scale

Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and Wei-Ning Hsu. Voicebox: Text-guided multilingual universal speech generation at scale. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), NeurIPS, 2023. URL http://dblp.u...

work page 2023

[34] [34]

Learning Multi-Level Hi- erarchies with Hindsight, September 2019

Andrew Levy, Robert Platt Jr., and Kate Saenko. Hierarchical actor-critic. CoRR, abs/1712.00948, 2017. URL http://arxiv.org/abs/1712.00948

work page arXiv 2017

[35] [35]

Sub-policy adaptation for hierarchical reinforcement learning

Alexander Li, Carlos Florensa, Ignasi Clavera, and Pieter Abbeel. Sub-policy adaptation for hierarchical reinforcement learning. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=ByeWogStDS

work page 2020

[36] [36]

RLlib: Abstractions for Distributed Reinforcement Learning

Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph E. Gonzalez, Michael I. Jordan, and Ion Stoica. RLlib : Abstractions for distributed reinforcement learning. In International Conference on Machine Learning ( ICML ) , 2018. URL https://arxiv.org/pdf/1712.09381

work page internal anchor Pith review Pith/arXiv arXiv 2018

[37] [37]

Eureka: Human-Level Reward Design via Coding Large Language Models

Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Matthews, Michael Beukman, Benjamin Ellis, Mikayel Samvelyan, Matthew Thomas Jackson, Samuel Coward, and Jakob Nicolaus Foerster

Michael T. Matthews, Michael Beukman, Benjamin Ellis, Mikayel Samvelyan, Matthew Thomas Jackson, Samuel Coward, and Jakob Nicolaus Foerster. Craftax: A lightning-fast benchmark for open-ended reinforcement learning. In ICML, 2024. URL https://openreview.net/forum?id=hg4wXlrQCV

work page 2024

[39] [39]

Amy McGovern and Andrew G. Barto. Automatic discovery of subgoals in reinforcement learning using diverse density. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML '01, pp.\ 361–368, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558607781

work page 2001

[40] [40]

moolib: A Platform for Distributed RL

Vegard Mella, Eric Hambro, Danielle Rothermel, and Heinrich K \" u ttler. moolib: A Platform for Distributed RL . 2022. URL https://github.com/facebookresearch/moolib

work page 2022

[41] [41]

Q-cut - dynamic discovery of sub-goals in reinforcement learning

Ishai Menache, Shie Mannor, and Nahum Shimkin. Q-cut - dynamic discovery of sub-goals in reinforcement learning. In Proceedings of the 13th European Conference on Machine Learning, ECML '02, pp.\ 295–306, Berlin, Heidelberg, 2002. Springer-Verlag. ISBN 3540440364

work page 2002

[42] [42]

Asynchronous Methods for Deep Reinforcement Learning

Volodymyr Mnih, Adri \` a Puigdom \` e nech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783, 2016. URL http://arxiv.org/abs/1602.01783

work page internal anchor Pith review Pith/arXiv arXiv 2016

[43] [43]

Data-efficient hierarchical reinforcement learning

Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS'18, pp.\ 3307–3317, Red Hook, NY, USA, 2018. Curran Associates Inc

work page 2018

[44] [44]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

BALROG : Benchmarking agentic LLM and VLM reasoning on games

Davide Paglieri, Bart omiej Cupia , Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, ukasz Kuci \'n ski, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rockt \"a schel. BALROG : Benchmarking agentic LLM and VLM reasoning on games. In The Thirteenth International Conference on Learning Represe...

work page 2025

[46] [46]

Horizon reduction makes rl scalable.arXiv preprint arXiv:2506.04168, 2025

Seohong Park, Kevin Frans, Deepinder Mann, Benjamin Eysenbach, Aviral Kumar, and Sergey Levine. Horizon reduction makes rl scalable, 2025. URL https://arxiv.org/abs/2506.04168

work page arXiv 2025

[47] [47]

Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning

Xue Bin Peng, Glen Berseth, Kangkang Yin, and Michiel Van De Panne. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Trans. Graph., 36 0 (4): 0 41:1--41:13, July 2017. ISSN 0730-0301. doi:10.1145/3072959.3073602. URL http://doi.acm.org/10.1145/3072959.3073602

work page doi:10.1145/3072959.3073602 2017

[48] [48]

Karl Pertsch, Youngwoon Lee, and Joseph J. Lim. Accelerating reinforcement learning with learned skill priors. In Conference on Robot Learning (CoRL), 2020

work page 2020

[49] [49]

Sukhatme, and Vladlen Koltun

Aleksei Petrenko, Zhehui Huang, Tushar Kumar, Gaurav S. Sukhatme, and Vladlen Koltun. Sample factory: Egocentric 3d control from pixels at 100000 FPS with asynchronous reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , volume 119 of Proceedings of Machine Learning Re...

work page 2020

[50] [50]

Doina Precup and Richard S. Sutton. Temporal abstraction in reinforcement learning. PhD thesis, 2000. AAI9978540

work page 2000

[51] [51]

From simple to complex skills: The case of in-hand object reorientation, 2025

Haozhi Qi, Brent Yi, Mike Lambeta, Yi Ma, Roberto Calandra, and Jitendra Malik. From simple to complex skills: The case of in-hand object reorientation, 2025. URL https://arxiv.org/abs/2501.05439

work page arXiv 2025

[52] [52]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021. URL https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021

[53] [53]

SAM 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R \"a dle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos. In The Thirteenth Intern...

work page 2025

[54] [54]

Minihack the planet: A sandbox for open-ended reinforcement learning research

Mikayel Samvelyan, Robert Kirk, Vitaly Kurin, Jack Parker-Holder, Minqi Jiang, Eric Hambro, Fabio Petroni, Heinrich Kuttler, Edward Grefenstette, and Tim Rockt \"a schel. Minihack the planet: A sandbox for open-ended reinforcement learning research. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)...

work page 2021

[55] [55]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 9338--9346, 2019. URL https://api.semanticscholar.org/CorpusID...

work page 2019

[56] [56]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[57] [57]

Mastering the game of go with deep neural networks and tree search

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529 0 (7587): 0 484--489, 2016

work page 2016

[58] [58]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy P. Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. CoRR, abs/1712.01815, 2017. URL http://arxiv.org/abs...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[59] [59]

Satinder P. Singh. Scaling reinforcement learning algorithms by learning variable temporal resolution models. In Derek H. Sleeman and Peter Edwards (eds.), Proceedings of the Ninth International Workshop on Machine Learning (ML 1992), Aberdeen, Scotland, UK, July 1-3, 1992 , pp.\ 406--415. Morgan Kaufmann, 1992 a . doi:10.1016/B978-1-55860-247-2.50058-9. ...

work page doi:10.1016/b978-1-55860-247-2.50058-9 1992

[60] [60]

Satinder P. Singh. Reinforcement learning with a hierarchy of abstract models. In Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI'92, pp.\ 202–207. AAAI Press, 1992 b . ISBN 0262510634

work page 1992

[61] [61]

An inference-based policy gradient method for learning options

Matthew Smith, Herke van Hoof, and Joelle Pineau. An inference-based policy gradient method for learning options. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 4703--4712. PMLR, 10--15 Jul 2018. URL https://proceedings.mlr.press/v8...

work page 2018

[62] [62]

Learning options in reinforcement learning

Martin Stolle and Doina Precup. Learning options in reinforcement learning. In Proceedings of the 5th International Symposium on Abstraction, Reformulation and Approximation, pp.\ 212–223, Berlin, Heidelberg, 2002. Springer-Verlag. ISBN 3540439412

work page 2002

[63] [63]

Pufferlib: Making reinforcement learning libraries and environments play nice, 2024

Joseph Suarez. Pufferlib: Making reinforcement learning libraries and environments play nice, 2024. URL https://arxiv.org/abs/2406.12905

work page arXiv 2024

[64] [64]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd.html

work page 2018

[65] [65]

Sutton, Doina Precup, and Satinder Singh

Richard S. Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artif. Intell., 112 0 (1-2): 0 181--211, 1999. URL http://dblp.uni-trier.de/db/journals/ai/ai112.html#SuttonPS99

work page 1999

[66] [66]

Habitat 2.0: training home assistants to rearrange their habitat

Andrew Szot, Alex Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir Vondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: training home assista...

work page 2021

[67] [67]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team. Gemini: A family of highly capable multimodal models, 2024. URL https://arxiv.org/abs/2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2024

[68] [68]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[69] [69]

Feudal networks for hierarchical reinforcement learning

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML'17, pp.\ 3540–3549. JMLR.org, 2017

work page 2017

[70] [70]

Hq-learning

Marco Wiering and J \"u rgen Schmidhuber. Hq-learning. Adaptive Behavior, 6 0 (2): 0 219--246, 1997. ISSN 1059-7123

work page 1997

[71] [71]

Decentralized distributed PPO: solving pointgoal navigation

Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, and Dhruv Batra. Decentralized distributed PPO: solving pointgoal navigation. CoRR, abs/1911.00357, 2019. URL http://arxiv.org/abs/1911.00357

work page arXiv 1911

[72] [72]

Function optimization using connectionist reinforcement learning algorithms

Ronald Williams and Jing Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3 0 (3): 0 241--268, 1991. doi:10.1080/09540099108946587

work page doi:10.1080/09540099108946587 1991

[73] [73]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn., 8 0 (3–4): 0 229–256, May 1992. ISSN 0885-6125. doi:10.1007/BF00992696. URL https://doi.org/10.1007/BF00992696

work page doi:10.1007/bf00992696 1992

[74] [74]

Gonzalez, and Ion Stoica

Zhanghao Wu, Eric Liang, Michael Luo, Sven Mika, Joseph E. Gonzalez, and Ion Stoica. RLlib flow: Distributed reinforcement learning is a dataflow problem. In Conference on Neural Information Processing Systems ( NeurIPS ) , 2021. URL https://proceedings.neurips.cc/paper/2021/file/2bce32ed409f5ebcee2a7b417ad9beed-Paper.pdf

work page 2021

[75] [75]

ASC: Adaptive Skill Coordination for Robotic Mobile Manipulation

Naoki Yokoyama, Alexander William Clegg, Joanne Truong, Eric Undersander, Jimmy Yang, Sergio Arnaud, Sehoon Ha, Dhruv Batra, and Akshara Rai. ASC: Adaptive Skill Coordination for Robotic Mobile Manipulation . IEEE Robotics and Automation Letters, 2023

work page 2023

[76] [76]

Online intrinsic rewards for decision making agents from large language model feedback, 2024

Qinqing Zheng, Mikael Henaff, Amy Zhang, Aditya Grover, and Brandon Amos. Online intrinsic rewards for decision making agents from large language model feedback, 2024. URL https://arxiv.org/abs/2410.23022

work page arXiv 2024