pith. sign in

arxiv: 2509.00338 · v3 · submitted 2025-08-30 · 💻 cs.LG · cs.AI

Scalable Option Learning in High-Throughput Environments

Pith reviewed 2026-05-18 20:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords hierarchical reinforcement learningscalable option learningNetHackhigh-throughput environmentsoption discoverydeep reinforcement learningscaling trends
0
0 comments X

The pith

Scalable Option Learning trains hierarchical agents on 30 billion frames of NetHack and surpasses flat agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper identifies key challenges that have prevented hierarchical reinforcement learning from scaling to high-throughput environments. It introduces Scalable Option Learning (SOL) as a solution that achieves approximately 35 times higher throughput than previous hierarchical methods. The authors demonstrate the approach by training agents on 30 billion frames of the complex game NetHack, where the hierarchical agents perform significantly better than flat agents and exhibit positive scaling with more data. SOL is also shown to work on simpler environments like MiniHack and MuJoCo, indicating general applicability. A sympathetic reader would care because this could unlock the long-timescale decision making that hierarchy promises but has not yet delivered at scale.

Core claim

We identify and solve several key challenges in scaling online hierarchical RL to high-throughput environments. We propose Scalable Option Learning (SOL), a highly scalable hierarchical RL algorithm which achieves a ~35x higher throughput compared to existing hierarchical methods. To demonstrate SOL's performance and scalability, we train hierarchical agents using 30 billion frames of experience on the complex game of NetHack, significantly surpassing flat agents and demonstrating positive scaling trends. We also validate SOL on MiniHack and Mujoco environments, showcasing its general applicability.

What carries the argument

Scalable Option Learning (SOL), a hierarchical RL algorithm that solves identified scaling challenges to enable high-throughput training while preserving the benefits of hierarchy.

If this is right

  • Hierarchical agents become feasible to train at scales previously limited to flat methods.
  • Performance continues to improve as training data increases, following positive scaling trends.
  • The approach extends to other environments including MiniHack and MuJoCo.
  • Long-timescale decision making in complex tasks becomes practical with hierarchy at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Prior failures of hierarchical RL may have stemmed mainly from throughput limits rather than inherent design flaws.
  • This method could be tested in domains like robotics where long-horizon planning is needed but data throughput has been a barrier.
  • Further scaling experiments beyond 30 billion frames would check whether the observed trends hold.

Load-bearing premise

That the identified challenges in scaling online hierarchical RL can be solved by SOL in a way that preserves hierarchy benefits and translates into better performance on complex tasks.

What would settle it

Training SOL hierarchical agents and flat agents on NetHack for equivalent frames and observing no performance advantage for the hierarchical version would falsify the superiority claim.

read the original abstract

Hierarchical reinforcement learning (RL) has the potential to enable effective decision-making over long timescales. Existing approaches, while promising, have yet to realize the benefits of large-scale training. In this work, we identify and solve several key challenges in scaling online hierarchical RL to high-throughput environments. We propose Scalable Option Learning (SOL), a highly scalable hierarchical RL algorithm which achieves a ~35x higher throughput compared to existing hierarchical methods. To demonstrate SOL's performance and scalability, we train hierarchical agents using 30 billion frames of experience on the complex game of NetHack, significantly surpassing flat agents and demonstrating positive scaling trends. We also validate SOL on MiniHack and Mujoco environments, showcasing its general applicability. Our code is open sourced at: github.com/facebookresearch/sol.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes Scalable Option Learning (SOL), a hierarchical RL algorithm for high-throughput environments. It claims a ~35x throughput improvement over existing hierarchical methods and demonstrates training hierarchical agents on 30 billion frames in NetHack, where they significantly surpass flat agents while exhibiting positive scaling trends. Additional validation is provided on MiniHack and MuJoCo, with open-sourced code.

Significance. If the central claims hold, this work would represent a meaningful advance in scaling hierarchical RL to large experience budgets on complex tasks, addressing throughput bottlenecks that have limited prior online hierarchical approaches. The scale of the NetHack experiment (30B frames) and the open-sourced implementation are concrete strengths that could support reproducibility and follow-on research.

major comments (1)
  1. [Abstract and NetHack experiments] The central claim that SOL surpasses flat agents on NetHack via preserved hierarchical benefits after 30B frames (abstract) is load-bearing on options retaining meaningful temporal abstraction rather than collapsing to single-step primitives. The manuscript provides no reported metrics on option termination rates, average durations, or intra-option policy complexity in the NetHack results; without these, performance gains could be attributable to the scalable infrastructure alone.
minor comments (1)
  1. [Abstract] The abstract states 'positive scaling trends' but does not reference a specific figure or table showing the scaling curve; adding an explicit pointer would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We have addressed the major comment regarding the NetHack experiments by incorporating additional analyses in the revised manuscript to better substantiate the preservation of hierarchical benefits.

read point-by-point responses
  1. Referee: [Abstract and NetHack experiments] The central claim that SOL surpasses flat agents on NetHack via preserved hierarchical benefits after 30B frames (abstract) is load-bearing on options retaining meaningful temporal abstraction rather than collapsing to single-step primitives. The manuscript provides no reported metrics on option termination rates, average durations, or intra-option policy complexity in the NetHack results; without these, performance gains could be attributable to the scalable infrastructure alone.

    Authors: We agree that the absence of these specific metrics in the original manuscript leaves room for the interpretation that performance gains could stem primarily from the scalable infrastructure. To directly address this, the revised manuscript now includes option termination rates, average option durations, and measures of intra-option policy complexity for the NetHack results. These additions show that options maintain durations substantially longer than single steps and exhibit non-trivial intra-option behavior even after 30 billion frames, supporting that the hierarchical structure contributes meaningfully to the observed advantages over flat agents and the positive scaling trends. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical scaling results rest on direct measurements

full rationale

The paper proposes the SOL algorithm to scale online hierarchical RL and validates it via large-scale empirical training (30B frames on NetHack, 35x throughput gains over prior hierarchical methods). Claims of surpassing flat agents and positive scaling are grounded in reported performance metrics and throughput benchmarks rather than any derivation that reduces to fitted parameters, self-definitions, or self-citations. No equations or load-bearing steps in the abstract or described approach exhibit the enumerated circular patterns; the work is self-contained against external benchmarks and open-sourced code.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; detailed ledger requires the full manuscript.

pith-pipeline@v0.9.0 · 5658 in / 972 out tokens · 44369 ms · 2026-05-18T20:00:54.315287+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Goal-Conditioned Agents that Learn Everything All at Once

    cs.LG 2026-05 unverdicted novelty 6.0

    LEO enables efficient all-goals learning in goal-conditioned RL by jointly predicting for all goals in one network pass, yielding >250x speedup over relabelling and better performance on Craftax.

  2. Hierarchical Behaviour Spaces

    cs.AI 2026-04 unverdicted novelty 6.0

    Hierarchical Behaviour Spaces uses linear combinations of reward functions to induce expressive behavior spaces in hierarchical RL, yielding strong performance on NetHack primarily through better exploration rather th...

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 2 Pith papers · 16 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    The option-critic architecture

    Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI'17, pp.\ 1726–1734. AAAI Press, 2017

  3. [3]

    2011, Computing in Science Engineering, 13, 31 , 10.1109/MCSE.2010.118

    S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, D.S. Seljebotn, and K. Smith. Cython: The best of both worlds. Computing in Science Engineering, 13 0 (2): 0 31 --39, 2011. ISSN 1521-9615. doi:10.1109/MCSE.2010.118

  4. [4]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert - Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

  5. [5]

    Exploration by random network distillation

    Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. In International Conference on Learning Representations, 2019

  6. [6]

    arXiv preprint arXiv:2309.00987 , year=

    Yuanpei Chen, Chen Wang, Li Fei-Fei, and C Karen Liu. Sequential dexterity: Chaining dexterous policies for long-horizon manipulation. arXiv preprint arXiv:2309.00987, 2023

  7. [7]

    Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

    Kyunghyun Cho, Bart van Merrienboer, C aglar G \" u l c ehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014. URL http://arxiv.org/abs/1406.1078

  8. [8]

    Nethack standard strategy

    NetHackWiki contributors. Nethack standard strategy. URL https://nethackwiki.com/wiki/Standard_strategy

  9. [9]

    \" O zg\" u r S im s ek and Andrew G. Barto. Skill characterization based on betweenness. In Proceedings of the 22nd International Conference on Neural Information Processing Systems, NIPS'08, pp.\ 1497–1504, Red Hook, NY, USA, 2008. Curran Associates Inc. ISBN 9781605609492

  10. [10]

    Peter Dayan and Geoffrey E. Hinton. Feudal reinforcement learning. In Proceedings of the 6th International Conference on Neural Information Processing Systems, NIPS'92, pp.\ 271–278, San Francisco, CA, USA, 1992. Morgan Kaufmann Publishers Inc. ISBN 1558602747

  11. [11]

    Gymnasium robotics, 2024

    Rodrigo de Lazcano, Kallinteris Andreas, Jun Jet Tai, Seungjae Ryan Lee, and Jordan Terry. Gymnasium robotics, 2024. URL http://github.com/Farama-Foundation/Gymnasium-Robotics

  12. [12]

    Openai baselines

    Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017

  13. [13]

    The Llama 3 Herd of Models

    Abhimanyu Dubey et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

  14. [14]

    IMPALA : Scalable distributed deep- RL with importance weighted actor-learner architectures

    Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA : Scalable distributed deep- RL with importance weighted actor-learner architectures. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Mach...

  15. [15]

    Diversity is all you need: Learning skills without a reward function

    Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SJx63jRqFm

  16. [16]

    Minedojo: Building open-ended embodied agents with internet-scale knowledge

    Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=r...

  17. [17]

    u rtler, Dieter B\

    Nico G\" u rtler, Dieter B\" u chler, and Georg Martius. Hierarchical reinforcement learning with timed subgoals. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS '21, Red Hook, NY, USA, 2021. Curran Associates Inc. ISBN 9781713845393

  18. [18]

    Learning and Transfer of Modulated Locomotor Controllers

    Nicolas Manfred Otto Heess, Greg Wayne, Yuval Tassa, Timothy P. Lillicrap, Martin A. Riedmiller, and David Silver. Learning and transfer of modulated locomotor controllers. ArXiv, abs/1610.05182, 2016. URL https://api.semanticscholar.org/CorpusID:9692454

  19. [19]

    Exploration via elliptical episodic bonuses

    Mikael Henaff, Roberta Raileanu, Minqi Jiang, and Tim Rockt \"a schel. Exploration via elliptical episodic bonuses. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022

  20. [20]

    Hierarchical learning in stochastic domains: preliminary results

    Leslie Pack Kaelbling. Hierarchical learning in stochastic domains: preliminary results. In Proceedings of the Tenth International Conference on International Conference on Machine Learning, ICML'93, pp.\ 167–173, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc. ISBN 1558603077

  21. [21]

    Nethack learning environment sample factory baseline

    Anssi Kanervisto and Karolis Jucys. Nethack learning environment sample factory baseline. https://github.com/Miffyli/nle-sample-factory-baseline, 2022. Accessed: 2025-03-28

  22. [22]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361

  23. [23]

    Segment Anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll \'a r, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023

  24. [24]

    Flexible option learning

    Martin Klissarov and Doina Precup. Flexible option learning. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=L5vbEVIePyb

  25. [25]

    Learnings Options End-to-End for Continuous Action Tasks

    Martin Klissarov, Pierre-Luc Bacon, Jean Harb, and Doina Precup. Learnings options end-to-end for continuous action tasks. ArXiv, abs/1712.00004, 2017. URL https://api.semanticscholar.org/CorpusID:1809550

  26. [26]

    Motif: Intrinsic motivation from artificial intelligence feedback

    Martin Klissarov, Pierluca D'Oro, Shagun Sodhani, Roberta Raileanu, Pierre-Luc Bacon, Pascal Vincent, Amy Zhang, and Mikael Henaff. Motif: Intrinsic motivation from artificial intelligence feedback. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=tmBKIecDE9

  27. [27]

    Machado, and Pierluca D'Oro

    Martin Klissarov, Mikael Henaff, Roberta Raileanu, Shagun Sodhani, Pascal Vincent, Amy Zhang, Pierre-Luc Bacon, Doina Precup, Marlos C. Machado, and Pierluca D'Oro. Maestromotif: Skill design from artificial intelligence feedback. 2025. URL https://openreview.net/forum?id=or8mMhmyRV

  28. [28]

    Actor-critic algorithms

    Vijay Konda and John Tsitsiklis. Actor-critic algorithms. In S. Solla, T. Leen, and K. M\" u ller (eds.), Advances in Neural Information Processing Systems, volume 12. MIT Press, 1999. URL https://proceedings.neurips.cc/paper_files/paper/1999/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf

  29. [29]

    TorchBeast: A PyTo rch Platform for Distributed RL

    Heinrich K\" u ttler, Nantas Nardelli, Thibaut Lavril, Marco Selvatici, Viswanath Sivakumar, Tim Rockt\" a schel, and Edward Grefenstette. TorchBeast: A PyTorch Platform for Distributed RL . arXiv preprint arXiv:1910.03552, 2019. URL https://github.com/facebookresearch/torchbeast

  30. [30]

    u ttler, Nantas Nardelli, Alexander H. Miller, Roberta Raileanu, Marco Selvatici, Edward Grefenstette, and Tim Rockt \

    Heinrich K \" u ttler, Nantas Nardelli, Alexander H. Miller, Roberta Raileanu, Marco Selvatici, Edward Grefenstette, and Tim Rockt \" a schel. The NetHack Learning Environment . In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2020

  31. [31]

    Reward design with language models

    Minae Kwon, Sang Michael Xie, Kalesha Bullard, and Dorsa Sadigh. Reward design with language models. In The Eleventh International Conference on Learning Representations, 2023 a . URL https://openreview.net/forum?id=10uNUgI5Kl

  32. [32]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023 b

  33. [33]

    Voicebox: Text-guided multilingual universal speech generation at scale

    Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and Wei-Ning Hsu. Voicebox: Text-guided multilingual universal speech generation at scale. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), NeurIPS, 2023. URL http://dblp.u...

  34. [34]

    Learning Multi-Level Hi- erarchies with Hindsight, September 2019

    Andrew Levy, Robert Platt Jr., and Kate Saenko. Hierarchical actor-critic. CoRR, abs/1712.00948, 2017. URL http://arxiv.org/abs/1712.00948

  35. [35]

    Sub-policy adaptation for hierarchical reinforcement learning

    Alexander Li, Carlos Florensa, Ignasi Clavera, and Pieter Abbeel. Sub-policy adaptation for hierarchical reinforcement learning. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=ByeWogStDS

  36. [36]

    RLlib: Abstractions for Distributed Reinforcement Learning

    Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph E. Gonzalez, Michael I. Jordan, and Ion Stoica. RLlib : Abstractions for distributed reinforcement learning. In International Conference on Machine Learning ( ICML ) , 2018. URL https://arxiv.org/pdf/1712.09381

  37. [37]

    Eureka: Human-Level Reward Design via Coding Large Language Models

    Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931, 2023

  38. [38]

    Matthews, Michael Beukman, Benjamin Ellis, Mikayel Samvelyan, Matthew Thomas Jackson, Samuel Coward, and Jakob Nicolaus Foerster

    Michael T. Matthews, Michael Beukman, Benjamin Ellis, Mikayel Samvelyan, Matthew Thomas Jackson, Samuel Coward, and Jakob Nicolaus Foerster. Craftax: A lightning-fast benchmark for open-ended reinforcement learning. In ICML, 2024. URL https://openreview.net/forum?id=hg4wXlrQCV

  39. [39]

    Amy McGovern and Andrew G. Barto. Automatic discovery of subgoals in reinforcement learning using diverse density. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML '01, pp.\ 361–368, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558607781

  40. [40]

    moolib: A Platform for Distributed RL

    Vegard Mella, Eric Hambro, Danielle Rothermel, and Heinrich K \" u ttler. moolib: A Platform for Distributed RL . 2022. URL https://github.com/facebookresearch/moolib

  41. [41]

    Q-cut - dynamic discovery of sub-goals in reinforcement learning

    Ishai Menache, Shie Mannor, and Nahum Shimkin. Q-cut - dynamic discovery of sub-goals in reinforcement learning. In Proceedings of the 13th European Conference on Machine Learning, ECML '02, pp.\ 295–306, Berlin, Heidelberg, 2002. Springer-Verlag. ISBN 3540440364

  42. [42]

    Asynchronous Methods for Deep Reinforcement Learning

    Volodymyr Mnih, Adri \` a Puigdom \` e nech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783, 2016. URL http://arxiv.org/abs/1602.01783

  43. [43]

    Data-efficient hierarchical reinforcement learning

    Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS'18, pp.\ 3307–3317, Red Hook, NY, USA, 2018. Curran Associates Inc

  44. [44]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774

  45. [45]

    BALROG : Benchmarking agentic LLM and VLM reasoning on games

    Davide Paglieri, Bart omiej Cupia , Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, ukasz Kuci \'n ski, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rockt \"a schel. BALROG : Benchmarking agentic LLM and VLM reasoning on games. In The Thirteenth International Conference on Learning Represe...

  46. [46]

    Horizon reduction makes rl scalable.arXiv preprint arXiv:2506.04168, 2025

    Seohong Park, Kevin Frans, Deepinder Mann, Benjamin Eysenbach, Aviral Kumar, and Sergey Levine. Horizon reduction makes rl scalable, 2025. URL https://arxiv.org/abs/2506.04168

  47. [47]

    Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning

    Xue Bin Peng, Glen Berseth, Kangkang Yin, and Michiel Van De Panne. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Trans. Graph., 36 0 (4): 0 41:1--41:13, July 2017. ISSN 0730-0301. doi:10.1145/3072959.3073602. URL http://doi.acm.org/10.1145/3072959.3073602

  48. [48]

    Karl Pertsch, Youngwoon Lee, and Joseph J. Lim. Accelerating reinforcement learning with learned skill priors. In Conference on Robot Learning (CoRL), 2020

  49. [49]

    Sukhatme, and Vladlen Koltun

    Aleksei Petrenko, Zhehui Huang, Tushar Kumar, Gaurav S. Sukhatme, and Vladlen Koltun. Sample factory: Egocentric 3d control from pixels at 100000 FPS with asynchronous reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , volume 119 of Proceedings of Machine Learning Re...

  50. [50]

    Doina Precup and Richard S. Sutton. Temporal abstraction in reinforcement learning. PhD thesis, 2000. AAI9978540

  51. [51]

    From simple to complex skills: The case of in-hand object reorientation, 2025

    Haozhi Qi, Brent Yi, Mike Lambeta, Yi Ma, Roberto Calandra, and Jitendra Malik. From simple to complex skills: The case of in-hand object reorientation, 2025. URL https://arxiv.org/abs/2501.05439

  52. [52]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021. URL https://arxiv.org/abs/2103.00020

  53. [53]

    SAM 2: Segment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R \"a dle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos. In The Thirteenth Intern...

  54. [54]

    Minihack the planet: A sandbox for open-ended reinforcement learning research

    Mikayel Samvelyan, Robert Kirk, Vitaly Kurin, Jack Parker-Holder, Minqi Jiang, Eric Hambro, Fabio Petroni, Heinrich Kuttler, Edward Grefenstette, and Tim Rockt \"a schel. Minihack the planet: A sandbox for open-ended reinforcement learning research. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)...

  55. [55]

    Habitat: A platform for embodied ai research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 9338--9346, 2019. URL https://api.semanticscholar.org/CorpusID...

  56. [56]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  57. [57]

    Mastering the game of go with deep neural networks and tree search

    David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529 0 (7587): 0 484--489, 2016

  58. [58]

    Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy P. Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. CoRR, abs/1712.01815, 2017. URL http://arxiv.org/abs...

  59. [59]

    Satinder P. Singh. Scaling reinforcement learning algorithms by learning variable temporal resolution models. In Derek H. Sleeman and Peter Edwards (eds.), Proceedings of the Ninth International Workshop on Machine Learning (ML 1992), Aberdeen, Scotland, UK, July 1-3, 1992 , pp.\ 406--415. Morgan Kaufmann, 1992 a . doi:10.1016/B978-1-55860-247-2.50058-9. ...

  60. [60]

    Satinder P. Singh. Reinforcement learning with a hierarchy of abstract models. In Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI'92, pp.\ 202–207. AAAI Press, 1992 b . ISBN 0262510634

  61. [61]

    An inference-based policy gradient method for learning options

    Matthew Smith, Herke van Hoof, and Joelle Pineau. An inference-based policy gradient method for learning options. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 4703--4712. PMLR, 10--15 Jul 2018. URL https://proceedings.mlr.press/v8...

  62. [62]

    Learning options in reinforcement learning

    Martin Stolle and Doina Precup. Learning options in reinforcement learning. In Proceedings of the 5th International Symposium on Abstraction, Reformulation and Approximation, pp.\ 212–223, Berlin, Heidelberg, 2002. Springer-Verlag. ISBN 3540439412

  63. [63]

    Pufferlib: Making reinforcement learning libraries and environments play nice, 2024

    Joseph Suarez. Pufferlib: Making reinforcement learning libraries and environments play nice, 2024. URL https://arxiv.org/abs/2406.12905

  64. [64]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd.html

  65. [65]

    Sutton, Doina Precup, and Satinder Singh

    Richard S. Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artif. Intell., 112 0 (1-2): 0 181--211, 1999. URL http://dblp.uni-trier.de/db/journals/ai/ai112.html#SuttonPS99

  66. [66]

    Habitat 2.0: training home assistants to rearrange their habitat

    Andrew Szot, Alex Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir Vondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: training home assista...

  67. [67]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team. Gemini: A family of highly capable multimodal models, 2024. URL https://arxiv.org/abs/2312.11805

  68. [68]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

  69. [69]

    Feudal networks for hierarchical reinforcement learning

    Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML'17, pp.\ 3540–3549. JMLR.org, 2017

  70. [70]

    Hq-learning

    Marco Wiering and J \"u rgen Schmidhuber. Hq-learning. Adaptive Behavior, 6 0 (2): 0 219--246, 1997. ISSN 1059-7123

  71. [71]

    Decentralized distributed PPO: solving pointgoal navigation

    Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, and Dhruv Batra. Decentralized distributed PPO: solving pointgoal navigation. CoRR, abs/1911.00357, 2019. URL http://arxiv.org/abs/1911.00357

  72. [72]

    Function optimization using connectionist reinforcement learning algorithms

    Ronald Williams and Jing Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3 0 (3): 0 241--268, 1991. doi:10.1080/09540099108946587

  73. [73]

    Williams

    Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn., 8 0 (3–4): 0 229–256, May 1992. ISSN 0885-6125. doi:10.1007/BF00992696. URL https://doi.org/10.1007/BF00992696

  74. [74]

    Gonzalez, and Ion Stoica

    Zhanghao Wu, Eric Liang, Michael Luo, Sven Mika, Joseph E. Gonzalez, and Ion Stoica. RLlib flow: Distributed reinforcement learning is a dataflow problem. In Conference on Neural Information Processing Systems ( NeurIPS ) , 2021. URL https://proceedings.neurips.cc/paper/2021/file/2bce32ed409f5ebcee2a7b417ad9beed-Paper.pdf

  75. [75]

    ASC: Adaptive Skill Coordination for Robotic Mobile Manipulation

    Naoki Yokoyama, Alexander William Clegg, Joanne Truong, Eric Undersander, Jimmy Yang, Sergio Arnaud, Sehoon Ha, Dhruv Batra, and Akshara Rai. ASC: Adaptive Skill Coordination for Robotic Mobile Manipulation . IEEE Robotics and Automation Letters, 2023

  76. [76]

    Online intrinsic rewards for decision making agents from large language model feedback, 2024

    Qinqing Zheng, Mikael Henaff, Amy Zhang, Aditya Grover, and Brandon Amos. Online intrinsic rewards for decision making agents from large language model feedback, 2024. URL https://arxiv.org/abs/2410.23022