pith. machine review for the scientific record. sign in

arxiv: 2605.14211 · v1 · submitted 2026-05-14 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

ASH: Agents that Self-Hone via Embodied Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:49 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords embodied learninginverse dynamics modelself-improvementlong-horizon tasksunlabeled videogame environmentsagentic systems
0
0 comments X

The pith

ASH learns long-horizon policies in complex games by training an inverse dynamics model on its own trajectories to label unlabeled internet videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ASH as an agentic system that acquires embodied skills in long-horizon environments without hand-engineered rewards or expert-labeled demonstrations. When the agent stalls, it trains an inverse dynamics model solely on its self-generated trajectories and applies that model to extract action supervision from relevant internet video clips. Unsupervised techniques further identify and retain key moments from large-scale video as long-term memory. This loop enables sustained progress across multi-hour tasks where standard behavioral cloning and retrieval baselines plateau.

Core claim

ASH reaches an average of 11.2 out of 12 milestones in Pokemon Emerald and 9.9 out of 12 in The Legend of Zelda by repeatedly training an inverse dynamics model on its own noisy trajectories and using the model to derive supervision signals from unlabeled internet video, while also storing unsupervised key moments as memory; the strongest baselines remain stuck at roughly 6 milestones in both environments.

What carries the argument

The self-improvement loop that trains an inverse dynamics model from the agent's own trajectories to label actions in internet video, paired with unsupervised extraction of key moments for long-term memory.

If this is right

  • The same self-honing loop can be applied to other long-horizon embodied tasks that lack dense rewards or expert data.
  • Agents can bootstrap policies from web-scale unlabeled video once they generate enough of their own trajectories to train a usable IDM.
  • Unsupervised key-moment retention enables planning over multi-hour horizons without explicit state tracking.
  • Performance gaps versus baselines widen as task length increases because self-generated labels keep the policy advancing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the IDM generalizes across visual domains, the method could transfer from game video to real-world robot footage without additional annotation.
  • Scaling the volume of internet video or the number of self-improvement cycles could further raise the fraction of milestones reached.
  • The approach suggests that internet video plus self-generated data forms a sufficient training signal for many sequential decision problems once an initial exploration policy exists.

Load-bearing premise

An inverse dynamics model trained only on the agent's own noisy self-generated trajectories will produce sufficiently accurate action labels when applied to unrelated low-quality internet video clips.

What would settle it

Training the IDM on ASH trajectories and then measuring whether policy performance stops improving after one or more cycles of video-derived supervision, or whether milestone counts remain comparable to the strongest baseline.

Figures

Figures reproduced from arXiv: 2605.14211 by Benjamin Schneider, Sun Sun, Victor Zhong, Xavier Schneider.

Figure 1
Figure 1. Figure 1: ASH self-improves over the course of a multi-hour playthrough by retrieving and learning [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Agent progress in Pokémon Emerald (top) and Legend of Zelda (bottom), measured using milestone completion rates. ASH is able to adapt and continue progressing throughout the 8-hour gameplay period. While all methods are able to complete early milestones, only ASH can adapt to new areas, objectives, and mechanics. See Appendix C for standard deviation. function to determine if it belongs to a cluster. These… view at source ↗
Figure 3
Figure 3. Figure 3: Outside (top) vs. inside (bottom) of the Zelda castle: offline poli￾cies collapse on this dy￾namics shift; ASH boot￾straps and continues. Self-improvement is necessary for sustained progression. Over the 8-hour evaluation, ASH reaches milestone 12 in both environments, while no baseline exceeds milestone 8 in Pokémon or 6 in Zelda. VPT and offline BC plateau once the games introduce dynamics that are under… view at source ↗
Figure 4
Figure 4. Figure 4: (Left) Component ablation on Pokémon Emerald: each addition (long-term memory, dynamic bootstrapping) yields a clear gain in milestones completed per GPU hour of online training. Shaded regions are one standard deviation over 4 trajectories per method. (Right) IDM accuracy across bootstraps, evaluated on a test set. The dashed line is the pre-bootstrap initialization checkpoint. Across both environments, e… view at source ↗
Figure 5
Figure 5. Figure 5: Offline replay of the final ASH check￾point vs. the original online run. Catastrophic forgetting is a phenomenon in life￾long learning where an agent will forget pre￾viously known skills and knowledge when its policy is updated [47]. The result is an agent that can progress through the latter stages of an environment but can no longer accomplish early milestones. We examine whether ASH’s final policy is ab… view at source ↗
Figure 6
Figure 6. Figure 6: Dynamic bootstrapping example. To complete milestone 2, the player must rescue the Professor from a wild Zigzagoon (Panel 1). To accomplish this, the player must use their starter Pokémon to defeat the Zigzagoon in battle (Panel 2). However, ASH’s initial policy does not know how to use the battle interface to command their Pokémon. After being stuck for ∆ steps (20 minutes), ASH dynamically bootstraps, an… view at source ↗
Figure 7
Figure 7. Figure 7: Long-term memory example. When the player arrives in Oldale town (Panel 1), they are presented with 3 possible next paths. Option A: The player heads north to Route 103 to meet their rival, May. This is the correct choice if the player has just obtained their starter Pokémon from the Professor and been tasked with bringing May back to the lab. Option B: The player has already met May, and should head back … view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of 3 HDBSCAN [40] clusters, as well as 500 uniformly sampled outlier points reduced to 2 dimensions via principal component analysis. Each of these clusters represents a key moment in Pokémon Emerald; choosing a starter Pokémon (blue), saving the professor (orange) and challenging a gym leader (green). Grey dots are outliers that are not assigned to a cluster by HDBSCAN [40]. Ideally, they ar… view at source ↗
read the original abstract

Long-horizon embodied tasks remain a fundamental challenge in AI, as current methods rely on hand-engineered rewards or action-labeled demonstrations, neither of which scales. We introduce ASH, an agentic system that learns an embodied policy from unlabeled, noisy internet video, without reward shaping or expert annotation. ASH follows a self-improvement loop; when it gets stuck, ASH learns an Inverse Dynamics Model (IDM) from its own trajectories, and uses its IDM to extract supervision from relevant internet video. ASH uses unsupervised learning to identify key moments from large-scale internet video and retains them as long-term memory -- allowing it to tackle long-horizon problems. We evaluate ASH on two complementary environments demanding multi-hour planning: Pokemon Emerald, a turn-based RPG, and The Legend of Zelda: The Minish Cap, a real-time action-adventure game. In both games, behavioral cloning, retrieval-augmented and zero-shot foundation-model baselines plateau, while ASH sustains progression across our 8-hour evaluation. ASH reaches an average of $11.2/12$ milestones in Pokemon Emerald and $9.9/12$ in Legend of Zelda, while the strongest baseline gets stuck in both environments at an average of $6.5/12$ and $6.0/12$ milestones, respectively. We demonstrate that self-improving agents are a scalable recipe for long-horizon embodied learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces ASH, an agentic system for long-horizon embodied learning that follows a self-improvement loop: when stuck, it trains an Inverse Dynamics Model (IDM) on its own trajectories and applies the IDM to extract action labels from unlabeled noisy internet video for supervision, while using unsupervised learning to identify key moments as long-term memory. Evaluated on Pokemon Emerald and Legend of Zelda, ASH achieves average milestone progress of 11.2/12 and 9.9/12 respectively, while baselines plateau at 6.5/12 and 6.0/12.

Significance. If the performance gains can be shown to stem from the IDM-based self-honing mechanism with proper validation, the work would represent a meaningful step toward scalable embodied agents that leverage abundant internet video without hand-engineered rewards or expert annotations, addressing a core limitation in current long-horizon task learning.

major comments (2)
  1. [Abstract] Abstract: The central performance claims (11.2/12 milestones in Pokemon Emerald, 9.9/12 in Zelda) are reported without error bars, ablation studies isolating the IDM supervision component, or details on filtering noisy video, making it impossible to determine whether the self-honing loop drives the gains over baselines.
  2. [Abstract] Abstract: The method's validity hinges on the IDM, trained only on the agent's initially random or stuck self-trajectories, producing accurate action labels on unrelated noisy internet video despite domain shifts in quality, frame rate, perspective, and style; however, no quantitative IDM accuracy metrics on held-out external clips are provided.
minor comments (1)
  1. The abstract would benefit from a concise definition of the 12 milestones and how they are evaluated across the 8-hour runs to improve clarity and reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the validation of ASH's self-honing mechanism.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claims (11.2/12 milestones in Pokemon Emerald, 9.9/12 in Zelda) are reported without error bars, ablation studies isolating the IDM supervision component, or details on filtering noisy video, making it impossible to determine whether the self-honing loop drives the gains over baselines.

    Authors: We agree that error bars, targeted ablations, and filtering details are essential to substantiate the claims. In the revised manuscript we will report error bars over multiple independent runs for all milestone-progress metrics. We will add ablation studies that isolate the IDM-based internet-video supervision (comparing full ASH against variants without the IDM loop or without video labels) and will expand the methods section with the precise filtering criteria and preprocessing steps applied to noisy internet clips. These additions will directly demonstrate that the self-honing loop accounts for the observed gains over baselines. revision: yes

  2. Referee: [Abstract] Abstract: The method's validity hinges on the IDM, trained only on the agent's initially random or stuck self-trajectories, producing accurate action labels on unrelated noisy internet video despite domain shifts in quality, frame rate, perspective, and style; however, no quantitative IDM accuracy metrics on held-out external clips are provided.

    Authors: We acknowledge the importance of quantifying IDM generalization. The revised manuscript will include new quantitative results measuring IDM action-prediction accuracy on held-out external video clips drawn from the same internet sources, explicitly reporting performance under the domain shifts in quality, frame rate, perspective, and visual style. These metrics will be presented alongside the end-to-end results to confirm that the IDM trained on agent trajectories can reliably label noisy video for supervision. revision: yes

Circularity Check

0 steps flagged

No significant circularity in ASH's procedural self-improvement loop

full rationale

The paper presents ASH as an agentic system following a self-improvement loop: learning an IDM from its own trajectories to extract supervision from internet video, combined with unsupervised key moment identification. This is described as a procedural algorithm without any mathematical derivations, equations, or fitted parameters that reduce predictions to inputs by construction. Performance is evaluated empirically via milestone completion in games, not through self-referential claims. No self-citation load-bearing arguments or uniqueness theorems are referenced. The central claim rests on the empirical results rather than tautological definitions, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that internet video contains recoverable action information that an IDM trained on self-generated trajectories can extract, plus the assumption that unsupervised key-moment detection yields useful long-term memory for multi-hour planning.

axioms (1)
  • domain assumption Internet videos contain extractable supervision for embodied actions when paired with an IDM trained on the agent's own trajectories
    Invoked to justify using unlabeled video as training signal without expert annotation.

pith-pipeline@v0.9.0 · 5548 in / 1427 out tokens · 28925 ms · 2026-05-15T02:49:32.785235+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 11 internal anchors

  1. [1]

    Wei-Chieh Huang, Weizhi Zhang, Yueqing Liang, Yuanchen Bei, Yankai Chen, Tao Feng, Xinyu Pan, Zhen Tan, Yu Wang, Tianxin Wei, Shanglin Wu, Ruiyao Xu, Liangwei Yang, Rui Yang, Wooseong Yang, Chin-Yuan Yeh, Hanrong Zhang, Haozhen Zhang, Siqi Zhu, Henry Peng Zou, Wanjia Zhao, Song Wang, Wujiang Xu, Zixuan Ke, Zheng Hui, Dawei Li, Yaozu Wu, Langzhou He, Chen ...

  2. [2]

    Sycara, Matthew Johnson-Roberson, Dhruv Batra, Xiaolong Wang, Sebastian Scherer, Zsolt Kira, Fei Xia, and Yonatan Bisk

    Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi Zhang, Hao-Shu Fang, Shibo Zhao, Shayegan Omidshafiei, Dong-Ki Kim, Ali akbar Agha-mohammadi, Katia Sycara, Matthew Johnson-Roberson, Dhruv Batra, Xiaolong Wang, Sebastian Scherer, Chen Wang, Zsolt Kira, Fei Xia, and Yonatan Bisk. Toward gene...

  3. [3]

    Behavioral cloning from observation

    Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. InProceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pages 4950–4957. International Joint Conferences on Artificial Intelligence Organization, 7 2018. doi: 10.24963/ijcai.2018/

  4. [4]

    URLhttps://doi.org/10.24963/ijcai.2018/687

  5. [5]

    Baker, I

    Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos, 2022. URLhttps://arxiv.org/abs/2206.11795

  6. [6]

    Efficient reductions for imitation learning

    Stephane Ross and Drew Bagnell. Efficient reductions for imitation learning. In Yee Whye Teh and Mike Titterington, editors,Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 ofProceedings of Machine Learning Research, pages 661–668, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URL http...

  7. [7]

    McIlraith

    Shalev Lifshitz, Keiran Paster, Harris Chan, Jimmy Ba, and Sheila A. McIlraith. STEVE-1: A generative model for text-to-behavior in minecraft. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=YkBDJWerKg

  8. [8]

    Pokemon red via reinforcement learning, 2025

    Marco Pleines, Daniel Addis, David Rubinstein, Frank Zimmer, Mike Preuss, and Peter Whidden. Pokemon red via reinforcement learning, 2025

  9. [9]

    Training Agents Inside of Scalable World Models

    Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models, 2025. URLhttps://arxiv.org/abs/2509.24527

  10. [10]

    In: CVPR, pp

    Yichen Zhu, Zhicai Ou, Xiaofeng Mou, and Jian Tang. Retrieval-Augmented Embodied Agents . In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17985–17995, Los Alamitos, CA, USA, June 2024. IEEE Computer Society. doi: 10.1109/CVPR52733.2024.01703. URL https://doi.ieeecomputersociety.org/10.1109/CVPR52733.2024.01703

  11. [11]

    Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

  12. [12]

    Optimus-2: Multi- modal minecraft agent with goal-observation-action conditioned policy

    Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, and Liqiang Nie. Optimus-2: Multi- modal minecraft agent with goal-observation-action conditioned policy. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2025

  13. [13]

    Behavioral Cloning from Observation

    Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation, 2018. URL https://arxiv.org/abs/1805.01954

  14. [14]

    Imitating latent policies from observation

    Ashley Edwards, Himanshu Sahni, Yannick Schroecker, and Charles Isbell. Imitating latent policies from observation. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Re- search, pages 1755–1763. PMLR, 09–15 Jun 2019. URL https://proceedi...

  15. [15]

    Learning to act without actions

    Dominik Schmidt and Minqi Jiang. Learning to act without actions. InThe Twelfth International Conference on Learning Representations (ICLR), 2024. 10

  16. [16]

    Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando De Freitas, Satinder Singh, and Tim Rocktäschel

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Maria Elisabeth Bechtle, Feryal Behbahani, Stephanie C.Y . Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando...

  17. [17]

    Retrieval- augmented generation for knowledge-intensive nlp tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval- augmented generation for knowledge-intensive nlp tasks. InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’2...

  18. [18]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

  19. [19]

    Reflexion: language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=vAElhFcKW6

  20. [20]

    Large-scale retrieval for reinforcement learning

    Peter Conway Humphreys, Arthur Guez, Olivier Tieleman, Laurent Sifre, Theophane Weber, and Timothy P Lillicrap. Large-scale retrieval for reinforcement learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=Ya9lATuQ3gg

  21. [21]

    Retrieval-augmented reinforce- ment learning

    Anirudh Goyal, Abram Friesen, Andrea Banino, Theophane Weber, Nan Rosemary Ke, Adrià Puig- domènech Badia, Arthur Guez, Mehdi Mirza, Peter C Humphreys, Ksenia Konyushova, Michal Valko, Simon Osindero, Timothy Lillicrap, Nicolas Heess, and Charles Blundell. Retrieval-augmented reinforce- ment learning. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csab...

  22. [22]

    Large language models are semi-parametric reinforcement learning agents

    Danyang Zhang, Lu Chen, Situo Zhang, Hongshen Xu, Zihan Zhao, and Kai Yu. Large language models are semi-parametric reinforcement learning agents. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=ZcJa1R6j3v

  23. [23]

    STar: Bootstrapping reasoning with reasoning

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. STar: Bootstrapping reasoning with reasoning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=_3ELRdg2sgI

  24. [24]

    Reinforced Self-Training (ReST) for Language Modeling

    Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. Reinforced self-training (rest) for language modeling, 2023. URL https://arxiv.org/abs/2308.08998

  25. [25]

    Thinking fast and slow with deep learning and tree search

    Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 5366–5376, Red Hook, NY , USA, 2017. Curran Associates Inc. ISBN 9781510860964

  26. [26]

    Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm, 2017. URLhttps://arxiv.org/abs/1712.01815

  27. [27]

    Autort: Embodied foundation models for large scale orchestration of robotic agents, 2024

    Michael Ahn, Debidatta Dwibedi, Chelsea Finn, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Karol Hausman, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Sean Kirmani, Isabel Leal, Edward Lee, Sergey Levine, Yao Lu, Isabel Leal, Sharath Maddineni, Kanishka Rao, Dorsa Sadigh, Pannag Sanketi, Pierre Sermanet, Quan Vuong, Stefan Welker, Fei Xia, Te...

  28. [28]

    Minedojo: Building open-ended embodied agents with internet-scale knowledge

    Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. InThirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URLhttps://openreview.net/forum?id=rc8...

  29. [29]

    Decision transformer: Reinforcement learning via sequence modeling,

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling,

  30. [30]

    URLhttps://arxiv.org/abs/2106.01345

  31. [31]

    Memorizing transformers

    Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers. InInternational Conference on Learning Representations, 2022. URL https://openreview.net/ forum?id=TrjbxzRcnf-

  32. [32]

    Recurrent memory transformer

    Aydar Bulatov, Yuri Kuratov, and Mikhail Burtsev. Recurrent memory transformer. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=Uynr3iPhksa

  33. [33]

    Improving language models by retrieving from trillions of tokens

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego De Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, ...

  34. [34]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models, 2023. URL https://arxiv.org/abs/2305.16291

  35. [35]

    Playing Atari with Deep Reinforcement Learning

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning, 2013. URL https://arxiv.org/ abs/1312.5602

  36. [36]

    Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Joseph Dudzik, Junyoung Chung, David Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Hor- gan, Manuel Kroiss, Ivo Danihelka, Aja Huang, L. Sifre, Trevor Cai, John P. Agapiou, Max Jader- berg, Alexander Sasha Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin D...

  37. [37]

    OpenAI, :, Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ ebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique P. d. O. Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ily...

  38. [38]

    The pokeagent challenge: Competitive and long-context learning at scale

    Seth Karten, Jake Grigsby, Stephanie Milani, Kiran V odrahalli, Amy Zhang, Fei Fang, Yuke Zhu, and Chi Jin. The pokeagent challenge: Competitive and long-context learning at scale. InNeurIPS Competition Track, April 2025

  39. [39]

    ClaudePlaysPokemon

    ClaudePlaysPokemon. ClaudePlaysPokemon. https://www.twitch.tv/claudeplayspokemon, 2026. [Accessed 01-05-2026]

  40. [40]

    Sawyer, Daniel Slater, David Reichert, Davide Vercelli, Demis Hassabis, Drew A

    SIMA team, Adrian Bolton, Alexander Lerchner, Alexandra Cordell, Alexandre Moufarek, Andrew Bolt, Andrew Lampinen, Anna Mitenkova, Arne Olav Hallingstad, Bojan Vujatovic, Bonnie Li, Cong Lu, Daan Wierstra, Daniel P. Sawyer, Daniel Slater, David Reichert, Davide Vercelli, Demis Hassabis, Drew A. Hudson, Duncan Williams, Ed Hirst, Fabio Pardo, Felix Hill, F...

  41. [41]

    Dungeons and data: A large-scale nethack dataset, 2023

    Eric Hambro, Roberta Raileanu, Danielle Rothermel, Vegard Mella, Tim Rocktäschel, Heinrich Küttler, and Naila Murray. Dungeons and data: A large-scale nethack dataset, 2023. URL https://arxiv.org/ abs/2211.00539

  42. [42]

    Ricardo J. G. B. Campello, Davoud Moulavi, and Joerg Sander. Density-based clustering based on hierarchical density estimates. In Jian Pei, Vincent S. Tseng, Longbing Cao, Hiroshi Motoda, and Guandong Xu, editors,Advances in Knowledge Discovery and Data Mining, pages 160–172, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. ISBN 978-3-642-37456-2

  43. [43]

    Sigmoid loss for language image pre-training, 2023

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training, 2023. URLhttps://arxiv.org/abs/2303.15343

  44. [44]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  45. [45]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...

  46. [46]

    Walkthrough:pokémon emerald, Jun 2025

    Bulbapedia. Walkthrough:pokémon emerald, Jun 2025. URL https://bulbapedia.bulbagarden. net/wiki/Walkthrough:Pok%C3%A9mon_Emerald

  47. [47]

    The minish cap walkthrough

    Zelda Dungeon. The minish cap walkthrough. https://www.zeldadungeon.net/ the-minish-cap-walkthrough/, 2026. [Accessed 30-04-2026]

  48. [48]

    Machine learning in python: Main developments and technology trends in data science, machine learning, and artificial intelligence.arXiv preprint arXiv:2002.04803, 2020

    Sebastian Raschka, Joshua Patterson, and Corey Nolet. Machine learning in python: Main developments and technology trends in data science, machine learning, and artificial intelligence.arXiv preprint arXiv:2002.04803, 2020

  49. [49]

    A comprehensive survey of forgetting in deep learning beyond continual learning.IEEE Trans

    Zhenyi Wang, Enneng Yang, Li Shen, and Heng Huang. A comprehensive survey of forgetting in deep learning beyond continual learning.IEEE Trans. Pattern Anal. Mach. Intell., 47(3):1464–1483, March

  50. [50]

    doi: 10.1109/TPAMI.2024.3498346

    ISSN 0162-8828. doi: 10.1109/TPAMI.2024.3498346. URL https://doi.org/10.1109/TPAMI. 2024.3498346

  51. [51]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URLhttps://arxiv. org/abs/1711.05101. 13 A ASH Playthrough Examples Examples of how ASH usesdynamic bootstrappingandlong-memoryto overcome roadblocks. 1 2 4 5 6 3 Figure 6:Dynamic bootstrapping example.To complete milestone 2, the player must rescue the Professor from a wild Zi...

  52. [52]

    Are all frameskey moments?

  53. [53]

    walk to the finish line,

    Do all frames correspond to thesamekey moment? If the answer to both questions is yes, we consider the cluster to be a key moment. Based on this analysis, we report that 48% of clusters identified by HDBSCAN [40] correspond to key moments. The effectiveness of HDBSCAN in identifying these key moments helps explain the performance increase observed in Sect...