arxiv: 2605.14211 · v1 · submitted 2026-05-14 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

ASH: Agents that Self-Hone via Embodied Learning

Benjamin Schneider , Xavier Schneider , Victor Zhong , Sun Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:49 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords embodied learninginverse dynamics modelself-improvementlong-horizon tasksunlabeled videogame environmentsagentic systems

0 comments

The pith

ASH learns long-horizon policies in complex games by training an inverse dynamics model on its own trajectories to label unlabeled internet videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ASH as an agentic system that acquires embodied skills in long-horizon environments without hand-engineered rewards or expert-labeled demonstrations. When the agent stalls, it trains an inverse dynamics model solely on its self-generated trajectories and applies that model to extract action supervision from relevant internet video clips. Unsupervised techniques further identify and retain key moments from large-scale video as long-term memory. This loop enables sustained progress across multi-hour tasks where standard behavioral cloning and retrieval baselines plateau.

Core claim

ASH reaches an average of 11.2 out of 12 milestones in Pokemon Emerald and 9.9 out of 12 in The Legend of Zelda by repeatedly training an inverse dynamics model on its own noisy trajectories and using the model to derive supervision signals from unlabeled internet video, while also storing unsupervised key moments as memory; the strongest baselines remain stuck at roughly 6 milestones in both environments.

What carries the argument

The self-improvement loop that trains an inverse dynamics model from the agent's own trajectories to label actions in internet video, paired with unsupervised extraction of key moments for long-term memory.

If this is right

The same self-honing loop can be applied to other long-horizon embodied tasks that lack dense rewards or expert data.
Agents can bootstrap policies from web-scale unlabeled video once they generate enough of their own trajectories to train a usable IDM.
Unsupervised key-moment retention enables planning over multi-hour horizons without explicit state tracking.
Performance gaps versus baselines widen as task length increases because self-generated labels keep the policy advancing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the IDM generalizes across visual domains, the method could transfer from game video to real-world robot footage without additional annotation.
Scaling the volume of internet video or the number of self-improvement cycles could further raise the fraction of milestones reached.
The approach suggests that internet video plus self-generated data forms a sufficient training signal for many sequential decision problems once an initial exploration policy exists.

Load-bearing premise

An inverse dynamics model trained only on the agent's own noisy self-generated trajectories will produce sufficiently accurate action labels when applied to unrelated low-quality internet video clips.

What would settle it

Training the IDM on ASH trajectories and then measuring whether policy performance stops improving after one or more cycles of video-derived supervision, or whether milestone counts remain comparable to the strongest baseline.

Figures

Figures reproduced from arXiv: 2605.14211 by Benjamin Schneider, Sun Sun, Victor Zhong, Xavier Schneider.

**Figure 2.** Figure 2: Agent progress in Pokémon Emerald (top) and Legend of Zelda (bottom), measured using milestone completion rates. ASH is able to adapt and continue progressing throughout the 8-hour gameplay period. While all methods are able to complete early milestones, only ASH can adapt to new areas, objectives, and mechanics. See Appendix C for standard deviation. function to determine if it belongs to a cluster. These… view at source ↗

**Figure 3.** Figure 3: Outside (top) vs. inside (bottom) of the Zelda castle: offline policies collapse on this dynamics shift; ASH bootstraps and continues. Self-improvement is necessary for sustained progression. Over the 8-hour evaluation, ASH reaches milestone 12 in both environments, while no baseline exceeds milestone 8 in Pokémon or 6 in Zelda. VPT and offline BC plateau once the games introduce dynamics that are under… view at source ↗

**Figure 4.** Figure 4: (Left) Component ablation on Pokémon Emerald: each addition (long-term memory, dynamic bootstrapping) yields a clear gain in milestones completed per GPU hour of online training. Shaded regions are one standard deviation over 4 trajectories per method. (Right) IDM accuracy across bootstraps, evaluated on a test set. The dashed line is the pre-bootstrap initialization checkpoint. Across both environments, e… view at source ↗

**Figure 5.** Figure 5: Offline replay of the final ASH checkpoint vs. the original online run. Catastrophic forgetting is a phenomenon in lifelong learning where an agent will forget previously known skills and knowledge when its policy is updated [47]. The result is an agent that can progress through the latter stages of an environment but can no longer accomplish early milestones. We examine whether ASH’s final policy is ab… view at source ↗

**Figure 6.** Figure 6: Dynamic bootstrapping example. To complete milestone 2, the player must rescue the Professor from a wild Zigzagoon (Panel 1). To accomplish this, the player must use their starter Pokémon to defeat the Zigzagoon in battle (Panel 2). However, ASH’s initial policy does not know how to use the battle interface to command their Pokémon. After being stuck for ∆ steps (20 minutes), ASH dynamically bootstraps, an… view at source ↗

**Figure 7.** Figure 7: Long-term memory example. When the player arrives in Oldale town (Panel 1), they are presented with 3 possible next paths. Option A: The player heads north to Route 103 to meet their rival, May. This is the correct choice if the player has just obtained their starter Pokémon from the Professor and been tasked with bringing May back to the lab. Option B: The player has already met May, and should head back … view at source ↗

**Figure 8.** Figure 8: Visualization of 3 HDBSCAN [40] clusters, as well as 500 uniformly sampled outlier points reduced to 2 dimensions via principal component analysis. Each of these clusters represents a key moment in Pokémon Emerald; choosing a starter Pokémon (blue), saving the professor (orange) and challenging a gym leader (green). Grey dots are outliers that are not assigned to a cluster by HDBSCAN [40]. Ideally, they ar… view at source ↗

read the original abstract

Long-horizon embodied tasks remain a fundamental challenge in AI, as current methods rely on hand-engineered rewards or action-labeled demonstrations, neither of which scales. We introduce ASH, an agentic system that learns an embodied policy from unlabeled, noisy internet video, without reward shaping or expert annotation. ASH follows a self-improvement loop; when it gets stuck, ASH learns an Inverse Dynamics Model (IDM) from its own trajectories, and uses its IDM to extract supervision from relevant internet video. ASH uses unsupervised learning to identify key moments from large-scale internet video and retains them as long-term memory -- allowing it to tackle long-horizon problems. We evaluate ASH on two complementary environments demanding multi-hour planning: Pokemon Emerald, a turn-based RPG, and The Legend of Zelda: The Minish Cap, a real-time action-adventure game. In both games, behavioral cloning, retrieval-augmented and zero-shot foundation-model baselines plateau, while ASH sustains progression across our 8-hour evaluation. ASH reaches an average of $11.2/12$ milestones in Pokemon Emerald and $9.9/12$ in Legend of Zelda, while the strongest baseline gets stuck in both environments at an average of $6.5/12$ and $6.0/12$ milestones, respectively. We demonstrate that self-improving agents are a scalable recipe for long-horizon embodied learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ASH's loop of training an IDM only on the agent's own trajectories then using it to label internet video for policy updates is a clean procedural idea, but the abstract supplies no IDM accuracy checks or ablations so the reported milestone jumps remain hard to attribute.

read the letter

The core claim is that an agent can keep progressing on long-horizon tasks by fitting an inverse dynamics model to its own stuck trajectories and then applying that model to pull action labels from unlabeled internet video. In Pokemon Emerald it reaches 11.2 out of 12 milestones and in Zelda 9.9 out of 12, while the strongest baselines stall around 6. The loop also includes an unsupervised step to pull out key moments for memory. That combination is not the usual behavioral cloning or retrieval setup, so the procedural recipe itself is the main novelty here. It directly targets the reward-shaping and annotation bottlenecks that limit most current embodied work. The environments are reasonable stress tests for multi-hour planning, and the gap over baselines is large enough to notice. The paper does a service by showing that self-generated data can in principle close the supervision loop without external labels. The soft spots are straightforward. The abstract gives no numbers on how well the IDM actually predicts actions on held-out internet clips, no description of video filtering or domain-shift handling, no error bars, and no ablation that isolates the IDM component. The central assumption—that an IDM fit only to the agent's early noisy trajectories will still produce usable labels on unrelated, lower-quality video—therefore sits untested. Without those checks it is difficult to know whether the milestone gains come from the self-honing mechanism or from other unstated choices in the pipeline. This is for researchers working on video-based imitation and scalable embodied agents who want to see concrete attempts to remove hand-engineered supervision. A reader already running long-horizon experiments would find the loop worth trying even if the current numbers need more controls. It deserves a serious referee. The idea is coherent on its own terms and the environments are non-trivial, so the paper should go through review with requests for IDM accuracy metrics, ablations, and implementation details rather than being desk-rejected.

Referee Report

2 major / 1 minor

Summary. The paper introduces ASH, an agentic system for long-horizon embodied learning that follows a self-improvement loop: when stuck, it trains an Inverse Dynamics Model (IDM) on its own trajectories and applies the IDM to extract action labels from unlabeled noisy internet video for supervision, while using unsupervised learning to identify key moments as long-term memory. Evaluated on Pokemon Emerald and Legend of Zelda, ASH achieves average milestone progress of 11.2/12 and 9.9/12 respectively, while baselines plateau at 6.5/12 and 6.0/12.

Significance. If the performance gains can be shown to stem from the IDM-based self-honing mechanism with proper validation, the work would represent a meaningful step toward scalable embodied agents that leverage abundant internet video without hand-engineered rewards or expert annotations, addressing a core limitation in current long-horizon task learning.

major comments (2)

[Abstract] Abstract: The central performance claims (11.2/12 milestones in Pokemon Emerald, 9.9/12 in Zelda) are reported without error bars, ablation studies isolating the IDM supervision component, or details on filtering noisy video, making it impossible to determine whether the self-honing loop drives the gains over baselines.
[Abstract] Abstract: The method's validity hinges on the IDM, trained only on the agent's initially random or stuck self-trajectories, producing accurate action labels on unrelated noisy internet video despite domain shifts in quality, frame rate, perspective, and style; however, no quantitative IDM accuracy metrics on held-out external clips are provided.

minor comments (1)

The abstract would benefit from a concise definition of the 12 milestones and how they are evaluated across the 8-hour runs to improve clarity and reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the validation of ASH's self-honing mechanism.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims (11.2/12 milestones in Pokemon Emerald, 9.9/12 in Zelda) are reported without error bars, ablation studies isolating the IDM supervision component, or details on filtering noisy video, making it impossible to determine whether the self-honing loop drives the gains over baselines.

Authors: We agree that error bars, targeted ablations, and filtering details are essential to substantiate the claims. In the revised manuscript we will report error bars over multiple independent runs for all milestone-progress metrics. We will add ablation studies that isolate the IDM-based internet-video supervision (comparing full ASH against variants without the IDM loop or without video labels) and will expand the methods section with the precise filtering criteria and preprocessing steps applied to noisy internet clips. These additions will directly demonstrate that the self-honing loop accounts for the observed gains over baselines. revision: yes
Referee: [Abstract] Abstract: The method's validity hinges on the IDM, trained only on the agent's initially random or stuck self-trajectories, producing accurate action labels on unrelated noisy internet video despite domain shifts in quality, frame rate, perspective, and style; however, no quantitative IDM accuracy metrics on held-out external clips are provided.

Authors: We acknowledge the importance of quantifying IDM generalization. The revised manuscript will include new quantitative results measuring IDM action-prediction accuracy on held-out external video clips drawn from the same internet sources, explicitly reporting performance under the domain shifts in quality, frame rate, perspective, and visual style. These metrics will be presented alongside the end-to-end results to confirm that the IDM trained on agent trajectories can reliably label noisy video for supervision. revision: yes

Circularity Check

0 steps flagged

No significant circularity in ASH's procedural self-improvement loop

full rationale

The paper presents ASH as an agentic system following a self-improvement loop: learning an IDM from its own trajectories to extract supervision from internet video, combined with unsupervised key moment identification. This is described as a procedural algorithm without any mathematical derivations, equations, or fitted parameters that reduce predictions to inputs by construction. Performance is evaluated empirically via milestone completion in games, not through self-referential claims. No self-citation load-bearing arguments or uniqueness theorems are referenced. The central claim rests on the empirical results rather than tautological definitions, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that internet video contains recoverable action information that an IDM trained on self-generated trajectories can extract, plus the assumption that unsupervised key-moment detection yields useful long-term memory for multi-hour planning.

axioms (1)

domain assumption Internet videos contain extractable supervision for embodied actions when paired with an IDM trained on the agent's own trajectories
Invoked to justify using unlabeled video as training signal without expert annotation.

pith-pipeline@v0.9.0 · 5548 in / 1427 out tokens · 28925 ms · 2026-05-15T02:49:32.785235+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ASH follows a self-improvement loop; when it gets stuck, ASH learns an Inverse Dynamics Model (IDM) from its own trajectories, and uses its IDM to extract supervision from relevant internet video.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use HDBSCAN clustering to discover recurring key moments... long-term memory of observations ρ that are the wl most recent key moments.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 11 internal anchors

[1]

Wei-Chieh Huang, Weizhi Zhang, Yueqing Liang, Yuanchen Bei, Yankai Chen, Tao Feng, Xinyu Pan, Zhen Tan, Yu Wang, Tianxin Wei, Shanglin Wu, Ruiyao Xu, Liangwei Yang, Rui Yang, Wooseong Yang, Chin-Yuan Yeh, Hanrong Zhang, Haozhen Zhang, Siqi Zhu, Henry Peng Zou, Wanjia Zhao, Song Wang, Wujiang Xu, Zixuan Ke, Zheng Hui, Dawei Li, Yaozu Wu, Langzhou He, Chen ...

work page arXiv 2026
[2]

Sycara, Matthew Johnson-Roberson, Dhruv Batra, Xiaolong Wang, Sebastian Scherer, Zsolt Kira, Fei Xia, and Yonatan Bisk

Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi Zhang, Hao-Shu Fang, Shibo Zhao, Shayegan Omidshafiei, Dong-Ki Kim, Ali akbar Agha-mohammadi, Katia Sycara, Matthew Johnson-Roberson, Dhruv Batra, Xiaolong Wang, Sebastian Scherer, Chen Wang, Zsolt Kira, Fei Xia, and Yonatan Bisk. Toward gene...

work page arXiv 2024
[3]

Behavioral cloning from observation

Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. InProceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pages 4950–4957. International Joint Conferences on Artificial Intelligence Organization, 7 2018. doi: 10.24963/ijcai.2018/

work page doi:10.24963/ijcai.2018/ 2018
[4]

URLhttps://doi.org/10.24963/ijcai.2018/687

work page doi:10.24963/ijcai.2018/687 2018
[5]

Baker, I

Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos, 2022. URLhttps://arxiv.org/abs/2206.11795

work page arXiv 2022
[6]

Efficient reductions for imitation learning

Stephane Ross and Drew Bagnell. Efficient reductions for imitation learning. In Yee Whye Teh and Mike Titterington, editors,Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 ofProceedings of Machine Learning Research, pages 661–668, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URL http...

work page 2010
[7]

McIlraith

Shalev Lifshitz, Keiran Paster, Harris Chan, Jimmy Ba, and Sheila A. McIlraith. STEVE-1: A generative model for text-to-behavior in minecraft. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=YkBDJWerKg

work page 2023
[8]

Pokemon red via reinforcement learning, 2025

Marco Pleines, Daniel Addis, David Rubinstein, Frank Zimmer, Mike Preuss, and Peter Whidden. Pokemon red via reinforcement learning, 2025

work page 2025
[9]

Training Agents Inside of Scalable World Models

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models, 2025. URLhttps://arxiv.org/abs/2509.24527

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

In: CVPR, pp

Yichen Zhu, Zhicai Ou, Xiaofeng Mou, and Jian Tang. Retrieval-Augmented Embodied Agents . In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17985–17995, Los Alamitos, CA, USA, June 2024. IEEE Computer Society. doi: 10.1109/CVPR52733.2024.01703. URL https://doi.ieeecomputersociety.org/10.1109/CVPR52733.2024.01703

work page doi:10.1109/cvpr52733.2024.01703 2024
[11]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

work page 2026
[12]

Optimus-2: Multi- modal minecraft agent with goal-observation-action conditioned policy

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, and Liqiang Nie. Optimus-2: Multi- modal minecraft agent with goal-observation-action conditioned policy. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2025

work page 2025
[13]

Behavioral Cloning from Observation

Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation, 2018. URL https://arxiv.org/abs/1805.01954

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Imitating latent policies from observation

Ashley Edwards, Himanshu Sahni, Yannick Schroecker, and Charles Isbell. Imitating latent policies from observation. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Re- search, pages 1755–1763. PMLR, 09–15 Jun 2019. URL https://proceedi...

work page 2019
[15]

Learning to act without actions

Dominik Schmidt and Minqi Jiang. Learning to act without actions. InThe Twelfth International Conference on Learning Representations (ICLR), 2024. 10

work page 2024
[16]

Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando De Freitas, Satinder Singh, and Tim Rocktäschel

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Maria Elisabeth Bechtle, Feryal Behbahani, Stephanie C.Y . Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando...

work page 2024
[17]

Retrieval- augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval- augmented generation for knowledge-intensive nlp tasks. InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’2...

work page 2020
[18]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=vAElhFcKW6

work page 2023
[20]

Large-scale retrieval for reinforcement learning

Peter Conway Humphreys, Arthur Guez, Olivier Tieleman, Laurent Sifre, Theophane Weber, and Timothy P Lillicrap. Large-scale retrieval for reinforcement learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=Ya9lATuQ3gg

work page 2022
[21]

Retrieval-augmented reinforce- ment learning

Anirudh Goyal, Abram Friesen, Andrea Banino, Theophane Weber, Nan Rosemary Ke, Adrià Puig- domènech Badia, Arthur Guez, Mehdi Mirza, Peter C Humphreys, Ksenia Konyushova, Michal Valko, Simon Osindero, Timothy Lillicrap, Nicolas Heess, and Charles Blundell. Retrieval-augmented reinforce- ment learning. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csab...

work page 2022
[22]

Large language models are semi-parametric reinforcement learning agents

Danyang Zhang, Lu Chen, Situo Zhang, Hongshen Xu, Zihan Zhao, and Kai Yu. Large language models are semi-parametric reinforcement learning agents. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=ZcJa1R6j3v

work page 2023
[23]

STar: Bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. STar: Bootstrapping reasoning with reasoning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=_3ELRdg2sgI

work page 2022
[24]

Reinforced Self-Training (ReST) for Language Modeling

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. Reinforced self-training (rest) for language modeling, 2023. URL https://arxiv.org/abs/2308.08998

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Thinking fast and slow with deep learning and tree search

Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 5366–5376, Red Hook, NY , USA, 2017. Curran Associates Inc. ISBN 9781510860964

work page 2017
[26]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm, 2017. URLhttps://arxiv.org/abs/1712.01815

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Autort: Embodied foundation models for large scale orchestration of robotic agents, 2024

Michael Ahn, Debidatta Dwibedi, Chelsea Finn, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Karol Hausman, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Sean Kirmani, Isabel Leal, Edward Lee, Sergey Levine, Yao Lu, Isabel Leal, Sharath Maddineni, Kanishka Rao, Dorsa Sadigh, Pannag Sanketi, Pierre Sermanet, Quan Vuong, Stefan Welker, Fei Xia, Te...

work page 2024
[28]

Minedojo: Building open-ended embodied agents with internet-scale knowledge

Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. InThirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URLhttps://openreview.net/forum?id=rc8...

work page 2022
[29]

Decision transformer: Reinforcement learning via sequence modeling,

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling,

work page
[30]

URLhttps://arxiv.org/abs/2106.01345

work page arXiv
[31]

Memorizing transformers

Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers. InInternational Conference on Learning Representations, 2022. URL https://openreview.net/ forum?id=TrjbxzRcnf-

work page 2022
[32]

Recurrent memory transformer

Aydar Bulatov, Yuri Kuratov, and Mikhail Burtsev. Recurrent memory transformer. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=Uynr3iPhksa

work page 2022
[33]

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego De Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, ...

work page 2022
[34]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models, 2023. URL https://arxiv.org/abs/2305.16291

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Playing Atari with Deep Reinforcement Learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning, 2013. URL https://arxiv.org/ abs/1312.5602

work page internal anchor Pith review Pith/arXiv arXiv 2013
[36]

Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Joseph Dudzik, Junyoung Chung, David Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Hor- gan, Manuel Kroiss, Ivo Danihelka, Aja Huang, L. Sifre, Trevor Cai, John P. Agapiou, Max Jader- berg, Alexander Sasha Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin D...

work page 2019
[37]

OpenAI, :, Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ ebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique P. d. O. Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ily...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[38]

The pokeagent challenge: Competitive and long-context learning at scale

Seth Karten, Jake Grigsby, Stephanie Milani, Kiran V odrahalli, Amy Zhang, Fei Fang, Yuke Zhu, and Chi Jin. The pokeagent challenge: Competitive and long-context learning at scale. InNeurIPS Competition Track, April 2025

work page 2025
[39]

ClaudePlaysPokemon

ClaudePlaysPokemon. ClaudePlaysPokemon. https://www.twitch.tv/claudeplayspokemon, 2026. [Accessed 01-05-2026]

work page 2026
[40]

Sawyer, Daniel Slater, David Reichert, Davide Vercelli, Demis Hassabis, Drew A

SIMA team, Adrian Bolton, Alexander Lerchner, Alexandra Cordell, Alexandre Moufarek, Andrew Bolt, Andrew Lampinen, Anna Mitenkova, Arne Olav Hallingstad, Bojan Vujatovic, Bonnie Li, Cong Lu, Daan Wierstra, Daniel P. Sawyer, Daniel Slater, David Reichert, Davide Vercelli, Demis Hassabis, Drew A. Hudson, Duncan Williams, Ed Hirst, Fabio Pardo, Felix Hill, F...

work page arXiv 2025
[41]

Dungeons and data: A large-scale nethack dataset, 2023

Eric Hambro, Roberta Raileanu, Danielle Rothermel, Vegard Mella, Tim Rocktäschel, Heinrich Küttler, and Naila Murray. Dungeons and data: A large-scale nethack dataset, 2023. URL https://arxiv.org/ abs/2211.00539

work page arXiv 2023
[42]

Ricardo J. G. B. Campello, Davoud Moulavi, and Joerg Sander. Density-based clustering based on hierarchical density estimates. In Jian Pei, Vincent S. Tseng, Longbing Cao, Hiroshi Motoda, and Guandong Xu, editors,Advances in Knowledge Discovery and Data Mining, pages 160–172, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. ISBN 978-3-642-37456-2

work page 2013
[43]

Sigmoid loss for language image pre-training, 2023

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training, 2023. URLhttps://arxiv.org/abs/2303.15343

work page arXiv 2023
[44]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Walkthrough:pokémon emerald, Jun 2025

Bulbapedia. Walkthrough:pokémon emerald, Jun 2025. URL https://bulbapedia.bulbagarden. net/wiki/Walkthrough:Pok%C3%A9mon_Emerald

work page 2025
[47]

The minish cap walkthrough

Zelda Dungeon. The minish cap walkthrough. https://www.zeldadungeon.net/ the-minish-cap-walkthrough/, 2026. [Accessed 30-04-2026]

work page 2026
[48]

Machine learning in python: Main developments and technology trends in data science, machine learning, and artificial intelligence.arXiv preprint arXiv:2002.04803, 2020

Sebastian Raschka, Joshua Patterson, and Corey Nolet. Machine learning in python: Main developments and technology trends in data science, machine learning, and artificial intelligence.arXiv preprint arXiv:2002.04803, 2020

work page arXiv 2002
[49]

A comprehensive survey of forgetting in deep learning beyond continual learning.IEEE Trans

Zhenyi Wang, Enneng Yang, Li Shen, and Heng Huang. A comprehensive survey of forgetting in deep learning beyond continual learning.IEEE Trans. Pattern Anal. Mach. Intell., 47(3):1464–1483, March

work page
[50]

doi: 10.1109/TPAMI.2024.3498346

ISSN 0162-8828. doi: 10.1109/TPAMI.2024.3498346. URL https://doi.org/10.1109/TPAMI. 2024.3498346

work page doi:10.1109/tpami.2024.3498346 2024
[51]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URLhttps://arxiv. org/abs/1711.05101. 13 A ASH Playthrough Examples Examples of how ASH usesdynamic bootstrappingandlong-memoryto overcome roadblocks. 1 2 4 5 6 3 Figure 6:Dynamic bootstrapping example.To complete milestone 2, the player must rescue the Professor from a wild Zi...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[52]

Are all frameskey moments?

work page
[53]

walk to the finish line,

Do all frames correspond to thesamekey moment? If the answer to both questions is yes, we consider the cluster to be a key moment. Based on this analysis, we report that 48% of clusters identified by HDBSCAN [40] correspond to key moments. The effectiveness of HDBSCAN in identifying these key moments helps explain the performance increase observed in Sect...

work page