ASH: Agents that Self-Hone via Embodied Learning

Benjamin Schneider; Sun Sun; Victor Zhong; Xavier Schneider

arxiv: 2605.14211 · v2 · pith:BDSGYFRKnew · submitted 2026-05-14 · 💻 cs.AI · cs.LG

ASH: Agents that Self-Hone via Embodied Learning

Benjamin Schneider , Xavier Schneider , Victor Zhong , Sun Sun This is my paper

Pith reviewed 2026-05-15 02:49 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords embodied learninginverse dynamics modelself-improvementlong-horizon tasksunlabeled videogame environmentsagentic systems

0 comments

The pith

ASH learns long-horizon policies in complex games by training an inverse dynamics model on its own trajectories to label unlabeled internet videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ASH as an agentic system that acquires embodied skills in long-horizon environments without hand-engineered rewards or expert-labeled demonstrations. When the agent stalls, it trains an inverse dynamics model solely on its self-generated trajectories and applies that model to extract action supervision from relevant internet video clips. Unsupervised techniques further identify and retain key moments from large-scale video as long-term memory. This loop enables sustained progress across multi-hour tasks where standard behavioral cloning and retrieval baselines plateau.

Core claim

ASH reaches an average of 11.2 out of 12 milestones in Pokemon Emerald and 9.9 out of 12 in The Legend of Zelda by repeatedly training an inverse dynamics model on its own noisy trajectories and using the model to derive supervision signals from unlabeled internet video, while also storing unsupervised key moments as memory; the strongest baselines remain stuck at roughly 6 milestones in both environments.

What carries the argument

The self-improvement loop that trains an inverse dynamics model from the agent's own trajectories to label actions in internet video, paired with unsupervised extraction of key moments for long-term memory.

If this is right

The same self-honing loop can be applied to other long-horizon embodied tasks that lack dense rewards or expert data.
Agents can bootstrap policies from web-scale unlabeled video once they generate enough of their own trajectories to train a usable IDM.
Unsupervised key-moment retention enables planning over multi-hour horizons without explicit state tracking.
Performance gaps versus baselines widen as task length increases because self-generated labels keep the policy advancing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the IDM generalizes across visual domains, the method could transfer from game video to real-world robot footage without additional annotation.
Scaling the volume of internet video or the number of self-improvement cycles could further raise the fraction of milestones reached.
The approach suggests that internet video plus self-generated data forms a sufficient training signal for many sequential decision problems once an initial exploration policy exists.

Load-bearing premise

An inverse dynamics model trained only on the agent's own noisy self-generated trajectories will produce sufficiently accurate action labels when applied to unrelated low-quality internet video clips.

What would settle it

Training the IDM on ASH trajectories and then measuring whether policy performance stops improving after one or more cycles of video-derived supervision, or whether milestone counts remain comparable to the strongest baseline.

Figures

Figures reproduced from arXiv: 2605.14211 by Benjamin Schneider, Sun Sun, Victor Zhong, Xavier Schneider.

**Figure 2.** Figure 2: Agent progress in Pokémon Emerald (top) and Legend of Zelda (bottom), measured using milestone completion rates. ASH is able to adapt and continue progressing throughout the 8-hour gameplay period. While all methods are able to complete early milestones, only ASH can adapt to new areas, objectives, and mechanics. See Appendix C for standard deviation. function to determine if it belongs to a cluster. These… view at source ↗

**Figure 3.** Figure 3: Outside (top) vs. inside (bottom) of the Zelda castle: offline policies collapse on this dynamics shift; ASH bootstraps and continues. Self-improvement is necessary for sustained progression. Over the 8-hour evaluation, ASH reaches milestone 12 in both environments, while no baseline exceeds milestone 8 in Pokémon or 6 in Zelda. VPT and offline BC plateau once the games introduce dynamics that are under… view at source ↗

**Figure 4.** Figure 4: (Left) Component ablation on Pokémon Emerald: each addition (long-term memory, dynamic bootstrapping) yields a clear gain in milestones completed per GPU hour of online training. Shaded regions are one standard deviation over 4 trajectories per method. (Right) IDM accuracy across bootstraps, evaluated on a test set. The dashed line is the pre-bootstrap initialization checkpoint. Across both environments, e… view at source ↗

**Figure 5.** Figure 5: Offline replay of the final ASH checkpoint vs. the original online run. Catastrophic forgetting is a phenomenon in lifelong learning where an agent will forget previously known skills and knowledge when its policy is updated [47]. The result is an agent that can progress through the latter stages of an environment but can no longer accomplish early milestones. We examine whether ASH’s final policy is ab… view at source ↗

**Figure 6.** Figure 6: Dynamic bootstrapping example. To complete milestone 2, the player must rescue the Professor from a wild Zigzagoon (Panel 1). To accomplish this, the player must use their starter Pokémon to defeat the Zigzagoon in battle (Panel 2). However, ASH’s initial policy does not know how to use the battle interface to command their Pokémon. After being stuck for ∆ steps (20 minutes), ASH dynamically bootstraps, an… view at source ↗

**Figure 7.** Figure 7: Long-term memory example. When the player arrives in Oldale town (Panel 1), they are presented with 3 possible next paths. Option A: The player heads north to Route 103 to meet their rival, May. This is the correct choice if the player has just obtained their starter Pokémon from the Professor and been tasked with bringing May back to the lab. Option B: The player has already met May, and should head back … view at source ↗

**Figure 8.** Figure 8: Visualization of 3 HDBSCAN [40] clusters, as well as 500 uniformly sampled outlier points reduced to 2 dimensions via principal component analysis. Each of these clusters represents a key moment in Pokémon Emerald; choosing a starter Pokémon (blue), saving the professor (orange) and challenging a gym leader (green). Grey dots are outliers that are not assigned to a cluster by HDBSCAN [40]. Ideally, they ar… view at source ↗

read the original abstract

Long-horizon embodied tasks remain a fundamental challenge in AI, as current methods rely on hand-engineered rewards or action-labeled demonstrations, neither of which scales. We introduce ASH, an agentic system that learns an embodied policy from unlabeled, noisy internet video, without reward shaping or expert annotation. ASH follows a self-improvement loop; when it gets stuck, ASH learns an Inverse Dynamics Model (IDM) from its own trajectories, and uses its IDM to extract supervision from relevant internet video. ASH uses unsupervised learning to identify key moments from large-scale internet video and retains them as long-term memory -- allowing it to tackle long-horizon problems. We evaluate ASH on two complementary environments demanding multi-hour planning: Pokemon Emerald, a turn-based RPG, and The Legend of Zelda: The Minish Cap, a real-time action-adventure game. In both games, behavioral cloning, retrieval-augmented and zero-shot foundation-model baselines plateau, while ASH sustains progression across our 8-hour evaluation. ASH reaches an average of $11.2/12$ milestones in Pokemon Emerald and $9.9/12$ in Legend of Zelda, while the strongest baseline gets stuck in both environments at an average of $6.5/12$ and $6.0/12$ milestones, respectively. We demonstrate that self-improving agents are a scalable recipe for long-horizon embodied learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ASH's loop of training an IDM only on the agent's own trajectories then using it to label internet video for policy updates is a clean procedural idea, but the abstract supplies no IDM accuracy checks or ablations so the reported milestone jumps remain hard to attribute.

read the letter

The core claim is that an agent can keep progressing on long-horizon tasks by fitting an inverse dynamics model to its own stuck trajectories and then applying that model to pull action labels from unlabeled internet video. In Pokemon Emerald it reaches 11.2 out of 12 milestones and in Zelda 9.9 out of 12, while the strongest baselines stall around 6. The loop also includes an unsupervised step to pull out key moments for memory. That combination is not the usual behavioral cloning or retrieval setup, so the procedural recipe itself is the main novelty here. It directly targets the reward-shaping and annotation bottlenecks that limit most current embodied work. The environments are reasonable stress tests for multi-hour planning, and the gap over baselines is large enough to notice. The paper does a service by showing that self-generated data can in principle close the supervision loop without external labels. The soft spots are straightforward. The abstract gives no numbers on how well the IDM actually predicts actions on held-out internet clips, no description of video filtering or domain-shift handling, no error bars, and no ablation that isolates the IDM component. The central assumption—that an IDM fit only to the agent's early noisy trajectories will still produce usable labels on unrelated, lower-quality video—therefore sits untested. Without those checks it is difficult to know whether the milestone gains come from the self-honing mechanism or from other unstated choices in the pipeline. This is for researchers working on video-based imitation and scalable embodied agents who want to see concrete attempts to remove hand-engineered supervision. A reader already running long-horizon experiments would find the loop worth trying even if the current numbers need more controls. It deserves a serious referee. The idea is coherent on its own terms and the environments are non-trivial, so the paper should go through review with requests for IDM accuracy metrics, ablations, and implementation details rather than being desk-rejected.

Referee Report

2 major / 1 minor

Summary. The paper introduces ASH, an agentic system for long-horizon embodied learning that follows a self-improvement loop: when stuck, it trains an Inverse Dynamics Model (IDM) on its own trajectories and applies the IDM to extract action labels from unlabeled noisy internet video for supervision, while using unsupervised learning to identify key moments as long-term memory. Evaluated on Pokemon Emerald and Legend of Zelda, ASH achieves average milestone progress of 11.2/12 and 9.9/12 respectively, while baselines plateau at 6.5/12 and 6.0/12.

Significance. If the performance gains can be shown to stem from the IDM-based self-honing mechanism with proper validation, the work would represent a meaningful step toward scalable embodied agents that leverage abundant internet video without hand-engineered rewards or expert annotations, addressing a core limitation in current long-horizon task learning.

major comments (2)

[Abstract] Abstract: The central performance claims (11.2/12 milestones in Pokemon Emerald, 9.9/12 in Zelda) are reported without error bars, ablation studies isolating the IDM supervision component, or details on filtering noisy video, making it impossible to determine whether the self-honing loop drives the gains over baselines.
[Abstract] Abstract: The method's validity hinges on the IDM, trained only on the agent's initially random or stuck self-trajectories, producing accurate action labels on unrelated noisy internet video despite domain shifts in quality, frame rate, perspective, and style; however, no quantitative IDM accuracy metrics on held-out external clips are provided.

minor comments (1)

The abstract would benefit from a concise definition of the 12 milestones and how they are evaluated across the 8-hour runs to improve clarity and reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the validation of ASH's self-honing mechanism.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims (11.2/12 milestones in Pokemon Emerald, 9.9/12 in Zelda) are reported without error bars, ablation studies isolating the IDM supervision component, or details on filtering noisy video, making it impossible to determine whether the self-honing loop drives the gains over baselines.

Authors: We agree that error bars, targeted ablations, and filtering details are essential to substantiate the claims. In the revised manuscript we will report error bars over multiple independent runs for all milestone-progress metrics. We will add ablation studies that isolate the IDM-based internet-video supervision (comparing full ASH against variants without the IDM loop or without video labels) and will expand the methods section with the precise filtering criteria and preprocessing steps applied to noisy internet clips. These additions will directly demonstrate that the self-honing loop accounts for the observed gains over baselines. revision: yes
Referee: [Abstract] Abstract: The method's validity hinges on the IDM, trained only on the agent's initially random or stuck self-trajectories, producing accurate action labels on unrelated noisy internet video despite domain shifts in quality, frame rate, perspective, and style; however, no quantitative IDM accuracy metrics on held-out external clips are provided.

Authors: We acknowledge the importance of quantifying IDM generalization. The revised manuscript will include new quantitative results measuring IDM action-prediction accuracy on held-out external video clips drawn from the same internet sources, explicitly reporting performance under the domain shifts in quality, frame rate, perspective, and visual style. These metrics will be presented alongside the end-to-end results to confirm that the IDM trained on agent trajectories can reliably label noisy video for supervision. revision: yes

Circularity Check

0 steps flagged

No significant circularity in ASH's procedural self-improvement loop

full rationale

The paper presents ASH as an agentic system following a self-improvement loop: learning an IDM from its own trajectories to extract supervision from internet video, combined with unsupervised key moment identification. This is described as a procedural algorithm without any mathematical derivations, equations, or fitted parameters that reduce predictions to inputs by construction. Performance is evaluated empirically via milestone completion in games, not through self-referential claims. No self-citation load-bearing arguments or uniqueness theorems are referenced. The central claim rests on the empirical results rather than tautological definitions, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that internet video contains recoverable action information that an IDM trained on self-generated trajectories can extract, plus the assumption that unsupervised key-moment detection yields useful long-term memory for multi-hour planning.

axioms (1)

domain assumption Internet videos contain extractable supervision for embodied actions when paired with an IDM trained on the agent's own trajectories
Invoked to justify using unlabeled video as training signal without expert annotation.

pith-pipeline@v0.9.0 · 5548 in / 1427 out tokens · 28925 ms · 2026-05-15T02:49:32.785235+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ASH follows a self-improvement loop; when it gets stuck, ASH learns an Inverse Dynamics Model (IDM) from its own trajectories, and uses its IDM to extract supervision from relevant internet video.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use HDBSCAN clustering to discover recurring key moments... long-term memory of observations ρ that are the wl most recent key moments.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.