{"paper":{"title":"ASH: Agents that Self-Hone via Embodied Learning","license":"http://creativecommons.org/licenses/by/4.0/","headline":"ASH learns long-horizon policies in complex games by training an inverse dynamics model on its own trajectories to label unlabeled internet videos.","cross_cats":["cs.LG"],"primary_cat":"cs.AI","authors_text":"Benjamin Schneider, Sun Sun, Victor Zhong, Xavier Schneider","submitted_at":"2026-05-14T00:10:12Z","abstract_excerpt":"Long-horizon embodied tasks remain a fundamental challenge in AI, as current methods rely on hand-engineered rewards or action-labeled demonstrations, neither of which scales. We introduce ASH, an agentic system that learns an embodied policy from unlabeled, noisy internet video, without reward shaping or expert annotation. ASH follows a self-improvement loop; when it gets stuck, ASH learns an Inverse Dynamics Model (IDM) from its own trajectories, and uses its IDM to extract supervision from relevant internet video. ASH uses unsupervised learning to identify key moments from large-scale inter"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"ASH reaches an average of 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Legend of Zelda, while the strongest baseline gets stuck in both environments at an average of 6.5/12 and 6.0/12 milestones, respectively.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That an inverse dynamics model trained only on the agent's own noisy, self-generated trajectories will produce sufficiently accurate action labels when applied to unrelated, low-quality internet video clips.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"ASH learns long-horizon policies in complex games by training an inverse dynamics model on its own trajectories to label unlabeled internet videos.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"29bbe7213949bf1c8c26110007c0348cbc44157ee7bf481011d61ab107cde2e6"},"source":{"id":"2605.14211","kind":"arxiv","version":1},"verdict":{"id":"1acdd95f-5606-43e4-859a-8544873463bc","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T02:49:32.785235Z","strongest_claim":"ASH reaches an average of 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Legend of Zelda, while the strongest baseline gets stuck in both environments at an average of 6.5/12 and 6.0/12 milestones, respectively.","one_line_summary":"ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That an inverse dynamics model trained only on the agent's own noisy, self-generated trajectories will produce sufficiently accurate action labels when applied to unrelated, low-quality internet video clips.","pith_extraction_headline":"ASH learns long-horizon policies in complex games by training an inverse dynamics model on its own trajectories to label unlabeled internet videos."},"references":{"count":53,"sample":[{"doi":"","year":2026,"title":"Rethinking memory mechanisms of foundation agents in the second half.arXiv preprint arXiv:2602.06052","work_id":"639480bb-7fee-4f7e-ac4f-23006d622bd4","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Sycara, Matthew Johnson-Roberson, Dhruv Batra, Xiaolong Wang, Sebastian Scherer, Zsolt Kira, Fei Xia, and Yonatan Bisk","work_id":"dfdf2ced-684b-4065-b117-75f500c4b04e","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.24963/ijcai.2018/","year":2018,"title":"Behavioral cloning from observation","work_id":"6c48af65-1415-477d-acb4-883991de1137","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.24963/ijcai.2018/687","year":2018,"title":"URLhttps://doi.org/10.24963/ijcai.2018/687","work_id":"708700e3-3b99-4f96-b407-7f336ba4220e","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Video pretraining (vpt): Learning to act by watching unlabeled online videos","work_id":"2bf44de9-4b8f-4ab3-816e-2ffb5278c578","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":53,"snapshot_sha256":"4dcafff292a252c567cb28fc9b7e5b30ca92e9ee1cc5b9837403b3d908b6458d","internal_anchors":12},"formal_canon":{"evidence_count":2,"snapshot_sha256":"701db4df2d06829a46a18bcd8081ad515ea39e36de5bc2c1b9ad4923344012b1"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}