Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

Guangyu Chen; Jinkun Liu; Wenbo Ding; Yifan Xie; Yuan Wang; Yu Sun

arxiv: 2604.24681 · v2 · pith:GRQAVAUJnew · submitted 2026-04-27 · 💻 cs.RO

Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

Yifan Xie , YuAn Wang , Guangyu Chen , Jinkun Liu , Yu Sun , Wenbo Ding This is my paper

Pith reviewed 2026-05-22 11:13 UTC · model grok-4.3

classification 💻 cs.RO

keywords robotic manipulationhuman intention priorslarge-scale human demonstrationshierarchical vision-language-actionhand motion priordistribution shiftMANO hand modelaction-language dataset

0 comments

The pith

MoT-HRA learns human-intention priors from 2.2 million curated human videos to improve robotic manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to extract useful manipulation knowledge from abundant human videos and transfer it to robots despite differences in bodies and observations. It does this by building a large dataset HA-2.2M through hand-focused filtering, 3D reconstruction, segmentation, and language alignment of existing videos. On this data the authors train a hierarchical model that splits the problem into a vision-language part for 3D trajectories, an intention part that treats hand motion as a reusable prior, and a final part that turns the combined representation into robot actions. The design uses shared attention plus read-only transfer so that human knowledge helps without overwriting robot-specific learning. If the approach works, robots could achieve more natural motions and hold up better when test conditions differ from training data.

Core claim

The central claim is that MoT-HRA factorizes manipulation into a vision-language expert predicting an embodiment-agnostic 3D trajectory, an intention expert modeling MANO-style hand motion as a latent human-motion prior, and a fine expert mapping the combined representation to robot action chunks, all linked by a shared-attention trunk and read-only key-value transfer; this structure lets the system learn human-intention priors from the HA-2.2M dataset and delivers better motion plausibility plus more robust control under distribution shift on hand-motion, simulation, and real-robot benchmarks.

What carries the argument

MoT-HRA hierarchical framework with three coupled experts (vision-language for 3D trajectory, intention for latent hand-motion prior, fine for robot action chunks) joined by shared-attention trunk and read-only key-value transfer.

If this is right

The intention expert produces more plausible hand trajectories than models without the human-motion prior.
Simulated manipulation tasks exhibit higher success rates when the full hierarchical structure is used.
Real-world robot control remains more stable when scene or object conditions shift from the training distribution.
Read-only key-value transfer lets downstream robot policies draw on human priors while leaving upstream representations largely unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same curation pipeline could be applied to larger internet-scale video collections to expand the range of captured manipulation skills.
The separation of trajectory, prior, and action stages may generalize to other robot learning settings that must bridge human and machine embodiments.
Controlled ablation tests that disable the read-only transfer could quantify how much interference is actually avoided in practice.

Load-bearing premise

The hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment steps extract embodiment-agnostic human-intention priors from raw human videos without introducing large reconstruction errors or biases that would block transfer to robots.

What would settle it

If real-robot experiments under distribution shift show no measurable gain in control success rate or stability metrics compared with baselines that lack the human-prior components, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.24681 by Guangyu Chen, Jinkun Liu, Wenbo Ding, Yifan Xie, Yuan Wang, Yu Sun.

**Figure 1.** Figure 1: Overview of the HA-2.2M curation pipeline. Large-scale unlabeled human demonstration view at source ↗

**Figure 2.** Figure 2: Overview of MoT-HRA. Given an image, a language instruction, and chunk-sized query view at source ↗

**Figure 3.** Figure 3: Attention mask of MoT-HRA. Image and text tokens use bidirectional attention to build a shared multimodal context, while 3D trajectory and MANO pose tokens attend to the full prefix but remain causally masked within their own spans. Robot-action tokens attend to all preceding modalities and use bidirectional attention within the action chunk, enabling joint refinement of temporally coupled controls. Fin… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on OakInk in first-person (top) and third-person (bottom) views. view at source ↗

**Figure 5.** Figure 5: Real-world evaluation on long-horizon manipulation tasks. Both gripper and dexterous view at source ↗

read the original abstract

Human videos contain rich manipulation priors, but using them for robot learning remains difficult because raw observations entangle scene understanding, human motion, and embodiment-specific action. We introduce MoT-HRA, a hierarchical vision-language-action framework that learns human-intention priors from large-scale human demonstrations. We first curate HA-2.2M, a 2.2M-episode action-language dataset reconstructed from heterogeneous human videos through hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment. On top of this dataset, MoT-HRA factorizes manipulation into three coupled experts: a vision-language expert predicts an embodiment-agnostic 3D trajectory, an intention expert models MANO-style hand motion as a latent human-motion prior, and a fine expert maps the intention-aware representation to robot action chunks. A shared-attention trunk and read-only key-value transfer allow downstream control to use human priors while limiting interference with upstream representations. Experiments on hand motion generation, simulated manipulation, and real-world robot tasks show that MoT-HRA improves motion plausibility and robust control under distribution shift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper factors human intention priors from video into a three-expert VLA setup with read-only transfer, but the curation pipeline's reconstruction accuracy lacks reported checks.

read the letter

The core idea is a hierarchical model that pulls embodiment-agnostic priors from large human video sets and passes them to robot policies through a read-only key-value link. This keeps the human motion modeling from interfering with the robot action chunks while still letting the shared attention trunk connect everything. The concrete split into a vision-language expert for 3D trajectories, a MANO-style intention expert, and a fine action mapper is a new combination on top of existing VLA work, and the HA-2.2M dataset curation from heterogeneous videos is a practical step that could let people leverage existing footage instead of collecting robot data from scratch. They run tests on hand motion generation, simulated tasks, and real robot control, which at least shows the pipeline can be applied end to end. The read-only transfer mechanism is a clean way to limit error propagation, and if the full results include ablations on the experts it could clarify how much each part contributes. The main soft spot is the data pipeline. The hand-centric filtering, 3D reconstruction, temporal segmentation, and language alignment are described, yet there are no numbers on reconstruction error against ground-truth 3D or on how much bias different video sources introduce. If those steps add substantial noise, the downstream robustness claims under distribution shift become harder to trust even with the read-only link. The abstract also skips quantitative metrics, baselines, and error bars, so the size of the reported gains is still unclear. This is aimed at groups working on vision-language-action models and human-to-robot transfer. Readers who want concrete architectures for scaling priors from video will find the factorization and transfer trick useful to examine. It deserves a serious referee because the problem is real and the proposed assembly is distinct enough that proper experiments could move the area forward.

Referee Report

3 major / 2 minor

Summary. The paper introduces MoT-HRA, a hierarchical vision-language-action framework for learning human-intention priors from large-scale human demonstrations. It curates the HA-2.2M dataset (2.2M episodes) from heterogeneous videos using hand-centric filtering, spatial reconstruction to 3D, temporal segmentation, and language alignment. The model factorizes manipulation into a vision-language expert (predicting embodiment-agnostic 3D trajectories), an intention expert (modeling MANO-style hand motion as latent prior), and a fine expert (mapping to robot action chunks), connected by a shared-attention trunk with read-only key-value transfer. Experiments on hand motion generation, simulated manipulation, and real-world robot tasks are claimed to show improved motion plausibility and robust control under distribution shift.

Significance. If the central claims hold after validation, the work could meaningfully advance robotic manipulation by demonstrating scalable transfer of embodiment-agnostic priors from massive human video data. The hierarchical factorization and read-only transfer mechanism address interference issues in prior-based control, potentially improving robustness to distribution shift in real-world settings. The scale of HA-2.2M represents a notable data contribution if its curation fidelity is established.

major comments (3)

[Abstract and dataset section] Abstract and §3 (dataset curation): The HA-2.2M construction pipeline is described in detail, yet no quantitative metrics are reported for reconstruction error of 3D trajectories or MANO parameters from 2D videos, nor ablations isolating curation biases across video sources. This is load-bearing for the central claim that the dataset supplies embodiment-agnostic human-intention priors without substantial errors propagating to robot control.
[Experiments] Experiments section: The reported improvements in motion plausibility and robust control under distribution shift lack any quantitative metrics, baselines, error bars, or ablation details. Without these, the link between the proposed pipeline and the claimed performance gains cannot be verified, weakening the empirical support for the framework.
[Model architecture] Model description (§4): While the shared-attention trunk and read-only key-value transfer are presented as mechanisms to limit interference, there is no analysis quantifying how reconstruction or alignment errors from the upstream experts affect downstream robot action chunks under distribution shift.

minor comments (2)

[Model] Notation for the three experts (vision-language, intention, fine) could be clarified with explicit equations or diagrams showing the read-only key-value transfer.
[Abstract] The abstract mentions 'improvements' without specifying the exact tasks or comparison methods; adding a brief table of key metrics in the abstract or introduction would aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments identify key areas where additional quantitative evidence would strengthen the manuscript's claims regarding the HA-2.2M dataset and the MoT-HRA framework. We address each major comment below and will incorporate revisions to enhance empirical rigor while preserving the core contributions.

read point-by-point responses

Referee: [Abstract and dataset section] Abstract and §3 (dataset curation): The HA-2.2M construction pipeline is described in detail, yet no quantitative metrics are reported for reconstruction error of 3D trajectories or MANO parameters from 2D videos, nor ablations isolating curation biases across video sources. This is load-bearing for the central claim that the dataset supplies embodiment-agnostic human-intention priors without substantial errors propagating to robot control.

Authors: We agree that quantitative metrics on reconstruction fidelity are important for validating the dataset's quality. In the revised manuscript, we will add a dedicated evaluation subsection in §3 reporting mean 3D trajectory reconstruction error and MANO parameter accuracy on a held-out validation set with available ground-truth annotations. We will also include source-specific ablations showing downstream task performance when training on individual video sources versus the full curated set. revision: yes
Referee: [Experiments] Experiments section: The reported improvements in motion plausibility and robust control under distribution shift lack any quantitative metrics, baselines, error bars, or ablation details. Without these, the link between the proposed pipeline and the claimed performance gains cannot be verified, weakening the empirical support for the framework.

Authors: We acknowledge that the experimental results would benefit from expanded quantitative details. In the revision, we will augment the Experiments section with explicit numerical metrics (e.g., success rates, motion plausibility scores), comparisons to established baselines, error bars computed over multiple random seeds, and ablation studies isolating the contribution of each expert and the read-only key-value transfer mechanism under distribution shift. revision: yes
Referee: [Model architecture] Model description (§4): While the shared-attention trunk and read-only key-value transfer are presented as mechanisms to limit interference, there is no analysis quantifying how reconstruction or alignment errors from the upstream experts affect downstream robot action chunks under distribution shift.

Authors: This observation is well-taken. We will add a new sensitivity analysis subsection (either in §4 or the Experiments section) that quantifies error propagation. This will include controlled perturbation experiments on upstream 3D trajectories and hand-motion latents, measuring their effect on fine-expert action chunk accuracy specifically in distribution-shift settings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation ingests external video data and trains independent experts

full rationale

The paper first curates the external HA-2.2M dataset from heterogeneous human videos via hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment, then trains the three coupled experts (vision-language for 3D trajectory, intention for MANO-style latent motion, fine for robot actions) plus shared-attention trunk on that dataset. No equations, fitted parameters, or self-citation chains reduce the claimed embodiment-agnostic priors to quantities defined by the model outputs themselves; the pipeline remains self-contained against external benchmarks and does not rename or smuggle its own results as inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that heterogeneous human videos can be processed into transferable, embodiment-agnostic intention priors; no free parameters or invented entities are introduced in the abstract description.

axioms (1)

domain assumption Human videos contain rich manipulation priors that can be disentangled into embodiment-agnostic components through hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment.
Invoked as the foundation for curating HA-2.2M and building the three-expert model.

pith-pipeline@v0.9.0 · 5734 in / 1306 out tokens · 47704 ms · 2026-05-22T11:13:35.600137+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MoT-HRA factorizes manipulation into three coupled experts: a vision-language expert predicts an embodiment-agnostic 3D trajectory, an intention expert models MANO-style hand motion as a latent human-motion prior, and a fine expert maps the intention-aware representation to robot action chunks.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 17 internal anchors

[1]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

work page 2023
[2]

Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

work page 2024
[3]

Octo: An open-source generalist robot policy

Oier Mees, Dibya Ghosh, Karl Pertsch, Kevin Black, Homer Rich Walke, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, et al. Octo: An open-source generalist robot policy. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024

work page 2024
[4]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025

work page 2025
[5]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630–2640, 2019

work page 2019
[7]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022

work page 2022
[8]

The epic-kitchens dataset: Collection, challenges and baselines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. The epic-kitchens dataset: Collection, challenges and baselines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020

work page 2020
[9]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE International Conference on Computer Vision, pages 5842–5850, 2017

work page 2017
[10]

Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

work page arXiv 2025
[11]

Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos.arXiv preprint arXiv:2510.21571, 2025

Qixiu Li, Yu Deng, Yaobo Liang, Lin Luo, Lei Zhou, Chengtang Yao, Lingqi Zeng, Zhiyuan Feng, Huizhi Liang, Sicheng Xu, et al. Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos.arXiv preprint arXiv:2510.21571, 2025

work page arXiv 2025
[12]

Spatial-aware vla pretraining through visual-physical alignment from human videos.arXiv preprint arXiv:2512.13080, 2025

Yicheng Feng, Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan, Sipeng Zheng, and Zongqing Lu. Spatial-aware vla pretraining through visual-physical alignment from human videos.arXiv preprint arXiv:2512.13080, 2025

work page arXiv 2025
[13]

Being-h0.7: A latent world-action model from egocentric videos

BeingBeyond Team. Being-h0.7: A latent world-action model from egocentric videos. https: //research.beingbeyond.com/being-h07, 2026. 10

work page 2026
[14]

Flowing from reasoning to motion: Learning 3d hand trajectory prediction from egocentric human interaction videos.arXiv preprint arXiv:2512.16907, 2025

Mingfei Chen, Yifan Wang, Zhengqin Li, Homanga Bharadhwaj, Yujin Chen, Chuan Qin, Ziyi Kou, Yuan Tian, Eric Whitmire, Rajinder Sodhi, et al. Flowing from reasoning to motion: Learning 3d hand trajectory prediction from egocentric human interaction videos.arXiv preprint arXiv:2512.16907, 2025

work page arXiv 2025
[15]

Knowledge insulat- ing vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705, 2025

Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulat- ing vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705, 2025

work page arXiv 2025
[16]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

What matters in building vision–language–action models for generalist robots.Nature Machine Intelligence, pages 1–15, 2026

Xinghang Li, Peiyan Li, Long Qian, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Xinlong Wang, Di Guo, et al. What matters in building vision–language–action models for generalist robots.Nature Machine Intelligence, pages 1–15, 2026

work page 2026
[19]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Gigabrain-0: A world model-powered vision-language- action model.arXiv preprint arXiv:2510.19430, 2025

GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, et al. Gigabrain-0: A world model-powered vision-language- action model.arXiv preprint arXiv:2510.19430, 2025

work page arXiv 2025
[22]

Universal visuo-tactile video understanding for embodied interaction.arXiv preprint arXiv:2505.22566, 2025

Yifan Xie, Mingyang Li, Shoujie Li, Xingting Li, Guangyu Chen, Fei Ma, Fei Richard Yu, and Wenbo Ding. Universal visuo-tactile video understanding for embodied interaction.arXiv preprint arXiv:2505.22566, 2025

work page arXiv 2025
[23]

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

R3m: A universal visual representation for robot manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhishek Gupta. R3m: A universal visual representation for robot manipulation. InConference on Robot Learning, 2022

work page 2022
[26]

Vip: Towards universal visual reward and representation via value-implicit pre-training

Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training. InInternational Conference on Learning Representations, 2023

work page 2023
[27]

Affordances from human videos as a versatile representation for robotics

Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023
[28]

Grounding language with visual affor- dances over unstructured data

Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. Grounding language with visual affor- dances over unstructured data. InIEEE International Conference on Robotics and Automation, 2023

work page 2023
[29]

From human videos to robot manipulation: A survey on scalable vision-language-action learning with human-centric data.TechRxiv preprint, 2026

Zhiyuan Feng, Qixiu Li, Huizhi Liang, Rushuai Yang, Yichao Shen, Zhiying Du, Zhaowei Zhang, Yu Deng, Li Zhao, Hao Zhao, et al. From human videos to robot manipulation: A survey on scalable vision-language-action learning with human-centric data.TechRxiv preprint, 2026. 11

work page 2026
[30]

Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyl- los Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[31]

Latent action pretraining from videos

Seonghyeon Ye, Joel Jang, et al. Latent action pretraining from videos. InInternational Conference on Learning Representations, 2025

work page 2025
[32]

Igor: Image-goal representations are the atomic control units for foundation models in embodied ai

Xiaoyu Chen, Junliang Guo, et al. Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024

work page arXiv 2024
[33]

Moto: Latent motion token as the bridging language for learning robot manipulation from videos

Yi Chen, Yuying Ge, et al. Moto: Latent motion token as the bridging language for learning robot manipulation from videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

work page 2025
[34]

Univla: Learning to act anywhere with task-centric latent actions

Qingwen Bu, Yanting Yang, et al. Univla: Learning to act anywhere with task-centric latent actions. InRobotics: Science and Systems, 2025

work page 2025
[35]

Unleashing large-scale video generative pre-training for visual robot manipulation

Hongtao Wu, Ya Jing, et al. Unleashing large-scale video generative pre-training for visual robot manipulation. InInternational Conference on Learning Representations, 2024

work page 2024
[36]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Chi-Lam Cheang, Guangzeng Chen, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation

Homanga Bharadhwaj, Debidatta Dwibedi, et al. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation. InConference on Robot Learning, 2025

work page 2025
[38]

Any-point trajectory modeling for policy learning

Chuan Wen, Xingyu Lin, et al. Any-point trajectory modeling for policy learning. InRobotics: Science and Systems, 2024

work page 2024
[39]

Magma: A foundation model for multimodal ai agents

Jianwei Yang, Reuben Tan, et al. Magma: A foundation model for multimodal ai agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[40]

Dexmv: Imitation learning for dexterous manipulation from human videos

Yuzhe Qin, Yueh-Hua Wu, et al. Dexmv: Imitation learning for dexterous manipulation from human videos. InEuropean Conference on Computer Vision, 2022

work page 2022
[41]

Vidbot: Learning generalizable 3d actions from in-the-wild 2d human videos

Hanzhi Chen, Boyang Sun, et al. Vidbot: Learning generalizable 3d actions from in-the-wild 2d human videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[42]

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

Ruihan Yang, Qinxi Yu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Vitpose: Simple vision transformer baselines for human pose estimation.Advances in Neural Information Processing Systems, 35:38571–38584, 2022

Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose: Simple vision transformer baselines for human pose estimation.Advances in Neural Information Processing Systems, 35:38571–38584, 2022

work page 2022
[46]

Vitpose++: Vision transformer for generic body pose estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(2):1212–1230, 2023

Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose++: Vision transformer for generic body pose estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(2):1212–1230, 2023

work page 2023
[47]

Reconstructing hands in 3D with transformers

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3D with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 12

work page 2024
[48]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Being-h0

Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, et al. Being-h0. 5: Scaling human-centric robot learning for cross-embodiment generalization.arXiv preprint arXiv:2601.12993, 2026

work page arXiv 2026
[50]

PaliGemma 2: A Family of Versatile VLMs for Transfer

Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, et al. Paligemma 2: A family of versatile vlms for transfer.arXiv preprint arXiv:2412.03555, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[52]

Em- bodied hands: Modeling and capturing hands and bodies to- gether

Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610, 2022

work page arXiv 2022
[53]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[56]

Oakink: A large-scale knowledge repository for understanding hand-object interaction

Lixin Yang, Kailin Li, Xinyu Zhan, Fei Wu, Anran Xu, Liu Liu, and Cewu Lu. Oakink: A large-scale knowledge repository for understanding hand-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20953–20962, 2022

work page 2022
[57]

Evaluating real-world robot manipulation policies in simulation

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation. InRSS 2024 Workshop: Data Generation for Robotics. 13

work page 2024

[1] [1]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

work page 2023

[2] [2]

Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

work page 2024

[3] [3]

Octo: An open-source generalist robot policy

Oier Mees, Dibya Ghosh, Karl Pertsch, Kevin Black, Homer Rich Walke, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, et al. Octo: An open-source generalist robot policy. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024

work page 2024

[4] [4]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025

work page 2025

[5] [5]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630–2640, 2019

work page 2019

[7] [7]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022

work page 2022

[8] [8]

The epic-kitchens dataset: Collection, challenges and baselines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. The epic-kitchens dataset: Collection, challenges and baselines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020

work page 2020

[9] [9]

something something

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE International Conference on Computer Vision, pages 5842–5850, 2017

work page 2017

[10] [10]

Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

work page arXiv 2025

[11] [11]

Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos.arXiv preprint arXiv:2510.21571, 2025

Qixiu Li, Yu Deng, Yaobo Liang, Lin Luo, Lei Zhou, Chengtang Yao, Lingqi Zeng, Zhiyuan Feng, Huizhi Liang, Sicheng Xu, et al. Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos.arXiv preprint arXiv:2510.21571, 2025

work page arXiv 2025

[12] [12]

Spatial-aware vla pretraining through visual-physical alignment from human videos.arXiv preprint arXiv:2512.13080, 2025

Yicheng Feng, Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan, Sipeng Zheng, and Zongqing Lu. Spatial-aware vla pretraining through visual-physical alignment from human videos.arXiv preprint arXiv:2512.13080, 2025

work page arXiv 2025

[13] [13]

Being-h0.7: A latent world-action model from egocentric videos

BeingBeyond Team. Being-h0.7: A latent world-action model from egocentric videos. https: //research.beingbeyond.com/being-h07, 2026. 10

work page 2026

[14] [14]

Flowing from reasoning to motion: Learning 3d hand trajectory prediction from egocentric human interaction videos.arXiv preprint arXiv:2512.16907, 2025

Mingfei Chen, Yifan Wang, Zhengqin Li, Homanga Bharadhwaj, Yujin Chen, Chuan Qin, Ziyi Kou, Yuan Tian, Eric Whitmire, Rajinder Sodhi, et al. Flowing from reasoning to motion: Learning 3d hand trajectory prediction from egocentric human interaction videos.arXiv preprint arXiv:2512.16907, 2025

work page arXiv 2025

[15] [15]

Knowledge insulat- ing vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705, 2025

Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulat- ing vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705, 2025

work page arXiv 2025

[16] [16]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

What matters in building vision–language–action models for generalist robots.Nature Machine Intelligence, pages 1–15, 2026

Xinghang Li, Peiyan Li, Long Qian, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Xinlong Wang, Di Guo, et al. What matters in building vision–language–action models for generalist robots.Nature Machine Intelligence, pages 1–15, 2026

work page 2026

[19] [19]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Gigabrain-0: A world model-powered vision-language- action model.arXiv preprint arXiv:2510.19430, 2025

GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, et al. Gigabrain-0: A world model-powered vision-language- action model.arXiv preprint arXiv:2510.19430, 2025

work page arXiv 2025

[22] [22]

Universal visuo-tactile video understanding for embodied interaction.arXiv preprint arXiv:2505.22566, 2025

Yifan Xie, Mingyang Li, Shoujie Li, Xingting Li, Guangyu Chen, Fei Ma, Fei Richard Yu, and Wenbo Ding. Universal visuo-tactile video understanding for embodied interaction.arXiv preprint arXiv:2505.22566, 2025

work page arXiv 2025

[23] [23]

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

R3m: A universal visual representation for robot manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhishek Gupta. R3m: A universal visual representation for robot manipulation. InConference on Robot Learning, 2022

work page 2022

[26] [26]

Vip: Towards universal visual reward and representation via value-implicit pre-training

Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training. InInternational Conference on Learning Representations, 2023

work page 2023

[27] [27]

Affordances from human videos as a versatile representation for robotics

Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023

[28] [28]

Grounding language with visual affor- dances over unstructured data

Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. Grounding language with visual affor- dances over unstructured data. InIEEE International Conference on Robotics and Automation, 2023

work page 2023

[29] [29]

From human videos to robot manipulation: A survey on scalable vision-language-action learning with human-centric data.TechRxiv preprint, 2026

Zhiyuan Feng, Qixiu Li, Huizhi Liang, Rushuai Yang, Yichao Shen, Zhiying Du, Zhaowei Zhang, Yu Deng, Li Zhao, Hao Zhao, et al. From human videos to robot manipulation: A survey on scalable vision-language-action learning with human-centric data.TechRxiv preprint, 2026. 11

work page 2026

[30] [30]

Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyl- los Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[31] [31]

Latent action pretraining from videos

Seonghyeon Ye, Joel Jang, et al. Latent action pretraining from videos. InInternational Conference on Learning Representations, 2025

work page 2025

[32] [32]

Igor: Image-goal representations are the atomic control units for foundation models in embodied ai

Xiaoyu Chen, Junliang Guo, et al. Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024

work page arXiv 2024

[33] [33]

Moto: Latent motion token as the bridging language for learning robot manipulation from videos

Yi Chen, Yuying Ge, et al. Moto: Latent motion token as the bridging language for learning robot manipulation from videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

work page 2025

[34] [34]

Univla: Learning to act anywhere with task-centric latent actions

Qingwen Bu, Yanting Yang, et al. Univla: Learning to act anywhere with task-centric latent actions. InRobotics: Science and Systems, 2025

work page 2025

[35] [35]

Unleashing large-scale video generative pre-training for visual robot manipulation

Hongtao Wu, Ya Jing, et al. Unleashing large-scale video generative pre-training for visual robot manipulation. InInternational Conference on Learning Representations, 2024

work page 2024

[36] [36]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Chi-Lam Cheang, Guangzeng Chen, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation

Homanga Bharadhwaj, Debidatta Dwibedi, et al. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation. InConference on Robot Learning, 2025

work page 2025

[38] [38]

Any-point trajectory modeling for policy learning

Chuan Wen, Xingyu Lin, et al. Any-point trajectory modeling for policy learning. InRobotics: Science and Systems, 2024

work page 2024

[39] [39]

Magma: A foundation model for multimodal ai agents

Jianwei Yang, Reuben Tan, et al. Magma: A foundation model for multimodal ai agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[40] [40]

Dexmv: Imitation learning for dexterous manipulation from human videos

Yuzhe Qin, Yueh-Hua Wu, et al. Dexmv: Imitation learning for dexterous manipulation from human videos. InEuropean Conference on Computer Vision, 2022

work page 2022

[41] [41]

Vidbot: Learning generalizable 3d actions from in-the-wild 2d human videos

Hanzhi Chen, Boyang Sun, et al. Vidbot: Learning generalizable 3d actions from in-the-wild 2d human videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[42] [42]

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

Ruihan Yang, Qinxi Yu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Vitpose: Simple vision transformer baselines for human pose estimation.Advances in Neural Information Processing Systems, 35:38571–38584, 2022

Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose: Simple vision transformer baselines for human pose estimation.Advances in Neural Information Processing Systems, 35:38571–38584, 2022

work page 2022

[46] [46]

Vitpose++: Vision transformer for generic body pose estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(2):1212–1230, 2023

Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose++: Vision transformer for generic body pose estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(2):1212–1230, 2023

work page 2023

[47] [47]

Reconstructing hands in 3D with transformers

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3D with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 12

work page 2024

[48] [48]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

Being-h0

Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, et al. Being-h0. 5: Scaling human-centric robot learning for cross-embodiment generalization.arXiv preprint arXiv:2601.12993, 2026

work page arXiv 2026

[50] [50]

PaliGemma 2: A Family of Versatile VLMs for Transfer

Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, et al. Paligemma 2: A family of versatile vlms for transfer.arXiv preprint arXiv:2412.03555, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[52] [52]

Em- bodied hands: Modeling and capturing hands and bodies to- gether

Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610, 2022

work page arXiv 2022

[53] [53]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[55] [55]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[56] [56]

Oakink: A large-scale knowledge repository for understanding hand-object interaction

Lixin Yang, Kailin Li, Xinyu Zhan, Fei Wu, Anran Xu, Liu Liu, and Cewu Lu. Oakink: A large-scale knowledge repository for understanding hand-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20953–20962, 2022

work page 2022

[57] [57]

Evaluating real-world robot manipulation policies in simulation

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation. InRSS 2024 Workshop: Data Generation for Robotics. 13

work page 2024