pith. machine review for the scientific record. sign in

arxiv: 2604.24681 · v1 · submitted 2026-04-27 · 💻 cs.RO

Recognition: unknown

Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-08 02:45 UTC · model grok-4.3

classification 💻 cs.RO
keywords robotic manipulationhuman demonstrationsintention priorsvision-language-actionhand motiondistribution shift
0
0 comments X

The pith

MoT-HRA learns human-intention priors from 2.2 million video episodes to guide more reliable robot manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to turn large collections of human videos into usable guidance for robots that must manipulate objects. It first builds the HA-2.2M dataset by filtering videos around hands, reconstructing 3D space, segmenting time, and aligning language. The model then splits the problem into three linked parts: one that reads vision and language to predict 3D paths, one that treats natural hand motion as a reusable prior, and one that turns the combined knowledge into robot commands. Experiments across motion generation, simulation, and physical robots indicate the approach yields more believable movements and holds up better when conditions change from training data. The core goal is to let robots draw on abundant human examples without requiring perfectly matched robot recordings.

Core claim

MoT-HRA is a hierarchical vision-language-action framework that learns human-intention priors from the HA-2.2M dataset. It factorizes manipulation into a vision-language expert that predicts an embodiment-agnostic 3D trajectory, an intention expert that models MANO-style hand motion as a latent human-motion prior, and a fine expert that maps the intention-aware representation to robot action chunks, connected via a shared-attention trunk and read-only key-value transfer to improve motion plausibility and robust control under distribution shift.

What carries the argument

The three-expert factorization (vision-language trajectory predictor, hand-motion intention prior, and robot-action mapper) joined by shared attention and read-only transfers.

If this is right

  • Produces more plausible hand motions during generation tasks
  • Raises success rates in simulated manipulation scenarios
  • Delivers more stable performance in physical robot tasks when test conditions differ from training data

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same curation steps for turning raw videos into intention data could support learning in other robot skills or even non-robotic planning domains.
  • The read-only transfer mechanism may allow the learned priors to stay intact when the system is later adapted to new robot hardware with little extra data.
  • Adding body-pose or object-interaction signals to the intention expert could capture still richer human priors without changing the overall structure.

Load-bearing premise

The hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment used to build HA-2.2M successfully extract embodiment-agnostic human-intention priors without substantial artifacts or bias.

What would settle it

A controlled comparison in which robots trained with MoT-HRA show no gain in motion plausibility or task success rate over a baseline that skips the human priors, especially under distribution shift in real-world trials.

Figures

Figures reproduced from arXiv: 2604.24681 by Guangyu Chen, Jinkun Liu, Wenbo Ding, Yifan Xie, Yuan Wang, Yu Sun.

Figure 1
Figure 1. Figure 1: Overview of the HA-2.2M curation pipeline. Large-scale unlabeled human demonstration view at source ↗
Figure 2
Figure 2. Figure 2: Overview of MoT-HRA. Given an image, a language instruction, and chunk-sized query view at source ↗
Figure 3
Figure 3. Figure 3: Attention mask of MoT-HRA. Im￾age and text tokens use bidirectional attention to build a shared multimodal context, while 3D trajectory and MANO pose tokens attend to the full prefix but remain causally masked within their own spans. Robot-action tokens attend to all preceding modalities and use bidi￾rectional attention within the action chunk, enabling joint refinement of temporally cou￾pled controls. Fin… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on OakInk in first-person (top) and third-person (bottom) views. view at source ↗
Figure 5
Figure 5. Figure 5: Real-world evaluation on long-horizon manipulation tasks. Both gripper and dexterous view at source ↗
read the original abstract

Human videos contain rich manipulation priors, but using them for robot learning remains difficult because raw observations entangle scene understanding, human motion, and embodiment-specific action. We introduce MoT-HRA, a hierarchical vision-language-action framework that learns human-intention priors from large-scale human demonstrations. We first curate HA-2.2M, a 2.2M-episode action-language dataset reconstructed from heterogeneous human videos through hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment. On top of this dataset, MoT-HRA factorizes manipulation into three coupled experts: a vision-language expert predicts an embodiment-agnostic 3D trajectory, an intention expert models MANO-style hand motion as a latent human-motion prior, and a fine expert maps the intention-aware representation to robot action chunks. A shared-attention trunk and read-only key-value transfer allow downstream control to use human priors while limiting interference with upstream representations. Experiments on hand motion generation, simulated manipulation, and real-world robot tasks show that MoT-HRA improves motion plausibility and robust control under distribution shift.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents MoT-HRA, a hierarchical vision-language-action framework designed to learn human-intention priors from large-scale human demonstrations for robotic manipulation. It describes the curation of the HA-2.2M dataset from human videos via hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment. The framework factorizes the task into three experts—a vision-language expert for 3D trajectory prediction, an intention expert for MANO hand motion priors, and a fine expert for robot action mapping—connected via shared-attention and read-only key-value transfer. Experiments across hand motion generation, simulated manipulation, and real-world robot tasks are claimed to show improvements in motion plausibility and robust control under distribution shift.

Significance. If the claimed results are substantiated with rigorous quantitative evaluation, this work has the potential to advance the field of robot learning by enabling effective transfer of manipulation knowledge from abundant human video data to robotic systems, addressing challenges of embodiment mismatch and distribution shift. The hierarchical factorization and use of large-scale curated data represent a promising direction for scalable robot policy learning.

major comments (2)
  1. Abstract: The abstract asserts performance gains across multiple domains but supplies no quantitative metrics, baseline comparisons, error bars, or statistical details. This is load-bearing for the central claim of improved motion plausibility and robust control, as it prevents evaluation of whether gains are significant or attributable to the method versus data curation.
  2. Dataset Curation: The hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment steps for HA-2.2M are outlined at a high level only. Without ablations or validation showing that these steps extract embodiment-agnostic priors without substantial artifacts or biases, it is unclear if reported improvements stem from the proposed MoT-HRA factorization or from the curation pipeline itself.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and will revise the manuscript accordingly to strengthen the presentation of quantitative results and dataset validation.

read point-by-point responses
  1. Referee: Abstract: The abstract asserts performance gains across multiple domains but supplies no quantitative metrics, baseline comparisons, error bars, or statistical details. This is load-bearing for the central claim of improved motion plausibility and robust control, as it prevents evaluation of whether gains are significant or attributable to the method versus data curation.

    Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised version we will incorporate key metrics from our experiments, including success rates and motion plausibility scores (e.g., FID or trajectory error) on hand motion generation, simulated manipulation, and real-robot tasks, together with baseline comparisons and error bars. These additions will make the claimed improvements directly evaluable while preserving the abstract's brevity. revision: yes

  2. Referee: Dataset Curation: The hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment steps for HA-2.2M are outlined at a high level only. Without ablations or validation showing that these steps extract embodiment-agnostic priors without substantial artifacts or biases, it is unclear if reported improvements stem from the proposed MoT-HRA factorization or from the curation pipeline itself.

    Authors: Detailed procedures for each curation step appear in the supplementary material. To isolate their contribution from the hierarchical factorization, we will add targeted ablations in the main paper that compare full HA-2.2M against versions omitting individual steps (e.g., without hand-centric filtering). These will quantify downstream effects on motion plausibility and robustness, confirming that the three-expert architecture drives the primary gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core pipeline starts from an externally sourced heterogeneous human video collection, applies explicit curation steps (hand-centric filtering, spatial reconstruction, temporal segmentation, language alignment) to produce HA-2.2M, then trains a hierarchical factorization (vision-language trajectory expert, MANO hand-motion prior, robot-action mapping) with shared attention and read-only KV transfer. No equations, fitted parameters, or self-citations are shown that reduce the claimed predictions or priors back to the model's own outputs by construction. Empirical claims are evaluated on held-out hand-generation, simulation, and real-robot tasks separate from the curation process. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Based solely on the abstract, the central claim rests on domain assumptions about data processing and a newly introduced model; no free parameters or invented physical entities are explicitly quantified.

axioms (1)
  • domain assumption Human videos contain rich manipulation priors that can be disentangled into embodiment-agnostic components through hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment
    This premise underpins the entire HA-2.2M curation process described in the abstract.
invented entities (2)
  • MoT-HRA hierarchical framework no independent evidence
    purpose: Factorizes manipulation learning into vision-language trajectory, human-motion intention prior, and robot-action fine experts with shared attention and read-only transfer
    Newly proposed model whose components are introduced to address the stated problem.
  • HA-2.2M dataset no independent evidence
    purpose: Large-scale action-language resource reconstructed from heterogeneous human videos
    Curated specifically to train the framework.

pith-pipeline@v0.9.0 · 5503 in / 1574 out tokens · 57492 ms · 2026-05-08T02:45:55.774426+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 27 canonical work pages · 14 internal anchors

  1. [1]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

  2. [2]

    Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x- embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  3. [3]

    Octo: An open-source generalist robot policy

    Oier Mees, Dibya Ghosh, Karl Pertsch, Kevin Black, Homer Rich Walke, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, et al. Octo: An open-source generalist robot policy. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024

  4. [4]

    Openvla: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025

  5. [5]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  6. [6]

    Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630–2640, 2019

  7. [7]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022

  8. [8]

    The epic-kitchens dataset: Collection, challenges and baselines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. The epic-kitchens dataset: Collection, challenges and baselines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020

  9. [9]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE International Conference on Computer Vision, pages 5842–5850, 2017

  10. [10]

    Being-h0: vision-language-action pretraining from large-scale human videos,

    Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

  11. [11]

    Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos.arXiv preprint arXiv:2510.21571, 2025

    Qixiu Li, Yu Deng, Yaobo Liang, Lin Luo, Lei Zhou, Chengtang Yao, Lingqi Zeng, Zhiyuan Feng, Huizhi Liang, Sicheng Xu, et al. Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos.arXiv preprint arXiv:2510.21571, 2025

  12. [12]

    Spatial-aware vla pretraining through visual-physical alignment from human videos.arXiv preprint arXiv:2512.13080, 2025

    Yicheng Feng, Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan, Sipeng Zheng, and Zongqing Lu. Spatial-aware vla pretraining through visual-physical alignment from human videos.arXiv preprint arXiv:2512.13080, 2025

  13. [13]

    Being-h0.7: A latent world-action model from egocentric videos

    BeingBeyond Team. Being-h0.7: A latent world-action model from egocentric videos. https: //research.beingbeyond.com/being-h07, 2026. 10

  14. [14]

    Flowing from reasoning to motion: Learning 3d hand trajectory prediction from egocentric human interaction videos.arXiv preprint arXiv:2512.16907, 2025

    Mingfei Chen, Yifan Wang, Zhengqin Li, Homanga Bharadhwaj, Yujin Chen, Chuan Qin, Ziyi Kou, Yuan Tian, Eric Whitmire, Rajinder Sodhi, et al. Flowing from reasoning to motion: Learning 3d hand trajectory prediction from egocentric human interaction videos.arXiv preprint arXiv:2512.16907, 2025

  15. [15]

    Driess, J

    Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulat- ing vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705, 2025

  16. [16]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  17. [17]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

  18. [18]

    What matters in building vision–language–action models for generalist robots.Nature Machine Intelligence, pages 1–15, 2026

    Xinghang Li, Peiyan Li, Long Qian, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Xinlong Wang, Di Guo, et al. What matters in building vision–language–action models for generalist robots.Nature Machine Intelligence, pages 1–15, 2026

  19. [19]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

  20. [20]

    Thinkact: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025

    Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025

  21. [21]

    arXiv preprint arXiv:2510.19430 (2025)

    GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, et al. Gigabrain-0: A world model-powered vision-language- action model.arXiv preprint arXiv:2510.19430, 2025

  22. [22]

    Universal visuo-tactile video understanding for embodied interaction.arXiv preprint arXiv:2505.22566, 2025

    Yifan Xie, Mingyang Li, Shoujie Li, Xingting Li, Guangyu Chen, Fei Ma, Fei Richard Yu, and Wenbo Ding. Universal visuo-tactile video understanding for embodied interaction.arXiv preprint arXiv:2505.22566, 2025

  23. [23]

    Causal World Modeling for Robot Control

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

  24. [24]

    World Action Models are Zero-shot Policies

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

  25. [25]

    R3m: A universal visual representation for robot manipulation

    Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhishek Gupta. R3m: A universal visual representation for robot manipulation. InConference on Robot Learning, 2022

  26. [26]

    Vip: Towards universal visual reward and representation via value-implicit pre-training

    Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training. InInternational Conference on Learning Representations, 2023

  27. [27]

    Affordances from human videos as a versatile representation for robotics

    Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  28. [28]

    Grounding language with visual affor- dances over unstructured data

    Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. Grounding language with visual affor- dances over unstructured data. InIEEE International Conference on Robotics and Automation, 2023

  29. [29]

    From human videos to robot manipulation: A survey on scalable vision-language-action learning with human-centric data.TechRxiv preprint, 2026

    Zhiyuan Feng, Qixiu Li, Huizhi Liang, Rushuai Yang, Yichao Shen, Zhiying Du, Zhaowei Zhang, Yu Deng, Li Zhao, Hao Zhao, et al. From human videos to robot manipulation: A survey on scalable vision-language-action learning with human-centric data.TechRxiv preprint, 2026. 11

  30. [30]

    Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyl- los Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  31. [31]

    Latent action pretraining from videos

    Seonghyeon Ye, Joel Jang, et al. Latent action pretraining from videos. InInternational Conference on Learning Representations, 2025

  32. [32]

    Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024

    Xiaoyu Chen, Junliang Guo, et al. Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024

  33. [33]

    Moto: Latent motion token as the bridging language for learning robot manipulation from videos

    Yi Chen, Yuying Ge, et al. Moto: Latent motion token as the bridging language for learning robot manipulation from videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

  34. [34]

    Univla: Learning to act anywhere with task-centric latent actions

    Qingwen Bu, Yanting Yang, et al. Univla: Learning to act anywhere with task-centric latent actions. InRobotics: Science and Systems, 2025

  35. [35]

    Unleashing large-scale video generative pre-training for visual robot manipulation

    Hongtao Wu, Ya Jing, et al. Unleashing large-scale video generative pre-training for visual robot manipulation. InInternational Conference on Learning Representations, 2024

  36. [36]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

  37. [37]

    Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation

    Homanga Bharadhwaj, Debidatta Dwibedi, et al. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation. InConference on Robot Learning, 2025

  38. [38]

    Any-point trajectory modeling for policy learning

    Chuan Wen, Xingyu Lin, et al. Any-point trajectory modeling for policy learning. InRobotics: Science and Systems, 2024

  39. [39]

    Magma: A foundation model for multimodal ai agents

    Jianwei Yang, Reuben Tan, et al. Magma: A foundation model for multimodal ai agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  40. [40]

    Dexmv: Imitation learning for dexterous manipulation from human videos

    Yuzhe Qin, Yueh-Hua Wu, et al. Dexmv: Imitation learning for dexterous manipulation from human videos. InEuropean Conference on Computer Vision, 2022

  41. [41]

    Vidbot: Learning generalizable 3d actions from in-the-wild 2d human videos

    Hanzhi Chen, Boyang Sun, et al. Vidbot: Learning generalizable 3d actions from in-the-wild 2d human videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  42. [42]

    Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

    Ruihan Yang, Qinxi Yu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

  43. [43]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  44. [44]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

  45. [45]

    Vitpose: Simple vision transformer baselines for human pose estimation.Advances in Neural Information Processing Systems, 35:38571–38584, 2022

    Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose: Simple vision transformer baselines for human pose estimation.Advances in Neural Information Processing Systems, 35:38571–38584, 2022

  46. [46]

    Vitpose++: Vision transformer for generic body pose estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(2):1212–1230, 2023

    Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose++: Vision transformer for generic body pose estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(2):1212–1230, 2023

  47. [47]

    Reconstructing hands in 3D with transformers

    Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3D with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 12

  48. [48]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  49. [49]

    Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization

    Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, et al. Being-h0. 5: Scaling human-centric robot learning for cross-embodiment generalization.arXiv preprint arXiv:2601.12993, 2026

  50. [50]

    arXiv preprint arXiv:2412.03555 (2024) 1

    Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, et al. Paligemma 2: A family of versatile vlms for transfer.arXiv preprint arXiv:2412.03555, 2024

  51. [51]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  52. [52]

    Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610, 2022

    Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610, 2022

  53. [53]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

  54. [54]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

  55. [55]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  56. [56]

    Oakink: A large-scale knowledge repository for understanding hand-object interaction

    Lixin Yang, Kailin Li, Xinyu Zhan, Fei Wu, Anran Xu, Liu Liu, and Cewu Lu. Oakink: A large-scale knowledge repository for understanding hand-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20953–20962, 2022

  57. [57]

    Evaluating real-world robot manipulation policies in simulation

    Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation. InRSS 2024 Workshop: Data Generation for Robotics. 13