pith. sign in

arxiv: 2606.23685 · v1 · pith:B2ASUD25new · submitted 2026-06-22 · 💻 cs.RO

LaST-HD: Learning Latent Physical Reasoning from Scalable Human Data for Robot Manipulation

Pith reviewed 2026-06-26 07:56 UTC · model grok-4.3

classification 💻 cs.RO
keywords robot manipulationhuman demonstrationslatent reasoningworld modelcross-embodiment transferaction learningmotion capturedexterous manipulation
0
0 comments X

The pith

LaST-HD aligns human-hand and robot data in a shared latent forward-dynamics space so robots internalize physical dynamics from scalable human demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LaST-HD as a way to transfer manipulation skills from human hands to robots without relying on manual kinematic retargeting. It trains an auxiliary action-conditioned world model on unpaired human and robot trajectories to produce unified latent targets that reflect shared physical dynamics. These targets then supervise the model's latent reasoning process, allowing it to learn action policies directly from human data. The approach pairs this alignment with a low-cost motion-capture glove and a staged training recipe that mixes human-robot data before refining with human-only corrections.

Core claim

LaST-HD trains an auxiliary action-conditioned world model on unpaired human-hand and robot trajectories to synthesize unified latent targets in a shared forward-dynamics space. After cross-embodiment representations are aligned in this space, the targets supervise the latent reasoning process, enabling the model to internalize shared physical dynamics and support efficient human-hand action learning. The method further includes the OOL Glove for precise keypoint capture and a progressive mixed-to-human training schedule that improves generalization to novel objects and scenes while reaching over 90 percent accuracy with only 20 minutes of additional human data.

What carries the argument

Auxiliary action-conditioned world model that generates unified latent targets from unpaired trajectories to align cross-embodiment representations in a shared forward-dynamics space.

If this is right

  • Generalization to novel objects, scenes, and positions improves when training uses only human-hand demonstrations after the latent alignment.
  • Online correction with 20 minutes of OOL glove data enables over 90 percent accuracy in previously unseen environments.
  • The progressive recipe of mixed human-robot co-training followed by human-hand correction reduces reliance on large robot-specific datasets.
  • Latent supervision replaces direct kinematic mimicking as the mechanism for cross-embodiment transfer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same auxiliary world-model alignment could be tested on additional robotic morphologies such as arms or mobile bases if the forward-dynamics space remains embodiment-agnostic.
  • If the latent targets prove robust, the method implies that abundant internet-scale human video could substitute for paired robot demonstrations in many manipulation tasks.
  • The approach opens the possibility of measuring physical-dynamics similarity between any two embodiments by comparing their latent trajectories under the same world model.

Load-bearing premise

The auxiliary world model trained on unpaired trajectories produces latent targets that faithfully represent the shared physical dynamics between human hands and robots.

What would settle it

Training LaST-HD with the auxiliary model's latent targets yields no measurable gain in generalization or task accuracy over a baseline that uses only kinematic retargeting or separate embodiment-specific models.

Figures

Figures reproduced from arXiv: 2606.23685 by Chenyang Gu, Hao Chen, Hao Tang, Jiaming Liu, Jiawei Chen, Lei Wang, Nuowei Han, Peng Jia, Qingpo Wuwu, Shanghang Zhang, Siyuan Qian, Xiangju Mi, Xiaoqi Li, Xuheng Zhang, Yang Yue, Yeqing Yang, Yiming Zhang, Yinxi Wang.

Figure 1
Figure 1. Figure 1: Overview of LaST-HD. LaST-HD aligns human-hand and robot demonstrations through action-conditioned latent physical reasoning, enabling morphology-agnostic action learning under a progressive mixed-to-human training recipe. To support high-quality human data collection, we in￾troduce an OOL glove tailored to LaST-HD. We evaluate LaST-HD on six real-world tasks spanning three embodiments, including dual-arm … view at source ↗
Figure 2
Figure 2. Figure 2: Framework. (a) LaST-HD aligns human-hand and robot data through an action￾conditioned world model, which converts predicted physical consequences into denoised future￾frame features as latent supervision for the reasoning expert. The aligned latent CoT conditions the action expert through shared attention for action generation. (b) We present the OOL Glove and col￾lection platform for human-hand data colle… view at source ↗
Figure 3
Figure 3. Figure 3: (a) Success rate under online correction with varying amounts of correction data. (b) Main ablation studies on LaST-HD design choices. (c) Visualization of latent-token attention maps. Put Items to Bag and Zip Organize Box Pour Water Put Items to Bag and Zip (Human Data - Unseen Background) Organize Box (Human Data - Unseen Object) Pour Water (Human Data - Unseen Position) [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
Figure 4
Figure 4. Figure 4: Robot executions are shown on the left, and OOL Glove data in unseen scenarios on the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: OOL Glove hardware and human-native data collection setup. (a) Lightweight glove hardware with six compact IMU-based sensing modules. (b) Natural manipulation demon￾strations are collected with synchronized egocentric vision, wrist 6-DoF tracking, and glove-based hand kinematics. (c) The system reconstructs metric 6-DoF hand-wrist keypoints for retargetable action supervision, operating at high sampling ra… view at source ↗
Figure 6
Figure 6. Figure 6: Real-world setup. Visualization of the two robot embodiments and three actuator con￾figurations used in our real-world experiments, together with the corresponding task assets. matches natural human behavior and substantially improves data collection efficiency. In practice, this allows demonstrations to be collected 4–5× faster than standard robot teleoperation while pre￾serving rich human interaction dyn… view at source ↗
Figure 7
Figure 7. Figure 7: UMAP visualization of LaST-HD latent tokens generated by different model variants. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of LaST-HD output-token-to-visual-token attention maps. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Additional visualizations of robot executions and OOL Glove Data. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Representative failure cases observed during real-world deployment of LaST-HD. [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
read the original abstract

Human-hand demonstrations provide a direct and scalable source of physical interaction data for robot learning. While manual retargeting is indispensable for establishing kinematic action correspondence across different morphologies, robust transfer requires going beyond geometry to address the underlying alignment of physical dynamics between human and robot manipulation. To address this, we introduce LaST-HD, a novel human-to-robot action learning paradigm that extends reasoning-before-acting VLA by aligning human-hand and robot demonstrations in a shared latent reasoning space. Rather than mimicking human kinematics, LaST-HD trains an auxiliary action-conditioned world model on unpaired human-hand and robot trajectories to synthesize unified latent targets. After aligning cross-embodiment representations in this shared forward-dynamics space, these targets supervise LaST-HD's latent reasoning process, enabling it to internalize shared physical dynamics and drive efficient human-hand action learning. Moreover, we develop Out-of-Lab (OOL) Glove, a low-cost motion-capture glove tailored to LaST-HD for human-hand data collection. The captured human data provide precise keypoints and serve as universal action supervision across grippers and dexterous hands. Armed with the aligned latent space and high-fidelity human-hand data, we develop a progressive mixed-to-human training recipe comprising mixed human-robot co-training and human-hand online correction post-training. Through mixed co-training, LaST-HD improves generalization to novel objects, scenes, and positions using only human-hand demonstrations. With online correction, LaST-HD further adapts to novel environments and achieves over 90\% accuracy using only 20 minutes of OOL glove data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces LaST-HD, a human-to-robot action learning method that extends reasoning-before-acting VLAs by training an auxiliary action-conditioned world model on unpaired human-hand and robot trajectories to produce unified latent targets in a shared forward-dynamics space. These targets align cross-embodiment representations and supervise the model's latent reasoning to internalize shared physical dynamics. The work also presents the OOL Glove for scalable human data collection and a progressive mixed-to-human training recipe, claiming over 90% accuracy on novel objects/scenes with only 20 minutes of human data.

Significance. If the auxiliary world model reliably produces latent targets that capture shared physical dynamics rather than embodiment-specific features, the approach could reduce reliance on robot-specific demonstrations and improve sample efficiency in manipulation learning. The OOL Glove and mixed-training recipe are practical contributions for data collection. The significance is currently limited by the absence of verification for the shared-dynamics assumption and experimental details.

major comments (2)
  1. [Abstract] Abstract: The central claim rests on the auxiliary action-conditioned world model 'synthesiz[ing] unified latent targets' that capture shared physical dynamics across embodiments from unpaired trajectories. No description is given of the world-model objective (e.g., presence or absence of a cross-embodiment consistency term), architecture, or any post-training check that human and robot latents for equivalent physical interactions (identical object contact) lie close together. This assumption is load-bearing; if the model learns separate forward predictors, the 'shared forward-dynamics space' does not exist and the supervision signal reduces to embodiment-specific imitation.
  2. [Abstract] Abstract: The performance claim ('over 90% accuracy using only 20 minutes of OOL glove data') and the 'progressive mixed-to-human training recipe' are stated without any experimental protocol, task definitions, baselines, number of trials, error bars, or ablation results. This prevents assessment of whether the latent-alignment mechanism actually drives the reported gains.
minor comments (1)
  1. [Abstract] Abstract: 'VLA' is introduced without expansion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where the abstract could better convey key technical details and experimental rigor. We respond to each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim rests on the auxiliary action-conditioned world model 'synthesiz[ing] unified latent targets' that capture shared physical dynamics across embodiments from unpaired trajectories. No description is given of the world-model objective (e.g., presence or absence of a cross-embodiment consistency term), architecture, or any post-training check that human and robot latents for equivalent physical interactions (identical object contact) lie close together. This assumption is load-bearing; if the model learns separate forward predictors, the 'shared forward-dynamics space' does not exist and the supervision signal reduces to embodiment-specific imitation.

    Authors: We agree the abstract is too concise on these points. The full manuscript (Section 3.2) specifies that the auxiliary world model is a transformer-based action-conditioned latent dynamics predictor trained solely with a next-state reconstruction loss on unpaired human-hand and robot trajectories; no explicit cross-embodiment consistency loss is used. Alignment arises from the shared action keypoint representation and joint optimization. The original submission did not include a post-training verification (e.g., latent distance metrics or visualizations for matched physical contacts). We will add this analysis in the revision to substantiate the shared-dynamics claim. revision: yes

  2. Referee: [Abstract] Abstract: The performance claim ('over 90% accuracy using only 20 minutes of OOL glove data') and the 'progressive mixed-to-human training recipe' are stated without any experimental protocol, task definitions, baselines, number of trials, error bars, or ablation results. This prevents assessment of whether the latent-alignment mechanism actually drives the reported gains.

    Authors: The abstract summarizes results; the full experimental protocol, task definitions (novel objects/scenes/positions), baselines, trial counts, error bars, and ablations appear in Section 4 and the supplementary material. We will revise the abstract to include a brief statement of the evaluation setup (e.g., number of trials and key metrics) so readers can better contextualize the claims without needing to consult the body immediately. revision: yes

Circularity Check

0 steps flagged

No circularity: method described at conceptual level without equations or self-referential derivations

full rationale

The provided abstract and description contain no equations, fitted parameters, or explicit derivation steps. The auxiliary action-conditioned world model is introduced as an assumption that produces unified latent targets from unpaired trajectories, but this is presented as a modeling choice rather than a result derived from or equivalent to its own outputs. No self-citations, uniqueness theorems, or renamings of known results are invoked in a load-bearing way within the given text. The central claim therefore remains a high-level architectural proposal without internal reduction to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities beyond the high-level description of the world model and latent space.

pith-pipeline@v0.9.1-grok · 5884 in / 1183 out tokens · 31080 ms · 2026-06-26T07:56:15.072486+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

98 extracted references · 30 linked inside Pith

  1. [1]

    Karamcheti, S

    S. Karamcheti, S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh. Prismatic vlms: Investigating the design space of visually-conditioned language models. InForty-first Interna- tional Conference on Machine Learning, 2024

  2. [2]

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...

  3. [3]

    Beyer, A

    L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdul- mohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  4. [4]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  5. [5]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. pi0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  6. [6]

    J. Liu, H. Chen, P. An, Z. Liu, R. Zhang, C. Gu, X. Li, Z. Guo, S. Chen, M. Liu, et al. Hy- bridvla: Collaborative diffusion and autoregression in a unified vision-language-action model. arXiv preprint arXiv:2503.10631, 2025

  7. [7]

    H. Chen, J. Liu, C. Gu, Z. Liu, R. Zhang, X. Li, X. He, Y . Guo, C.-W. Fu, S. Zhang, et al. Fast- in-slow: A dual-system foundation model unifying fast manipulation within slow reasoning. arXiv preprint arXiv:2506.01953, 2025

  8. [8]

    Khazatsky, K

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. 2024

  9. [9]

    K. Wu, C. Hou, J. Liu, Z. Che, X. Ju, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. InRobotics: Science and Systems (RSS)

  10. [10]

    Robotics: Science and Systems Foundation, 2025

  11. [11]

    Padalkar, A

    Open X-Embodiment Collaboration, A. Padalkar, A. Pooley, et al. Open X-Embodiment: Robotic learning datasets and RT-X models.https://arxiv.org/abs/2310.08864, 2023

  12. [12]

    A. S. Chen, S. Nair, and C. Finn. Learning generalizable robotic reward functions from” in- the-wild” human videos.RSS, 2021

  13. [13]

    S. Bahl, A. Gupta, and D. Pathak. Human-to-robot imitation in the wild.arXiv preprint arXiv:2207.09450, 2022. 9

  14. [14]

    R.-Z. Qiu, S. Yang, X. Cheng, C. Chawla, J. Li, T. He, G. Yan, D. J. Yoon, R. Hoque, L. Paulsen, et al. Humanoid policy˜ human policy.arXiv preprint arXiv:2503.13441, 2025

  15. [15]

    Guzey, Y

    I. Guzey, Y . Dai, G. Savva, R. Bhirangi, and L. Pinto. Bridging the human to robot dexterity gap through object-oriented rewards. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 3344–3351. IEEE, 2025

  16. [16]

    Lepert, J

    M. Lepert, J. Fang, and J. Bohg. Phantom: Training robots without robots using only human videos.arXiv preprint arXiv:2503.00779, 2025

  17. [17]

    Kareer, D

    S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Con- ference on Robotics and Automation (ICRA), pages 13226–13233. IEEE, 2025

  18. [18]

    S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3m: A universal visual represen- tation for robot manipulation. InCoRL, 2022

  19. [19]

    Bharadhwaj, R

    H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. InEuropean Conference on Computer Vision, pages 306–324. Springer, 2024

  20. [20]

    G. A. Team. Gen-1: Scaling embodied foundation models to mastery.Generalist AI Blog,

  21. [21]

    https://generalistai.com/blog/apr-02-2026-GEN-1

  22. [22]

    Intelligence, B

    P. Intelligence, B. Ai, A. Amin, R. Aniceto, A. Balakrishna, G. Balke, K. Black, G. Bokinsky, S. Cao, T. Charbonnier, et al.π 0.7: A steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026

  23. [23]

    Kareer, K

    S. Kareer, K. Pertsch, J. Darpinian, J. Hoffman, D. Xu, S. Levine, C. Finn, and S. Nair. Emer- gence of human to robot transfer in vision-language-action models

  24. [24]

    From human skill to robotic mastery.https://cypypccpy.github.io/ tech-blog.github.io/, Mar

    PsiBot Team. From human skill to robotic mastery.https://cypypccpy.github.io/ tech-blog.github.io/, Mar. 2026. Accessed: 2026-05-20

  25. [25]

    H. Bi, L. Wu, T. Lin, H. Tan, Z. Su, H. Su, and J. Zhu. H-rdt: Human manipulation en- hanced bimanual robotic manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18135–18143, 2026

  26. [26]

    Zheng, D

    R. Zheng, D. Niu, Y . Xie, J. Wang, M. Xu, Y . Jiang, F. Casta˜neda, F. Hu, Y . L. Tan, L. Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

  27. [27]

    Z. Liu, J. Liu, H. Chen, J. Yu, Z. Guo, C. Hou, C. Gu, X. Mi, R. Zhang, K. Wu, et al. Last {0}: Latent spatio-temporal chain-of-thought for robotic vision-language-action model.arXiv preprint arXiv:2601.05248, 2026

  28. [28]

    Y . Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

  29. [29]

    McInnes, J

    L. McInnes, J. Healy, and J. Melville. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018

  30. [30]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In7th Annual Conference on Robot Learning, 2023

  31. [31]

    M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025. 10

  32. [32]

    Bjorck, F

    J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  33. [33]

    Y . Su, N. Liu, D. Chen, Z. Zhao, K. Wu, M. Li, Z. Xu, Z. Che, and J. Tang. Freqpolicy: Efficient flow-based visuomotor policy via frequency consistency. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  34. [34]

    J. Wen, M. Zhu, Y . Zhu, Z. Tang, J. Li, Z. Zhou, C. Li, X. Liu, Y . Peng, C. Shen, et al. Diffusion-vla: Scaling robot foundation models via unified diffusion and autoregression.arXiv preprint arXiv:2412.03293, 2024

  35. [35]

    Intelligence, K

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, et al.π 0.5: a vision-language-action model with open-world generalization, 2025. URLhttps://arxiv.org/abs/2504.16054

  36. [36]

    F. Lin, R. Nai, Y . Hu, J. You, J. Zhao, and Y . Gao. Onetwovla: A unified vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2505.11917, 2025

  37. [37]

    Zawalski, W

    M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine. Robotic control via embodied chain-of-thought reasoning, 2025. URLhttps://arxiv.org/abs/2407.08693

  38. [38]

    Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, et al. Cot- vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025

  39. [39]

    S. Gao, S. Zhou, Y . Du, J. Zhang, and C. Gan. Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025

  40. [40]

    Zhang, H

    W. Zhang, H. Liu, Z. Qi, Y . Wang, X. Yu, J. Zhang, R. Dong, J. He, H. Wang, Z. Zhang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge. arXiv preprint arXiv:2507.04447, 2025

  41. [41]

    Z. Liu, J. Liu, J. Xu, N. Han, C. Gu, H. Chen, K. Zhou, R. Zhang, K. C. Hsieh, K. Wu, et al. Mla: A multisensory language-action model for multimodal understanding and forecasting in robotic manipulation.arXiv preprint arXiv:2509.26642, 2025

  42. [42]

    J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

  43. [43]

    C. Gu, J. Liu, H. Chen, R. Huang, Q. Wuwu, Z. Liu, X. Li, Y . Li, R. Zhang, P. Jia, et al. Man- ualvla: A unified vla model for chain-of-thought manual generation and robotic manipulation. arXiv preprint arXiv:2512.02013, 2025

  44. [44]

    Huang, Y .-H

    C.-P. Huang, Y .-H. Wu, M.-H. Chen, Y .-C. F. Wang, and F.-E. Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025

  45. [45]

    J. Cai, Z. Cai, J. Cao, Y . Chen, Z. He, L. Jiang, H. Li, H. Li, Y . Li, Y . Liu, et al. Internvla- a1: Unifying understanding, generation and action for robotic manipulation.arXiv preprint arXiv:2601.02456, 2026

  46. [46]

    J. Lyu, K. Liu, X. Zhang, H. Liao, Y . Feng, W. Zhu, T. Shen, J. Chen, J. Zhang, Y . Dong, et al. Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion.arXiv preprint arXiv:2602.12215, 2026

  47. [47]

    Banerjee, S

    P. Banerjee, S. Shkodrani, P. Moulon, S. Hampali, S. Han, F. Zhang, L. Zhang, J. Fountain, E. Miller, S. Basol, et al. Hot3d: Hand and object tracking in 3d from egocentric multi- view videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7071, 2025. 11

  48. [48]

    Grauman, A

    K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19383–19400, 2024

  49. [49]

    Grauman, A

    K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022

  50. [50]

    Damen, H

    D. Damen, H. Doughty, G. Maria Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European Conference on Computer Vision (ECCV), pages 720–736, 2018

  51. [51]

    Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training. InThe Eleventh International Conference on Learning Representations, 2022

  52. [52]

    Majumdar, K

    A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, P. Abbeel, J. Malik, D. Batra, Y . Lin, O. Maksymets, A. Rajeswaran, and F. Meier. Where are we in the search for an artificial visual cortex for embodied intelligence? 2023

  53. [53]

    C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar. Mimicplay: Long-horizon imitation learning by watching human play.arXiv preprint arXiv:2302.12422, 2023

  54. [54]

    S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. InCVPR, 2023

  55. [55]

    S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, et al. Latent action pretraining from videos.arXiv preprint arXiv:2410.11758, 2024

  56. [56]

    S. Gao, W. Liang, K. Zheng, A. Malik, S. Ye, S. Yu, W.-C. Tseng, Y . Dong, K. Mo, C.-H. Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

  57. [57]

    J. Duan, Y . R. Wang, M. Shridhar, D. Fox, and R. Krishna. Ar2-d2: Training a robot without a robot.arXiv preprint arXiv:2306.13818, 2023

  58. [58]

    Y . Park, J. S. Bhatia, L. Ankile, and P. Agrawal. Dexhub and dart: Towards internet scale robot data collection.arXiv preprint arXiv:2411.02214, 2024

  59. [59]

    T. Tao, M. K. Srirama, J. J. Liu, K. Shaw, and D. Pathak. Dexwild: Dexterous human interac- tions for in-the-wild robot policies.arXiv preprint arXiv:2505.07813, 2025

  60. [60]

    R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

  61. [61]

    X. Chen, Z. Wu, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, and C. Ruan. Janus-pro: Uni- fied multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  62. [62]

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre- training. InInternational Conference on Computer Vision (ICCV), 2023

  63. [63]

    G. Luo, L. Dunlap, D. H. Park, A. Holynski, and T. Darrell. Diffusion hyperfeatures: Search- ing through time and space for semantic correspondence.Advances in Neural Information Processing Systems, 36:47500–47510, 2023. 12

  64. [64]

    L. Tang, M. Jia, Q. Wang, C. P. Phoo, and B. Hariharan. Emergent correspondence from image diffusion.Advances in neural information processing systems, 36:1363–1389, 2023

  65. [65]

    X. Xu, Y . Hou, Z. Liu, and S. Song. Compliant residual dagger: Improving real-world contact- rich manipulation with human corrections.Advances in Neural Information Processing Sys- tems, 38:139559–139581, 2026

  66. [66]

    M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, and J. Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026

  67. [67]

    Zhang, J

    J. Zhang, J. Deng, C. Ma, and R. A. Potamias. Hawor: World-space hand motion reconstruc- tion from egocentric videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1805–1815, 2025

  68. [68]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  69. [69]

    O’Neill, A

    A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  70. [70]

    Ebert, Y

    F. Ebert, Y . Yang, K. Schmeckpeper, B. Bucher, G. Georgakis, K. Daniilidis, C. Finn, and S. Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. InRSS, 2022

  71. [71]

    Walke, K

    H. Walke, K. Black, A. Lee, M. J. Kim, M. Du, C. Zheng, T. Zhao, P. Hansen-Estruch, Q. Vuong, A. He, V . Myers, K. Fang, C. Finn, and S. Levine. Bridgedata v2: A dataset for robot learning at scale, 2023

  72. [72]

    Z. J. Cui, Y . Wang, N. M. M. Shafiullah, and L. Pinto. From play to policy: Conditional behavior generation from uncurated robot data.arXiv preprint arXiv:2210.10047, 2022

  73. [73]

    Kalashnikov, A

    D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Vanhoucke, et al. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation.arXiv preprint arXiv:1806.10293, 2018

  74. [74]

    Belkhale, Y

    S. Belkhale, Y . Cui, and D. Sadigh. Hydra: Hybrid robot actions for imitation learning.arxiv, 2023

  75. [75]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, et al. Rt-1: Robotics transformer for real-world control at scale. InarXiv preprint arXiv:2212.06817, 2022

  76. [76]

    Dasari, F

    S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn. Robonet: Large-scale multi-robot learning. InConference on Robot Learning, pages 885–897. PMLR, 2020

  77. [77]

    S. Dass, J. Yapeter, J. Zhang, J. Zhang, K. Pertsch, S. Nikolaidis, and J. J. Lim. CLVR jaco play dataset, 2023. URLhttps://github.com/clvrai/clvr_jaco_play_dataset

  78. [78]

    Lynch, A

    C. Lynch, A. Wahid, J. Tompson, T. Ding, J. Betker, R. Baruch, T. Armstrong, and P. Florence. Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023

  79. [79]

    N. M. M. Shafiullah, A. Rai, H. Etukuru, Y . Liu, I. Misra, S. Chintala, and L. Pinto. On bringing robots home, 2023. 13

  80. [80]

    E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. InConference on Robot Learn- ing, pages 991–1002. PMLR, 2022

Showing first 80 references.