LaST-HD: Learning Latent Physical Reasoning from Scalable Human Data for Robot Manipulation

Chenyang Gu; Hao Chen; Hao Tang; Jiaming Liu; Jiawei Chen; Lei Wang; Nuowei Han; Peng Jia; Qingpo Wuwu; Shanghang Zhang

arxiv: 2606.23685 · v1 · pith:B2ASUD25new · submitted 2026-06-22 · 💻 cs.RO

LaST-HD: Learning Latent Physical Reasoning from Scalable Human Data for Robot Manipulation

Jiaming Liu , Yinxi Wang , Chenyang Gu , Siyuan Qian , Xiangju Mi , Hao Chen , Jiawei Chen , Qingpo Wuwu

show 10 more authors

Xiaoqi Li Nuowei Han Yiming Zhang Xuheng Zhang Yang Yue Yeqing Yang Lei Wang Peng Jia Hao Tang Shanghang Zhang

This is my paper

Pith reviewed 2026-06-26 07:56 UTC · model grok-4.3

classification 💻 cs.RO

keywords robot manipulationhuman demonstrationslatent reasoningworld modelcross-embodiment transferaction learningmotion capturedexterous manipulation

0 comments

The pith

LaST-HD aligns human-hand and robot data in a shared latent forward-dynamics space so robots internalize physical dynamics from scalable human demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LaST-HD as a way to transfer manipulation skills from human hands to robots without relying on manual kinematic retargeting. It trains an auxiliary action-conditioned world model on unpaired human and robot trajectories to produce unified latent targets that reflect shared physical dynamics. These targets then supervise the model's latent reasoning process, allowing it to learn action policies directly from human data. The approach pairs this alignment with a low-cost motion-capture glove and a staged training recipe that mixes human-robot data before refining with human-only corrections.

Core claim

LaST-HD trains an auxiliary action-conditioned world model on unpaired human-hand and robot trajectories to synthesize unified latent targets in a shared forward-dynamics space. After cross-embodiment representations are aligned in this space, the targets supervise the latent reasoning process, enabling the model to internalize shared physical dynamics and support efficient human-hand action learning. The method further includes the OOL Glove for precise keypoint capture and a progressive mixed-to-human training schedule that improves generalization to novel objects and scenes while reaching over 90 percent accuracy with only 20 minutes of additional human data.

What carries the argument

Auxiliary action-conditioned world model that generates unified latent targets from unpaired trajectories to align cross-embodiment representations in a shared forward-dynamics space.

If this is right

Generalization to novel objects, scenes, and positions improves when training uses only human-hand demonstrations after the latent alignment.
Online correction with 20 minutes of OOL glove data enables over 90 percent accuracy in previously unseen environments.
The progressive recipe of mixed human-robot co-training followed by human-hand correction reduces reliance on large robot-specific datasets.
Latent supervision replaces direct kinematic mimicking as the mechanism for cross-embodiment transfer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same auxiliary world-model alignment could be tested on additional robotic morphologies such as arms or mobile bases if the forward-dynamics space remains embodiment-agnostic.
If the latent targets prove robust, the method implies that abundant internet-scale human video could substitute for paired robot demonstrations in many manipulation tasks.
The approach opens the possibility of measuring physical-dynamics similarity between any two embodiments by comparing their latent trajectories under the same world model.

Load-bearing premise

The auxiliary world model trained on unpaired trajectories produces latent targets that faithfully represent the shared physical dynamics between human hands and robots.

What would settle it

Training LaST-HD with the auxiliary model's latent targets yields no measurable gain in generalization or task accuracy over a baseline that uses only kinematic retargeting or separate embodiment-specific models.

Figures

Figures reproduced from arXiv: 2606.23685 by Chenyang Gu, Hao Chen, Hao Tang, Jiaming Liu, Jiawei Chen, Lei Wang, Nuowei Han, Peng Jia, Qingpo Wuwu, Shanghang Zhang, Siyuan Qian, Xiangju Mi, Xiaoqi Li, Xuheng Zhang, Yang Yue, Yeqing Yang, Yiming Zhang, Yinxi Wang.

**Figure 1.** Figure 1: Overview of LaST-HD. LaST-HD aligns human-hand and robot demonstrations through action-conditioned latent physical reasoning, enabling morphology-agnostic action learning under a progressive mixed-to-human training recipe. To support high-quality human data collection, we introduce an OOL glove tailored to LaST-HD. We evaluate LaST-HD on six real-world tasks spanning three embodiments, including dual-arm … view at source ↗

**Figure 2.** Figure 2: Framework. (a) LaST-HD aligns human-hand and robot data through an actionconditioned world model, which converts predicted physical consequences into denoised futureframe features as latent supervision for the reasoning expert. The aligned latent CoT conditions the action expert through shared attention for action generation. (b) We present the OOL Glove and collection platform for human-hand data colle… view at source ↗

**Figure 3.** Figure 3: (a) Success rate under online correction with varying amounts of correction data. (b) Main ablation studies on LaST-HD design choices. (c) Visualization of latent-token attention maps. Put Items to Bag and Zip Organize Box Pour Water Put Items to Bag and Zip (Human Data - Unseen Background) Organize Box (Human Data - Unseen Object) Pour Water (Human Data - Unseen Position) [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 4.** Figure 4: Robot executions are shown on the left, and OOL Glove data in unseen scenarios on the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: OOL Glove hardware and human-native data collection setup. (a) Lightweight glove hardware with six compact IMU-based sensing modules. (b) Natural manipulation demonstrations are collected with synchronized egocentric vision, wrist 6-DoF tracking, and glove-based hand kinematics. (c) The system reconstructs metric 6-DoF hand-wrist keypoints for retargetable action supervision, operating at high sampling ra… view at source ↗

**Figure 6.** Figure 6: Real-world setup. Visualization of the two robot embodiments and three actuator configurations used in our real-world experiments, together with the corresponding task assets. matches natural human behavior and substantially improves data collection efficiency. In practice, this allows demonstrations to be collected 4–5× faster than standard robot teleoperation while preserving rich human interaction dyn… view at source ↗

**Figure 7.** Figure 7: UMAP visualization of LaST-HD latent tokens generated by different model variants. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization of LaST-HD output-token-to-visual-token attention maps. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Additional visualizations of robot executions and OOL Glove Data. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Representative failure cases observed during real-world deployment of LaST-HD. [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

read the original abstract

Human-hand demonstrations provide a direct and scalable source of physical interaction data for robot learning. While manual retargeting is indispensable for establishing kinematic action correspondence across different morphologies, robust transfer requires going beyond geometry to address the underlying alignment of physical dynamics between human and robot manipulation. To address this, we introduce LaST-HD, a novel human-to-robot action learning paradigm that extends reasoning-before-acting VLA by aligning human-hand and robot demonstrations in a shared latent reasoning space. Rather than mimicking human kinematics, LaST-HD trains an auxiliary action-conditioned world model on unpaired human-hand and robot trajectories to synthesize unified latent targets. After aligning cross-embodiment representations in this shared forward-dynamics space, these targets supervise LaST-HD's latent reasoning process, enabling it to internalize shared physical dynamics and drive efficient human-hand action learning. Moreover, we develop Out-of-Lab (OOL) Glove, a low-cost motion-capture glove tailored to LaST-HD for human-hand data collection. The captured human data provide precise keypoints and serve as universal action supervision across grippers and dexterous hands. Armed with the aligned latent space and high-fidelity human-hand data, we develop a progressive mixed-to-human training recipe comprising mixed human-robot co-training and human-hand online correction post-training. Through mixed co-training, LaST-HD improves generalization to novel objects, scenes, and positions using only human-hand demonstrations. With online correction, LaST-HD further adapts to novel environments and achieves over 90\% accuracy using only 20 minutes of OOL glove data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LaST-HD uses an auxiliary world model on unpaired human-robot trajectories to create shared latent targets for VLA supervision, plus a cheap glove and mixed training recipe, but the shared-dynamics claim needs stronger checks.

read the letter

The paper's core move is training an action-conditioned world model on unpaired human-hand and robot trajectories to produce unified latent targets, then using those to supervise the reasoning stage of a VLA so the robot internalizes physical dynamics rather than just copying kinematics. They add the low-cost OOL Glove for precise human data and a two-stage recipe of mixed co-training followed by human-hand online correction.

It does a reasonable job naming the gap beyond retargeting and offering a concrete way to scale human data. The glove design is practical for anyone who has tried collecting dexterous demonstrations, and the progressive training idea is a straightforward way to bootstrap generalization before fine-tuning.

The soft spot is exactly the one in the stress-test note. The abstract gives no cross-embodiment consistency term, no reconstruction of matching physical outcomes, and no post-hoc check that human and robot latents for the same contact event sit close together. Without those, the world model could simply learn separate forward predictors, and the claimed shared forward-dynamics space would not exist. That assumption is load-bearing, and the description leaves it unaddressed.

The performance numbers (over 90 % with 20 minutes) are presented without protocol, baselines, or error bars in the abstract, so the central claim is hard to evaluate from what is shown. If the full paper supplies those and shows the latents actually align, the result strengthens; right now the evidence is thin.

This is for people working on cross-embodiment imitation and VLA models who want to reduce robot-specific data collection. A reader already thinking about latent world models would find the framing useful to test. It deserves a serious referee because the problem is real and the angle is distinct, even though the alignment step needs more scrutiny.

Referee Report

2 major / 1 minor

Summary. The paper introduces LaST-HD, a human-to-robot action learning method that extends reasoning-before-acting VLAs by training an auxiliary action-conditioned world model on unpaired human-hand and robot trajectories to produce unified latent targets in a shared forward-dynamics space. These targets align cross-embodiment representations and supervise the model's latent reasoning to internalize shared physical dynamics. The work also presents the OOL Glove for scalable human data collection and a progressive mixed-to-human training recipe, claiming over 90% accuracy on novel objects/scenes with only 20 minutes of human data.

Significance. If the auxiliary world model reliably produces latent targets that capture shared physical dynamics rather than embodiment-specific features, the approach could reduce reliance on robot-specific demonstrations and improve sample efficiency in manipulation learning. The OOL Glove and mixed-training recipe are practical contributions for data collection. The significance is currently limited by the absence of verification for the shared-dynamics assumption and experimental details.

major comments (2)

[Abstract] Abstract: The central claim rests on the auxiliary action-conditioned world model 'synthesiz[ing] unified latent targets' that capture shared physical dynamics across embodiments from unpaired trajectories. No description is given of the world-model objective (e.g., presence or absence of a cross-embodiment consistency term), architecture, or any post-training check that human and robot latents for equivalent physical interactions (identical object contact) lie close together. This assumption is load-bearing; if the model learns separate forward predictors, the 'shared forward-dynamics space' does not exist and the supervision signal reduces to embodiment-specific imitation.
[Abstract] Abstract: The performance claim ('over 90% accuracy using only 20 minutes of OOL glove data') and the 'progressive mixed-to-human training recipe' are stated without any experimental protocol, task definitions, baselines, number of trials, error bars, or ablation results. This prevents assessment of whether the latent-alignment mechanism actually drives the reported gains.

minor comments (1)

[Abstract] Abstract: 'VLA' is introduced without expansion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where the abstract could better convey key technical details and experimental rigor. We respond to each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim rests on the auxiliary action-conditioned world model 'synthesiz[ing] unified latent targets' that capture shared physical dynamics across embodiments from unpaired trajectories. No description is given of the world-model objective (e.g., presence or absence of a cross-embodiment consistency term), architecture, or any post-training check that human and robot latents for equivalent physical interactions (identical object contact) lie close together. This assumption is load-bearing; if the model learns separate forward predictors, the 'shared forward-dynamics space' does not exist and the supervision signal reduces to embodiment-specific imitation.

Authors: We agree the abstract is too concise on these points. The full manuscript (Section 3.2) specifies that the auxiliary world model is a transformer-based action-conditioned latent dynamics predictor trained solely with a next-state reconstruction loss on unpaired human-hand and robot trajectories; no explicit cross-embodiment consistency loss is used. Alignment arises from the shared action keypoint representation and joint optimization. The original submission did not include a post-training verification (e.g., latent distance metrics or visualizations for matched physical contacts). We will add this analysis in the revision to substantiate the shared-dynamics claim. revision: yes
Referee: [Abstract] Abstract: The performance claim ('over 90% accuracy using only 20 minutes of OOL glove data') and the 'progressive mixed-to-human training recipe' are stated without any experimental protocol, task definitions, baselines, number of trials, error bars, or ablation results. This prevents assessment of whether the latent-alignment mechanism actually drives the reported gains.

Authors: The abstract summarizes results; the full experimental protocol, task definitions (novel objects/scenes/positions), baselines, trial counts, error bars, and ablations appear in Section 4 and the supplementary material. We will revise the abstract to include a brief statement of the evaluation setup (e.g., number of trials and key metrics) so readers can better contextualize the claims without needing to consult the body immediately. revision: yes

Circularity Check

0 steps flagged

No circularity: method described at conceptual level without equations or self-referential derivations

full rationale

The provided abstract and description contain no equations, fitted parameters, or explicit derivation steps. The auxiliary action-conditioned world model is introduced as an assumption that produces unified latent targets from unpaired trajectories, but this is presented as a modeling choice rather than a result derived from or equivalent to its own outputs. No self-citations, uniqueness theorems, or renamings of known results are invoked in a load-bearing way within the given text. The central claim therefore remains a high-level architectural proposal without internal reduction to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities beyond the high-level description of the world model and latent space.

pith-pipeline@v0.9.1-grok · 5884 in / 1183 out tokens · 31080 ms · 2026-06-26T07:56:15.072486+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

98 extracted references · 30 linked inside Pith

[1]

Karamcheti, S

S. Karamcheti, S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh. Prismatic vlms: Investigating the design space of visually-conditioned language models. InForty-first Interna- tional Conference on Machine Learning, 2024

2024
[2]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...

Pith/arXiv arXiv 2025
[3]

Beyer, A

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdul- mohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Pith/arXiv arXiv 2024
[4]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[5]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. pi0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[6]

J. Liu, H. Chen, P. An, Z. Liu, R. Zhang, C. Gu, X. Li, Z. Guo, S. Chen, M. Liu, et al. Hy- bridvla: Collaborative diffusion and autoregression in a unified vision-language-action model. arXiv preprint arXiv:2503.10631, 2025

Pith/arXiv arXiv 2025
[7]

H. Chen, J. Liu, C. Gu, Z. Liu, R. Zhang, X. Li, X. He, Y . Guo, C.-W. Fu, S. Zhang, et al. Fast- in-slow: A dual-system foundation model unifying fast manipulation within slow reasoning. arXiv preprint arXiv:2506.01953, 2025

arXiv 2025
[8]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. 2024

2024
[9]

K. Wu, C. Hou, J. Liu, Z. Che, X. Ju, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. InRobotics: Science and Systems (RSS)
[10]

Robotics: Science and Systems Foundation, 2025

2025
[11]

Padalkar, A

Open X-Embodiment Collaboration, A. Padalkar, A. Pooley, et al. Open X-Embodiment: Robotic learning datasets and RT-X models.https://arxiv.org/abs/2310.08864, 2023

Pith/arXiv arXiv 2023
[12]

A. S. Chen, S. Nair, and C. Finn. Learning generalizable robotic reward functions from” in- the-wild” human videos.RSS, 2021

2021
[13]

S. Bahl, A. Gupta, and D. Pathak. Human-to-robot imitation in the wild.arXiv preprint arXiv:2207.09450, 2022. 9

arXiv 2022
[14]

R.-Z. Qiu, S. Yang, X. Cheng, C. Chawla, J. Li, T. He, G. Yan, D. J. Yoon, R. Hoque, L. Paulsen, et al. Humanoid policy˜ human policy.arXiv preprint arXiv:2503.13441, 2025

arXiv 2025
[15]

Guzey, Y

I. Guzey, Y . Dai, G. Savva, R. Bhirangi, and L. Pinto. Bridging the human to robot dexterity gap through object-oriented rewards. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 3344–3351. IEEE, 2025

2025
[16]

Lepert, J

M. Lepert, J. Fang, and J. Bohg. Phantom: Training robots without robots using only human videos.arXiv preprint arXiv:2503.00779, 2025

Pith/arXiv arXiv 2025
[17]

Kareer, D

S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Con- ference on Robotics and Automation (ICRA), pages 13226–13233. IEEE, 2025

2025
[18]

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3m: A universal visual represen- tation for robot manipulation. InCoRL, 2022

2022
[19]

Bharadhwaj, R

H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. InEuropean Conference on Computer Vision, pages 306–324. Springer, 2024

2024
[20]

G. A. Team. Gen-1: Scaling embodied foundation models to mastery.Generalist AI Blog,
[21]

https://generalistai.com/blog/apr-02-2026-GEN-1

2026
[22]

Intelligence, B

P. Intelligence, B. Ai, A. Amin, R. Aniceto, A. Balakrishna, G. Balke, K. Black, G. Bokinsky, S. Cao, T. Charbonnier, et al.π 0.7: A steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026

Pith/arXiv arXiv 2026
[23]

Kareer, K

S. Kareer, K. Pertsch, J. Darpinian, J. Hoffman, D. Xu, S. Levine, C. Finn, and S. Nair. Emer- gence of human to robot transfer in vision-language-action models
[24]

From human skill to robotic mastery.https://cypypccpy.github.io/ tech-blog.github.io/, Mar

PsiBot Team. From human skill to robotic mastery.https://cypypccpy.github.io/ tech-blog.github.io/, Mar. 2026. Accessed: 2026-05-20

2026
[25]

H. Bi, L. Wu, T. Lin, H. Tan, Z. Su, H. Su, and J. Zhu. H-rdt: Human manipulation en- hanced bimanual robotic manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18135–18143, 2026

2026
[26]

Zheng, D

R. Zheng, D. Niu, Y . Xie, J. Wang, M. Xu, Y . Jiang, F. Casta˜neda, F. Hu, Y . L. Tan, L. Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

arXiv 2026
[27]

Z. Liu, J. Liu, H. Chen, J. Yu, Z. Guo, C. Hou, C. Gu, X. Mi, R. Zhang, K. Wu, et al. Last {0}: Latent spatio-temporal chain-of-thought for robotic vision-language-action model.arXiv preprint arXiv:2601.05248, 2026

arXiv 2026
[28]

Y . Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

Pith/arXiv arXiv 2025
[29]

McInnes, J

L. McInnes, J. Healy, and J. Melville. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018

Pith/arXiv arXiv 2018
[30]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In7th Annual Conference on Robot Learning, 2023

2023
[31]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025. 10

Pith/arXiv arXiv 2025
[32]

Bjorck, F

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025
[33]

Y . Su, N. Liu, D. Chen, Z. Zhao, K. Wu, M. Li, Z. Xu, Z. Che, and J. Tang. Freqpolicy: Efficient flow-based visuomotor policy via frequency consistency. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[34]

J. Wen, M. Zhu, Y . Zhu, Z. Tang, J. Li, Z. Zhou, C. Li, X. Liu, Y . Peng, C. Shen, et al. Diffusion-vla: Scaling robot foundation models via unified diffusion and autoregression.arXiv preprint arXiv:2412.03293, 2024

arXiv 2024
[35]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, et al.π 0.5: a vision-language-action model with open-world generalization, 2025. URLhttps://arxiv.org/abs/2504.16054

Pith/arXiv arXiv 2025
[36]

F. Lin, R. Nai, Y . Hu, J. You, J. Zhao, and Y . Gao. Onetwovla: A unified vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2505.11917, 2025

arXiv 2025
[37]

Zawalski, W

M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine. Robotic control via embodied chain-of-thought reasoning, 2025. URLhttps://arxiv.org/abs/2407.08693

Pith/arXiv arXiv 2025
[38]

Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, et al. Cot- vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025

2025
[39]

S. Gao, S. Zhou, Y . Du, J. Zhang, and C. Gan. Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025

arXiv 2025
[40]

Zhang, H

W. Zhang, H. Liu, Z. Qi, Y . Wang, X. Yu, J. Zhang, R. Dong, J. He, H. Wang, Z. Zhang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge. arXiv preprint arXiv:2507.04447, 2025

Pith/arXiv arXiv 2025
[41]

Z. Liu, J. Liu, J. Xu, N. Han, C. Gu, H. Chen, K. Zhou, R. Zhang, K. C. Hsieh, K. Wu, et al. Mla: A multisensory language-action model for multimodal understanding and forecasting in robotic manipulation.arXiv preprint arXiv:2509.26642, 2025

arXiv 2025
[42]

J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

Pith/arXiv arXiv 2025
[43]

C. Gu, J. Liu, H. Chen, R. Huang, Q. Wuwu, Z. Liu, X. Li, Y . Li, R. Zhang, P. Jia, et al. Man- ualvla: A unified vla model for chain-of-thought manual generation and robotic manipulation. arXiv preprint arXiv:2512.02013, 2025

arXiv 2025
[44]

Huang, Y .-H

C.-P. Huang, Y .-H. Wu, M.-H. Chen, Y .-C. F. Wang, and F.-E. Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025

Pith/arXiv arXiv 2025
[45]

J. Cai, Z. Cai, J. Cao, Y . Chen, Z. He, L. Jiang, H. Li, H. Li, Y . Li, Y . Liu, et al. Internvla- a1: Unifying understanding, generation and action for robotic manipulation.arXiv preprint arXiv:2601.02456, 2026

arXiv 2026
[46]

J. Lyu, K. Liu, X. Zhang, H. Liao, Y . Feng, W. Zhu, T. Shen, J. Chen, J. Zhang, Y . Dong, et al. Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion.arXiv preprint arXiv:2602.12215, 2026

Pith/arXiv arXiv 2026
[47]

Banerjee, S

P. Banerjee, S. Shkodrani, P. Moulon, S. Hampali, S. Han, F. Zhang, L. Zhang, J. Fountain, E. Miller, S. Basol, et al. Hot3d: Hand and object tracking in 3d from egocentric multi- view videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7071, 2025. 11

2025
[48]

Grauman, A

K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19383–19400, 2024

2024
[49]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022

2022
[50]

Damen, H

D. Damen, H. Doughty, G. Maria Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European Conference on Computer Vision (ECCV), pages 720–736, 2018

2018
[51]

Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training. InThe Eleventh International Conference on Learning Representations, 2022

2022
[52]

Majumdar, K

A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, P. Abbeel, J. Malik, D. Batra, Y . Lin, O. Maksymets, A. Rajeswaran, and F. Meier. Where are we in the search for an artificial visual cortex for embodied intelligence? 2023

2023
[53]

C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar. Mimicplay: Long-horizon imitation learning by watching human play.arXiv preprint arXiv:2302.12422, 2023

arXiv 2023
[54]

S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. InCVPR, 2023

2023
[55]

S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, et al. Latent action pretraining from videos.arXiv preprint arXiv:2410.11758, 2024

Pith/arXiv arXiv 2024
[56]

S. Gao, W. Liang, K. Zheng, A. Malik, S. Ye, S. Yu, W.-C. Tseng, Y . Dong, K. Mo, C.-H. Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

Pith/arXiv arXiv 2026
[57]

J. Duan, Y . R. Wang, M. Shridhar, D. Fox, and R. Krishna. Ar2-d2: Training a robot without a robot.arXiv preprint arXiv:2306.13818, 2023

arXiv 2023
[58]

Y . Park, J. S. Bhatia, L. Ankile, and P. Agrawal. Dexhub and dart: Towards internet scale robot data collection.arXiv preprint arXiv:2411.02214, 2024

arXiv 2024
[59]

T. Tao, M. K. Srirama, J. J. Liu, K. Shaw, and D. Pathak. Dexwild: Dexterous human interac- tions for in-the-wild robot policies.arXiv preprint arXiv:2505.07813, 2025

Pith/arXiv arXiv 2025
[60]

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

Pith/arXiv arXiv 2025
[61]

X. Chen, Z. Wu, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, and C. Ruan. Janus-pro: Uni- fied multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Pith/arXiv arXiv 2025
[62]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre- training. InInternational Conference on Computer Vision (ICCV), 2023

2023
[63]

G. Luo, L. Dunlap, D. H. Park, A. Holynski, and T. Darrell. Diffusion hyperfeatures: Search- ing through time and space for semantic correspondence.Advances in Neural Information Processing Systems, 36:47500–47510, 2023. 12

2023
[64]

L. Tang, M. Jia, Q. Wang, C. P. Phoo, and B. Hariharan. Emergent correspondence from image diffusion.Advances in neural information processing systems, 36:1363–1389, 2023

2023
[65]

X. Xu, Y . Hou, Z. Liu, and S. Song. Compliant residual dagger: Improving real-world contact- rich manipulation with human corrections.Advances in Neural Information Processing Sys- tems, 38:139559–139581, 2026

2026
[66]

M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, and J. Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026

Pith/arXiv arXiv 2026
[67]

Zhang, J

J. Zhang, J. Deng, C. Ma, and R. A. Potamias. Hawor: World-space hand motion reconstruc- tion from egocentric videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1805–1815, 2025

2025
[68]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[69]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024
[70]

Ebert, Y

F. Ebert, Y . Yang, K. Schmeckpeper, B. Bucher, G. Georgakis, K. Daniilidis, C. Finn, and S. Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. InRSS, 2022

2022
[71]

Walke, K

H. Walke, K. Black, A. Lee, M. J. Kim, M. Du, C. Zheng, T. Zhao, P. Hansen-Estruch, Q. Vuong, A. He, V . Myers, K. Fang, C. Finn, and S. Levine. Bridgedata v2: A dataset for robot learning at scale, 2023

2023
[72]

Z. J. Cui, Y . Wang, N. M. M. Shafiullah, and L. Pinto. From play to policy: Conditional behavior generation from uncurated robot data.arXiv preprint arXiv:2210.10047, 2022

arXiv 2022
[73]

Kalashnikov, A

D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Vanhoucke, et al. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation.arXiv preprint arXiv:1806.10293, 2018

Pith/arXiv arXiv 2018
[74]

Belkhale, Y

S. Belkhale, Y . Cui, and D. Sadigh. Hydra: Hybrid robot actions for imitation learning.arxiv, 2023

2023
[75]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, et al. Rt-1: Robotics transformer for real-world control at scale. InarXiv preprint arXiv:2212.06817, 2022

Pith/arXiv arXiv 2022
[76]

Dasari, F

S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn. Robonet: Large-scale multi-robot learning. InConference on Robot Learning, pages 885–897. PMLR, 2020

2020
[77]

S. Dass, J. Yapeter, J. Zhang, J. Zhang, K. Pertsch, S. Nikolaidis, and J. J. Lim. CLVR jaco play dataset, 2023. URLhttps://github.com/clvrai/clvr_jaco_play_dataset

2023
[78]

Lynch, A

C. Lynch, A. Wahid, J. Tompson, T. Ding, J. Betker, R. Baruch, T. Armstrong, and P. Florence. Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023

2023
[79]

N. M. M. Shafiullah, A. Rai, H. Etukuru, Y . Liu, I. Misra, S. Chintala, and L. Pinto. On bringing robots home, 2023. 13

2023
[80]

E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. InConference on Robot Learn- ing, pages 991–1002. PMLR, 2022

2022

Showing first 80 references.

[1] [1]

Karamcheti, S

S. Karamcheti, S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh. Prismatic vlms: Investigating the design space of visually-conditioned language models. InForty-first Interna- tional Conference on Machine Learning, 2024

2024

[2] [2]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...

Pith/arXiv arXiv 2025

[3] [3]

Beyer, A

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdul- mohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Pith/arXiv arXiv 2024

[4] [4]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[5] [5]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. pi0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[6] [6]

J. Liu, H. Chen, P. An, Z. Liu, R. Zhang, C. Gu, X. Li, Z. Guo, S. Chen, M. Liu, et al. Hy- bridvla: Collaborative diffusion and autoregression in a unified vision-language-action model. arXiv preprint arXiv:2503.10631, 2025

Pith/arXiv arXiv 2025

[7] [7]

H. Chen, J. Liu, C. Gu, Z. Liu, R. Zhang, X. Li, X. He, Y . Guo, C.-W. Fu, S. Zhang, et al. Fast- in-slow: A dual-system foundation model unifying fast manipulation within slow reasoning. arXiv preprint arXiv:2506.01953, 2025

arXiv 2025

[8] [8]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. 2024

2024

[9] [9]

K. Wu, C. Hou, J. Liu, Z. Che, X. Ju, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. InRobotics: Science and Systems (RSS)

[10] [10]

Robotics: Science and Systems Foundation, 2025

2025

[11] [11]

Padalkar, A

Open X-Embodiment Collaboration, A. Padalkar, A. Pooley, et al. Open X-Embodiment: Robotic learning datasets and RT-X models.https://arxiv.org/abs/2310.08864, 2023

Pith/arXiv arXiv 2023

[12] [12]

A. S. Chen, S. Nair, and C. Finn. Learning generalizable robotic reward functions from” in- the-wild” human videos.RSS, 2021

2021

[13] [13]

S. Bahl, A. Gupta, and D. Pathak. Human-to-robot imitation in the wild.arXiv preprint arXiv:2207.09450, 2022. 9

arXiv 2022

[14] [14]

R.-Z. Qiu, S. Yang, X. Cheng, C. Chawla, J. Li, T. He, G. Yan, D. J. Yoon, R. Hoque, L. Paulsen, et al. Humanoid policy˜ human policy.arXiv preprint arXiv:2503.13441, 2025

arXiv 2025

[15] [15]

Guzey, Y

I. Guzey, Y . Dai, G. Savva, R. Bhirangi, and L. Pinto. Bridging the human to robot dexterity gap through object-oriented rewards. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 3344–3351. IEEE, 2025

2025

[16] [16]

Lepert, J

M. Lepert, J. Fang, and J. Bohg. Phantom: Training robots without robots using only human videos.arXiv preprint arXiv:2503.00779, 2025

Pith/arXiv arXiv 2025

[17] [17]

Kareer, D

S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Con- ference on Robotics and Automation (ICRA), pages 13226–13233. IEEE, 2025

2025

[18] [18]

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3m: A universal visual represen- tation for robot manipulation. InCoRL, 2022

2022

[19] [19]

Bharadhwaj, R

H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. InEuropean Conference on Computer Vision, pages 306–324. Springer, 2024

2024

[20] [20]

G. A. Team. Gen-1: Scaling embodied foundation models to mastery.Generalist AI Blog,

[21] [21]

https://generalistai.com/blog/apr-02-2026-GEN-1

2026

[22] [22]

Intelligence, B

P. Intelligence, B. Ai, A. Amin, R. Aniceto, A. Balakrishna, G. Balke, K. Black, G. Bokinsky, S. Cao, T. Charbonnier, et al.π 0.7: A steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026

Pith/arXiv arXiv 2026

[23] [23]

Kareer, K

S. Kareer, K. Pertsch, J. Darpinian, J. Hoffman, D. Xu, S. Levine, C. Finn, and S. Nair. Emer- gence of human to robot transfer in vision-language-action models

[24] [24]

From human skill to robotic mastery.https://cypypccpy.github.io/ tech-blog.github.io/, Mar

PsiBot Team. From human skill to robotic mastery.https://cypypccpy.github.io/ tech-blog.github.io/, Mar. 2026. Accessed: 2026-05-20

2026

[25] [25]

H. Bi, L. Wu, T. Lin, H. Tan, Z. Su, H. Su, and J. Zhu. H-rdt: Human manipulation en- hanced bimanual robotic manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18135–18143, 2026

2026

[26] [26]

Zheng, D

R. Zheng, D. Niu, Y . Xie, J. Wang, M. Xu, Y . Jiang, F. Casta˜neda, F. Hu, Y . L. Tan, L. Fu, et al. Egoscale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710, 2026

arXiv 2026

[27] [27]

Z. Liu, J. Liu, H. Chen, J. Yu, Z. Guo, C. Hou, C. Gu, X. Mi, R. Zhang, K. Wu, et al. Last {0}: Latent spatio-temporal chain-of-thought for robotic vision-language-action model.arXiv preprint arXiv:2601.05248, 2026

arXiv 2026

[28] [28]

Y . Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

Pith/arXiv arXiv 2025

[29] [29]

McInnes, J

L. McInnes, J. Healy, and J. Melville. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018

Pith/arXiv arXiv 2018

[30] [30]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In7th Annual Conference on Robot Learning, 2023

2023

[31] [31]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025. 10

Pith/arXiv arXiv 2025

[32] [32]

Bjorck, F

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025

[33] [33]

Y . Su, N. Liu, D. Chen, Z. Zhao, K. Wu, M. Li, Z. Xu, Z. Che, and J. Tang. Freqpolicy: Efficient flow-based visuomotor policy via frequency consistency. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[34] [34]

J. Wen, M. Zhu, Y . Zhu, Z. Tang, J. Li, Z. Zhou, C. Li, X. Liu, Y . Peng, C. Shen, et al. Diffusion-vla: Scaling robot foundation models via unified diffusion and autoregression.arXiv preprint arXiv:2412.03293, 2024

arXiv 2024

[35] [35]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, et al.π 0.5: a vision-language-action model with open-world generalization, 2025. URLhttps://arxiv.org/abs/2504.16054

Pith/arXiv arXiv 2025

[36] [36]

F. Lin, R. Nai, Y . Hu, J. You, J. Zhao, and Y . Gao. Onetwovla: A unified vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2505.11917, 2025

arXiv 2025

[37] [37]

Zawalski, W

M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine. Robotic control via embodied chain-of-thought reasoning, 2025. URLhttps://arxiv.org/abs/2407.08693

Pith/arXiv arXiv 2025

[38] [38]

Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, et al. Cot- vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025

2025

[39] [39]

S. Gao, S. Zhou, Y . Du, J. Zhang, and C. Gan. Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025

arXiv 2025

[40] [40]

Zhang, H

W. Zhang, H. Liu, Z. Qi, Y . Wang, X. Yu, J. Zhang, R. Dong, J. He, H. Wang, Z. Zhang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge. arXiv preprint arXiv:2507.04447, 2025

Pith/arXiv arXiv 2025

[41] [41]

Z. Liu, J. Liu, J. Xu, N. Han, C. Gu, H. Chen, K. Zhou, R. Zhang, K. C. Hsieh, K. Wu, et al. Mla: A multisensory language-action model for multimodal understanding and forecasting in robotic manipulation.arXiv preprint arXiv:2509.26642, 2025

arXiv 2025

[42] [42]

J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

Pith/arXiv arXiv 2025

[43] [43]

C. Gu, J. Liu, H. Chen, R. Huang, Q. Wuwu, Z. Liu, X. Li, Y . Li, R. Zhang, P. Jia, et al. Man- ualvla: A unified vla model for chain-of-thought manual generation and robotic manipulation. arXiv preprint arXiv:2512.02013, 2025

arXiv 2025

[44] [44]

Huang, Y .-H

C.-P. Huang, Y .-H. Wu, M.-H. Chen, Y .-C. F. Wang, and F.-E. Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025

Pith/arXiv arXiv 2025

[45] [45]

J. Cai, Z. Cai, J. Cao, Y . Chen, Z. He, L. Jiang, H. Li, H. Li, Y . Li, Y . Liu, et al. Internvla- a1: Unifying understanding, generation and action for robotic manipulation.arXiv preprint arXiv:2601.02456, 2026

arXiv 2026

[46] [46]

J. Lyu, K. Liu, X. Zhang, H. Liao, Y . Feng, W. Zhu, T. Shen, J. Chen, J. Zhang, Y . Dong, et al. Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion.arXiv preprint arXiv:2602.12215, 2026

Pith/arXiv arXiv 2026

[47] [47]

Banerjee, S

P. Banerjee, S. Shkodrani, P. Moulon, S. Hampali, S. Han, F. Zhang, L. Zhang, J. Fountain, E. Miller, S. Basol, et al. Hot3d: Hand and object tracking in 3d from egocentric multi- view videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7071, 2025. 11

2025

[48] [48]

Grauman, A

K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19383–19400, 2024

2024

[49] [49]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022

2022

[50] [50]

Damen, H

D. Damen, H. Doughty, G. Maria Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European Conference on Computer Vision (ECCV), pages 720–736, 2018

2018

[51] [51]

Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training. InThe Eleventh International Conference on Learning Representations, 2022

2022

[52] [52]

Majumdar, K

A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, P. Abbeel, J. Malik, D. Batra, Y . Lin, O. Maksymets, A. Rajeswaran, and F. Meier. Where are we in the search for an artificial visual cortex for embodied intelligence? 2023

2023

[53] [53]

C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar. Mimicplay: Long-horizon imitation learning by watching human play.arXiv preprint arXiv:2302.12422, 2023

arXiv 2023

[54] [54]

S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. InCVPR, 2023

2023

[55] [55]

S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, et al. Latent action pretraining from videos.arXiv preprint arXiv:2410.11758, 2024

Pith/arXiv arXiv 2024

[56] [56]

S. Gao, W. Liang, K. Zheng, A. Malik, S. Ye, S. Yu, W.-C. Tseng, Y . Dong, K. Mo, C.-H. Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

Pith/arXiv arXiv 2026

[57] [57]

J. Duan, Y . R. Wang, M. Shridhar, D. Fox, and R. Krishna. Ar2-d2: Training a robot without a robot.arXiv preprint arXiv:2306.13818, 2023

arXiv 2023

[58] [58]

Y . Park, J. S. Bhatia, L. Ankile, and P. Agrawal. Dexhub and dart: Towards internet scale robot data collection.arXiv preprint arXiv:2411.02214, 2024

arXiv 2024

[59] [59]

T. Tao, M. K. Srirama, J. J. Liu, K. Shaw, and D. Pathak. Dexwild: Dexterous human interac- tions for in-the-wild robot policies.arXiv preprint arXiv:2505.07813, 2025

Pith/arXiv arXiv 2025

[60] [60]

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

Pith/arXiv arXiv 2025

[61] [61]

X. Chen, Z. Wu, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, and C. Ruan. Janus-pro: Uni- fied multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Pith/arXiv arXiv 2025

[62] [62]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre- training. InInternational Conference on Computer Vision (ICCV), 2023

2023

[63] [63]

G. Luo, L. Dunlap, D. H. Park, A. Holynski, and T. Darrell. Diffusion hyperfeatures: Search- ing through time and space for semantic correspondence.Advances in Neural Information Processing Systems, 36:47500–47510, 2023. 12

2023

[64] [64]

L. Tang, M. Jia, Q. Wang, C. P. Phoo, and B. Hariharan. Emergent correspondence from image diffusion.Advances in neural information processing systems, 36:1363–1389, 2023

2023

[65] [65]

X. Xu, Y . Hou, Z. Liu, and S. Song. Compliant residual dagger: Improving real-world contact- rich manipulation with human corrections.Advances in Neural Information Processing Sys- tems, 38:139559–139581, 2026

2026

[66] [66]

M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, and J. Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026

Pith/arXiv arXiv 2026

[67] [67]

Zhang, J

J. Zhang, J. Deng, C. Ma, and R. A. Potamias. Hawor: World-space hand motion reconstruc- tion from egocentric videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1805–1815, 2025

2025

[68] [68]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[69] [69]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024

[70] [70]

Ebert, Y

F. Ebert, Y . Yang, K. Schmeckpeper, B. Bucher, G. Georgakis, K. Daniilidis, C. Finn, and S. Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. InRSS, 2022

2022

[71] [71]

Walke, K

H. Walke, K. Black, A. Lee, M. J. Kim, M. Du, C. Zheng, T. Zhao, P. Hansen-Estruch, Q. Vuong, A. He, V . Myers, K. Fang, C. Finn, and S. Levine. Bridgedata v2: A dataset for robot learning at scale, 2023

2023

[72] [72]

Z. J. Cui, Y . Wang, N. M. M. Shafiullah, and L. Pinto. From play to policy: Conditional behavior generation from uncurated robot data.arXiv preprint arXiv:2210.10047, 2022

arXiv 2022

[73] [73]

Kalashnikov, A

D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Vanhoucke, et al. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation.arXiv preprint arXiv:1806.10293, 2018

Pith/arXiv arXiv 2018

[74] [74]

Belkhale, Y

S. Belkhale, Y . Cui, and D. Sadigh. Hydra: Hybrid robot actions for imitation learning.arxiv, 2023

2023

[75] [75]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, et al. Rt-1: Robotics transformer for real-world control at scale. InarXiv preprint arXiv:2212.06817, 2022

Pith/arXiv arXiv 2022

[76] [76]

Dasari, F

S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn. Robonet: Large-scale multi-robot learning. InConference on Robot Learning, pages 885–897. PMLR, 2020

2020

[77] [77]

S. Dass, J. Yapeter, J. Zhang, J. Zhang, K. Pertsch, S. Nikolaidis, and J. J. Lim. CLVR jaco play dataset, 2023. URLhttps://github.com/clvrai/clvr_jaco_play_dataset

2023

[78] [78]

Lynch, A

C. Lynch, A. Wahid, J. Tompson, T. Ding, J. Betker, R. Baruch, T. Armstrong, and P. Florence. Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023

2023

[79] [79]

N. M. M. Shafiullah, A. Rai, H. Etukuru, Y . Liu, I. Misra, S. Chintala, and L. Pinto. On bringing robots home, 2023. 13

2023

[80] [80]

E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. InConference on Robot Learn- ing, pages 991–1002. PMLR, 2022

2022