Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos

Jian Wang; Jincheng Yu; Runze Xu; Yiluo Zhang; Yu Wang

arxiv: 2606.18955 · v1 · pith:LO3POELOnew · submitted 2026-06-17 · 💻 cs.CV · cs.RO

Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos

Runze Xu , Yiluo Zhang , Jian Wang , Yu Wang , Jincheng Yu This is my paper

Pith reviewed 2026-06-26 21:51 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords vision-language-actionlatent actionegocentric videocross-embodimentVQ-VAEpre-trainingrobotic manipulationdisentangled representation

0 comments

The pith

Pre-training on unlabeled human videos via motion-focused latent actions matches VLA models trained on massive annotated robot datasets while needing only 50 trajectories for adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action models normally require enormous labeled robotic datasets that are expensive to gather. This work instead extracts reusable action representations from plentiful unlabeled egocentric human videos. A Hybrid Disentangled VQ-VAE isolates motion dynamics from scene backgrounds using physical masks to form a shared cross-embodiment action codebook. The vision-language model is pre-trained on this codebook, then adapted to a robot by letting the model predict action intent while a frozen encoder supplies embodiment-specific state features. Experiments in simulation and on real robots show the resulting policy performs on par with models trained on far larger labeled corpora.

Core claim

By pre-training exclusively on unlabeled human videos with a motion-focused latent action codebook derived from a Hybrid Disentangled VQ-VAE, and employing an intent-perception decoupling strategy during adaptation, the method enables VLA models to perform competitively with state-of-the-art models trained on large annotated robotic datasets, requiring only 50 trajectories for downstream adaptation to specific embodiments.

What carries the argument

Hybrid Disentangled VQ-VAE that applies physical masks to separate motion dynamics from environmental backgrounds and thereby builds a transferable cross-embodiment action codebook.

If this is right

VLA pre-training can shift from scarce labeled robot data to abundant unlabeled human videos.
Downstream adaptation to new robot bodies requires only tens of trajectories instead of thousands.
Separating intent prediction from state perception reduces hallucinations during embodiment transfer.
The same codebook supports both simulation and real-world deployment without additional annotation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Internet-scale human video collections could be used directly for VLA pre-training if the disentanglement step generalizes.
The motion-centric codebook might support zero-shot transfer to entirely new robot morphologies once the intent predictor is fixed.
Similar disentanglement could be applied to navigation or multi-agent settings where action labels are also missing.

Load-bearing premise

The VQ-VAE must reliably extract motion-only codes that remain useful and hallucination-free when the same codebook is later used for robot adaptation.

What would settle it

A controlled test showing that the pre-trained model needs substantially more than 50 robot trajectories to reach parity with annotated-data baselines, or that it produces frequent action hallucinations on held-out real-world tasks.

Figures

Figures reproduced from arXiv: 2606.18955 by Jian Wang, Jincheng Yu, Runze Xu, Yiluo Zhang, Yu Wang.

**Figure 1.** Figure 1: Method overview. We propose a human-video-driven framework for training vision–language–action models. A hybrid disentangled VQ-VAE first extracts transferable latent action codes from unlabeled human videos. These codes are then used as supervision to pre-train the VLM to infer action intentions from observations and instructions. Finally, with only a small number of robot trajectories, the VLM backbone a… view at source ↗

**Figure 2.** Figure 2: Hybrid Disentangled VQ-VAE. The VQ-VAE model decomposes short-term visual changes into discrete action and background latent spaces via a dual-path vector quantization bottleneck. A shared mask-guided decoder enforces semantic separation by reconstructing motion-related and background regions from corresponding latent codes, enabling the extraction of transferable action intentions from videos. III. METH… view at source ↗

**Figure 3.** Figure 3: Real-world experiments. (a) The physical dual-arm robot platform used for evaluation. (b) Three real-world manipulation tasks, including placing a bottle on a plate, unplugging a power cord, and folding a towel. (c) Task success rates compared with UniVLA, showing improved transfer of action intentions from human videos to the real robot. Notably, the “Place Bottle” task shows a lower success rate. The bot… view at source ↗

**Figure 4.** Figure 4: Comparison of latent action alignment consistency. The proposed Motion-Focused latent action outperforms UniVLA in CKA metrics, indicating a more coherent cross-embodiment action space. C. Latent Action Evaluation To explain the observed generalization performance at the representation level, we design an alignment analysis method based on domain subspace elimination. This approach quantitatively assesses… view at source ↗

**Figure 5.** Figure 5: Latent Action Visualization. Image pairs from different datasets with same latent codes. Despite different morphologies, robot arms and human hands are assigned the same action tokens. values between the resulting feature matrices. The results (Fig. 4b) show that UniVLA exhibits lower consistency with a mean CKA of 0.8659, whereas our Motion-Focused Latent Action achieves a significantly higher alignment w… view at source ↗

read the original abstract

Training generalist Vision-Language-Action(VLA) models typically requires massive, diverse robotic datasets with high-fidelity action annotations. While egocentric human manipulation videos are abundant and capture significant environmental diversity, the absence of action labels makes them difficult to use in conventional training paradigms. To address this, we propose a latent-action-based framework designed to extract general action priors from unlabeled human videos. The architecture features a Hybrid Disentangled VQ-VAE that decouples motion dynamics from environmental backgrounds through physical masks, enabling the construction of a cross-embodiment action codebook. By pre-training on human videos with the codebook, the VLM backbone learns deep representations of action intent. For adaptation to specific embodiments, we introduce an intent-perception decoupling strategy where the VLM predicts the action intent while a separate frozen visual encoder provides state-specific features to the action expert, thereby reducing action hallucinations. Results in simulation and real-world environments show that our method, pre-trained exclusively on unlabeled human videos, performs competitively with state-of-the-art VLA models trained on massive annotated datasets, requiring only 50 trajectories for downstream adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The idea of pulling cross-embodiment action priors from unlabeled human video via a disentangled VQ-VAE is reasonable on paper, but the physical-mask step looks like the weakest link and the abstract gives almost nothing to verify the competitive results.

read the letter

The paper's main move is to pre-train a VLA backbone on abundant human ego-videos by first learning a motion-focused latent action codebook with a Hybrid Disentangled VQ-VAE that uses physical masks to separate dynamics from background, then adding an intent-perception split at adaptation time so the VLM predicts intent while a frozen encoder supplies embodiment-specific state. That setup lets them claim competitive sim and real performance after only 50 downstream trajectories.

The approach is sensible in principle: human video is cheap and diverse, and the low-data adaptation target is realistic for robot work. The decoupling strategy is a straightforward attempt to limit action hallucinations.

The soft spots are the usual ones for an abstract-only view plus one specific issue. No equations, training curves, baseline tables, or data splits are shown, so the performance claim cannot be assessed. More critically, the physical masks are load-bearing for the disentanglement and the cross-embodiment guarantee, yet the abstract is silent on how they are produced. If mask generation uses any pre-trained model, heuristics tuned to particular embodiments, or even indirect supervision, the "exclusively unlabeled" pre-training story does not hold and the stress-test concern lands. That single point undercuts the central selling proposition until the methods section is inspected.

This is for groups already working on scalable VLA pre-training who need concrete ways to ingest human video. It is worth sending to peer review so the mask provenance, codebook transfer experiments, and adaptation results can be checked directly; the idea is coherent enough on its own terms to justify referee time.

Referee Report

2 major / 0 minor

Summary. The paper proposes a latent-action framework to pre-train VLA models on abundant unlabeled human egocentric videos. It introduces a Hybrid Disentangled VQ-VAE that uses physical masks to decouple motion dynamics from backgrounds and construct a cross-embodiment action codebook; the VLM backbone is pre-trained on this codebook to learn action intent representations. For embodiment-specific adaptation an intent-perception decoupling strategy is used in which the VLM predicts intent while a frozen visual encoder supplies state features to an action expert. The central empirical claim is that this pipeline, after pre-training only on unlabeled human videos, achieves competitive performance with SOTA VLA models trained on massive annotated robotic datasets while requiring only 50 trajectories for downstream adaptation in simulation and real-world settings.

Significance. If the decoupling and transfer claims hold, the work would materially reduce the annotation burden for generalist VLA training by converting readily available human video into usable action priors, thereby improving scalability and cross-embodiment generalization. The explicit separation of intent prediction from embodiment-specific perception is a concrete architectural contribution that could be adopted more broadly.

major comments (2)

[Abstract] Abstract: The headline result (competitive performance after exclusively unlabeled pre-training followed by 50-trajectory adaptation) rests on the Hybrid Disentangled VQ-VAE producing a transferable cross-embodiment action codebook. This in turn requires that the physical masks cleanly isolate motion dynamics without embodiment leakage or hallucinations. The abstract supplies no information on the provenance or generation procedure for these masks; if mask creation depends on any pre-trained model, labeled data, or embodiment-tuned heuristics, the "exclusively unlabeled" pre-training claim is contradicted and the cross-embodiment guarantee is weakened. This is load-bearing for the central claim.
[Abstract] Abstract: The competitive-performance statement is presented without reference to specific baselines, error bars, data-exclusion criteria, or statistical tests. Because the soundness of the result cannot be assessed from the given description, the empirical support for the 50-trajectory adaptation claim remains unverified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the two major comments below and will revise the abstract to improve clarity on both mask generation and empirical reporting.

read point-by-point responses

Referee: [Abstract] Abstract: The headline result (competitive performance after exclusively unlabeled pre-training followed by 50-trajectory adaptation) rests on the Hybrid Disentangled VQ-VAE producing a transferable cross-embodiment action codebook. This in turn requires that the physical masks cleanly isolate motion dynamics without embodiment leakage or hallucinations. The abstract supplies no information on the provenance or generation procedure for these masks; if mask creation depends on any pre-trained model, labeled data, or embodiment-tuned heuristics, the "exclusively unlabeled" pre-training claim is contradicted and the cross-embodiment guarantee is weakened. This is load-bearing for the central claim.

Authors: The physical masks are generated via an unsupervised motion segmentation pipeline that computes dense optical flow between consecutive frames and applies adaptive thresholding to isolate dynamic regions from static backgrounds; no pre-trained models, human labels, or embodiment-specific heuristics are used. This procedure is described in Section 3.2 and preserves the exclusively unlabeled pre-training claim. We agree the abstract should state this explicitly and will revise it to include a one-sentence description of the mask generation method. revision: yes
Referee: [Abstract] Abstract: The competitive-performance statement is presented without reference to specific baselines, error bars, data-exclusion criteria, or statistical tests. Because the soundness of the result cannot be assessed from the given description, the empirical support for the 50-trajectory adaptation claim remains unverified.

Authors: The abstract is intentionally concise; full details appear in Section 4 and Tables 1–3, which report comparisons against RT-2, OpenVLA, and Octo with mean success rates plus standard deviations over five random seeds, exclusion of failed rollouts, and paired t-tests (p < 0.05). We will revise the abstract to name the primary baselines and note that all reported results include error bars and statistical testing. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a framework for pre-training a VLA model on unlabeled human videos via a Hybrid Disentangled VQ-VAE that applies physical masks for motion-background decoupling, followed by intent-perception decoupling for adaptation. No equations, fitted parameters, or self-citations are exhibited in the provided text that reduce the cross-embodiment codebook, action intent prediction, or competitive performance claims to quantities defined by construction from the inputs themselves. The results are framed as empirical outcomes from simulation and real-world tests after 50-trajectory adaptation rather than tautological redefinitions or self-referential normalizations, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Only the abstract is available, so the ledger is necessarily incomplete and based solely on stated components; no free-parameter values or independent evidence for new entities can be verified.

axioms (1)

domain assumption Physical masks can reliably isolate motion dynamics from static backgrounds in egocentric human videos.
Invoked to enable the Hybrid Disentangled VQ-VAE to build the cross-embodiment action codebook.

invented entities (2)

Hybrid Disentangled VQ-VAE no independent evidence
purpose: Decouple motion from background to construct a shared action codebook from unlabeled video.
New architecture introduced for the pre-training stage.
intent-perception decoupling strategy no independent evidence
purpose: Separate VLM intent prediction from embodiment-specific state features to reduce hallucinations during adaptation.
New training strategy for the adaptation phase.

pith-pipeline@v0.9.1-grok · 5739 in / 1486 out tokens · 22742 ms · 2026-06-26T21:51:11.549618+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 15 linked inside Pith

[1]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahidet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning. PMLR, 2023, pp. 2165–2183

2023
[2]

pi 0: A vision- language-action flow model for general robot control,

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “pi 0: A vision- language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[3]

Rdt-1b: a diffusion foundation model for bimanual manipulation,

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “Rdt-1b: a diffusion foundation model for bimanual manipulation,”arXiv preprint arXiv:2410.07864, 2024

Pith/arXiv arXiv 2024
[4]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x- embodiment collaboration 0,

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jainet al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x- embodiment collaboration 0,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 6892–6903

2024
[5]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems,

Q. Bu, J. Cai, L. Chen, X. Cui, Y . Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huanget al., “Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems,”arXiv preprint arXiv:2503.06669, 2025

Pith/arXiv arXiv 2025
[6]

Egomimic: Scaling imitation learning via egocentric video,

S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu, “Egomimic: Scaling imitation learning via egocentric video,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 13 226–13 233

2025
[7]

Motiontrans: Human vr data enable motion-level learning for robotic manipulation policies,

C. Yuan, R. Zhou, M. Liu, Y . Hu, S. Wang, L. Yi, C. Wen, S. Zhang, and Y . Gao, “Motiontrans: Human vr data enable motion-level learning for robotic manipulation policies,”arXiv preprint arXiv:2509.17759, 2025

arXiv 2025
[8]

H-rdt: Human manipulation enhanced bimanual robotic manipulation,

H. Bi, L. Wu, T. Lin, H. Tan, Z. Su, H. Su, and J. Zhu, “H-rdt: Human manipulation enhanced bimanual robotic manipulation,”arXiv preprint arXiv:2507.23523, 2025

arXiv 2025
[9]

Latent action pretraining from videos,

S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Linet al., “Latent action pretraining from videos,” arXiv preprint arXiv:2410.11758, 2024

Pith/arXiv arXiv 2024
[10]

Igor: Image-goal representations are the atomic con- trol units for foundation models in embodied ai,

X. Chen, J. Guo, T. He, C. Zhang, P. Zhang, D. C. Yang, L. Zhao, and J. Bian, “Igor: Image-goal representations are the atomic con- trol units for foundation models in embodied ai,”arXiv preprint arXiv:2411.00785, 2024

arXiv 2024
[11]

What do latent action models actually learn?

C. Zhang, T. Pearce, P. Zhang, K. Wang, X. Chen, W. Shen, L. Zhao, and J. Bian, “What do latent action models actually learn?” 2025. [Online]. Available: https://arxiv.org/abs/2506.15691

arXiv 2025
[12]

Univla: Learning to act anywhere with task-centric latent actions,

Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li, “Univla: Learning to act anywhere with task-centric latent actions,” arXiv preprint arXiv:2505.06111, 2025

Pith/arXiv arXiv 2025
[13]

Openvla: An open-source vision-language-action model,

M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, “Openvla: An open-source vision-language-action model,” arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[14]

Gr00t n1: An open foundation model for generalist humanoid robots,

NVIDIA, :, J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z...

Pith/arXiv arXiv 2025
[15]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025

2025
[16]

Flow matching for generative modeling,

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” 2023. [Online]. Available: https://arxiv.org/abs/2210.02747

Pith/arXiv arXiv 2023
[17]

Villa-x: enhancing latent action modeling in vision-language-action models,

X. Chen, H. Wei, P. Zhang, C. Zhang, K. Wang, Y . Guo, R. Yang, Y . Wang, X. Xiao, L. Zhaoet al., “Villa-x: enhancing latent action modeling in vision-language-action models,”arXiv preprint arXiv:2507.23682, 2025

Pith/arXiv arXiv 2025
[18]

Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots,

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song, “Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots,” inProceedings of Robotics: Science and Systems (RSS), 2024

2024
[19]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll ´ar, and R. Gir- shick, “Segment anything,”arXiv:2304.02643, 2023

Pith/arXiv arXiv 2023
[20]

Prismatic vlms: Investigating the design space of visually- conditioned language models,

S. Karamcheti, S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh, “Prismatic vlms: Investigating the design space of visually- conditioned language models,” inForty-first International Conference on Machine Learning, 2024

2024
[21]

Spatialvla: Exploring spatial representations for visual-language-action model,

D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, Y . Ding, Z. Wang, J. Gu, B. Zhao, D. Wanget al., “Spatialvla: Exploring spatial representations for visual-language-action model,”arXiv preprint arXiv:2501.15830, 2025

Pith/arXiv arXiv 2025
[22]

Fast: Efficient action tokenization for vision-language-action models,

K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine, “Fast: Efficient action tokenization for vision-language-action models,”arXiv preprint arXiv:2501.09747, 2025

Pith/arXiv arXiv 2025
[23]

Learning fine-grained bimanual manipulation with low-cost hardware,

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,”arXiv preprint arXiv:2304.13705, 2023

Pith/arXiv arXiv 2023
[24]

Bridgedata v2: A dataset for robot learning at scale,

H. Walke, K. Black, A. Lee, M. J. Kim, M. Du, C. Zheng, T. Zhao, P. Hansen-Estruch, Q. Vuong, A. He, V . Myers, K. Fang, C. Finn, and S. Levine, “Bridgedata v2: A dataset for robot learning at scale,” in Conference on Robot Learning (CoRL), 2023

2023
[25]

Libero: Benchmarking knowledge transfer for lifelong robot learn- ing,

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learn- ing,”Advances in Neural Information Processing Systems, vol. 36, pp. 44 776–44 791, 2023

2023
[26]

Roboengine: Plug-and-play robot data augmentation with semantic robot segmen- tation and background generation,

C. Yuan, S. Joshi, S. Zhu, H. Su, H. Zhao, and Y . Gao, “Roboengine: Plug-and-play robot data augmentation with semantic robot segmen- tation and background generation,”arXiv preprint arXiv:2503.18738, 2025

arXiv 2025
[27]

Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Guet al., “Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,”arXiv preprint arXiv:2506.18088, 2025

Pith/arXiv arXiv 2025
[28]

Egodex: Learning dexterous manipulation from large-scale egocentric video,

R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang, “Egodex: Learning dexterous manipulation from large-scale egocentric video,” 2025. [Online]. Available: https://arxiv.org/abs/2505.11709

Pith/arXiv arXiv 2025
[29]

Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation,

M. Heo, Y . Lee, D. Lee, and J. J. Lim, “Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation,” in Robotics: Science and Systems, 2023

2023
[30]

Similarity of neural network representations revisited,

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton, “Similarity of neural network representations revisited,” inInternational conference on machine learning. PMlR, 2019, pp. 3519–3529

2019

[1] [1]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahidet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning. PMLR, 2023, pp. 2165–2183

2023

[2] [2]

pi 0: A vision- language-action flow model for general robot control,

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “pi 0: A vision- language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[3] [3]

Rdt-1b: a diffusion foundation model for bimanual manipulation,

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “Rdt-1b: a diffusion foundation model for bimanual manipulation,”arXiv preprint arXiv:2410.07864, 2024

Pith/arXiv arXiv 2024

[4] [4]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x- embodiment collaboration 0,

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jainet al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x- embodiment collaboration 0,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 6892–6903

2024

[5] [5]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems,

Q. Bu, J. Cai, L. Chen, X. Cui, Y . Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huanget al., “Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems,”arXiv preprint arXiv:2503.06669, 2025

Pith/arXiv arXiv 2025

[6] [6]

Egomimic: Scaling imitation learning via egocentric video,

S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu, “Egomimic: Scaling imitation learning via egocentric video,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 13 226–13 233

2025

[7] [7]

Motiontrans: Human vr data enable motion-level learning for robotic manipulation policies,

C. Yuan, R. Zhou, M. Liu, Y . Hu, S. Wang, L. Yi, C. Wen, S. Zhang, and Y . Gao, “Motiontrans: Human vr data enable motion-level learning for robotic manipulation policies,”arXiv preprint arXiv:2509.17759, 2025

arXiv 2025

[8] [8]

H-rdt: Human manipulation enhanced bimanual robotic manipulation,

H. Bi, L. Wu, T. Lin, H. Tan, Z. Su, H. Su, and J. Zhu, “H-rdt: Human manipulation enhanced bimanual robotic manipulation,”arXiv preprint arXiv:2507.23523, 2025

arXiv 2025

[9] [9]

Latent action pretraining from videos,

S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Linet al., “Latent action pretraining from videos,” arXiv preprint arXiv:2410.11758, 2024

Pith/arXiv arXiv 2024

[10] [10]

Igor: Image-goal representations are the atomic con- trol units for foundation models in embodied ai,

X. Chen, J. Guo, T. He, C. Zhang, P. Zhang, D. C. Yang, L. Zhao, and J. Bian, “Igor: Image-goal representations are the atomic con- trol units for foundation models in embodied ai,”arXiv preprint arXiv:2411.00785, 2024

arXiv 2024

[11] [11]

What do latent action models actually learn?

C. Zhang, T. Pearce, P. Zhang, K. Wang, X. Chen, W. Shen, L. Zhao, and J. Bian, “What do latent action models actually learn?” 2025. [Online]. Available: https://arxiv.org/abs/2506.15691

arXiv 2025

[12] [12]

Univla: Learning to act anywhere with task-centric latent actions,

Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li, “Univla: Learning to act anywhere with task-centric latent actions,” arXiv preprint arXiv:2505.06111, 2025

Pith/arXiv arXiv 2025

[13] [13]

Openvla: An open-source vision-language-action model,

M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, “Openvla: An open-source vision-language-action model,” arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[14] [14]

Gr00t n1: An open foundation model for generalist humanoid robots,

NVIDIA, :, J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z...

Pith/arXiv arXiv 2025

[15] [15]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025

2025

[16] [16]

Flow matching for generative modeling,

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” 2023. [Online]. Available: https://arxiv.org/abs/2210.02747

Pith/arXiv arXiv 2023

[17] [17]

Villa-x: enhancing latent action modeling in vision-language-action models,

X. Chen, H. Wei, P. Zhang, C. Zhang, K. Wang, Y . Guo, R. Yang, Y . Wang, X. Xiao, L. Zhaoet al., “Villa-x: enhancing latent action modeling in vision-language-action models,”arXiv preprint arXiv:2507.23682, 2025

Pith/arXiv arXiv 2025

[18] [18]

Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots,

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song, “Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots,” inProceedings of Robotics: Science and Systems (RSS), 2024

2024

[19] [19]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll ´ar, and R. Gir- shick, “Segment anything,”arXiv:2304.02643, 2023

Pith/arXiv arXiv 2023

[20] [20]

Prismatic vlms: Investigating the design space of visually- conditioned language models,

S. Karamcheti, S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh, “Prismatic vlms: Investigating the design space of visually- conditioned language models,” inForty-first International Conference on Machine Learning, 2024

2024

[21] [21]

Spatialvla: Exploring spatial representations for visual-language-action model,

D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, Y . Ding, Z. Wang, J. Gu, B. Zhao, D. Wanget al., “Spatialvla: Exploring spatial representations for visual-language-action model,”arXiv preprint arXiv:2501.15830, 2025

Pith/arXiv arXiv 2025

[22] [22]

Fast: Efficient action tokenization for vision-language-action models,

K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine, “Fast: Efficient action tokenization for vision-language-action models,”arXiv preprint arXiv:2501.09747, 2025

Pith/arXiv arXiv 2025

[23] [23]

Learning fine-grained bimanual manipulation with low-cost hardware,

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,”arXiv preprint arXiv:2304.13705, 2023

Pith/arXiv arXiv 2023

[24] [24]

Bridgedata v2: A dataset for robot learning at scale,

H. Walke, K. Black, A. Lee, M. J. Kim, M. Du, C. Zheng, T. Zhao, P. Hansen-Estruch, Q. Vuong, A. He, V . Myers, K. Fang, C. Finn, and S. Levine, “Bridgedata v2: A dataset for robot learning at scale,” in Conference on Robot Learning (CoRL), 2023

2023

[25] [25]

Libero: Benchmarking knowledge transfer for lifelong robot learn- ing,

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learn- ing,”Advances in Neural Information Processing Systems, vol. 36, pp. 44 776–44 791, 2023

2023

[26] [26]

Roboengine: Plug-and-play robot data augmentation with semantic robot segmen- tation and background generation,

C. Yuan, S. Joshi, S. Zhu, H. Su, H. Zhao, and Y . Gao, “Roboengine: Plug-and-play robot data augmentation with semantic robot segmen- tation and background generation,”arXiv preprint arXiv:2503.18738, 2025

arXiv 2025

[27] [27]

Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Guet al., “Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,”arXiv preprint arXiv:2506.18088, 2025

Pith/arXiv arXiv 2025

[28] [28]

Egodex: Learning dexterous manipulation from large-scale egocentric video,

R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang, “Egodex: Learning dexterous manipulation from large-scale egocentric video,” 2025. [Online]. Available: https://arxiv.org/abs/2505.11709

Pith/arXiv arXiv 2025

[29] [29]

Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation,

M. Heo, Y . Lee, D. Lee, and J. J. Lim, “Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation,” in Robotics: Science and Systems, 2023

2023

[30] [30]

Similarity of neural network representations revisited,

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton, “Similarity of neural network representations revisited,” inInternational conference on machine learning. PMlR, 2019, pp. 3519–3529

2019