Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos

Jian Wang; Jincheng Yu; Runze Xu; Yiluo Zhang; Yu Wang

arxiv: 2606.18955 · v2 · pith:LO3POELOnew · submitted 2026-06-17 · 💻 cs.CV · cs.RO

Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos

Runze Xu , Yiluo Zhang , Jian Wang , Yu Wang , Jincheng Yu This is my paper

Pith reviewed 2026-07-03 23:42 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords vision-language-actionlatent actioncross-embodimentegocentric videosVQ-VAEunlabeled pre-trainingrobot adaptation

0 comments

The pith

Latent actions from unlabeled human videos let VLA models adapt to robots using only 50 trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that general action priors can be extracted from abundant unlabeled human egocentric videos to train vision-language-action models that transfer across robot embodiments. It introduces a Hybrid Disentangled VQ-VAE to separate motion dynamics from backgrounds using physical masks and build a shared action codebook for pre-training a VLM backbone on action intent. During adaptation to a specific robot, an intent-perception decoupling strategy has the VLM predict intent while a frozen visual encoder supplies embodiment-specific features, which cuts down on action hallucinations. Experiments in simulation and real settings show the resulting models match state-of-the-art performance of systems trained on large annotated robot datasets yet need only 50 trajectories for new tasks. A sympathetic reader would care because this points to a path for using vast existing human video collections instead of costly robot data gathering.

Core claim

By pre-training exclusively on unlabeled human videos with a cross-embodiment action codebook derived from a Hybrid Disentangled VQ-VAE that decouples motion from backgrounds via physical masks, followed by intent-perception decoupling at adaptation time, the method produces VLA models that perform competitively with state-of-the-art models trained on massive annotated datasets while requiring only 50 trajectories for downstream adaptation.

What carries the argument

Hybrid Disentangled VQ-VAE that decouples motion dynamics from environmental backgrounds through physical masks to construct a cross-embodiment action codebook, together with the intent-perception decoupling strategy used at adaptation.

If this is right

Pre-training on human videos transfers general action priors to new robot embodiments without any robot data.
Downstream adaptation succeeds with only 50 trajectories while matching models trained on far larger annotated sets.
Intent-perception decoupling reduces action hallucinations during embodiment-specific use.
The approach works in both simulation and real-world environments.
Abundant unlabeled egocentric human videos become usable for VLA model training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could lower the data-collection cost for generalist robot policies by shifting most learning to existing human video archives.
Similar motion-focused latent extraction might be tested on instructional or demonstration videos from other sources.
Relaxing the physical-mask requirement could broaden applicability if stronger disentanglement techniques emerge.
Latent action codes may prove useful for bridging human and robot action spaces in additional multimodal tasks.

Load-bearing premise

The Hybrid Disentangled VQ-VAE can effectively decouple motion dynamics from environmental backgrounds through physical masks to construct a cross-embodiment action codebook.

What would settle it

If a model pre-trained only on the human videos requires substantially more than 50 trajectories or fails to reach comparable performance to annotated-data baselines in the reported simulation and real-world tasks, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.18955 by Jian Wang, Jincheng Yu, Runze Xu, Yiluo Zhang, Yu Wang.

**Figure 1.** Figure 1: Method overview. We propose a human-video-driven framework for training vision–language–action models. A hybrid disentangled VQ-VAE first extracts transferable latent action codes from unlabeled human videos. These codes are then used as supervision to pre-train the VLM to infer action intentions from observations and instructions. Finally, with only a small number of robot trajectories, the VLM backbone a… view at source ↗

**Figure 2.** Figure 2: Hybrid Disentangled VQ-VAE. The VQ-VAE model decomposes short-term visual changes into discrete action and background latent spaces via a dual-path vector quantization bottleneck. A shared mask-guided decoder enforces semantic separation by reconstructing motion-related and background regions from corresponding latent codes, enabling the extraction of transferable action intentions from videos. III. METH… view at source ↗

**Figure 3.** Figure 3: Real-world experiments. (a) The physical dual-arm robot platform used for evaluation. (b) Three real-world manipulation tasks, including placing a bottle on a plate, unplugging a power cord, and folding a towel. (c) Task success rates compared with UniVLA, showing improved transfer of action intentions from human videos to the real robot. Notably, the “Place Bottle” task shows a lower success rate. The bot… view at source ↗

**Figure 4.** Figure 4: Comparison of latent action alignment consistency. The proposed Motion-Focused latent action outperforms UniVLA in CKA metrics, indicating a more coherent cross-embodiment action space. C. Latent Action Evaluation To explain the observed generalization performance at the representation level, we design an alignment analysis method based on domain subspace elimination. This approach quantitatively assesses… view at source ↗

**Figure 5.** Figure 5: Latent Action Visualization. Image pairs from different datasets with same latent codes. Despite different morphologies, robot arms and human hands are assigned the same action tokens. values between the resulting feature matrices. The results (Fig. 4b) show that UniVLA exhibits lower consistency with a mean CKA of 0.8659, whereas our Motion-Focused Latent Action achieves a significantly higher alignment w… view at source ↗

read the original abstract

Training generalist Vision-Language-Action(VLA) models typically requires massive, diverse robotic datasets with high-fidelity action annotations. While egocentric human manipulation videos are abundant and capture significant environmental diversity, the absence of action labels makes them difficult to use in conventional training paradigms. To address this, we propose a latent-action-based framework designed to extract general action priors from unlabeled human videos. The architecture features a Hybrid Disentangled VQ-VAE that decouples motion dynamics from environmental backgrounds through physical masks, enabling the construction of a cross-embodiment action codebook. By pre-training on human videos with the codebook, the VLM backbone learns deep representations of action intent. For adaptation to specific embodiments, we introduce an intent-perception decoupling strategy where the VLM predicts the action intent while a separate frozen visual encoder provides state-specific features to the action expert, thereby reducing action hallucinations. Results in simulation and real-world environments show that our method, pre-trained exclusively on unlabeled human videos, performs competitively with state-of-the-art VLA models trained on massive annotated datasets, requiring only 50 trajectories for downstream adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a Hybrid Disentangled VQ-VAE to pull latent actions from unlabeled human ego-videos for VLA pretraining, but the abstract supplies no metrics or baselines so the competitive claim with 50 trajectories cannot be assessed.

read the letter

The core idea here is to pretrain VLAs on abundant human videos by learning a cross-embodiment action codebook through a Hybrid Disentangled VQ-VAE that uses physical masks to separate motion from background, then applies an intent-perception split at adaptation time. That combination is the main new piece; it extends earlier latent-action work to the human-video setting with a concrete architecture choice.

The approach addresses a genuine bottleneck in robot data collection, and the engineering decisions around decoupling look reasonable on paper. If the downstream results hold, it would let people leverage existing video datasets instead of waiting for more annotated robot trajectories.

The main weakness is the complete lack of evidence in the abstract. It asserts competitive performance against SOTA models trained on large annotated sets, yet gives no numbers, no baselines, no ablations, and no description of the simulation or real-world setups. Without those, the claim that the VQ-VAE actually produces a transferable codebook remains untested in the text we have. The intent-perception decoupling is presented as the fix for hallucinations, but again there is nothing to show it works as described.

This is aimed at groups working on scalable VLA training and data-efficient robot learning. A reader already thinking about latent actions or human-video pretraining could pick up the architecture details for their own experiments.

I would bring the full paper to a reading group to see the numbers and ablations. I would not cite it yet. It deserves peer review because the problem matters and the proposed pipeline is specific enough to evaluate once the experiments are shown.

Referee Report

2 major / 1 minor

Summary. The paper proposes a latent-action framework for Vision-Language-Action (VLA) models that pre-trains exclusively on unlabeled human egocentric videos. A Hybrid Disentangled VQ-VAE decouples motion dynamics from backgrounds using physical masks to build a cross-embodiment action codebook; the VLM backbone learns action-intent representations from this codebook. An intent-perception decoupling strategy (VLM predicts intent while a frozen visual encoder supplies embodiment-specific features) is introduced for downstream adaptation. The central empirical claim is that this pipeline, after pre-training on human videos, achieves competitive performance with SOTA VLA models (trained on large annotated robotic datasets) using only 50 trajectories for adaptation in both simulation and real-world environments.

Significance. If the reported results hold, the work would be significant because it demonstrates a pathway to leverage abundant unlabeled human video data for robotic policy learning, substantially lowering the data-collection barrier for generalist VLA models. The approach directly targets the scarcity of high-quality annotated robot trajectories by transferring general action priors across embodiments.

major comments (2)

[Abstract] Abstract: the assertion that the method 'performs competitively with state-of-the-art VLA models trained on massive annotated datasets' using only 50 trajectories is presented without any quantitative metrics, baselines, error bars, or experimental details. This claim is load-bearing for the paper's central contribution yet cannot be evaluated from the supplied text.
[Methods (Hybrid Disentangled VQ-VAE)] Hybrid Disentangled VQ-VAE description: the claim that physical masks enable effective decoupling of motion dynamics from environmental backgrounds to produce a transferable cross-embodiment action codebook is central to the pre-training pipeline, but no ablation studies, reconstruction metrics, or codebook analysis are referenced to substantiate that the disentanglement succeeds.

minor comments (1)

[Abstract] Abstract: the acronym 'VLA' is used before its expansion ('Vision-Language-Action') is given.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that the method 'performs competitively with state-of-the-art VLA models trained on massive annotated datasets' using only 50 trajectories is presented without any quantitative metrics, baselines, error bars, or experimental details. This claim is load-bearing for the paper's central contribution yet cannot be evaluated from the supplied text.

Authors: We agree that the abstract would benefit from explicit quantitative support for this central claim. The detailed results, including success rates, baselines, and error bars from both simulation and real-world experiments with 50 trajectories, appear in the Experiments section. We have revised the abstract to include key quantitative metrics and baseline comparisons to make the claim self-contained. revision: yes
Referee: [Methods (Hybrid Disentangled VQ-VAE)] Hybrid Disentangled VQ-VAE description: the claim that physical masks enable effective decoupling of motion dynamics from environmental backgrounds to produce a transferable cross-embodiment action codebook is central to the pre-training pipeline, but no ablation studies, reconstruction metrics, or codebook analysis are referenced to substantiate that the disentanglement succeeds.

Authors: We acknowledge that the manuscript would be strengthened by direct empirical validation of the physical masks' contribution to disentanglement. The current text emphasizes the architectural motivation and end-to-end results. We have added ablation studies (with/without masks), reconstruction metrics, and codebook analysis to the Methods and Experiments sections of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript describes an empirical ML pipeline: a Hybrid Disentangled VQ-VAE is trained on unlabeled ego-videos to build an action codebook, followed by VLM pre-training and a downstream adaptation stage that uses 50 trajectories. No first-principles derivation, uniqueness theorem, or parameter-free prediction is asserted; all performance claims rest on reported simulation and real-robot metrics rather than any reduction of outputs to fitted inputs or self-citations. The architecture choices (physical masks, intent-perception split) are presented as engineering decisions whose value is measured externally, leaving the central claim self-contained against the provided experimental results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only review provides no explicit free parameters, axioms, or independent evidence for invented components; the VQ-VAE and decoupling strategy are introduced without supporting details.

invented entities (2)

Hybrid Disentangled VQ-VAE no independent evidence
purpose: Decouple motion dynamics from environmental backgrounds via physical masks
Core new component for building the cross-embodiment action codebook from human videos.
intent-perception decoupling strategy no independent evidence
purpose: Separate action intent prediction from embodiment-specific state features to reduce hallucinations
Proposed adaptation mechanism for specific robot embodiments.

pith-pipeline@v0.9.1-grok · 5739 in / 1330 out tokens · 44303 ms · 2026-07-03T23:42:30.832102+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 20 canonical work pages · 15 internal anchors

[1]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahidet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning. PMLR, 2023, pp. 2165–2183

2023
[2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “pi 0: A vision- language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “Rdt-1b: a diffusion foundation model for bimanual manipulation,”arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x- embodiment collaboration 0,

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jainet al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x- embodiment collaboration 0,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 6892–6903

2024
[5]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Q. Bu, J. Cai, L. Chen, X. Cui, Y . Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huanget al., “Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems,”arXiv preprint arXiv:2503.06669, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Egomimic: Scaling imitation learning via egocentric video,

S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu, “Egomimic: Scaling imitation learning via egocentric video,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 13 226–13 233

2025
[7]

Motiontrans: Human vr data enable motion-level learning for robotic manipulation policies,

C. Yuan, R. Zhou, M. Liu, Y . Hu, S. Wang, L. Yi, C. Wen, S. Zhang, and Y . Gao, “Motiontrans: Human vr data enable motion-level learning for robotic manipulation policies,”arXiv preprint arXiv:2509.17759, 2025

work page arXiv 2025
[8]

H-rdt: Human manipulation enhanced bimanual robotic manipulation,

H. Bi, L. Wu, T. Lin, H. Tan, Z. Su, H. Su, and J. Zhu, “H-rdt: Human manipulation enhanced bimanual robotic manipulation,”arXiv preprint arXiv:2507.23523, 2025

work page arXiv 2025
[9]

Latent Action Pretraining from Videos

S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Linet al., “Latent action pretraining from videos,” arXiv preprint arXiv:2410.11758, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Igor: Image-goal representations are the atomic con- trol units for foundation models in embodied ai,

X. Chen, J. Guo, T. He, C. Zhang, P. Zhang, D. C. Yang, L. Zhao, and J. Bian, “Igor: Image-goal representations are the atomic con- trol units for foundation models in embodied ai,”arXiv preprint arXiv:2411.00785, 2024

work page arXiv 2024
[11]

What do latent action models actually learn?

C. Zhang, T. Pearce, P. Zhang, K. Wang, X. Chen, W. Shen, L. Zhao, and J. Bian, “What do latent action models actually learn?” 2025. [Online]. Available: https://arxiv.org/abs/2506.15691

work page arXiv 2025
[12]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li, “Univla: Learning to act anywhere with task-centric latent actions,” arXiv preprint arXiv:2505.06111, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

OpenVLA: An Open-Source Vision-Language-Action Model

M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, “Openvla: An open-source vision-language-action model,” arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

NVIDIA, :, J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025

2025
[16]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” 2023. [Online]. Available: https://arxiv.org/abs/2210.02747

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

X. Chen, H. Wei, P. Zhang, C. Zhang, K. Wang, Y . Guo, R. Yang, Y . Wang, X. Xiao, L. Zhaoet al., “Villa-x: enhancing latent action modeling in vision-language-action models,”arXiv preprint arXiv:2507.23682, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots,

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song, “Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots,” inProceedings of Robotics: Science and Systems (RSS), 2024

2024
[19]

Segment Anything

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll ´ar, and R. Gir- shick, “Segment anything,”arXiv:2304.02643, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Prismatic vlms: Investigating the design space of visually- conditioned language models,

S. Karamcheti, S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh, “Prismatic vlms: Investigating the design space of visually- conditioned language models,” inForty-first International Conference on Machine Learning, 2024

2024
[21]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, Y . Ding, Z. Wang, J. Gu, B. Zhao, D. Wanget al., “Spatialvla: Exploring spatial representations for visual-language-action model,”arXiv preprint arXiv:2501.15830, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine, “Fast: Efficient action tokenization for vision-language-action models,”arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,”arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Bridgedata v2: A dataset for robot learning at scale,

H. Walke, K. Black, A. Lee, M. J. Kim, M. Du, C. Zheng, T. Zhao, P. Hansen-Estruch, Q. Vuong, A. He, V . Myers, K. Fang, C. Finn, and S. Levine, “Bridgedata v2: A dataset for robot learning at scale,” in Conference on Robot Learning (CoRL), 2023

2023
[25]

Libero: Benchmarking knowledge transfer for lifelong robot learn- ing,

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learn- ing,”Advances in Neural Information Processing Systems, vol. 36, pp. 44 776–44 791, 2023

2023
[26]

Roboengine: Plug-and-play robot data augmentation with semantic robot segmen- tation and background generation,

C. Yuan, S. Joshi, S. Zhu, H. Su, H. Zhao, and Y . Gao, “Roboengine: Plug-and-play robot data augmentation with semantic robot segmen- tation and background generation,”arXiv preprint arXiv:2503.18738, 2025

work page arXiv 2025
[27]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Guet al., “Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,”arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang, “Egodex: Learning dexterous manipulation from large-scale egocentric video,” 2025. [Online]. Available: https://arxiv.org/abs/2505.11709

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation,

M. Heo, Y . Lee, D. Lee, and J. J. Lim, “Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation,” in Robotics: Science and Systems, 2023

2023
[30]

Similarity of neural network representations revisited,

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton, “Similarity of neural network representations revisited,” inInternational conference on machine learning. PMlR, 2019, pp. 3519–3529

2019

[1] [1]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahidet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning. PMLR, 2023, pp. 2165–2183

2023

[2] [2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “pi 0: A vision- language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “Rdt-1b: a diffusion foundation model for bimanual manipulation,”arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x- embodiment collaboration 0,

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jainet al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x- embodiment collaboration 0,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 6892–6903

2024

[5] [5]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Q. Bu, J. Cai, L. Chen, X. Cui, Y . Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huanget al., “Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems,”arXiv preprint arXiv:2503.06669, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Egomimic: Scaling imitation learning via egocentric video,

S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu, “Egomimic: Scaling imitation learning via egocentric video,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 13 226–13 233

2025

[7] [7]

Motiontrans: Human vr data enable motion-level learning for robotic manipulation policies,

C. Yuan, R. Zhou, M. Liu, Y . Hu, S. Wang, L. Yi, C. Wen, S. Zhang, and Y . Gao, “Motiontrans: Human vr data enable motion-level learning for robotic manipulation policies,”arXiv preprint arXiv:2509.17759, 2025

work page arXiv 2025

[8] [8]

H-rdt: Human manipulation enhanced bimanual robotic manipulation,

H. Bi, L. Wu, T. Lin, H. Tan, Z. Su, H. Su, and J. Zhu, “H-rdt: Human manipulation enhanced bimanual robotic manipulation,”arXiv preprint arXiv:2507.23523, 2025

work page arXiv 2025

[9] [9]

Latent Action Pretraining from Videos

S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Linet al., “Latent action pretraining from videos,” arXiv preprint arXiv:2410.11758, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Igor: Image-goal representations are the atomic con- trol units for foundation models in embodied ai,

X. Chen, J. Guo, T. He, C. Zhang, P. Zhang, D. C. Yang, L. Zhao, and J. Bian, “Igor: Image-goal representations are the atomic con- trol units for foundation models in embodied ai,”arXiv preprint arXiv:2411.00785, 2024

work page arXiv 2024

[11] [11]

What do latent action models actually learn?

C. Zhang, T. Pearce, P. Zhang, K. Wang, X. Chen, W. Shen, L. Zhao, and J. Bian, “What do latent action models actually learn?” 2025. [Online]. Available: https://arxiv.org/abs/2506.15691

work page arXiv 2025

[12] [12]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li, “Univla: Learning to act anywhere with task-centric latent actions,” arXiv preprint arXiv:2505.06111, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

OpenVLA: An Open-Source Vision-Language-Action Model

M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, “Openvla: An open-source vision-language-action model,” arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

NVIDIA, :, J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025

2025

[16] [16]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” 2023. [Online]. Available: https://arxiv.org/abs/2210.02747

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

X. Chen, H. Wei, P. Zhang, C. Zhang, K. Wang, Y . Guo, R. Yang, Y . Wang, X. Xiao, L. Zhaoet al., “Villa-x: enhancing latent action modeling in vision-language-action models,”arXiv preprint arXiv:2507.23682, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots,

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song, “Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots,” inProceedings of Robotics: Science and Systems (RSS), 2024

2024

[19] [19]

Segment Anything

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll ´ar, and R. Gir- shick, “Segment anything,”arXiv:2304.02643, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Prismatic vlms: Investigating the design space of visually- conditioned language models,

S. Karamcheti, S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh, “Prismatic vlms: Investigating the design space of visually- conditioned language models,” inForty-first International Conference on Machine Learning, 2024

2024

[21] [21]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, Y . Ding, Z. Wang, J. Gu, B. Zhao, D. Wanget al., “Spatialvla: Exploring spatial representations for visual-language-action model,”arXiv preprint arXiv:2501.15830, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine, “Fast: Efficient action tokenization for vision-language-action models,”arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,”arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Bridgedata v2: A dataset for robot learning at scale,

H. Walke, K. Black, A. Lee, M. J. Kim, M. Du, C. Zheng, T. Zhao, P. Hansen-Estruch, Q. Vuong, A. He, V . Myers, K. Fang, C. Finn, and S. Levine, “Bridgedata v2: A dataset for robot learning at scale,” in Conference on Robot Learning (CoRL), 2023

2023

[25] [25]

Libero: Benchmarking knowledge transfer for lifelong robot learn- ing,

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learn- ing,”Advances in Neural Information Processing Systems, vol. 36, pp. 44 776–44 791, 2023

2023

[26] [26]

Roboengine: Plug-and-play robot data augmentation with semantic robot segmen- tation and background generation,

C. Yuan, S. Joshi, S. Zhu, H. Su, H. Zhao, and Y . Gao, “Roboengine: Plug-and-play robot data augmentation with semantic robot segmen- tation and background generation,”arXiv preprint arXiv:2503.18738, 2025

work page arXiv 2025

[27] [27]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Guet al., “Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,”arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang, “Egodex: Learning dexterous manipulation from large-scale egocentric video,” 2025. [Online]. Available: https://arxiv.org/abs/2505.11709

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation,

M. Heo, Y . Lee, D. Lee, and J. J. Lim, “Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation,” in Robotics: Science and Systems, 2023

2023

[30] [30]

Similarity of neural network representations revisited,

S. Kornblith, M. Norouzi, H. Lee, and G. Hinton, “Similarity of neural network representations revisited,” inInternational conference on machine learning. PMlR, 2019, pp. 3519–3529

2019