pith. sign in

arxiv: 2606.26095 · v1 · pith:TLQWGDZ2new · submitted 2026-06-24 · 💻 cs.RO · cs.AI· cs.CV

Learning Action Priors for Cross-embodiment Robot Manipulation

Pith reviewed 2026-06-25 19:07 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV
keywords vision-language-actionaction priorscross-embodimentflow matchingrobot manipulationtemporal motion structuretwo-stage trainingpolicy learning
0
0 comments X

The pith

Pretraining an action module on unconditioned trajectories before VLA alignment equips policies with transferable motion structure that accelerates convergence and raises success rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard VLA models suffer because their action modules must simultaneously discover motion dynamics and cross-modal alignments from the start. By isolating the learning of temporal motion structure in a first stage that sees only action trajectories, the framework builds an explicit motion prior using a flow-matching encoder-decoder. This prior transfers into the second stage through decoder reuse and early latent distillation, allowing the vision-language backbone to align with an already-structured action space. Experiments across thirteen cross-embodiment tasks show the resulting policies converge faster, reach higher success rates, and handle data-scarce real-world settings more robustly than joint training from scratch. Scaling the action data used in the first stage further strengthens the downstream VLA performance.

Core claim

The central claim is that a two-stage framework equips the action module with cross-embodiment temporal motion structure before VLA training begins: stage one trains a lightweight flow-matching encoder-decoder solely on unconditioned action trajectories, and stage two reuses the decoder while distilling latents to align visual-language features with the pretrained action embedding space, yielding faster convergence, higher success rates, and stronger real-world results than VLA training without such priors.

What carries the argument

Two-stage training framework in which a flow-matching encoder-decoder first learns motion structure from action trajectories alone and then transfers it via decoder reuse and latent distillation into VLA policy optimization.

If this is right

  • VLA training converges faster once the action module already encodes temporal motion structure.
  • Success rates rise on both simulated and real-world cross-embodiment tasks.
  • Performance gains are largest on data-scarce real-world deployments.
  • Scaling the quantity of unconditioned action trajectories in stage one produces a more generalizable prior that lifts downstream VLA results.
  • The pretrained encoder supplies a compact history compressor that summarizes state-action sequences into one token at low cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of motion learning from semantic alignment may apply to other embodied sequence tasks where dynamics can be modeled independently of perception.
  • Larger collections of raw action data could yield increasingly universal priors usable across many robot morphologies without embodiment-specific fine-tuning.
  • The encoder-decoder could serve as a reusable motion backbone that multiple VLA models draw from rather than retraining from scratch each time.

Load-bearing premise

The motion structure learned from action trajectories alone can be aligned to visual-language features in the second stage without introducing misalignment that requires substantial extra adaptation.

What would settle it

A head-to-head comparison on the same thirteen tasks in which standard joint VLA training from scratch matches or exceeds the two-stage model's convergence speed and success rates would falsify the claimed benefit of the action priors.

Figures

Figures reproduced from arXiv: 2606.26095 by Dong Jing, Jiaqi Liu, Jinman Zhao, Li Erran Li, Mingyu Ding, Tianqi Zhang, Zelong Sun, Zhiwu Lu.

Figure 1
Figure 1. Figure 1: Illustration of motivation: a policy should first learn to move, and then learn to see and act. (a) In Stage 1, the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of our proposed framework. In Stage 1, interleaved state-action sequences and dataset-specific soft-prompt [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the 13 cross-embodiment tasks across LIBERO, RoboCasa GR1, and real-world Franka manipulation. The [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on two long-tail real-world tasks: (a) Stack Cups and (b) Grasp Coke. Rows show representative [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training dynamics during the first 3,000 VLA training steps for three variants: No Action Prior, Action-State Prior, and [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Effect of the distillation horizon under a fixed 30k [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Average success rate versus total wall-clock training [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
read the original abstract

Most Vision-Language-Action (VLA) models build on a Vision-Language Model (VLM) backbone by attaching an action module and optimizing the full policy jointly. This design inherits strong visual and linguistic priors from the VLM, but leaves the action module to learn physical motion almost from scratch. As a result, the policy lacks an explicit motion prior, forcing early optimization to simultaneously discover temporal action dynamics and cross-modal alignment, a challenge further amplified in cross-embodiment settings. In this work, we propose to pretrain the action module with motion priors before cross-modal VLA alignment. Specifically, we introduce a two-stage training framework that equips the action module with cross-embodiment temporal motion structure before VLA training begins. In Stage~1, a lightweight flow-matching-based encoder-decoder action module efficiently learns temporal motion structure solely from unconditioned action trajectories, without processing visual or language tokens. In Stage~2, this learned prior is transferred to VLA training through decoder reuse and early-stage latent distillation, aligning visual-language features with the action embedding space while still allowing end-to-end policy refinement. In addition, the trained encoder serves as a compact history compressor, summarizing state-action histories into a single temporal context token for history-aware modeling at negligible cost. Extensive experiments across 13 diverse cross-embodiment tasks on both simulated and real-world platforms validate the effectiveness of our approach. Compared with VLA training without action priors, our model achieves faster convergence, higher success rates, and substantially stronger performance on data-scarce real-world tasks. Moreover, scaling up the action data in Stage~1 yields a more generalizable action prior that directly improves downstream VLA performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper introduces a two-stage training framework for Vision-Language-Action (VLA) models in cross-embodiment robot manipulation. In Stage 1, a lightweight flow-matching encoder-decoder learns temporal motion structure solely from unconditioned action trajectories. In Stage 2, the prior is transferred to VLA training via decoder reuse and early-stage latent distillation while allowing end-to-end refinement; the encoder is additionally reused as a compact history compressor. Experiments across 13 diverse simulated and real-world tasks are reported to show faster convergence, higher success rates, and stronger performance on data-scarce tasks relative to VLA training without action priors.

Significance. If the empirical gains are robust, the work supplies a practical pretraining recipe that injects explicit motion structure into VLA policies before cross-modal alignment, which is especially relevant for cross-embodiment and low-data regimes. The separation of motion-prior learning from visual-language alignment follows a standard pretrain-then-align pattern but is applied specifically to the action module using flow matching on raw trajectories.

minor comments (1)
  1. The abstract asserts performance gains across 13 tasks but supplies no quantitative results, baselines, error bars, or methodological details, preventing assessment of whether the data supports the stated claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their review and for the positive assessment of the significance of our two-stage framework for pretraining action priors via flow matching before VLA alignment. We note that the recommendation is listed as uncertain and that no specific major comments were enumerated in the report. We are prepared to provide further details or clarifications on any aspect of the work if requested.

Circularity Check

0 steps flagged

No significant circularity; empirical pretraining procedure validated by external task evaluations

full rationale

The paper describes an empirical two-stage training procedure for VLA models: Stage 1 pretrains a flow-matching encoder-decoder action module solely on unconditioned action trajectories to capture temporal motion structure, and Stage 2 transfers the decoder and uses latent distillation during VLA alignment. All performance claims (faster convergence, higher success rates on 13 cross-embodiment tasks) rest on reported experimental results rather than any derivation that reduces to its own inputs by construction. No self-definitional equations, fitted parameters renamed as predictions, load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the method. The approach is a standard pretrain-then-finetune pattern whose validity is settled externally by task success metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that independent pretraining of temporal motion structure transfers usefully to conditioned VLA settings; no free parameters or invented entities are specified in the abstract.

axioms (1)
  • domain assumption A flow-matching-based encoder-decoder can learn useful cross-embodiment temporal motion structure solely from unconditioned action trajectories
    Core premise of Stage 1 as stated in the abstract.

pith-pipeline@v0.9.1-grok · 5865 in / 1325 out tokens · 29065 ms · 2026-06-25T19:07:45.205800+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 44 linked inside Pith

  1. [1]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Advances in Neural Information Processing Systems, 2023

  2. [2]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

    Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Luet al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,”CVPR, 2024

  3. [3]

    Qwen2. 5-vl technical report,

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

  4. [4]

    Gpt-4 technical report,

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  5. [5]

    Learning universal policies via text- guided video generation,

    Y . Du, M. Yang, B. Dai, H. Dai, O. Nachum, J. B. Tenenbaum, D. Schuurmans, and P. Abbeel, “Learning universal policies via text- guided video generation,”Advances in Neural Information Processing Systems, 2023

  6. [6]

    Zero-shot robotic manipulation with pretrained image-editing diffusion models,

    K. Black, M. Nakamoto, P. Atanasov, H. Walke, C. Finn, A. Kumar, and S. Levine, “Zero-shot robotic manipulation with pretrained image-editing diffusion models,”ICLR, 2024

  7. [7]

    Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation,

    C.-L. Cheang, G. Chen, Y . Jing, T. Kong, H. Li, Y . Li, Y . Liu, H. Wu, J. Xia, Y . Yanet al., “Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation,”arXiv preprint arXiv:2410.06158, 2024

  8. [8]

    Cosmos 3: Omnimodal world models for physical ai,

    N. Agarwal, A. Ali, J. Allen, M. Antolini, A. Aubame, A. Azzolini, J. Bai, M. Bala, Y . Balaji, J. Bapstet al., “Cosmos 3: Omnimodal world models for physical ai,”arXiv preprint arXiv:2606.02800, 2026

  9. [9]

    World action models are zero-shot policies,

    S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xianget al., “World action models are zero-shot policies,” arXiv preprint arXiv:2602.15922, 2026

  10. [10]

    Rt-1: Robotics transformer for real-world control at scale,

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsuet al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv preprint arXiv:2212.06817, 2022

  11. [11]

    Rt-2: Vision-language- action models transfer web knowledge to robotic control,

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finnet al., “Rt-2: Vision-language- action models transfer web knowledge to robotic control,”arXiv preprint arXiv:2307.15818, 2023

  12. [12]

    Openvla: An open-source vision-language-action model,

    M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, “Openvla: An open-source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024

  13. [13]

    Learning robust perceptive locomotion for quadrupedal robots in the wild,

    T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter, “Learning robust perceptive locomotion for quadrupedal robots in the wild,”Science Robotics, vol. 7, no. 62, 2022

  14. [14]

    Anymal parkour: Learning agile navigation for quadrupedal robots,

    D. Hoeller, N. Rudin, D. Sako, and M. Hutter, “Anymal parkour: Learning agile navigation for quadrupedal robots,”Science Robotics, vol. 9, 2024

  15. [15]

    Humanplus: Humanoid shadowing and imitation from humans,

    Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn, “Humanplus: Humanoid shadowing and imitation from humans,”CoRL, 2024

  16. [16]

    Gr00t n1: An open foundation model for generalist humanoid robots,

    J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huanget al., “Gr00t n1: An open foundation model for generalist humanoid robots,”arXiv preprint arXiv:2503.14734, 2025

  17. [17]

    Gr-3 technical report,

    C. Cheang, S. Chen, Z. Cui, Y . Hu, L. Huang, T. Kong, H. Li, Y . Li, Y . Liu, X. Maet al., “Gr-3 technical report,”arXiv preprint arXiv:2507.15493, 2025

  18. [18]

    pi 0: A vision-language-action flow model for general robot control,

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “ pi 0: A vision-language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

  19. [19]

    Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation,

    Q. Li, Y . Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y . Deng, S. Xu, Y . Zhanget al., “Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation,” arXiv preprint arXiv:2411.19650, 2024

  20. [20]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collab- oration 0,

    A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jainet al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collab- oration 0,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 6892–6903

  21. [21]

    Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers,

    L. Wang, X. Chen, J. Zhao, and K. He, “Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers,”Advances in Neural Information Processing Systems, 2024

  22. [22]

    Qwen-robotmanip technical report: Alignment unlocks scale for robotic manipulation foundation models,

    H. Yuan, Z. Liang, A. Chen, Y . Wang, H. Li, P. Lin, Y . Huang, Z. Lei, T. Zhang, J. Zhanget al., “Qwen-robotmanip technical report: Alignment unlocks scale for robotic manipulation foundation models,”arXiv preprint arXiv:2606.17846, 2026

  23. [23]

    Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data,

    S. Deng, M. Yan, S. Wei, H. Ma, Y . Yang, J. Chen, Z. Zhang, T. Yang, X. Zhang, W. Zhang, H. Cui, Z. Zhang, and H. Wang, “Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data,” 2025

  24. [24]

    Dreamvla: A vision- language-action model dreamed with comprehensive world knowledge,

    W. Zhang, H. Liu, Z. Qi, Y . Wang, X. Yu, J. Zhang, R. Dong, J. He, H. Wang, Z. Zhang, L. Yi, W. Zeng, and X. Jin, “Dreamvla: A vision- language-action model dreamed with comprehensive world knowledge,” CoRR, vol. abs/2507.04447, 2025

  25. [25]

    Univla: Learning to act anywhere with task-centric latent actions,

    Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li, “Univla: Learning to act anywhere with task-centric latent actions,”arXiv preprint arXiv:2505.06111, 2025

  26. [26]

    Embodiedgpt: Vision-language pre-training via embodied chain of thought,

    Y . Mu, Q. Zhang, M. Hu, W. Wang, M. Ding, J. Jin, B. Wang, J. Dai, Y . Qiao, and P. Luo, “Embodiedgpt: Vision-language pre-training via embodied chain of thought,”Advances in Neural Information Processing Systems, vol. 36, 2024

  27. [27]

    Starvla: A lego-like codebase for vision-language-action model developing,

    Contributors, “Starvla: A lego-like codebase for vision-language-action model developing,” GitHub repository, 1 2025

  28. [28]

    Tempovla: Learning speed-controllable vision-language-action policies,

    D. Jing, J. Nie, T. Zhang, J. Liu, H. Yao, Z. Lu, and M. Ding, “Tempovla: Learning speed-controllable vision-language-action policies,” arXiv preprint arXiv:2606.06491, 2026

  29. [30]

    Paligemma 2: A family of versatile vlms for transfer,

    A. Steiner, A. S. Pinto, M. Tschannen, D. Keysers, X. Wang, Y . Bitton, A. Gritsenko, M. Minderer, A. Sherbondy, S. Longet al., “Paligemma 2: A family of versatile vlms for transfer,”arXiv preprint arXiv:2412.03555, 2024

  30. [31]

    Fine-tuning vision-language-action models: Optimizing speed and success,

    M. J. Kim, C. Finn, and P. Liang, “Fine-tuning vision-language-action models: Optimizing speed and success,”arXiv preprint arXiv:2502.19645, 2025

  31. [32]

    Diffusion policy: Visuomotor policy learning via action diffusion,

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”arXiv preprint arXiv:2303.04137, 2023

  32. [33]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware,

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware,” inProceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023

  33. [34]

    Rdt-1b: a diffusion foundation model for bimanual manipulation,

    S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “Rdt-1b: a diffusion foundation model for bimanual manipulation,” arXiv preprint arXiv:2410.07864, 2024

  34. [35]

    Fast: Efficient action tokenization for vision-language-action models,

    K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine, “Fast: Efficient action tokenization for vision-language-action models,”arXiv preprint arXiv:2501.09747, 2025

  35. [36]

    Spatialvla: Exploring spatial representations for visual-language-action model,

    D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, Y . Ding, Z. Wang, J. Gu, B. Zhao, D. Wanget al., “Spatialvla: Exploring spatial representations for visual-language-action model,”arXiv preprint arXiv:2501.15830, 2025

  36. [37]

    Spatial forcing: Implicit spatial representation alignment for vision-language-action model,

    F. Li, W. Song, H. Zhao, J. Wang, P. Ding, D. Wang, L. Zeng, and H. Li, “Spatial forcing: Implicit spatial representation alignment for vision-language-action model,”arXiv preprint arXiv:2510.12276, 2025

  37. [38]

    Geomanip: Geometric constraints as general interfaces for robot manipulation,

    W. Tang, J.-H. Pan, Y .-H. Liu, M. Tomizuka, L. E. Li, C.-W. Fu, and M. Ding, “Geomanip: Geometric constraints as general interfaces for robot manipulation,”arXiv preprint arXiv:2501.09783, 2025

  38. [39]

    Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,

    Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finnet al., “Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 1702–1713

  39. [40]

    Robotic control via embodied chain-of-thought reasoning,

    M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine, “Robotic control via embodied chain-of-thought reasoning,”arXiv preprint arXiv:2407.08693, 2024

  40. [41]

    Incentivizing multimodal reasoning in large models for direct robot manipulation,

    W. Tang, D. Jing, J.-H. Pan, Z. Lu, Y .-H. Liu, L. E. Li, M. Ding, and C.-W. Fu, “Incentivizing multimodal reasoning in large models for direct robot manipulation,”arXiv preprint arXiv:2505.12744, 2025

  41. [42]

    Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies,

    R. Zheng, Y . Liang, S. Huang, J. Gao, H. Daum ´e III, A. Kolobov, F. Huang, and J. Yang, “Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies,”arXiv preprint arXiv:2412.10345, 2024

  42. [43]

    Worldvla: Towards autoregressive action world model,

    J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wanget al., “Worldvla: Towards autoregressive action world model,”arXiv preprint arXiv:2506.21539, 2025

  43. [44]

    Causal world modeling for robot control,

    L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhuet al., “Causal world modeling for robot control,”arXiv preprint arXiv:2601.21998, 2026

  44. [45]

    Fast-wam: Do world action mod- els need test-time future imagination?

    T. Yuan, Z. Dong, Y . Liu, and H. Zhao, “Fast-wam: Do world action mod- els need test-time future imagination?”arXiv preprint arXiv:2603.16666, 2026

  45. [46]

    Cosmos policy: Fine-tuning video models 14 for visuomotor control and planning,

    M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finnet al., “Cosmos policy: Fine-tuning video models 14 for visuomotor control and planning,”arXiv preprint arXiv:2601.16163, 2026

  46. [47]

    Bridgedata v2: A dataset for robot learning at scale,

    H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen- Estruch, A. W. He, V . Myers, M. J. Kim, M. Duet al., “Bridgedata v2: A dataset for robot learning at scale,” inConference on Robot Learning. PMLR, 2023, pp. 1723–1736

  47. [48]

    Droid: A large-scale in-the-wild robot manipulation dataset,

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karam- cheti, S. Nasiriany, M. K. Srinivasan, L. Y . Zhu, O. Lingelbachet al., “Droid: A large-scale in-the-wild robot manipulation dataset,”arXiv preprint arXiv:2403.12945, 2024

  48. [49]

    Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems,

    Q. Bu, J. Cai, L. Chen, X. Cui, Y . Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huanget al., “Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems,”arXiv preprint arXiv:2503.06669, 2025

  49. [50]

    Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation,

    K. Wu, C. Zhao, J. Chen, W. Li, Y . Xu, J. Zhuet al., “Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation,”arXiv preprint arXiv:2412.13877, 2024

  50. [51]

    Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,

    T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, W. Deng, Y . Guo, T. Nian, X. Xie, Q. Chen, K. Su, T. Xu, G. Liu, M. Hu, H. ang Gao, K. Wang, Z. Liang, Y . Qin, X. Yang, P. Luo, and Y . Mu, “Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,” 2025

  51. [52]

    Libero: Benchmarking knowledge transfer for lifelong robot learning,

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”arXiv preprint arXiv:2306.03310, 2023

  52. [53]

    Robocasa: Large-scale simulation of everyday tasks for generalist robots,

    S. Nasiriany, A. Maddukuri, L. Zhang, A. Parber, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu, “Robocasa: Large-scale simulation of everyday tasks for generalist robots,”arXiv preprint arXiv:2406.02523, 2024

  53. [54]

    Octo: An open-source generalist robot policy,

    O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xuet al., “Octo: An open-source generalist robot policy,”arXiv preprint arXiv:2405.12213, 2024

  54. [55]

    X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model,

    J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zenget al., “X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model,”arXiv preprint arXiv:2510.10274, 2025

  55. [56]

    Universal actions for enhanced embodied foundation models,

    J. Zheng, J. Li, D. Liu, Y . Zheng, Z. Wang, Z. Ou, Y . Liu, J. Liu, Y .-Q. Zhang, and X. Zhan, “Universal actions for enhanced embodied foundation models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 22 508–22 519

  56. [57]

    Learning structured output represen- tation using deep conditional generative models,

    K. Sohn, H. Lee, and X. Yan, “Learning structured output represen- tation using deep conditional generative models,”Advances in neural information processing systems, vol. 28, 2015

  57. [58]

    Apt: Action expert pretraining improves instruction generalization of vision-language-action policies,

    K. Xu, Z. Zhu, A. Chen, R. Xiong, and Y . Wang, “Apt: Action expert pretraining improves instruction generalization of vision-language-action policies,”arXiv preprint arXiv:2606.12366, 2026

  58. [59]

    Latent action pretraining from videos,

    S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Linet al., “Latent action pretraining from videos,” arXiv preprint arXiv:2410.11758, 2024

  59. [60]

    Neural discrete representation learning,

    A. Van Den Oord, O. Vinyalset al., “Neural discrete representation learning,”Advances in neural information processing systems, vol. 30, 2017

  60. [61]

    Igor: Image-goal representations are the atomic control units for foundation models in embodied ai,

    X. Chen, J. Guo, T. He, C. Zhang, P. Zhang, D. C. Yang, L. Zhao, and J. Bian, “Igor: Image-goal representations are the atomic control units for foundation models in embodied ai,”arXiv preprint arXiv:2411.00785, 2024

  61. [62]

    Moto: Latent motion token as the bridging language for robot manipulation,

    Y . Chen, Y . Ge, Y . Li, Y . Ge, M. Ding, Y . Shan, and X. Liu, “Moto: Latent motion token as the bridging language for robot manipulation,” arXiv preprint arXiv:2412.04445, 2024

  62. [63]

    Egodex: Learning dexterous manipulation from large-scale egocentric video,

    R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang, “Egodex: Learning dexterous manipulation from large-scale egocentric video,”

  63. [64]

    Available: https://arxiv.org/abs/2505.11709

    [Online]. Available: https://arxiv.org/abs/2505.11709

  64. [65]

    Egoverse: An egocentric human dataset for robot learning from around the world,

    R. Punamiya, S. Kareer, Z. Liu, J. Citron, R.-Z. Qiu, X. Cai, A. Gavryushin, J. Chen, D. Liconti, L. Y . Zhuet al., “Egoverse: An egocentric human dataset for robot learning from around the world,” arXiv preprint arXiv:2604.07607, 2026

  65. [66]

    Ego4d: Around the world in 3,000 hours of egocentric video,

    K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liuet al., “Ego4d: Around the world in 3,000 hours of egocentric video,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 18 995– 19 012

  66. [67]

    Vla-jepa: Enhancing vision-language-action model with latent world model,

    J. Ma, C. Wang, Y . Li, J. Liu, Y . Li, H. Huang, P. Xu, and H. Zhao, “Vla-jepa: Enhancing vision-language-action model with latent world model,”arXiv preprint arXiv:2602.10098, 2025

  67. [68]

    Latent action pretraining through world modeling,

    D. Wu, Y . Cao, Y . Fuet al., “Latent action pretraining through world modeling,”arXiv preprint arXiv:2509.18428, 2025

  68. [69]

    Adaworld: Learning adaptable world models with latent actions,

    S. Gao, S. Zhou, Y . Du, J. Zhang, and C. Gan, “Adaworld: Learning adaptable world models with latent actions,” inInternational Conference on Machine Learning (ICML), 2025

  69. [70]

    Dreamdojo: A generalist robot world model from large-scale human videos,

    S. Gao, W. Liang, K. Zheng, A. Malik, S. Ye, S. Yu, W.-C. Tseng, Y . Dong, K. Mo, C.-H. Lin, Q. Ma, S. Nah, L. Magne, J. Xiang, Y . Xie, R. Zheng, D. Niu, Y . L. Tan, K. Zentner, G. Kurian, S. Indupuru, P. Jannaty, J. Gu, J. Zhang, J. Malik, P. Abbeel, M.-Y . Liu, Y . Zhu, J. Jang, and L. J. Fan, “Dreamdojo: A generalist robot world model from large-scale...

  70. [71]

    Dynamo: In-domain dynamics pretraining for visuo,

    Z. Cui, H. Pan, A. Iyer, S. Haldar, and L. Pinto, “Dynamo: In-domain dynamics pretraining for visuo,”Motor Control, 2024

  71. [72]

    Mixture of horizons in action chunking,

    D. Jing, G. Wang, J. Liu, W. Tang, Z. Sun, Y . Yao, Z. Wei, Y . Liu, Z. Lu, and M. Ding, “Mixture of horizons in action chunking,”arXiv preprint arXiv:2511.19433, 2025

  72. [73]

    π0.5: a vision-language-action model with open-world generalization,

    P. Intelligence, “π0.5: a vision-language-action model with open-world generalization,” 2025. [Online]. Available: https://arxiv.org/abs/2504. 16054

  73. [74]

    Starvla: A lego-like codebase for vision-language-action model developing,

    S. Community, “Starvla: A lego-like codebase for vision-language-action model developing,”arXiv preprint arXiv:2604.05014, 2026

  74. [75]

    Deep unsupervised learning using nonequilibrium thermodynamics,

    J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International conference on machine learning. pmlr, 2015, pp. 2256– 2265

  75. [76]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840– 6851, 2020

  76. [77]

    Scalable diffusion models with transformers,

    W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4195–4205

  77. [78]

    Qwen2. 5 technical report,

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Weiet al., “Qwen2. 5 technical report,”arXiv preprint arXiv:2412.15115, 2024

  78. [79]

    Qwen3-vl technical report,

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...