pith. machine review for the scientific record. sign in

arxiv: 2603.00110 · v2 · submitted 2026-02-18 · 💻 cs.RO

Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

Pith reviewed 2026-05-15 21:20 UTC · model grok-4.3

classification 💻 cs.RO
keywords robotic manipulationvideo generation modelsphysics simulationmultimodal tokenspolicy learningobject permanencecontinuous control
0
0 comments X

The pith

Treating pretrained video generation models as physics simulators enables robotic manipulation through shared physical tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces PhysGen, a framework that repurposes pretrained video generation models for robotic manipulation tasks. It does so by modeling the interplay between the environment and robot actions and by creating a multimodal continuous representation that turns video frames and actions into shared physical tokens. The approach transfers built-in knowledge of object permanence and dynamics from video pretraining into control policies. A sympathetic reader would care because it offers a route to capable robots that avoids collecting massive robot-specific datasets.

Core claim

By treating the pretrained video model as a proxy for a physics simulator, PhysGen models the dynamic interplay between the external environment and robot actions. It introduces a multimodal continuous representation that unifies video and action into shared physical tokens. This enables the seamless transfer of implicit physical knowledge such as object permanence and dynamics from video pretraining to downstream manipulation. The framework adds causal masking, inverse kinematics, Lookahead Multi-Token Prediction, and KV caching to support efficient training.

What carries the argument

The multimodal continuous representation that unifies video and action into shared physical tokens, which bridges discrete video generation with continuous robotic control.

Load-bearing premise

The implicit physical knowledge encoded in pretrained video models transfers effectively to continuous robotic control through shared tokens without major loss of fidelity or reality gaps.

What would settle it

A test in which PhysGen shows no advantage over a non-video baseline specifically on tasks that require tracking objects that leave and re-enter view would indicate the claimed knowledge transfer is not occurring.

Figures

Figures reproduced from arXiv: 2603.00110 by Guangrun Wang, Liang Lin, Qichang Li, Sihan Qin, Tianshui Chen, Yuhao Chen, Zijian Song.

Figure 1
Figure 1. Figure 1: The PhysGen Framework: Repurposing Video Gen￾eration as a World Simulator. Our approach unifies percep￾tion and control through Physical Autoregression. The model operates on a sequence of continuous physical tokens (red), which fuse visual context (orange) and embodiment actions (black) to predict their joint evolution. Acting as a predic￾tive world model, PhysGen runs in synchronization with the physical… view at source ↗
Figure 2
Figure 2. Figure 2: The model architecture of PhysGen. Starting from text tokens, PhysGen leverages a Causal Transformer to autore￾gressively predict physical tokens. Notably, we apply a frame diffusion and an action diffusion to estimate the conditional distributions of visual and action signals in continuous space. The structure of the action diffusion network is shown on the right column, where the predicted token serves a… view at source ↗
Figure 3
Figure 3. Figure 3: The causal attention mask. Frame tokens use chunk-wise full attention; action tokens use temporal causal attention; action tokens unidirectionally attend to frame to￾kens for visually conditioned control. De-Tokenizer A central challenge in de-tokenization is estimating the conditional distribution of output tokens. Prior works typically map discrete vocabularies to images or actions [50, 70, 73], enabling… view at source ↗
Figure 4
Figure 4. Figure 4: Video predictions and actual action executions. Each row shows PhysGen’s predicted video alongside the correspond￾ing execution video for five different tasks. The strong visual similarity between predicted and actual action videos highlights the effectiveness of our approach in transferring knowledge from video pretraining. (a) Pick Transparency (b) Pick Cube (c) Press Button (d) Stack Cube [PITH_FULL_IM… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of real world manipulation. We deploy our method on a Franka Panda robot across four real-world tasks: Pick Transparency, Pick Cube, Press Button, and Stack Cube. The robot executes stable and coherent manipulations across diverse settings, demonstrating effective knowledge transfer from video generation to real-world manipulation. ACT [68], OpenVLA [70] and Pi0 [2]. ACT is trained from scrat… view at source ↗
Figure 6
Figure 6. Figure 6: The attention map. The top row shows the token￾level attention map, indicating how the predicted action at￾tends to previous frame and action tokens. The bottom row shows the pixel-level attention map, indicating how the pre￾dicted action attends to different spatial regions of the frame. As shown in our experiments, eliminating L-MTP results in a 3.4 percentage point drop in absolute success rate. We hypo… view at source ↗
read the original abstract

The scarcity of large-scale robotic data has motivated the repurposing of foundation models from other modalities for policy learning. In this work, we introduce PhysGen (Learning Physics from Pretrained Video Generation Models), a scalable continuous and sequential world interaction framework that leverages autoregressive video generation to solve robotic manipulation tasks. By treating the pretrained video model as a proxy for a physics simulator, PhysGen models the dynamic interplay between the external environment and robot actions. We introduce a multimodal continuous representation that unifies video and action into shared physical tokens, bridging the gap between discrete video generation and continuous robotic control. This approach enables the seamless transfer of implicit physical knowledge-such as object permanence and dynamics-from video pretraining to downstream manipulation.To ensure efficient convergence, we incorporate causal masking, inverse kinematics, Lookahead Multi-Token Prediction (L-MTP), and key-value (KV) caching. Experimental results on the Libero and ManiSkill benchmarks demonstrate that PhysGen consistently outperforms robust baselines, surpassing OpenVLA and WorldVLA by margins of 13.8% and 8.8%, respectively. Notably, in real-world scenarios, PhysGen matches the performance of large-scale action-pretrained models like $\pi_0$ without requiring prior action-specific pretraining, demonstrating superior capability in physically complex tasks such as grasping transparent objects. These findings validate the potential of extracting physical intuition from pretrained video generators to facilitate generalizable robotic manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PhysGen, a scalable continuous and sequential world interaction framework for robotic manipulation that repurposes pretrained video generation models as proxies for physics simulators. It proposes a multimodal continuous representation unifying video frames and robot actions into shared physical tokens, incorporates causal masking, inverse kinematics, Lookahead Multi-Token Prediction (L-MTP), and KV caching for training efficiency, and reports consistent outperformance on Libero and ManiSkill benchmarks (13.8% over OpenVLA, 8.8% over WorldVLA) plus real-world parity with π0 without action-specific pretraining.

Significance. If the central performance claims hold under rigorous validation, this work would offer a promising route to data-efficient robotic policy learning by extracting implicit physical priors (object permanence, dynamics) from abundant video data, potentially reducing reliance on large-scale action datasets and improving generalization in physically complex tasks.

major comments (2)
  1. [Abstract] Abstract: The reported performance margins (13.8% over OpenVLA, 8.8% over WorldVLA) and real-world parity with π0 are presented without error bars, statistical significance tests, ablation details, or full experimental protocol, rendering the central superiority claim only partially verifiable.
  2. [Methods] Methods (video-as-physics-proxy section): The load-bearing assumption that action-conditioned video generation faithfully transfers physical knowledge (e.g., object permanence) without measurable simulation-reality gaps is not supported by direct fidelity metrics such as trajectory accuracy or conservation-law checks on generated sequences; task success rates alone do not isolate this mechanism.
minor comments (2)
  1. [Title] The manuscript title ends with a grammatical inconsistency ('Models' plural).
  2. [Methods] Notation for the shared physical tokens and the multimodal continuous representation should be defined more explicitly with an equation or diagram in the methods section for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have made revisions to improve the manuscript's rigor and clarity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The reported performance margins (13.8% over OpenVLA, 8.8% over WorldVLA) and real-world parity with π0 are presented without error bars, statistical significance tests, ablation details, or full experimental protocol, rendering the central superiority claim only partially verifiable.

    Authors: We agree that the abstract should better support verifiability. In the revision, we have added references to error bars (from 5 random seeds), statistical significance testing (p < 0.05 for key margins), and explicit pointers to the full experimental protocol and ablation studies now detailed in Section 4 and the appendix. These updates make the performance claims more transparent without altering the reported numbers. revision: yes

  2. Referee: [Methods] Methods (video-as-physics-proxy section): The load-bearing assumption that action-conditioned video generation faithfully transfers physical knowledge (e.g., object permanence) without measurable simulation-reality gaps is not supported by direct fidelity metrics such as trajectory accuracy or conservation-law checks on generated sequences; task success rates alone do not isolate this mechanism.

    Authors: We acknowledge the value of direct fidelity metrics to isolate the mechanism. We have added a new subsection (3.4) with qualitative video examples and basic quantitative checks (e.g., object trajectory consistency in generated sequences) demonstrating object permanence and dynamics preservation. Full trajectory accuracy and conservation-law verification across all tasks would require substantial new experiments; we note this as a limitation and clarify that downstream task success serves as the primary proxy for transferred physical knowledge. revision: partial

Circularity Check

0 steps flagged

No significant circularity; core claims rest on external pretrained video models and independent benchmarks

full rationale

The paper's derivation treats an external pretrained video generation model as a physics proxy and introduces a multimodal continuous token representation to bridge discrete video frames with continuous robot actions. This integration does not reduce to self-definition (e.g., no quantity defined in terms of itself), fitted inputs renamed as predictions, or load-bearing self-citations. Standard techniques such as causal masking, inverse kinematics, L-MTP, and KV caching are applied without smuggling ansatzes via prior self-work. Performance is reported on external benchmarks (Libero, ManiSkill) and real-world tasks, with gains over baselines like OpenVLA not forced by internal fitting. Any self-citations are peripheral and non-load-bearing for the transfer claim, leaving the derivation self-contained against external evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the transferability of physics knowledge from video pretraining; no explicit free parameters are fitted in the described framework, and no new axioms or invented entities beyond the shared token concept are introduced.

invented entities (1)
  • shared physical tokens no independent evidence
    purpose: Unify discrete video frames and continuous robot actions into a common representation for world modeling
    Introduced to bridge video generation and robotic control; no independent evidence provided outside the paper's own experiments.

pith-pipeline@v0.9.0 · 5578 in / 1247 out tokens · 24689 ms · 2026-05-15T21:20:43.154756+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

  2. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · cited by 2 Pith papers · 33 internal anchors

  1. [1]

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. 2025. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734(2025)

  2. [2]

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky

  3. [3]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    𝜋0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164 [cs.LG] https://arxiv.org/abs/2410.24164

  4. [4]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

  5. [5]

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al . 2025. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669(2025)

  6. [6]

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. 2025. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111(2025)

  7. [7]

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al . 2025. WorldVLA: Towards Autoregressive Action World Model.arXiv preprint arXiv:2506.21539(2025)

  8. [8]

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. 2024. Gr-2: A generative video- language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158(2024)

  9. [9]

    Beiqi Chen, Shuai Shao, Haitang Feng, Jianhuang Lai, Jianlou Si, and Guangcong Wang. 2025. Style4D-Bench: A Benchmark Suite for 4D Stylization.arXiv preprint arXiv:2508.19243(2025)

  10. [10]

    Guangyan Chen, Meiling Wang, Te Cui, Luojie Yang, Qi Shao, Lin Zhao, Tianle Zhang, Yihang Li, Yi Yang, and Yufeng Yue. 2025. Unifying Latent Action and Latent State Pre-training for Policy Learning from Videos. InProceedings of the SIGGRAPH Asia 2025 Conference Papers. 1–11

  11. [11]

    Hongyu Chen and Guangrun Wang. 2025. UML-CoT: Structured Reasoning and Planning with Unified Modeling Language for Robotic Room Cleaning.arXiv preprint arXiv:2509.22628(2025)

  12. [12]

    Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yan- jiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. 2025. Villa- x: enhancing latent action modeling in vision-language-action models.arXiv preprint arXiv:2507.23682(2025)

  13. [13]

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burch- fiel, Russ Tedrake, and Shuran Song. 2023. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research(2023), 02783649241273668

  14. [14]

    Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. 2024. Autoregressive video generation without vector quantization.arXiv preprint arXiv:2412.14169(2024)

  15. [15]

    Letian Fu, Huang Huang, Gaurav Datta, Lawrence Yunliang Chen, William Chung- Ho Panitch, Fangchen Liu, Hui Li, and Ken Goldberg. 2024. In-context imitation learning via next-token prediction.arXiv preprint arXiv:2408.15980(2024)

  16. [16]

    Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, and Jun Xiao. 2024. Vid-gpt: Introducing gpt-style autoregressive generation in video diffusion mod- els.arXiv preprint arXiv:2406.10981(2024)

  17. [17]

    Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. 2024. Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737(2024)

  18. [18]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  19. [19]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in neural information processing systems33 (2020), 6840–6851

  20. [20]

    Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. 2025. Thinkact: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815(2025)

  21. [21]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...

  22. [22]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    𝜋0.5: a Vision-Language-Action Model with Open-World Generalization. arXiv:2504.16054 [cs.LG] https://arxiv.org/abs/2504.16054

  23. [23]

    Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, et al . 2023. Phi-2: The surprising power of small language models.Microsoft Research Blog1, 3 (2023), 3

  24. [24]

    Moo Jin Kim, Chelsea Finn, and Percy Liang. 2025. Fine-tuning vision-language- action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645 (2025)

  25. [25]

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakr- ishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al

  26. [26]

    Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246(2024)

  27. [27]

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al

  28. [28]

    InProceedings of the IEEE/CVF international conference on computer vision

    Segment anything. InProceedings of the IEEE/CVF international conference on computer vision. 4015–4026

  29. [29]

    Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al . 2025. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917 (2025)

  30. [30]

    Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. 2025. Unified video action model.arXiv preprint arXiv:2503.00200(2025)

  31. [31]

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. 2024. Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems37 (2024), 56424–56445

  32. [32]

    Weiqi Li, Quande Zhang, Ruifeng Zhai, Liang Lin, and Guangrun Wang. 2025. VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling.arXiv preprint arXiv:2512.02902(2025)

  33. [33]

    Xiao Li, Jiaqi Zhang, Shuxiang Zhang, Tianshui Chen, Liang Lin, and Guan- grun Wang. 2025. In-Situ Tweedie Discrete Diffusion Models.arXiv preprint arXiv:2510.01047(2025)

  34. [34]

    Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. 2025. Video generators are robot policies.arXiv preprint arXiv:2508.00795(2025)

  35. [35]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

  36. [36]

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. 2023. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36 (2023), 44776–44791

  37. [37]

    Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al . 2025. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model.arXiv preprint arXiv:2503.10631(2025)

  38. [38]

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. 2024. RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation. InThe Thirteenth International Conference on Learning Representations

  39. [39]

    Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. 2025. Being-H0: Vision- Language-Action Pretraining from Large-Scale Human Videos.arXiv preprint arXiv:2507.15597(2025)

  40. [40]

    Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Ro- hun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín

  41. [41]

    What matters in learning from offline human demonstrations for robot manipulation.arXiv preprint arXiv:2108.03298(2021)

  42. [42]

    Fei Ni, Jianye Hao, Shiguang Wu, Longxin Kou, Jiashun Liu, Yan Zheng, Bin Wang, and Yuzheng Zhuang. 2024. Generate subgoal images before act: Unlocking the chain-of-thought reasoning in diffusion model for robot manipulation with multimodal prompts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13991–14000

  43. [43]

    NVIDIA Developer Team. 2023. Mastering LLM Techniques: Inference Optimiza- tion. https://developer.nvidia.com/blog/mastering-llm-techniques-inference- optimization/

  44. [44]

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. 2025. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747 (2025)

  45. [45]

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. 2025. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830 (2025)

  46. [46]

    Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, and Baining Guo. 2025. VideoVLA: Video Generators Can Be Generalizable Robot Manipulators.Advances in neural information processing systems(2025)

  47. [47]

    Zijian Song, Qichang Li, Jiawei Zhou, Zhenlong Yuan, Tianshui Chen, Liang Lin, and Guangrun Wang. 2026. Robotic Manipulation is Vision-to-Geometry Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation ACM Conference, 2026, Mapping (𝑓(𝑣) →𝐺 ): Vision-Geometry Backbones over L...

  48. [48]

    Zijian Song, Xiaoxin Lin, Tao Pu, Zhenlong Yuan, Guangrun Wang, and Liang Lin

  49. [49]

    Human-Centric Open-Future Task Discovery: Formulation, Benchmark, and Scalable Tree-Based Search.arXiv preprint arXiv:2511.18929(2025)

  50. [50]

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing568 (2024), 127063

  51. [51]

    Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse kai Chan, Yuan Gao, Xuanlin Li, Tongzhou Mu, Nan Xiao, Arnav Gurha, Viswesh Nagaswamy Rajesh, Yong Woo Choi, Yen-Ru Chen, Zhiao Huang, Roberto Calandra, Rui Chen, Shan Luo, and Hao Su. 2025. ManiSkill3: GPU Parallelized Robotics Simulation...

  52. [52]

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. 2024. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213(2024)

  53. [53]

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. 2024. Visual autoregressive modeling: Scalable image generation via next-scale prediction. Advances in neural information processing systems37 (2024), 84839–84865

  54. [54]

    An Dinh Vuong, Minh Nhat Vu, Dong An, and Ian Reid. 2025. Action Tokenizer Matters in In-Context Imitation Learning.arXiv preprint arXiv:2503.01206(2025)

  55. [55]

    Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. 2025. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.arXiv preprint arXiv:2509.09372(2025)

  56. [56]

    Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. 2025. Unified Vision-Language-Action Model.arXiv preprint arXiv:2506.19850(2025)

  57. [57]

    Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng

  58. [58]

    Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855(2025)

  59. [59]

    Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. 2025. Tinyvla: Towards fast, data- efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Letters(2025)

  60. [60]

    Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. 2023. Unleashing large-scale video gener- ative pre-training for visual robot manipulation.arXiv preprint arXiv:2312.13139 (2023)

  61. [61]

    Desai Xie, Zhan Xu, Yicong Hong, Hao Tan, Difan Liu, Feng Liu, Arie Kaufman, and Yang Zhou. 2025. Progressive autoregressive video diffusion models. In Proceedings of the Computer Vision and Pattern Recognition Conference. 6322– 6332

  62. [62]

    Rongtao Xu, Jian Zhang, Minghao Guo, Youpeng Wen, Haoting Yang, Min Lin, Jianzheng Huang, Zhe Li, Kaidong Zhang, Liqiong Wang, et al . 2025. A0: An affordance-aware hierarchical model for general robotic manipulation.arXiv preprint arXiv:2504.12636(2025)

  63. [63]

    Yuanfeng Xu, Yuhao Chen, Liang Lin, and Guangrun Wang. 2026. Bridging the Discrete-Continuous Gap: Unified Multimodal Generation via Coupled Manifold Discrete Absorbing Diffusion.arXiv preprint arXiv:2601.04056(2026)

  64. [64]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  65. [65]

    Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. 2025. Latent Action Pretraining from Videos. InThe Thirteenth International Conference on Learning Representations

  66. [66]

    Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. 2023. Language Model Beats Diffusion–Tokenizer is Key to Visual Generation.arXiv preprint arXiv:2310.05737(2023)

  67. [67]

    Zhenlong Yuan, Jiakai Cao, Zhaoxin Li, Hao Jiang, and Zhaoqi Wang. 2024. Sd-mvs: Segmentation-driven deformation multi-view stereo with spherical re- finement and em optimization. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 6871–6880

  68. [68]

    Zhenlong Yuan, Chengxuan Qian, Jing Tang, Rui Chen, Zijian Song, Lei Sun, Xiangxiang Chu, Yujun Cai, Dapeng Zhang, and Shuo Li. 2025. AutoDrive- R2: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving.arXiv preprint arXiv:2509.01944(2025)

  69. [69]

    Zhihao Zhan, Yuhao Chen, Jiaying Zhou, Qinhan Lv, Hao Liu, Keze Wang, Liang Lin, and Guangrun Wang. 2026. Stable Language Guidance for Vision-Language- Action Models.arXiv preprint arXiv:2601.04052(2026)

  70. [70]

    Zhihao Zhan, Jiaying Zhou, Likui Zhang, Qinhan Lv, Hao Liu, Jusheng Zhang, Weizheng Li, Ziliang Chen, Tianshui Chen, Keze Wang, et al. 2025. E0: Enhancing Generalization and Fine-Grained Control in VLA Models via Continuized Discrete Diffusion.arXiv preprint arXiv:2511.21542(2025)

  71. [71]

    Dapeng Zhang, Zhenlong Yuan, Zhangquan Chen, Chih-Ting Liao, Yinda Chen, Fei Shen, Qingguo Zhou, and Tat-Seng Chua. 2025. Reasoning-VLA: A fast and general vision-language-action reasoning model for autonomous driving.arXiv preprint arXiv:2511.19912(2025)

  72. [72]

    Wenbo Zhang, Tianrun Hu, Yanyuan Qiao, Hanbo Zhang, Yuchu Qin, Yang Li, Jiajun Liu, Tao Kong, Lingqiao Liu, and Xiao Ma. 2025. Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation.arXiv preprint arXiv:2506.09990(2025)

  73. [73]

    Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, et al . 2025. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge. arXiv preprint arXiv:2507.04447(2025)

  74. [74]

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al . 2025. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference. 1702–1713

  75. [75]

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. 2023. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705(2023)

  76. [76]

    Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. 2024. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345(2024)

  77. [77]

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. 2024. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404(2024)

  78. [78]

    Yifan Zhong, Xuchuan Huang, Ruochong Li, Ceyao Zhang, Yitao Liang, Yaodong Yang, and Yuanpei Chen. 2025. Dexgraspvla: A vision-language-action framework towards general dexterous grasping.arXiv preprint arXiv:2502.20900(2025)

  79. [79]

    Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. 2025. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.Robotics: Science and Systems(2025)

  80. [80]

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. 2023. Rt-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning. PMLR, 2165–2183