Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation
Pith reviewed 2026-05-15 21:20 UTC · model grok-4.3
The pith
Treating pretrained video generation models as physics simulators enables robotic manipulation through shared physical tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By treating the pretrained video model as a proxy for a physics simulator, PhysGen models the dynamic interplay between the external environment and robot actions. It introduces a multimodal continuous representation that unifies video and action into shared physical tokens. This enables the seamless transfer of implicit physical knowledge such as object permanence and dynamics from video pretraining to downstream manipulation. The framework adds causal masking, inverse kinematics, Lookahead Multi-Token Prediction, and KV caching to support efficient training.
What carries the argument
The multimodal continuous representation that unifies video and action into shared physical tokens, which bridges discrete video generation with continuous robotic control.
Load-bearing premise
The implicit physical knowledge encoded in pretrained video models transfers effectively to continuous robotic control through shared tokens without major loss of fidelity or reality gaps.
What would settle it
A test in which PhysGen shows no advantage over a non-video baseline specifically on tasks that require tracking objects that leave and re-enter view would indicate the claimed knowledge transfer is not occurring.
Figures
read the original abstract
The scarcity of large-scale robotic data has motivated the repurposing of foundation models from other modalities for policy learning. In this work, we introduce PhysGen (Learning Physics from Pretrained Video Generation Models), a scalable continuous and sequential world interaction framework that leverages autoregressive video generation to solve robotic manipulation tasks. By treating the pretrained video model as a proxy for a physics simulator, PhysGen models the dynamic interplay between the external environment and robot actions. We introduce a multimodal continuous representation that unifies video and action into shared physical tokens, bridging the gap between discrete video generation and continuous robotic control. This approach enables the seamless transfer of implicit physical knowledge-such as object permanence and dynamics-from video pretraining to downstream manipulation.To ensure efficient convergence, we incorporate causal masking, inverse kinematics, Lookahead Multi-Token Prediction (L-MTP), and key-value (KV) caching. Experimental results on the Libero and ManiSkill benchmarks demonstrate that PhysGen consistently outperforms robust baselines, surpassing OpenVLA and WorldVLA by margins of 13.8% and 8.8%, respectively. Notably, in real-world scenarios, PhysGen matches the performance of large-scale action-pretrained models like $\pi_0$ without requiring prior action-specific pretraining, demonstrating superior capability in physically complex tasks such as grasping transparent objects. These findings validate the potential of extracting physical intuition from pretrained video generators to facilitate generalizable robotic manipulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PhysGen, a scalable continuous and sequential world interaction framework for robotic manipulation that repurposes pretrained video generation models as proxies for physics simulators. It proposes a multimodal continuous representation unifying video frames and robot actions into shared physical tokens, incorporates causal masking, inverse kinematics, Lookahead Multi-Token Prediction (L-MTP), and KV caching for training efficiency, and reports consistent outperformance on Libero and ManiSkill benchmarks (13.8% over OpenVLA, 8.8% over WorldVLA) plus real-world parity with π0 without action-specific pretraining.
Significance. If the central performance claims hold under rigorous validation, this work would offer a promising route to data-efficient robotic policy learning by extracting implicit physical priors (object permanence, dynamics) from abundant video data, potentially reducing reliance on large-scale action datasets and improving generalization in physically complex tasks.
major comments (2)
- [Abstract] Abstract: The reported performance margins (13.8% over OpenVLA, 8.8% over WorldVLA) and real-world parity with π0 are presented without error bars, statistical significance tests, ablation details, or full experimental protocol, rendering the central superiority claim only partially verifiable.
- [Methods] Methods (video-as-physics-proxy section): The load-bearing assumption that action-conditioned video generation faithfully transfers physical knowledge (e.g., object permanence) without measurable simulation-reality gaps is not supported by direct fidelity metrics such as trajectory accuracy or conservation-law checks on generated sequences; task success rates alone do not isolate this mechanism.
minor comments (2)
- [Title] The manuscript title ends with a grammatical inconsistency ('Models' plural).
- [Methods] Notation for the shared physical tokens and the multimodal continuous representation should be defined more explicitly with an equation or diagram in the methods section for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and have made revisions to improve the manuscript's rigor and clarity.
read point-by-point responses
-
Referee: [Abstract] Abstract: The reported performance margins (13.8% over OpenVLA, 8.8% over WorldVLA) and real-world parity with π0 are presented without error bars, statistical significance tests, ablation details, or full experimental protocol, rendering the central superiority claim only partially verifiable.
Authors: We agree that the abstract should better support verifiability. In the revision, we have added references to error bars (from 5 random seeds), statistical significance testing (p < 0.05 for key margins), and explicit pointers to the full experimental protocol and ablation studies now detailed in Section 4 and the appendix. These updates make the performance claims more transparent without altering the reported numbers. revision: yes
-
Referee: [Methods] Methods (video-as-physics-proxy section): The load-bearing assumption that action-conditioned video generation faithfully transfers physical knowledge (e.g., object permanence) without measurable simulation-reality gaps is not supported by direct fidelity metrics such as trajectory accuracy or conservation-law checks on generated sequences; task success rates alone do not isolate this mechanism.
Authors: We acknowledge the value of direct fidelity metrics to isolate the mechanism. We have added a new subsection (3.4) with qualitative video examples and basic quantitative checks (e.g., object trajectory consistency in generated sequences) demonstrating object permanence and dynamics preservation. Full trajectory accuracy and conservation-law verification across all tasks would require substantial new experiments; we note this as a limitation and clarify that downstream task success serves as the primary proxy for transferred physical knowledge. revision: partial
Circularity Check
No significant circularity; core claims rest on external pretrained video models and independent benchmarks
full rationale
The paper's derivation treats an external pretrained video generation model as a physics proxy and introduces a multimodal continuous token representation to bridge discrete video frames with continuous robot actions. This integration does not reduce to self-definition (e.g., no quantity defined in terms of itself), fitted inputs renamed as predictions, or load-bearing self-citations. Standard techniques such as causal masking, inverse kinematics, L-MTP, and KV caching are applied without smuggling ansatzes via prior self-work. Performance is reported on external benchmarks (Libero, ManiSkill) and real-world tasks, with gains over baselines like OpenVLA not forced by internal fitting. Any self-citations are peripheral and non-load-bearing for the transfer claim, leaving the derivation self-contained against external evidence.
Axiom & Free-Parameter Ledger
invented entities (1)
-
shared physical tokens
no independent evidence
Forward citations
Cited by 2 Pith papers
-
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Reference graph
Works this paper leans on
-
[1]
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. 2025. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky
-
[3]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
𝜋0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164 [cs.LG] https://arxiv.org/abs/2410.24164
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901
work page 2020
-
[5]
Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al . 2025. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. 2025. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al . 2025. WorldVLA: Towards Autoregressive Action World Model.arXiv preprint arXiv:2506.21539(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. 2024. Gr-2: A generative video- language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [9]
-
[10]
Guangyan Chen, Meiling Wang, Te Cui, Luojie Yang, Qi Shao, Lin Zhao, Tianle Zhang, Yihang Li, Yi Yang, and Yufeng Yue. 2025. Unifying Latent Action and Latent State Pre-training for Policy Learning from Videos. InProceedings of the SIGGRAPH Asia 2025 Conference Papers. 1–11
work page 2025
- [11]
-
[12]
Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yan- jiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. 2025. Villa- x: enhancing latent action modeling in vision-language-action models.arXiv preprint arXiv:2507.23682(2025)
work page internal anchor Pith review arXiv 2025
-
[13]
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burch- fiel, Russ Tedrake, and Shuran Song. 2023. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research(2023), 02783649241273668
work page 2023
-
[14]
Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. 2024. Autoregressive video generation without vector quantization.arXiv preprint arXiv:2412.14169(2024)
work page internal anchor Pith review arXiv 2024
- [15]
- [16]
-
[17]
Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. 2024. Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737(2024)
work page internal anchor Pith review arXiv 2024
-
[18]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in neural information processing systems33 (2020), 6840–6851
work page 2020
-
[20]
Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. 2025. Thinkact: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815(2025)
work page Pith review arXiv 2025
-
[21]
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...
-
[22]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
𝜋0.5: a Vision-Language-Action Model with Open-World Generalization. arXiv:2504.16054 [cs.LG] https://arxiv.org/abs/2504.16054
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, et al . 2023. Phi-2: The surprising power of small language models.Microsoft Research Blog1, 3 (2023), 3
work page 2023
-
[24]
Moo Jin Kim, Chelsea Finn, and Percy Liang. 2025. Fine-tuning vision-language- action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakr- ishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al
-
[26]
Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al
-
[28]
InProceedings of the IEEE/CVF international conference on computer vision
Segment anything. InProceedings of the IEEE/CVF international conference on computer vision. 4015–4026
-
[29]
Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al . 2025. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. 2025. Unified video action model.arXiv preprint arXiv:2503.00200(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. 2024. Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems37 (2024), 56424–56445
work page 2024
- [32]
- [33]
-
[34]
Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. 2025. Video generators are robot policies.arXiv preprint arXiv:2508.00795(2025)
work page internal anchor Pith review arXiv 2025
-
[35]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. 2023. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36 (2023), 44776–44791
work page 2023
-
[37]
Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al . 2025. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model.arXiv preprint arXiv:2503.10631(2025)
work page internal anchor Pith review arXiv 2025
-
[38]
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. 2024. RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation. InThe Thirteenth International Conference on Learning Representations
work page 2024
- [39]
-
[40]
Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Ro- hun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín
-
[41]
What matters in learning from offline human demonstrations for robot manipulation.arXiv preprint arXiv:2108.03298(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[42]
Fei Ni, Jianye Hao, Shiguang Wu, Longxin Kou, Jiashun Liu, Yan Zheng, Bin Wang, and Yuzheng Zhuang. 2024. Generate subgoal images before act: Unlocking the chain-of-thought reasoning in diffusion model for robot manipulation with multimodal prompts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13991–14000
work page 2024
-
[43]
NVIDIA Developer Team. 2023. Mastering LLM Techniques: Inference Optimiza- tion. https://developer.nvidia.com/blog/mastering-llm-techniques-inference- optimization/
work page 2023
-
[44]
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. 2025. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. 2025. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, and Baining Guo. 2025. VideoVLA: Video Generators Can Be Generalizable Robot Manipulators.Advances in neural information processing systems(2025)
work page 2025
-
[47]
Zijian Song, Qichang Li, Jiawei Zhou, Zhenlong Yuan, Tianshui Chen, Liang Lin, and Guangrun Wang. 2026. Robotic Manipulation is Vision-to-Geometry Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation ACM Conference, 2026, Mapping (𝑓(𝑣) →𝐺 ): Vision-Geometry Backbones over L...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[48]
Zijian Song, Xiaoxin Lin, Tao Pu, Zhenlong Yuan, Guangrun Wang, and Liang Lin
- [49]
-
[50]
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing568 (2024), 127063
work page 2024
-
[51]
Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse kai Chan, Yuan Gao, Xuanlin Li, Tongzhou Mu, Nan Xiao, Arnav Gurha, Viswesh Nagaswamy Rajesh, Yong Woo Choi, Yen-Ru Chen, Zhiao Huang, Roberto Calandra, Rui Chen, Shan Luo, and Hao Su. 2025. ManiSkill3: GPU Parallelized Robotics Simulation...
work page 2025
-
[52]
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. 2024. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. 2024. Visual autoregressive modeling: Scalable image generation via next-scale prediction. Advances in neural information processing systems37 (2024), 84839–84865
work page 2024
- [54]
- [55]
- [56]
-
[57]
Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng
-
[58]
Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. 2025. Tinyvla: Towards fast, data- efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Letters(2025)
work page 2025
-
[60]
Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. 2023. Unleashing large-scale video gener- ative pre-training for visual robot manipulation.arXiv preprint arXiv:2312.13139 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
Desai Xie, Zhan Xu, Yicong Hong, Hao Tan, Difan Liu, Feng Liu, Arie Kaufman, and Yang Zhou. 2025. Progressive autoregressive video diffusion models. In Proceedings of the Computer Vision and Pattern Recognition Conference. 6322– 6332
work page 2025
- [62]
- [63]
-
[64]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[65]
Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. 2025. Latent Action Pretraining from Videos. InThe Thirteenth International Conference on Learning Representations
work page 2025
-
[66]
Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. 2023. Language Model Beats Diffusion–Tokenizer is Key to Visual Generation.arXiv preprint arXiv:2310.05737(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[67]
Zhenlong Yuan, Jiakai Cao, Zhaoxin Li, Hao Jiang, and Zhaoqi Wang. 2024. Sd-mvs: Segmentation-driven deformation multi-view stereo with spherical re- finement and em optimization. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 6871–6880
work page 2024
-
[68]
Zhenlong Yuan, Chengxuan Qian, Jing Tang, Rui Chen, Zijian Song, Lei Sun, Xiangxiang Chu, Yujun Cai, Dapeng Zhang, and Shuo Li. 2025. AutoDrive- R2: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving.arXiv preprint arXiv:2509.01944(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[69]
Zhihao Zhan, Yuhao Chen, Jiaying Zhou, Qinhan Lv, Hao Liu, Keze Wang, Liang Lin, and Guangrun Wang. 2026. Stable Language Guidance for Vision-Language- Action Models.arXiv preprint arXiv:2601.04052(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[70]
Zhihao Zhan, Jiaying Zhou, Likui Zhang, Qinhan Lv, Hao Liu, Jusheng Zhang, Weizheng Li, Ziliang Chen, Tianshui Chen, Keze Wang, et al. 2025. E0: Enhancing Generalization and Fine-Grained Control in VLA Models via Continuized Discrete Diffusion.arXiv preprint arXiv:2511.21542(2025)
- [71]
- [72]
-
[73]
Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, et al . 2025. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge. arXiv preprint arXiv:2507.04447(2025)
work page internal anchor Pith review arXiv 2025
-
[74]
Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al . 2025. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference. 1702–1713
work page 2025
-
[75]
Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. 2023. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[76]
Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. 2024. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[77]
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. 2024. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [78]
-
[79]
Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. 2025. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.Robotics: Science and Systems(2025)
work page 2025
-
[80]
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. 2023. Rt-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning. PMLR, 2165–2183
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.