WALL-WM: Carving World Action Modeling at the Event Joints

Charles Yang; Chris Pan; Colin Ye; Elise Mon; Ellie Ma; Ethan Chen; Gody Li; Hang Su; Hao Wang; Howard Lu

arxiv: 2606.01955 · v1 · pith:COEFE7A3new · submitted 2026-06-01 · 💻 cs.RO · cs.CV

WALL-WM: Carving World Action Modeling at the Event Joints

Shalfun Li , Victor Yao , Charles Yang , Truth Qu , Regis Cheng , Ryan Yu , Howard Lu , Newton Von

show 23 more authors

Vincent Chen Yohann Tang Maeve Zhang Ellie Ma Gody Li Sage Yang Lorien Shu J.W. Gao Ethan Chen Colin Ye Yu Sun Elise Mon PS Zhang Neo Li Lily Li James Wang Ping Yang Chris Pan Lucy Liang Hang Su Roy Gan Hao Wang Qian Wang

This is my paper

Pith reviewed 2026-06-28 14:12 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords world action modelsevent-grounded pretrainingvision-language-actionrobotic generalizationsemantic eventspretraining infrastructurevariable-length execution

0 comments

The pith

WALL-WM uses semantic action events as the atomic unit for pretraining to fix granularity mismatch in world action models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that existing world action models force language, vision, and actions into fixed-length chunks, creating a mismatch with how each modality operates. WALL-WM instead uses semantically coherent action events as the basic learning unit, supported by event captions and balanced sampling. This enables scalable pretraining and two inference modes: one for variable-length event execution and one for standard chunks. If successful, it leads to broad generalization across language, scenes, and tasks in real-world evaluations. The method includes infrastructure for large-scale training.

Core claim

What carries the argument

semantically coherent action events as the atomic unit of learning in event-grounded VLA pretraining

If this is right

Supports variable-length execution chunks conditioned on next-event descriptions
Enables conventional fixed-length chunk inference via unified mode with staircase decoding
Provides a practical scale-up recipe for general-purpose WAMs using large-scale pretraining infrastructure
Achieves state-of-the-art performance across language, scenes, and tasks in real-world generalization evaluation

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The event organization could extend to other sequential prediction tasks where semantic boundaries align better with goals than fixed time windows.
Cluster-balanced sampling might help mitigate long-tail issues in behavior datasets beyond robotics.
The dual inference modes suggest potential for adaptive systems that choose execution granularity based on task demands.

Load-bearing premise

That semantically coherent action events can be reliably identified, captioned, and used as the atomic unit of supervision at scale without introducing selection bias or requiring post-hoc adjustments that affect the reported generalization gains.

What would settle it

Running the same large-scale real-world generalization evaluation but with supervision based on fixed-length segments instead of event captions, and observing no performance gains over chunk-centric baselines.

read the original abstract

WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes VLA pretraining around semantic action events instead of fixed chunks to fix a timescale mismatch, but the abstract gives no numbers or methods to check if the fix actually works.

read the letter

The main thing here is the shift to event-grounded supervision for world action models. Instead of training on fixed-length action chunks, WALL-WM organizes pretraining around semantically coherent events, with event captions and cluster-balanced sampling to build the data. From that backbone it offers an event mode for variable-length execution and a unified mode that keeps standard chunk inference via Staircase Decoding.

That formulation targets a genuine problem: language talks about goals, vision runs continuously, and actions need fine control, so lumping them into the same window often reduces to short-horizon correlation. The event unit is a reasonable attempt to align the three.

The abstract claims broad generalization across language, scenes, and tasks plus SOTA on large-scale real-world evaluation. No metrics, baselines, or ablation details appear, so those claims stay unverified. The weakest link is the assumption that events can be identified and captioned reliably at scale without selection bias creeping into the reported gains; nothing in the text shows how that pipeline is built or validated.

This is for researchers already working on scalable VLA models who want to try alternatives to chunk-centric training. A reader looking for a new organizing principle might find the setup useful even before the numbers are checked.

The work deserves a serious referee once the full methods and results are in hand, because the core idea is coherent and the mismatch it names is real.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces WALL-WM, a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit. It pairs this with event-level captions, cluster-balanced sampling, two inference modes (event mode for variable-length chunks and unified mode with Staircase Decoding for fixed-length inference), and Muon-optimizer-based pretraining infrastructure. The central claim is that this resolves granularity mismatches and enables broad generalization across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.

Significance. If the experimental claims hold, the event-grounded formulation could provide a scalable alternative to fixed-chunk VLA training by better aligning supervision with semantic structure, potentially improving generalization in robotics applications. The dual inference modes and data ecosystem represent a methodological contribution that, if validated with rigorous evidence, would be of interest to the robotics and multimodal learning communities.

major comments (1)

[Abstract] Abstract: the claim that WALL-WM 'achieves state-of-the-art performance in large-scale real-world generalization evaluation' is unsupported by any metrics, baselines, dataset descriptions, ablation results, or evaluation protocol details. Without this evidence the central generalization claim cannot be assessed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and the recommendation for major revision. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that WALL-WM 'achieves state-of-the-art performance in large-scale real-world generalization evaluation' is unsupported by any metrics, baselines, dataset descriptions, ablation results, or evaluation protocol details. Without this evidence the central generalization claim cannot be assessed.

Authors: We agree that an abstract claim of this strength should be backed by visible evidence. The manuscript contains a dedicated Experiments section that reports the metrics from the large-scale real-world generalization evaluation, the baselines used, dataset descriptions, ablation results on event grounding and sampling, and the full evaluation protocol. To make this support explicit at the abstract level, we will revise the abstract to include the key quantitative results and a concise reference to the evaluation setup. These changes will appear in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain absent from text

full rationale

The supplied abstract and manuscript description contain no equations, parameter-fitting procedures, uniqueness theorems, or derivation steps of any kind. All claims are high-level architectural and empirical (event-grounded pretraining, Staircase Decoding, cluster-balanced sampling, generalization results) with no reduction of a 'prediction' to fitted inputs or self-citation chains. The central modeling shift is presented as a design choice rather than a derived necessity, leaving the result self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted or audited.

pith-pipeline@v0.9.1-grok · 5895 in / 1063 out tokens · 20487 ms · 2026-06-28T14:12:53.172496+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining
cs.CV 2026-06 unverdicted novelty 7.0

X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.
MemoryWAM: Efficient World Action Modeling with Persistent Memory
cs.RO 2026-06 unverdicted novelty 4.0

MemoryWAM is a world action model with a hybrid memory design using recent frames, anchor frames, and gist tokens for efficient long-horizon robotic manipulation.
World Action Models: A Survey
cs.RO 2026-06 unverdicted novelty 3.0

A survey that clarifies boundaries and organizes World Action Models by generation requirements and predictive substrates, identifying a trend toward generating less of the future.

Reference graph

Works this paper leans on

96 extracted references · 49 linked inside Pith · cited by 3 Pith papers

[1]

Claude 3.7 sonnet.https://www.anthropic.com/news/claude-3-7-sonnet, 2025

Anthropic. Claude 3.7 sonnet.https://www.anthropic.com/news/claude-3-7-sonnet, 2025

2025
[2]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Pith/arXiv arXiv 2025
[3]

Latent reasoning vla: Latent thinking and prediction for vision-language-action models.arXiv preprint arXiv:2602.01166, 2026

Shuanghao Bai, Jing Lyu, Wanqi Zhou, Zhe Li, Dakai Wang, Lei Xing, Xiaoguang Zhao, Pengwei Wang, Zhongyuan Wang, Cheng Chi, et al. Latent reasoning vla: Latent thinking and prediction for vision-language-action models.arXiv preprint arXiv:2602.01166, 2026

Pith/arXiv arXiv 2026
[4]

V-JEPA: Latent video prediction for visual representation learning, 2024.https://openreview.net/ forum?id=WFYbBOEOtv

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-JEPA: Latent video prediction for visual representation learning, 2024.https://openreview.net/ forum?id=WFYbBOEOtv

2024
[5]

Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

Pith/arXiv arXiv 2024
[6]

Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation

Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. InEuropean Conference on Computer Vision, pages 306–324. Springer, 2024

2024
[7]

Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

HongzheBi, HengkaiTan, ShenghaoXie, ZeyuanWang, ShuheHuang, HaitianLiu, RuowenZhao, YaoFeng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

Pith/arXiv arXiv 2025
[8]

Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025
[9]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[10]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025

Pith/arXiv arXiv 2025
[11]

Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

Pith/arXiv arXiv 2025
[12]

Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

Pith/arXiv arXiv 2025
[13]

Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

Pith/arXiv arXiv 2025
[14]

Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

Pith/arXiv arXiv 2025
[15]

Moto: Latent motion token as the bridging language for learning robot manipulation from videos

Yi Chen, Yuying Ge, Weiliang Tang, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent motion token as the bridging language for learning robot manipulation from videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19752–19763, 2025

2025
[16]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Brian Ichter, and Avinash Shah. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023

2023
[17]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Pith/arXiv arXiv 2025
[18]

Rescalingegocentricvision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130:33–55, 2022

DimaDamen, HazelDoughty, GiovanniMariaFarinella, AntoninoFurnari, etal. Rescalingegocentricvision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130:33–55, 2022

2022
[19]

Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.Advances in Neural Information Processing Systems, 38:102867–102888, 2026

Danny Driess, Jost Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.Advances in Neural Information Processing Systems, 38:102867–102888, 2026

2026
[20]

Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36: 9156–9172, 2023

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36: 9156–9172, 2023

2023
[21]

Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, and Jonathan Tompson

Yilun Du, Sherry Yang, Pete Florence, Fei Xia, Ayzaan Wahid, brian ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, and Jonathan Tompson. Video language planning. In The Twelfth International Conference on Learning Representations, 2024.https://openreview.net/forum?id= 9pKtcJcMP3

2024
[22]

Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025

Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025. 41

Pith/arXiv arXiv 2025
[23]

Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025

Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025

arXiv 2025
[24]

Think before you speak: Training language models with pause tokens

Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Sharan Narang. Think before you speak: Training language models with pause tokens. InInternational Conference on Learning Representations, 2024

2024
[25]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

2022
[26]

Prediction with action: Visual policy learning via joint denoising process.Advances in Neural Information Processing Systems, 37: 112386–112410, 2024

Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process.Advances in Neural Information Processing Systems, 37: 112386–112410, 2024

2024
[27]

Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

Pith/arXiv arXiv 1912
[28]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational conference on machine learning, pages 2555–2565. PMLR, 2019

2019
[29]

Mastering diverse domains through world models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

Pith/arXiv arXiv 2023
[30]

Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

Shibo Hao, Sainbayar Gu, Haotian Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

Pith/arXiv arXiv 2024
[31]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020

2020
[32]

Video prediction policy: A generalist robot policy with predictive visual representations

Yucheng Hu et al. Video prediction policy: A generalist robot policy with predictive visual representations. In Proceedings of the 42nd International Conference on Machine Learning, 2025

2025
[33]

Ladi-wm: A latent diffusion-based world model for predictive manipulation.arXiv preprint arXiv:2505.11528, 2025

Yuhang Huang, Jiazhao Zhang, Shilong Zou, Xinwang Liu, Ruizhen Hu, and Kai Xu. Ladi-wm: A latent diffusion-based world model for predictive manipulation.arXiv preprint arXiv:2505.11528, 2025

arXiv 2025
[35]

How much 3d do video foundation models encode?arXiv preprint arXiv:2512.19949, 2025

Zixuan Huang, Xiang Li, Zhaoyang Lv, and James M Rehg. How much 3d do video foundation models encode?arXiv preprint arXiv:2512.19949, 2025

arXiv 2025
[36]

arXiv preprint arXiv:2504.16054, 2025

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.𝑝𝑖0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025
[37]

Dreamgen: Unlocking generalization in robot learning through video world models

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models. arXiv preprint arXiv:2505.12705, 2025

Pith/arXiv arXiv 2025
[38]

Ladir: Latent diffusion enhances llms for text reasoning.arXiv preprint arXiv:2510.04573, 2025

Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Nicklas Majamaki, Navdeep Jaitly, Yi-An Ma, and Lianhui Qin. Ladir: Latent diffusion enhances llms for text reasoning.arXiv preprint arXiv:2510.04573, 2025

Pith/arXiv arXiv 2025
[39]

Beyond mode elicitation: Diversity- preserving reinforcement learning via latent diffusion reasoner.arXiv preprint arXiv:2602.01705, 2026

Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Yi-An Ma, and Lianhui Qin. Beyond mode elicitation: Diversity- preserving reinforcement learning via latent diffusion reasoner.arXiv preprint arXiv:2602.01705, 2026

Pith/arXiv arXiv 2026
[40]

Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Pith/arXiv arXiv 2024
[41]

Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[42]

Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026

Pith/arXiv arXiv 2026
[43]

Tenenbaum

Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B. Tenenbaum. Learning to act from actionless videos through dense correspondences. InProceedings of the International Conference on Learning Representations (ICLR), 2024

2024
[44]

Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026
[45]

Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

Pith/arXiv arXiv 2024
[46]

Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025. 42

Pith/arXiv arXiv 2025
[47]

Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

Pith/arXiv arXiv 2025
[48]

Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Yaron Lipman, Ricky Chen, Heli Ben-Hamu, Max Nickel, and Manzil Zaheer Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022
[49]

Last0: Latent spatio-temporal chain-of-thought for robotic vision-language-action model.arXiv preprint arXiv:2601.05248, 2026

Zhuoyang Liu, Jiaming Liu, Hao Chen, Jiale Yu, Ziyu Guo, Chengkai Hou, Chenyang Gu, Xiangju Mi, Renrui Zhang, Kun Wu, Zhengping Che, Jian Tang, Pheng-Ann Heng, and Shanghang Zhang. Last0: Latent spatio-temporal chain-of-thought for robotic vision-language-action model.arXiv preprint arXiv:2601.05248, 2026

arXiv 2026
[50]

Onevl: One-step latent reasoning and planning with vision-language explanation.arXiv preprint arXiv:2604.18486, 2026

Jinghui Lu, Jiayi Guan, Zhijian Huang, Jinlong Li, Guang Li, Lingdong Kong, Yingyan Li, Han Wang, Shaoqing Xu, Yuechen Luo, et al. Onevl: One-step latent reasoning and planning with vision-language explanation.arXiv preprint arXiv:2604.18486, 2026

Pith/arXiv arXiv 2026
[51]

Solving new tasks by adapting internet video knowledge.arXiv preprint arXiv:2504.15369, 2025

Calvin Luo, Zilai Zeng, Yilun Du, and Chen Sun. Solving new tasks by adapting internet video knowledge.arXiv preprint arXiv:2504.15369, 2025

arXiv 2025
[52]

Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026

Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, Hangjun Ye, Zhi-Xin Yang, and Fuxi Wen. Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026

arXiv 2026
[53]

Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion.arXiv preprint arXiv:2602.12215, 2026

Jiangran Lyu, Kai Liu, Xuheng Zhang, Haoran Liao, Yusen Feng, Wenxuan Zhu, Tingrui Shen, Jiayi Chen, Jiazhao Zhang, Yifei Dong, et al. Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion.arXiv preprint arXiv:2602.12215, 2026

Pith/arXiv arXiv 2026
[54]

Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

Pith/arXiv arXiv 2026
[55]

Openvid-1m: A large-scale high-quality dataset for text-to-video generation

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation. InInternational Conference on Learning Representations, 2025

2025
[56]

Spear-1: Scaling beyond robot demonstrations via 3d understanding.arXiv preprint arXiv:2511.17411, 2025

Nikolay Nikolov, Giuliano Albanese, Sombit Dey, Aleksandar Yanev, Luc Van Gool, Jan-Nico Zaech, and Danda Pani Paudel. Spear-1: Scaling beyond robot demonstrations via 3d understanding.arXiv preprint arXiv:2511.17411, 2025

Pith/arXiv arXiv 2025
[57]

Introducing gpt-5.https://openai.com/index/introducing-gpt-5-for-developers/, 2025

OpenAI. Introducing gpt-5.https://openai.com/index/introducing-gpt-5-for-developers/, 2025

2025
[58]

Dinov2: Learning robust visual features without supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

Pith/arXiv arXiv 2023
[59]

mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025

Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025

Pith/arXiv arXiv 2025
[60]

Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

Pith/arXiv arXiv 2025
[61]

Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

Pith/arXiv arXiv 2025
[62]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

2020
[63]

Common objectsin3d: Large-scalelearningandevaluationofreal-life3dcategoryreconstruction

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objectsin3d: Large-scalelearningandevaluationofreal-life3dcategoryreconstruction. InProceedingsoftheIEEE/CVF International Conference on Computer Vision, pages 10901–10911, 2021

2021
[64]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InInternational Conference on Learning Representations, 2022

2022
[65]

Worldarena: Benchmarking embodied video generation models as world simulators.arXiv preprint arXiv:2602.08971, 2026

Chenguo Shang et al. Worldarena: Benchmarking embodied video generation models as world simulators.arXiv preprint arXiv:2602.08971, 2026

arXiv 2026
[66]

Generic event boundary detection: A benchmark for event segmentation

Mike Zheng Shou, Stan Weixian Lei, Weiyao Wang, Deepti Ghadiyaram, and Matt Feiszli. Generic event boundary detection: A benchmark for event segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8075–8084, 2021

2021
[67]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[68]

Gigabrain-0: A world model-powered vision-language-action model.arXiv preprint arXiv:2510.19430, 2025

GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, et al. Gigabrain-0: A world model-powered vision-language-action model.arXiv preprint arXiv:2510.19430, 2025

arXiv 2025
[69]

Motubrain: An advanced world action model for robot control.arXiv preprint arXiv:2604.27792, 2026

MotuBrain Team, Chendong Xiang, Fan Bao, Haitian Liu, Hengkai Tan, Hongzhe Bi, James Li, Jiabao Liu, Jingrui Pang, Kiro Jing, et al. Motubrain: An advanced world action model for robot control.arXiv preprint arXiv:2604.27792, 2026. 43

Pith/arXiv arXiv 2026
[70]

Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

Pith/arXiv arXiv 2024
[71]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026.https://qwen

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026.https://qwen. ai/blog?id=qwen3.5

2026
[72]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[73]

Xrzero-g0: Pushing the frontier of dexterous robotic manipulation with interfaces, quality and ratios.arXiv preprint arXiv:2604.13001, 2026

Junming Wang, Teng Pu, Wingmun Fung, Jindong Wang, Shanchang Wang, Yuan Deng, Shuyuan Wang, Ziwei Liu, Kunhao Pan, Ping Yang, et al. Xrzero-g0: Pushing the frontier of dexterous robotic manipulation with interfaces, quality and ratios.arXiv preprint arXiv:2604.13001, 2026

Pith/arXiv arXiv 2026
[74]

Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025

Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025

Pith/arXiv arXiv 2025
[75]

Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025

2025
[76]

Dual-stream diffusion for world-model augmented vision-language-action model.arXiv preprint arXiv:2510.27607, 2025

John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, and Jinwoo Shin. Dual-stream diffusion for world-model augmented vision-language-action model.arXiv preprint arXiv:2510.27607, 2025

Pith/arXiv arXiv 2025
[77]

A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

Pith/arXiv arXiv 2026
[78]

Generation models know space: Unleashing implicit 3d priors for scene understanding.arXiv preprint arXiv:2603.19235, 2026

Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan, and Xiang Bai. Generation models know space: Unleashing implicit 3d priors for scene understanding.arXiv preprint arXiv:2603.19235, 2026

arXiv 2026
[79]

Advancing high-resolution video-language representation with large-scale video transcriptions

Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advancing high-resolution video-language representation with large-scale video transcriptions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5036–5045, 2022

2022
[80]

Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

An Yang, Baosong Yang, Beichen Zhang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024. https://arxiv.org/abs/2412.15115

Pith/arXiv arXiv 2024
[81]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In International Conference on Learning Representations, volume 2025, pages 83048–83077, 2025

2025

Showing first 80 references.

[1] [1]

Claude 3.7 sonnet.https://www.anthropic.com/news/claude-3-7-sonnet, 2025

Anthropic. Claude 3.7 sonnet.https://www.anthropic.com/news/claude-3-7-sonnet, 2025

2025

[2] [2]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Pith/arXiv arXiv 2025

[3] [3]

Latent reasoning vla: Latent thinking and prediction for vision-language-action models.arXiv preprint arXiv:2602.01166, 2026

Shuanghao Bai, Jing Lyu, Wanqi Zhou, Zhe Li, Dakai Wang, Lei Xing, Xiaoguang Zhao, Pengwei Wang, Zhongyuan Wang, Cheng Chi, et al. Latent reasoning vla: Latent thinking and prediction for vision-language-action models.arXiv preprint arXiv:2602.01166, 2026

Pith/arXiv arXiv 2026

[4] [4]

V-JEPA: Latent video prediction for visual representation learning, 2024.https://openreview.net/ forum?id=WFYbBOEOtv

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-JEPA: Latent video prediction for visual representation learning, 2024.https://openreview.net/ forum?id=WFYbBOEOtv

2024

[5] [5]

Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

Pith/arXiv arXiv 2024

[6] [6]

Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation

Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. InEuropean Conference on Computer Vision, pages 306–324. Springer, 2024

2024

[7] [7]

Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

HongzheBi, HengkaiTan, ShenghaoXie, ZeyuanWang, ShuheHuang, HaitianLiu, RuowenZhao, YaoFeng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

Pith/arXiv arXiv 2025

[8] [8]

Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025

[9] [9]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[10] [10]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025

Pith/arXiv arXiv 2025

[11] [11]

Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

Pith/arXiv arXiv 2025

[12] [12]

Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

Pith/arXiv arXiv 2025

[13] [13]

Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

Pith/arXiv arXiv 2025

[14] [14]

Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

Pith/arXiv arXiv 2025

[15] [15]

Moto: Latent motion token as the bridging language for learning robot manipulation from videos

Yi Chen, Yuying Ge, Weiliang Tang, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent motion token as the bridging language for learning robot manipulation from videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19752–19763, 2025

2025

[16] [16]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Brian Ichter, and Avinash Shah. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023

2023

[17] [17]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Pith/arXiv arXiv 2025

[18] [18]

Rescalingegocentricvision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130:33–55, 2022

DimaDamen, HazelDoughty, GiovanniMariaFarinella, AntoninoFurnari, etal. Rescalingegocentricvision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130:33–55, 2022

2022

[19] [19]

Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.Advances in Neural Information Processing Systems, 38:102867–102888, 2026

Danny Driess, Jost Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.Advances in Neural Information Processing Systems, 38:102867–102888, 2026

2026

[20] [20]

Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36: 9156–9172, 2023

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36: 9156–9172, 2023

2023

[21] [21]

Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, and Jonathan Tompson

Yilun Du, Sherry Yang, Pete Florence, Fei Xia, Ayzaan Wahid, brian ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, and Jonathan Tompson. Video language planning. In The Twelfth International Conference on Learning Representations, 2024.https://openreview.net/forum?id= 9pKtcJcMP3

2024

[22] [22]

Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025

Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025. 41

Pith/arXiv arXiv 2025

[23] [23]

Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025

Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025

arXiv 2025

[24] [24]

Think before you speak: Training language models with pause tokens

Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Sharan Narang. Think before you speak: Training language models with pause tokens. InInternational Conference on Learning Representations, 2024

2024

[25] [25]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

2022

[26] [26]

Prediction with action: Visual policy learning via joint denoising process.Advances in Neural Information Processing Systems, 37: 112386–112410, 2024

Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process.Advances in Neural Information Processing Systems, 37: 112386–112410, 2024

2024

[27] [27]

Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

Pith/arXiv arXiv 1912

[28] [28]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational conference on machine learning, pages 2555–2565. PMLR, 2019

2019

[29] [29]

Mastering diverse domains through world models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

Pith/arXiv arXiv 2023

[30] [30]

Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

Shibo Hao, Sainbayar Gu, Haotian Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

Pith/arXiv arXiv 2024

[31] [31]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020

2020

[32] [32]

Video prediction policy: A generalist robot policy with predictive visual representations

Yucheng Hu et al. Video prediction policy: A generalist robot policy with predictive visual representations. In Proceedings of the 42nd International Conference on Machine Learning, 2025

2025

[33] [33]

Ladi-wm: A latent diffusion-based world model for predictive manipulation.arXiv preprint arXiv:2505.11528, 2025

Yuhang Huang, Jiazhao Zhang, Shilong Zou, Xinwang Liu, Ruizhen Hu, and Kai Xu. Ladi-wm: A latent diffusion-based world model for predictive manipulation.arXiv preprint arXiv:2505.11528, 2025

arXiv 2025

[34] [35]

How much 3d do video foundation models encode?arXiv preprint arXiv:2512.19949, 2025

Zixuan Huang, Xiang Li, Zhaoyang Lv, and James M Rehg. How much 3d do video foundation models encode?arXiv preprint arXiv:2512.19949, 2025

arXiv 2025

[35] [36]

arXiv preprint arXiv:2504.16054, 2025

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.𝑝𝑖0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025

[36] [37]

Dreamgen: Unlocking generalization in robot learning through video world models

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models. arXiv preprint arXiv:2505.12705, 2025

Pith/arXiv arXiv 2025

[37] [38]

Ladir: Latent diffusion enhances llms for text reasoning.arXiv preprint arXiv:2510.04573, 2025

Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Nicklas Majamaki, Navdeep Jaitly, Yi-An Ma, and Lianhui Qin. Ladir: Latent diffusion enhances llms for text reasoning.arXiv preprint arXiv:2510.04573, 2025

Pith/arXiv arXiv 2025

[38] [39]

Beyond mode elicitation: Diversity- preserving reinforcement learning via latent diffusion reasoner.arXiv preprint arXiv:2602.01705, 2026

Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Yi-An Ma, and Lianhui Qin. Beyond mode elicitation: Diversity- preserving reinforcement learning via latent diffusion reasoner.arXiv preprint arXiv:2602.01705, 2026

Pith/arXiv arXiv 2026

[39] [40]

Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Pith/arXiv arXiv 2024

[40] [41]

Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[41] [42]

Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026

Pith/arXiv arXiv 2026

[42] [43]

Tenenbaum

Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B. Tenenbaum. Learning to act from actionless videos through dense correspondences. InProceedings of the International Conference on Learning Representations (ICLR), 2024

2024

[43] [44]

Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026

[44] [45]

Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

Pith/arXiv arXiv 2024

[45] [46]

Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025. 42

Pith/arXiv arXiv 2025

[46] [47]

Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

Pith/arXiv arXiv 2025

[47] [48]

Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Yaron Lipman, Ricky Chen, Heli Ben-Hamu, Max Nickel, and Manzil Zaheer Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022

[48] [49]

Last0: Latent spatio-temporal chain-of-thought for robotic vision-language-action model.arXiv preprint arXiv:2601.05248, 2026

Zhuoyang Liu, Jiaming Liu, Hao Chen, Jiale Yu, Ziyu Guo, Chengkai Hou, Chenyang Gu, Xiangju Mi, Renrui Zhang, Kun Wu, Zhengping Che, Jian Tang, Pheng-Ann Heng, and Shanghang Zhang. Last0: Latent spatio-temporal chain-of-thought for robotic vision-language-action model.arXiv preprint arXiv:2601.05248, 2026

arXiv 2026

[49] [50]

Onevl: One-step latent reasoning and planning with vision-language explanation.arXiv preprint arXiv:2604.18486, 2026

Jinghui Lu, Jiayi Guan, Zhijian Huang, Jinlong Li, Guang Li, Lingdong Kong, Yingyan Li, Han Wang, Shaoqing Xu, Yuechen Luo, et al. Onevl: One-step latent reasoning and planning with vision-language explanation.arXiv preprint arXiv:2604.18486, 2026

Pith/arXiv arXiv 2026

[50] [51]

Solving new tasks by adapting internet video knowledge.arXiv preprint arXiv:2504.15369, 2025

Calvin Luo, Zilai Zeng, Yilun Du, and Chen Sun. Solving new tasks by adapting internet video knowledge.arXiv preprint arXiv:2504.15369, 2025

arXiv 2025

[51] [52]

Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026

Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, Hangjun Ye, Zhi-Xin Yang, and Fuxi Wen. Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026

arXiv 2026

[52] [53]

Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion.arXiv preprint arXiv:2602.12215, 2026

Jiangran Lyu, Kai Liu, Xuheng Zhang, Haoran Liao, Yusen Feng, Wenxuan Zhu, Tingrui Shen, Jiayi Chen, Jiazhao Zhang, Yifei Dong, et al. Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion.arXiv preprint arXiv:2602.12215, 2026

Pith/arXiv arXiv 2026

[53] [54]

Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

Pith/arXiv arXiv 2026

[54] [55]

Openvid-1m: A large-scale high-quality dataset for text-to-video generation

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation. InInternational Conference on Learning Representations, 2025

2025

[55] [56]

Spear-1: Scaling beyond robot demonstrations via 3d understanding.arXiv preprint arXiv:2511.17411, 2025

Nikolay Nikolov, Giuliano Albanese, Sombit Dey, Aleksandar Yanev, Luc Van Gool, Jan-Nico Zaech, and Danda Pani Paudel. Spear-1: Scaling beyond robot demonstrations via 3d understanding.arXiv preprint arXiv:2511.17411, 2025

Pith/arXiv arXiv 2025

[56] [57]

Introducing gpt-5.https://openai.com/index/introducing-gpt-5-for-developers/, 2025

OpenAI. Introducing gpt-5.https://openai.com/index/introducing-gpt-5-for-developers/, 2025

2025

[57] [58]

Dinov2: Learning robust visual features without supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

Pith/arXiv arXiv 2023

[58] [59]

mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025

Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025

Pith/arXiv arXiv 2025

[59] [60]

Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

Pith/arXiv arXiv 2025

[60] [61]

Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

Pith/arXiv arXiv 2025

[61] [62]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

2020

[62] [63]

Common objectsin3d: Large-scalelearningandevaluationofreal-life3dcategoryreconstruction

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objectsin3d: Large-scalelearningandevaluationofreal-life3dcategoryreconstruction. InProceedingsoftheIEEE/CVF International Conference on Computer Vision, pages 10901–10911, 2021

2021

[63] [64]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InInternational Conference on Learning Representations, 2022

2022

[64] [65]

Worldarena: Benchmarking embodied video generation models as world simulators.arXiv preprint arXiv:2602.08971, 2026

Chenguo Shang et al. Worldarena: Benchmarking embodied video generation models as world simulators.arXiv preprint arXiv:2602.08971, 2026

arXiv 2026

[65] [66]

Generic event boundary detection: A benchmark for event segmentation

Mike Zheng Shou, Stan Weixian Lei, Weiyao Wang, Deepti Ghadiyaram, and Matt Feiszli. Generic event boundary detection: A benchmark for event segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8075–8084, 2021

2021

[66] [67]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024

[67] [68]

Gigabrain-0: A world model-powered vision-language-action model.arXiv preprint arXiv:2510.19430, 2025

GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, et al. Gigabrain-0: A world model-powered vision-language-action model.arXiv preprint arXiv:2510.19430, 2025

arXiv 2025

[68] [69]

Motubrain: An advanced world action model for robot control.arXiv preprint arXiv:2604.27792, 2026

MotuBrain Team, Chendong Xiang, Fan Bao, Haitian Liu, Hengkai Tan, Hongzhe Bi, James Li, Jiabao Liu, Jingrui Pang, Kiro Jing, et al. Motubrain: An advanced world action model for robot control.arXiv preprint arXiv:2604.27792, 2026. 43

Pith/arXiv arXiv 2026

[69] [70]

Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

Pith/arXiv arXiv 2024

[70] [71]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026.https://qwen

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026.https://qwen. ai/blog?id=qwen3.5

2026

[71] [72]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[72] [73]

Xrzero-g0: Pushing the frontier of dexterous robotic manipulation with interfaces, quality and ratios.arXiv preprint arXiv:2604.13001, 2026

Junming Wang, Teng Pu, Wingmun Fung, Jindong Wang, Shanchang Wang, Yuan Deng, Shuyuan Wang, Ziwei Liu, Kunhao Pan, Ping Yang, et al. Xrzero-g0: Pushing the frontier of dexterous robotic manipulation with interfaces, quality and ratios.arXiv preprint arXiv:2604.13001, 2026

Pith/arXiv arXiv 2026

[73] [74]

Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025

Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025

Pith/arXiv arXiv 2025

[74] [75]

Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025

2025

[75] [76]

Dual-stream diffusion for world-model augmented vision-language-action model.arXiv preprint arXiv:2510.27607, 2025

John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, and Jinwoo Shin. Dual-stream diffusion for world-model augmented vision-language-action model.arXiv preprint arXiv:2510.27607, 2025

Pith/arXiv arXiv 2025

[76] [77]

A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

Pith/arXiv arXiv 2026

[77] [78]

Generation models know space: Unleashing implicit 3d priors for scene understanding.arXiv preprint arXiv:2603.19235, 2026

Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan, and Xiang Bai. Generation models know space: Unleashing implicit 3d priors for scene understanding.arXiv preprint arXiv:2603.19235, 2026

arXiv 2026

[78] [79]

Advancing high-resolution video-language representation with large-scale video transcriptions

Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advancing high-resolution video-language representation with large-scale video transcriptions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5036–5045, 2022

2022

[79] [80]

Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

An Yang, Baosong Yang, Beichen Zhang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024. https://arxiv.org/abs/2412.15115

Pith/arXiv arXiv 2024

[80] [81]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In International Conference on Learning Representations, volume 2025, pages 83048–83077, 2025

2025