World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

Boyuan Zhao; Jianbo Zhou; Pengfei Liu; Siqi Kou; Xiao Xia; Xueqi Li; Yanzhe Hu; Yi Yang; Yiyang Chen; Zhihong Liu

arxiv: 2606.05979 · v1 · pith:OQ65H3OGnew · submitted 2026-06-04 · 💻 cs.RO · cs.AI

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

Yi Yang , Zhihong Liu , Siqi Kou , Yiyang Chen , Yanzhe Hu , Jianbo Zhou , Boyuan Zhao , Zhijie Wei

show 4 more authors

Xiao Xia Xueqi Li Pengfei Liu Zhijie Deng

This is my paper

Pith reviewed 2026-06-28 01:02 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords world-language-action modelembodied foundation modelsworld modelingrobot action synthesisautoregressive transformermeta-querieslong-horizon taskscross-embodiment learning

0 comments

The pith

WLA models jointly predict textual subtasks, subgoal images, and robot actions from instructions, images, and states via an autoregressive transformer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes world-language-action (WLA) models as a new class of embodied foundation models. WLA combines world modeling from egocentric videos with language reasoning capacities to handle complex long-horizon robot tasks. It uses an autoregressive transformer to forecast the next state as both semantic text and fine-grained physical dynamics, with dynamics supervised by a dedicated World Expert to support action prediction. Meta-queries let world prediction influence actions implicitly but can be disabled at inference, while the approach also supports test-time scaling and learning from cross-embodiment videos without action labels. The WLA-0 prototype reports 92.94% success on RoboTwin2.0 Clean and 56.5% on RMBench.

Core claim

WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions. At its core is an autoregressive Transformer backbone that predicts the next state comprising semantic-level textual intention and complementary fine-grained physical dynamics. The physical dynamics receive supervision from a dedicated World Expert to ease state-action correlation for the Action Expert, and meta-queries make world prediction implicitly impact action generation so the former can be disabled during inference while remaining available for test-time scaling.

What carries the argument

Autoregressive Transformer backbone that predicts the next state as semantic textual intention plus fine-grained physical dynamics, supervised by a World Expert, with meta-queries to allow implicit world influence on action generation.

If this is right

WLA-0 achieves 92.94% success rate on RoboTwin2.0 Clean.
WLA-0 achieves 56.5% success rate on RMBench.
World prediction can be activated at test time to enable scaling for improved robot control.
The model supports learning novel tasks directly from cross-embodiment robot videos without action annotations.
WLA-0 runs at 40 ms per inference on an NVIDIA RTX 5090 with 2B active parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The meta-query mechanism could be adapted to other multimodal prediction settings where one modality should influence another only during training.
Success on cross-embodiment video data suggests the architecture may reduce reliance on expensive robot-specific action labels when collecting training data.
Activating world prediction at test time might yield larger gains on tasks longer than those evaluated in the reported benchmarks.
The separation of World Expert and Action Expert could allow independent scaling or replacement of each component in future versions.

Load-bearing premise

Supervision of physical dynamics by a dedicated World Expert eases state-action correlation for the Action Expert, and meta-queries allow world prediction to implicitly impact action generation while remaining safely disabled at inference.

What would settle it

Train an otherwise identical model without World Expert supervision or without meta-queries and measure whether success rates fall below 92.94% on RoboTwin2.0 Clean or 56.5% on RMBench.

read the original abstract

We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the \emph{world modeling interface} to learn from extensive egocentric videos as in the world-action model (WAM) and the \emph{language reasoning} capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an \emph{autoregressive (AR)} Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the \emph{next state}, comprising the \emph{semantic-level} textual intention and complementary \emph{fine-grained} physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction \emph{implicitly} impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94\% success rate on RoboTwin2.0 Clean and 56.5\% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from \emph{cross-embodiment robot videos} without action annotations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WLA proposes an AR transformer with World Expert and meta-queries to unify world modeling and action synthesis, but the SOTA numbers have no visible experimental support.

read the letter

The paper's main move is an autoregressive transformer that predicts semantic text, subgoal images, and robot actions in one sequence. It adds a World Expert to supervise physical dynamics and meta-queries so the world modeling can shape action outputs without running at inference time.

This setup is presented as a way to combine the video-based world modeling from WAM lines with the language-conditioned control from VLA models. The concrete design choices around the meta-queries and the switch to AR from diffusion are the clearest technical differences from the cited prior work. The 2B model running at 40 ms on an RTX 5090 is a practical detail that shows attention to deployment.

The framing around long-horizon tasks and learning from cross-embodiment video without action labels is reasonable and matches real needs in robotics. The hope that world modeling supervision will help the Action Expert characterize state-action correlations is a plausible hypothesis on paper.

The problem is that none of the performance claims can be checked. The abstract lists 92.94% on RoboTwin2.0 Clean and 56.5% on RMBench as SOTA but gives no baselines, no ablations on the World Expert or meta-queries, no training procedure, no dataset splits, and no error bars. Without those, there is no way to know whether the joint objective produces the gains or whether other factors are at work. The weakest assumption in the abstract therefore stays untested.

This is for people already working on embodied foundation models who want to see how an AR backbone might replace diffusion approaches. A reader looking for reproducible evidence or controlled comparisons will find the current version thin.

It should go to peer review because the unification goal is relevant and the architecture is specific enough to discuss, but the authors need to supply the missing experimental details before the claims can be evaluated.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes World-Language-Action (WLA) models as a new class of embodied foundation models. These take textual instructions, images, and robot states as input and use an autoregressive Transformer to jointly predict textual subtasks, subgoal images, and robot actions. A dedicated World Expert supervises physical dynamics to aid the Action Expert, while meta-queries allow world predictions to implicitly influence action generation (and can be disabled at inference or activated for test-time scaling). The 2B-parameter WLA-0 prototype is reported to achieve SOTA results including 92.94% success on RoboTwin2.0 Clean and 56.5% on RMBench, with potential for cross-embodiment learning from videos without action labels.

Significance. If the joint modeling objective and meta-query mechanism demonstrably improve long-horizon performance and cross-embodiment generalization beyond existing VLA and WAM approaches, the work could advance unified embodied foundation models. The reported inference speed (40 ms on RTX 5090) and the ability to disable world prediction at inference are practical strengths if substantiated.

major comments (2)

[Abstract] Abstract: The central performance claims (92.94% success on RoboTwin2.0 Clean; 56.5% on RMBench) are presented as SOTA without any baseline comparisons, ablation studies, dataset splits, error bars, or statistical tests. This directly undermines evaluation of whether the joint world-language-action objective produces the reported gains.
[Abstract] Abstract (core of WLA paragraph): The claim that World Expert supervision eases state-action correlation for the Action Expert, and that meta-queries enable implicit world influence while remaining safely disabled at inference, is asserted without any supporting derivation, loss formulation, or empirical ablation in the visible text; these are load-bearing for the architecture's novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point-by-point below, providing clarifications from the full paper and indicating revisions to improve the abstract's self-contained presentation of claims.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims (92.94% success on RoboTwin2.0 Clean; 56.5% on RMBench) are presented as SOTA without any baseline comparisons, ablation studies, dataset splits, error bars, or statistical tests. This directly undermines evaluation of whether the joint world-language-action objective produces the reported gains.

Authors: The abstract serves as a concise summary and space limits preclude full experimental details. The complete manuscript provides baseline comparisons against prior VLA and WAM methods in Tables 1 and 2, ablation studies isolating the joint objective in Section 4.3, dataset splits and preprocessing in Section 4.1, and results aggregated over multiple random seeds with error bars and statistical tests in Section 4.2. We will revise the abstract to add a brief qualifier such as '(surpassing prior approaches; see Section 4 for comparisons and ablations)' to better link the claims to the supporting evidence. revision: yes
Referee: [Abstract] Abstract (core of WLA paragraph): The claim that World Expert supervision eases state-action correlation for the Action Expert, and that meta-queries enable implicit world influence while remaining safely disabled at inference, is asserted without any supporting derivation, loss formulation, or empirical ablation in the visible text; these are load-bearing for the architecture's novelty.

Authors: The abstract summarizes the architecture at a high level. The full manuscript details the World Expert loss formulation and its role in easing state-action correlation in Section 3.2 (Equations 2–4), derives the meta-query mechanism for implicit influence (with the option to disable at inference) in Section 3.3, and provides empirical ablations on these components (including test-time activation and cross-embodiment results) in Section 4.4. We will revise the abstract to include a short parenthetical reference to these sections for improved traceability while preserving brevity. revision: partial

Circularity Check

0 steps flagged

No circularity: architecture proposal with empirical claims lacks derivations or self-referential reductions

full rationale

The paper presents WLA as a new model class combining world modeling, language reasoning, and action synthesis via an AR Transformer backbone, World Expert supervision, and meta-queries. No equations, derivations, or first-principles results are described that could reduce to inputs by construction. Performance claims (e.g., success rates on RoboTwin2.0 and RMBench) are empirical and rest on the described architecture and training rather than any fitted parameter renamed as prediction or self-citation chain. Prior models (WAM, VLA) are referenced as inspiration but not as load-bearing self-citations justifying uniqueness. The design choices (e.g., meta-queries enabling implicit impact while disabled at inference) are presented as engineering decisions without mathematical self-definition. This is a standard empirical architecture paper whose central claims are testable via experiments and do not exhibit the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, training objectives, or architectural hyperparameters, so no free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5902 in / 1273 out tokens · 46505 ms · 2026-06-28T01:02:30.646397+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 33 linked inside Pith

[1]

Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025
[2]

Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Pith/arXiv arXiv 2023
[3]

Revisiting feature prediction for learning visual representations from video.arXiv preprint arXiv:2404.08471, 2024

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video.arXiv preprint arXiv:2404.08471, 2024

Pith/arXiv arXiv 2024
[4]

Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Pith/arXiv arXiv 2024
[5]

Motus: Aunifiedlatentactionworldmodel.arXiv preprint arXiv:2512.13030, 2025

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, ChendongXiang, YinzeRong, etal. Motus: Aunifiedlatentactionworldmodel.arXiv preprint arXiv:2512.13030, 2025

Pith/arXiv arXiv 2025
[6]

arXiv preprint arXiv:2410.24164, 2024

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[7]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

2024
[8]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

1901
[9]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty- first International Conference on Machine Learning, 2024

2024
[10]

Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

Pith/arXiv arXiv 2025
[11]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 9650–9660, 2021

2021
[12]

Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

Pith/arXiv arXiv 2024
[13]

Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Pith/arXiv arXiv 2025
[14]

Rmbench: Memory-dependent robotic manipulation benchmark with insights into policy design.arXiv preprint arXiv:2603.01229, 2026

Tianxing Chen, Yuran Wang, Mingleyang Li, Yan Qin, Hao Shi, Zixuan Li, Yifan Hu, Yingsheng Zhang, Kaixuan Wang, Yue Chen, et al. Rmbench: Memory-dependent robotic manipulation benchmark with insights into policy design.arXiv preprint arXiv:2603.01229, 2026

arXiv 2026
[15]

Moto: Latent motion token as the bridging language for learning robot manipulation from videos

Yi Chen, Yuying Ge, Weiliang Tang, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent motion token as the bridging language for learning robot manipulation from videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19752–19763, 2025

2025
[16]

Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014, 2026

StarVLA Community. Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014, 2026

Pith/arXiv arXiv 2026
[17]

Rynnbrain: Open embodied foundation models.arXiv preprint arXiv:2602.14979, 2026

Ronghao Dang, Jiayan Guo, Bohan Hou, Sicong Leng, Kehan Li, Xin Li, Jiangpin Liu, Yunxuan Mao, Zhikai Wang, Yuqian Yuan, et al. Rynnbrain: Open embodied foundation models.arXiv preprint arXiv:2602.14979, 2026

arXiv 2026
[18]

Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

Pith/arXiv arXiv 2023
[19]

Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

2023
[20]

Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025

Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025

Pith/arXiv arXiv 2025
[21]

World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

Pith/arXiv arXiv 2018
[22]

Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

Pith/arXiv arXiv 2016
[23]

Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

Pith/arXiv arXiv 2024
[24]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025
[25]

Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, et al.pi0.7: a steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026

Pith/arXiv arXiv 2026
[26]

Pris- matic vlms: Investigating the design space of visually-conditioned language models

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Pris- matic vlms: Investigating the design space of visually-conditioned language models. InForty-first International Conference on Machine Learning, 2024

2024
[27]

Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[28]

Cosmos policy: Fine-tuning video models for visuomotor control and planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026

Pith/arXiv arXiv 2026
[29]

Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

Pith/arXiv arXiv 2013
[30]

Learning to act from actionless videos through dense correspondences

Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B Tenenbaum. Learning to act from actionless videos through dense correspondences. InInternational Conference on Learning Representations, volume 2024, pages 40938–40958, 2024

2024
[31]

Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026
[32]

Drivevla-w0: World models amplify data scaling law in autonomous driving

Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, et al. Drivevla-w0: World models amplify data scaling law in autonomous driving. arXiv preprint arXiv:2510.12796, 2025

Pith/arXiv arXiv 2025
[33]

Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024

Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen- tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024

Pith/arXiv arXiv 2024
[34]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

2023
[35]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023
[36]

Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Pith/arXiv arXiv 2017
[37]

Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

arXiv 2025
[38]

Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2026

Teli Ma, Jia Zheng, Zifan Wang, Chunli Jiang, Andy Cui, Junwei Liang, and Shuo Yang. Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2026

arXiv 2026
[39]

Embodiedgpt: Vision-language pre-training via embodied chain of thought.Advances in Neural Information Processing Systems, 36:25081–25094, 2023

Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, and Ping Luo. Embodiedgpt: Vision-language pre-training via embodied chain of thought.Advances in Neural Information Processing Systems, 36:25081–25094, 2023

2023
[40]

mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025

Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025

Pith/arXiv arXiv 2025
[41]

Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

Pith/arXiv arXiv 2025
[42]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[43]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3505–3506, 2020

2020
[44]

Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

Pith/arXiv arXiv 2002
[45]

Roformer: Enhanced trans- former with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced trans- former with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[46]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

TeamWan, AngWang, BaoleAi, BinWen, ChaojieMao, Chen-WeiXie, DiChen, FeiwuYu, HaimingZhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[47]

Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025

Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxi- ang Zhang. Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025

arXiv 2025
[48]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

2022
[49]

Unleashing large-scale video generative pre-training for visual robot manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. InInternational Conference on Learning Representations, volume 2024, pages 10641–10662, 2024

2024
[50]

Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024

Pith/arXiv arXiv 2024
[51]

Latent action pretraining from videos

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos. InInternational Conference on Learning Representations, volume 2025, pages 28213–28239, 2025

2025
[52]

World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Pith/arXiv arXiv 2026
[53]

Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

Pith/arXiv arXiv 2026
[54]

Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024

Pith/arXiv arXiv 2024
[55]

Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

2019
[56]

Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.Advances in Neural Information Processing Systems, 38:24195–24228, 2026

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.Advances in Neural Information Processing Systems, 38:24195–24228, 2026

2026
[57]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025

2025
[58]

X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language- action model.arXiv preprint arXiv:2510.10274, 2025

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language- action model.arXiv preprint arXiv:2510.10274, 2025

Pith/arXiv arXiv 2025
[59]

Chatvla: Unified multimodal understanding and robot control with vision-language- action model

Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Yaxin Peng, Chaomin Shen, Feifei Feng, et al. Chatvla: Unified multimodal understanding and robot control with vision-language- action model. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5377–5395, 2025

2025
[60]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023. A Acceleration Techniques WLA-0’s inference latency is dominated by Python dispatch o...

2023

[1] [1]

Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025

[2] [2]

Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Pith/arXiv arXiv 2023

[3] [3]

Revisiting feature prediction for learning visual representations from video.arXiv preprint arXiv:2404.08471, 2024

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video.arXiv preprint arXiv:2404.08471, 2024

Pith/arXiv arXiv 2024

[4] [4]

Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Pith/arXiv arXiv 2024

[5] [5]

Motus: Aunifiedlatentactionworldmodel.arXiv preprint arXiv:2512.13030, 2025

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, ChendongXiang, YinzeRong, etal. Motus: Aunifiedlatentactionworldmodel.arXiv preprint arXiv:2512.13030, 2025

Pith/arXiv arXiv 2025

[6] [6]

arXiv preprint arXiv:2410.24164, 2024

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[7] [7]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

2024

[8] [8]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

1901

[9] [9]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty- first International Conference on Machine Learning, 2024

2024

[10] [10]

Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

Pith/arXiv arXiv 2025

[11] [11]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 9650–9660, 2021

2021

[12] [12]

Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

Pith/arXiv arXiv 2024

[13] [13]

Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Pith/arXiv arXiv 2025

[14] [14]

Rmbench: Memory-dependent robotic manipulation benchmark with insights into policy design.arXiv preprint arXiv:2603.01229, 2026

Tianxing Chen, Yuran Wang, Mingleyang Li, Yan Qin, Hao Shi, Zixuan Li, Yifan Hu, Yingsheng Zhang, Kaixuan Wang, Yue Chen, et al. Rmbench: Memory-dependent robotic manipulation benchmark with insights into policy design.arXiv preprint arXiv:2603.01229, 2026

arXiv 2026

[15] [15]

Moto: Latent motion token as the bridging language for learning robot manipulation from videos

Yi Chen, Yuying Ge, Weiliang Tang, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent motion token as the bridging language for learning robot manipulation from videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19752–19763, 2025

2025

[16] [16]

Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014, 2026

StarVLA Community. Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014, 2026

Pith/arXiv arXiv 2026

[17] [17]

Rynnbrain: Open embodied foundation models.arXiv preprint arXiv:2602.14979, 2026

Ronghao Dang, Jiayan Guo, Bohan Hou, Sicong Leng, Kehan Li, Xin Li, Jiangpin Liu, Yunxuan Mao, Zhikai Wang, Yuqian Yuan, et al. Rynnbrain: Open embodied foundation models.arXiv preprint arXiv:2602.14979, 2026

arXiv 2026

[18] [18]

Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

Pith/arXiv arXiv 2023

[19] [19]

Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

2023

[20] [20]

Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025

Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025

Pith/arXiv arXiv 2025

[21] [21]

World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

Pith/arXiv arXiv 2018

[22] [22]

Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

Pith/arXiv arXiv 2016

[23] [23]

Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

Pith/arXiv arXiv 2024

[24] [24]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025

[25] [25]

Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, et al.pi0.7: a steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026

Pith/arXiv arXiv 2026

[26] [26]

Pris- matic vlms: Investigating the design space of visually-conditioned language models

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Pris- matic vlms: Investigating the design space of visually-conditioned language models. InForty-first International Conference on Machine Learning, 2024

2024

[27] [27]

Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[28] [28]

Cosmos policy: Fine-tuning video models for visuomotor control and planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026

Pith/arXiv arXiv 2026

[29] [29]

Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

Pith/arXiv arXiv 2013

[30] [30]

Learning to act from actionless videos through dense correspondences

Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B Tenenbaum. Learning to act from actionless videos through dense correspondences. InInternational Conference on Learning Representations, volume 2024, pages 40938–40958, 2024

2024

[31] [31]

Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026

[32] [32]

Drivevla-w0: World models amplify data scaling law in autonomous driving

Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, et al. Drivevla-w0: World models amplify data scaling law in autonomous driving. arXiv preprint arXiv:2510.12796, 2025

Pith/arXiv arXiv 2025

[33] [33]

Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024

Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen- tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024

Pith/arXiv arXiv 2024

[34] [34]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

2023

[35] [35]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023

[36] [36]

Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Pith/arXiv arXiv 2017

[37] [37]

Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

arXiv 2025

[38] [38]

Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2026

Teli Ma, Jia Zheng, Zifan Wang, Chunli Jiang, Andy Cui, Junwei Liang, and Shuo Yang. Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448, 2026

arXiv 2026

[39] [39]

Embodiedgpt: Vision-language pre-training via embodied chain of thought.Advances in Neural Information Processing Systems, 36:25081–25094, 2023

Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, and Ping Luo. Embodiedgpt: Vision-language pre-training via embodied chain of thought.Advances in Neural Information Processing Systems, 36:25081–25094, 2023

2023

[40] [40]

mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025

Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025

Pith/arXiv arXiv 2025

[41] [41]

Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

Pith/arXiv arXiv 2025

[42] [42]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[43] [43]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3505–3506, 2020

2020

[44] [44]

Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

Pith/arXiv arXiv 2002

[45] [45]

Roformer: Enhanced trans- former with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced trans- former with rotary position embedding.Neurocomputing, 568:127063, 2024

2024

[46] [46]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

TeamWan, AngWang, BaoleAi, BinWen, ChaojieMao, Chen-WeiXie, DiChen, FeiwuYu, HaimingZhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[47] [47]

Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025

Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxi- ang Zhang. Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025

arXiv 2025

[48] [48]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

2022

[49] [49]

Unleashing large-scale video generative pre-training for visual robot manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. InInternational Conference on Learning Representations, volume 2024, pages 10641–10662, 2024

2024

[50] [50]

Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024

Pith/arXiv arXiv 2024

[51] [51]

Latent action pretraining from videos

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, et al. Latent action pretraining from videos. InInternational Conference on Learning Representations, volume 2025, pages 28213–28239, 2025

2025

[52] [52]

World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Pith/arXiv arXiv 2026

[53] [53]

Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

Pith/arXiv arXiv 2026

[54] [54]

Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024

Pith/arXiv arXiv 2024

[55] [55]

Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

2019

[56] [56]

Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.Advances in Neural Information Processing Systems, 38:24195–24228, 2026

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.Advances in Neural Information Processing Systems, 38:24195–24228, 2026

2026

[57] [57]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025

2025

[58] [58]

X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language- action model.arXiv preprint arXiv:2510.10274, 2025

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language- action model.arXiv preprint arXiv:2510.10274, 2025

Pith/arXiv arXiv 2025

[59] [59]

Chatvla: Unified multimodal understanding and robot control with vision-language- action model

Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Yaxin Peng, Chaomin Shen, Feifei Feng, et al. Chatvla: Unified multimodal understanding and robot control with vision-language- action model. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5377–5395, 2025

2025

[60] [60]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023. A Acceleration Techniques WLA-0’s inference latency is dominated by Python dispatch o...

2023