pith. machine review for the scientific record. sign in

arxiv: 2605.07931 · v3 · submitted 2026-05-08 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:12 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language-actionworld modelsvisual bandwidthadaptive attention poolingflow-matchinglong-horizon planningVLA policy
0
0 comments X

The pith

A single semantic token per frame suffices to drive long-horizon planning in world-model-augmented vision-language-action policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that the visual stream fed to a world model for VLA policies can be compressed from high-bandwidth features to one token per frame. Adaptive Attention Pooling extracts a task-relevant semantic token from each view, and the resulting latent stream is produced jointly with action trajectories under a single flow-matching objective. This setup yields higher success rates on long-horizon benchmarks while using only 14.71 million LoRA parameters on a 2-billion-parameter backbone. The result suggests that rich per-frame visual detail is not required when the world model and policy are trained together in this manner.

Core claim

OneWM-VLA compresses each view into a single semantic token per frame through Adaptive Attention Pooling and produces the resulting latent stream and the action trajectory under a single flow-matching objective. Per-frame visual bandwidth can thereby be reduced to one token without loss of long-horizon performance, as evidenced by raising average success from 47.9 percent to 61.3 percent on MetaWorld MT50, from 85.2 percent to 95.6 percent on LIBERO-Long, and from 20 percent to 60 percent on the real-robot Fold Cloth task.

What carries the argument

Adaptive Attention Pooling that condenses each frame into one task-relevant semantic token, trained jointly with action prediction via a single flow-matching objective.

If this is right

  • World models attached to VLA policies can run with drastically lower per-frame visual compute.
  • Joint flow-matching removes the need for a separate decoder between latent prediction and action output.
  • The same low-bandwidth latent stream supports both simulated benchmarks and real-robot deformable manipulation.
  • LoRA fine-tuning of a 2B backbone with roughly 15 million parameters is sufficient to realize these gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same one-token compression could be tested on non-visual sensory streams to see whether bandwidth reduction generalizes across modalities.
  • If the approach holds at still longer horizons, it would lower the barrier to deploying world-model planning on embedded robot hardware.
  • An ablation that replaces Adaptive Attention Pooling with simpler uniform pooling would isolate how much the attention mechanism contributes to information preservation.

Load-bearing premise

Adaptive Attention Pooling can extract and preserve every piece of task-relevant semantic information from each frame so that the single-token latent stream remains sufficient for accurate long-horizon rollouts.

What would settle it

A controlled comparison on a new long-horizon task in which the single-token version produces measurably lower success rates than an otherwise identical high-bandwidth version would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.07931 by Bin Liu, De Ma, Gang Pan, Shengchao Yuan, Xiaoxin Bai, Zhiyuan Jin, Zuojin Tang.

Figure 1
Figure 1. Figure 1: Motivation. OneWM-VLA represents each frame by a single semantic latent, keeping the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The OneWM-VLA Framework. Through Adaptive Attention Pooling (Adaptive Fusion), [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Adaptive attention pooling. Adaptive Attention Pooling reduces each view to a single token per frame in two stages: a token-level multi-strategy pooling and a view￾level adaptive fusion. We process each camera independently and write i ∈ {r, w1, w2} for the third-person view and the two wrist views. Visual encoding. For each view i, we extract to￾ken features with the pretrained PaliGemma [4] encoder Eϕ fr… view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation suites used in this work: the LIBERO and MetaWorld MT50 simulation [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: PCA visualization of visual features on LIBERO-Long. Top: before pooling (256 tokens, [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
read the original abstract

Vision-language-action (VLA) models increasingly rely on auxiliary world modules to plan over long horizons, yet how such modules should be parameterized on top of a pretrained VLA remains an open design question. Existing world-model-augmented VLAs typically pass the per-frame visual stream into the world module at high visual bandwidth and treat its rollout as a side product of action prediction; under a constrained adaptation budget on a frozen backbone, this leaves both the per-frame representation and the latent action coupling under-examined. We introduce OneWM-VLA, which compresses each view into a single semantic token per frame through an Adaptive Attention Pooling, and produces the resulting latent stream and the action trajectory under a single flow-matching objective rather than connecting them through a separate decoder. Empirically, we find that per-frame visual bandwidth can be reduced to a single token without compromising long-horizon performance under our setup. Trained with 14.71M LoRA parameters on a $\pi_0$ (2B) backbone, OneWM-VLA improves the average success rate from 47.9% to 61.3% on MetaWorld~MT50, reaches 95.6% on LIBERO-Long (vs.85.2% for $\pi_0$), and reaches 60.0% on the long-horizon deformable task Fold Cloth on a real Piper arm (vs.20.0% for $\pi_0$).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces OneWM-VLA, a VLA architecture that compresses each visual frame into a single semantic token via Adaptive Attention Pooling and jointly predicts the resulting latent stream and action trajectory under a single flow-matching objective on a frozen π₀ (2B) backbone with 14.71M LoRA parameters. It reports empirical success-rate gains over the base π₀ model on MetaWorld MT50 (47.9% → 61.3%), LIBERO-Long (85.2% → 95.6%), and a real-robot Fold Cloth task (20.0% → 60.0%), concluding that per-frame visual bandwidth can be reduced to one token without compromising long-horizon performance.

Significance. If the single-token compression is shown to be sufficient, the result would be significant for efficient world-model design in VLAs, demonstrating that high visual bandwidth is not required for long-horizon rollouts under constrained adaptation budgets. The multi-benchmark evaluation, including real-robot deployment, strengthens the practical relevance; however, the lack of controls isolating the token reduction from the mere addition of a world-model coupling limits attribution of the gains.

major comments (2)
  1. [Abstract / Experiments] Abstract and Experiments section: the central claim that single-token compression 'without compromising long-horizon performance' is not supported by the reported comparisons, which are only against the base π₀ model (no world module) rather than against an otherwise identical multi-token (k>1) world-model variant trained under the same flow-matching objective and LoRA adaptation.
  2. [Methods / Experiments] Methods and Experiments: no ablation studies, training details, or error bars are provided to isolate the contribution of Adaptive Attention Pooling and the single-token latent stream from other unstated changes in the world-model coupling or objective.
minor comments (2)
  1. [Abstract] The LoRA parameter count (14.71M) is stated without a breakdown of which modules receive adaptation or a comparison to full fine-tuning cost.
  2. [Methods] Notation for the Adaptive Attention Pooling mechanism and the flow-matching objective should be defined more explicitly with equations to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important ways to strengthen the attribution of our results to the single-token compression. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the central claim that single-token compression 'without compromising long-horizon performance' is not supported by the reported comparisons, which are only against the base π₀ model (no world module) rather than against an otherwise identical multi-token (k>1) world-model variant trained under the same flow-matching objective and LoRA adaptation.

    Authors: We agree that the current baselines do not fully isolate the effect of reducing to a single token. To directly support the claim, we will add a controlled comparison in the revised Experiments section against an otherwise identical multi-token (k=4) world-model variant trained under the exact same flow-matching objective, LoRA adaptation budget, and π₀ (2B) backbone. This will allow readers to see whether performance is preserved or degraded when moving from k>1 to k=1. revision: yes

  2. Referee: [Methods / Experiments] Methods and Experiments: no ablation studies, training details, or error bars are provided to isolate the contribution of Adaptive Attention Pooling and the single-token latent stream from other unstated changes in the world-model coupling or objective.

    Authors: We will expand the Methods section with full training hyperparameters (optimizer, learning rate schedule, batch size, number of epochs, and LoRA configuration) and add ablation studies that vary the pooling mechanism while keeping the flow-matching objective and coupling fixed. We will also report mean success rates with standard deviations over three independent random seeds for all main results and ablations to quantify variability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on direct benchmark measurements

full rationale

The paper reports success-rate improvements (e.g., 47.9% to 61.3% on MT50, 85.2% to 95.6% on LIBERO-Long) from training a LoRA-adapted model on public benchmarks and comparing against the base π0 policy. No equations, fitted parameters, or self-citations are invoked that would reduce these measured outcomes to quantities defined by the model's own inputs or prior author work. The derivation chain consists of an architectural choice (Adaptive Attention Pooling to one token) followed by end-to-end flow-matching training and empirical evaluation; the reported numbers are not forced by construction from any internal fit or self-referential premise.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on the untested premise that a single pooled token retains sufficient information for world-model rollouts and that the joint flow-matching loss is adequate to couple latent states with actions. The only explicit free parameter reported is the 14.71 M LoRA budget; the choice of exactly one token is a design decision rather than a fitted value.

free parameters (1)
  • LoRA parameter count
    14.71 M trainable parameters on the frozen 2 B backbone; reported as the adaptation budget.
axioms (2)
  • domain assumption A pretrained VLA backbone can remain frozen while a lightweight world module is adapted on top
    Stated as the constrained adaptation budget setup.
  • domain assumption Flow-matching loss can simultaneously supervise both the latent world stream and the action trajectory
    Core modeling choice replacing a separate decoder.
invented entities (1)
  • Adaptive Attention Pooling no independent evidence
    purpose: Compress each visual frame into exactly one semantic token
    New module introduced to achieve the one-token bandwidth reduction.

pith-pipeline@v0.9.0 · 5582 in / 1442 out tokens · 71358 ms · 2026-05-15T06:12:55.574732+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 27 internal anchors

  1. [1]

    Qwen3-vl technical report, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  3. [3]

    Revisiting Feature Prediction for Learning Visual Representations from Video

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual repre- sentations from video.arXiv preprint arXiv:2404.08471, 2024

  4. [4]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  5. [5]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  6. [6]

    π0: A vision-language-action flow model for general robot control, 2026

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π0: A visi...

  7. [7]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  8. [8]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

  9. [9]

    Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

    Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

  10. [10]

    WorldVLA: Towards Autoregressive Action World Model

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

  11. [11]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

  12. [12]

    πRL: Online rl fine-tuning for flow-based vision-language- action models.arXiv preprint arXiv:2510.25889, 2025

    Kang Chen, Zhihao Liu, Tonghe Zhang, Zhen Guo, Si Xu, Hao Lin, Hongzhi Zang, Xiang Li, Quanlu Zhang, Zhaofei Yu, et al. πRL: Online rl fine-tuning for flow-based vision-language- action models.arXiv preprint arXiv:2510.25889, 2025

  13. [13]

    Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021. 10

  14. [14]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  15. [15]

    Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

  16. [16]

    Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025

    Ankit Goyal, Hugo Hadfield, Xuning Yang, Valts Blukis, and Fabio Ramos. Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025

  17. [17]

    Prediction with action: Visual policy learning via joint denoising process.Ad- vances in Neural Information Processing Systems, 37:112386–112410, 2024

    Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process.Ad- vances in Neural Information Processing Systems, 37:112386–112410, 2024

  18. [18]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

  19. [19]

    Training Agents Inside of Scalable World Models

    Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025

  20. [20]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

  21. [21]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

  22. [22]

    Thinkact: Vision- language-action reasoning via reinforced visual latent planning, 2025

    Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025

  23. [23]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5 : a vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  24. [24]

    Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

    Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

  25. [25]

    arXiv preprint arXiv:2509.15212 (2025) 5

    Yuming Jiang, Siteng Huang, Shengke Xue, Yaxi Zhao, Jun Cen, Sicong Leng, Kehan Li, Jiayan Guo, Kexiang Wang, Mingxiu Chen, et al. Rynnvla-001: Using human demonstrations to improve robot manipulation.arXiv preprint arXiv:2509.15212, 2025

  26. [26]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

  27. [27]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  28. [28]

    A path towards autonomous machine intelligence version 0.9

    Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022

  29. [29]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

  30. [30]

    Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378, 2023

    Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378, 2023. 11

  31. [31]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  32. [32]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023

  33. [33]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

  34. [34]

    Zentner, Ryan Julian, J K Terry, Isaac Woungang, Nariman Farsad, and Pablo Samuel Castro

    Reginald McLean, Evangelos Chatzaroulas, Luc McCutcheon, Frank Röder, Tianhe Yu, Zhan- peng He, K.R. Zentner, Ryan Julian, J K Terry, Isaac Woungang, Nariman Farsad, and Pablo Samuel Castro. Meta-world+: An improved, standardized, RL benchmark. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Bench- marks Track, 2025

  35. [35]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

  36. [36]

    Masked world models for visual control

    Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, and Pieter Abbeel. Masked world models for visual control. InConference on Robot Learning, pages 1332–1344. PMLR, 2023

  37. [37]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

  38. [38]

    Efficient and generalized end-to- end autonomous driving system with latent deep reinforcement learning and demonstrations

    Zuojin Tang, Xiaoyu Chen, Yongqiang Li, and Jianyu Chen. Efficient and generalized end-to- end autonomous driving system with latent deep reinforcement learning and demonstrations. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 179–197. Springer, 2025

  39. [39]

    Vlascd: A visual language action model for simultaneous chatting and decision making

    Zuojin Tang, Bin Hu, Chenyang Zhao, De Ma, Gang Pan, and Bin Liu. Vlascd: A visual language action model for simultaneous chatting and decision making. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 9223–9243, 2025

  40. [40]

    Gemini Robotics: Bringing AI into the Physical World

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montser- rat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

  41. [41]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  42. [42]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  43. [43]

    World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

    Junjin Xiao, Yandan Yang, Xinyuan Chang, Ronghan Chen, Feng Xiong, Mu Xu, Wei-Shi Zheng, and Qing Zhang. World-env: Leveraging world model as a virtual environment for vla post-training.arXiv preprint arXiv:2509.24948, 2025

  44. [44]

    Vla-r1: Enhancing reasoning in vision-language-action models.arXiv preprint arXiv:2510.01623, 2025

    Angen Ye, Zeyu Zhang, Boyuan Wang, Xiaofeng Wang, Dapeng Zhang, and Zheng Zhu. Vla-r1: Enhancing reasoning in vision-language-action models.arXiv preprint arXiv:2510.01623, 2025

  45. [45]

    Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation,

    Shuang Zeng, Dekang Qi, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Shiyi Liang, Mu Xu, and Xing Wei. Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation.arXiv preprint arXiv:2509.22548, 2025. 12

  46. [46]

    Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

    Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

  47. [47]

    arXiv preprint arXiv:2507.04447 (2025) 3, 7, 14

    Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint arXiv:2507.04447, 2025

  48. [48]

    Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025

  49. [49]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

  50. [50]

    3D-VLA: A 3D Vision-Language-Action Generative World Model

    Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model.arXiv preprint arXiv:2403.09631, 2024

  51. [51]

    X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

    Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025

  52. [52]

    before pooling

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 13 A Task Difficulty Partition for MetaWorld MT50 We follow the difficulty partition o...