arxiv: 2605.07931 · v3 · submitted 2026-05-08 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

Zuojin Tang , Shengchao Yuan , Xiaoxin Bai , Zhiyuan Jin , De Ma , Gang Pan , Bin Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:12 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language-actionworld modelsvisual bandwidthadaptive attention poolingflow-matchinglong-horizon planningVLA policy

0 comments

The pith

A single semantic token per frame suffices to drive long-horizon planning in world-model-augmented vision-language-action policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that the visual stream fed to a world model for VLA policies can be compressed from high-bandwidth features to one token per frame. Adaptive Attention Pooling extracts a task-relevant semantic token from each view, and the resulting latent stream is produced jointly with action trajectories under a single flow-matching objective. This setup yields higher success rates on long-horizon benchmarks while using only 14.71 million LoRA parameters on a 2-billion-parameter backbone. The result suggests that rich per-frame visual detail is not required when the world model and policy are trained together in this manner.

Core claim

OneWM-VLA compresses each view into a single semantic token per frame through Adaptive Attention Pooling and produces the resulting latent stream and the action trajectory under a single flow-matching objective. Per-frame visual bandwidth can thereby be reduced to one token without loss of long-horizon performance, as evidenced by raising average success from 47.9 percent to 61.3 percent on MetaWorld MT50, from 85.2 percent to 95.6 percent on LIBERO-Long, and from 20 percent to 60 percent on the real-robot Fold Cloth task.

What carries the argument

Adaptive Attention Pooling that condenses each frame into one task-relevant semantic token, trained jointly with action prediction via a single flow-matching objective.

If this is right

World models attached to VLA policies can run with drastically lower per-frame visual compute.
Joint flow-matching removes the need for a separate decoder between latent prediction and action output.
The same low-bandwidth latent stream supports both simulated benchmarks and real-robot deformable manipulation.
LoRA fine-tuning of a 2B backbone with roughly 15 million parameters is sufficient to realize these gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same one-token compression could be tested on non-visual sensory streams to see whether bandwidth reduction generalizes across modalities.
If the approach holds at still longer horizons, it would lower the barrier to deploying world-model planning on embedded robot hardware.
An ablation that replaces Adaptive Attention Pooling with simpler uniform pooling would isolate how much the attention mechanism contributes to information preservation.

Load-bearing premise

Adaptive Attention Pooling can extract and preserve every piece of task-relevant semantic information from each frame so that the single-token latent stream remains sufficient for accurate long-horizon rollouts.

What would settle it

A controlled comparison on a new long-horizon task in which the single-token version produces measurably lower success rates than an otherwise identical high-bandwidth version would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.07931 by Bin Liu, De Ma, Gang Pan, Shengchao Yuan, Xiaoxin Bai, Zhiyuan Jin, Zuojin Tang.

**Figure 2.** Figure 2: The OneWM-VLA Framework. Through Adaptive Attention Pooling (Adaptive Fusion), [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Adaptive attention pooling. Adaptive Attention Pooling reduces each view to a single token per frame in two stages: a token-level multi-strategy pooling and a viewlevel adaptive fusion. We process each camera independently and write i ∈ {r, w1, w2} for the third-person view and the two wrist views. Visual encoding. For each view i, we extract token features with the pretrained PaliGemma [4] encoder Eϕ fr… view at source ↗

**Figure 4.** Figure 4: Evaluation suites used in this work: the LIBERO and MetaWorld MT50 simulation [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: PCA visualization of visual features on LIBERO-Long. Top: before pooling (256 tokens, [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

read the original abstract

Vision-language-action (VLA) models increasingly rely on auxiliary world modules to plan over long horizons, yet how such modules should be parameterized on top of a pretrained VLA remains an open design question. Existing world-model-augmented VLAs typically pass the per-frame visual stream into the world module at high visual bandwidth and treat its rollout as a side product of action prediction; under a constrained adaptation budget on a frozen backbone, this leaves both the per-frame representation and the latent action coupling under-examined. We introduce OneWM-VLA, which compresses each view into a single semantic token per frame through an Adaptive Attention Pooling, and produces the resulting latent stream and the action trajectory under a single flow-matching objective rather than connecting them through a separate decoder. Empirically, we find that per-frame visual bandwidth can be reduced to a single token without compromising long-horizon performance under our setup. Trained with 14.71M LoRA parameters on a $\pi_0$ (2B) backbone, OneWM-VLA improves the average success rate from 47.9% to 61.3% on MetaWorld~MT50, reaches 95.6% on LIBERO-Long (vs.85.2% for $\pi_0$), and reaches 60.0% on the long-horizon deformable task Fold Cloth on a real Piper arm (vs.20.0% for $\pi_0$).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Single-token compression via Adaptive Attention Pooling improves VLA performance over the base model but the experiments do not isolate whether one token is enough compared to more tokens under the same setup.

read the letter

The paper's main claim is that you can reduce each frame to one semantic token with Adaptive Attention Pooling, then train the resulting latent stream and action trajectory together under a single flow-matching objective on top of a frozen pi0 backbone. With only 14.71M LoRA parameters they report gains from 47.9% to 61.3% on MetaWorld MT50, 85.2% to 95.6% on LIBERO-Long, and 20% to 60% on real cloth folding. That joint objective is cleaner than adding a separate decoder for the world model, and the low adaptation budget is a practical plus for people who want to keep the backbone fixed. The architecture itself is new enough in the VLA world-model space to be worth noting. The clear limitation is the missing control: all comparisons are to the plain pi0 without any world module, so we cannot tell whether the single-token reduction itself preserves performance or whether any world-model coupling would have produced similar lifts. No ablations on token count, no error bars, and no training details appear in the abstract, which leaves the sufficiency claim under-tested. This is useful reading for researchers focused on compute-efficient long-horizon planning in robotics. The concrete numbers and the joint-training approach give something to build on, even if the headline result needs tighter controls. I would send it to peer review because the idea is testable and the reported improvements are large enough to justify referee time.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces OneWM-VLA, a VLA architecture that compresses each visual frame into a single semantic token via Adaptive Attention Pooling and jointly predicts the resulting latent stream and action trajectory under a single flow-matching objective on a frozen π₀ (2B) backbone with 14.71M LoRA parameters. It reports empirical success-rate gains over the base π₀ model on MetaWorld MT50 (47.9% → 61.3%), LIBERO-Long (85.2% → 95.6%), and a real-robot Fold Cloth task (20.0% → 60.0%), concluding that per-frame visual bandwidth can be reduced to one token without compromising long-horizon performance.

Significance. If the single-token compression is shown to be sufficient, the result would be significant for efficient world-model design in VLAs, demonstrating that high visual bandwidth is not required for long-horizon rollouts under constrained adaptation budgets. The multi-benchmark evaluation, including real-robot deployment, strengthens the practical relevance; however, the lack of controls isolating the token reduction from the mere addition of a world-model coupling limits attribution of the gains.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: the central claim that single-token compression 'without compromising long-horizon performance' is not supported by the reported comparisons, which are only against the base π₀ model (no world module) rather than against an otherwise identical multi-token (k>1) world-model variant trained under the same flow-matching objective and LoRA adaptation.
[Methods / Experiments] Methods and Experiments: no ablation studies, training details, or error bars are provided to isolate the contribution of Adaptive Attention Pooling and the single-token latent stream from other unstated changes in the world-model coupling or objective.

minor comments (2)

[Abstract] The LoRA parameter count (14.71M) is stated without a breakdown of which modules receive adaptation or a comparison to full fine-tuning cost.
[Methods] Notation for the Adaptive Attention Pooling mechanism and the flow-matching objective should be defined more explicitly with equations to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important ways to strengthen the attribution of our results to the single-token compression. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the central claim that single-token compression 'without compromising long-horizon performance' is not supported by the reported comparisons, which are only against the base π₀ model (no world module) rather than against an otherwise identical multi-token (k>1) world-model variant trained under the same flow-matching objective and LoRA adaptation.

Authors: We agree that the current baselines do not fully isolate the effect of reducing to a single token. To directly support the claim, we will add a controlled comparison in the revised Experiments section against an otherwise identical multi-token (k=4) world-model variant trained under the exact same flow-matching objective, LoRA adaptation budget, and π₀ (2B) backbone. This will allow readers to see whether performance is preserved or degraded when moving from k>1 to k=1. revision: yes
Referee: [Methods / Experiments] Methods and Experiments: no ablation studies, training details, or error bars are provided to isolate the contribution of Adaptive Attention Pooling and the single-token latent stream from other unstated changes in the world-model coupling or objective.

Authors: We will expand the Methods section with full training hyperparameters (optimizer, learning rate schedule, batch size, number of epochs, and LoRA configuration) and add ablation studies that vary the pooling mechanism while keeping the flow-matching objective and coupling fixed. We will also report mean success rates with standard deviations over three independent random seeds for all main results and ablations to quantify variability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on direct benchmark measurements

full rationale

The paper reports success-rate improvements (e.g., 47.9% to 61.3% on MT50, 85.2% to 95.6% on LIBERO-Long) from training a LoRA-adapted model on public benchmarks and comparing against the base π0 policy. No equations, fitted parameters, or self-citations are invoked that would reduce these measured outcomes to quantities defined by the model's own inputs or prior author work. The derivation chain consists of an architectural choice (Adaptive Attention Pooling to one token) followed by end-to-end flow-matching training and empirical evaluation; the reported numbers are not forced by construction from any internal fit or self-referential premise.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on the untested premise that a single pooled token retains sufficient information for world-model rollouts and that the joint flow-matching loss is adequate to couple latent states with actions. The only explicit free parameter reported is the 14.71 M LoRA budget; the choice of exactly one token is a design decision rather than a fitted value.

free parameters (1)

LoRA parameter count
14.71 M trainable parameters on the frozen 2 B backbone; reported as the adaptation budget.

axioms (2)

domain assumption A pretrained VLA backbone can remain frozen while a lightweight world module is adapted on top
Stated as the constrained adaptation budget setup.
domain assumption Flow-matching loss can simultaneously supervise both the latent world stream and the action trajectory
Core modeling choice replacing a separate decoder.

invented entities (1)

Adaptive Attention Pooling no independent evidence
purpose: Compress each visual frame into exactly one semantic token
New module introduced to achieve the one-token bandwidth reduction.

pith-pipeline@v0.9.0 · 5582 in / 1442 out tokens · 71358 ms · 2026-05-15T06:12:55.574732+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce OneWM-VLA, which compresses each view into a single semantic token per frame through an Adaptive Attention Pooling, and produces the resulting latent stream and the action trajectory under a single flow-matching objective
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

per-frame visual bandwidth can be reduced to a single token without compromising long-horizon performance under our setup

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 27 internal anchors

[1]

Qwen3-vl technical report, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page 2025
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual repre- sentations from video.arXiv preprint arXiv:2404.08471, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

π0: A vision-language-action flow model for general robot control, 2026

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π0: A visi...

work page 2026
[7]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

work page arXiv 2025
[10]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

πRL: Online rl fine-tuning for flow-based vision-language- action models.arXiv preprint arXiv:2510.25889, 2025

Kang Chen, Zhihao Liu, Tonghe Zhang, Zhen Guo, Si Xu, Hao Lin, Hongzhi Zang, Xiang Li, Quanlu Zhang, Zhaofei Yu, et al. πRL: Online rl fine-tuning for flow-based vision-language- action models.arXiv preprint arXiv:2510.25889, 2025

work page arXiv 2025
[13]

Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021. 10

work page 2021
[14]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

work page 2025
[15]

Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

work page 2023
[16]

Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025

Ankit Goyal, Hugo Hadfield, Xuning Yang, Valts Blukis, and Fabio Ramos. Vla-0: Building state-of-the-art vlas with zero modification.arXiv preprint arXiv:2510.13054, 2025

work page arXiv 2025
[17]

Prediction with action: Visual policy learning via joint denoising process.Ad- vances in Neural Information Processing Systems, 37:112386–112410, 2024

Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process.Ad- vances in Neural Information Processing Systems, 37:112386–112410, 2024

work page 2024
[18]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912
[19]

Training Agents Inside of Scalable World Models

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

work page 2022
[21]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Thinkact: Vision- language-action reasoning via reinforced visual latent planning, 2025

Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025

work page arXiv 2025
[23]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5 : a vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

work page arXiv 2025
[25]

arXiv preprint arXiv:2509.15212 (2025) 5

Yuming Jiang, Siteng Huang, Shengke Xue, Yaxi Zhao, Jun Cen, Sicong Leng, Kehan Li, Jiayan Guo, Kexiang Wang, Mingxiu Chen, et al. Rynnvla-001: Using human demonstrations to improve robot manipulation.arXiv preprint arXiv:2509.15212, 2025

work page arXiv 2025
[26]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

A path towards autonomous machine intelligence version 0.9

Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022

work page 2022
[29]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378, 2023

Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378, 2023. 11

work page arXiv 2023
[31]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Zentner, Ryan Julian, J K Terry, Isaac Woungang, Nariman Farsad, and Pablo Samuel Castro

Reginald McLean, Evangelos Chatzaroulas, Luc McCutcheon, Frank Röder, Tianhe Yu, Zhan- peng He, K.R. Zentner, Ryan Julian, J K Terry, Isaac Woungang, Nariman Farsad, and Pablo Samuel Castro. Meta-world+: An improved, standardized, RL benchmark. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Bench- marks Track, 2025

work page 2025
[35]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Masked world models for visual control

Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, and Pieter Abbeel. Masked world models for visual control. InConference on Robot Learning, pages 1332–1344. PMLR, 2023

work page 2023
[37]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Efficient and generalized end-to- end autonomous driving system with latent deep reinforcement learning and demonstrations

Zuojin Tang, Xiaoyu Chen, Yongqiang Li, and Jianyu Chen. Efficient and generalized end-to- end autonomous driving system with latent deep reinforcement learning and demonstrations. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 179–197. Springer, 2025

work page 2025
[39]

Vlascd: A visual language action model for simultaneous chatting and decision making

Zuojin Tang, Bin Hu, Chenyang Zhao, De Ma, Gang Pan, and Bin Liu. Vlascd: A visual language action model for simultaneous chatting and decision making. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 9223–9243, 2025

work page 2025
[40]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montser- rat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

Junjin Xiao, Yandan Yang, Xinyuan Chang, Ronghan Chen, Feng Xiong, Mu Xu, Wei-Shi Zheng, and Qing Zhang. World-env: Leveraging world model as a virtual environment for vla post-training.arXiv preprint arXiv:2509.24948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Vla-r1: Enhancing reasoning in vision-language-action models.arXiv preprint arXiv:2510.01623, 2025

Angen Ye, Zeyu Zhang, Boyuan Wang, Xiaofeng Wang, Dapeng Zhang, and Zheng Zhu. Vla-r1: Enhancing reasoning in vision-language-action models.arXiv preprint arXiv:2510.01623, 2025

work page arXiv 2025
[45]

Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation,

Shuang Zeng, Dekang Qi, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Shiyi Liang, Mu Xu, and Xing Wei. Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation.arXiv preprint arXiv:2509.22548, 2025. 12

work page arXiv 2025
[46]

Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

work page arXiv 2025
[47]

arXiv preprint arXiv:2507.04447 (2025) 3, 7, 14

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint arXiv:2507.04447, 2025

work page arXiv 2025
[48]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025

work page 2025
[49]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

3D-VLA: A 3D Vision-Language-Action Generative World Model

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model.arXiv preprint arXiv:2403.09631, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

before pooling

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 13 A Task Difficulty Partition for MetaWorld MT50 We follow the difficulty partition o...

work page 2023