arxiv: 2511.18960 · v3 · submitted 2025-11-24 · 💻 cs.LG · cs.CV· cs.RO

Recognition: no theorem link

AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

Lei Xiao , Jifeng Li , Juntao Gao , Feiyang Ye , Yan Jin , Jingjing Qian , Jing Zhang , Yong Wu

show 1 more author

Xiaoyuan Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-17 06:23 UTC · model grok-4.3

classification 💻 cs.LG cs.CVcs.RO

keywords vision-language-actionactive visual attentionrecurrent statepartially observable markov decision processrobot manipulationembodied policy learningvisual token reweighting

0 comments

The pith

Conditioning VLA actions on a recurrent history state and active visual attention improves robotic manipulation performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most vision-language-action models process each visual observation independently at every timestep. The paper argues this design ignores the history required for real robot tasks, which are only partially observable. AVA-VLA instead maintains a recurrent neural state that approximates the agent's belief over past interactions. It then applies active visual attention to reweight visual tokens in the current view according to both the language instruction and that history. If the approach holds, it produces stronger results on standard benchmarks and better transfer to physical dual-arm tasks.

Core claim

Reformulating VLA policy learning from a partially observable Markov decision process perspective, with a recurrent state as a neural approximation to the agent's belief over task history and active visual attention that dynamically reweights visual tokens based on the instruction and execution history, enables improved action generation in robotic sequential decision-making.

What carries the argument

Active Visual Attention (AVA), which uses the recurrent state and instruction to dynamically reweight visual tokens from the current observation and focus on regions most relevant to the task history.

Load-bearing premise

A recurrent neural state provides a sufficient approximation to the agent's belief over task history and dynamically reweighting visual tokens based on this state and the instruction improves action generation.

What would settle it

A controlled ablation on the LIBERO or CALVIN benchmarks that removes the recurrent state and active visual attention and finds equal or higher success rates would show the additions do not drive the reported gains.

Figures

Figures reproduced from arXiv: 2511.18960 by Feiyang Ye, Jifeng Li, Jingjing Qian, Jing Zhang, Juntao Gao, Lei Xiao, Xiaoyuan Yu, Yan Jin, Yong Wu.

**Figure 3.** Figure 3: Comparison on the Mobile ALOHA real-robot experiments. Evaluation across four manipulation tasks, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of proposed AVA-VLA’s manipulation process on four long-horizon real-world tasks, showing [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Visual dynamics of the soft weights during the task “put both moka pots on the stove" in two viewpoints. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Visual dynamics of the soft weights during the task “put yellow banana into bucket" in three viewpoints for [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Visual dynamics of the soft weights during the task “scoop sesame into bowl" in three viewpoints for Mobile [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Visual dynamics of the soft weights during the continuous tasks “Lift red block table" and “Place in slider" in [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Visual dynamics of the soft weights during the task “put the black bowl in the bottom drawer of the cabinet [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Visual dynamics of the soft weights during the task “put the yellow and white mug in the microwave and [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

read the original abstract

Vision-Language-Action (VLA) models have shown remarkable progress in embodied tasks recently, but most methods process visual observations independently at each timestep. This history-agnostic design treats robot manipulation as a Markov Decision Process, even though real-world robotic control is inherently partially observable and requires reasoning over past interactions. To address this mismatch, we reformulate VLA policy learning from a Partially Observable Markov Decision Process perspective and propose AVA-VLA, a framework that conditions action generation on a recurrent state that serves as a neural approximation to the agent's belief over task history. Built on this recurrent state, we introduce Active Visual Attention (AVA), which dynamically reweights visual tokens in the current observation to focus on regions most relevant given both the instruction and execution history. Extensive experiments show that AVA-VLA achieves state-of-the-art performance on standard robotic benchmarks, including LIBERO and CALVIN, and transfers effectively to real-world dual-arm manipulation tasks. These results demonstrate the effectiveness of temporally grounded active visual processing for improving VLA performance in robotic sequential decision-making. The project page is available at https://liauto-dsr.github.io/AVA-VLA-Page.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AVA-VLA adds a recurrent state and history-conditioned visual attention to VLA policies to address partial observability, with reported gains on LIBERO and CALVIN that look plausible but need checking against strong controls.

read the letter

The core move here is to stop treating each visual observation as independent and instead carry a recurrent state forward as a rough stand-in for what the robot has seen and done so far. From that state they derive an active visual attention step that reweights current tokens according to both the language instruction and the accumulated history. This directly targets the Markov assumption that most current VLA work still makes even though real manipulation is only partially observable.

Referee Report

2 major / 2 minor

Summary. The paper claims that by reformulating Vision-Language-Action (VLA) models as Partially Observable Markov Decision Processes (POMDPs) and using a recurrent neural state to approximate the agent's belief over task history, their proposed Active Visual Attention (AVA) mechanism can dynamically reweight visual tokens to improve action generation. They report achieving state-of-the-art performance on the LIBERO and CALVIN benchmarks and effective transfer to real-world dual-arm robotic manipulation tasks.

Significance. If the central results hold under scrutiny, this contribution would be significant as it directly tackles the partial observability issue in robotic control that standard VLA models ignore by processing each timestep independently. The combination of recurrent belief approximation and instruction-conditioned active attention represents a logical extension of current architectures, and successful real-world transfer would strengthen the case for such temporally grounded approaches in embodied AI. The work provides a clear framework that could be built upon for more complex, long-horizon tasks.

major comments (2)

[§3.2] §3.2: The recurrent state is introduced as a neural approximation to the belief state in the POMDP formulation, but the manuscript provides no empirical or theoretical analysis showing that this state retains relevant history information over the multi-step horizons present in the LIBERO and CALVIN tasks; without such validation, the performance improvements attributed to AVA may not stem from the intended mechanism.
[§4.2] §4.2: In the ablation studies, while removing AVA degrades performance, there is no control experiment that uses a non-recurrent memory (e.g., a fixed-size buffer of past features) with equivalent parameter count; this leaves open the possibility that gains are due to additional capacity rather than the specific recurrent belief approximation.

minor comments (2)

[Figure 4] Figure 4: The real-world experiment figures would benefit from including failure cases or attention maps to provide more insight into when the method succeeds or fails.
[Notation] Notation: The notation for the recurrent state h_t and the attention weights should be consistently defined across equations to avoid confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the constructive major comments. We address each point below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [§3.2] §3.2: The recurrent state is introduced as a neural approximation to the belief state in the POMDP formulation, but the manuscript provides no empirical or theoretical analysis showing that this state retains relevant history information over the multi-step horizons present in the LIBERO and CALVIN tasks; without such validation, the performance improvements attributed to AVA may not stem from the intended mechanism.

Authors: We agree that the manuscript would benefit from explicit validation that the recurrent state functions as an effective belief approximation over the relevant horizons. In the revision we will add an analysis section that probes the recurrent state, for example by measuring how its hidden activations change when historical observations are masked or truncated at different lengths, and by reporting task performance as a function of history length on LIBERO and CALVIN. This will provide direct evidence that the state retains task-relevant information and that the AVA improvements are tied to this mechanism. revision: yes
Referee: [§4.2] §4.2: In the ablation studies, while removing AVA degrades performance, there is no control experiment that uses a non-recurrent memory (e.g., a fixed-size buffer of past features) with equivalent parameter count; this leaves open the possibility that gains are due to additional capacity rather than the specific recurrent belief approximation.

Authors: We acknowledge that the current ablations do not include a capacity-matched non-recurrent baseline, which leaves the source of the gains partially ambiguous. We will add this control experiment in the revised manuscript: a fixed-size buffer of past visual features (with the same total parameter count as the recurrent module) that is concatenated or attended to in the same manner as the recurrent state. Performance differences between this baseline and the recurrent version will isolate the benefit attributable to the recurrent belief approximation. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper reformulates VLA policy learning as a POMDP and introduces a recurrent neural state as a standard approximation to the belief over task history, then builds Active Visual Attention on top of it. No equations, derivations, or self-citations are shown that reduce any central claim or prediction to fitted parameters or prior inputs by construction. Performance is evaluated on external benchmarks (LIBERO, CALVIN) and real-world tasks, rendering the approach self-contained and falsifiable without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard domain assumptions from robotics and RL literature plus the unverified empirical claim of SOTA performance.

axioms (1)

domain assumption Real-world robotic control is inherently partially observable and requires reasoning over past interactions.
Explicitly stated in the abstract as the motivation for the POMDP reformulation.

pith-pipeline@v0.9.0 · 5532 in / 1124 out tokens · 25684 ms · 2026-05-17T06:23:12.819309+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 1 Pith paper · 18 internal anchors

[1]

RT-H: Action Hierarchies Using Language

Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024

work page internal anchor Pith review arXiv 2024
[2]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

GR-3 Technical Report

Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

work page internal anchor Pith review arXiv 2025
[8]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In European Conference on Computer Vision, pp. 19–35. Springer, 2024

work page 2024
[9]

Pali-3 vision language models: Smaller, faster, stronger

Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul V oigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, et al. Pali-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199, 2023

work page arXiv 2023
[10]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction. arXiv preprint arXiv:2505.20279, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

work page 2022
[13]

NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U Tan, Navonil Majumder, Soujanya Poria, et al. Nora: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Irl-vla: Training an vision-language-action policy via reward world model.arXiv preprint arXiv:2508.06571, 2025a

Anqing Jiang, Yu Gao, Yiru Wang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun, Shichen Tang, Lijuan Zhu, Jinhao Chai, et al. Irl-vla: Training an vision-language-action policy via reward world model.arXiv preprint arXiv:2508.06571, 2025a

work page arXiv
[15]

The better you learn, the smarter you prune: Towards efficient vision-language-action models via differentiable token pruning.arXiv preprint arXiv:2509.12594, 2025b

Titong Jiang, Xuefeng Jiang, Yuan Ma, Xin Wen, Bailin Li, Kun Zhan, Peng Jia, Yahui Liu, Sheng Sun, and Xianpeng Lang. The better you learn, the smarter you prune: Towards efficient vision-language-action models via differentiable token pruning.arXiv preprint arXiv:2509.12594, 2025b

work page arXiv
[16]

Prismatic vlms: Investigating the design space of visually-conditioned language models

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Prismatic vlms: Investigating the design space of visually-conditioned language models. InForty-first International Conference on Machine Learning, 2024

work page 2024
[17]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Partially observable markov decision processes in robotics: A survey

Mikko Lauri, David Hsu, and Joni Pajarinen. Partially observable markov decision processes in robotics: A survey. IEEE Transactions on Robotics, 39(1):21–40, 2022. 11 APREPRINT- DECEMBER3, 2025

work page 2022
[20]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Cogvla: Cognition-aligned vision-language-action model via instruction-driven routing & sparsification.arXiv preprint arXiv:2508.21046, 2025a

Wei Li, Renshan Zhang, Rui Shao, Jie He, and Liqiang Nie. Cogvla: Cognition-aligned vision-language-action model via instruction-driven routing & sparsification.arXiv preprint arXiv:2508.21046, 2025a

work page arXiv
[22]

Sp-vla: A joint model scheduling and token pruning approach for vla model acceleration.arXiv preprint arXiv:2506.12723, 2025b

Ye Li, Yuan Meng, Zewen Sun, Kangye Ji, Chen Tang, Jiajun Fan, Xinzhu Ma, Shutao Xia, Zhi Wang, and Wenwu Zhu. Sp-vla: A joint model scheduling and token pruning approach for vla model acceleration.arXiv preprint arXiv:2506.12723, 2025b

work page arXiv
[23]

Reviving and improving recurrent back-propagation

Renjie Liao, Yuwen Xiong, Ethan Fetaya, Lisa Zhang, KiJung Yoon, Xaq Pitkow, Raquel Urtasun, and Richard Zemel. Reviving and improving recurrent back-propagation. InInternational conference on machine learning, pp. 3082–3091. PMLR, 2018

work page 2018
[24]

Petformer: Long-term time series forecasting via placeholder-enhanced transformer.IEEE Transactions on Emerging Topics in Computational Intelligence, 2024

Shengsheng Lin, Weiwei Lin, Wentai Wu, Songbo Wang, and Yongxiang Wang. Petformer: Long-term time series forecasting via placeholder-enhanced transformer.IEEE Transactions on Emerging Topics in Computational Intelligence, 2024

work page 2024
[25]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023a

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023a

work page
[26]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023b

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023b

work page
[27]

Recurrent neural networks.Design and applications, 5(64-67):2, 2001

Larry R Medsker, Lakhmi Jain, et al. Recurrent neural networks.Design and applications, 5(64-67):2, 2001

work page 2001
[28]

Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7 (3):7327–7334, 2022

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7 (3):7327–7334, 2022

work page 2022
[29]

Semantic and sequential alignment for referring video object segmentation

Feiyu Pan, Hao Fang, Fangkai Li, Yanyu Xu, Yawei Li, Luca Benini, and Xiankai Lu. Semantic and sequential alignment for referring video object segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 19067–19076, 2025

work page 2025
[30]

On the difficulty of training recurrent neural networks

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning, pp. 1310–1318. Pmlr, 2013

work page 2013
[31]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018
[32]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Streaming long video understanding with large language models.Advances in Neural Information Processing Systems, 37: 119336–119360, 2024

Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models.Advances in Neural Information Processing Systems, 37: 119336–119360, 2024

work page 2024
[34]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34: 13937–13949, 2021

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34: 13937–13949, 2021

work page 2021
[36]

Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies.arXiv preprint arXiv:2509.04996, 2025

Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Ya˘gmurlu, Fabian Otto, and Rudolf Lioutikov. Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies.arXiv preprint arXiv:2509.04996, 2025

work page arXiv 2025
[37]

Mome: Mixture of multimodal experts for generalist multimodal large language models.Advances in neural information processing systems, 37: 42048–42070, 2024

Leyang Shen, Gongwei Chen, Rui Shao, Weili Guan, and Liqiang Nie. Mome: Mixture of multimodal experts for generalist multimodal large language models.Advances in neural information processing systems, 37: 42048–42070, 2024

work page 2024
[38]

The optimal control of partially observable markov processes over a finite horizon.Operations research, 21(5):1071–1088, 1973

Richard D Smallwood and Edward J Sondik. The optimal control of partially observable markov processes over a finite horizon.Operations research, 21(5):1071–1088, 1973

work page 1973
[39]

Accelerating vision-language-action model integrated with action chunking via parallel decoding

Wenxuan Song, Jiayi Chen, Pengxiang Ding, Han Zhao, Wei Zhao, Zhide Zhong, Zongyuan Ge, Jun Ma, and Haoang Li. Accelerating vision-language-action model integrated with action chunking via parallel decoding. arXiv preprint arXiv:2503.02310, 2025. 12 APREPRINT- DECEMBER3, 2025

work page arXiv 2025
[40]

Reconvla: Reconstructive vision-language-action model as effective robot perceiver.arXiv preprint arXiv:2508.10333, 2025

Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision-language-action model as effective robot perceiver.arXiv preprint arXiv:2508.10333, 2025

work page arXiv 2025
[41]

Lvpruning: An effective yet simple language-guided vision token pruning approach for multi-modal large language models

Yizheng Sun, Yanze Xin, Hao Li, Jingyuan Sun, Chenghua Lin, and Riza Theresa Batista-Navarro. Lvpruning: An effective yet simple language-guided vision token pruning approach for multi-modal large language models. InFindings of the Association for Computational Linguistics: NAACL 2025, pp. 4299–4308, 2025

work page 2025
[42]

Interactive post-training for vision-language-action models.arXiv preprint arXiv:2505.17016, 2025

Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Krähenbühl. Interactive post-training for vision-language-action models.arXiv preprint arXiv:2505.17016, 2025

work page arXiv 2025
[43]

Qwen2 Technical Report

Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2(3), 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Predictive inverse dynamics models are scalable learners for robotic manipulation.arXiv preprint arXiv:2412.15109, 2024

Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation.arXiv preprint arXiv:2412.15109, 2024

work page arXiv 2024
[45]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Specprune-vla: Accelerating vision- language-action models via action-aware self-speculative pruning.arXiv preprint arXiv:2509.05614, 2025a

Hanzhen Wang, Jiaming Xu, Jiayi Pan, Yongkang Zhou, and Guohao Dai. Specprune-vla: Accelerating vision- language-action models via action-aware self-speculative pruning.arXiv preprint arXiv:2509.05614, 2025a

work page arXiv
[47]

Continuous 3d perception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 10510–10522, 2025b

work page
[48]

Efficientvlm: Fast and accurate vision- language models via knowledge distillation and modal-adaptive pruning

Tiannan Wang, Wangchunshu Zhou, Yan Zeng, and Xinsong Zhang. Efficientvlm: Fast and accurate vision- language models via knowledge distillation and modal-adaptive pruning. InFindings of the association for computational linguistics: ACL 2023, pp. 13899–13913, 2023

work page 2023
[49]

Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.arXiv preprint arXiv:2509.09372, 2025

Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.arXiv preprint arXiv:2509.09372, 2025

work page arXiv 2025
[50]

Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025c

Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025c

work page arXiv
[51]

Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. IEEE Robotics and Automation Letters, 2025

work page 2025
[52]

Vla-cache: Towards efficient vision- language-action model via adaptive token caching in robotic manipulation.arXiv preprint arXiv:2502.02175, 2025

Siyu Xu, Yunke Wang, Chenghao Xia, Dihao Zhu, Tao Huang, and Chang Xu. Vla-cache: Towards efficient vision- language-action model via adaptive token caching in robotic manipulation.arXiv preprint arXiv:2502.02175, 2025

work page arXiv 2025
[53]

Dynamically constructed (po) mdps for adaptive robot planning

Shiqi Zhang, Piyush Khandelwal, and Peter Stone. Dynamically constructed (po) mdps for adaptive robot planning. InProceedings of the AAAI conference on artificial intelligence, volume 31, 2017

work page 2017
[54]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 1702–1713, 2025

work page 2025
[56]

Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[57]

https://global.agilex.ai/products/cobot-magic

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pp. 2165–2183. PMLR, 2023. 13 APREPRINT- DECEMBER3, 2025 Appendix A Implementation Details We report the implement...

work page 2023