pith. sign in

arxiv: 2602.20200 · v2 · pith:ZUZD3XLBnew · submitted 2026-02-22 · 💻 cs.RO · cs.AI· cs.CV

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

Pith reviewed 2026-05-21 13:17 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV
keywords Vision-Language-Action modelsrobotic manipulationdual-memory frameworkglobal prior memorylocal consistency memorydiffusion-based policyinference efficiency
0
0 comments X

The pith

A dual-memory system replaces random noise with retrieved task priors and adds action-history constraints to make vision-language-action policies faster and more reliable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OptimusVLA, which augments hierarchical vision-language-action models with two memories to fix bottlenecks in action generation. Global Prior Memory pulls task-level priors from semantically similar past trajectories instead of starting from Gaussian noise, which shortens the denoising path and lowers the number of function evaluations. Local Consistency Memory tracks the sequence of already-executed actions to infer progress and enforce temporal smoothness. These changes produce higher success rates on standard manipulation benchmarks and real-robot suites while delivering nearly three times faster inference.

Core claim

OptimusVLA replaces the isotropic Gaussian noise prior in the generative action policy with retrieved priors from a Global Prior Memory of semantically similar trajectories and augments the policy with a Local Consistency Memory that models executed action sequences to inject learned consistency constraints, thereby reducing denoising steps, improving temporal coherence, and raising success rates on manipulation tasks.

What carries the argument

Dual-memory framework with Global Prior Memory (GPM) that retrieves task-level priors to shorten the generative path and Local Consistency Memory (LCM) that enforces temporal coherence on the action sequence.

If this is right

  • OptimusVLA reaches 98.6 percent average success on the LIBERO benchmark.
  • It improves over the pi_0 baseline by 13.5 percent on the CALVIN benchmark.
  • It attains 38 percent success on the RoboTwin 2.0 Hard suite.
  • Real-world tests rank it best on generalization and long-horizon tasks while providing 2.9 times inference speedup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the prior memory can be grown incrementally from the robot's own experience, the same dual-memory pattern might support continual learning with limited new data collection.
  • The separation of global semantic retrieval and local temporal constraint could transfer to other generative sequence models outside robotics.
  • Performance would likely degrade on tasks whose semantic signatures have no close neighbors in the stored library, revealing dependence on retrieval quality.

Load-bearing premise

A sufficiently large library of semantically searchable prior trajectories must exist and the retrieved priors must stay relevant and safe for the current scene.

What would settle it

Evaluating the model on a novel task that has no close semantic matches in the prior memory library and measuring whether the reported gains in success rate and inference speed vanish.

Figures

Figures reproduced from arXiv: 2602.20200 by Bing Hu, Dongmei Jiang, Gongwei Chen, Jianye Hao, Liqiang Nie, Pengwei Xie, Rui Shao, Zaijing Li.

Figure 1
Figure 1. Figure 1: Top: Comparison between the standard VLA architec￾ture (left) and our proposed OptimusVLA (right). (ii) Poor ro￾bustness to temporal dependence. Middle: Illustration of how GPM (blue) and LCM (green) address two key limitations of ex￾isting VLA models: (i) Low inference efficiency due to a large prior–target gap. (ii) Poor robustness to temporal dependence. Bottom: Efficiency and performance comparison. na… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of OptimusVLA framework. Given a task and the current observation, the Vision–Language backbone first encodes [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Real-world task setup and evaluation results. We evaluate the performance of OptimusVLA against OpenVLA [ () IfffiiiLIBERO(b) If [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Inference efficiency comparison on LIBERO and Real [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results of OptimusVLA on simulation task and Real-World task. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results of OptimusVLA on Real-World. From top to bottom, we illustrate four [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative results of OptimusVLA on Real-World. From top to bottom, we illustrate four [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
read the original abstract

Hierarchical Vision-Language-Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision-Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i) Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising steps and the incidence of infeasible samples. (ii) Poor robustness. Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM). GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories, thereby shortening the generative path and reducing the umber of function evaluations (NFE). LCM dynamically models executed action sequence to infer task progress and injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory. Across three simulation benchmarks, OptimusVLA consistently outperforms strong baselines: it achieves 98.6% average success rate on LIBERO, improves over pi_0 by 13.5% on CALVIN, and attains 38% average success rate on RoboTwin 2.0 Hard. In Real-World evaluation, OptimusVLA ranks best on Generalization and Long-horizon suites, surpassing pi_0 by 42.9% and 52.4%, respectively, while delivering 2.9x inference speedup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces OptimusVLA, a dual-memory Vision-Language-Action framework for robotic manipulation. Global Prior Memory (GPM) replaces isotropic Gaussian noise with task-level priors retrieved from semantically similar trajectories to shorten the denoising trajectory and reduce NFE. Local Consistency Memory (LCM) maintains a dynamic model of the executed action sequence to infer task progress and enforce temporal coherence. Empirical results claim 98.6% average success on LIBERO, a 13.5% improvement over pi_0 on CALVIN, 38% on RoboTwin 2.0 Hard, best-in-class real-world generalization and long-horizon performance, and a 2.9× inference speedup.

Significance. If the retrieval mechanism reliably supplies relevant priors and the consistency constraint is effective, the work could meaningfully advance inference efficiency in generative VLA policies without sacrificing robustness on long-horizon tasks. The multi-benchmark evaluation and real-world results provide a reasonable basis for practical impact in robotics, though the magnitude of the efficiency gain remains contingent on the quality and availability of the external prior library.

major comments (3)
  1. [§3.1] §3.1 (GPM construction): the description of how the prior library is built, including embedding model, database size, similarity metric, and retrieval threshold or fallback policy, is absent. This information is load-bearing for the central efficiency claim that retrieved priors reduce the distributional gap and deliver the reported 2.9× speedup; without it the advantage cannot be isolated from an external curated resource.
  2. [§4.2] §4.2 and §4.3 (results and ablations): no ablation isolates the contribution of GPM from LCM, and no error bars or statistical tests accompany the success-rate numbers (e.g., 98.6% on LIBERO, 13.5% gain on CALVIN). These omissions prevent attribution of gains to the dual-memory design and weaken confidence in the cross-benchmark superiority claims.
  3. [§3.2] §3.2 (LCM formulation): the precise mechanism by which the learned consistency constraint is injected into the denoising process (e.g., as an additional loss term, conditioning signal, or modified sampling schedule) is not formalized with equations, making it difficult to verify that temporal coherence is enforced without introducing new failure modes.
minor comments (3)
  1. [Abstract] Abstract: correct the typos “proceess” → “process” and “umber” → “number”.
  2. [Figures/Tables] Figure captions and tables should explicitly state whether reported numbers are means over multiple seeds or single runs; the current presentation leaves this ambiguous.
  3. [Implementation details] The manuscript should clarify whether the prior library is released with the code or remains proprietary, as this directly affects reproducibility of the efficiency results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for improving reproducibility, empirical rigor, and formal clarity. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (GPM construction): the description of how the prior library is built, including embedding model, database size, similarity metric, and retrieval threshold or fallback policy, is absent. This information is load-bearing for the central efficiency claim that retrieved priors reduce the distributional gap and deliver the reported 2.9× speedup; without it the advantage cannot be isolated from an external curated resource.

    Authors: We agree that these implementation details are essential for reproducibility and to properly attribute the efficiency gains. In the revised manuscript, we will expand §3.1 with a full description of the prior library, including the embedding model, database size, similarity metric (cosine similarity on trajectory embeddings), retrieval threshold, and fallback policy to standard Gaussian noise when no sufficiently similar prior is available. revision: yes

  2. Referee: [§4.2] §4.2 and §4.3 (results and ablations): no ablation isolates the contribution of GPM from LCM, and no error bars or statistical tests accompany the success-rate numbers (e.g., 98.6% on LIBERO, 13.5% gain on CALVIN). These omissions prevent attribution of gains to the dual-memory design and weaken confidence in the cross-benchmark superiority claims.

    Authors: We acknowledge that isolating the contributions of GPM and LCM is necessary to strengthen causal claims. We will add dedicated ablation experiments evaluating each component independently. We will also report error bars (standard deviation over multiple seeds) and include statistical significance tests for the key performance differences across benchmarks. revision: yes

  3. Referee: [§3.2] §3.2 (LCM formulation): the precise mechanism by which the learned consistency constraint is injected into the denoising process (e.g., as an additional loss term, conditioning signal, or modified sampling schedule) is not formalized with equations, making it difficult to verify that temporal coherence is enforced without introducing new failure modes.

    Authors: We agree that the injection mechanism requires explicit formalization. In the revised §3.2, we will add the mathematical formulation, including the equations that define how the learned consistency constraint is incorporated into the denoising objective or sampling procedure. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on benchmark validation rather than self-referential derivation

full rationale

The paper introduces OptimusVLA with Global Prior Memory (GPM) and Local Consistency Memory (LCM) to improve VLA action generation efficiency and robustness. GPM replaces isotropic noise with retrieved task-level priors from semantically similar trajectories, while LCM enforces temporal consistency on action sequences. These are presented as architectural innovations whose benefits are demonstrated through empirical results on LIBERO (98.6% success), CALVIN (+13.5% over pi_0), RoboTwin, and real-world suites (2.9x speedup). No equations or first-principles derivations are shown that reduce by construction to fitted parameters, self-citations, or renamed inputs; the retrieval mechanism and consistency constraint are external design choices validated experimentally rather than tautological. The derivation chain is self-contained against external benchmarks with no load-bearing self-citation or definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The approach relies on the existence of a retrievable trajectory library and on the assumption that semantic similarity in language-vision space predicts useful action priors; these are domain assumptions rather than free parameters or new entities.

axioms (2)
  • domain assumption A library of semantically similar past trajectories can be retrieved at inference time to serve as a better starting distribution than isotropic Gaussian noise.
    Invoked when GPM replaces the standard noise prior; no details on library construction or retrieval mechanism are given in the abstract.
  • domain assumption Enforcing consistency on the executed action sequence improves robustness by providing awareness of task progress.
    Central justification for the LCM module.
invented entities (2)
  • Global Prior Memory (GPM) no independent evidence
    purpose: Replace Gaussian noise with retrieved task-level priors to shorten the generative path.
    New architectural component introduced to address the distributional gap.
  • Local Consistency Memory (LCM) no independent evidence
    purpose: Model recent actions to infer progress and enforce temporal coherence.
    New architectural component introduced to address lack of history awareness.

pith-pipeline@v0.9.0 · 5854 in / 1645 out tokens · 34329 ms · 2026-05-21T13:17:53.525266+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

    cs.RO 2026-05 unverdicted novelty 6.0

    RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.

  2. ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 2 Pith papers · 25 internal anchors

  1. [1]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.pi 0: A vision-language- action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. 1, 2, 3, 5, 6, 7

  2. [2]

    Towards synergistic, generalized, and efficient dual-system for robotic manipulation.arXiv preprint arXiv:2410.08001,

    Qingwen Bu, Hongyang Li, Li Chen, Jisong Cai, Jia Zeng, Heming Cui, Maoqing Yao, and Yu Qiao. Towards synergis- tic, generalized, and efficient dual-system for robotic manip- ulation.arXiv preprint arXiv:2410.08001, 2024. 5

  3. [3]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent ac- tions.arXiv preprint arXiv:2505.06111, 2025. 3, 5

  4. [4]

    Lion: Empowering multimodal large language model with dual-level visual knowledge

    Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, and Liqiang Nie. Lion: Empowering multimodal large language model with dual-level visual knowledge. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26540–26550, 2024. 1, 4

  5. [5]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data gen- erator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025. 1, 2, 5, 6, 3

  6. [6]

    Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025. 1, 2, 5, 6, 3

  7. [7]

    GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

    Shengliang Deng, Mi Yan, Songlin Wei, Haixin Ma, Yuxin Yang, Jiayi Chen, Zhiqi Zhang, Taoyu Yang, Xuheng Zhang, Wenhao Zhang, et al. Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data.arXiv preprint arXiv:2505.03233, 2025. 2

  8. [8]

    Interleave-vla: Enhancing robot manipulation with interleaved image-text instructions.arXiv preprint arXiv:2505.02152, 2025

    Cunxin Fan, Xiaosong Jia, Yihang Sun, Yixiao Wang, Jianglan Wei, Ziyang Gong, Xiangyu Zhao, Masayoshi Tomizuka, Xue Yang, Junchi Yan, et al. Interleave-vla: En- hancing robot manipulation with interleaved image-text in- structions.arXiv preprint arXiv:2505.02152, 2025. 2, 3

  9. [9]

    Mamba: Linear-time sequence mod- eling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence mod- eling with selective state spaces. InFirst conference on lan- guage modeling, 2024. 5

  10. [10]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A general- ist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024. 5

  11. [11]

    An Embodied Generalist Agent in 3D World

    Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023. 2

  12. [12]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi05: a vision-language-action model with open-world generaliza- tion.arXiv preprint arXiv:2504.16054, 2025. 1, 2, 3, 5, 6, 7

  13. [13]

    Vima: General robot manip- ulation with multimodal prompts

    Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandku- mar, Yuke Zhu, and Linxi Fan. Vima: General robot manip- ulation with multimodal prompts. InNeurIPS 2022 Founda- tion Models for Decision Making Workshop, 2022. 1, 2

  14. [14]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 1, 2, 3, 5, 7

  15. [15]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and suc- cess.arXiv preprint arXiv:2502.19645, 2025. 2, 5, 6, 7

  16. [16]

    Star: Learning diverse robot skill ab- stractions through rotation-augmented vector quantization

    Hao Li, Qi Lv, Rui Shao, Xiang Deng, Yinchuan Li, Jianye Hao, and Liqiang Nie. Star: Learning diverse robot skill ab- stractions through rotation-augmented vector quantization. InInternational Conference on Machine Learning, 2025. 1

  17. [17]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision- language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650,

  18. [18]

    Cogvla: Cognition-aligned vision-language-action model via instruction-driven routing & sparsification.arXiv preprint arXiv:2508.21046, 2025

    Wei Li, Renshan Zhang, Rui Shao, Jie He, and Liqiang Nie. Cogvla: Cognition-aligned vision-language-action model via instruction-driven routing & sparsification.arXiv preprint arXiv:2508.21046, 2025. 1, 2

  19. [19]

    Semanticvla: Semantic-aligned sparsification and enhancement for effi- cient robotic manipulation

    Wei Li, Renshan Zhang, Rui Shao, Zhijian Fang, Kai- wen Zhou, Zhuotao Tian, and Liqiang Nie. Semanticvla: Semantic-aligned sparsification and enhancement for effi- cient robotic manipulation. InProceedings of the AAAI Con- ference on Artificial Intelligence, 2026. 1

  20. [20]

    Vision-Language Foundation Models as Effective Robot Imitators

    Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378,

  21. [21]

    Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks.arXiv preprint arXiv:2408.03615, 2024

    Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dong- mei Jiang, and Liqiang Nie. Optimus-1: Hybrid mul- timodal memory empowered agents excel in long-horizon tasks.arXiv preprint arXiv:2408.03615, 2024. 2

  22. [22]

    Optimus-3: Towards generalist multi- modal minecraft agents with scalable task experts.arXiv preprint arXiv:2506.10357, 2025f

    Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Weili Guan, Dongmei Jiang, Yaowei Wang, and Liqiang Nie. Optimus-3: Dual-router aligned mixture-of-experts agent with dual-granularity reasoning-aware policy optimization. arXiv preprint arXiv:2506.10357, 2025. 2

  23. [23]

    Optimus-2: Multimodal minecraft agent with goal-observation-action conditioned policy

    Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, and Liqiang Nie. Optimus-2: Multimodal minecraft agent with goal-observation-action conditioned policy. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 9039–9049,

  24. [24]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 3, 4

  25. [25]

    Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023. 1, 2, 5

  26. [26]

    Visual instruction tuning.Advances in neural information processing systems, 36, 2024

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024. 1, 4

  27. [27]

    Towards generalist robot policies: What mat- ters in building vision-language-action models

    Huaping Liu, Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, and Hanbo Zhang. Towards generalist robot policies: What mat- ters in building vision-language-action models. 2025. 5

  28. [28]

    Towards generalist robot policies: What mat- ters in building vision-language-action models

    Huaping Liu, Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, and Hanbo Zhang. Towards generalist robot policies: What mat- ters in building vision-language-action models. 2025. 2, 3

  29. [29]

    HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

    Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Ren- rui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffu- sion and autoregression in a unified vision-language-action model.arXiv preprint arXiv:2503.10631, 2025. 2

  30. [30]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipu- lation.arXiv preprint arXiv:2410.07864, 2024. 6, 3

  31. [31]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 2, 3, 4

  32. [32]

    Puma: Layer-pruned language model for efficient unified multimodal retrieval with modality-adaptive learning

    Yibo Lyu, Rui Shao, Gongwei Chen, Yijie Zhu, Weili Guan, and Liqiang Nie. Puma: Layer-pruned language model for efficient unified multimodal retrieval with modality-adaptive learning. InProceedings of the 33rd ACM International Con- ference on Multimedia, pages 7653–7662, 2025. 2

  33. [33]

    Hierarchical diffusion policy for kinematics-aware multi- task robotic manipulation

    Xiao Ma, Sumit Patidar, Iain Haughton, and Stephen James. Hierarchical diffusion policy for kinematics-aware multi- task robotic manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18081–18090, 2024. 2

  34. [34]

    Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manip- ulation tasks.IEEE Robotics and Automation Letters, 7(3): 7327–7334, 2022

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wol- fram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manip- ulation tasks.IEEE Robotics and Automation Letters, 7(3): 7327–7334, 2022. 1, 2, 5

  35. [35]

    Vision-based framework to estimate robot configuration and kinematic constraints

    Valerio Ortenzi, Naresh Marturi, Michael Mistry, Jef- frey Kuo, and Rustam Stolkin. Vision-based framework to estimate robot configuration and kinematic constraints. IEEE/ASME Transactions on Mechatronics, 23(5):2402– 2412, 2018. 2

  36. [36]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision- language-action models.arXiv preprint arXiv:2501.09747,

  37. [37]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial represen- tations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025. 1, 2, 5

  38. [38]

    Multi-adversarial discriminative deep domain generalization for face presentation attack detection

    Rui Shao, Xiangyuan Lan, Jiawei Li, and Pong C Yuen. Multi-adversarial discriminative deep domain generalization for face presentation attack detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10023–10031, 2019. 3

  39. [39]

    Detecting and grounding multi-modal media manipulation

    Rui Shao, Tianxing Wu, and Ziwei Liu. Detecting and grounding multi-modal media manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 6904–6913, 2023

  40. [40]

    Detecting and grounding multi-modal media manip- ulation and beyond.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Rui Shao, Tianxing Wu, Jianlong Wu, Liqiang Nie, and Zi- wei Liu. Detecting and grounding multi-modal media manip- ulation and beyond.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 3

  41. [41]

    Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

    Rui Shao, Wei Li, Lingsen Zhang, Renshan Zhang, Zhiyang Liu, Ran Chen, and Liqiang Nie. Large vlm-based vision- language-action models for robotic manipulation: A survey. arXiv preprint arXiv:2508.13073, 2025. 1

  42. [42]

    MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

    Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation. arXiv preprint arXiv:2508.19236, 2025. 2, 3, 4, 5

  43. [43]

    Accelerating vision-language-action model integrated with action chunking via parallel decoding.arXiv preprint arXiv:2503.02310, 2025

    Wenxuan Song, Jiayi Chen, Pengxiang Ding, Han Zhao, Wei Zhao, Zhide Zhong, Zongyuan Ge, Jun Ma, and Haoang Li. Accelerating vision-language-action model integrated with action chunking via parallel decoding.arXiv preprint arXiv:2503.02310, 2025. 1, 2

  44. [44]

    Reconvla: Reconstructive vision-language-action model as effective robot perceiver.arXiv preprint arXiv:2508.10333,

    Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengx- iang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision-language-action model as effective robot perceiver. arXiv preprint arXiv:2508.10333, 2025. 5

  45. [45]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024. 5

  46. [46]

    Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation

    Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation.arXiv preprint arXiv:2412.15109, 2024. 5

  47. [47]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. page 6000–6010,

  48. [48]

    Bitvla: 1-bit vision-language-action models for robotics manipulation.arXiv preprint arXiv:2506.07530,

    Hongyu Wang, Chuyan Xiong, Ruiping Wang, and Xilin Chen. Bitvla: 1-bit vision-language-action models for robotics manipulation.arXiv preprint arXiv:2506.07530,

  49. [49]

    Vla-adapter: An effective paradigm 10 for tiny-scale vision-language-action model.arXiv preprint arXiv:2509.09372, 2025

    Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm 10 for tiny-scale vision-language-action model.arXiv preprint arXiv:2509.09372, 2025. 1

  50. [50]

    Tinyvla: Towards fast, data-efficient vision- language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025

    Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data-efficient vision- language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025. 2

  51. [51]

    3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

    Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Gen- eralizable visuomotor policy learning via simple 3d repre- sentations.arXiv preprint arXiv:2403.03954, 2024. 6, 3

  52. [52]

    UP- VLA: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

    Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xi- ang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025. 5

  53. [53]

    Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow match- ing for robot manipulation

    Qinglun Zhang, Zhen Liu, Haoqiang Fan, Guanghui Liu, Bing Zeng, and Shuaicheng Liu. Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow match- ing for robot manipulation. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 14754–14762, 2025. 1, 2, 4

  54. [54]

    Falcon: Resolv- ing visual redundancy and fragmentation in high-resolution multimodal large language models via visual registers

    Renshan Zhang, Rui Shao, Gongwei Chen, Miao Zhang, Kaiwen Zhou, Weili Guan, and Liqiang Nie. Falcon: Resolv- ing visual redundancy and fragmentation in high-resolution multimodal large language models via visual registers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23530–23540, 2025. 2

  55. [55]

    DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

    Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xin- qiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint arXiv:2507.04447, 2025. 1

  56. [56]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. 6, 3

  57. [57]

    TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

    Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345, 2024. 3

  58. [58]

    Hiconagent: History context-aware policy optimiza- tion for gui agents.arXiv preprint arXiv:2512.01763, 2025

    Xurui Zhou, Gongwei Chen, Yuquan Xie, Zaijing Li, Kai- wen Zhou, Shuai Wang, Shuo Yang, Zhuotao Tian, and Rui Shao. Hiconagent: History context-aware policy optimiza- tion for gui agents.arXiv preprint arXiv:2512.01763, 2025. 2

  59. [59]

    H-gar: A hierarchical interaction framework via goal-driven observation-action refinement for robotic manipulation

    Yijie Zhu, Rui Shao, Ziyang Liu, Jie He, Jizhihui Liu, Ji- uru Wang, and Zitong Yu. H-gar: A hierarchical interaction framework via goal-driven observation-action refinement for robotic manipulation. InProceedings of the AAAI Confer- ence on Artificial Intelligence, 2026. 1

  60. [60]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 2 11 Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Actio...