Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation
Pith reviewed 2026-05-21 13:17 UTC · model grok-4.3
The pith
A dual-memory system replaces random noise with retrieved task priors and adds action-history constraints to make vision-language-action policies faster and more reliable.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OptimusVLA replaces the isotropic Gaussian noise prior in the generative action policy with retrieved priors from a Global Prior Memory of semantically similar trajectories and augments the policy with a Local Consistency Memory that models executed action sequences to inject learned consistency constraints, thereby reducing denoising steps, improving temporal coherence, and raising success rates on manipulation tasks.
What carries the argument
Dual-memory framework with Global Prior Memory (GPM) that retrieves task-level priors to shorten the generative path and Local Consistency Memory (LCM) that enforces temporal coherence on the action sequence.
If this is right
- OptimusVLA reaches 98.6 percent average success on the LIBERO benchmark.
- It improves over the pi_0 baseline by 13.5 percent on the CALVIN benchmark.
- It attains 38 percent success on the RoboTwin 2.0 Hard suite.
- Real-world tests rank it best on generalization and long-horizon tasks while providing 2.9 times inference speedup.
Where Pith is reading between the lines
- If the prior memory can be grown incrementally from the robot's own experience, the same dual-memory pattern might support continual learning with limited new data collection.
- The separation of global semantic retrieval and local temporal constraint could transfer to other generative sequence models outside robotics.
- Performance would likely degrade on tasks whose semantic signatures have no close neighbors in the stored library, revealing dependence on retrieval quality.
Load-bearing premise
A sufficiently large library of semantically searchable prior trajectories must exist and the retrieved priors must stay relevant and safe for the current scene.
What would settle it
Evaluating the model on a novel task that has no close semantic matches in the prior memory library and measuring whether the reported gains in success rate and inference speed vanish.
Figures
read the original abstract
Hierarchical Vision-Language-Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision-Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i) Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising steps and the incidence of infeasible samples. (ii) Poor robustness. Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM). GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories, thereby shortening the generative path and reducing the umber of function evaluations (NFE). LCM dynamically models executed action sequence to infer task progress and injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory. Across three simulation benchmarks, OptimusVLA consistently outperforms strong baselines: it achieves 98.6% average success rate on LIBERO, improves over pi_0 by 13.5% on CALVIN, and attains 38% average success rate on RoboTwin 2.0 Hard. In Real-World evaluation, OptimusVLA ranks best on Generalization and Long-horizon suites, surpassing pi_0 by 42.9% and 52.4%, respectively, while delivering 2.9x inference speedup.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces OptimusVLA, a dual-memory Vision-Language-Action framework for robotic manipulation. Global Prior Memory (GPM) replaces isotropic Gaussian noise with task-level priors retrieved from semantically similar trajectories to shorten the denoising trajectory and reduce NFE. Local Consistency Memory (LCM) maintains a dynamic model of the executed action sequence to infer task progress and enforce temporal coherence. Empirical results claim 98.6% average success on LIBERO, a 13.5% improvement over pi_0 on CALVIN, 38% on RoboTwin 2.0 Hard, best-in-class real-world generalization and long-horizon performance, and a 2.9× inference speedup.
Significance. If the retrieval mechanism reliably supplies relevant priors and the consistency constraint is effective, the work could meaningfully advance inference efficiency in generative VLA policies without sacrificing robustness on long-horizon tasks. The multi-benchmark evaluation and real-world results provide a reasonable basis for practical impact in robotics, though the magnitude of the efficiency gain remains contingent on the quality and availability of the external prior library.
major comments (3)
- [§3.1] §3.1 (GPM construction): the description of how the prior library is built, including embedding model, database size, similarity metric, and retrieval threshold or fallback policy, is absent. This information is load-bearing for the central efficiency claim that retrieved priors reduce the distributional gap and deliver the reported 2.9× speedup; without it the advantage cannot be isolated from an external curated resource.
- [§4.2] §4.2 and §4.3 (results and ablations): no ablation isolates the contribution of GPM from LCM, and no error bars or statistical tests accompany the success-rate numbers (e.g., 98.6% on LIBERO, 13.5% gain on CALVIN). These omissions prevent attribution of gains to the dual-memory design and weaken confidence in the cross-benchmark superiority claims.
- [§3.2] §3.2 (LCM formulation): the precise mechanism by which the learned consistency constraint is injected into the denoising process (e.g., as an additional loss term, conditioning signal, or modified sampling schedule) is not formalized with equations, making it difficult to verify that temporal coherence is enforced without introducing new failure modes.
minor comments (3)
- [Abstract] Abstract: correct the typos “proceess” → “process” and “umber” → “number”.
- [Figures/Tables] Figure captions and tables should explicitly state whether reported numbers are means over multiple seeds or single runs; the current presentation leaves this ambiguous.
- [Implementation details] The manuscript should clarify whether the prior library is released with the code or remains proprietary, as this directly affects reproducibility of the efficiency results.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight important areas for improving reproducibility, empirical rigor, and formal clarity. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.
read point-by-point responses
-
Referee: [§3.1] §3.1 (GPM construction): the description of how the prior library is built, including embedding model, database size, similarity metric, and retrieval threshold or fallback policy, is absent. This information is load-bearing for the central efficiency claim that retrieved priors reduce the distributional gap and deliver the reported 2.9× speedup; without it the advantage cannot be isolated from an external curated resource.
Authors: We agree that these implementation details are essential for reproducibility and to properly attribute the efficiency gains. In the revised manuscript, we will expand §3.1 with a full description of the prior library, including the embedding model, database size, similarity metric (cosine similarity on trajectory embeddings), retrieval threshold, and fallback policy to standard Gaussian noise when no sufficiently similar prior is available. revision: yes
-
Referee: [§4.2] §4.2 and §4.3 (results and ablations): no ablation isolates the contribution of GPM from LCM, and no error bars or statistical tests accompany the success-rate numbers (e.g., 98.6% on LIBERO, 13.5% gain on CALVIN). These omissions prevent attribution of gains to the dual-memory design and weaken confidence in the cross-benchmark superiority claims.
Authors: We acknowledge that isolating the contributions of GPM and LCM is necessary to strengthen causal claims. We will add dedicated ablation experiments evaluating each component independently. We will also report error bars (standard deviation over multiple seeds) and include statistical significance tests for the key performance differences across benchmarks. revision: yes
-
Referee: [§3.2] §3.2 (LCM formulation): the precise mechanism by which the learned consistency constraint is injected into the denoising process (e.g., as an additional loss term, conditioning signal, or modified sampling schedule) is not formalized with equations, making it difficult to verify that temporal coherence is enforced without introducing new failure modes.
Authors: We agree that the injection mechanism requires explicit formalization. In the revised §3.2, we will add the mathematical formulation, including the equations that define how the learned consistency constraint is incorporated into the denoising objective or sampling procedure. revision: yes
Circularity Check
No significant circularity; empirical claims rest on benchmark validation rather than self-referential derivation
full rationale
The paper introduces OptimusVLA with Global Prior Memory (GPM) and Local Consistency Memory (LCM) to improve VLA action generation efficiency and robustness. GPM replaces isotropic noise with retrieved task-level priors from semantically similar trajectories, while LCM enforces temporal consistency on action sequences. These are presented as architectural innovations whose benefits are demonstrated through empirical results on LIBERO (98.6% success), CALVIN (+13.5% over pi_0), RoboTwin, and real-world suites (2.9x speedup). No equations or first-principles derivations are shown that reduce by construction to fitted parameters, self-citations, or renamed inputs; the retrieval mechanism and consistency constraint are external design choices validated experimentally rather than tautological. The derivation chain is self-contained against external benchmarks with no load-bearing self-citation or definitional loop.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption A library of semantically similar past trajectories can be retrieved at inference time to serve as a better starting distribution than isotropic Gaussian noise.
- domain assumption Enforcing consistency on the executed action sequence improves robustness by providing awareness of task progress.
invented entities (2)
-
Global Prior Memory (GPM)
no independent evidence
-
Local Consistency Memory (LCM)
no independent evidence
Forward citations
Cited by 2 Pith papers
-
RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark
RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.
-
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
Reference graph
Works this paper leans on
-
[1]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.pi 0: A vision-language- action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. 1, 2, 3, 5, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Qingwen Bu, Hongyang Li, Li Chen, Jisong Cai, Jia Zeng, Heming Cui, Maoqing Yao, and Yu Qiao. Towards synergis- tic, generalized, and efficient dual-system for robotic manip- ulation.arXiv preprint arXiv:2410.08001, 2024. 5
-
[3]
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent ac- tions.arXiv preprint arXiv:2505.06111, 2025. 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Lion: Empowering multimodal large language model with dual-level visual knowledge
Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, and Liqiang Nie. Lion: Empowering multimodal large language model with dual-level visual knowledge. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26540–26550, 2024. 1, 4
work page 2024
-
[5]
Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data gen- erator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025. 1, 2, 5, 6, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025. 1, 2, 5, 6, 3
work page 2025
-
[7]
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data
Shengliang Deng, Mi Yan, Songlin Wei, Haixin Ma, Yuxin Yang, Jiayi Chen, Zhiqi Zhang, Taoyu Yang, Xuheng Zhang, Wenhao Zhang, et al. Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data.arXiv preprint arXiv:2505.03233, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Cunxin Fan, Xiaosong Jia, Yihang Sun, Yixiao Wang, Jianglan Wei, Ziyang Gong, Xiangyu Zhao, Masayoshi Tomizuka, Xue Yang, Junchi Yan, et al. Interleave-vla: En- hancing robot manipulation with interleaved image-text in- structions.arXiv preprint arXiv:2505.02152, 2025. 2, 3
-
[9]
Mamba: Linear-time sequence mod- eling with selective state spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence mod- eling with selective state spaces. InFirst conference on lan- guage modeling, 2024. 5
work page 2024
-
[10]
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A general- ist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
An Embodied Generalist Agent in 3D World
Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi05: a vision-language-action model with open-world generaliza- tion.arXiv preprint arXiv:2504.16054, 2025. 1, 2, 3, 5, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Vima: General robot manip- ulation with multimodal prompts
Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandku- mar, Yuke Zhu, and Linxi Fan. Vima: General robot manip- ulation with multimodal prompts. InNeurIPS 2022 Founda- tion Models for Decision Making Workshop, 2022. 1, 2
work page 2022
-
[14]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 1, 2, 3, 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and suc- cess.arXiv preprint arXiv:2502.19645, 2025. 2, 5, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Star: Learning diverse robot skill ab- stractions through rotation-augmented vector quantization
Hao Li, Qi Lv, Rui Shao, Xiang Deng, Yinchuan Li, Jianye Hao, and Liqiang Nie. Star: Learning diverse robot skill ab- stractions through rotation-augmented vector quantization. InInternational Conference on Machine Learning, 2025. 1
work page 2025
-
[17]
Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision- language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Wei Li, Renshan Zhang, Rui Shao, Jie He, and Liqiang Nie. Cogvla: Cognition-aligned vision-language-action model via instruction-driven routing & sparsification.arXiv preprint arXiv:2508.21046, 2025. 1, 2
-
[19]
Semanticvla: Semantic-aligned sparsification and enhancement for effi- cient robotic manipulation
Wei Li, Renshan Zhang, Rui Shao, Zhijian Fang, Kai- wen Zhou, Zhuotao Tian, and Liqiang Nie. Semanticvla: Semantic-aligned sparsification and enhancement for effi- cient robotic manipulation. InProceedings of the AAAI Con- ference on Artificial Intelligence, 2026. 1
work page 2026
-
[20]
Vision-Language Foundation Models as Effective Robot Imitators
Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dong- mei Jiang, and Liqiang Nie. Optimus-1: Hybrid mul- timodal memory empowered agents excel in long-horizon tasks.arXiv preprint arXiv:2408.03615, 2024. 2
-
[22]
Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Weili Guan, Dongmei Jiang, Yaowei Wang, and Liqiang Nie. Optimus-3: Dual-router aligned mixture-of-experts agent with dual-granularity reasoning-aware policy optimization. arXiv preprint arXiv:2506.10357, 2025. 2
-
[23]
Optimus-2: Multimodal minecraft agent with goal-observation-action conditioned policy
Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, and Liqiang Nie. Optimus-2: Multimodal minecraft agent with goal-observation-action conditioned policy. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 9039–9049,
-
[24]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023. 1, 2, 5
work page 2023
-
[26]
Visual instruction tuning.Advances in neural information processing systems, 36, 2024
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024. 1, 4
work page 2024
-
[27]
Towards generalist robot policies: What mat- ters in building vision-language-action models
Huaping Liu, Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, and Hanbo Zhang. Towards generalist robot policies: What mat- ters in building vision-language-action models. 2025. 5
work page 2025
-
[28]
Towards generalist robot policies: What mat- ters in building vision-language-action models
Huaping Liu, Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, and Hanbo Zhang. Towards generalist robot policies: What mat- ters in building vision-language-action models. 2025. 2, 3
work page 2025
-
[29]
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Ren- rui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffu- sion and autoregression in a unified vision-language-action model.arXiv preprint arXiv:2503.10631, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipu- lation.arXiv preprint arXiv:2410.07864, 2024. 6, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 2, 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[32]
Yibo Lyu, Rui Shao, Gongwei Chen, Yijie Zhu, Weili Guan, and Liqiang Nie. Puma: Layer-pruned language model for efficient unified multimodal retrieval with modality-adaptive learning. InProceedings of the 33rd ACM International Con- ference on Multimedia, pages 7653–7662, 2025. 2
work page 2025
-
[33]
Hierarchical diffusion policy for kinematics-aware multi- task robotic manipulation
Xiao Ma, Sumit Patidar, Iain Haughton, and Stephen James. Hierarchical diffusion policy for kinematics-aware multi- task robotic manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18081–18090, 2024. 2
work page 2024
-
[34]
Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wol- fram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manip- ulation tasks.IEEE Robotics and Automation Letters, 7(3): 7327–7334, 2022. 1, 2, 5
work page 2022
-
[35]
Vision-based framework to estimate robot configuration and kinematic constraints
Valerio Ortenzi, Naresh Marturi, Michael Mistry, Jef- frey Kuo, and Rustam Stolkin. Vision-based framework to estimate robot configuration and kinematic constraints. IEEE/ASME Transactions on Mechatronics, 23(5):2402– 2412, 2018. 2
work page 2018
-
[36]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision- language-action models.arXiv preprint arXiv:2501.09747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial represen- tations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025. 1, 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Multi-adversarial discriminative deep domain generalization for face presentation attack detection
Rui Shao, Xiangyuan Lan, Jiawei Li, and Pong C Yuen. Multi-adversarial discriminative deep domain generalization for face presentation attack detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10023–10031, 2019. 3
work page 2019
-
[39]
Detecting and grounding multi-modal media manipulation
Rui Shao, Tianxing Wu, and Ziwei Liu. Detecting and grounding multi-modal media manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 6904–6913, 2023
work page 2023
-
[40]
Rui Shao, Tianxing Wu, Jianlong Wu, Liqiang Nie, and Zi- wei Liu. Detecting and grounding multi-modal media manip- ulation and beyond.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 3
work page 2024
-
[41]
Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey
Rui Shao, Wei Li, Lingsen Zhang, Renshan Zhang, Zhiyang Liu, Ran Chen, and Liqiang Nie. Large vlm-based vision- language-action models for robotic manipulation: A survey. arXiv preprint arXiv:2508.13073, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation
Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation. arXiv preprint arXiv:2508.19236, 2025. 2, 3, 4, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Wenxuan Song, Jiayi Chen, Pengxiang Ding, Han Zhao, Wei Zhao, Zhide Zhong, Zongyuan Ge, Jun Ma, and Haoang Li. Accelerating vision-language-action model integrated with action chunking via parallel decoding.arXiv preprint arXiv:2503.02310, 2025. 1, 2
-
[44]
Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengx- iang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision-language-action model as effective robot perceiver. arXiv preprint arXiv:2508.10333, 2025. 5
-
[45]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation
Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation.arXiv preprint arXiv:2412.15109, 2024. 5
work page internal anchor Pith review arXiv 2024
-
[47]
Gomez, Łukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. page 6000–6010,
-
[48]
Hongyu Wang, Chuyan Xiong, Ruiping Wang, and Xilin Chen. Bitvla: 1-bit vision-language-action models for robotics manipulation.arXiv preprint arXiv:2506.07530,
-
[49]
Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm 10 for tiny-scale vision-language-action model.arXiv preprint arXiv:2509.09372, 2025. 1
-
[50]
Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data-efficient vision- language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025. 2
work page 2025
-
[51]
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Gen- eralizable visuomotor policy learning via simple 3d repre- sentations.arXiv preprint arXiv:2403.03954, 2024. 6, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xi- ang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025. 5
-
[53]
Qinglun Zhang, Zhen Liu, Haoqiang Fan, Guanghui Liu, Bing Zeng, and Shuaicheng Liu. Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow match- ing for robot manipulation. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 14754–14762, 2025. 1, 2, 4
work page 2025
-
[54]
Renshan Zhang, Rui Shao, Gongwei Chen, Miao Zhang, Kaiwen Zhou, Weili Guan, and Liqiang Nie. Falcon: Resolv- ing visual redundancy and fragmentation in high-resolution multimodal large language models via visual registers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23530–23540, 2025. 2
work page 2025
-
[55]
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xin- qiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint arXiv:2507.04447, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. 6, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
Xurui Zhou, Gongwei Chen, Yuquan Xie, Zaijing Li, Kai- wen Zhou, Shuai Wang, Shuo Yang, Zhuotao Tian, and Rui Shao. Hiconagent: History context-aware policy optimiza- tion for gui agents.arXiv preprint arXiv:2512.01763, 2025. 2
-
[59]
Yijie Zhu, Rui Shao, Ziyang Liu, Jie He, Jizhihui Liu, Ji- uru Wang, and Zitong Yu. H-gar: A hierarchical interaction framework via goal-driven observation-action refinement for robotic manipulation. InProceedings of the AAAI Confer- ence on Artificial Intelligence, 2026. 1
work page 2026
-
[60]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 2 11 Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Actio...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.