Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

Bing Hu; Dongmei Jiang; Gongwei Chen; Jianye Hao; Liqiang Nie; Pengwei Xie; Rui Shao; Zaijing Li

arxiv: 2602.20200 · v2 · pith:ZUZD3XLBnew · submitted 2026-02-22 · 💻 cs.RO · cs.AI· cs.CV

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

Zaijing Li , Bing Hu , Rui Shao , Gongwei Chen , Dongmei Jiang , Pengwei Xie , Jianye Hao , Liqiang Nie This is my paper

Pith reviewed 2026-05-21 13:17 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV

keywords Vision-Language-Action modelsrobotic manipulationdual-memory frameworkglobal prior memorylocal consistency memorydiffusion-based policyinference efficiency

0 comments

The pith

A dual-memory system replaces random noise with retrieved task priors and adds action-history constraints to make vision-language-action policies faster and more reliable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OptimusVLA, which augments hierarchical vision-language-action models with two memories to fix bottlenecks in action generation. Global Prior Memory pulls task-level priors from semantically similar past trajectories instead of starting from Gaussian noise, which shortens the denoising path and lowers the number of function evaluations. Local Consistency Memory tracks the sequence of already-executed actions to infer progress and enforce temporal smoothness. These changes produce higher success rates on standard manipulation benchmarks and real-robot suites while delivering nearly three times faster inference.

Core claim

OptimusVLA replaces the isotropic Gaussian noise prior in the generative action policy with retrieved priors from a Global Prior Memory of semantically similar trajectories and augments the policy with a Local Consistency Memory that models executed action sequences to inject learned consistency constraints, thereby reducing denoising steps, improving temporal coherence, and raising success rates on manipulation tasks.

What carries the argument

Dual-memory framework with Global Prior Memory (GPM) that retrieves task-level priors to shorten the generative path and Local Consistency Memory (LCM) that enforces temporal coherence on the action sequence.

If this is right

OptimusVLA reaches 98.6 percent average success on the LIBERO benchmark.
It improves over the pi_0 baseline by 13.5 percent on the CALVIN benchmark.
It attains 38 percent success on the RoboTwin 2.0 Hard suite.
Real-world tests rank it best on generalization and long-horizon tasks while providing 2.9 times inference speedup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the prior memory can be grown incrementally from the robot's own experience, the same dual-memory pattern might support continual learning with limited new data collection.
The separation of global semantic retrieval and local temporal constraint could transfer to other generative sequence models outside robotics.
Performance would likely degrade on tasks whose semantic signatures have no close neighbors in the stored library, revealing dependence on retrieval quality.

Load-bearing premise

A sufficiently large library of semantically searchable prior trajectories must exist and the retrieved priors must stay relevant and safe for the current scene.

What would settle it

Evaluating the model on a novel task that has no close semantic matches in the prior memory library and measuring whether the reported gains in success rate and inference speed vanish.

Figures

Figures reproduced from arXiv: 2602.20200 by Bing Hu, Dongmei Jiang, Gongwei Chen, Jianye Hao, Liqiang Nie, Pengwei Xie, Rui Shao, Zaijing Li.

**Figure 1.** Figure 1: Top: Comparison between the standard VLA architecture (left) and our proposed OptimusVLA (right). (ii) Poor robustness to temporal dependence. Middle: Illustration of how GPM (blue) and LCM (green) address two key limitations of existing VLA models: (i) Low inference efficiency due to a large prior–target gap. (ii) Poor robustness to temporal dependence. Bottom: Efficiency and performance comparison. na… view at source ↗

**Figure 2.** Figure 2: Overview of OptimusVLA framework. Given a task and the current observation, the Vision–Language backbone first encodes [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Real-world task setup and evaluation results. We evaluate the performance of OptimusVLA against OpenVLA [ () IfffiiiLIBERO(b) If [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Inference efficiency comparison on LIBERO and Real [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results of OptimusVLA on simulation task and Real-World task. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results of OptimusVLA on Real-World. From top to bottom, we illustrate four [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative results of OptimusVLA on Real-World. From top to bottom, we illustrate four [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Hierarchical Vision-Language-Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision-Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i) Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising steps and the incidence of infeasible samples. (ii) Poor robustness. Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM). GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories, thereby shortening the generative path and reducing the umber of function evaluations (NFE). LCM dynamically models executed action sequence to infer task progress and injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory. Across three simulation benchmarks, OptimusVLA consistently outperforms strong baselines: it achieves 98.6% average success rate on LIBERO, improves over pi_0 by 13.5% on CALVIN, and attains 38% average success rate on RoboTwin 2.0 Hard. In Real-World evaluation, OptimusVLA ranks best on Generalization and Long-horizon suites, surpassing pi_0 by 42.9% and 52.4%, respectively, while delivering 2.9x inference speedup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OptimusVLA shows clear benchmark gains from swapping noise for retrieved priors and adding history consistency, but the retrieval setup stays underspecified.

read the letter

The main thing to know is that this paper takes the standard hierarchical VLA setup and adds two targeted memory modules that deliver measurable lifts in success rate and inference speed on manipulation tasks. The numbers are the headline: 98.6% average on LIBERO, a 13.5% edge over pi_0 on CALVIN, 38% on the hard RoboTwin split, and a 2.9x speedup, with similar relative gains in real-world generalization and long-horizon tests. Those are the kind of practical improvements that matter for people trying to run these models on actual robots without burning too much compute per step. The dual-memory design is the concrete addition. Global Prior Memory pulls task-level priors from semantically similar past trajectories instead of starting from isotropic Gaussian noise, which shortens the generative path. Local Consistency Memory tracks the executed action sequence to infer progress and enforce temporal smoothness. Both are straightforward extensions rather than a new paradigm, and they address the two bottlenecks the authors flag: distributional gap and lack of history awareness. The paper does a reasonable job documenting the empirical payoff across simulation suites and real-robot evaluations. The gains look consistent enough to be worth testing in follow-up work. Where the story thins out is the supporting machinery for the efficiency claim. The central speedup rests on retrieved priors being close enough to the target distribution to actually reduce denoising steps. The description says priors come from semantically similar trajectories, yet the abstract and available details leave the library construction, embedding method, similarity function, and fallback for weak matches unspecified. If those choices are fragile or depend on a large curated store that is not released, the reported speed and robustness numbers become harder to attribute solely to the model. No error bars or component ablations appear in the summary either, which makes it tougher to isolate how much each memory contributes. This work is aimed at researchers building or deploying vision-language-action policies for manipulation, especially those already using diffusion-style action generators. Readers who need faster inference or better temporal coherence on standard benchmarks will find the results useful to examine. It deserves a serious referee because the empirical claims are specific and the architecture is simple to understand, even if the retrieval details need expansion. I would send it for review but ask the authors to clarify the prior index and any safeguards for poor matches.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces OptimusVLA, a dual-memory Vision-Language-Action framework for robotic manipulation. Global Prior Memory (GPM) replaces isotropic Gaussian noise with task-level priors retrieved from semantically similar trajectories to shorten the denoising trajectory and reduce NFE. Local Consistency Memory (LCM) maintains a dynamic model of the executed action sequence to infer task progress and enforce temporal coherence. Empirical results claim 98.6% average success on LIBERO, a 13.5% improvement over pi_0 on CALVIN, 38% on RoboTwin 2.0 Hard, best-in-class real-world generalization and long-horizon performance, and a 2.9× inference speedup.

Significance. If the retrieval mechanism reliably supplies relevant priors and the consistency constraint is effective, the work could meaningfully advance inference efficiency in generative VLA policies without sacrificing robustness on long-horizon tasks. The multi-benchmark evaluation and real-world results provide a reasonable basis for practical impact in robotics, though the magnitude of the efficiency gain remains contingent on the quality and availability of the external prior library.

major comments (3)

[§3.1] §3.1 (GPM construction): the description of how the prior library is built, including embedding model, database size, similarity metric, and retrieval threshold or fallback policy, is absent. This information is load-bearing for the central efficiency claim that retrieved priors reduce the distributional gap and deliver the reported 2.9× speedup; without it the advantage cannot be isolated from an external curated resource.
[§4.2] §4.2 and §4.3 (results and ablations): no ablation isolates the contribution of GPM from LCM, and no error bars or statistical tests accompany the success-rate numbers (e.g., 98.6% on LIBERO, 13.5% gain on CALVIN). These omissions prevent attribution of gains to the dual-memory design and weaken confidence in the cross-benchmark superiority claims.
[§3.2] §3.2 (LCM formulation): the precise mechanism by which the learned consistency constraint is injected into the denoising process (e.g., as an additional loss term, conditioning signal, or modified sampling schedule) is not formalized with equations, making it difficult to verify that temporal coherence is enforced without introducing new failure modes.

minor comments (3)

[Abstract] Abstract: correct the typos “proceess” → “process” and “umber” → “number”.
[Figures/Tables] Figure captions and tables should explicitly state whether reported numbers are means over multiple seeds or single runs; the current presentation leaves this ambiguous.
[Implementation details] The manuscript should clarify whether the prior library is released with the code or remains proprietary, as this directly affects reproducibility of the efficiency results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for improving reproducibility, empirical rigor, and formal clarity. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [§3.1] §3.1 (GPM construction): the description of how the prior library is built, including embedding model, database size, similarity metric, and retrieval threshold or fallback policy, is absent. This information is load-bearing for the central efficiency claim that retrieved priors reduce the distributional gap and deliver the reported 2.9× speedup; without it the advantage cannot be isolated from an external curated resource.

Authors: We agree that these implementation details are essential for reproducibility and to properly attribute the efficiency gains. In the revised manuscript, we will expand §3.1 with a full description of the prior library, including the embedding model, database size, similarity metric (cosine similarity on trajectory embeddings), retrieval threshold, and fallback policy to standard Gaussian noise when no sufficiently similar prior is available. revision: yes
Referee: [§4.2] §4.2 and §4.3 (results and ablations): no ablation isolates the contribution of GPM from LCM, and no error bars or statistical tests accompany the success-rate numbers (e.g., 98.6% on LIBERO, 13.5% gain on CALVIN). These omissions prevent attribution of gains to the dual-memory design and weaken confidence in the cross-benchmark superiority claims.

Authors: We acknowledge that isolating the contributions of GPM and LCM is necessary to strengthen causal claims. We will add dedicated ablation experiments evaluating each component independently. We will also report error bars (standard deviation over multiple seeds) and include statistical significance tests for the key performance differences across benchmarks. revision: yes
Referee: [§3.2] §3.2 (LCM formulation): the precise mechanism by which the learned consistency constraint is injected into the denoising process (e.g., as an additional loss term, conditioning signal, or modified sampling schedule) is not formalized with equations, making it difficult to verify that temporal coherence is enforced without introducing new failure modes.

Authors: We agree that the injection mechanism requires explicit formalization. In the revised §3.2, we will add the mathematical formulation, including the equations that define how the learned consistency constraint is incorporated into the denoising objective or sampling procedure. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on benchmark validation rather than self-referential derivation

full rationale

The paper introduces OptimusVLA with Global Prior Memory (GPM) and Local Consistency Memory (LCM) to improve VLA action generation efficiency and robustness. GPM replaces isotropic noise with retrieved task-level priors from semantically similar trajectories, while LCM enforces temporal consistency on action sequences. These are presented as architectural innovations whose benefits are demonstrated through empirical results on LIBERO (98.6% success), CALVIN (+13.5% over pi_0), RoboTwin, and real-world suites (2.9x speedup). No equations or first-principles derivations are shown that reduce by construction to fitted parameters, self-citations, or renamed inputs; the retrieval mechanism and consistency constraint are external design choices validated experimentally rather than tautological. The derivation chain is self-contained against external benchmarks with no load-bearing self-citation or definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The approach relies on the existence of a retrievable trajectory library and on the assumption that semantic similarity in language-vision space predicts useful action priors; these are domain assumptions rather than free parameters or new entities.

axioms (2)

domain assumption A library of semantically similar past trajectories can be retrieved at inference time to serve as a better starting distribution than isotropic Gaussian noise.
Invoked when GPM replaces the standard noise prior; no details on library construction or retrieval mechanism are given in the abstract.
domain assumption Enforcing consistency on the executed action sequence improves robustness by providing awareness of task progress.
Central justification for the LCM module.

invented entities (2)

Global Prior Memory (GPM) no independent evidence
purpose: Replace Gaussian noise with retrieved task-level priors to shorten the generative path.
New architectural component introduced to address the distributional gap.
Local Consistency Memory (LCM) no independent evidence
purpose: Model recent actions to infer progress and enforce temporal coherence.
New architectural component introduced to address lack of history awareness.

pith-pipeline@v0.9.0 · 5854 in / 1645 out tokens · 34329 ms · 2026-05-21T13:17:53.525266+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark
cs.RO 2026-05 unverdicted novelty 6.0

RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 2 Pith papers · 25 internal anchors

[1]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.pi 0: A vision-language- action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. 1, 2, 3, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Towards synergistic, generalized, and efficient dual-system for robotic manipulation.arXiv preprint arXiv:2410.08001,

Qingwen Bu, Hongyang Li, Li Chen, Jisong Cai, Jia Zeng, Heming Cui, Maoqing Yao, and Yu Qiao. Towards synergis- tic, generalized, and efficient dual-system for robotic manip- ulation.arXiv preprint arXiv:2410.08001, 2024. 5

work page arXiv 2024
[3]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent ac- tions.arXiv preprint arXiv:2505.06111, 2025. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Lion: Empowering multimodal large language model with dual-level visual knowledge

Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, and Liqiang Nie. Lion: Empowering multimodal large language model with dual-level visual knowledge. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26540–26550, 2024. 1, 4

work page 2024
[5]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data gen- erator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025. 1, 2, 5, 6, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025. 1, 2, 5, 6, 3

work page 2025
[7]

GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

Shengliang Deng, Mi Yan, Songlin Wei, Haixin Ma, Yuxin Yang, Jiayi Chen, Zhiqi Zhang, Taoyu Yang, Xuheng Zhang, Wenhao Zhang, et al. Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data.arXiv preprint arXiv:2505.03233, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Interleave-vla: Enhancing robot manipulation with interleaved image-text instructions.arXiv preprint arXiv:2505.02152, 2025

Cunxin Fan, Xiaosong Jia, Yihang Sun, Yixiao Wang, Jianglan Wei, Ziyang Gong, Xiangyu Zhao, Masayoshi Tomizuka, Xue Yang, Junchi Yan, et al. Interleave-vla: En- hancing robot manipulation with interleaved image-text in- structions.arXiv preprint arXiv:2505.02152, 2025. 2, 3

work page arXiv 2025
[9]

Mamba: Linear-time sequence mod- eling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence mod- eling with selective state spaces. InFirst conference on lan- guage modeling, 2024. 5

work page 2024
[10]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A general- ist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

An Embodied Generalist Agent in 3D World

Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi05: a vision-language-action model with open-world generaliza- tion.arXiv preprint arXiv:2504.16054, 2025. 1, 2, 3, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Vima: General robot manip- ulation with multimodal prompts

Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandku- mar, Yuke Zhu, and Linxi Fan. Vima: General robot manip- ulation with multimodal prompts. InNeurIPS 2022 Founda- tion Models for Decision Making Workshop, 2022. 1, 2

work page 2022
[14]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 1, 2, 3, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and suc- cess.arXiv preprint arXiv:2502.19645, 2025. 2, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Star: Learning diverse robot skill ab- stractions through rotation-augmented vector quantization

Hao Li, Qi Lv, Rui Shao, Xiang Deng, Yinchuan Li, Jianye Hao, and Liqiang Nie. Star: Learning diverse robot skill ab- stractions through rotation-augmented vector quantization. InInternational Conference on Machine Learning, 2025. 1

work page 2025
[17]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision- language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Cogvla: Cognition-aligned vision-language-action model via instruction-driven routing & sparsification.arXiv preprint arXiv:2508.21046, 2025

Wei Li, Renshan Zhang, Rui Shao, Jie He, and Liqiang Nie. Cogvla: Cognition-aligned vision-language-action model via instruction-driven routing & sparsification.arXiv preprint arXiv:2508.21046, 2025. 1, 2

work page arXiv 2025
[19]

Semanticvla: Semantic-aligned sparsification and enhancement for effi- cient robotic manipulation

Wei Li, Renshan Zhang, Rui Shao, Zhijian Fang, Kai- wen Zhou, Zhuotao Tian, and Liqiang Nie. Semanticvla: Semantic-aligned sparsification and enhancement for effi- cient robotic manipulation. InProceedings of the AAAI Con- ference on Artificial Intelligence, 2026. 1

work page 2026
[20]

Vision-Language Foundation Models as Effective Robot Imitators

Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks.arXiv preprint arXiv:2408.03615, 2024

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dong- mei Jiang, and Liqiang Nie. Optimus-1: Hybrid mul- timodal memory empowered agents excel in long-horizon tasks.arXiv preprint arXiv:2408.03615, 2024. 2

work page arXiv 2024
[22]

Optimus-3: Towards generalist multi- modal minecraft agents with scalable task experts.arXiv preprint arXiv:2506.10357, 2025f

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Weili Guan, Dongmei Jiang, Yaowei Wang, and Liqiang Nie. Optimus-3: Dual-router aligned mixture-of-experts agent with dual-granularity reasoning-aware policy optimization. arXiv preprint arXiv:2506.10357, 2025. 2

work page arXiv 2025
[23]

Optimus-2: Multimodal minecraft agent with goal-observation-action conditioned policy

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, and Liqiang Nie. Optimus-2: Multimodal minecraft agent with goal-observation-action conditioned policy. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 9039–9049,

work page
[24]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023. 1, 2, 5

work page 2023
[26]

Visual instruction tuning.Advances in neural information processing systems, 36, 2024

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024. 1, 4

work page 2024
[27]

Towards generalist robot policies: What mat- ters in building vision-language-action models

Huaping Liu, Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, and Hanbo Zhang. Towards generalist robot policies: What mat- ters in building vision-language-action models. 2025. 5

work page 2025
[28]

Towards generalist robot policies: What mat- ters in building vision-language-action models

Huaping Liu, Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, and Hanbo Zhang. Towards generalist robot policies: What mat- ters in building vision-language-action models. 2025. 2, 3

work page 2025
[29]

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Ren- rui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffu- sion and autoregression in a unified vision-language-action model.arXiv preprint arXiv:2503.10631, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipu- lation.arXiv preprint arXiv:2410.07864, 2024. 6, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 2, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

Puma: Layer-pruned language model for efficient unified multimodal retrieval with modality-adaptive learning

Yibo Lyu, Rui Shao, Gongwei Chen, Yijie Zhu, Weili Guan, and Liqiang Nie. Puma: Layer-pruned language model for efficient unified multimodal retrieval with modality-adaptive learning. InProceedings of the 33rd ACM International Con- ference on Multimedia, pages 7653–7662, 2025. 2

work page 2025
[33]

Hierarchical diffusion policy for kinematics-aware multi- task robotic manipulation

Xiao Ma, Sumit Patidar, Iain Haughton, and Stephen James. Hierarchical diffusion policy for kinematics-aware multi- task robotic manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18081–18090, 2024. 2

work page 2024
[34]

Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manip- ulation tasks.IEEE Robotics and Automation Letters, 7(3): 7327–7334, 2022

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wol- fram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manip- ulation tasks.IEEE Robotics and Automation Letters, 7(3): 7327–7334, 2022. 1, 2, 5

work page 2022
[35]

Vision-based framework to estimate robot configuration and kinematic constraints

Valerio Ortenzi, Naresh Marturi, Michael Mistry, Jef- frey Kuo, and Rustam Stolkin. Vision-based framework to estimate robot configuration and kinematic constraints. IEEE/ASME Transactions on Mechatronics, 23(5):2402– 2412, 2018. 2

work page 2018
[36]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision- language-action models.arXiv preprint arXiv:2501.09747,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial represen- tations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025. 1, 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Multi-adversarial discriminative deep domain generalization for face presentation attack detection

Rui Shao, Xiangyuan Lan, Jiawei Li, and Pong C Yuen. Multi-adversarial discriminative deep domain generalization for face presentation attack detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10023–10031, 2019. 3

work page 2019
[39]

Detecting and grounding multi-modal media manipulation

Rui Shao, Tianxing Wu, and Ziwei Liu. Detecting and grounding multi-modal media manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 6904–6913, 2023

work page 2023
[40]

Detecting and grounding multi-modal media manip- ulation and beyond.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Rui Shao, Tianxing Wu, Jianlong Wu, Liqiang Nie, and Zi- wei Liu. Detecting and grounding multi-modal media manip- ulation and beyond.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 3

work page 2024
[41]

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

Rui Shao, Wei Li, Lingsen Zhang, Renshan Zhang, Zhiyang Liu, Ran Chen, and Liqiang Nie. Large vlm-based vision- language-action models for robotic manipulation: A survey. arXiv preprint arXiv:2508.13073, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation. arXiv preprint arXiv:2508.19236, 2025. 2, 3, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Accelerating vision-language-action model integrated with action chunking via parallel decoding.arXiv preprint arXiv:2503.02310, 2025

Wenxuan Song, Jiayi Chen, Pengxiang Ding, Han Zhao, Wei Zhao, Zhide Zhong, Zongyuan Ge, Jun Ma, and Haoang Li. Accelerating vision-language-action model integrated with action chunking via parallel decoding.arXiv preprint arXiv:2503.02310, 2025. 1, 2

work page arXiv 2025
[44]

Reconvla: Reconstructive vision-language-action model as effective robot perceiver.arXiv preprint arXiv:2508.10333,

Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengx- iang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision-language-action model as effective robot perceiver. arXiv preprint arXiv:2508.10333, 2025. 5

work page arXiv 2025
[45]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation

Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation.arXiv preprint arXiv:2412.15109, 2024. 5

work page internal anchor Pith review arXiv 2024
[47]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. page 6000–6010,

work page
[48]

Bitvla: 1-bit vision-language-action models for robotics manipulation.arXiv preprint arXiv:2506.07530,

Hongyu Wang, Chuyan Xiong, Ruiping Wang, and Xilin Chen. Bitvla: 1-bit vision-language-action models for robotics manipulation.arXiv preprint arXiv:2506.07530,

work page arXiv
[49]

Vla-adapter: An effective paradigm 10 for tiny-scale vision-language-action model.arXiv preprint arXiv:2509.09372, 2025

Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm 10 for tiny-scale vision-language-action model.arXiv preprint arXiv:2509.09372, 2025. 1

work page arXiv 2025
[50]

Tinyvla: Towards fast, data-efficient vision- language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data-efficient vision- language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025. 2

work page 2025
[51]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Gen- eralizable visuomotor policy learning via simple 3d repre- sentations.arXiv preprint arXiv:2403.03954, 2024. 6, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

UP- VLA: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xi- ang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025. 5

work page arXiv 2025
[53]

Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow match- ing for robot manipulation

Qinglun Zhang, Zhen Liu, Haoqiang Fan, Guanghui Liu, Bing Zeng, and Shuaicheng Liu. Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow match- ing for robot manipulation. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 14754–14762, 2025. 1, 2, 4

work page 2025
[54]

Falcon: Resolv- ing visual redundancy and fragmentation in high-resolution multimodal large language models via visual registers

Renshan Zhang, Rui Shao, Gongwei Chen, Miao Zhang, Kaiwen Zhou, Weili Guan, and Liqiang Nie. Falcon: Resolv- ing visual redundancy and fragmentation in high-resolution multimodal large language models via visual registers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23530–23540, 2025. 2

work page 2025
[55]

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xin- qiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint arXiv:2507.04447, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. 6, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Hiconagent: History context-aware policy optimiza- tion for gui agents.arXiv preprint arXiv:2512.01763, 2025

Xurui Zhou, Gongwei Chen, Yuquan Xie, Zaijing Li, Kai- wen Zhou, Shuai Wang, Shuo Yang, Zhuotao Tian, and Rui Shao. Hiconagent: History context-aware policy optimiza- tion for gui agents.arXiv preprint arXiv:2512.01763, 2025. 2

work page arXiv 2025
[59]

H-gar: A hierarchical interaction framework via goal-driven observation-action refinement for robotic manipulation

Yijie Zhu, Rui Shao, Ziyang Liu, Jie He, Jizhihui Liu, Ji- uru Wang, and Zitong Yu. H-gar: A hierarchical interaction framework via goal-driven observation-action refinement for robotic manipulation. InProceedings of the AAAI Confer- ence on Artificial Intelligence, 2026. 1

work page 2026
[60]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 2 11 Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Actio...

work page 2023

[1] [1]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.pi 0: A vision-language- action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. 1, 2, 3, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Towards synergistic, generalized, and efficient dual-system for robotic manipulation.arXiv preprint arXiv:2410.08001,

Qingwen Bu, Hongyang Li, Li Chen, Jisong Cai, Jia Zeng, Heming Cui, Maoqing Yao, and Yu Qiao. Towards synergis- tic, generalized, and efficient dual-system for robotic manip- ulation.arXiv preprint arXiv:2410.08001, 2024. 5

work page arXiv 2024

[3] [3]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent ac- tions.arXiv preprint arXiv:2505.06111, 2025. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Lion: Empowering multimodal large language model with dual-level visual knowledge

Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, and Liqiang Nie. Lion: Empowering multimodal large language model with dual-level visual knowledge. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26540–26550, 2024. 1, 4

work page 2024

[5] [5]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data gen- erator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025. 1, 2, 5, 6, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025. 1, 2, 5, 6, 3

work page 2025

[7] [7]

GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

Shengliang Deng, Mi Yan, Songlin Wei, Haixin Ma, Yuxin Yang, Jiayi Chen, Zhiqi Zhang, Taoyu Yang, Xuheng Zhang, Wenhao Zhang, et al. Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data.arXiv preprint arXiv:2505.03233, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Interleave-vla: Enhancing robot manipulation with interleaved image-text instructions.arXiv preprint arXiv:2505.02152, 2025

Cunxin Fan, Xiaosong Jia, Yihang Sun, Yixiao Wang, Jianglan Wei, Ziyang Gong, Xiangyu Zhao, Masayoshi Tomizuka, Xue Yang, Junchi Yan, et al. Interleave-vla: En- hancing robot manipulation with interleaved image-text in- structions.arXiv preprint arXiv:2505.02152, 2025. 2, 3

work page arXiv 2025

[9] [9]

Mamba: Linear-time sequence mod- eling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence mod- eling with selective state spaces. InFirst conference on lan- guage modeling, 2024. 5

work page 2024

[10] [10]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A general- ist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

An Embodied Generalist Agent in 3D World

Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi05: a vision-language-action model with open-world generaliza- tion.arXiv preprint arXiv:2504.16054, 2025. 1, 2, 3, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Vima: General robot manip- ulation with multimodal prompts

Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandku- mar, Yuke Zhu, and Linxi Fan. Vima: General robot manip- ulation with multimodal prompts. InNeurIPS 2022 Founda- tion Models for Decision Making Workshop, 2022. 1, 2

work page 2022

[14] [14]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 1, 2, 3, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and suc- cess.arXiv preprint arXiv:2502.19645, 2025. 2, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Star: Learning diverse robot skill ab- stractions through rotation-augmented vector quantization

Hao Li, Qi Lv, Rui Shao, Xiang Deng, Yinchuan Li, Jianye Hao, and Liqiang Nie. Star: Learning diverse robot skill ab- stractions through rotation-augmented vector quantization. InInternational Conference on Machine Learning, 2025. 1

work page 2025

[17] [17]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision- language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Cogvla: Cognition-aligned vision-language-action model via instruction-driven routing & sparsification.arXiv preprint arXiv:2508.21046, 2025

Wei Li, Renshan Zhang, Rui Shao, Jie He, and Liqiang Nie. Cogvla: Cognition-aligned vision-language-action model via instruction-driven routing & sparsification.arXiv preprint arXiv:2508.21046, 2025. 1, 2

work page arXiv 2025

[19] [19]

Semanticvla: Semantic-aligned sparsification and enhancement for effi- cient robotic manipulation

Wei Li, Renshan Zhang, Rui Shao, Zhijian Fang, Kai- wen Zhou, Zhuotao Tian, and Liqiang Nie. Semanticvla: Semantic-aligned sparsification and enhancement for effi- cient robotic manipulation. InProceedings of the AAAI Con- ference on Artificial Intelligence, 2026. 1

work page 2026

[20] [20]

Vision-Language Foundation Models as Effective Robot Imitators

Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks.arXiv preprint arXiv:2408.03615, 2024

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dong- mei Jiang, and Liqiang Nie. Optimus-1: Hybrid mul- timodal memory empowered agents excel in long-horizon tasks.arXiv preprint arXiv:2408.03615, 2024. 2

work page arXiv 2024

[22] [22]

Optimus-3: Towards generalist multi- modal minecraft agents with scalable task experts.arXiv preprint arXiv:2506.10357, 2025f

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Weili Guan, Dongmei Jiang, Yaowei Wang, and Liqiang Nie. Optimus-3: Dual-router aligned mixture-of-experts agent with dual-granularity reasoning-aware policy optimization. arXiv preprint arXiv:2506.10357, 2025. 2

work page arXiv 2025

[23] [23]

Optimus-2: Multimodal minecraft agent with goal-observation-action conditioned policy

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, and Liqiang Nie. Optimus-2: Multimodal minecraft agent with goal-observation-action conditioned policy. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 9039–9049,

work page

[24] [24]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2022

[25] [25]

Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023. 1, 2, 5

work page 2023

[26] [26]

Visual instruction tuning.Advances in neural information processing systems, 36, 2024

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024. 1, 4

work page 2024

[27] [27]

Towards generalist robot policies: What mat- ters in building vision-language-action models

Huaping Liu, Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, and Hanbo Zhang. Towards generalist robot policies: What mat- ters in building vision-language-action models. 2025. 5

work page 2025

[28] [28]

Towards generalist robot policies: What mat- ters in building vision-language-action models

Huaping Liu, Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, and Hanbo Zhang. Towards generalist robot policies: What mat- ters in building vision-language-action models. 2025. 2, 3

work page 2025

[29] [29]

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Ren- rui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffu- sion and autoregression in a unified vision-language-action model.arXiv preprint arXiv:2503.10631, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipu- lation.arXiv preprint arXiv:2410.07864, 2024. 6, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 2, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2022

[32] [32]

Puma: Layer-pruned language model for efficient unified multimodal retrieval with modality-adaptive learning

Yibo Lyu, Rui Shao, Gongwei Chen, Yijie Zhu, Weili Guan, and Liqiang Nie. Puma: Layer-pruned language model for efficient unified multimodal retrieval with modality-adaptive learning. InProceedings of the 33rd ACM International Con- ference on Multimedia, pages 7653–7662, 2025. 2

work page 2025

[33] [33]

Hierarchical diffusion policy for kinematics-aware multi- task robotic manipulation

Xiao Ma, Sumit Patidar, Iain Haughton, and Stephen James. Hierarchical diffusion policy for kinematics-aware multi- task robotic manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18081–18090, 2024. 2

work page 2024

[34] [34]

Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manip- ulation tasks.IEEE Robotics and Automation Letters, 7(3): 7327–7334, 2022

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wol- fram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manip- ulation tasks.IEEE Robotics and Automation Letters, 7(3): 7327–7334, 2022. 1, 2, 5

work page 2022

[35] [35]

Vision-based framework to estimate robot configuration and kinematic constraints

Valerio Ortenzi, Naresh Marturi, Michael Mistry, Jef- frey Kuo, and Rustam Stolkin. Vision-based framework to estimate robot configuration and kinematic constraints. IEEE/ASME Transactions on Mechatronics, 23(5):2402– 2412, 2018. 2

work page 2018

[36] [36]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision- language-action models.arXiv preprint arXiv:2501.09747,

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial represen- tations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025. 1, 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Multi-adversarial discriminative deep domain generalization for face presentation attack detection

Rui Shao, Xiangyuan Lan, Jiawei Li, and Pong C Yuen. Multi-adversarial discriminative deep domain generalization for face presentation attack detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10023–10031, 2019. 3

work page 2019

[39] [39]

Detecting and grounding multi-modal media manipulation

Rui Shao, Tianxing Wu, and Ziwei Liu. Detecting and grounding multi-modal media manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 6904–6913, 2023

work page 2023

[40] [40]

Detecting and grounding multi-modal media manip- ulation and beyond.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Rui Shao, Tianxing Wu, Jianlong Wu, Liqiang Nie, and Zi- wei Liu. Detecting and grounding multi-modal media manip- ulation and beyond.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 3

work page 2024

[41] [41]

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

Rui Shao, Wei Li, Lingsen Zhang, Renshan Zhang, Zhiyang Liu, Ran Chen, and Liqiang Nie. Large vlm-based vision- language-action models for robotic manipulation: A survey. arXiv preprint arXiv:2508.13073, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation. arXiv preprint arXiv:2508.19236, 2025. 2, 3, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Accelerating vision-language-action model integrated with action chunking via parallel decoding.arXiv preprint arXiv:2503.02310, 2025

Wenxuan Song, Jiayi Chen, Pengxiang Ding, Han Zhao, Wei Zhao, Zhide Zhong, Zongyuan Ge, Jun Ma, and Haoang Li. Accelerating vision-language-action model integrated with action chunking via parallel decoding.arXiv preprint arXiv:2503.02310, 2025. 1, 2

work page arXiv 2025

[44] [44]

Reconvla: Reconstructive vision-language-action model as effective robot perceiver.arXiv preprint arXiv:2508.10333,

Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengx- iang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision-language-action model as effective robot perceiver. arXiv preprint arXiv:2508.10333, 2025. 5

work page arXiv 2025

[45] [45]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation

Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation.arXiv preprint arXiv:2412.15109, 2024. 5

work page internal anchor Pith review arXiv 2024

[47] [47]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. page 6000–6010,

work page

[48] [48]

Bitvla: 1-bit vision-language-action models for robotics manipulation.arXiv preprint arXiv:2506.07530,

Hongyu Wang, Chuyan Xiong, Ruiping Wang, and Xilin Chen. Bitvla: 1-bit vision-language-action models for robotics manipulation.arXiv preprint arXiv:2506.07530,

work page arXiv

[49] [49]

Vla-adapter: An effective paradigm 10 for tiny-scale vision-language-action model.arXiv preprint arXiv:2509.09372, 2025

Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm 10 for tiny-scale vision-language-action model.arXiv preprint arXiv:2509.09372, 2025. 1

work page arXiv 2025

[50] [50]

Tinyvla: Towards fast, data-efficient vision- language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data-efficient vision- language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025. 2

work page 2025

[51] [51]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Gen- eralizable visuomotor policy learning via simple 3d repre- sentations.arXiv preprint arXiv:2403.03954, 2024. 6, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

UP- VLA: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xi- ang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025. 5

work page arXiv 2025

[53] [53]

Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow match- ing for robot manipulation

Qinglun Zhang, Zhen Liu, Haoqiang Fan, Guanghui Liu, Bing Zeng, and Shuaicheng Liu. Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow match- ing for robot manipulation. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 14754–14762, 2025. 1, 2, 4

work page 2025

[54] [54]

Falcon: Resolv- ing visual redundancy and fragmentation in high-resolution multimodal large language models via visual registers

Renshan Zhang, Rui Shao, Gongwei Chen, Miao Zhang, Kaiwen Zhou, Weili Guan, and Liqiang Nie. Falcon: Resolv- ing visual redundancy and fragmentation in high-resolution multimodal large language models via visual registers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23530–23540, 2025. 2

work page 2025

[55] [55]

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xin- qiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint arXiv:2507.04447, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. 6, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[57] [57]

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[58] [58]

Hiconagent: History context-aware policy optimiza- tion for gui agents.arXiv preprint arXiv:2512.01763, 2025

Xurui Zhou, Gongwei Chen, Yuquan Xie, Zaijing Li, Kai- wen Zhou, Shuai Wang, Shuo Yang, Zhuotao Tian, and Rui Shao. Hiconagent: History context-aware policy optimiza- tion for gui agents.arXiv preprint arXiv:2512.01763, 2025. 2

work page arXiv 2025

[59] [59]

H-gar: A hierarchical interaction framework via goal-driven observation-action refinement for robotic manipulation

Yijie Zhu, Rui Shao, Ziyang Liu, Jie He, Jizhihui Liu, Ji- uru Wang, and Zitong Yu. H-gar: A hierarchical interaction framework via goal-driven observation-action refinement for robotic manipulation. InProceedings of the AAAI Confer- ence on Artificial Intelligence, 2026. 1

work page 2026

[60] [60]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 2 11 Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Actio...

work page 2023