VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models

Ruimeng Yang; Shengyu Si; Yuanzhuo Lu; Yu-Gang Jiang; Ziyi Ye; Zuxuan Wu

arxiv: 2605.29562 · v1 · pith:5SYWXZEZnew · submitted 2026-05-28 · 💻 cs.RO · cs.AI· cs.CV

VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models

Shengyu Si , Yuanzhuo Lu , Ruimeng Yang , Ziyi Ye , Zuxuan Wu , Yu-Gang Jiang This is my paper

Pith reviewed 2026-06-29 07:02 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV

keywords vision-language-action modelsprocedural memoryLoRA adapterscross-task generalizationrobotic manipulationmemory retrievaldynamic fusion

0 comments

The pith

VLA-Pro stores task-specific LoRA adapters as procedural memories and retrieves plus fuses them at inference to improve cross-task generalization in vision-language-action models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes VLA-Pro as a plug-and-play addition to vision-language-action models that struggle to handle unseen tasks requiring experience transfer across objects, scenes, and actions. It stores task-relevant LoRA adapters during training as a form of procedural memory. At inference the system selects relevant stored adapters from the current visual, language, and action context and blends them to produce the next action chunk. A reader would care because this offers a modular route to reuse prior manipulation skills without full retraining or loss of stability. If the approach works, robots could accumulate and apply experience across tasks in a way that scales beyond single-task training.

Core claim

VLA-Pro stores task-specific LoRA adapters as parameterized procedural memories during training. At inference time it retrieves relevant procedural memories based on the current multi-modal context and dynamically fuses these memories for generating the current action chunk. Experiments across RoboTwin, RLBench, and real-world tasks show consistent gains in cross-task generalization on multiple backbones.

What carries the argument

Retrieval of relevant task-specific LoRA adapters followed by dynamic fusion into the current action generation step.

If this is right

Cross-task success improves up to 207 percent relative in simulation benchmarks.
Real-world manipulation success rises from 5.8 percent to 65.0 percent on the tested tasks.
The same gains appear across different VLA backbones while keeping the original model weights unchanged.
Procedural memory transfer supplies a route for moving manipulation experience to novel tasks without retraining the full model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A growing library of stored adapters could let a single robot improve over its lifetime by adding new tasks without overwriting old ones.
The same retrieval-plus-fusion pattern might extend to other sequential decision domains such as navigation or tool use where context cues signal which past skills apply.
If retrieval accuracy proves the main limit, future work could test whether richer context encoders or learned retrieval policies raise the ceiling further.

Load-bearing premise

Retrieval from multi-modal context will select useful memories and their fusion will add value without causing negative transfer or unstable actions.

What would settle it

A controlled test on held-out tasks where the base VLA model without retrieval matches or exceeds VLA-Pro performance, or where fusion produces visibly unstable robot trajectories.

Figures

Figures reproduced from arXiv: 2605.29562 by Ruimeng Yang, Shengyu Si, Yuanzhuo Lu, Yu-Gang Jiang, Ziyi Ye, Zuxuan Wu.

**Figure 1.** Figure 1: Overview of VLA-Pro. To bridge this gap, we propose VLAPro, a plug-and-play framework that transfers procedural memory from the most similar training (seen) tasks to testing (unseen) tasks, as shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: VLA-Pro Method Overview. The top row illustrates task execution as a sequence of stages, where each stage involves the retrieval and integration of procedural memories. Given the current multimodal context, VLA-Pro retrieves a series of procedural state sequences Di from a memory bank. Each indexed Di corresponds to a task-specific parameterized experience ∆θi, which is further merged into a fused adapter … view at source ↗

**Figure 3.** Figure 3: Overview of our real-world experimental setup. 6 training tasks and 6 corresponding test tasks designed for evaluating the model’s performance in real-world manipulation. 4.2 Models Backbones and Baselines For RoboTwin, the VLA-Pro framework is instantiated on three different backbones: X-VLA[52], RDT[29], and π0.5[2]. For each backbone, its pretrained checkpoint independent of RoboTwin is used as the base… view at source ↗

**Figure 4.** Figure 4: Real-world manipulation results. Quantitative success rates and qualitative execution examples on held-out real-world tasks, comparing the baseline with VLA-Pro. Real-world Experimental Results [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of different parameter components in VLAPro with π0.5 backbone on RoboTwin. The left and right radar charts show five seen and five unseen tasks, respectively [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: Analysis of retrieval performance vs. task success rate. (a) Procedural state extraction accuracy measured by MRR. (b) Correlation between Unseen–Seen task similarity and transfer gain. baseline k=1 k=2 k=3 Experiment Configuration close fridge laptop lid turn oven toilet seat water plants close microwave take usb out take lid off beat the buzz Avg. Task Success Rate (%) 24.0 24.0 40.0 28.0 0.0 20.0 4.0 12… view at source ↗

**Figure 9.** Figure 9: Visualization of RoboTwin dataset construction. This figure illustrates the 8 training tasks and the corresponding 9 test tasks for cross-task generalization evaluation. We build our customized RoboTwin task suite by modifying the original RoboTwin environment. Specifically, we retain 2 original training tasks and construct 6 additional training tasks and 9 held-out test tasks. The full task list and refer… view at source ↗

**Figure 10.** Figure 10: Modified grasppoint configuration for constructing procedurally related RoboTwin tasks. B.2 RLBench Task Suite Our RLBench experiments follow the X-ICM [53] task split, which contains 18 training tasks and 23 held-out test tasks for cross-task generalization evaluation. From the 18 training tasks, we select 8 foundational tasks as source memories, since they cover basic procedural elements. The selected… view at source ↗

**Figure 11.** Figure 11: Examples of initial and final wrist-camera images for the real-world training tasks. As the visual observation changes during task execution, the model infers the current execution stage accordingly and retrieves the relevant procedural memory. B.3 Real-World Task Suite This section provides additional details about the real-world task suite. We design 6 training tasks and 6 corresponding held-out test ta… view at source ↗

**Figure 12.** Figure 12: Representative training loss curves in the RoboTwin experiments. From left to right, the plots show the loss curve of the RDT baseline, the continued VLA-Pro training with RDT as the backbone, the π0.5 baseline, and the continued VLA-Pro training with π0.5 as the backbone. RLBench. In RLBench experiments, RDT uses the official RDT-1B pretrained model. For AtomicVLA, AdamW is used with a cosine learning-ra… view at source ↗

read the original abstract

Vision-Language-Action~(VLA) models have shown strong potential for general-purpose robotic manipulation, yet they still struggle to generalize to unseen tasks that necessitate transferring relevant experience across objects, scenes, and action patterns. This paper proposes VLA-Pro, a plug-and-play framework designed to enhance cross-task generalization by storing task-relevant procedural memories at training time and transferring these memories during inference. Specifically, VLA-Pro stores task-specific LoRA adapters as parameterized procedural memories during training. At inference time, VLA-Pro retrieves relevant procedural memories based on the current multi-modal context and dynamically fuses these memories for generating the current action chunk. Experiments on RoboTwin, RLBench, and real-world manipulation tasks show that VLA-Pro consistently improves cross-task generalization across multiple backbones, achieving up to a 207% relative improvement in simulation and increasing real-world success rate from 5.8% to 65.0%. These results suggest that procedural memory retrieval and adaptation provide an effective mechanism for transferring manipulation experience to novel tasks while preserving modularity and execution stability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLA-Pro stores task LoRAs and retrieves/fuses them at inference to boost cross-task VLA generalization, but the large reported gains rest on unexamined retrieval reliability.

read the letter

The core idea is straightforward: train separate LoRA adapters per task as procedural memories, then at test time use the current multi-modal observation to pick and blend the relevant ones before outputting an action chunk. This is presented as a modular add-on that works across backbones without full retraining.

It does a few things cleanly. The plug-and-play framing is practical, the evaluation spans simulation suites (RoboTwin, RLBench) plus real manipulation, and the numbers are big enough to notice—up to 207% relative lift in sim and real success from 5.8% to 65%. If the retrieval step reliably surfaces useful experience without destabilizing the policy, this could be a lightweight way to expand task coverage.

The main weakness is the one flagged in the stress test. Retrieval depends on similarity in visual-language features, yet nothing in the abstract shows how they handle cases where objects look alike but the required action sequences differ. Fusion could then introduce negative transfer or jittery execution that the chosen benchmarks might not catch. The abstract also gives no detail on exact baselines, statistical testing, data splits, or ablations on the retrieval and fusion modules, so the size of the improvement is difficult to judge.

This is incremental engineering aimed at people already running VLA models who need better reuse across tasks. It is coherent on its own terms and engages the right literature on memory and adaptation, so it is worth a serious referee even if the experiments will need tightening. I would not cite it yet but would want to see the full controls before deciding.

Referee Report

2 major / 2 minor

Summary. The paper proposes VLA-Pro, a plug-and-play framework for Vision-Language-Action (VLA) models that stores task-specific LoRA adapters as procedural memories during training and, at inference, retrieves relevant memories via multi-modal context similarity and dynamically fuses them to generate action chunks. This is claimed to improve cross-task generalization on RoboTwin, RLBench, and real-world manipulation tasks, with reported gains of up to 207% relative improvement in simulation and real-world success rates rising from 5.8% to 65.0% across multiple backbones.

Significance. If the empirical results hold after addressing the retrieval and fusion assumptions, the work would be significant for offering a modular, parameter-efficient mechanism to transfer manipulation experience across tasks without full retraining, preserving execution stability. The scale of the reported gains suggests potential for practical impact in general-purpose robotics, though the absence of detailed baseline comparisons and negative-transfer controls limits immediate assessment of novelty relative to existing adapter or memory-based methods.

major comments (2)

[Abstract] Abstract: The central empirical claim (207% relative improvement; 5.8%→65.0% real-world success) is load-bearing, yet no information is supplied on the exact baselines, number of evaluation episodes, data splits, or statistical significance testing; without these, it is impossible to determine whether the gains arise from the retrieval-fusion mechanism or from unaccounted confounds.
[Method] Method description (inference-time retrieval and fusion): The claim that multi-modal-context retrieval followed by dynamic fusion reliably transfers useful procedural memories without negative transfer rests on the untested assumption that visual-language similarity selects action-compatible LoRA adapters; the manuscript must provide an ablation or failure-case analysis on tasks with similar objects/scenes but divergent action sequences, as mismatch would directly undermine the cross-task generalization results.

minor comments (2)

[Abstract] Abstract: The phrase 'preserving modularity and execution stability' is asserted without reference to any stability metric or comparison against unfused baselines.
The manuscript should include a table or figure explicitly listing the backbones tested and the precise retrieval similarity function (e.g., cosine on which embeddings).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the empirical claims and methodological assumptions. We address each major comment point-by-point below, clarifying where details appear in the manuscript and indicating revisions made to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claim (207% relative improvement; 5.8%→65.0% real-world success) is load-bearing, yet no information is supplied on the exact baselines, number of evaluation episodes, data splits, or statistical significance testing; without these, it is impossible to determine whether the gains arise from the retrieval-fusion mechanism or from unaccounted confounds.

Authors: The full experimental protocol—including exact baselines (e.g., vanilla VLA, LoRA fine-tuning per task), evaluation episodes (100 per task across 3 random seeds), data splits (train/test task partitions detailed in Section 4.1), and statistical reporting (mean ± std with significance tests)—is provided in Section 4 and Appendix B. The abstract summarizes headline results for brevity. To address the concern, we have revised the abstract to include a one-sentence reference to the evaluation setup and added a compact experimental summary table (Table 1) in the main text. revision: partial
Referee: [Method] Method description (inference-time retrieval and fusion): The claim that multi-modal-context retrieval followed by dynamic fusion reliably transfers useful procedural memories without negative transfer rests on the untested assumption that visual-language similarity selects action-compatible LoRA adapters; the manuscript must provide an ablation or failure-case analysis on tasks with similar objects/scenes but divergent action sequences, as mismatch would directly undermine the cross-task generalization results.

Authors: We agree that explicit validation of the retrieval assumption is valuable. We have added a targeted ablation (new Section 4.4) comparing multi-modal retrieval against vision-only and language-only variants on a curated set of tasks with high visual/scene similarity but divergent action sequences (e.g., “pick red block” vs. “push red block” on identical tables). Results show reduced negative transfer with the full multi-modal similarity metric, supported by quantitative success rates and qualitative failure-case analysis. These additions directly test and support the cross-task transfer claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework validated on external benchmarks

full rationale

The paper introduces VLA-Pro as a plug-and-play retrieval-and-fusion framework for LoRA adapters in VLA models. No equations, derivations, or first-principles predictions appear in the provided text; the central claims rest on experimental outcomes across RoboTwin, RLBench, and real-world tasks rather than any self-referential fitting or self-citation chain that reduces the result to its inputs by construction. Retrieval and fusion are described as design choices whose effectiveness is measured externally, with no load-bearing step that renames a fit as a prediction or imports uniqueness from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond standard use of LoRA adapters and retrieval; framework appears to build on existing techniques without new postulates.

pith-pipeline@v0.9.1-grok · 5737 in / 981 out tokens · 45061 ms · 2026-06-29T07:02:59.853636+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 25 canonical work pages · 12 internal anchors

[1]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

In9th Annual Conference on Robot Learning, 2025

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.π0.5: A vision-language-action model with open-world generalization. In9th Annual Conference on Robot Learning, 2025

2025
[3]

Rt-1: Robotics transformer for real-world control at scale.Robotics: Science and Systems XIX, 2023

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.Robotics: Science and Systems XIX, 2023

2023
[4]

RynnVLA-002: A Unified Vision-Language-Action and World Model

Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Zero-shot vehicle model recognition via text-based retrieval-augmented generation.arXiv preprint arXiv:2510.18502, 2025

Wei-Chia Chang and Yan-Ann Chen. Zero-shot vehicle model recognition via text-based retrieval-augmented generation.arXiv preprint arXiv:2510.18502, 2025

work page arXiv 2025
[6]

Queryadapter: Rapid adaptation of vision-language models in response to natural language queries

Nicolas Harvey Chapman, Feras Dayoub, Will Browne, and Christopher Lehnert. Queryadapter: Rapid adaptation of vision-language models in response to natural language queries. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9606–9613. IEEE, 2025

2025
[7]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

From Local Corrections to Generalized Skills: Improving Neuro-Symbolic Policies with MEMO

Benjamin A Christie, Yinlong Dai, Mohammad Bararjanianbahnamiri, Simon Stepputtis, and Dylan P Losey. From local corrections to generalized skills: Improving neuro-symbolic policies with memo.arXiv preprint arXiv:2603.04560, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Yinpei Dai, Hongze Fu, Jayjun Lee, Yuejiang Liu, Haoran Zhang, Jianing Yang, Chelsea Finn, Nima Fazeli, and Joyce Chai. Robomme: Benchmarking and understanding memory for robotic generalist policies.arXiv preprint arXiv:2603.04639, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

Palm-e: an embodied multimodal language model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: an embodied multimodal language model. In Proceedings of the 40th International Conference on Machine Learning, pages 8469–8488, 2023

2023
[11]

Test-time retrieval-augmented adaptation for vision-language models

Xinqi Fan, Xueli Chen, Luoxiao Yang, Chuin Hong Yap, Rizwan Qureshi, Qi Dou, Moi Hoon Yap, and Mubarak Shah. Test-time retrieval-augmented adaptation for vision-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8810–8819, 2025

2025
[12]

Long-vla: Unleashing long-horizon capability of vision language action model for robot manipulation

Yiguo Fan, Shuanghao Bai, Xinyang Tong, Pengxiang Ding, Yuyang Zhu, Hongchao Lu, Fengqi Dai, Wei Zhao, Yang Liu, Siteng Huang, et al. Long-vla: Unleashing long-horizon capability of vision language action model for robot manipulation. In9th Annual Conference on Robot Learning, 2025

2025
[13]

Kalm: Keypoint abstraction using large models for object-relative imitation learning

Xiaolin Fang, Bo-Ruei Huang, Jiayuan Mao, Jasmine Shone, Joshua B Tenenbaum, Tomás Lozano-Pérez, and Leslie Pack Kaelbling. Kalm: Keypoint abstraction using large models for object-relative imitation learning. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8307–8314. IEEE, 2025

2025
[14]

Mergevla: Cross-skill model merging toward a generalist vision-language-action agent.arXiv preprint arXiv:2511.18810, 2025

Yuxia Fu, Zhizhen Zhang, Yuqi Zhang, Zijian Wang, Zi Huang, and Yadan Luo. Mergevla: Cross-skill model merging toward a generalist vision-language-action agent.arXiv preprint arXiv:2511.18810, 2025

work page arXiv 2025
[15]

Rvt-2: Learning precise manipulation from few demonstrations

Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, and Dieter Fox. Rvt-2: Learning precise manipulation from few demonstrations. InRSS 2024 Workshop: Data Generation for Robotics, 2024

2024
[16]

Metaxas, and Ruixiang Tang

Minghao Guo, Qingyue Jiao, Zeru Shi, Yihao Quan, Boxuan Zhang, Danrui Li, Liwei Che, Wujiang Xu, Shilong Liu, Zirui Liu, Mubbasir Kapadia, Vladimir Pavlovic, Jiang Liu, Mengdi Wang, Yiyu Shi, Dimitris N. Metaxas, and Ruixiang Tang. Memeye: A visual-centric evaluation framework for multimodal agent memory, 2026. 11

2026
[17]

Deepsieve: Information sieving via llm-as-a-knowledge-router

Minghao Guo, Qingcheng Zeng, Xujiang Zhao, Yanchi Liu, Wenchao Yu, Mengnan Du, Haifeng Chen, and Wei Cheng. Deepsieve: Information sieving via llm-as-a-knowledge-router. InFindings of the Association for Computational Linguistics: EACL 2026, pages 3054–3077, 2026

2026
[18]

Chameleon: Control-Indexed Prospective Memory for Visuomotor Manipulation

Xinying Guo, Chenxi Jiang, Hyun Bin Kim, Ying Sun, Yang Xiao, Yuhang Han, and Jianfei Yang. Chameleon: Episodic memory for long-horizon robotic manipulation.arXiv preprint arXiv:2603.24576, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

2020
[20]

Robo-abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation

Yuanchen Ju, Kaizhe Hu, Guowei Zhang, Gu Zhang, Mingrun Jiang, and Huazhe Xu. Robo-abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation. InEuropean Conference on Computer Vision, pages 222–239. Springer, 2024

2024
[21]

Adaptive capacity allocation for vision language action fine-tuning.arXiv preprint arXiv:2603.07404, 2026

Donghoon Kim, Minji Bae, Unghui Nam, Gyeonghun Kim, Suyun Lee, Kyuhong Shim, and Byonghyo Shim. Adaptive capacity allocation for vision language action fine-tuning.arXiv preprint arXiv:2603.07404, 2026

work page arXiv 2026
[22]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. In8th Annual Conference on Robot Learning, 2024

2024
[23]

Ram: Retrieval-based affordance transfer for generalizable zero-shot robotic manipulation

Yuxuan Kuang, Junjie Ye, Haoran Geng, Jiageng Mao, Congyue Deng, Leonidas Guibas, He Wang, and Yue Wang. Ram: Retrieval-based affordance transfer for generalizable zero-shot robotic manipulation. In8th Annual Conference on Robot Learning, 2024

2024
[24]

Collage: Adaptive fusion-based retrieval for augmented policy learning

Sateesh Kumar, Shivin Dass, Georgios Pavlakos, and Roberto Martín-Martín. Collage: Adaptive fusion-based retrieval for augmented policy learning. InConference on Robot Learning, pages 4607–4624. PMLR, 2025

2025
[25]

Multi-agent behavior retrieval: Retrieval-augmented policy training for cooperative push manipulation by mobile robots

So Kuroki, Mai Nishimura, and Tadashi Kozuno. Multi-agent behavior retrieval: Retrieval-augmented policy training for cooperative push manipulation by mobile robots. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 12671–12678. IEEE, 2024

2024
[26]

Ra-tta: Retrieval- augmented test-time adaptation for vision-language models

Youngjun Lee, Doyoung Kim, Junhyeok Kang, Jihwan Bang, Hwanjun Song, and Jae-Gil Lee. Ra-tta: Retrieval- augmented test-time adaptation for vision-language models. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[27]

Soma: Strategic orchestration and memory-augmented system for vision-language-action model robustness via in-context adaptation.arXiv preprint arXiv:2603.24060, 2026

Zhuoran Li, Zhiyang Li, Kaijun Zhou, and Jinyu Gu. Soma: Strategic orchestration and memory-augmented system for vision-language-action model robustness via in-context adaptation.arXiv preprint arXiv:2603.24060, 2026

work page arXiv 2026
[28]

Adaptive Action Chunking at Inference-time for Vision-Language-Action Models

Yuanchang Liang, Xiaobo Wang, Kai Wang, Shuo Wang, Xiaojiang Peng, Haoyu Chen, David Kim Huat Chua, and Prahlad Vadakkepat. Adaptive action chunking at inference-time for vision-language-action models.arXiv preprint arXiv:2604.04161, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Coral: Scalable multi-task robot learning via lora experts.arXiv preprint arXiv:2603.09298, 2026

Yuankai Luo, Woping Chen, Tong Liang, and Zhenguo Li. Coral: Scalable multi-task robot learning via lora experts.arXiv preprint arXiv:2603.09298, 2026

work page arXiv 2026
[31]

Omnirouter: Budget and performance controllable multi-llm routing.ACM SIGKDD Explorations Newsletter, 27(2):107–116, 2025

Kai Mei, Wujiang Xu, Minghao Guo, Shuhang Lin, and Yongfeng Zhang. Omnirouter: Budget and performance controllable multi-llm routing.ACM SIGKDD Explorations Newsletter, 27(2):107–116, 2025

2025
[32]

Attributes as operators: factorizing unseen attribute-object compositions

Tushar Nagarajan and Kristen Grauman. Attributes as operators: factorizing unseen attribute-object compositions. InProceedings of the European Conference on Computer Vision (ECCV), pages 169–185, 2018

2018
[33]

Rora-vlm: Robust retrieval augmentation for vision language models.arXiv preprint arXiv:2410.08876, 2024

Jingyuan Qi, Zhiyang Xu, Rulin Shao, Yang Chen, Jin Di, Yu Cheng, Qifan Wang, and Lifu Huang. Rora-vlm: Robust retrieval augmentation for vision language models.arXiv preprint arXiv:2410.08876, 2024

work page arXiv 2024
[34]

Flower: Democratizing generalist robot policies with efficient vision-language-flow models

Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Yağmurlu, Fabian Otto, and Rudolf Lioutikov. Flower: Democratizing generalist robot policies with efficient vision-language-flow models. InConference on Robot Learning, pages 3736–3761. PMLR, 2025. 12

2025
[35]

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS

Bronislav Sidik and Dror Mizrahi. 3d-anchored lookahead planning for persistent robotic scene memory via world-model-based mcts.arXiv preprint arXiv:2604.11302, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[37]

Reconvla: Reconstructive vision-language-action model as effective robot perceiver

Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision-language-action model as effective robot perceiver. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18549–18557, 2026

2026
[38]

Ricl: Adding in-context adaptability to pre-trained vision-language-action models

Kaustubh Sridhar, Souradeep Dutta, Dinesh Jayaraman, and Insup Lee. Ricl: Adding in-context adaptability to pre-trained vision-language-action models. In9th Annual Conference on Robot Learning, 2025

2025
[39]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Roboflamingo-plus: Fusion of depth and rgb perception with vision-language models for enhanced robotic manipulation.arXiv preprint arXiv:2503.19510, 2025

Sheng Wang. Roboflamingo-plus: Fusion of depth and rgb perception with vision-language models for enhanced robotic manipulation.arXiv preprint arXiv:2503.19510, 2025

work page arXiv 2025
[41]

Kinematic-aware prompting for generalizable articulated object manipulation with llms

Wenke Xia, Dong Wang, Xincheng Pang, Zhigang Wang, Bin Zhao, Di Hu, and Xuelong Li. Kinematic-aware prompting for generalizable articulated object manipulation with llms. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 2073–2080. IEEE, 2024

2073
[42]

Dynamicvla: A vision-language-action model for dynamic object manipulation.arXiv preprint arXiv:2601.22153, 2026

Haozhe Xie, Beichen Wen, Jiarui Zheng, Zhaoxi Chen, Fangzhou Hong, Haiwen Diao, and Ziwei Liu. Dynamicvla: A vision-language-action model for dynamic object manipulation.arXiv preprint arXiv:2601.22153, 2026

work page arXiv 2026
[43]

Fluxmem: Adaptive hierarchical memory for streaming video understanding.arXiv preprint arXiv:2603.02096, 2026

Yiweng Xie, Bo He, Junke Wang, Xiangyu Zheng, Ziyi Ye, and Zuxuan Wu. Fluxmem: Adaptive hierarchical memory for streaming video understanding.arXiv preprint arXiv:2603.02096, 2026

work page arXiv 2026
[44]

Zero-shot robotic manipulation via 3d gaussian splatting-enhanced multimodal retrieval-augmented generation

Zilong Xie, Jingyu Gong, Xin Tan, Zhizhong Zhang, and Yuan Xie. Zero-shot robotic manipulation via 3d gaussian splatting-enhanced multimodal retrieval-augmented generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18683–18691, 2026

2026
[45]

Vision-language-action instruction tuning: From understanding to manipulation

Shuai Yang, Hao Li, Bin Wang, Yilun Chen, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, and Jiangmiao Pang. Vision-language-action instruction tuning: From understanding to manipulation. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[46]

St4vla: Spatially guided training for vision- language-action models.arXiv preprint arXiv:2602.10109, 2026

Jinhui Ye, Fangjing Wang, Ning Gao, Junqiu Yu, Yangkun Zhu, Bin Wang, Jinyu Zhang, Weiyang Jin, Yanwei Fu, Feng Zheng, et al. St4vla: Spatially guided training for vision-language-action models.arXiv preprint arXiv:2602.10109, 2026

work page arXiv 2026
[47]

Learning llm-as-a-judge for preference alignment

Ziyi Ye, Xiangsheng Li, Qiuchi Li, Qingyao Ai, Yujia Zhou, Wei Shen, Dong Yan, and Yiqun Liu. Learning llm-as-a-judge for preference alignment. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[48]

3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations

Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. InICRA 2024 Workshop on 3D Visual Representations for Robot Manipulation, 2024

2024
[49]

Atomicvla: Unlocking the potential of atomic skill learning in robots.arXiv preprint arXiv:2603.07648, 2026

Likui Zhang, Tao Tang, Zhihao Zhan, Xiuwei Chen, Zisheng Chen, Jianhua Han, Jiangtong Zhu, Pei Xu, Hang Xu, Hefeng Wu, et al. Atomicvla: Unlocking the potential of atomic skill learning in robots.arXiv preprint arXiv:2603.07648, 2026

work page arXiv 2026
[50]

Align-then-steer: Adapting the vision-language action models through unified latent guidance

Yang Zhang, Chenwei Wang, Ouyang Lu, Yuan Zhao, Yunfei Ge, Zhenglong Sun, Xiu Li, Chi Zhang, Chenjia Bai, and Xuelong Li. Align-then-steer: Adapting the vision-language action models through unified latent guidance. arXiv preprint arXiv:2509.02055, 2025

work page arXiv 2025
[51]

Recurrent reasoning with vision-language models for estimating long-horizon embodied task progress.arXiv preprint arXiv:2603.17312, 2026

Yuelin Zhang, Sijie Cheng, Chen Li, Zongzhao Li, Yuxin Huang, Yang Liu, and Wenbing Huang. Recurrent reasoning with vision-language models for estimating long-horizon embodied task progress.arXiv preprint arXiv:2603.17312, 2026. 13

work page arXiv 2026
[52]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Exploring the limits of vision-language-action manipulation in cross-task generalization

Jiaming Zhou, Ke Ye, Teli Ma, Zifan Wang, Ronghe Qiu, Kun-Yu Lin, Zhilin Zhao, Junwei Liang, et al. Exploring the limits of vision-language-action manipulation in cross-task generalization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[54]

Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning

Zewei Zhou, Tianhui Cai, Seth Z Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[55]

Retrieval-augmented embodied agents

Yichen Zhu, Zhicai Ou, Xiaofeng Mou, and Jian Tang. Retrieval-augmented embodied agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17985–17995, 2024

2024
[56]

subtask":

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In7th Annual Conference on Robot Learning, 2023. 14 A Procedural Memory Storage and Retrieval A.1 Prompt for Procedural State Extraction We use the...

2023
[57]

No markdown, no comments, no trailing commas

Output JSON only. No markdown, no comments, no trailing commas
[58]

Keep exactly the keys shown above

Do not add/remove keys. Keep exactly the keys shown above
[59]

place-on-stand

Use ONLY one of the allowed enum values. A.2 Procedural State Schema and Matching Weights This section summarizes the procedural state schema for RoboTwin tasks and the matching weights used in Action-Aware Procedural Matching. The free-formsubtask field is only used for readability and debugging, and is excluded from similarity computation to avoid seman...

2000

[1] [1]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

In9th Annual Conference on Robot Learning, 2025

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.π0.5: A vision-language-action model with open-world generalization. In9th Annual Conference on Robot Learning, 2025

2025

[3] [3]

Rt-1: Robotics transformer for real-world control at scale.Robotics: Science and Systems XIX, 2023

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.Robotics: Science and Systems XIX, 2023

2023

[4] [4]

RynnVLA-002: A Unified Vision-Language-Action and World Model

Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Zero-shot vehicle model recognition via text-based retrieval-augmented generation.arXiv preprint arXiv:2510.18502, 2025

Wei-Chia Chang and Yan-Ann Chen. Zero-shot vehicle model recognition via text-based retrieval-augmented generation.arXiv preprint arXiv:2510.18502, 2025

work page arXiv 2025

[6] [6]

Queryadapter: Rapid adaptation of vision-language models in response to natural language queries

Nicolas Harvey Chapman, Feras Dayoub, Will Browne, and Christopher Lehnert. Queryadapter: Rapid adaptation of vision-language models in response to natural language queries. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9606–9613. IEEE, 2025

2025

[7] [7]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

From Local Corrections to Generalized Skills: Improving Neuro-Symbolic Policies with MEMO

Benjamin A Christie, Yinlong Dai, Mohammad Bararjanianbahnamiri, Simon Stepputtis, and Dylan P Losey. From local corrections to generalized skills: Improving neuro-symbolic policies with memo.arXiv preprint arXiv:2603.04560, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Yinpei Dai, Hongze Fu, Jayjun Lee, Yuejiang Liu, Haoran Zhang, Jianing Yang, Chelsea Finn, Nima Fazeli, and Joyce Chai. Robomme: Benchmarking and understanding memory for robotic generalist policies.arXiv preprint arXiv:2603.04639, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

Palm-e: an embodied multimodal language model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: an embodied multimodal language model. In Proceedings of the 40th International Conference on Machine Learning, pages 8469–8488, 2023

2023

[11] [11]

Test-time retrieval-augmented adaptation for vision-language models

Xinqi Fan, Xueli Chen, Luoxiao Yang, Chuin Hong Yap, Rizwan Qureshi, Qi Dou, Moi Hoon Yap, and Mubarak Shah. Test-time retrieval-augmented adaptation for vision-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8810–8819, 2025

2025

[12] [12]

Long-vla: Unleashing long-horizon capability of vision language action model for robot manipulation

Yiguo Fan, Shuanghao Bai, Xinyang Tong, Pengxiang Ding, Yuyang Zhu, Hongchao Lu, Fengqi Dai, Wei Zhao, Yang Liu, Siteng Huang, et al. Long-vla: Unleashing long-horizon capability of vision language action model for robot manipulation. In9th Annual Conference on Robot Learning, 2025

2025

[13] [13]

Kalm: Keypoint abstraction using large models for object-relative imitation learning

Xiaolin Fang, Bo-Ruei Huang, Jiayuan Mao, Jasmine Shone, Joshua B Tenenbaum, Tomás Lozano-Pérez, and Leslie Pack Kaelbling. Kalm: Keypoint abstraction using large models for object-relative imitation learning. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8307–8314. IEEE, 2025

2025

[14] [14]

Mergevla: Cross-skill model merging toward a generalist vision-language-action agent.arXiv preprint arXiv:2511.18810, 2025

Yuxia Fu, Zhizhen Zhang, Yuqi Zhang, Zijian Wang, Zi Huang, and Yadan Luo. Mergevla: Cross-skill model merging toward a generalist vision-language-action agent.arXiv preprint arXiv:2511.18810, 2025

work page arXiv 2025

[15] [15]

Rvt-2: Learning precise manipulation from few demonstrations

Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, and Dieter Fox. Rvt-2: Learning precise manipulation from few demonstrations. InRSS 2024 Workshop: Data Generation for Robotics, 2024

2024

[16] [16]

Metaxas, and Ruixiang Tang

Minghao Guo, Qingyue Jiao, Zeru Shi, Yihao Quan, Boxuan Zhang, Danrui Li, Liwei Che, Wujiang Xu, Shilong Liu, Zirui Liu, Mubbasir Kapadia, Vladimir Pavlovic, Jiang Liu, Mengdi Wang, Yiyu Shi, Dimitris N. Metaxas, and Ruixiang Tang. Memeye: A visual-centric evaluation framework for multimodal agent memory, 2026. 11

2026

[17] [17]

Deepsieve: Information sieving via llm-as-a-knowledge-router

Minghao Guo, Qingcheng Zeng, Xujiang Zhao, Yanchi Liu, Wenchao Yu, Mengnan Du, Haifeng Chen, and Wei Cheng. Deepsieve: Information sieving via llm-as-a-knowledge-router. InFindings of the Association for Computational Linguistics: EACL 2026, pages 3054–3077, 2026

2026

[18] [18]

Chameleon: Control-Indexed Prospective Memory for Visuomotor Manipulation

Xinying Guo, Chenxi Jiang, Hyun Bin Kim, Ying Sun, Yang Xiao, Yuhang Han, and Jianfei Yang. Chameleon: Episodic memory for long-horizon robotic manipulation.arXiv preprint arXiv:2603.24576, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

2020

[20] [20]

Robo-abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation

Yuanchen Ju, Kaizhe Hu, Guowei Zhang, Gu Zhang, Mingrun Jiang, and Huazhe Xu. Robo-abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation. InEuropean Conference on Computer Vision, pages 222–239. Springer, 2024

2024

[21] [21]

Adaptive capacity allocation for vision language action fine-tuning.arXiv preprint arXiv:2603.07404, 2026

Donghoon Kim, Minji Bae, Unghui Nam, Gyeonghun Kim, Suyun Lee, Kyuhong Shim, and Byonghyo Shim. Adaptive capacity allocation for vision language action fine-tuning.arXiv preprint arXiv:2603.07404, 2026

work page arXiv 2026

[22] [22]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. In8th Annual Conference on Robot Learning, 2024

2024

[23] [23]

Ram: Retrieval-based affordance transfer for generalizable zero-shot robotic manipulation

Yuxuan Kuang, Junjie Ye, Haoran Geng, Jiageng Mao, Congyue Deng, Leonidas Guibas, He Wang, and Yue Wang. Ram: Retrieval-based affordance transfer for generalizable zero-shot robotic manipulation. In8th Annual Conference on Robot Learning, 2024

2024

[24] [24]

Collage: Adaptive fusion-based retrieval for augmented policy learning

Sateesh Kumar, Shivin Dass, Georgios Pavlakos, and Roberto Martín-Martín. Collage: Adaptive fusion-based retrieval for augmented policy learning. InConference on Robot Learning, pages 4607–4624. PMLR, 2025

2025

[25] [25]

Multi-agent behavior retrieval: Retrieval-augmented policy training for cooperative push manipulation by mobile robots

So Kuroki, Mai Nishimura, and Tadashi Kozuno. Multi-agent behavior retrieval: Retrieval-augmented policy training for cooperative push manipulation by mobile robots. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 12671–12678. IEEE, 2024

2024

[26] [26]

Ra-tta: Retrieval- augmented test-time adaptation for vision-language models

Youngjun Lee, Doyoung Kim, Junhyeok Kang, Jihwan Bang, Hwanjun Song, and Jae-Gil Lee. Ra-tta: Retrieval- augmented test-time adaptation for vision-language models. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[27] [27]

Soma: Strategic orchestration and memory-augmented system for vision-language-action model robustness via in-context adaptation.arXiv preprint arXiv:2603.24060, 2026

Zhuoran Li, Zhiyang Li, Kaijun Zhou, and Jinyu Gu. Soma: Strategic orchestration and memory-augmented system for vision-language-action model robustness via in-context adaptation.arXiv preprint arXiv:2603.24060, 2026

work page arXiv 2026

[28] [28]

Adaptive Action Chunking at Inference-time for Vision-Language-Action Models

Yuanchang Liang, Xiaobo Wang, Kai Wang, Shuo Wang, Xiaojiang Peng, Haoyu Chen, David Kim Huat Chua, and Prahlad Vadakkepat. Adaptive action chunking at inference-time for vision-language-action models.arXiv preprint arXiv:2604.04161, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Coral: Scalable multi-task robot learning via lora experts.arXiv preprint arXiv:2603.09298, 2026

Yuankai Luo, Woping Chen, Tong Liang, and Zhenguo Li. Coral: Scalable multi-task robot learning via lora experts.arXiv preprint arXiv:2603.09298, 2026

work page arXiv 2026

[31] [31]

Omnirouter: Budget and performance controllable multi-llm routing.ACM SIGKDD Explorations Newsletter, 27(2):107–116, 2025

Kai Mei, Wujiang Xu, Minghao Guo, Shuhang Lin, and Yongfeng Zhang. Omnirouter: Budget and performance controllable multi-llm routing.ACM SIGKDD Explorations Newsletter, 27(2):107–116, 2025

2025

[32] [32]

Attributes as operators: factorizing unseen attribute-object compositions

Tushar Nagarajan and Kristen Grauman. Attributes as operators: factorizing unseen attribute-object compositions. InProceedings of the European Conference on Computer Vision (ECCV), pages 169–185, 2018

2018

[33] [33]

Rora-vlm: Robust retrieval augmentation for vision language models.arXiv preprint arXiv:2410.08876, 2024

Jingyuan Qi, Zhiyang Xu, Rulin Shao, Yang Chen, Jin Di, Yu Cheng, Qifan Wang, and Lifu Huang. Rora-vlm: Robust retrieval augmentation for vision language models.arXiv preprint arXiv:2410.08876, 2024

work page arXiv 2024

[34] [34]

Flower: Democratizing generalist robot policies with efficient vision-language-flow models

Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Yağmurlu, Fabian Otto, and Rudolf Lioutikov. Flower: Democratizing generalist robot policies with efficient vision-language-flow models. InConference on Robot Learning, pages 3736–3761. PMLR, 2025. 12

2025

[35] [35]

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS

Bronislav Sidik and Dror Mizrahi. 3d-anchored lookahead planning for persistent robotic scene memory via world-model-based mcts.arXiv preprint arXiv:2604.11302, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[37] [37]

Reconvla: Reconstructive vision-language-action model as effective robot perceiver

Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision-language-action model as effective robot perceiver. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18549–18557, 2026

2026

[38] [38]

Ricl: Adding in-context adaptability to pre-trained vision-language-action models

Kaustubh Sridhar, Souradeep Dutta, Dinesh Jayaraman, and Insup Lee. Ricl: Adding in-context adaptability to pre-trained vision-language-action models. In9th Annual Conference on Robot Learning, 2025

2025

[39] [39]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Roboflamingo-plus: Fusion of depth and rgb perception with vision-language models for enhanced robotic manipulation.arXiv preprint arXiv:2503.19510, 2025

Sheng Wang. Roboflamingo-plus: Fusion of depth and rgb perception with vision-language models for enhanced robotic manipulation.arXiv preprint arXiv:2503.19510, 2025

work page arXiv 2025

[41] [41]

Kinematic-aware prompting for generalizable articulated object manipulation with llms

Wenke Xia, Dong Wang, Xincheng Pang, Zhigang Wang, Bin Zhao, Di Hu, and Xuelong Li. Kinematic-aware prompting for generalizable articulated object manipulation with llms. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 2073–2080. IEEE, 2024

2073

[42] [42]

Dynamicvla: A vision-language-action model for dynamic object manipulation.arXiv preprint arXiv:2601.22153, 2026

Haozhe Xie, Beichen Wen, Jiarui Zheng, Zhaoxi Chen, Fangzhou Hong, Haiwen Diao, and Ziwei Liu. Dynamicvla: A vision-language-action model for dynamic object manipulation.arXiv preprint arXiv:2601.22153, 2026

work page arXiv 2026

[43] [43]

Fluxmem: Adaptive hierarchical memory for streaming video understanding.arXiv preprint arXiv:2603.02096, 2026

Yiweng Xie, Bo He, Junke Wang, Xiangyu Zheng, Ziyi Ye, and Zuxuan Wu. Fluxmem: Adaptive hierarchical memory for streaming video understanding.arXiv preprint arXiv:2603.02096, 2026

work page arXiv 2026

[44] [44]

Zero-shot robotic manipulation via 3d gaussian splatting-enhanced multimodal retrieval-augmented generation

Zilong Xie, Jingyu Gong, Xin Tan, Zhizhong Zhang, and Yuan Xie. Zero-shot robotic manipulation via 3d gaussian splatting-enhanced multimodal retrieval-augmented generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18683–18691, 2026

2026

[45] [45]

Vision-language-action instruction tuning: From understanding to manipulation

Shuai Yang, Hao Li, Bin Wang, Yilun Chen, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, and Jiangmiao Pang. Vision-language-action instruction tuning: From understanding to manipulation. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[46] [46]

St4vla: Spatially guided training for vision- language-action models.arXiv preprint arXiv:2602.10109, 2026

Jinhui Ye, Fangjing Wang, Ning Gao, Junqiu Yu, Yangkun Zhu, Bin Wang, Jinyu Zhang, Weiyang Jin, Yanwei Fu, Feng Zheng, et al. St4vla: Spatially guided training for vision-language-action models.arXiv preprint arXiv:2602.10109, 2026

work page arXiv 2026

[47] [47]

Learning llm-as-a-judge for preference alignment

Ziyi Ye, Xiangsheng Li, Qiuchi Li, Qingyao Ai, Yujia Zhou, Wei Shen, Dong Yan, and Yiqun Liu. Learning llm-as-a-judge for preference alignment. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[48] [48]

3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations

Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. InICRA 2024 Workshop on 3D Visual Representations for Robot Manipulation, 2024

2024

[49] [49]

Atomicvla: Unlocking the potential of atomic skill learning in robots.arXiv preprint arXiv:2603.07648, 2026

Likui Zhang, Tao Tang, Zhihao Zhan, Xiuwei Chen, Zisheng Chen, Jianhua Han, Jiangtong Zhu, Pei Xu, Hang Xu, Hefeng Wu, et al. Atomicvla: Unlocking the potential of atomic skill learning in robots.arXiv preprint arXiv:2603.07648, 2026

work page arXiv 2026

[50] [50]

Align-then-steer: Adapting the vision-language action models through unified latent guidance

Yang Zhang, Chenwei Wang, Ouyang Lu, Yuan Zhao, Yunfei Ge, Zhenglong Sun, Xiu Li, Chi Zhang, Chenjia Bai, and Xuelong Li. Align-then-steer: Adapting the vision-language action models through unified latent guidance. arXiv preprint arXiv:2509.02055, 2025

work page arXiv 2025

[51] [51]

Recurrent reasoning with vision-language models for estimating long-horizon embodied task progress.arXiv preprint arXiv:2603.17312, 2026

Yuelin Zhang, Sijie Cheng, Chen Li, Zongzhao Li, Yuxin Huang, Yang Liu, and Wenbing Huang. Recurrent reasoning with vision-language models for estimating long-horizon embodied task progress.arXiv preprint arXiv:2603.17312, 2026. 13

work page arXiv 2026

[52] [52]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Exploring the limits of vision-language-action manipulation in cross-task generalization

Jiaming Zhou, Ke Ye, Teli Ma, Zifan Wang, Ronghe Qiu, Kun-Yu Lin, Zhilin Zhao, Junwei Liang, et al. Exploring the limits of vision-language-action manipulation in cross-task generalization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[54] [54]

Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning

Zewei Zhou, Tianhui Cai, Seth Z Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[55] [55]

Retrieval-augmented embodied agents

Yichen Zhu, Zhicai Ou, Xiaofeng Mou, and Jian Tang. Retrieval-augmented embodied agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17985–17995, 2024

2024

[56] [56]

subtask":

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In7th Annual Conference on Robot Learning, 2023. 14 A Procedural Memory Storage and Retrieval A.1 Prompt for Procedural State Extraction We use the...

2023

[57] [57]

No markdown, no comments, no trailing commas

Output JSON only. No markdown, no comments, no trailing commas

[58] [58]

Keep exactly the keys shown above

Do not add/remove keys. Keep exactly the keys shown above

[59] [59]

place-on-stand

Use ONLY one of the allowed enum values. A.2 Procedural State Schema and Matching Weights This section summarizes the procedural state schema for RoboTwin tasks and the matching weights used in Action-Aware Procedural Matching. The free-formsubtask field is only used for readability and debugging, and is excluded from similarity computation to avoid seman...

2000