pith. sign in

arxiv: 2605.29562 · v1 · pith:5SYWXZEZnew · submitted 2026-05-28 · 💻 cs.RO · cs.AI· cs.CV

VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models

Pith reviewed 2026-06-29 07:02 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV
keywords vision-language-action modelsprocedural memoryLoRA adapterscross-task generalizationrobotic manipulationmemory retrievaldynamic fusion
0
0 comments X

The pith

VLA-Pro stores task-specific LoRA adapters as procedural memories and retrieves plus fuses them at inference to improve cross-task generalization in vision-language-action models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes VLA-Pro as a plug-and-play addition to vision-language-action models that struggle to handle unseen tasks requiring experience transfer across objects, scenes, and actions. It stores task-relevant LoRA adapters during training as a form of procedural memory. At inference the system selects relevant stored adapters from the current visual, language, and action context and blends them to produce the next action chunk. A reader would care because this offers a modular route to reuse prior manipulation skills without full retraining or loss of stability. If the approach works, robots could accumulate and apply experience across tasks in a way that scales beyond single-task training.

Core claim

VLA-Pro stores task-specific LoRA adapters as parameterized procedural memories during training. At inference time it retrieves relevant procedural memories based on the current multi-modal context and dynamically fuses these memories for generating the current action chunk. Experiments across RoboTwin, RLBench, and real-world tasks show consistent gains in cross-task generalization on multiple backbones.

What carries the argument

Retrieval of relevant task-specific LoRA adapters followed by dynamic fusion into the current action generation step.

If this is right

  • Cross-task success improves up to 207 percent relative in simulation benchmarks.
  • Real-world manipulation success rises from 5.8 percent to 65.0 percent on the tested tasks.
  • The same gains appear across different VLA backbones while keeping the original model weights unchanged.
  • Procedural memory transfer supplies a route for moving manipulation experience to novel tasks without retraining the full model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A growing library of stored adapters could let a single robot improve over its lifetime by adding new tasks without overwriting old ones.
  • The same retrieval-plus-fusion pattern might extend to other sequential decision domains such as navigation or tool use where context cues signal which past skills apply.
  • If retrieval accuracy proves the main limit, future work could test whether richer context encoders or learned retrieval policies raise the ceiling further.

Load-bearing premise

Retrieval from multi-modal context will select useful memories and their fusion will add value without causing negative transfer or unstable actions.

What would settle it

A controlled test on held-out tasks where the base VLA model without retrieval matches or exceeds VLA-Pro performance, or where fusion produces visibly unstable robot trajectories.

Figures

Figures reproduced from arXiv: 2605.29562 by Ruimeng Yang, Shengyu Si, Yuanzhuo Lu, Yu-Gang Jiang, Ziyi Ye, Zuxuan Wu.

Figure 1
Figure 1. Figure 1: Overview of VLA-Pro. To bridge this gap, we propose VLA￾Pro, a plug-and-play framework that transfers procedural memory from the most similar training (seen) tasks to test￾ing (unseen) tasks, as shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: VLA-Pro Method Overview. The top row illustrates task execution as a sequence of stages, where each stage involves the retrieval and integration of procedural memories. Given the current multimodal context, VLA-Pro retrieves a series of procedural state sequences Di from a memory bank. Each indexed Di corresponds to a task-specific parameterized experience ∆θi, which is further merged into a fused adapter … view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our real-world experimental setup. 6 training tasks and 6 corresponding test tasks designed for evaluating the model’s performance in real-world manipulation. 4.2 Models Backbones and Baselines For RoboTwin, the VLA-Pro framework is instantiated on three different backbones: X-VLA[52], RDT[29], and π0.5[2]. For each backbone, its pretrained checkpoint independent of RoboTwin is used as the base… view at source ↗
Figure 4
Figure 4. Figure 4: Real-world manipulation results. Quantitative success rates and qualitative execution examples on held-out real-world tasks, comparing the baseline with VLA-Pro. Real-world Experimental Results [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of different parameter components in VLA￾Pro with π0.5 backbone on RoboTwin. The left and right radar charts show five seen and five unseen tasks, respectively [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Analysis of retrieval performance vs. task success rate. (a) Procedural state extraction accuracy measured by MRR. (b) Correlation between Unseen–Seen task similarity and transfer gain. baseline k=1 k=2 k=3 Experiment Configuration close fridge laptop lid turn oven toilet seat water plants close microwave take usb out take lid off beat the buzz Avg. Task Success Rate (%) 24.0 24.0 40.0 28.0 0.0 20.0 4.0 12… view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of RoboTwin dataset construction. This figure illustrates the 8 training tasks and the corresponding 9 test tasks for cross-task generalization evaluation. We build our customized RoboTwin task suite by modifying the original RoboTwin environment. Specifically, we retain 2 original training tasks and construct 6 additional training tasks and 9 held-out test tasks. The full task list and refer… view at source ↗
Figure 10
Figure 10. Figure 10: Modified grasp￾point configuration for con￾structing procedurally related RoboTwin tasks. B.2 RLBench Task Suite Our RLBench experiments follow the X-ICM [53] task split, which contains 18 training tasks and 23 held-out test tasks for cross-task generalization evaluation. From the 18 training tasks, we select 8 foundational tasks as source memories, since they cover basic procedural elements. The selected… view at source ↗
Figure 11
Figure 11. Figure 11: Examples of initial and final wrist-camera images for the real-world training tasks. As the visual observation changes during task execution, the model infers the current execution stage accordingly and retrieves the relevant procedural memory. B.3 Real-World Task Suite This section provides additional details about the real-world task suite. We design 6 training tasks and 6 corresponding held-out test ta… view at source ↗
Figure 12
Figure 12. Figure 12: Representative training loss curves in the RoboTwin experiments. From left to right, the plots show the loss curve of the RDT baseline, the continued VLA-Pro training with RDT as the backbone, the π0.5 baseline, and the continued VLA-Pro training with π0.5 as the backbone. RLBench. In RLBench experiments, RDT uses the official RDT-1B pretrained model. For AtomicVLA, AdamW is used with a cosine learning-ra… view at source ↗
read the original abstract

Vision-Language-Action~(VLA) models have shown strong potential for general-purpose robotic manipulation, yet they still struggle to generalize to unseen tasks that necessitate transferring relevant experience across objects, scenes, and action patterns. This paper proposes VLA-Pro, a plug-and-play framework designed to enhance cross-task generalization by storing task-relevant procedural memories at training time and transferring these memories during inference. Specifically, VLA-Pro stores task-specific LoRA adapters as parameterized procedural memories during training. At inference time, VLA-Pro retrieves relevant procedural memories based on the current multi-modal context and dynamically fuses these memories for generating the current action chunk. Experiments on RoboTwin, RLBench, and real-world manipulation tasks show that VLA-Pro consistently improves cross-task generalization across multiple backbones, achieving up to a 207% relative improvement in simulation and increasing real-world success rate from 5.8% to 65.0%. These results suggest that procedural memory retrieval and adaptation provide an effective mechanism for transferring manipulation experience to novel tasks while preserving modularity and execution stability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes VLA-Pro, a plug-and-play framework for Vision-Language-Action (VLA) models that stores task-specific LoRA adapters as procedural memories during training and, at inference, retrieves relevant memories via multi-modal context similarity and dynamically fuses them to generate action chunks. This is claimed to improve cross-task generalization on RoboTwin, RLBench, and real-world manipulation tasks, with reported gains of up to 207% relative improvement in simulation and real-world success rates rising from 5.8% to 65.0% across multiple backbones.

Significance. If the empirical results hold after addressing the retrieval and fusion assumptions, the work would be significant for offering a modular, parameter-efficient mechanism to transfer manipulation experience across tasks without full retraining, preserving execution stability. The scale of the reported gains suggests potential for practical impact in general-purpose robotics, though the absence of detailed baseline comparisons and negative-transfer controls limits immediate assessment of novelty relative to existing adapter or memory-based methods.

major comments (2)
  1. [Abstract] Abstract: The central empirical claim (207% relative improvement; 5.8%→65.0% real-world success) is load-bearing, yet no information is supplied on the exact baselines, number of evaluation episodes, data splits, or statistical significance testing; without these, it is impossible to determine whether the gains arise from the retrieval-fusion mechanism or from unaccounted confounds.
  2. [Method] Method description (inference-time retrieval and fusion): The claim that multi-modal-context retrieval followed by dynamic fusion reliably transfers useful procedural memories without negative transfer rests on the untested assumption that visual-language similarity selects action-compatible LoRA adapters; the manuscript must provide an ablation or failure-case analysis on tasks with similar objects/scenes but divergent action sequences, as mismatch would directly undermine the cross-task generalization results.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'preserving modularity and execution stability' is asserted without reference to any stability metric or comparison against unfused baselines.
  2. The manuscript should include a table or figure explicitly listing the backbones tested and the precise retrieval similarity function (e.g., cosine on which embeddings).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the empirical claims and methodological assumptions. We address each major comment point-by-point below, clarifying where details appear in the manuscript and indicating revisions made to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central empirical claim (207% relative improvement; 5.8%→65.0% real-world success) is load-bearing, yet no information is supplied on the exact baselines, number of evaluation episodes, data splits, or statistical significance testing; without these, it is impossible to determine whether the gains arise from the retrieval-fusion mechanism or from unaccounted confounds.

    Authors: The full experimental protocol—including exact baselines (e.g., vanilla VLA, LoRA fine-tuning per task), evaluation episodes (100 per task across 3 random seeds), data splits (train/test task partitions detailed in Section 4.1), and statistical reporting (mean ± std with significance tests)—is provided in Section 4 and Appendix B. The abstract summarizes headline results for brevity. To address the concern, we have revised the abstract to include a one-sentence reference to the evaluation setup and added a compact experimental summary table (Table 1) in the main text. revision: partial

  2. Referee: [Method] Method description (inference-time retrieval and fusion): The claim that multi-modal-context retrieval followed by dynamic fusion reliably transfers useful procedural memories without negative transfer rests on the untested assumption that visual-language similarity selects action-compatible LoRA adapters; the manuscript must provide an ablation or failure-case analysis on tasks with similar objects/scenes but divergent action sequences, as mismatch would directly undermine the cross-task generalization results.

    Authors: We agree that explicit validation of the retrieval assumption is valuable. We have added a targeted ablation (new Section 4.4) comparing multi-modal retrieval against vision-only and language-only variants on a curated set of tasks with high visual/scene similarity but divergent action sequences (e.g., “pick red block” vs. “push red block” on identical tables). Results show reduced negative transfer with the full multi-modal similarity metric, supported by quantitative success rates and qualitative failure-case analysis. These additions directly test and support the cross-task transfer claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework validated on external benchmarks

full rationale

The paper introduces VLA-Pro as a plug-and-play retrieval-and-fusion framework for LoRA adapters in VLA models. No equations, derivations, or first-principles predictions appear in the provided text; the central claims rest on experimental outcomes across RoboTwin, RLBench, and real-world tasks rather than any self-referential fitting or self-citation chain that reduces the result to its inputs by construction. Retrieval and fusion are described as design choices whose effectiveness is measured externally, with no load-bearing step that renames a fit as a prediction or imports uniqueness from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond standard use of LoRA adapters and retrieval; framework appears to build on existing techniques without new postulates.

pith-pipeline@v0.9.1-grok · 5737 in / 981 out tokens · 45061 ms · 2026-06-29T07:02:59.853636+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 25 canonical work pages · 12 internal anchors

  1. [1]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  2. [2]

    In9th Annual Conference on Robot Learning, 2025

    Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.π0.5: A vision-language-action model with open-world generalization. In9th Annual Conference on Robot Learning, 2025

  3. [3]

    Rt-1: Robotics transformer for real-world control at scale.Robotics: Science and Systems XIX, 2023

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.Robotics: Science and Systems XIX, 2023

  4. [4]

    RynnVLA-002: A Unified Vision-Language-Action and World Model

    Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

  5. [5]

    Zero-shot vehicle model recognition via text-based retrieval-augmented generation.arXiv preprint arXiv:2510.18502, 2025

    Wei-Chia Chang and Yan-Ann Chen. Zero-shot vehicle model recognition via text-based retrieval-augmented generation.arXiv preprint arXiv:2510.18502, 2025

  6. [6]

    Queryadapter: Rapid adaptation of vision-language models in response to natural language queries

    Nicolas Harvey Chapman, Feras Dayoub, Will Browne, and Christopher Lehnert. Queryadapter: Rapid adaptation of vision-language models in response to natural language queries. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9606–9613. IEEE, 2025

  7. [7]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  8. [8]

    From Local Corrections to Generalized Skills: Improving Neuro-Symbolic Policies with MEMO

    Benjamin A Christie, Yinlong Dai, Mohammad Bararjanianbahnamiri, Simon Stepputtis, and Dylan P Losey. From local corrections to generalized skills: Improving neuro-symbolic policies with memo.arXiv preprint arXiv:2603.04560, 2026

  9. [9]

    RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

    Yinpei Dai, Hongze Fu, Jayjun Lee, Yuejiang Liu, Haoran Zhang, Jianing Yang, Chelsea Finn, Nima Fazeli, and Joyce Chai. Robomme: Benchmarking and understanding memory for robotic generalist policies.arXiv preprint arXiv:2603.04639, 2026

  10. [10]

    Palm-e: an embodied multimodal language model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: an embodied multimodal language model. In Proceedings of the 40th International Conference on Machine Learning, pages 8469–8488, 2023

  11. [11]

    Test-time retrieval-augmented adaptation for vision-language models

    Xinqi Fan, Xueli Chen, Luoxiao Yang, Chuin Hong Yap, Rizwan Qureshi, Qi Dou, Moi Hoon Yap, and Mubarak Shah. Test-time retrieval-augmented adaptation for vision-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8810–8819, 2025

  12. [12]

    Long-vla: Unleashing long-horizon capability of vision language action model for robot manipulation

    Yiguo Fan, Shuanghao Bai, Xinyang Tong, Pengxiang Ding, Yuyang Zhu, Hongchao Lu, Fengqi Dai, Wei Zhao, Yang Liu, Siteng Huang, et al. Long-vla: Unleashing long-horizon capability of vision language action model for robot manipulation. In9th Annual Conference on Robot Learning, 2025

  13. [13]

    Kalm: Keypoint abstraction using large models for object-relative imitation learning

    Xiaolin Fang, Bo-Ruei Huang, Jiayuan Mao, Jasmine Shone, Joshua B Tenenbaum, Tomás Lozano-Pérez, and Leslie Pack Kaelbling. Kalm: Keypoint abstraction using large models for object-relative imitation learning. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8307–8314. IEEE, 2025

  14. [14]

    Mergevla: Cross-skill model merging toward a generalist vision-language-action agent.arXiv preprint arXiv:2511.18810, 2025

    Yuxia Fu, Zhizhen Zhang, Yuqi Zhang, Zijian Wang, Zi Huang, and Yadan Luo. Mergevla: Cross-skill model merging toward a generalist vision-language-action agent.arXiv preprint arXiv:2511.18810, 2025

  15. [15]

    Rvt-2: Learning precise manipulation from few demonstrations

    Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, and Dieter Fox. Rvt-2: Learning precise manipulation from few demonstrations. InRSS 2024 Workshop: Data Generation for Robotics, 2024

  16. [16]

    Metaxas, and Ruixiang Tang

    Minghao Guo, Qingyue Jiao, Zeru Shi, Yihao Quan, Boxuan Zhang, Danrui Li, Liwei Che, Wujiang Xu, Shilong Liu, Zirui Liu, Mubbasir Kapadia, Vladimir Pavlovic, Jiang Liu, Mengdi Wang, Yiyu Shi, Dimitris N. Metaxas, and Ruixiang Tang. Memeye: A visual-centric evaluation framework for multimodal agent memory, 2026. 11

  17. [17]

    Deepsieve: Information sieving via llm-as-a-knowledge-router

    Minghao Guo, Qingcheng Zeng, Xujiang Zhao, Yanchi Liu, Wenchao Yu, Mengnan Du, Haifeng Chen, and Wei Cheng. Deepsieve: Information sieving via llm-as-a-knowledge-router. InFindings of the Association for Computational Linguistics: EACL 2026, pages 3054–3077, 2026

  18. [18]

    Chameleon: Control-Indexed Prospective Memory for Visuomotor Manipulation

    Xinying Guo, Chenxi Jiang, Hyun Bin Kim, Ying Sun, Yang Xiao, Yuhang Han, and Jianfei Yang. Chameleon: Episodic memory for long-horizon robotic manipulation.arXiv preprint arXiv:2603.24576, 2026

  19. [19]

    Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

    Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

  20. [20]

    Robo-abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation

    Yuanchen Ju, Kaizhe Hu, Guowei Zhang, Gu Zhang, Mingrun Jiang, and Huazhe Xu. Robo-abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation. InEuropean Conference on Computer Vision, pages 222–239. Springer, 2024

  21. [21]

    Adaptive capacity allocation for vision language action fine-tuning.arXiv preprint arXiv:2603.07404, 2026

    Donghoon Kim, Minji Bae, Unghui Nam, Gyeonghun Kim, Suyun Lee, Kyuhong Shim, and Byonghyo Shim. Adaptive capacity allocation for vision language action fine-tuning.arXiv preprint arXiv:2603.07404, 2026

  22. [22]

    Openvla: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. In8th Annual Conference on Robot Learning, 2024

  23. [23]

    Ram: Retrieval-based affordance transfer for generalizable zero-shot robotic manipulation

    Yuxuan Kuang, Junjie Ye, Haoran Geng, Jiageng Mao, Congyue Deng, Leonidas Guibas, He Wang, and Yue Wang. Ram: Retrieval-based affordance transfer for generalizable zero-shot robotic manipulation. In8th Annual Conference on Robot Learning, 2024

  24. [24]

    Collage: Adaptive fusion-based retrieval for augmented policy learning

    Sateesh Kumar, Shivin Dass, Georgios Pavlakos, and Roberto Martín-Martín. Collage: Adaptive fusion-based retrieval for augmented policy learning. InConference on Robot Learning, pages 4607–4624. PMLR, 2025

  25. [25]

    Multi-agent behavior retrieval: Retrieval-augmented policy training for cooperative push manipulation by mobile robots

    So Kuroki, Mai Nishimura, and Tadashi Kozuno. Multi-agent behavior retrieval: Retrieval-augmented policy training for cooperative push manipulation by mobile robots. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 12671–12678. IEEE, 2024

  26. [26]

    Ra-tta: Retrieval- augmented test-time adaptation for vision-language models

    Youngjun Lee, Doyoung Kim, Junhyeok Kang, Jihwan Bang, Hwanjun Song, and Jae-Gil Lee. Ra-tta: Retrieval- augmented test-time adaptation for vision-language models. InThe Thirteenth International Conference on Learning Representations, 2025

  27. [27]

    Soma: Strategic orchestration and memory-augmented system for vision-language-action model robustness via in-context adaptation.arXiv preprint arXiv:2603.24060, 2026

    Zhuoran Li, Zhiyang Li, Kaijun Zhou, and Jinyu Gu. Soma: Strategic orchestration and memory-augmented system for vision-language-action model robustness via in-context adaptation.arXiv preprint arXiv:2603.24060, 2026

  28. [28]

    Adaptive Action Chunking at Inference-time for Vision-Language-Action Models

    Yuanchang Liang, Xiaobo Wang, Kai Wang, Shuo Wang, Xiaojiang Peng, Haoyu Chen, David Kim Huat Chua, and Prahlad Vadakkepat. Adaptive action chunking at inference-time for vision-language-action models.arXiv preprint arXiv:2604.04161, 2026

  29. [29]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

  30. [30]

    Coral: Scalable multi-task robot learning via lora experts.arXiv preprint arXiv:2603.09298, 2026

    Yuankai Luo, Woping Chen, Tong Liang, and Zhenguo Li. Coral: Scalable multi-task robot learning via lora experts.arXiv preprint arXiv:2603.09298, 2026

  31. [31]

    Omnirouter: Budget and performance controllable multi-llm routing.ACM SIGKDD Explorations Newsletter, 27(2):107–116, 2025

    Kai Mei, Wujiang Xu, Minghao Guo, Shuhang Lin, and Yongfeng Zhang. Omnirouter: Budget and performance controllable multi-llm routing.ACM SIGKDD Explorations Newsletter, 27(2):107–116, 2025

  32. [32]

    Attributes as operators: factorizing unseen attribute-object compositions

    Tushar Nagarajan and Kristen Grauman. Attributes as operators: factorizing unseen attribute-object compositions. InProceedings of the European Conference on Computer Vision (ECCV), pages 169–185, 2018

  33. [33]

    Rora-vlm: Robust retrieval augmentation for vision language models.arXiv preprint arXiv:2410.08876, 2024

    Jingyuan Qi, Zhiyang Xu, Rulin Shao, Yang Chen, Jin Di, Yu Cheng, Qifan Wang, and Lifu Huang. Rora-vlm: Robust retrieval augmentation for vision language models.arXiv preprint arXiv:2410.08876, 2024

  34. [34]

    Flower: Democratizing generalist robot policies with efficient vision-language-flow models

    Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Yağmurlu, Fabian Otto, and Rudolf Lioutikov. Flower: Democratizing generalist robot policies with efficient vision-language-flow models. InConference on Robot Learning, pages 3736–3761. PMLR, 2025. 12

  35. [35]

    MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

    Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation.arXiv preprint arXiv:2508.19236, 2025

  36. [36]

    3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS

    Bronislav Sidik and Dror Mizrahi. 3d-anchored lookahead planning for persistent robotic scene memory via world-model-based mcts.arXiv preprint arXiv:2604.11302, 2026

  37. [37]

    Reconvla: Reconstructive vision-language-action model as effective robot perceiver

    Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision-language-action model as effective robot perceiver. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18549–18557, 2026

  38. [38]

    Ricl: Adding in-context adaptability to pre-trained vision-language-action models

    Kaustubh Sridhar, Souradeep Dutta, Dinesh Jayaraman, and Insup Lee. Ricl: Adding in-context adaptability to pre-trained vision-language-action models. In9th Annual Conference on Robot Learning, 2025

  39. [39]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  40. [40]

    Roboflamingo-plus: Fusion of depth and rgb perception with vision-language models for enhanced robotic manipulation.arXiv preprint arXiv:2503.19510, 2025

    Sheng Wang. Roboflamingo-plus: Fusion of depth and rgb perception with vision-language models for enhanced robotic manipulation.arXiv preprint arXiv:2503.19510, 2025

  41. [41]

    Kinematic-aware prompting for generalizable articulated object manipulation with llms

    Wenke Xia, Dong Wang, Xincheng Pang, Zhigang Wang, Bin Zhao, Di Hu, and Xuelong Li. Kinematic-aware prompting for generalizable articulated object manipulation with llms. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 2073–2080. IEEE, 2024

  42. [42]

    Dynamicvla: A vision-language-action model for dynamic object manipulation.arXiv preprint arXiv:2601.22153, 2026

    Haozhe Xie, Beichen Wen, Jiarui Zheng, Zhaoxi Chen, Fangzhou Hong, Haiwen Diao, and Ziwei Liu. Dynamicvla: A vision-language-action model for dynamic object manipulation.arXiv preprint arXiv:2601.22153, 2026

  43. [43]

    Fluxmem: Adaptive hierarchical memory for streaming video understanding.arXiv preprint arXiv:2603.02096, 2026

    Yiweng Xie, Bo He, Junke Wang, Xiangyu Zheng, Ziyi Ye, and Zuxuan Wu. Fluxmem: Adaptive hierarchical memory for streaming video understanding.arXiv preprint arXiv:2603.02096, 2026

  44. [44]

    Zero-shot robotic manipulation via 3d gaussian splatting-enhanced multimodal retrieval-augmented generation

    Zilong Xie, Jingyu Gong, Xin Tan, Zhizhong Zhang, and Yuan Xie. Zero-shot robotic manipulation via 3d gaussian splatting-enhanced multimodal retrieval-augmented generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18683–18691, 2026

  45. [45]

    Vision-language-action instruction tuning: From understanding to manipulation

    Shuai Yang, Hao Li, Bin Wang, Yilun Chen, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, and Jiangmiao Pang. Vision-language-action instruction tuning: From understanding to manipulation. InThe Fourteenth International Conference on Learning Representations, 2026

  46. [46]

    St4vla: Spatially guided training for vision- language-action models.arXiv preprint arXiv:2602.10109, 2026

    Jinhui Ye, Fangjing Wang, Ning Gao, Junqiu Yu, Yangkun Zhu, Bin Wang, Jinyu Zhang, Weiyang Jin, Yanwei Fu, Feng Zheng, et al. St4vla: Spatially guided training for vision-language-action models.arXiv preprint arXiv:2602.10109, 2026

  47. [47]

    Learning llm-as-a-judge for preference alignment

    Ziyi Ye, Xiangsheng Li, Qiuchi Li, Qingyao Ai, Yujia Zhou, Wei Shen, Dong Yan, and Yiqun Liu. Learning llm-as-a-judge for preference alignment. InThe Thirteenth International Conference on Learning Representations, 2025

  48. [48]

    3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations

    Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. InICRA 2024 Workshop on 3D Visual Representations for Robot Manipulation, 2024

  49. [49]

    Atomicvla: Unlocking the potential of atomic skill learning in robots.arXiv preprint arXiv:2603.07648, 2026

    Likui Zhang, Tao Tang, Zhihao Zhan, Xiuwei Chen, Zisheng Chen, Jianhua Han, Jiangtong Zhu, Pei Xu, Hang Xu, Hefeng Wu, et al. Atomicvla: Unlocking the potential of atomic skill learning in robots.arXiv preprint arXiv:2603.07648, 2026

  50. [50]

    Align-then-steer: Adapting the vision-language action models through unified latent guidance

    Yang Zhang, Chenwei Wang, Ouyang Lu, Yuan Zhao, Yunfei Ge, Zhenglong Sun, Xiu Li, Chi Zhang, Chenjia Bai, and Xuelong Li. Align-then-steer: Adapting the vision-language action models through unified latent guidance. arXiv preprint arXiv:2509.02055, 2025

  51. [51]

    Recurrent reasoning with vision-language models for estimating long-horizon embodied task progress.arXiv preprint arXiv:2603.17312, 2026

    Yuelin Zhang, Sijie Cheng, Chen Li, Zongzhao Li, Yuxin Huang, Yang Liu, and Wenbing Huang. Recurrent reasoning with vision-language models for estimating long-horizon embodied task progress.arXiv preprint arXiv:2603.17312, 2026. 13

  52. [52]

    X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

    Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025

  53. [53]

    Exploring the limits of vision-language-action manipulation in cross-task generalization

    Jiaming Zhou, Ke Ye, Teli Ma, Zifan Wang, Ronghe Qiu, Kun-Yu Lin, Zhilin Zhao, Junwei Liang, et al. Exploring the limits of vision-language-action manipulation in cross-task generalization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  54. [54]

    Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning

    Zewei Zhou, Tianhui Cai, Seth Z Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  55. [55]

    Retrieval-augmented embodied agents

    Yichen Zhu, Zhicai Ou, Xiaofeng Mou, and Jian Tang. Retrieval-augmented embodied agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17985–17995, 2024

  56. [56]

    subtask":

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In7th Annual Conference on Robot Learning, 2023. 14 A Procedural Memory Storage and Retrieval A.1 Prompt for Procedural State Extraction We use the...

  57. [57]

    No markdown, no comments, no trailing commas

    Output JSON only. No markdown, no comments, no trailing commas

  58. [58]

    Keep exactly the keys shown above

    Do not add/remove keys. Keep exactly the keys shown above

  59. [59]

    place-on-stand

    Use ONLY one of the allowed enum values. A.2 Procedural State Schema and Matching Weights This section summarizes the procedural state schema for RoboTwin tasks and the matching weights used in Action-Aware Procedural Matching. The free-formsubtask field is only used for readability and debugging, and is excluded from similarity computation to avoid seman...