pith. sign in

arxiv: 2606.08520 · v1 · pith:A3IJGP36new · submitted 2026-06-07 · 💻 cs.RO

Two Bridges, One Pathway: From VLMs to Generalizable VLAs with Embodied Trajectory-Coupled Data

Pith reviewed 2026-06-27 18:29 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language modelsvision-language-actionrobot policiesembodied datageneralizationfine-tuningtrajectory supervision
0
0 comments X

The pith

Embodied trajectory-coupled data bridges VLMs to generalizable VLAs through gradual three-stage adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that directly fine-tuning vision-language models on robot action data forces the model across both visual perception and action prediction gaps at once, causing loss of pre-trained generalizations. It introduces embodied trajectory-coupled data drawn from the same robot scenes and trajectories but paired with language objectives as an intermediate form that shares visual context while keeping familiar supervision. A three-stage recipe first adapts the model to embodied visual-language semantics, then gradually shifts it toward action prediction while retaining representations, and finally specializes it to the target domain. Mixing task-relevant out-of-distribution ETC data with only a small amount of action data enables the resulting policy to handle novel visual-language conditions without collecting further robot demonstrations.

Core claim

Vision-language models can be turned into generalizable vision-language-action policies by using embodied trajectory-coupled data as a stepping stone that shares robot visual context while retaining language-understanding objectives; this enables a three-stage process of distribution bridging, objective bridging, and retentive adaptation, and mixing task-relevant out-of-distribution ETC data with limited action data transfers VLM generalizations into robust policies that succeed on novel conditions without additional demonstrations.

What carries the argument

Embodied trajectory-coupled (ETC) data: vision-language supervision derived from the same robot scenes and trajectories used for action learning.

If this is right

  • The model generalizes to novel visual-language conditions using only small amounts of action data.
  • Gradual bridging across distribution and objective gaps prevents degradation of VLM representations.
  • Three distinct stages are needed: first adapting to embodied visuals, then shifting objectives, then specializing to deployment.
  • Task-relevant out-of-distribution ETC data is effective for enabling generalization without new robot demonstrations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gradual bridging approach could lower the total volume of robot demonstrations required to deploy capable policies across varied tasks.
  • Similar intermediate data forms might help adapt other pre-trained models when their output modality changes.
  • Experiments that vary how closely the ETC data matches the deployment visuals would show how domain-specific the data must be.

Load-bearing premise

That ETC data shares visual context while retaining familiar language objectives and thus acts as a natural stepping stone that preserves rather than degrades VLM representations during the transition to action prediction.

What would settle it

Training a model with the three-stage process but without mixing task-relevant out-of-distribution ETC data and checking whether it still generalizes to novel visual-language conditions at the same rate as the mixed version.

read the original abstract

Vision-language models (VLMs) are powerful general-purpose reasoners, yet converting them into robot control policies (VLAs) is surprisingly difficult. The root cause is a two-fold gap: VLMs are trained on internet-scale images with language-understanding objectives, while VLAs must perceive robot scenes and predict motor actions. Fine-tuning a VLM directly on robot action data forces the model to cross both gaps at once -- the learning curve is steep and the rich generalizations learned during pretraining tend to degrade rather than transfer. We argue that this gap can be bridged gradually with the right intermediate data. We introduce \emph{embodied trajectory-coupled (ETC) data} -- vision-language supervision derived from the same robot scenes and trajectories used for action learning. Because ETC data shares the visual context of robot operation while retaining familiar language-understanding objectives, it provides a natural stepping stone between VLM pretraining and VLA fine-tuning. Building on this, we design a three-stage training recipe. Distribution Bridging first adapts the VLM to embodied visual-language semantics. Objective Bridging then gradually shifts the model toward action prediction while preserving the acquired representations. Retentive Adaptation finally specializes the policy to the target deployment domain. We further show that mixing task-relevant out-of-distribution ETC data with a small amount of action data enables the model to generalize to novel visual-language conditions without requiring additional robot demonstrations. Simulation and real-robot experiments confirm that this gradual bridging strategy is the key to transferring VLM generalization into robust, deployable robot policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that direct fine-tuning of VLMs on robot action data is hindered by simultaneous gaps in visual distribution and training objective, leading to degraded generalization. It introduces embodied trajectory-coupled (ETC) data—vision-language supervision from the same robot scenes and trajectories—as an intermediate bridge. A three-stage recipe is proposed: Distribution Bridging adapts the VLM to embodied visual-language semantics; Objective Bridging gradually shifts toward action prediction while preserving representations; Retentive Adaptation specializes to the target domain. Experiments show that mixing task-relevant out-of-distribution ETC data with limited action data enables generalization to novel visual-language conditions without extra robot demonstrations, confirmed in simulation and real-robot settings.

Significance. If the empirical results hold, the work provides a concrete, data-efficient recipe for transferring VLM generalization to deployable VLAs. The emphasis on gradual bridging via ETC data and the mixing strategy for OOD generalization addresses a practical bottleneck in embodied AI, potentially lowering the data requirements for robust robot policies. The staged approach is a strength if the preservation of pretrained representations is demonstrated.

major comments (2)
  1. [§3] §3 (three-stage recipe): the claim that Objective Bridging 'gradually shifts' the model while preserving representations requires explicit ablation showing that skipping this stage degrades performance relative to the full recipe; without such controls the necessity of the intermediate objective remains unproven.
  2. [Experiments] Experiments section (mixing results): the reported generalization to novel conditions relies on 'task-relevant out-of-distribution ETC data'; the definition and selection criteria for what counts as 'task-relevant' must be stated precisely, as overly broad selection could inflate the apparent benefit of mixing.
minor comments (2)
  1. [Abstract, §2] Abstract and §2: the term 'embodied trajectory-coupled (ETC) data' is introduced without a formal definition or example of the exact supervision format (e.g., caption style, trajectory encoding); a short illustrative example would improve clarity.
  2. [Figures] Figure captions (throughout): several figures comparing VLA variants lack error bars or statistical significance markers despite the text claiming 'robust' improvements; adding these would strengthen the presentation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment, recognition of the practical value of our approach, and the recommendation for minor revision. We address each major comment below and will update the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3] §3 (three-stage recipe): the claim that Objective Bridging 'gradually shifts' the model while preserving representations requires explicit ablation showing that skipping this stage degrades performance relative to the full recipe; without such controls the necessity of the intermediate objective remains unproven.

    Authors: We agree that an explicit ablation is required to substantiate the necessity of the Objective Bridging stage. The revised manuscript will include a new ablation comparing the full three-stage recipe against a two-stage variant that omits Objective Bridging, reporting the resulting degradation in both in-distribution performance and out-of-distribution generalization to confirm that gradual objective shifting is essential for representation preservation. revision: yes

  2. Referee: [Experiments] Experiments section (mixing results): the reported generalization to novel conditions relies on 'task-relevant out-of-distribution ETC data'; the definition and selection criteria for what counts as 'task-relevant' must be stated precisely, as overly broad selection could inflate the apparent benefit of mixing.

    Authors: We acknowledge the need for a precise definition. In the revised Experiments section we will add an explicit subsection defining task-relevance via two quantitative criteria: (1) semantic overlap measured by cosine similarity of language embeddings between ETC trajectories and target tasks, and (2) visual distribution overlap computed via feature-space distance to the in-distribution robot scenes. This will make the selection process transparent and rule out overly broad inclusion. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with no derivations

full rationale

The paper describes an empirical three-stage training recipe (Distribution Bridging, Objective Bridging, Retentive Adaptation) using embodied trajectory-coupled (ETC) data to transition from VLMs to VLAs. No equations, first-principles derivations, fitted parameters, or mathematical reductions are present or claimed. The central claim rests on experimental results from simulation and real-robot tests showing generalization via mixing ETC and action data, without any self-definitional loops, fitted inputs called predictions, or load-bearing self-citations. The approach is self-contained as a practical training strategy validated externally to any internal tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that language-understanding objectives remain compatible with embodied visual contexts without further justification.

pith-pipeline@v0.9.1-grok · 5859 in / 1055 out tokens · 11984 ms · 2026-06-27T18:29:45.319278+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 15 linked inside Pith

  1. [1]

    Gemini robotics: Bring- ing ai into the physical world

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bring- ing ai into the physical world. arXiv preprint arXiv:2503.20020, 2025

  2. [2]

    Embodiedmidtrain: Bridging the gap between vision-language models and vision-language-action models via mid-training

    Yiyang Du, Zhanqiu Guo, Xin Ye, Liu Ren, and Chenyan Xiong. Embodiedmidtrain: Bridging the gap between vision-language models and vision-language-action models via mid-training. arXiv preprint arXiv:2604.20012, 2026

  3. [3]

    Igniting vlms toward the embodied space

    Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, et al. Igniting vlms toward the embodied space. arXiv preprint arXiv:2509.11766, 2025

  4. [4]

    Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution

    Rui Cai, Jun Guo, Xinze He, Piaopiao Jin, Jie Li, Bingxuan Lin, Futeng Liu, Wei Liu, Fei Ma, Kun Ma, et al. Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution. arXiv preprint arXiv:2602.12684, 2026

  5. [5]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

  6. [6]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Haus- man, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pe...

  7. [7]

    A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation

    Fanqi Lin, Kushal Arora, Jean Mercat, Haruki Nishimura, Paarth Shah, Chen Xu, Mengchao Zhang, Mark Zolotas, Maya Angeles, Owen Pfannenstiehl, et al. A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation. arXiv preprint arXiv:2602.01067, 2026

  8. [8]

    Chatvla: Unified multimodal understanding and robot control with vision-language-action model

    Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Yaxin Peng, Chaomin Shen, Feifei Feng, et al. Chatvla: Unified multimodal understanding and robot control with vision-language-action model. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages 5377–5395, 2025

  9. [9]

    Palm-e: An embodied multimodal language model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery , Brian Ichter, Ayzaan Wahid, Jonathan T ompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

  10. [10]

    Robobrain: A unified brain model for robotic manipulation from abstract to concrete

    Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 1724–1734, 2025

  11. [11]

    Eo-1: An open unified embodied foundation model for general robot control, 2026

    Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Dong Wang, Xinyi Ye, Qi Lv , Modi Shi, Guanghui Ren, Cheng Ruan, Maoqing Yao, Haoran Yang, Jiacheng Bao, Bin Zhao, and Xuelong Li. Eo-1: An open unified embodied foundation model for general robot control, 2026. URL https://arxiv.org/abs/2508.21112

  12. [12]

    Gr-3 technical report

    Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report. arXiv preprint arXiv:2507.15493, 2025

  13. [13]

    Galaxea open-world dataset and g0 dual-system vla model

    Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual-system vla model. arXiv preprint arXiv:2509.00576, 2025

  14. [14]

    Knowledge insulating vision-language-action models: Train fast, run fast, gener- alize better

    Danny Driess, Jost Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, gener- alize better. Advances in Neural Information Processing Systems, 38:102867–102888, 2026

  15. [15]

    A pragmatic vla foundation model

    Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model. arXiv preprint arXiv:2601.18692, 2026. 10

  16. [16]

    𝜋0: A vision-language-action flow model for general robot control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. 𝜋0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  17. [17]

    Gr00t n1: An open foundation model for generalist humanoid robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev , Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

  18. [18]

    Faster: T oward efficient autoregressive vision language action modeling via neural action tokenization

    Yicheng Liu, Shiduo Zhang, Zibin Dong, Baijun Ye, Tianyuan Yuan, Xiaopeng Yu, Linqi Yin, Chenhao Lu, Junhao Shi, Luca Jiang-Tao Yu, et al. Faster: T oward efficient autoregressive vision language action modeling via neural action tokenization. arXiv preprint arXiv:2512.04952, 2025

  19. [19]

    Vlm4vla: Revisiting vision-language-models in vision-language-action models

    Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. Vlm4vla: Revisiting vision-language-models in vision-language-action models. arXiv preprint arXiv:2601.03309, 2026

  20. [20]

    Actions as language: Fine- tuning vlms into vlas without catastrophic forgetting

    Asher J Hancock, Xindi Wu, Lihan Zha, Olga Russakovsky , and Anirudha Majumdar. Actions as language: Fine- tuning vlms into vlas without catastrophic forgetting. arXiv preprint arXiv:2509.22195, 2025

  21. [21]

    Internvla-a1: Unifying understanding, generation and action for robotic manipulation

    Junhao Cai, Zetao Cai, Jiafei Cao, Yilun Chen, Zeyu He, Lei Jiang, Hang Li, Hengjie Li, Yang Li, Yufei Liu, et al. Internvla-a1: Unifying understanding, generation and action for robotic manipulation. arXiv preprint arXiv:2601.02456, 2026

  22. [22]

    Sanketi, and Ken Goldberg

    Kaiyuan Chen, Shuangyu Xie, Zehan Ma, Pannag R. Sanketi, and Ken Goldberg. Robo2VLM: Visual question answering from large-scale in-the-wild robot manipulation datasets. arXiv preprint arXiv:2505.15517, 2025

  23. [23]

    Robovqa: Multimodal long-horizon reasoning for robotics

    Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 645–652. IEEE, 2024

  24. [24]

    Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models

    Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers), pages 346–355, 2024

  25. [25]

    Roborefer: T owards spatial referring with reasoning in vision-language models for robotics

    Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, et al. Roborefer: T owards spatial referring with reasoning in vision-language models for robotics. Advances in Neural Information Processing Systems, 38:28404–28481, 2026

  26. [26]

    Robopoint: A vision-language model for spatial affordance prediction for robotics

    Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay , Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousa- vian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics. arXiv preprint arXiv:2406.10721, 2024

  27. [27]

    Molmoact: Action reasoning models that can reason in space

    Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space. arXiv preprint arXiv:2508.07917 , 2025

  28. [28]

    Robobrain 2.0 technical report

    BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, et al. Robobrain 2.0 technical report. arXiv preprint arXiv:2507.02029, 2025

  29. [29]

    Mimo-embodied: X-embodied foundation model technical report

    Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, Shuhuai Ren, Xianhui Meng, et al. Mimo-embodied: X-embodied foundation model technical report. arXiv preprint arXiv:2511.16518, 2025

  30. [30]

    Unify robot actions in camera frame, 2026

    Sicheng Xie, Lingchen Meng, Zijie Diao, Haidong Cao, Zhiying Du, Shuyuan T u, Jiaqi Leng, Qiuyue Wang, Mingsheng Li, Shuai Bai, Zuxuan Wu, and Yu-Gang Jiang. Unify robot actions in camera frame, 2026. URL https://arxiv.org/abs/2511.17001

  31. [31]

    Paligemma: A versatile 3b vlm for transfer

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov , Xiao Wang, Daniel Salz, Maxim Neu- mann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726, 2024. 11

  32. [32]

    Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  33. [33]

    Libero: Benchmarking knowledge transfer for lifelong robot learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems , 36:44776–44791, 2023

  34. [34]

    Evaluating real-world robot manipulation policies in simulation

    Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941, 2024

  35. [35]

    Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks

    Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, et al. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11142– 11152, 2025

  36. [36]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, T ony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pages 1723–1736. PMLR, 2023

  37. [37]

    Vla-os: Structuring and dissecting planning representations and paradigms in vision-language- action models

    Chongkai Gao, Zixuan Liu, Zhenghao Chi, Junshan Huang, Xin Fei, Yiwen Hou, Yuxuan Zhang, Yudi Lin, Zhirui Fang, and Lin Shao. Vla-os: Structuring and dissecting planning representations and paradigms in vision-language- action models. Advances in Neural Information Processing Systems, 38:136705–136736, 2026

  38. [38]

    A-okvqa: A benchmark for visual question answering using world knowledge

    Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European conference on computer vision , pages 146–162. Springer, 2022

  39. [39]

    Microsoft coco: Common objects in context

    Tsung- Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision , pages 740–755. Springer, 2014

  40. [40]

    COL", positioned in the center-right of the scene, to the left of the

    Lukas Blecher. LaTeX-OCR: pix2tex – using a ViT to convert images of equations into LaTeX code. https://github. com/lukas-blecher/LaTeX-OCR , 2022. Software repository , accessed 2026-05-28. 12 A Implementation Details A.1 Co-training Strategy During both Objective Bridging and Retentive Adaptation, we co-train on ETC and action data. Each opti- mization ...

  41. [41]

    bike” → “bicycle

    The continuous score is exp(− RMSE/ 𝜏) with 𝜏 = 20 px; the example is correct if RMSE ≤ 20 px. D.1.6 COCO Joint Detection F1 For COCO, we use a dataset-specific metric that jointly requires category-label agreement and BBox overlap. Class names are normalized to the 80 COCO categories (e.g., “bike” → “bicycle”). Predictions and ground- truth objects are m...