pith. sign in

arxiv: 2605.15735 · v2 · pith:HKIK7EMPnew · submitted 2026-05-15 · 💻 cs.CV · cs.AI

UAM: A Dual-Stream Perspective on Forgetting in VLA Training

Pith reviewed 2026-05-20 19:19 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language-action modelsembodiment taxdual-stream architecturedorsal expertvisual dynamics predictionout-of-distribution generalizationrobot manipulationsemantic preservation
0
0 comments X

The pith

A parallel dorsal expert lets vision-language-action models train end-to-end while retaining over 95 percent of the original vision-language model's multimodal capability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard fine-tuning of a vision-language model on action data erodes its ability to handle language and vision tasks together, an effect the authors term the embodiment tax. The paper traces this loss to a single encoder being forced to handle both semantics and control at once. Drawing from the brain's separation of recognition and visuomotor pathways, the authors introduce a second stream called the Dorsal Expert. This stream starts from a pretrained generative model and learns to predict how scenes change over time. The resulting model trains fully on action data with no freezing or extra supervision, keeps most of the original multimodal skill, and posts the best success rates on robot tasks that test generalization to new objects and instructions.

Core claim

The Unified Action Model shows that semantic preservation during vision-language-action training can arise directly from architectural separation of pathways instead of from frozen weights or auxiliary data. By adding a Dorsal Expert initialized from a generative model and trained only on mid-level visual dynamics prediction, the full system trains end-to-end on action data alone, retains over 95 percent of the underlying VLM's multimodal capability, and reaches the highest average success rate among compared methods on manipulation tasks involving unseen objects, novel object-target pairs, and varied instructions.

What carries the argument

The Dorsal Expert, a parallel pathway initialized from a pretrained generative model and trained with a mid-level visual dynamics prediction objective to offload control-relevant features from the main VLM encoder.

If this is right

  • End-to-end training on action data becomes viable without parameter freezing, gradient stopping, or auxiliary vision-language data.
  • Over 95 percent retention of the original VLM multimodal capability occurs alongside improved action success.
  • Highest average success rates appear on tasks probing generalization to unseen objects, novel compositions, and instruction changes.
  • Semantic generalization in actions transfers naturally from the preserved VLM capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation principle might reduce capability loss when adapting other multimodal models to new output domains.
  • Replacing the dynamics objective with other control-oriented mid-level tasks could test whether the exact prediction target is essential.
  • Applying the dual-stream design to larger VLMs or different robot embodiments would reveal the limits of the separation benefit.

Load-bearing premise

That initializing and training the parallel Dorsal Expert on visual dynamics prediction alone will sufficiently separate control features from the language-grounded semantics so the main encoder can keep its multimodal ability during end-to-end action training.

What would settle it

A measurement after full end-to-end training in which the original VLM's multimodal task accuracy drops below 95 percent of its pretrained level, or in which average success rates on the out-of-distribution manipulation benchmarks fall below the best standard fine-tuning baseline.

read the original abstract

Vision--language--action (VLA) models are typically built by fine-tuning a pretrained vision--language model (VLM) on action data. However, we show that this standard recipe systematically erodes the VLM's multimodal competence, a side effect we call the embodiment tax. But do VLAs have to forget? Inspired by the two-stream organization of biological vision, we trace this degradation to a structural bottleneck: current VLAs ask a single encoder to support both language-grounded semantics and control-relevant visual features, whereas biological vision separates recognition and visuomotor control into distinct pathways. Building on this view, we propose the Unified Action Model (UAM), which adds a parallel Dorsal Expert, an analog of the brain's dorsal pathway. To make the Dorsal Expert an effective second pathway and reduce the control-learning burden on the VLM, we initialize it from a pretrained generative model and train it with a mid-level reasoning objective that predicts visual dynamics. This design allows us to train the whole VLA end-to-end on action data alone: with no parameter freezing, no gradient stopping, and no auxiliary VL co-training, UAM retains over $95\%$ of the underlying VLM's multimodal capability and at the same time achieves the highest average success rate among baselines on a variety of manipulation tasks that probe out-of-distribution generalization, including unseen objects, novel object--target compositions, and instruction variation. Together, these results suggest that semantic preservation in VLAs can emerge from architectural separation itself, rather than being enforced by frozen weights or auxiliary data replay, and that this preserved semantic capability can naturally transfer from VLMs to semantic generalization in actions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that standard fine-tuning of pretrained vision-language models (VLMs) on action data for vision-language-action (VLA) models incurs an 'embodiment tax' that erodes multimodal competence. Inspired by the two-stream hypothesis in biological vision, the authors propose the Unified Action Model (UAM), which augments the VLM with a parallel 'Dorsal Expert' stream. This expert is initialized from a pretrained generative model and trained solely on a mid-level visual dynamics prediction objective. The design purportedly allows fully end-to-end action training (no freezing, no gradient stopping, no auxiliary VL data) while retaining >95% of the original VLM's multimodal capability and achieving the highest average success rates among baselines on manipulation tasks that test out-of-distribution generalization (unseen objects, novel object-target compositions, instruction variation). The central thesis is that semantic preservation can emerge from architectural separation rather than explicit regularization.

Significance. If the empirical claims and mechanistic account hold, the work provides a constructive architectural alternative to parameter freezing or replay-based methods for mitigating forgetting in VLAs. It reframes the embodiment tax as a structural bottleneck rather than an inevitable optimization conflict and demonstrates that a dual-stream design can simultaneously support strong action performance and semantic retention. This perspective could influence future VLA architectures and embodied foundation models by highlighting the value of explicit pathway separation.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (architecture description): The central claim that the Dorsal Expert 'reduce[s] the control-learning burden on the VLM' and thereby enables >95% retention without any freezing or auxiliary data is not supported by direct evidence. No feature-space analysis (e.g., cosine similarity or probing classifiers on semantic vs. control axes of VLM embeddings before/after training), no gradient-norm comparison of the VLM encoder with vs. without the parallel stream, and no ablation that removes the dynamics-prediction loss while retaining the second stream are reported. Without these, the observed retention could be explained by initialization, data scale, or task selection rather than the claimed offloading mechanism.
  2. [Abstract and experimental results section] Abstract and experimental results section: The quantitative retention figure (>95%) and the claim of 'highest average success rate among baselines' are presented without specifying the exact VLM benchmarks used for retention measurement, the precise success-rate metric and number of trials, the full list of baselines (including whether they also use end-to-end training), or statistical significance. These omissions make it impossible to assess whether the results are robust or whether the tasks genuinely probe the preserved multimodal semantics versus pure action performance.
  3. [§4] §4 (training objectives): The mid-level visual dynamics prediction objective is described as the key to making the Dorsal Expert an effective control pathway, yet no analysis shows that this objective actually extracts control-relevant features that would otherwise compete with semantic features in the VLM encoder. A simple ablation (train the parallel stream with a different auxiliary loss or with no auxiliary loss) would directly test the load-bearing assumption but is not provided.
minor comments (2)
  1. [§3] Clarify the precise integration point between the Dorsal Expert and the VLM encoder (e.g., whether features are concatenated, added, or attended) and whether any parameters are shared.
  2. [Results] Add a table or figure that explicitly lists all baselines, their training regimes (frozen vs. end-to-end), and the exact success rates with standard deviations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, indicating where revisions will be incorporated to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (architecture description): The central claim that the Dorsal Expert 'reduce[s] the control-learning burden on the VLM' and thereby enables >95% retention without any freezing or auxiliary data is not supported by direct evidence. No feature-space analysis (e.g., cosine similarity or probing classifiers on semantic vs. control axes of VLM embeddings before/after training), no gradient-norm comparison of the VLM encoder with vs. without the parallel stream, and no ablation that removes the dynamics-prediction loss while retaining the second stream are reported. Without these, the observed retention could be explained by initialization, data scale, or task selection rather than the claimed offloading mechanism.

    Authors: We acknowledge that additional mechanistic analyses would provide stronger support for the offloading interpretation. The manuscript currently presents the retention results under fully end-to-end training as the primary evidence, consistent with the architectural motivation from the two-stream hypothesis. To directly address this, we will add in the revised version: (i) an ablation training the parallel stream without the dynamics-prediction loss (retaining only the action objective through the stream), which shows retention falling to approximately 75-80%; and (ii) a comparison of gradient norms on the VLM encoder with and without the Dorsal Expert, indicating reduced gradient flow through semantic pathways when the parallel stream is active. Feature-space probing will be included if page constraints allow. revision: partial

  2. Referee: [Abstract and experimental results section] Abstract and experimental results section: The quantitative retention figure (>95%) and the claim of 'highest average success rate among baselines' are presented without specifying the exact VLM benchmarks used for retention measurement, the precise success-rate metric and number of trials, the full list of baselines (including whether they also use end-to-end training), or statistical significance. These omissions make it impossible to assess whether the results are robust or whether the tasks genuinely probe the preserved multimodal semantics versus pure action performance.

    Authors: We agree that greater specificity in reporting is required for reproducibility and assessment. In the revised experimental results section we will explicitly list the VLM benchmarks (VQAv2, GQA, and OKVQA) used for the retention metric, report success rates as mean percentage of successful episodes over 50 trials per task across three seeds with standard error, enumerate all baselines (standard fine-tuning, LoRA, and replay methods) with confirmation that they are also trained end-to-end, and add statistical significance via paired t-tests (p < 0.05) between UAM and the strongest baseline. revision: yes

  3. Referee: [§4] §4 (training objectives): The mid-level visual dynamics prediction objective is described as the key to making the Dorsal Expert an effective control pathway, yet no analysis shows that this objective actually extracts control-relevant features that would otherwise compete with semantic features in the VLM encoder. A simple ablation (train the parallel stream with a different auxiliary loss or with no auxiliary loss) would directly test the load-bearing assumption but is not provided.

    Authors: This concern is closely related to the first comment. We have performed the requested ablation by training the parallel stream under two alternative settings: (a) no auxiliary loss and (b) a low-level pixel reconstruction loss. The revised §4 and associated experiments will report that both alternatives yield lower multimodal retention and reduced OOD action success compared with the mid-level dynamics objective, consistent with the claim that the chosen objective extracts control-relevant features while limiting interference with semantic representations in the VLM stream. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes an architectural modification (parallel Dorsal Expert initialized from a generative model and trained on visual dynamics prediction) and reports empirical outcomes: end-to-end training without freezing yields >95% retention of VLM multimodal capability plus high task success rates. No equations, fitted parameters, or self-citations are shown that reduce the retention claim to a definitional identity or to the input data by construction. The central result is presented as a measured consequence of the design rather than a re-expression of the training objective itself. The derivation is therefore self-contained against external benchmarks and does not match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the biological analogy as motivation and the effectiveness of the generative initialization plus dynamics objective; no explicit free parameters are named in the abstract, but the design implicitly assumes the second stream can be made effective without additional constraints.

axioms (1)
  • domain assumption Biological vision separates recognition and visuomotor control into distinct pathways
    Invoked to justify adding a parallel Dorsal Expert rather than modifying the single VLM encoder.
invented entities (1)
  • Dorsal Expert no independent evidence
    purpose: To serve as a separate pathway for control-relevant visual features and reduce burden on the VLM
    New component introduced as analog of brain dorsal stream; initialized from generative model and trained on visual dynamics prediction.

pith-pipeline@v0.9.0 · 5860 in / 1388 out tokens · 39266 ms · 2026-05-20T19:19:48.947763+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 31 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  2. [2]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  3. [3]

    Motus: A Unified Latent Action World Model

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

  4. [4]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  5. [5]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023

  6. [6]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

  7. [7]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation, 2024. URLhttps://arxiv.org/abs/2410.06158

  8. [8]

    Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

  9. [9]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  10. [10]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025

  11. [11]

    Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024

    Xiaoyu Chen, Junliang Guo, Tianyu He, Chuheng Zhang, Pushi Zhang, Derek Cathera Yang, Li Zhao, and Jiang Bian. Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024

  12. [12]

    villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

    Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. Villa-x: enhancing latent action modeling in vision-language-action models. arXiv preprint arXiv:2507.23682, 2025

  13. [13]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 11

  14. [14]

    Knowledge insulat- ing vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705, 2025

    Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705, 2025

  15. [15]

    Learning universal policies via text-guided video generation.Advancesin Neural Information Processing Systems, 36, 2024

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advancesin Neural Information Processing Systems, 36, 2024

  16. [16]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

  17. [17]

    SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprintarXiv:2404.14396, 2024

  18. [18]

    Separate visual pathways for perception and action

    Melvyn A Goodale and A David Milner. Separate visual pathways for perception and action. Trends in neurosciences, 15(1):20–25, 1992

  19. [19]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations. arXiv preprint arXiv:2412.14803, 2024

  20. [20]

    Bagelvla: Enhancing long-horizon manipulation via interleaved vision-language-action generation

    Yucheng Hu, Jianke Zhang, Yuanfei Luo, Yanjiang Guo, Xiaoyu Chen, Xinshu Sun, Kun Feng, Qingzhou Lu, Sheng Chen, Yangang Zhang, et al. Bagelvla: Enhancing long-horizon manipulation via interleaved vision-language-action generation. Proceedings of Robotics: Science and Systems (RSS), 2026

  21. [21]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a vision-language-action model with open-world generalization, 2025. URL https://arxiv. org/abs/2504.16054, 1(2):3, 2025

  22. [22]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

  23. [23]

    Cosmos policy: Fine-tuning video models for visuomotor control and planning

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning. The FourteenthInternational Conference on Learning Representations, 2026

  24. [24]

    A new neural framework for visuospatial processing

    Dwight J Kravitz, Kadharbatcha S Saleem, Chris I Baker, and Mortimer Mishkin. A new neural framework for visuospatial processing. Nature Reviews Neuroscience, 12(4):217–230, 2011

  25. [25]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  26. [26]

    Unified video action model.Proceedings of Robotics: Science and Systems (RSS), 2025

    Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.Proceedings of Robotics: Science and Systems (RSS), 2025

  27. [27]

    Video Generators are Robot Policies

    Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

  28. [28]

    A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation.arXiv preprint arXiv:2602.01067, 2026

    Fanqi Lin, Kushal Arora, Jean Mercat, Haruki Nishimura, Paarth Shah, Chen Xu, Mengchao Zhang, Mark Zolotas, Maya Angeles, Owen Pfannenstiehl, et al. A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation.arXiv preprint arXiv:2602.01067, 2026

  29. [29]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024

  30. [30]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023. 12

  31. [31]

    F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

    Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, and Jiangmiao Pang. F1: A vision-language-action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951, 2025

  32. [32]

    Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022

  33. [33]

    Oup Oxford, 2006

    David Milner and Mel Goodale.The visual brain in action, volume 27. Oup Oxford, 2006

  34. [34]

    mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

    Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025

  35. [35]

    Two different streams form the dorsal visual system: anatomy and functions

    Giacomo Rizzolatti and Massimo Matelli. Two different streams form the dorsal visual system: anatomy and functions. Experimental brain research, 153(2):146–157, 2003

  36. [36]

    LM- Fusion: Adapting pretrained language models for multimodal generation

    Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, and Lili YU. LM- Fusion: Adapting pretrained language models for multimodal generation. InThe Thirty-ninthAnnual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=Kc1WTxZbrP

  37. [37]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

  38. [38]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

  39. [39]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

  40. [40]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy, 2024. URLhttps://arxiv.org/abs/2405.12213

  41. [41]

    Ungerleider

    Leslie G. Ungerleider. Two cortical visual systems. InProceedings of the Royal Society of London. Series B. Biological Sciences, 1982. URLhttps://api.semanticscholar.org/CorpusID:142774685

  42. [42]

    Diffusionvla: Scaling robot foundation models via unified diffusion and autoregression

    Junjie Wen, Yichen Zhu, Minjie Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, and Feifei Feng. Diffusionvla: Scaling robot foundation models via unified diffusion and autoregression. In Forty-secondInternational Conference on Machine Learning, 2025

  43. [43]

    Janus: Decoupling visual encoding for unified multimodal understanding and generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025

  44. [44]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024

  45. [45]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  46. [46]

    World Action Models are Zero-shot Policies

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

  47. [47]

    Twinbrainvla: Un- leashing the potential of generalist vlms for embodied tasks via asymmetric mixture-of-transformers.arXiv preprint arXiv:2601.14133, 2026

    Bin Yu, Shijie Lian, Xiaopeng Lin, Yuliang Wei, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Xinming Wang, Bailing Wang, Cong Huang, et al. Twinbrainvla: Unleashing the potential of generalist vlms for embodied tasks via asymmetric mixture-of-transformers.arXiv preprint arXiv:2601.14133, 2026. 13

  48. [48]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023

  49. [49]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmarkforexpertagi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmarkforexpertagi. In Proceedings oftheIEEE/CVFconferenceoncomputervisionandpatternrecognition, pages 9556–9567, 2024

  50. [50]

    Robotic Control via Embodied Chain-of-Thought Reasoning

    Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024

  51. [51]

    UP- VLA: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

    Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

  52. [52]

    UniJEPA: Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning

    Jianke Zhang, Yucheng Hu, Yanjiang Guo, Xiaoyu Chen, Yichen Liu, Wenna Chen, Chaochao Lu, and Jianyu Chen. Unicod: Enhancing robot policy via unified continuous and discrete representation learning.arXiv preprint arXiv:2510.10642, 2025

  53. [53]

    Zhang, X

    Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. Vlm4vla: Revisiting vision-language-models in vision-language-action models. arXiv preprint arXiv:2601.03309, 2026

  54. [54]

    DreamVLA: A vision-language-action model dreamed with comprehensive world knowledge

    Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, and Xin Jin. DreamVLA: A vision-language-action model dreamed with comprehensive world knowledge. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/fo...

  55. [55]

    Action” metric, we report the normalized average completion rate on the test simulation environment as the indicator of action performance. For the “VLM

    Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Yaxin Peng, Chaomin Shen, Feifei Feng, et al. Chatvla: Unified multimodal understanding and robot control with vision-language-action model. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5377–5395, 2025. 14 Appendix A Eva...

  56. [56]

    This implies that routing visual and linguistic tokens through decoupled pathways reduces modality interference

    Parallel Architectures Mitigate Forgetting:We observe that although action accuracy remains comparable regardlessofthearchitectureused, modelsemployingaMoTarchitectureretaintheirlanguagecapabilities better than a standard sequential action head. This implies that routing visual and linguistic tokens through decoupled pathways reduces modality interference

  57. [57]

    Ventral-Dorsal

    Model Scale Correlates with Retention:Model capacity plays a pivotal role. Larger foundational VLMs (7B) demonstrate higher resilience to catastrophic forgetting compared to smaller models (2B), maintaining a higher relative percentage of their original semantic reasoning scores. While scaling up model size or employing MoT can partially alleviate the sym...