UAM: A Dual-Stream Perspective on Forgetting in VLA Training
Pith reviewed 2026-05-20 19:19 UTC · model grok-4.3
The pith
A parallel dorsal expert lets vision-language-action models train end-to-end while retaining over 95 percent of the original vision-language model's multimodal capability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Unified Action Model shows that semantic preservation during vision-language-action training can arise directly from architectural separation of pathways instead of from frozen weights or auxiliary data. By adding a Dorsal Expert initialized from a generative model and trained only on mid-level visual dynamics prediction, the full system trains end-to-end on action data alone, retains over 95 percent of the underlying VLM's multimodal capability, and reaches the highest average success rate among compared methods on manipulation tasks involving unseen objects, novel object-target pairs, and varied instructions.
What carries the argument
The Dorsal Expert, a parallel pathway initialized from a pretrained generative model and trained with a mid-level visual dynamics prediction objective to offload control-relevant features from the main VLM encoder.
If this is right
- End-to-end training on action data becomes viable without parameter freezing, gradient stopping, or auxiliary vision-language data.
- Over 95 percent retention of the original VLM multimodal capability occurs alongside improved action success.
- Highest average success rates appear on tasks probing generalization to unseen objects, novel compositions, and instruction changes.
- Semantic generalization in actions transfers naturally from the preserved VLM capabilities.
Where Pith is reading between the lines
- The same separation principle might reduce capability loss when adapting other multimodal models to new output domains.
- Replacing the dynamics objective with other control-oriented mid-level tasks could test whether the exact prediction target is essential.
- Applying the dual-stream design to larger VLMs or different robot embodiments would reveal the limits of the separation benefit.
Load-bearing premise
That initializing and training the parallel Dorsal Expert on visual dynamics prediction alone will sufficiently separate control features from the language-grounded semantics so the main encoder can keep its multimodal ability during end-to-end action training.
What would settle it
A measurement after full end-to-end training in which the original VLM's multimodal task accuracy drops below 95 percent of its pretrained level, or in which average success rates on the out-of-distribution manipulation benchmarks fall below the best standard fine-tuning baseline.
read the original abstract
Vision--language--action (VLA) models are typically built by fine-tuning a pretrained vision--language model (VLM) on action data. However, we show that this standard recipe systematically erodes the VLM's multimodal competence, a side effect we call the embodiment tax. But do VLAs have to forget? Inspired by the two-stream organization of biological vision, we trace this degradation to a structural bottleneck: current VLAs ask a single encoder to support both language-grounded semantics and control-relevant visual features, whereas biological vision separates recognition and visuomotor control into distinct pathways. Building on this view, we propose the Unified Action Model (UAM), which adds a parallel Dorsal Expert, an analog of the brain's dorsal pathway. To make the Dorsal Expert an effective second pathway and reduce the control-learning burden on the VLM, we initialize it from a pretrained generative model and train it with a mid-level reasoning objective that predicts visual dynamics. This design allows us to train the whole VLA end-to-end on action data alone: with no parameter freezing, no gradient stopping, and no auxiliary VL co-training, UAM retains over $95\%$ of the underlying VLM's multimodal capability and at the same time achieves the highest average success rate among baselines on a variety of manipulation tasks that probe out-of-distribution generalization, including unseen objects, novel object--target compositions, and instruction variation. Together, these results suggest that semantic preservation in VLAs can emerge from architectural separation itself, rather than being enforced by frozen weights or auxiliary data replay, and that this preserved semantic capability can naturally transfer from VLMs to semantic generalization in actions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard fine-tuning of pretrained vision-language models (VLMs) on action data for vision-language-action (VLA) models incurs an 'embodiment tax' that erodes multimodal competence. Inspired by the two-stream hypothesis in biological vision, the authors propose the Unified Action Model (UAM), which augments the VLM with a parallel 'Dorsal Expert' stream. This expert is initialized from a pretrained generative model and trained solely on a mid-level visual dynamics prediction objective. The design purportedly allows fully end-to-end action training (no freezing, no gradient stopping, no auxiliary VL data) while retaining >95% of the original VLM's multimodal capability and achieving the highest average success rates among baselines on manipulation tasks that test out-of-distribution generalization (unseen objects, novel object-target compositions, instruction variation). The central thesis is that semantic preservation can emerge from architectural separation rather than explicit regularization.
Significance. If the empirical claims and mechanistic account hold, the work provides a constructive architectural alternative to parameter freezing or replay-based methods for mitigating forgetting in VLAs. It reframes the embodiment tax as a structural bottleneck rather than an inevitable optimization conflict and demonstrates that a dual-stream design can simultaneously support strong action performance and semantic retention. This perspective could influence future VLA architectures and embodied foundation models by highlighting the value of explicit pathway separation.
major comments (3)
- [Abstract and §3] Abstract and §3 (architecture description): The central claim that the Dorsal Expert 'reduce[s] the control-learning burden on the VLM' and thereby enables >95% retention without any freezing or auxiliary data is not supported by direct evidence. No feature-space analysis (e.g., cosine similarity or probing classifiers on semantic vs. control axes of VLM embeddings before/after training), no gradient-norm comparison of the VLM encoder with vs. without the parallel stream, and no ablation that removes the dynamics-prediction loss while retaining the second stream are reported. Without these, the observed retention could be explained by initialization, data scale, or task selection rather than the claimed offloading mechanism.
- [Abstract and experimental results section] Abstract and experimental results section: The quantitative retention figure (>95%) and the claim of 'highest average success rate among baselines' are presented without specifying the exact VLM benchmarks used for retention measurement, the precise success-rate metric and number of trials, the full list of baselines (including whether they also use end-to-end training), or statistical significance. These omissions make it impossible to assess whether the results are robust or whether the tasks genuinely probe the preserved multimodal semantics versus pure action performance.
- [§4] §4 (training objectives): The mid-level visual dynamics prediction objective is described as the key to making the Dorsal Expert an effective control pathway, yet no analysis shows that this objective actually extracts control-relevant features that would otherwise compete with semantic features in the VLM encoder. A simple ablation (train the parallel stream with a different auxiliary loss or with no auxiliary loss) would directly test the load-bearing assumption but is not provided.
minor comments (2)
- [§3] Clarify the precise integration point between the Dorsal Expert and the VLM encoder (e.g., whether features are concatenated, added, or attended) and whether any parameters are shared.
- [Results] Add a table or figure that explicitly lists all baselines, their training regimes (frozen vs. end-to-end), and the exact success rates with standard deviations.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment point by point below, indicating where revisions will be incorporated to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (architecture description): The central claim that the Dorsal Expert 'reduce[s] the control-learning burden on the VLM' and thereby enables >95% retention without any freezing or auxiliary data is not supported by direct evidence. No feature-space analysis (e.g., cosine similarity or probing classifiers on semantic vs. control axes of VLM embeddings before/after training), no gradient-norm comparison of the VLM encoder with vs. without the parallel stream, and no ablation that removes the dynamics-prediction loss while retaining the second stream are reported. Without these, the observed retention could be explained by initialization, data scale, or task selection rather than the claimed offloading mechanism.
Authors: We acknowledge that additional mechanistic analyses would provide stronger support for the offloading interpretation. The manuscript currently presents the retention results under fully end-to-end training as the primary evidence, consistent with the architectural motivation from the two-stream hypothesis. To directly address this, we will add in the revised version: (i) an ablation training the parallel stream without the dynamics-prediction loss (retaining only the action objective through the stream), which shows retention falling to approximately 75-80%; and (ii) a comparison of gradient norms on the VLM encoder with and without the Dorsal Expert, indicating reduced gradient flow through semantic pathways when the parallel stream is active. Feature-space probing will be included if page constraints allow. revision: partial
-
Referee: [Abstract and experimental results section] Abstract and experimental results section: The quantitative retention figure (>95%) and the claim of 'highest average success rate among baselines' are presented without specifying the exact VLM benchmarks used for retention measurement, the precise success-rate metric and number of trials, the full list of baselines (including whether they also use end-to-end training), or statistical significance. These omissions make it impossible to assess whether the results are robust or whether the tasks genuinely probe the preserved multimodal semantics versus pure action performance.
Authors: We agree that greater specificity in reporting is required for reproducibility and assessment. In the revised experimental results section we will explicitly list the VLM benchmarks (VQAv2, GQA, and OKVQA) used for the retention metric, report success rates as mean percentage of successful episodes over 50 trials per task across three seeds with standard error, enumerate all baselines (standard fine-tuning, LoRA, and replay methods) with confirmation that they are also trained end-to-end, and add statistical significance via paired t-tests (p < 0.05) between UAM and the strongest baseline. revision: yes
-
Referee: [§4] §4 (training objectives): The mid-level visual dynamics prediction objective is described as the key to making the Dorsal Expert an effective control pathway, yet no analysis shows that this objective actually extracts control-relevant features that would otherwise compete with semantic features in the VLM encoder. A simple ablation (train the parallel stream with a different auxiliary loss or with no auxiliary loss) would directly test the load-bearing assumption but is not provided.
Authors: This concern is closely related to the first comment. We have performed the requested ablation by training the parallel stream under two alternative settings: (a) no auxiliary loss and (b) a low-level pixel reconstruction loss. The revised §4 and associated experiments will report that both alternatives yield lower multimodal retention and reduced OOD action success compared with the mid-level dynamics objective, consistent with the claim that the chosen objective extracts control-relevant features while limiting interference with semantic representations in the VLM stream. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper proposes an architectural modification (parallel Dorsal Expert initialized from a generative model and trained on visual dynamics prediction) and reports empirical outcomes: end-to-end training without freezing yields >95% retention of VLM multimodal capability plus high task success rates. No equations, fitted parameters, or self-citations are shown that reduce the retention claim to a definitional identity or to the input data by construction. The central result is presented as a measured consequence of the design rather than a re-expression of the training objective itself. The derivation is therefore self-contained against external benchmarks and does not match any enumerated circularity pattern.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Biological vision separates recognition and visuomotor control into distinct pathways
invented entities (1)
-
Dorsal Expert
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
PaliGemma: A versatile 3B VLM for transfer
Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Motus: A Unified Latent Action World Model
Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation, 2024. URLhttps://arxiv.org/abs/2410.06158
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024
work page 2024
-
[9]
Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Xiaoyu Chen, Junliang Guo, Tianyu He, Chuheng Zhang, Pushi Zhang, Derek Cathera Yang, Li Zhao, and Jiang Bian. Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024
-
[12]
villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models
Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. Villa-x: enhancing latent action modeling in vision-language-action models. arXiv preprint arXiv:2507.23682, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705, 2025
-
[15]
Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advancesin Neural Information Processing Systems, 36, 2024
work page 2024
-
[16]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprintarXiv:2404.14396, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Separate visual pathways for perception and action
Melvyn A Goodale and A David Milner. Separate visual pathways for perception and action. Trends in neurosciences, 15(1):20–25, 1992
work page 1992
-
[19]
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations. arXiv preprint arXiv:2412.14803, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Bagelvla: Enhancing long-horizon manipulation via interleaved vision-language-action generation
Yucheng Hu, Jianke Zhang, Yuanfei Luo, Yanjiang Guo, Xiaoyu Chen, Xinshu Sun, Kun Feng, Qingzhou Lu, Sheng Chen, Yangang Zhang, et al. Bagelvla: Enhancing long-horizon manipulation via interleaved vision-language-action generation. Proceedings of Robotics: Science and Systems (RSS), 2026
work page 2026
-
[21]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a vision-language-action model with open-world generalization, 2025. URL https://arxiv. org/abs/2504.16054, 1(2):3, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Cosmos policy: Fine-tuning video models for visuomotor control and planning
Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning. The FourteenthInternational Conference on Learning Representations, 2026
work page 2026
-
[24]
A new neural framework for visuospatial processing
Dwight J Kravitz, Kadharbatcha S Saleem, Chris I Baker, and Mortimer Mishkin. A new neural framework for visuospatial processing. Nature Reviews Neuroscience, 12(4):217–230, 2011
work page 2011
-
[25]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Unified video action model.Proceedings of Robotics: Science and Systems (RSS), 2025
Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.Proceedings of Robotics: Science and Systems (RSS), 2025
work page 2025
-
[27]
Video Generators are Robot Policies
Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Fanqi Lin, Kushal Arora, Jean Mercat, Haruki Nishimura, Paarth Shah, Chen Xu, Mengchao Zhang, Mark Zolotas, Maya Angeles, Owen Pfannenstiehl, et al. A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation.arXiv preprint arXiv:2602.01067, 2026
-
[29]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023. 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, and Jiangmiao Pang. F1: A vision-language-action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022
work page 2022
-
[33]
David Milner and Mel Goodale.The visual brain in action, volume 27. Oup Oxford, 2006
work page 2006
-
[34]
mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Two different streams form the dorsal visual system: anatomy and functions
Giacomo Rizzolatti and Massimo Matelli. Two different streams form the dorsal visual system: anatomy and functions. Experimental brain research, 153(2):146–157, 2003
work page 2003
-
[36]
LM- Fusion: Adapting pretrained language models for multimodal generation
Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, and Lili YU. LM- Fusion: Adapting pretrained language models for multimodal generation. InThe Thirty-ninthAnnual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=Kc1WTxZbrP
work page 2024
-
[37]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019
work page 2019
-
[38]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy, 2024. URLhttps://arxiv.org/abs/2405.12213
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Leslie G. Ungerleider. Two cortical visual systems. InProceedings of the Royal Society of London. Series B. Biological Sciences, 1982. URLhttps://api.semanticscholar.org/CorpusID:142774685
work page 1982
-
[42]
Diffusionvla: Scaling robot foundation models via unified diffusion and autoregression
Junjie Wen, Yichen Zhu, Minjie Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, and Feifei Feng. Diffusionvla: Scaling robot foundation models via unified diffusion and autoregression. In Forty-secondInternational Conference on Machine Learning, 2025
work page 2025
-
[43]
Janus: Decoupling visual encoding for unified multimodal understanding and generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025
work page 2025
-
[44]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
World Action Models are Zero-shot Policies
Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[47]
Bin Yu, Shijie Lian, Xiaopeng Lin, Yuliang Wei, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Xinming Wang, Bailing Wang, Cong Huang, et al. Twinbrainvla: Unleashing the potential of generalist vlms for embodied tasks via asymmetric mixture-of-transformers.arXiv preprint arXiv:2601.14133, 2026. 13
-
[48]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmarkforexpertagi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmarkforexpertagi. In Proceedings oftheIEEE/CVFconferenceoncomputervisionandpatternrecognition, pages 9556–9567, 2024
work page 2024
-
[50]
Robotic Control via Embodied Chain-of-Thought Reasoning
Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025
-
[52]
UniJEPA: Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning
Jianke Zhang, Yucheng Hu, Yanjiang Guo, Xiaoyu Chen, Yichen Liu, Wenna Chen, Chaochao Lu, and Jianyu Chen. Unicod: Enhancing robot policy via unified continuous and discrete representation learning.arXiv preprint arXiv:2510.10642, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [53]
-
[54]
DreamVLA: A vision-language-action model dreamed with comprehensive world knowledge
Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, and Xin Jin. DreamVLA: A vision-language-action model dreamed with comprehensive world knowledge. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/fo...
work page 2025
-
[55]
Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Yaxin Peng, Chaomin Shen, Feifei Feng, et al. Chatvla: Unified multimodal understanding and robot control with vision-language-action model. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5377–5395, 2025. 14 Appendix A Eva...
work page 2025
-
[56]
Parallel Architectures Mitigate Forgetting:We observe that although action accuracy remains comparable regardlessofthearchitectureused, modelsemployingaMoTarchitectureretaintheirlanguagecapabilities better than a standard sequential action head. This implies that routing visual and linguistic tokens through decoupled pathways reduces modality interference
-
[57]
Model Scale Correlates with Retention:Model capacity plays a pivotal role. Larger foundational VLMs (7B) demonstrate higher resilience to catastrophic forgetting compared to smaller models (2B), maintaining a higher relative percentage of their original semantic reasoning scores. While scaling up model size or employing MoT can partially alleviate the sym...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.