UAM: A Dual-Stream Perspective on Forgetting in VLA Training

Hongbin Xu; Jianke Zhang; Jianyu Chen; Tian Lan; Xiaoyu Chen; Yanjiang Guo; Yuanfei Luo; Yucheng Hu; Ziyang Liu

arxiv: 2605.15735 · v2 · pith:HKIK7EMPnew · submitted 2026-05-15 · 💻 cs.CV · cs.AI

UAM: A Dual-Stream Perspective on Forgetting in VLA Training

Jianke Zhang , Yuanfei Luo , Yucheng Hu , Xiaoyu Chen , Yanjiang Guo , Ziyang Liu , Hongbin Xu , Tian Lan

show 1 more author

Jianyu Chen

This is my paper

Pith reviewed 2026-05-20 19:19 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language-action modelsembodiment taxdual-stream architecturedorsal expertvisual dynamics predictionout-of-distribution generalizationrobot manipulationsemantic preservation

0 comments

The pith

A parallel dorsal expert lets vision-language-action models train end-to-end while retaining over 95 percent of the original vision-language model's multimodal capability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard fine-tuning of a vision-language model on action data erodes its ability to handle language and vision tasks together, an effect the authors term the embodiment tax. The paper traces this loss to a single encoder being forced to handle both semantics and control at once. Drawing from the brain's separation of recognition and visuomotor pathways, the authors introduce a second stream called the Dorsal Expert. This stream starts from a pretrained generative model and learns to predict how scenes change over time. The resulting model trains fully on action data with no freezing or extra supervision, keeps most of the original multimodal skill, and posts the best success rates on robot tasks that test generalization to new objects and instructions.

Core claim

The Unified Action Model shows that semantic preservation during vision-language-action training can arise directly from architectural separation of pathways instead of from frozen weights or auxiliary data. By adding a Dorsal Expert initialized from a generative model and trained only on mid-level visual dynamics prediction, the full system trains end-to-end on action data alone, retains over 95 percent of the underlying VLM's multimodal capability, and reaches the highest average success rate among compared methods on manipulation tasks involving unseen objects, novel object-target pairs, and varied instructions.

What carries the argument

The Dorsal Expert, a parallel pathway initialized from a pretrained generative model and trained with a mid-level visual dynamics prediction objective to offload control-relevant features from the main VLM encoder.

If this is right

End-to-end training on action data becomes viable without parameter freezing, gradient stopping, or auxiliary vision-language data.
Over 95 percent retention of the original VLM multimodal capability occurs alongside improved action success.
Highest average success rates appear on tasks probing generalization to unseen objects, novel compositions, and instruction changes.
Semantic generalization in actions transfers naturally from the preserved VLM capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation principle might reduce capability loss when adapting other multimodal models to new output domains.
Replacing the dynamics objective with other control-oriented mid-level tasks could test whether the exact prediction target is essential.
Applying the dual-stream design to larger VLMs or different robot embodiments would reveal the limits of the separation benefit.

Load-bearing premise

That initializing and training the parallel Dorsal Expert on visual dynamics prediction alone will sufficiently separate control features from the language-grounded semantics so the main encoder can keep its multimodal ability during end-to-end action training.

What would settle it

A measurement after full end-to-end training in which the original VLM's multimodal task accuracy drops below 95 percent of its pretrained level, or in which average success rates on the out-of-distribution manipulation benchmarks fall below the best standard fine-tuning baseline.

read the original abstract

Vision--language--action (VLA) models are typically built by fine-tuning a pretrained vision--language model (VLM) on action data. However, we show that this standard recipe systematically erodes the VLM's multimodal competence, a side effect we call the embodiment tax. But do VLAs have to forget? Inspired by the two-stream organization of biological vision, we trace this degradation to a structural bottleneck: current VLAs ask a single encoder to support both language-grounded semantics and control-relevant visual features, whereas biological vision separates recognition and visuomotor control into distinct pathways. Building on this view, we propose the Unified Action Model (UAM), which adds a parallel Dorsal Expert, an analog of the brain's dorsal pathway. To make the Dorsal Expert an effective second pathway and reduce the control-learning burden on the VLM, we initialize it from a pretrained generative model and train it with a mid-level reasoning objective that predicts visual dynamics. This design allows us to train the whole VLA end-to-end on action data alone: with no parameter freezing, no gradient stopping, and no auxiliary VL co-training, UAM retains over $95\%$ of the underlying VLM's multimodal capability and at the same time achieves the highest average success rate among baselines on a variety of manipulation tasks that probe out-of-distribution generalization, including unseen objects, novel object--target compositions, and instruction variation. Together, these results suggest that semantic preservation in VLAs can emerge from architectural separation itself, rather than being enforced by frozen weights or auxiliary data replay, and that this preserved semantic capability can naturally transfer from VLMs to semantic generalization in actions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UAM's dual-stream split is a fresh architectural try at the forgetting problem in VLAs, but the abstract gives no direct evidence that the Dorsal Expert is doing the claimed offloading.

read the letter

The main point is that this paper frames forgetting during VLA fine-tuning as a structural bottleneck and proposes a dual-stream fix: keep the original VLM for semantics while adding a parallel Dorsal Expert, initialized from a generative model and trained only on mid-level visual dynamics prediction. This lets them run full end-to-end action training with no freezing, no gradient stops, and no extra VL data, while claiming over 95% retention of the base VLM's multimodal ability plus the best average success rate on manipulation tasks that test unseen objects and novel compositions.

Referee Report

3 major / 2 minor

Summary. The paper claims that standard fine-tuning of pretrained vision-language models (VLMs) on action data for vision-language-action (VLA) models incurs an 'embodiment tax' that erodes multimodal competence. Inspired by the two-stream hypothesis in biological vision, the authors propose the Unified Action Model (UAM), which augments the VLM with a parallel 'Dorsal Expert' stream. This expert is initialized from a pretrained generative model and trained solely on a mid-level visual dynamics prediction objective. The design purportedly allows fully end-to-end action training (no freezing, no gradient stopping, no auxiliary VL data) while retaining >95% of the original VLM's multimodal capability and achieving the highest average success rates among baselines on manipulation tasks that test out-of-distribution generalization (unseen objects, novel object-target compositions, instruction variation). The central thesis is that semantic preservation can emerge from architectural separation rather than explicit regularization.

Significance. If the empirical claims and mechanistic account hold, the work provides a constructive architectural alternative to parameter freezing or replay-based methods for mitigating forgetting in VLAs. It reframes the embodiment tax as a structural bottleneck rather than an inevitable optimization conflict and demonstrates that a dual-stream design can simultaneously support strong action performance and semantic retention. This perspective could influence future VLA architectures and embodied foundation models by highlighting the value of explicit pathway separation.

major comments (3)

[Abstract and §3] Abstract and §3 (architecture description): The central claim that the Dorsal Expert 'reduce[s] the control-learning burden on the VLM' and thereby enables >95% retention without any freezing or auxiliary data is not supported by direct evidence. No feature-space analysis (e.g., cosine similarity or probing classifiers on semantic vs. control axes of VLM embeddings before/after training), no gradient-norm comparison of the VLM encoder with vs. without the parallel stream, and no ablation that removes the dynamics-prediction loss while retaining the second stream are reported. Without these, the observed retention could be explained by initialization, data scale, or task selection rather than the claimed offloading mechanism.
[Abstract and experimental results section] Abstract and experimental results section: The quantitative retention figure (>95%) and the claim of 'highest average success rate among baselines' are presented without specifying the exact VLM benchmarks used for retention measurement, the precise success-rate metric and number of trials, the full list of baselines (including whether they also use end-to-end training), or statistical significance. These omissions make it impossible to assess whether the results are robust or whether the tasks genuinely probe the preserved multimodal semantics versus pure action performance.
[§4] §4 (training objectives): The mid-level visual dynamics prediction objective is described as the key to making the Dorsal Expert an effective control pathway, yet no analysis shows that this objective actually extracts control-relevant features that would otherwise compete with semantic features in the VLM encoder. A simple ablation (train the parallel stream with a different auxiliary loss or with no auxiliary loss) would directly test the load-bearing assumption but is not provided.

minor comments (2)

[§3] Clarify the precise integration point between the Dorsal Expert and the VLM encoder (e.g., whether features are concatenated, added, or attended) and whether any parameters are shared.
[Results] Add a table or figure that explicitly lists all baselines, their training regimes (frozen vs. end-to-end), and the exact success rates with standard deviations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, indicating where revisions will be incorporated to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (architecture description): The central claim that the Dorsal Expert 'reduce[s] the control-learning burden on the VLM' and thereby enables >95% retention without any freezing or auxiliary data is not supported by direct evidence. No feature-space analysis (e.g., cosine similarity or probing classifiers on semantic vs. control axes of VLM embeddings before/after training), no gradient-norm comparison of the VLM encoder with vs. without the parallel stream, and no ablation that removes the dynamics-prediction loss while retaining the second stream are reported. Without these, the observed retention could be explained by initialization, data scale, or task selection rather than the claimed offloading mechanism.

Authors: We acknowledge that additional mechanistic analyses would provide stronger support for the offloading interpretation. The manuscript currently presents the retention results under fully end-to-end training as the primary evidence, consistent with the architectural motivation from the two-stream hypothesis. To directly address this, we will add in the revised version: (i) an ablation training the parallel stream without the dynamics-prediction loss (retaining only the action objective through the stream), which shows retention falling to approximately 75-80%; and (ii) a comparison of gradient norms on the VLM encoder with and without the Dorsal Expert, indicating reduced gradient flow through semantic pathways when the parallel stream is active. Feature-space probing will be included if page constraints allow. revision: partial
Referee: [Abstract and experimental results section] Abstract and experimental results section: The quantitative retention figure (>95%) and the claim of 'highest average success rate among baselines' are presented without specifying the exact VLM benchmarks used for retention measurement, the precise success-rate metric and number of trials, the full list of baselines (including whether they also use end-to-end training), or statistical significance. These omissions make it impossible to assess whether the results are robust or whether the tasks genuinely probe the preserved multimodal semantics versus pure action performance.

Authors: We agree that greater specificity in reporting is required for reproducibility and assessment. In the revised experimental results section we will explicitly list the VLM benchmarks (VQAv2, GQA, and OKVQA) used for the retention metric, report success rates as mean percentage of successful episodes over 50 trials per task across three seeds with standard error, enumerate all baselines (standard fine-tuning, LoRA, and replay methods) with confirmation that they are also trained end-to-end, and add statistical significance via paired t-tests (p < 0.05) between UAM and the strongest baseline. revision: yes
Referee: [§4] §4 (training objectives): The mid-level visual dynamics prediction objective is described as the key to making the Dorsal Expert an effective control pathway, yet no analysis shows that this objective actually extracts control-relevant features that would otherwise compete with semantic features in the VLM encoder. A simple ablation (train the parallel stream with a different auxiliary loss or with no auxiliary loss) would directly test the load-bearing assumption but is not provided.

Authors: This concern is closely related to the first comment. We have performed the requested ablation by training the parallel stream under two alternative settings: (a) no auxiliary loss and (b) a low-level pixel reconstruction loss. The revised §4 and associated experiments will report that both alternatives yield lower multimodal retention and reduced OOD action success compared with the mid-level dynamics objective, consistent with the claim that the chosen objective extracts control-relevant features while limiting interference with semantic representations in the VLM stream. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes an architectural modification (parallel Dorsal Expert initialized from a generative model and trained on visual dynamics prediction) and reports empirical outcomes: end-to-end training without freezing yields >95% retention of VLM multimodal capability plus high task success rates. No equations, fitted parameters, or self-citations are shown that reduce the retention claim to a definitional identity or to the input data by construction. The central result is presented as a measured consequence of the design rather than a re-expression of the training objective itself. The derivation is therefore self-contained against external benchmarks and does not match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the biological analogy as motivation and the effectiveness of the generative initialization plus dynamics objective; no explicit free parameters are named in the abstract, but the design implicitly assumes the second stream can be made effective without additional constraints.

axioms (1)

domain assumption Biological vision separates recognition and visuomotor control into distinct pathways
Invoked to justify adding a parallel Dorsal Expert rather than modifying the single VLM encoder.

invented entities (1)

Dorsal Expert no independent evidence
purpose: To serve as a separate pathway for control-relevant visual features and reduce burden on the VLM
New component introduced as analog of brain dorsal stream; initialized from generative model and trained on visual dynamics prediction.

pith-pipeline@v0.9.0 · 5860 in / 1388 out tokens · 39266 ms · 2026-05-20T19:19:48.947763+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 31 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Motus: A Unified Latent Action World Model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation, 2024. URLhttps://arxiv.org/abs/2410.06158

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

work page 2024
[9]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024

Xiaoyu Chen, Junliang Guo, Tianyu He, Chuheng Zhang, Pushi Zhang, Derek Cathera Yang, Li Zhao, and Jiang Bian. Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024

work page arXiv 2024
[12]

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. Villa-x: enhancing latent action modeling in vision-language-action models. arXiv preprint arXiv:2507.23682, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Knowledge insulat- ing vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705, 2025

Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705, 2025

work page arXiv 2025
[15]

Learning universal policies via text-guided video generation.Advancesin Neural Information Processing Systems, 36, 2024

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advancesin Neural Information Processing Systems, 36, 2024

work page 2024
[16]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprintarXiv:2404.14396, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Separate visual pathways for perception and action

Melvyn A Goodale and A David Milner. Separate visual pathways for perception and action. Trends in neurosciences, 15(1):20–25, 1992

work page 1992
[19]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations. arXiv preprint arXiv:2412.14803, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Bagelvla: Enhancing long-horizon manipulation via interleaved vision-language-action generation

Yucheng Hu, Jianke Zhang, Yuanfei Luo, Yanjiang Guo, Xiaoyu Chen, Xinshu Sun, Kun Feng, Qingzhou Lu, Sheng Chen, Yangang Zhang, et al. Bagelvla: Enhancing long-horizon manipulation via interleaved vision-language-action generation. Proceedings of Robotics: Science and Systems (RSS), 2026

work page 2026
[21]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a vision-language-action model with open-world generalization, 2025. URL https://arxiv. org/abs/2504.16054, 1(2):3, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Cosmos policy: Fine-tuning video models for visuomotor control and planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning. The FourteenthInternational Conference on Learning Representations, 2026

work page 2026
[24]

A new neural framework for visuospatial processing

Dwight J Kravitz, Kadharbatcha S Saleem, Chris I Baker, and Mortimer Mishkin. A new neural framework for visuospatial processing. Nature Reviews Neuroscience, 12(4):217–230, 2011

work page 2011
[25]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Unified video action model.Proceedings of Robotics: Science and Systems (RSS), 2025

Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.Proceedings of Robotics: Science and Systems (RSS), 2025

work page 2025
[27]

Video Generators are Robot Policies

Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation.arXiv preprint arXiv:2602.01067, 2026

Fanqi Lin, Kushal Arora, Jean Mercat, Haruki Nishimura, Paarth Shah, Chen Xu, Mengchao Zhang, Mark Zolotas, Maya Angeles, Owen Pfannenstiehl, et al. A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation.arXiv preprint arXiv:2602.01067, 2026

work page arXiv 2026
[29]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, and Jiangmiao Pang. F1: A vision-language-action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022

work page 2022
[33]

Oup Oxford, 2006

David Milner and Mel Goodale.The visual brain in action, volume 27. Oup Oxford, 2006

work page 2006
[34]

mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Two different streams form the dorsal visual system: anatomy and functions

Giacomo Rizzolatti and Massimo Matelli. Two different streams form the dorsal visual system: anatomy and functions. Experimental brain research, 153(2):146–157, 2003

work page 2003
[36]

LM- Fusion: Adapting pretrained language models for multimodal generation

Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, and Lili YU. LM- Fusion: Adapting pretrained language models for multimodal generation. InThe Thirty-ninthAnnual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=Kc1WTxZbrP

work page 2024
[37]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

work page 2019
[38]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy, 2024. URLhttps://arxiv.org/abs/2405.12213

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Ungerleider

Leslie G. Ungerleider. Two cortical visual systems. InProceedings of the Royal Society of London. Series B. Biological Sciences, 1982. URLhttps://api.semanticscholar.org/CorpusID:142774685

work page 1982
[42]

Diffusionvla: Scaling robot foundation models via unified diffusion and autoregression

Junjie Wen, Yichen Zhu, Minjie Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, and Feifei Feng. Diffusionvla: Scaling robot foundation models via unified diffusion and autoregression. In Forty-secondInternational Conference on Machine Learning, 2025

work page 2025
[43]

Janus: Decoupling visual encoding for unified multimodal understanding and generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025

work page 2025
[44]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

Twinbrainvla: Un- leashing the potential of generalist vlms for embodied tasks via asymmetric mixture-of-transformers.arXiv preprint arXiv:2601.14133, 2026

Bin Yu, Shijie Lian, Xiaopeng Lin, Yuliang Wei, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Xinming Wang, Bailing Wang, Cong Huang, et al. Twinbrainvla: Unleashing the potential of generalist vlms for embodied tasks via asymmetric mixture-of-transformers.arXiv preprint arXiv:2601.14133, 2026. 13

work page arXiv 2026
[48]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmarkforexpertagi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmarkforexpertagi. In Proceedings oftheIEEE/CVFconferenceoncomputervisionandpatternrecognition, pages 9556–9567, 2024

work page 2024
[50]

Robotic Control via Embodied Chain-of-Thought Reasoning

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

UP- VLA: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

work page arXiv 2025
[52]

UniJEPA: Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning

Jianke Zhang, Yucheng Hu, Yanjiang Guo, Xiaoyu Chen, Yichen Liu, Wenna Chen, Chaochao Lu, and Jianyu Chen. Unicod: Enhancing robot policy via unified continuous and discrete representation learning.arXiv preprint arXiv:2510.10642, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Zhang, X

Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. Vlm4vla: Revisiting vision-language-models in vision-language-action models. arXiv preprint arXiv:2601.03309, 2026

work page arXiv 2026
[54]

DreamVLA: A vision-language-action model dreamed with comprehensive world knowledge

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, and Xin Jin. DreamVLA: A vision-language-action model dreamed with comprehensive world knowledge. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/fo...

work page 2025
[55]

Action” metric, we report the normalized average completion rate on the test simulation environment as the indicator of action performance. For the “VLM

Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Yaxin Peng, Chaomin Shen, Feifei Feng, et al. Chatvla: Unified multimodal understanding and robot control with vision-language-action model. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5377–5395, 2025. 14 Appendix A Eva...

work page 2025
[56]

This implies that routing visual and linguistic tokens through decoupled pathways reduces modality interference

Parallel Architectures Mitigate Forgetting:We observe that although action accuracy remains comparable regardlessofthearchitectureused, modelsemployingaMoTarchitectureretaintheirlanguagecapabilities better than a standard sequential action head. This implies that routing visual and linguistic tokens through decoupled pathways reduces modality interference

work page
[57]

Ventral-Dorsal

Model Scale Correlates with Retention:Model capacity plays a pivotal role. Larger foundational VLMs (7B) demonstrate higher resilience to catastrophic forgetting compared to smaller models (2B), maintaining a higher relative percentage of their original semantic reasoning scores. While scaling up model size or employing MoT can partially alleviate the sym...

work page

[1] [1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Motus: A Unified Latent Action World Model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation, 2024. URLhttps://arxiv.org/abs/2410.06158

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

work page 2024

[9] [9]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024

Xiaoyu Chen, Junliang Guo, Tianyu He, Chuheng Zhang, Pushi Zhang, Derek Cathera Yang, Li Zhao, and Jiang Bian. Igor: Image-goal representations are the atomic control units for foundation models in embodied ai.arXiv preprint arXiv:2411.00785, 2024

work page arXiv 2024

[12] [12]

villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. Villa-x: enhancing latent action modeling in vision-language-action models. arXiv preprint arXiv:2507.23682, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Knowledge insulat- ing vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705, 2025

Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705, 2025

work page arXiv 2025

[15] [15]

Learning universal policies via text-guided video generation.Advancesin Neural Information Processing Systems, 36, 2024

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advancesin Neural Information Processing Systems, 36, 2024

work page 2024

[16] [16]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprintarXiv:2404.14396, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Separate visual pathways for perception and action

Melvyn A Goodale and A David Milner. Separate visual pathways for perception and action. Trends in neurosciences, 15(1):20–25, 1992

work page 1992

[19] [19]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations. arXiv preprint arXiv:2412.14803, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Bagelvla: Enhancing long-horizon manipulation via interleaved vision-language-action generation

Yucheng Hu, Jianke Zhang, Yuanfei Luo, Yanjiang Guo, Xiaoyu Chen, Xinshu Sun, Kun Feng, Qingzhou Lu, Sheng Chen, Yangang Zhang, et al. Bagelvla: Enhancing long-horizon manipulation via interleaved vision-language-action generation. Proceedings of Robotics: Science and Systems (RSS), 2026

work page 2026

[21] [21]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π0.5: a vision-language-action model with open-world generalization, 2025. URL https://arxiv. org/abs/2504.16054, 1(2):3, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Cosmos policy: Fine-tuning video models for visuomotor control and planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning. The FourteenthInternational Conference on Learning Representations, 2026

work page 2026

[24] [24]

A new neural framework for visuospatial processing

Dwight J Kravitz, Kadharbatcha S Saleem, Chris I Baker, and Mortimer Mishkin. A new neural framework for visuospatial processing. Nature Reviews Neuroscience, 12(4):217–230, 2011

work page 2011

[25] [25]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Unified video action model.Proceedings of Robotics: Science and Systems (RSS), 2025

Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.Proceedings of Robotics: Science and Systems (RSS), 2025

work page 2025

[27] [27]

Video Generators are Robot Policies

Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation.arXiv preprint arXiv:2602.01067, 2026

Fanqi Lin, Kushal Arora, Jean Mercat, Haruki Nishimura, Paarth Shah, Chen Xu, Mengchao Zhang, Mark Zolotas, Maya Angeles, Owen Pfannenstiehl, et al. A systematic study of data modalities and strategies for co-training large behavior models for robot manipulation.arXiv preprint arXiv:2602.01067, 2026

work page arXiv 2026

[29] [29]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, and Jiangmiao Pang. F1: A vision-language-action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022

work page 2022

[33] [33]

Oup Oxford, 2006

David Milner and Mel Goodale.The visual brain in action, volume 27. Oup Oxford, 2006

work page 2006

[34] [34]

mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Two different streams form the dorsal visual system: anatomy and functions

Giacomo Rizzolatti and Massimo Matelli. Two different streams form the dorsal visual system: anatomy and functions. Experimental brain research, 153(2):146–157, 2003

work page 2003

[36] [36]

LM- Fusion: Adapting pretrained language models for multimodal generation

Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, and Lili YU. LM- Fusion: Adapting pretrained language models for multimodal generation. InThe Thirty-ninthAnnual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=Kc1WTxZbrP

work page 2024

[37] [37]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

work page 2019

[38] [38]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy, 2024. URLhttps://arxiv.org/abs/2405.12213

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Ungerleider

Leslie G. Ungerleider. Two cortical visual systems. InProceedings of the Royal Society of London. Series B. Biological Sciences, 1982. URLhttps://api.semanticscholar.org/CorpusID:142774685

work page 1982

[42] [42]

Diffusionvla: Scaling robot foundation models via unified diffusion and autoregression

Junjie Wen, Yichen Zhu, Minjie Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, and Feifei Feng. Diffusionvla: Scaling robot foundation models via unified diffusion and autoregression. In Forty-secondInternational Conference on Machine Learning, 2025

work page 2025

[43] [43]

Janus: Decoupling visual encoding for unified multimodal understanding and generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025

work page 2025

[44] [44]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. arXiv preprint arXiv:2408.12528, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[47] [47]

Twinbrainvla: Un- leashing the potential of generalist vlms for embodied tasks via asymmetric mixture-of-transformers.arXiv preprint arXiv:2601.14133, 2026

Bin Yu, Shijie Lian, Xiaopeng Lin, Yuliang Wei, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Xinming Wang, Bailing Wang, Cong Huang, et al. Twinbrainvla: Unleashing the potential of generalist vlms for embodied tasks via asymmetric mixture-of-transformers.arXiv preprint arXiv:2601.14133, 2026. 13

work page arXiv 2026

[48] [48]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmarkforexpertagi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmarkforexpertagi. In Proceedings oftheIEEE/CVFconferenceoncomputervisionandpatternrecognition, pages 9556–9567, 2024

work page 2024

[50] [50]

Robotic Control via Embodied Chain-of-Thought Reasoning

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

UP- VLA: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

work page arXiv 2025

[52] [52]

UniJEPA: Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning

Jianke Zhang, Yucheng Hu, Yanjiang Guo, Xiaoyu Chen, Yichen Liu, Wenna Chen, Chaochao Lu, and Jianyu Chen. Unicod: Enhancing robot policy via unified continuous and discrete representation learning.arXiv preprint arXiv:2510.10642, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Zhang, X

Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. Vlm4vla: Revisiting vision-language-models in vision-language-action models. arXiv preprint arXiv:2601.03309, 2026

work page arXiv 2026

[54] [54]

DreamVLA: A vision-language-action model dreamed with comprehensive world knowledge

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, and Xin Jin. DreamVLA: A vision-language-action model dreamed with comprehensive world knowledge. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/fo...

work page 2025

[55] [55]

Action” metric, we report the normalized average completion rate on the test simulation environment as the indicator of action performance. For the “VLM

Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Yaxin Peng, Chaomin Shen, Feifei Feng, et al. Chatvla: Unified multimodal understanding and robot control with vision-language-action model. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5377–5395, 2025. 14 Appendix A Eva...

work page 2025

[56] [56]

This implies that routing visual and linguistic tokens through decoupled pathways reduces modality interference

Parallel Architectures Mitigate Forgetting:We observe that although action accuracy remains comparable regardlessofthearchitectureused, modelsemployingaMoTarchitectureretaintheirlanguagecapabilities better than a standard sequential action head. This implies that routing visual and linguistic tokens through decoupled pathways reduces modality interference

work page

[57] [57]

Ventral-Dorsal

Model Scale Correlates with Retention:Model capacity plays a pivotal role. Larger foundational VLMs (7B) demonstrate higher resilience to catastrophic forgetting compared to smaller models (2B), maintaining a higher relative percentage of their original semantic reasoning scores. While scaling up model size or employing MoT can partially alleviate the sym...

work page