pith. sign in

arxiv: 2606.17937 · v1 · pith:LLKPE6LPnew · submitted 2026-06-16 · 💻 cs.RO

ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation

Pith reviewed 2026-06-27 00:48 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language-actionrobotic manipulationchain-of-thoughtlong-horizon tasksautoregressive modelmixture-of-transformersinverse dynamics
0
0 comments X

The pith

ThinkingVLA improves robotic manipulation by interleaving visual state prediction with inverse action reasoning inside one autoregressive model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most vision-language-action models map observations straight to actions and struggle with long-horizon tasks that require explicit planning. The paper claims manipulation planning decomposes into forward prediction of the next visual state and inverse dynamics that recover the actions needed to reach it. A unified autoregressive architecture that interleaves text and image tokens can carry both steps in a single generation process. ThinkingVLA realizes the decomposition through forward chain-of-thought for subgoal and image forecasting followed by inverse chain-of-thought on the predicted image to ground action reasoning. Experiments on simulation and real-world benchmarks show consistent gains over baselines, with the largest improvements on long-horizon tasks.

Core claim

Manipulation planning naturally decomposes into prediction of the next visual state and inverse dynamics to infer actions from that state. Bridging these requires a unified autoregressive architecture interleaving textual and visual reasoning. ThinkingVLA realizes this with forward CoT identifying the immediate subgoal and guiding visual forecasting, the predicted image then serving as target state for inverse CoT that reasons about spatial relationships and action intent, and the final action generated conditioned on the full reasoning context.

What carries the argument

Mixture-of-Transformers architecture that interleaves forward chain-of-thought for visual forecasting with inverse chain-of-thought conditioned on the predicted image to produce actions.

Load-bearing premise

Manipulation planning naturally decomposes into visual-state prediction followed by inverse dynamics and these two must be bridged inside a single unified autoregressive architecture interleaving text and images.

What would settle it

A controlled comparison in which a model without the interleaved inverse CoT step or without unified autoregressive interleaving achieves equal performance on long-horizon tasks would falsify the necessity of the claimed decomposition.

read the original abstract

Most Vision-Language-Action (VLA) models map observations directly to actions without explicit reasoning, limiting their capacity for reasoning-intensive long-horizon tasks. To address this, existing approaches adopt Chain-of-Thought (CoT) reasoning to enable subgoal decomposition and spatial anticipation. However, those methods lack a unified architecture for effective cross-modal reasoning and fail to explicitly include inverse reasoning ability based on the target state. We argue that manipulation planning naturally decomposes into prediction, anticipating the next visual state, and inverse dynamics, inferring the actions to reach it. Bridging both requires a unified autoregressive architecture that interleaves textual and visual reasoning in a single generation process. We propose \textbf{ThinkingVLA}, a generative model that realizes this decomposition within a unified Mixture-of-Transformers architecture. ThinkingVLA consists of a forward CoT that identifies the immediate subgoal and guides the visual forecasting; the predicted image then serves as the target state, grounding an inverse CoT that reasons about spatial relationships and action intent based on the predicted image; and the final action is generated conditioned on this full reasoning context. Extensive experiments on simulation and real-world benchmarks demonstrate that ThinkingVLA consistently outperforms state-of-the-art baselines, with particularly large gains on long-horizon manipulation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes ThinkingVLA, a generative Vision-Language-Action (VLA) model that realizes manipulation planning as a decomposition into visual-state prediction and inverse dynamics within a single unified autoregressive Mixture-of-Transformers architecture. The model performs forward CoT to identify subgoals and guide visual forecasting, predicts the next image as target state, conducts inverse CoT to reason about spatial relationships and actions from the predicted image, and generates the final action conditioned on the full context. It claims consistent outperformance over state-of-the-art baselines on simulation and real-world benchmarks, with particularly large gains on long-horizon tasks.

Significance. If the outperformance holds and is shown to stem from the interleaved reasoning rather than other factors, the work could advance VLA models for complex robotic manipulation by demonstrating the value of explicit forward and inverse reasoning in a unified generative process.

major comments (1)
  1. [Abstract] Abstract (argument paragraph): The assertion that manipulation planning 'naturally decomposes' into visual-state prediction followed by inverse dynamics and that bridging them 'requires' a unified autoregressive architecture interleaving text and images is presented as foundational, yet the experiments report outperformance without an ablation against a modular control (separate visual forecaster + separate inverse-dynamics policy) or a variant that removes interleaving while holding CoT, data scale, and capacity fixed. This leaves the architectural necessity claim load-bearing but untested.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point-by-point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (argument paragraph): The assertion that manipulation planning 'naturally decomposes' into visual-state prediction followed by inverse dynamics and that bridging them 'requires' a unified autoregressive architecture interleaving text and images is presented as foundational, yet the experiments report outperformance without an ablation against a modular control (separate visual forecaster + separate inverse-dynamics policy) or a variant that removes interleaving while holding CoT, data scale, and capacity fixed. This leaves the architectural necessity claim load-bearing but untested.

    Authors: We appreciate the referee's observation that the abstract presents the decomposition and unified architecture as foundational without a direct ablation isolating the interleaving mechanism. The core motivation is that a single autoregressive Mixture-of-Transformers process enables the predicted image to directly ground the subsequent inverse CoT within the same token sequence, providing shared context that modular pipelines would require explicit bridging mechanisms to replicate. Our experiments demonstrate gains over existing VLA baselines that lack this unified forward-inverse structure, particularly on long-horizon tasks. Nevertheless, we acknowledge the absence of the suggested modular control ablation. We will revise the abstract to frame the unified interleaving as a design choice that facilitates the reasoning flow rather than a strict requirement, and add a dedicated paragraph in the discussion section analyzing why separate modules would likely incur information loss or require non-trivial additional components to achieve comparable cross-modal grounding. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper's central claim is presented as an argument ('We argue that manipulation planning naturally decomposes into prediction... and inverse dynamics... Bridging both requires a unified autoregressive architecture...') rather than a derivation from equations or first principles. No self-definitional reductions, fitted inputs renamed as predictions, load-bearing self-citations, imported uniqueness theorems, smuggled ansatzes, or renamed known results appear in the provided text. The architecture is a design choice justified by the argument and validated empirically; the derivation chain does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unverified empirical assertion of outperformance.

pith-pipeline@v0.9.1-grok · 5791 in / 988 out tokens · 35561 ms · 2026-06-27T00:48:15.072582+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 17 linked inside Pith

  1. [1]

    Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

  2. [2]

    Rt-h: Action hierarchies using language

    Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quan Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language. InRobotics: Science and Systems, 2024

  3. [3]

    Paligemma: A versatile 3b vlm for transfer

    LucasBeyer,AndreasSteiner,AndréSusanoPinto,AlexanderKolesnikov,XiaoWang,DanielSalz,MaximNeumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprintarXiv:2407.07726, 2024

  4. [4]

    Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  5. [5]

    𝜋0: Avision-language-actionflowmodelforgeneralrobotcontrol

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, KarolHausman,BrianIchter,etal. 𝜋0: Avision-language-actionflowmodelforgeneralrobotcontrol. arXivpreprint arXiv:2410.24164, 2024

  6. [6]

    In9th AnnualConferenceon RobotLearning, 2025

    Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.𝜋0.5: a vision-language-action model with open-world generalization. In9th AnnualConferenceon RobotLearning, 2025

  7. [7]

    Rt-1: Robotics transformer for real-world control at scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. Robotics: Science andSystemsXIX, 2023

  8. [8]

    Worldvla: Towards autoregressive action world model.arXivpreprintarXiv:2506.21539, 2025

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXivpreprintarXiv:2506.21539, 2025

  9. [9]

    Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXivpreprintarXiv:2506.18088, 2025

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXivpreprintarXiv:2506.18088, 2025

  10. [10]

    Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  11. [11]

    Emu3.5: Nativemultimodalmodelsareworldlearners

    Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang,WenxuanWang,etal. Emu3.5: Nativemultimodalmodelsareworldlearners. arXivpreprintarXiv:2510.26583, 2025

  12. [12]

    Emerging properties in unified multimodal pretraining.arXiv preprintarXiv:2505.14683, 2025

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprintarXiv:2505.14683, 2025

  13. [13]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings ofthe IEEE/CVF conferenceoncomputervision andpattern recognition, pages 12873–12883, 2021

  14. [14]

    Manualvla: A unified vla model for chain-of-thought manual generation and robotic manipulation

    Chenyang Gu, Jiaming Liu, Hao Chen, Runzhong Huang, Qingpo Wuwu, Zhuoyang Liu, Xiaoqi Li, Ying Li, Renrui Zhang, Peng Jia, et al. Manualvla: A unified vla model for chain-of-thought manual generation and robotic manipulation. arXiv preprintarXiv:2512.02013, 2025

  15. [15]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS2021 Workshopon DeepGenerative Models and DownstreamApplications

  16. [16]

    Bagelvla: Enhancing long-horizon manipulation via interleaved vision-language-action generation

    Yucheng Hu, Jianke Zhang, Yuanfei Luo, Yanjiang Guo, Xiaoyu Chen, Xinshu Sun, Kun Feng, Qingzhou Lu, Sheng Chen, Yangang Zhang, et al. Bagelvla: Enhancing long-horizon manipulation via interleaved vision-language-action generation. arXivpreprintarXiv:2602.09849, 2026

  17. [17]

    Openvla: Anopen-sourcevision-language-actionmodel

    MooJinKim,KarlPertsch,SiddharthKaramcheti,TedXiao,AshwinBalakrishna,SurajNair,RafaelRafailov,EthanP Foster,PannagRSanketi,QuanVuong,etal. Openvla: Anopen-sourcevision-language-actionmodel. In 8thAnnual Conferenceon RobotLearning, 2024. 10

  18. [18]

    Molmoact: Action reasoning models that can reason in space

    Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Boyang Li, Shuo Liu, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space. InWorkshopon Making Sense of Datain Robotics: Composition,Curation,and Interpretability at Scale atCoRL 2025, 2025

  19. [19]

    Causal world modeling for robot control.arXiv preprintarXiv:2601.21998, 2026

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprintarXiv:2601.21998, 2026

  20. [20]

    Activemimic: Egocentric video pretraining with active perception.arXivpreprint arXiv:2606.06194, 2026

    Xingyao Lin, Guojin Zhong, Tianyi Lu, Ziyi Ye, Yichen Zhu, Zuxuan Wu, and Yu-Gang Jiang. Activemimic: Egocentric video pretraining with active perception.arXivpreprint arXiv:2606.06194, 2026

  21. [21]

    Rdt-1b: a diffusion foundation model for bimanual manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. InTheThirteenth International Conferenceon Learning Representations, 2025

  22. [22]

    Last0: Latent spatio-temporal chain-of-thought for robotic vision-language-action model

    Zhuoyang Liu, Jiaming Liu, Hao Chen, Ziyu Guo, Chengkai Hou, Chenyang Gu, Jiale Yu, Xiangju Mi, Renrui Zhang, Zhengping Che, et al. Last0: Latent spatio-temporal chain-of-thought for robotic vision-language-action model. arXiv preprintarXiv:2601.05248, 2026

  23. [23]

    F1: A vision-language-action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951, 2025

    Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, and Jiangmiao Pang. F1: A vision-language-action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951, 2025

  24. [24]

    Unitok: a unified tokenizer for visual generation and understanding

    Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, BINGYUE PENG, and XIAOJUAN QI. Unitok: a unified tokenizer for visual generation and understanding. InThe Thirty-ninth Annual Conference on Neural InformationProcessingSystems

  25. [25]

    Unifying perception and action: A hybrid- modalitypipelinewithimplicitvisualchain-of-thoughtforroboticactiongeneration

    Xiangkai Ma, Lekai Xing, Han Zhang, Wenzhong Li, and Sanglu Lu. Unifying perception and action: A hybrid- modalitypipelinewithimplicitvisualchain-of-thoughtforroboticactiongeneration. arXivpreprintarXiv:2511.19859, 2025

  26. [26]

    Open X-Embodiment: Robotic learning datasets and RT-X models.arXiv preprint arXiv:2310.08864, 2024

    Open X-Embodiment Collaboration. Open X-Embodiment: Robotic learning datasets and RT-X models.arXiv preprint arXiv:2310.08864, 2024

  27. [27]

    Generatingdiversehigh-fidelityimageswithvq-vae-2

    AliRazavi,AaronVandenOord,andOriolVinyals. Generatingdiversehigh-fidelityimageswithvq-vae-2. Advances in neuralinformationprocessingsystems, 32, 2019

  28. [28]

    Scalable image tokenization with index backpropagation quantization

    Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization. InProceedingsofthe IEEE/CVFInternational ConferenceonComputer Vision, pages 16037–16046, 2025

  29. [29]

    Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

  30. [30]

    Predictiveinversedynamics models are scalable learners for robotic manipulation

    YangTian,SizheYang,JiaZeng,PingWang,DahuaLin,HaoDong,andJiangmiaoPang. Predictiveinversedynamics models are scalable learners for robotic manipulation. InInternational Conference on Learning Representations, volume 2025, pages 92033–92052, 2025

  31. [31]

    Neural discrete representation learning.Advancesin neuralinformation processing systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advancesin neuralinformation processing systems, 30, 2017

  32. [32]

    Omnitokenizer: A joint image-video tokenizer for visual generation.Advancesin NeuralInformationProcessing Systems, 37:28281–28295, 2024

    Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation.Advancesin NeuralInformationProcessing Systems, 37:28281–28295, 2024

  33. [33]

    Emu3: Next-token prediction is all you need.arXivpreprintarXiv:2409.18869, 2024

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXivpreprintarXiv:2409.18869, 2024

  34. [34]

    Unified vision-language-action model.arXivpreprintarXiv:2506.19850, 2025

    Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. Unified vision-language-action model.arXivpreprintarXiv:2506.19850, 2025

  35. [35]

    Chain-of-thoughtpromptingelicitsreasoninginlargelanguagemodels

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thoughtpromptingelicitsreasoninginlargelanguagemodels. Advancesinneuralinformationprocessing systems, 35:24824–24837, 2022

  36. [36]

    Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.IEEE RoboticsandAutomationLetters, 2025

    Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.IEEE RoboticsandAutomationLetters, 2025. 11

  37. [37]

    Diffusionvla: Scaling robot foundation models via unified diffusion and autoregression

    Junjie Wen, Yichen Zhu, Minjie Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, and Feifei Feng. Diffusionvla: Scaling robot foundation models via unified diffusion and autoregression. In Forty-secondInternational ConferenceonMachineLearning, 2025

  38. [38]

    Janus: Decoupling visual encoding for unified multimodal understanding and generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. In Proceedings ofthe ComputerVisionand PatternRecognitionConference, pages 12966–12977, 2025

  39. [39]

    Liquid: Language models are scalable and unified multi-modal generators.International Journal of ComputerVision, 134 (1):39, 2026

    Junfeng Wu, Yi Jiang, Chuofan Ma, Yuliang Liu, Hengshuang Zhao, Zehuan Yuan, Song Bai, and Xiang Bai. Liquid: Language models are scalable and unified multi-modal generators.International Journal of ComputerVision, 134 (1):39, 2026

  40. [40]

    Vila-u: a unified foundation model integrating visual understanding and generation

    Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation. In The Thirteenth International Conferenceon Learning Representations

  41. [41]

    Show-o: One single transformer to unify multimodal understanding and generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InTheThirteenth International Conferenceon Learning Representations

  42. [42]

    Deepthinkvla: Enhancing reasoning capability of vision-language-action models.arXivpreprintarXiv:2511.15669, 2025

    Cheng Yin, Yankai Lin, Wang Xu, Sikyuen Tam, Xiangrui Zeng, Zhiyuan Liu, and Zhouping Yin. Deepthinkvla: Enhancing reasoning capability of vision-language-action models.arXivpreprintarXiv:2511.15669, 2025

  43. [43]

    Language model beats diffusion-tokenizer is key to visual generation

    Lijun Yu, Jose Lezama, Nitesh Bharadwaj Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion-tokenizer is key to visual generation. InThe TwelfthInternational Conferenceon Learning Representations

  44. [44]

    Robotic control via embodied chain-of-thought reasoning

    Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. In8th AnnualConferenceon RobotLearning, 2024

  45. [45]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings ofthe IEEE/CVF international conferenceon computervision, pages 11975–11986, 2023

  46. [46]

    Up-vla: A unified understanding and prediction model for embodied agent

    Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent. InInternational Conferenceon Machine Learning, pages 74911–74922. PMLR, 2025

  47. [47]

    Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

    QingqingZhao,YaoLu,MooJinKim,ZipengFu,ZhuoyangZhang,YechengWu,ZhaoshuoLi,QianliMa,SongHan, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings ofthe ComputerVisionandPatternRecognitionConference, pages 1702–1713, 2025

  48. [48]

    X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model

    Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprintarXiv:2510.10274, 2025

  49. [49]

    subtask: [action]

    Chunting Zhou, LILI YU, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma,LukeZettlemoyer,andOmerLevy. Transfusion: Predictthenexttokenanddiffuseimageswithonemulti-modal model. InTheThirteenth International Conferenceon Learning Representations. 12 A Model Architecture Details Table 3 provides the full architectural spec...