ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation

Guojin Zhong; Hui Zhang; Junke Wang; Peng Wang; Shengqi Xu; Tianyi Lu; Xingyao Lin; Yu-Gang Jiang; Zijie Diao; Ziyi Ye

arxiv: 2606.17937 · v1 · pith:LLKPE6LPnew · submitted 2026-06-16 · 💻 cs.RO

ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation

Tianyi Lu , Hui Zhang , Zijie Diao , Junke Wang , Shengqi Xu , Xingyao Lin , Guojin Zhong , Ziyi Ye

show 3 more authors

Peng Wang Zuxuan Wu Yu-Gang Jiang

This is my paper

Pith reviewed 2026-06-27 00:48 UTC · model grok-4.3

classification 💻 cs.RO

keywords vision-language-actionrobotic manipulationchain-of-thoughtlong-horizon tasksautoregressive modelmixture-of-transformersinverse dynamics

0 comments

The pith

ThinkingVLA improves robotic manipulation by interleaving visual state prediction with inverse action reasoning inside one autoregressive model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most vision-language-action models map observations straight to actions and struggle with long-horizon tasks that require explicit planning. The paper claims manipulation planning decomposes into forward prediction of the next visual state and inverse dynamics that recover the actions needed to reach it. A unified autoregressive architecture that interleaves text and image tokens can carry both steps in a single generation process. ThinkingVLA realizes the decomposition through forward chain-of-thought for subgoal and image forecasting followed by inverse chain-of-thought on the predicted image to ground action reasoning. Experiments on simulation and real-world benchmarks show consistent gains over baselines, with the largest improvements on long-horizon tasks.

Core claim

Manipulation planning naturally decomposes into prediction of the next visual state and inverse dynamics to infer actions from that state. Bridging these requires a unified autoregressive architecture interleaving textual and visual reasoning. ThinkingVLA realizes this with forward CoT identifying the immediate subgoal and guiding visual forecasting, the predicted image then serving as target state for inverse CoT that reasons about spatial relationships and action intent, and the final action generated conditioned on the full reasoning context.

What carries the argument

Mixture-of-Transformers architecture that interleaves forward chain-of-thought for visual forecasting with inverse chain-of-thought conditioned on the predicted image to produce actions.

Load-bearing premise

Manipulation planning naturally decomposes into visual-state prediction followed by inverse dynamics and these two must be bridged inside a single unified autoregressive architecture interleaving text and images.

What would settle it

A controlled comparison in which a model without the interleaved inverse CoT step or without unified autoregressive interleaving achieves equal performance on long-horizon tasks would falsify the necessity of the claimed decomposition.

read the original abstract

Most Vision-Language-Action (VLA) models map observations directly to actions without explicit reasoning, limiting their capacity for reasoning-intensive long-horizon tasks. To address this, existing approaches adopt Chain-of-Thought (CoT) reasoning to enable subgoal decomposition and spatial anticipation. However, those methods lack a unified architecture for effective cross-modal reasoning and fail to explicitly include inverse reasoning ability based on the target state. We argue that manipulation planning naturally decomposes into prediction, anticipating the next visual state, and inverse dynamics, inferring the actions to reach it. Bridging both requires a unified autoregressive architecture that interleaves textual and visual reasoning in a single generation process. We propose \textbf{ThinkingVLA}, a generative model that realizes this decomposition within a unified Mixture-of-Transformers architecture. ThinkingVLA consists of a forward CoT that identifies the immediate subgoal and guides the visual forecasting; the predicted image then serves as the target state, grounding an inverse CoT that reasons about spatial relationships and action intent based on the predicted image; and the final action is generated conditioned on this full reasoning context. Extensive experiments on simulation and real-world benchmarks demonstrate that ThinkingVLA consistently outperforms state-of-the-art baselines, with particularly large gains on long-horizon manipulation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ThinkingVLA interleaves forward visual CoT, image prediction, and inverse CoT inside one MoT pass for VLA, but the experiments do not isolate whether that unified interleaving is required for the reported gains.

read the letter

The paper's core move is to treat manipulation planning as forward prediction of the next visual state followed by inverse dynamics, and to implement both inside a single autoregressive Mixture-of-Transformers generation that interleaves text and images. Forward CoT identifies the subgoal and drives the image forecast; the predicted image then anchors an inverse CoT that reasons about actions to reach it. That specific forward-plus-inverse structure with an explicit image token in the middle is not something the cited prior CoT VLA papers do in one pass.

The idea is straightforward and addresses a real gap: most VLA models jump straight from observation to action without explicit cross-modal steps. Putting the predicted image in the middle gives the inverse reasoning a concrete target state to condition on, which could help on long-horizon tasks where spatial relationships matter.

The main weakness is that the abstract claims large gains on long-horizon benchmarks without showing the ablation that would make the architectural claim load-bearing. There is no comparison to a modular baseline that runs a separate visual forecaster and then a separate inverse-dynamics policy, nor an ablation that keeps CoT and total capacity fixed but removes the interleaving. If the gains come mainly from more reasoning tokens or data scale rather than the unified autoregressive bridge, the central argument about natural decomposition requiring one model does not hold up. The paper also does not report whether the image prediction step is accurate enough on its own or whether errors there propagate.

This is the kind of paper that belongs in a reading group for people working on VLA architectures and long-horizon robotic planning. The construction is concrete enough that a referee could check the implementation details and ask for the missing controls. It deserves peer review because it has a clear, testable architectural proposal and reports results on standard benchmarks, even if the current evidence does not yet pin down why the gains occur.

Referee Report

1 major / 0 minor

Summary. The paper proposes ThinkingVLA, a generative Vision-Language-Action (VLA) model that realizes manipulation planning as a decomposition into visual-state prediction and inverse dynamics within a single unified autoregressive Mixture-of-Transformers architecture. The model performs forward CoT to identify subgoals and guide visual forecasting, predicts the next image as target state, conducts inverse CoT to reason about spatial relationships and actions from the predicted image, and generates the final action conditioned on the full context. It claims consistent outperformance over state-of-the-art baselines on simulation and real-world benchmarks, with particularly large gains on long-horizon tasks.

Significance. If the outperformance holds and is shown to stem from the interleaved reasoning rather than other factors, the work could advance VLA models for complex robotic manipulation by demonstrating the value of explicit forward and inverse reasoning in a unified generative process.

major comments (1)

[Abstract] Abstract (argument paragraph): The assertion that manipulation planning 'naturally decomposes' into visual-state prediction followed by inverse dynamics and that bridging them 'requires' a unified autoregressive architecture interleaving text and images is presented as foundational, yet the experiments report outperformance without an ablation against a modular control (separate visual forecaster + separate inverse-dynamics policy) or a variant that removes interleaving while holding CoT, data scale, and capacity fixed. This leaves the architectural necessity claim load-bearing but untested.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract (argument paragraph): The assertion that manipulation planning 'naturally decomposes' into visual-state prediction followed by inverse dynamics and that bridging them 'requires' a unified autoregressive architecture interleaving text and images is presented as foundational, yet the experiments report outperformance without an ablation against a modular control (separate visual forecaster + separate inverse-dynamics policy) or a variant that removes interleaving while holding CoT, data scale, and capacity fixed. This leaves the architectural necessity claim load-bearing but untested.

Authors: We appreciate the referee's observation that the abstract presents the decomposition and unified architecture as foundational without a direct ablation isolating the interleaving mechanism. The core motivation is that a single autoregressive Mixture-of-Transformers process enables the predicted image to directly ground the subsequent inverse CoT within the same token sequence, providing shared context that modular pipelines would require explicit bridging mechanisms to replicate. Our experiments demonstrate gains over existing VLA baselines that lack this unified forward-inverse structure, particularly on long-horizon tasks. Nevertheless, we acknowledge the absence of the suggested modular control ablation. We will revise the abstract to frame the unified interleaving as a design choice that facilitates the reasoning flow rather than a strict requirement, and add a dedicated paragraph in the discussion section analyzing why separate modules would likely incur information loss or require non-trivial additional components to achieve comparable cross-modal grounding. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper's central claim is presented as an argument ('We argue that manipulation planning naturally decomposes into prediction... and inverse dynamics... Bridging both requires a unified autoregressive architecture...') rather than a derivation from equations or first principles. No self-definitional reductions, fitted inputs renamed as predictions, load-bearing self-citations, imported uniqueness theorems, smuggled ansatzes, or renamed known results appear in the provided text. The architecture is a design choice justified by the argument and validated empirically; the derivation chain does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unverified empirical assertion of outperformance.

pith-pipeline@v0.9.1-grok · 5791 in / 988 out tokens · 35561 ms · 2026-06-27T00:48:15.072582+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 17 linked inside Pith

[1]

Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvĳit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025
[2]

Rt-h: Action hierarchies using language

Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quan Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language. InRobotics: Science and Systems, 2024

2024
[3]

Paligemma: A versatile 3b vlm for transfer

LucasBeyer,AndreasSteiner,AndréSusanoPinto,AlexanderKolesnikov,XiaoWang,DanielSalz,MaximNeumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprintarXiv:2407.07726, 2024

Pith/arXiv arXiv 2024
[4]

Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025
[5]

𝜋0: Avision-language-actionflowmodelforgeneralrobotcontrol

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, KarolHausman,BrianIchter,etal. 𝜋0: Avision-language-actionflowmodelforgeneralrobotcontrol. arXivpreprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[6]

In9th AnnualConferenceon RobotLearning, 2025

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.𝜋0.5: a vision-language-action model with open-world generalization. In9th AnnualConferenceon RobotLearning, 2025

2025
[7]

Rt-1: Robotics transformer for real-world control at scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. Robotics: Science andSystemsXIX, 2023

2023
[8]

Worldvla: Towards autoregressive action world model.arXivpreprintarXiv:2506.21539, 2025

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXivpreprintarXiv:2506.21539, 2025

Pith/arXiv arXiv 2025
[9]

Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXivpreprintarXiv:2506.18088, 2025

Tianxing Chen, Zanxin Chen, Baĳun Chen, Zĳian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXivpreprintarXiv:2506.18088, 2025

Pith/arXiv arXiv 2025
[10]

Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Pith/arXiv arXiv 2025
[11]

Emu3.5: Nativemultimodalmodelsareworldlearners

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang,WenxuanWang,etal. Emu3.5: Nativemultimodalmodelsareworldlearners. arXivpreprintarXiv:2510.26583, 2025

Pith/arXiv arXiv 2025
[12]

Emerging properties in unified multimodal pretraining.arXiv preprintarXiv:2505.14683, 2025

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprintarXiv:2505.14683, 2025

Pith/arXiv arXiv 2025
[13]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings ofthe IEEE/CVF conferenceoncomputervision andpattern recognition, pages 12873–12883, 2021

2021
[14]

Manualvla: A unified vla model for chain-of-thought manual generation and robotic manipulation

Chenyang Gu, Jiaming Liu, Hao Chen, Runzhong Huang, Qingpo Wuwu, Zhuoyang Liu, Xiaoqi Li, Ying Li, Renrui Zhang, Peng Jia, et al. Manualvla: A unified vla model for chain-of-thought manual generation and robotic manipulation. arXiv preprintarXiv:2512.02013, 2025

arXiv 2025
[15]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS2021 Workshopon DeepGenerative Models and DownstreamApplications
[16]

Bagelvla: Enhancing long-horizon manipulation via interleaved vision-language-action generation

Yucheng Hu, Jianke Zhang, Yuanfei Luo, Yanjiang Guo, Xiaoyu Chen, Xinshu Sun, Kun Feng, Qingzhou Lu, Sheng Chen, Yangang Zhang, et al. Bagelvla: Enhancing long-horizon manipulation via interleaved vision-language-action generation. arXivpreprintarXiv:2602.09849, 2026

arXiv 2026
[17]

Openvla: Anopen-sourcevision-language-actionmodel

MooJinKim,KarlPertsch,SiddharthKaramcheti,TedXiao,AshwinBalakrishna,SurajNair,RafaelRafailov,EthanP Foster,PannagRSanketi,QuanVuong,etal. Openvla: Anopen-sourcevision-language-actionmodel. In 8thAnnual Conferenceon RobotLearning, 2024. 10

2024
[18]

Molmoact: Action reasoning models that can reason in space

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Boyang Li, Shuo Liu, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space. InWorkshopon Making Sense of Datain Robotics: Composition,Curation,and Interpretability at Scale atCoRL 2025, 2025

2025
[19]

Causal world modeling for robot control.arXiv preprintarXiv:2601.21998, 2026

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprintarXiv:2601.21998, 2026

Pith/arXiv arXiv 2026
[20]

Activemimic: Egocentric video pretraining with active perception.arXivpreprint arXiv:2606.06194, 2026

Xingyao Lin, Guojin Zhong, Tianyi Lu, Ziyi Ye, Yichen Zhu, Zuxuan Wu, and Yu-Gang Jiang. Activemimic: Egocentric video pretraining with active perception.arXivpreprint arXiv:2606.06194, 2026

Pith/arXiv arXiv 2026
[21]

Rdt-1b: a diffusion foundation model for bimanual manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. InTheThirteenth International Conferenceon Learning Representations, 2025

2025
[22]

Last0: Latent spatio-temporal chain-of-thought for robotic vision-language-action model

Zhuoyang Liu, Jiaming Liu, Hao Chen, Ziyu Guo, Chengkai Hou, Chenyang Gu, Jiale Yu, Xiangju Mi, Renrui Zhang, Zhengping Che, et al. Last0: Latent spatio-temporal chain-of-thought for robotic vision-language-action model. arXiv preprintarXiv:2601.05248, 2026

arXiv 2026
[23]

F1: A vision-language-action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951, 2025

Qi Lv, Weĳie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, and Jiangmiao Pang. F1: A vision-language-action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951, 2025

Pith/arXiv arXiv 2025
[24]

Unitok: a unified tokenizer for visual generation and understanding

Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, BINGYUE PENG, and XIAOJUAN QI. Unitok: a unified tokenizer for visual generation and understanding. InThe Thirty-ninth Annual Conference on Neural InformationProcessingSystems
[25]

Unifying perception and action: A hybrid- modalitypipelinewithimplicitvisualchain-of-thoughtforroboticactiongeneration

Xiangkai Ma, Lekai Xing, Han Zhang, Wenzhong Li, and Sanglu Lu. Unifying perception and action: A hybrid- modalitypipelinewithimplicitvisualchain-of-thoughtforroboticactiongeneration. arXivpreprintarXiv:2511.19859, 2025

arXiv 2025
[26]

Open X-Embodiment: Robotic learning datasets and RT-X models.arXiv preprint arXiv:2310.08864, 2024

Open X-Embodiment Collaboration. Open X-Embodiment: Robotic learning datasets and RT-X models.arXiv preprint arXiv:2310.08864, 2024

Pith/arXiv arXiv 2024
[27]

Generatingdiversehigh-fidelityimageswithvq-vae-2

AliRazavi,AaronVandenOord,andOriolVinyals. Generatingdiversehigh-fidelityimageswithvq-vae-2. Advances in neuralinformationprocessingsystems, 32, 2019

2019
[28]

Scalable image tokenization with index backpropagation quantization

Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization. InProceedingsofthe IEEE/CVFInternational ConferenceonComputer Vision, pages 16037–16046, 2025

2025
[29]

Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

Pith/arXiv arXiv 2024
[30]

Predictiveinversedynamics models are scalable learners for robotic manipulation

YangTian,SizheYang,JiaZeng,PingWang,DahuaLin,HaoDong,andJiangmiaoPang. Predictiveinversedynamics models are scalable learners for robotic manipulation. InInternational Conference on Learning Representations, volume 2025, pages 92033–92052, 2025

2025
[31]

Neural discrete representation learning.Advancesin neuralinformation processing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advancesin neuralinformation processing systems, 30, 2017

2017
[32]

Omnitokenizer: A joint image-video tokenizer for visual generation.Advancesin NeuralInformationProcessing Systems, 37:28281–28295, 2024

Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation.Advancesin NeuralInformationProcessing Systems, 37:28281–28295, 2024

2024
[33]

Emu3: Next-token prediction is all you need.arXivpreprintarXiv:2409.18869, 2024

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXivpreprintarXiv:2409.18869, 2024

Pith/arXiv arXiv 2024
[34]

Unified vision-language-action model.arXivpreprintarXiv:2506.19850, 2025

Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. Unified vision-language-action model.arXivpreprintarXiv:2506.19850, 2025

arXiv 2025
[35]

Chain-of-thoughtpromptingelicitsreasoninginlargelanguagemodels

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thoughtpromptingelicitsreasoninginlargelanguagemodels. Advancesinneuralinformationprocessing systems, 35:24824–24837, 2022

2022
[36]

Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.IEEE RoboticsandAutomationLetters, 2025

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.IEEE RoboticsandAutomationLetters, 2025. 11

2025
[37]

Diffusionvla: Scaling robot foundation models via unified diffusion and autoregression

Junjie Wen, Yichen Zhu, Minjie Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, and Feifei Feng. Diffusionvla: Scaling robot foundation models via unified diffusion and autoregression. In Forty-secondInternational ConferenceonMachineLearning, 2025

2025
[38]

Janus: Decoupling visual encoding for unified multimodal understanding and generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. In Proceedings ofthe ComputerVisionand PatternRecognitionConference, pages 12966–12977, 2025

2025
[39]

Liquid: Language models are scalable and unified multi-modal generators.International Journal of ComputerVision, 134 (1):39, 2026

Junfeng Wu, Yi Jiang, Chuofan Ma, Yuliang Liu, Hengshuang Zhao, Zehuan Yuan, Song Bai, and Xiang Bai. Liquid: Language models are scalable and unified multi-modal generators.International Journal of ComputerVision, 134 (1):39, 2026

2026
[40]

Vila-u: a unified foundation model integrating visual understanding and generation

Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation. In The Thirteenth International Conferenceon Learning Representations
[41]

Show-o: One single transformer to unify multimodal understanding and generation

Jinheng Xie, Weĳia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhĳie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InTheThirteenth International Conferenceon Learning Representations
[42]

Deepthinkvla: Enhancing reasoning capability of vision-language-action models.arXivpreprintarXiv:2511.15669, 2025

Cheng Yin, Yankai Lin, Wang Xu, Sikyuen Tam, Xiangrui Zeng, Zhiyuan Liu, and Zhouping Yin. Deepthinkvla: Enhancing reasoning capability of vision-language-action models.arXivpreprintarXiv:2511.15669, 2025

Pith/arXiv arXiv 2025
[43]

Language model beats diffusion-tokenizer is key to visual generation

Lĳun Yu, Jose Lezama, Nitesh Bharadwaj Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion-tokenizer is key to visual generation. InThe TwelfthInternational Conferenceon Learning Representations
[44]

Robotic control via embodied chain-of-thought reasoning

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. In8th AnnualConferenceon RobotLearning, 2024

2024
[45]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings ofthe IEEE/CVF international conferenceon computervision, pages 11975–11986, 2023

2023
[46]

Up-vla: A unified understanding and prediction model for embodied agent

Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent. InInternational Conferenceon Machine Learning, pages 74911–74922. PMLR, 2025

2025
[47]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

QingqingZhao,YaoLu,MooJinKim,ZipengFu,ZhuoyangZhang,YechengWu,ZhaoshuoLi,QianliMa,SongHan, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings ofthe ComputerVisionandPatternRecognitionConference, pages 1702–1713, 2025

2025
[48]

X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprintarXiv:2510.10274, 2025

Pith/arXiv arXiv 2025
[49]

subtask: [action]

Chunting Zhou, LILI YU, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma,LukeZettlemoyer,andOmerLevy. Transfusion: Predictthenexttokenanddiffuseimageswithonemulti-modal model. InTheThirteenth International Conferenceon Learning Representations. 12 A Model Architecture Details Table 3 provides the full architectural spec...

2048

[1] [1]

Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvĳit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025

[2] [2]

Rt-h: Action hierarchies using language

Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quan Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language. InRobotics: Science and Systems, 2024

2024

[3] [3]

Paligemma: A versatile 3b vlm for transfer

LucasBeyer,AndreasSteiner,AndréSusanoPinto,AlexanderKolesnikov,XiaoWang,DanielSalz,MaximNeumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprintarXiv:2407.07726, 2024

Pith/arXiv arXiv 2024

[4] [4]

Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025

[5] [5]

𝜋0: Avision-language-actionflowmodelforgeneralrobotcontrol

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, KarolHausman,BrianIchter,etal. 𝜋0: Avision-language-actionflowmodelforgeneralrobotcontrol. arXivpreprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[6] [6]

In9th AnnualConferenceon RobotLearning, 2025

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.𝜋0.5: a vision-language-action model with open-world generalization. In9th AnnualConferenceon RobotLearning, 2025

2025

[7] [7]

Rt-1: Robotics transformer for real-world control at scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. Robotics: Science andSystemsXIX, 2023

2023

[8] [8]

Worldvla: Towards autoregressive action world model.arXivpreprintarXiv:2506.21539, 2025

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXivpreprintarXiv:2506.21539, 2025

Pith/arXiv arXiv 2025

[9] [9]

Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXivpreprintarXiv:2506.18088, 2025

Tianxing Chen, Zanxin Chen, Baĳun Chen, Zĳian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXivpreprintarXiv:2506.18088, 2025

Pith/arXiv arXiv 2025

[10] [10]

Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Pith/arXiv arXiv 2025

[11] [11]

Emu3.5: Nativemultimodalmodelsareworldlearners

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang,WenxuanWang,etal. Emu3.5: Nativemultimodalmodelsareworldlearners. arXivpreprintarXiv:2510.26583, 2025

Pith/arXiv arXiv 2025

[12] [12]

Emerging properties in unified multimodal pretraining.arXiv preprintarXiv:2505.14683, 2025

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprintarXiv:2505.14683, 2025

Pith/arXiv arXiv 2025

[13] [13]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings ofthe IEEE/CVF conferenceoncomputervision andpattern recognition, pages 12873–12883, 2021

2021

[14] [14]

Manualvla: A unified vla model for chain-of-thought manual generation and robotic manipulation

Chenyang Gu, Jiaming Liu, Hao Chen, Runzhong Huang, Qingpo Wuwu, Zhuoyang Liu, Xiaoqi Li, Ying Li, Renrui Zhang, Peng Jia, et al. Manualvla: A unified vla model for chain-of-thought manual generation and robotic manipulation. arXiv preprintarXiv:2512.02013, 2025

arXiv 2025

[15] [15]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS2021 Workshopon DeepGenerative Models and DownstreamApplications

[16] [16]

Bagelvla: Enhancing long-horizon manipulation via interleaved vision-language-action generation

Yucheng Hu, Jianke Zhang, Yuanfei Luo, Yanjiang Guo, Xiaoyu Chen, Xinshu Sun, Kun Feng, Qingzhou Lu, Sheng Chen, Yangang Zhang, et al. Bagelvla: Enhancing long-horizon manipulation via interleaved vision-language-action generation. arXivpreprintarXiv:2602.09849, 2026

arXiv 2026

[17] [17]

Openvla: Anopen-sourcevision-language-actionmodel

MooJinKim,KarlPertsch,SiddharthKaramcheti,TedXiao,AshwinBalakrishna,SurajNair,RafaelRafailov,EthanP Foster,PannagRSanketi,QuanVuong,etal. Openvla: Anopen-sourcevision-language-actionmodel. In 8thAnnual Conferenceon RobotLearning, 2024. 10

2024

[18] [18]

Molmoact: Action reasoning models that can reason in space

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Boyang Li, Shuo Liu, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space. InWorkshopon Making Sense of Datain Robotics: Composition,Curation,and Interpretability at Scale atCoRL 2025, 2025

2025

[19] [19]

Causal world modeling for robot control.arXiv preprintarXiv:2601.21998, 2026

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprintarXiv:2601.21998, 2026

Pith/arXiv arXiv 2026

[20] [20]

Activemimic: Egocentric video pretraining with active perception.arXivpreprint arXiv:2606.06194, 2026

Xingyao Lin, Guojin Zhong, Tianyi Lu, Ziyi Ye, Yichen Zhu, Zuxuan Wu, and Yu-Gang Jiang. Activemimic: Egocentric video pretraining with active perception.arXivpreprint arXiv:2606.06194, 2026

Pith/arXiv arXiv 2026

[21] [21]

Rdt-1b: a diffusion foundation model for bimanual manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. InTheThirteenth International Conferenceon Learning Representations, 2025

2025

[22] [22]

Last0: Latent spatio-temporal chain-of-thought for robotic vision-language-action model

Zhuoyang Liu, Jiaming Liu, Hao Chen, Ziyu Guo, Chengkai Hou, Chenyang Gu, Jiale Yu, Xiangju Mi, Renrui Zhang, Zhengping Che, et al. Last0: Latent spatio-temporal chain-of-thought for robotic vision-language-action model. arXiv preprintarXiv:2601.05248, 2026

arXiv 2026

[23] [23]

F1: A vision-language-action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951, 2025

Qi Lv, Weĳie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, and Jiangmiao Pang. F1: A vision-language-action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951, 2025

Pith/arXiv arXiv 2025

[24] [24]

Unitok: a unified tokenizer for visual generation and understanding

Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, BINGYUE PENG, and XIAOJUAN QI. Unitok: a unified tokenizer for visual generation and understanding. InThe Thirty-ninth Annual Conference on Neural InformationProcessingSystems

[25] [25]

Unifying perception and action: A hybrid- modalitypipelinewithimplicitvisualchain-of-thoughtforroboticactiongeneration

Xiangkai Ma, Lekai Xing, Han Zhang, Wenzhong Li, and Sanglu Lu. Unifying perception and action: A hybrid- modalitypipelinewithimplicitvisualchain-of-thoughtforroboticactiongeneration. arXivpreprintarXiv:2511.19859, 2025

arXiv 2025

[26] [26]

Open X-Embodiment: Robotic learning datasets and RT-X models.arXiv preprint arXiv:2310.08864, 2024

Open X-Embodiment Collaboration. Open X-Embodiment: Robotic learning datasets and RT-X models.arXiv preprint arXiv:2310.08864, 2024

Pith/arXiv arXiv 2024

[27] [27]

Generatingdiversehigh-fidelityimageswithvq-vae-2

AliRazavi,AaronVandenOord,andOriolVinyals. Generatingdiversehigh-fidelityimageswithvq-vae-2. Advances in neuralinformationprocessingsystems, 32, 2019

2019

[28] [28]

Scalable image tokenization with index backpropagation quantization

Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization. InProceedingsofthe IEEE/CVFInternational ConferenceonComputer Vision, pages 16037–16046, 2025

2025

[29] [29]

Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

Pith/arXiv arXiv 2024

[30] [30]

Predictiveinversedynamics models are scalable learners for robotic manipulation

YangTian,SizheYang,JiaZeng,PingWang,DahuaLin,HaoDong,andJiangmiaoPang. Predictiveinversedynamics models are scalable learners for robotic manipulation. InInternational Conference on Learning Representations, volume 2025, pages 92033–92052, 2025

2025

[31] [31]

Neural discrete representation learning.Advancesin neuralinformation processing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advancesin neuralinformation processing systems, 30, 2017

2017

[32] [32]

Omnitokenizer: A joint image-video tokenizer for visual generation.Advancesin NeuralInformationProcessing Systems, 37:28281–28295, 2024

Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation.Advancesin NeuralInformationProcessing Systems, 37:28281–28295, 2024

2024

[33] [33]

Emu3: Next-token prediction is all you need.arXivpreprintarXiv:2409.18869, 2024

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXivpreprintarXiv:2409.18869, 2024

Pith/arXiv arXiv 2024

[34] [34]

Unified vision-language-action model.arXivpreprintarXiv:2506.19850, 2025

Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. Unified vision-language-action model.arXivpreprintarXiv:2506.19850, 2025

arXiv 2025

[35] [35]

Chain-of-thoughtpromptingelicitsreasoninginlargelanguagemodels

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thoughtpromptingelicitsreasoninginlargelanguagemodels. Advancesinneuralinformationprocessing systems, 35:24824–24837, 2022

2022

[36] [36]

Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.IEEE RoboticsandAutomationLetters, 2025

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.IEEE RoboticsandAutomationLetters, 2025. 11

2025

[37] [37]

Diffusionvla: Scaling robot foundation models via unified diffusion and autoregression

Junjie Wen, Yichen Zhu, Minjie Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, and Feifei Feng. Diffusionvla: Scaling robot foundation models via unified diffusion and autoregression. In Forty-secondInternational ConferenceonMachineLearning, 2025

2025

[38] [38]

Janus: Decoupling visual encoding for unified multimodal understanding and generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. In Proceedings ofthe ComputerVisionand PatternRecognitionConference, pages 12966–12977, 2025

2025

[39] [39]

Liquid: Language models are scalable and unified multi-modal generators.International Journal of ComputerVision, 134 (1):39, 2026

Junfeng Wu, Yi Jiang, Chuofan Ma, Yuliang Liu, Hengshuang Zhao, Zehuan Yuan, Song Bai, and Xiang Bai. Liquid: Language models are scalable and unified multi-modal generators.International Journal of ComputerVision, 134 (1):39, 2026

2026

[40] [40]

Vila-u: a unified foundation model integrating visual understanding and generation

Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation. In The Thirteenth International Conferenceon Learning Representations

[41] [41]

Show-o: One single transformer to unify multimodal understanding and generation

Jinheng Xie, Weĳia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhĳie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InTheThirteenth International Conferenceon Learning Representations

[42] [42]

Deepthinkvla: Enhancing reasoning capability of vision-language-action models.arXivpreprintarXiv:2511.15669, 2025

Cheng Yin, Yankai Lin, Wang Xu, Sikyuen Tam, Xiangrui Zeng, Zhiyuan Liu, and Zhouping Yin. Deepthinkvla: Enhancing reasoning capability of vision-language-action models.arXivpreprintarXiv:2511.15669, 2025

Pith/arXiv arXiv 2025

[43] [43]

Language model beats diffusion-tokenizer is key to visual generation

Lĳun Yu, Jose Lezama, Nitesh Bharadwaj Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion-tokenizer is key to visual generation. InThe TwelfthInternational Conferenceon Learning Representations

[44] [44]

Robotic control via embodied chain-of-thought reasoning

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. In8th AnnualConferenceon RobotLearning, 2024

2024

[45] [45]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings ofthe IEEE/CVF international conferenceon computervision, pages 11975–11986, 2023

2023

[46] [46]

Up-vla: A unified understanding and prediction model for embodied agent

Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent. InInternational Conferenceon Machine Learning, pages 74911–74922. PMLR, 2025

2025

[47] [47]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

QingqingZhao,YaoLu,MooJinKim,ZipengFu,ZhuoyangZhang,YechengWu,ZhaoshuoLi,QianliMa,SongHan, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings ofthe ComputerVisionandPatternRecognitionConference, pages 1702–1713, 2025

2025

[48] [48]

X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprintarXiv:2510.10274, 2025

Pith/arXiv arXiv 2025

[49] [49]

subtask: [action]

Chunting Zhou, LILI YU, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma,LukeZettlemoyer,andOmerLevy. Transfusion: Predictthenexttokenanddiffuseimageswithonemulti-modal model. InTheThirteenth International Conferenceon Learning Representations. 12 A Model Architecture Details Table 3 provides the full architectural spec...

2048