ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation
Pith reviewed 2026-06-27 00:48 UTC · model grok-4.3
The pith
ThinkingVLA improves robotic manipulation by interleaving visual state prediction with inverse action reasoning inside one autoregressive model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Manipulation planning naturally decomposes into prediction of the next visual state and inverse dynamics to infer actions from that state. Bridging these requires a unified autoregressive architecture interleaving textual and visual reasoning. ThinkingVLA realizes this with forward CoT identifying the immediate subgoal and guiding visual forecasting, the predicted image then serving as target state for inverse CoT that reasons about spatial relationships and action intent, and the final action generated conditioned on the full reasoning context.
What carries the argument
Mixture-of-Transformers architecture that interleaves forward chain-of-thought for visual forecasting with inverse chain-of-thought conditioned on the predicted image to produce actions.
Load-bearing premise
Manipulation planning naturally decomposes into visual-state prediction followed by inverse dynamics and these two must be bridged inside a single unified autoregressive architecture interleaving text and images.
What would settle it
A controlled comparison in which a model without the interleaved inverse CoT step or without unified autoregressive interleaving achieves equal performance on long-horizon tasks would falsify the necessity of the claimed decomposition.
read the original abstract
Most Vision-Language-Action (VLA) models map observations directly to actions without explicit reasoning, limiting their capacity for reasoning-intensive long-horizon tasks. To address this, existing approaches adopt Chain-of-Thought (CoT) reasoning to enable subgoal decomposition and spatial anticipation. However, those methods lack a unified architecture for effective cross-modal reasoning and fail to explicitly include inverse reasoning ability based on the target state. We argue that manipulation planning naturally decomposes into prediction, anticipating the next visual state, and inverse dynamics, inferring the actions to reach it. Bridging both requires a unified autoregressive architecture that interleaves textual and visual reasoning in a single generation process. We propose \textbf{ThinkingVLA}, a generative model that realizes this decomposition within a unified Mixture-of-Transformers architecture. ThinkingVLA consists of a forward CoT that identifies the immediate subgoal and guides the visual forecasting; the predicted image then serves as the target state, grounding an inverse CoT that reasons about spatial relationships and action intent based on the predicted image; and the final action is generated conditioned on this full reasoning context. Extensive experiments on simulation and real-world benchmarks demonstrate that ThinkingVLA consistently outperforms state-of-the-art baselines, with particularly large gains on long-horizon manipulation tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ThinkingVLA, a generative Vision-Language-Action (VLA) model that realizes manipulation planning as a decomposition into visual-state prediction and inverse dynamics within a single unified autoregressive Mixture-of-Transformers architecture. The model performs forward CoT to identify subgoals and guide visual forecasting, predicts the next image as target state, conducts inverse CoT to reason about spatial relationships and actions from the predicted image, and generates the final action conditioned on the full context. It claims consistent outperformance over state-of-the-art baselines on simulation and real-world benchmarks, with particularly large gains on long-horizon tasks.
Significance. If the outperformance holds and is shown to stem from the interleaved reasoning rather than other factors, the work could advance VLA models for complex robotic manipulation by demonstrating the value of explicit forward and inverse reasoning in a unified generative process.
major comments (1)
- [Abstract] Abstract (argument paragraph): The assertion that manipulation planning 'naturally decomposes' into visual-state prediction followed by inverse dynamics and that bridging them 'requires' a unified autoregressive architecture interleaving text and images is presented as foundational, yet the experiments report outperformance without an ablation against a modular control (separate visual forecaster + separate inverse-dynamics policy) or a variant that removes interleaving while holding CoT, data scale, and capacity fixed. This leaves the architectural necessity claim load-bearing but untested.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comment point-by-point below.
read point-by-point responses
-
Referee: [Abstract] Abstract (argument paragraph): The assertion that manipulation planning 'naturally decomposes' into visual-state prediction followed by inverse dynamics and that bridging them 'requires' a unified autoregressive architecture interleaving text and images is presented as foundational, yet the experiments report outperformance without an ablation against a modular control (separate visual forecaster + separate inverse-dynamics policy) or a variant that removes interleaving while holding CoT, data scale, and capacity fixed. This leaves the architectural necessity claim load-bearing but untested.
Authors: We appreciate the referee's observation that the abstract presents the decomposition and unified architecture as foundational without a direct ablation isolating the interleaving mechanism. The core motivation is that a single autoregressive Mixture-of-Transformers process enables the predicted image to directly ground the subsequent inverse CoT within the same token sequence, providing shared context that modular pipelines would require explicit bridging mechanisms to replicate. Our experiments demonstrate gains over existing VLA baselines that lack this unified forward-inverse structure, particularly on long-horizon tasks. Nevertheless, we acknowledge the absence of the suggested modular control ablation. We will revise the abstract to frame the unified interleaving as a design choice that facilitates the reasoning flow rather than a strict requirement, and add a dedicated paragraph in the discussion section analyzing why separate modules would likely incur information loss or require non-trivial additional components to achieve comparable cross-modal grounding. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper's central claim is presented as an argument ('We argue that manipulation planning naturally decomposes into prediction... and inverse dynamics... Bridging both requires a unified autoregressive architecture...') rather than a derivation from equations or first principles. No self-definitional reductions, fitted inputs renamed as predictions, load-bearing self-citations, imported uniqueness theorems, smuggled ansatzes, or renamed known results appear in the provided text. The architecture is a design choice justified by the argument and validated empirically; the derivation chain does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025
Pith/arXiv arXiv 2025
-
[2]
Rt-h: Action hierarchies using language
Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quan Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language. InRobotics: Science and Systems, 2024
2024
-
[3]
Paligemma: A versatile 3b vlm for transfer
LucasBeyer,AndreasSteiner,AndréSusanoPinto,AlexanderKolesnikov,XiaoWang,DanielSalz,MaximNeumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprintarXiv:2407.07726, 2024
Pith/arXiv arXiv 2024
-
[4]
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025
Pith/arXiv arXiv 2025
-
[5]
𝜋0: Avision-language-actionflowmodelforgeneralrobotcontrol
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, KarolHausman,BrianIchter,etal. 𝜋0: Avision-language-actionflowmodelforgeneralrobotcontrol. arXivpreprint arXiv:2410.24164, 2024
Pith/arXiv arXiv 2024
-
[6]
In9th AnnualConferenceon RobotLearning, 2025
Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.𝜋0.5: a vision-language-action model with open-world generalization. In9th AnnualConferenceon RobotLearning, 2025
2025
-
[7]
Rt-1: Robotics transformer for real-world control at scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. Robotics: Science andSystemsXIX, 2023
2023
-
[8]
Worldvla: Towards autoregressive action world model.arXivpreprintarXiv:2506.21539, 2025
Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXivpreprintarXiv:2506.21539, 2025
Pith/arXiv arXiv 2025
-
[9]
Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXivpreprintarXiv:2506.18088, 2025
Pith/arXiv arXiv 2025
-
[10]
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
Pith/arXiv arXiv 2025
-
[11]
Emu3.5: Nativemultimodalmodelsareworldlearners
Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang,WenxuanWang,etal. Emu3.5: Nativemultimodalmodelsareworldlearners. arXivpreprintarXiv:2510.26583, 2025
Pith/arXiv arXiv 2025
-
[12]
Emerging properties in unified multimodal pretraining.arXiv preprintarXiv:2505.14683, 2025
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprintarXiv:2505.14683, 2025
Pith/arXiv arXiv 2025
-
[13]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings ofthe IEEE/CVF conferenceoncomputervision andpattern recognition, pages 12873–12883, 2021
2021
-
[14]
Manualvla: A unified vla model for chain-of-thought manual generation and robotic manipulation
Chenyang Gu, Jiaming Liu, Hao Chen, Runzhong Huang, Qingpo Wuwu, Zhuoyang Liu, Xiaoqi Li, Ying Li, Renrui Zhang, Peng Jia, et al. Manualvla: A unified vla model for chain-of-thought manual generation and robotic manipulation. arXiv preprintarXiv:2512.02013, 2025
arXiv 2025
-
[15]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS2021 Workshopon DeepGenerative Models and DownstreamApplications
-
[16]
Bagelvla: Enhancing long-horizon manipulation via interleaved vision-language-action generation
Yucheng Hu, Jianke Zhang, Yuanfei Luo, Yanjiang Guo, Xiaoyu Chen, Xinshu Sun, Kun Feng, Qingzhou Lu, Sheng Chen, Yangang Zhang, et al. Bagelvla: Enhancing long-horizon manipulation via interleaved vision-language-action generation. arXivpreprintarXiv:2602.09849, 2026
arXiv 2026
-
[17]
Openvla: Anopen-sourcevision-language-actionmodel
MooJinKim,KarlPertsch,SiddharthKaramcheti,TedXiao,AshwinBalakrishna,SurajNair,RafaelRafailov,EthanP Foster,PannagRSanketi,QuanVuong,etal. Openvla: Anopen-sourcevision-language-actionmodel. In 8thAnnual Conferenceon RobotLearning, 2024. 10
2024
-
[18]
Molmoact: Action reasoning models that can reason in space
Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Boyang Li, Shuo Liu, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space. InWorkshopon Making Sense of Datain Robotics: Composition,Curation,and Interpretability at Scale atCoRL 2025, 2025
2025
-
[19]
Causal world modeling for robot control.arXiv preprintarXiv:2601.21998, 2026
Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprintarXiv:2601.21998, 2026
Pith/arXiv arXiv 2026
-
[20]
Xingyao Lin, Guojin Zhong, Tianyi Lu, Ziyi Ye, Yichen Zhu, Zuxuan Wu, and Yu-Gang Jiang. Activemimic: Egocentric video pretraining with active perception.arXivpreprint arXiv:2606.06194, 2026
Pith/arXiv arXiv 2026
-
[21]
Rdt-1b: a diffusion foundation model for bimanual manipulation
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. InTheThirteenth International Conferenceon Learning Representations, 2025
2025
-
[22]
Last0: Latent spatio-temporal chain-of-thought for robotic vision-language-action model
Zhuoyang Liu, Jiaming Liu, Hao Chen, Ziyu Guo, Chengkai Hou, Chenyang Gu, Jiale Yu, Xiangju Mi, Renrui Zhang, Zhengping Che, et al. Last0: Latent spatio-temporal chain-of-thought for robotic vision-language-action model. arXiv preprintarXiv:2601.05248, 2026
arXiv 2026
-
[23]
Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, and Jiangmiao Pang. F1: A vision-language-action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951, 2025
Pith/arXiv arXiv 2025
-
[24]
Unitok: a unified tokenizer for visual generation and understanding
Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, BINGYUE PENG, and XIAOJUAN QI. Unitok: a unified tokenizer for visual generation and understanding. InThe Thirty-ninth Annual Conference on Neural InformationProcessingSystems
-
[25]
Xiangkai Ma, Lekai Xing, Han Zhang, Wenzhong Li, and Sanglu Lu. Unifying perception and action: A hybrid- modalitypipelinewithimplicitvisualchain-of-thoughtforroboticactiongeneration. arXivpreprintarXiv:2511.19859, 2025
arXiv 2025
-
[26]
Open X-Embodiment: Robotic learning datasets and RT-X models.arXiv preprint arXiv:2310.08864, 2024
Open X-Embodiment Collaboration. Open X-Embodiment: Robotic learning datasets and RT-X models.arXiv preprint arXiv:2310.08864, 2024
Pith/arXiv arXiv 2024
-
[27]
Generatingdiversehigh-fidelityimageswithvq-vae-2
AliRazavi,AaronVandenOord,andOriolVinyals. Generatingdiversehigh-fidelityimageswithvq-vae-2. Advances in neuralinformationprocessingsystems, 32, 2019
2019
-
[28]
Scalable image tokenization with index backpropagation quantization
Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization. InProceedingsofthe IEEE/CVFInternational ConferenceonComputer Vision, pages 16037–16046, 2025
2025
-
[29]
Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024
Pith/arXiv arXiv 2024
-
[30]
Predictiveinversedynamics models are scalable learners for robotic manipulation
YangTian,SizheYang,JiaZeng,PingWang,DahuaLin,HaoDong,andJiangmiaoPang. Predictiveinversedynamics models are scalable learners for robotic manipulation. InInternational Conference on Learning Representations, volume 2025, pages 92033–92052, 2025
2025
-
[31]
Neural discrete representation learning.Advancesin neuralinformation processing systems, 30, 2017
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advancesin neuralinformation processing systems, 30, 2017
2017
-
[32]
Omnitokenizer: A joint image-video tokenizer for visual generation.Advancesin NeuralInformationProcessing Systems, 37:28281–28295, 2024
Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation.Advancesin NeuralInformationProcessing Systems, 37:28281–28295, 2024
2024
-
[33]
Emu3: Next-token prediction is all you need.arXivpreprintarXiv:2409.18869, 2024
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXivpreprintarXiv:2409.18869, 2024
Pith/arXiv arXiv 2024
-
[34]
Unified vision-language-action model.arXivpreprintarXiv:2506.19850, 2025
Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. Unified vision-language-action model.arXivpreprintarXiv:2506.19850, 2025
arXiv 2025
-
[35]
Chain-of-thoughtpromptingelicitsreasoninginlargelanguagemodels
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thoughtpromptingelicitsreasoninginlargelanguagemodels. Advancesinneuralinformationprocessing systems, 35:24824–24837, 2022
2022
-
[36]
Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.IEEE RoboticsandAutomationLetters, 2025
Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.IEEE RoboticsandAutomationLetters, 2025. 11
2025
-
[37]
Diffusionvla: Scaling robot foundation models via unified diffusion and autoregression
Junjie Wen, Yichen Zhu, Minjie Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, and Feifei Feng. Diffusionvla: Scaling robot foundation models via unified diffusion and autoregression. In Forty-secondInternational ConferenceonMachineLearning, 2025
2025
-
[38]
Janus: Decoupling visual encoding for unified multimodal understanding and generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. In Proceedings ofthe ComputerVisionand PatternRecognitionConference, pages 12966–12977, 2025
2025
-
[39]
Liquid: Language models are scalable and unified multi-modal generators.International Journal of ComputerVision, 134 (1):39, 2026
Junfeng Wu, Yi Jiang, Chuofan Ma, Yuliang Liu, Hengshuang Zhao, Zehuan Yuan, Song Bai, and Xiang Bai. Liquid: Language models are scalable and unified multi-modal generators.International Journal of ComputerVision, 134 (1):39, 2026
2026
-
[40]
Vila-u: a unified foundation model integrating visual understanding and generation
Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation. In The Thirteenth International Conferenceon Learning Representations
-
[41]
Show-o: One single transformer to unify multimodal understanding and generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InTheThirteenth International Conferenceon Learning Representations
-
[42]
Cheng Yin, Yankai Lin, Wang Xu, Sikyuen Tam, Xiangrui Zeng, Zhiyuan Liu, and Zhouping Yin. Deepthinkvla: Enhancing reasoning capability of vision-language-action models.arXivpreprintarXiv:2511.15669, 2025
Pith/arXiv arXiv 2025
-
[43]
Language model beats diffusion-tokenizer is key to visual generation
Lijun Yu, Jose Lezama, Nitesh Bharadwaj Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion-tokenizer is key to visual generation. InThe TwelfthInternational Conferenceon Learning Representations
-
[44]
Robotic control via embodied chain-of-thought reasoning
Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. In8th AnnualConferenceon RobotLearning, 2024
2024
-
[45]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings ofthe IEEE/CVF international conferenceon computervision, pages 11975–11986, 2023
2023
-
[46]
Up-vla: A unified understanding and prediction model for embodied agent
Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent. InInternational Conferenceon Machine Learning, pages 74911–74922. PMLR, 2025
2025
-
[47]
Cot-vla: Visual chain-of-thought reasoning for vision-language-action models
QingqingZhao,YaoLu,MooJinKim,ZipengFu,ZhuoyangZhang,YechengWu,ZhaoshuoLi,QianliMa,SongHan, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings ofthe ComputerVisionandPatternRecognitionConference, pages 1702–1713, 2025
2025
-
[48]
X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model
Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprintarXiv:2510.10274, 2025
Pith/arXiv arXiv 2025
-
[49]
subtask: [action]
Chunting Zhou, LILI YU, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma,LukeZettlemoyer,andOmerLevy. Transfusion: Predictthenexttokenanddiffuseimageswithonemulti-modal model. InTheThirteenth International Conferenceon Learning Representations. 12 A Model Architecture Details Table 3 provides the full architectural spec...
2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.