Recognition: 2 theorem links
· Lean TheoremCoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
Pith reviewed 2026-05-16 05:17 UTC · model grok-4.3
The pith
Adding explicit visual chain-of-thought by predicting future image frames before actions improves vision-language-action model performance on complex robot tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce CoT-VLA, a 7B-parameter vision-language-action model that performs explicit visual chain-of-thought reasoning by autoregressively predicting future image frames as visual goals and then generating the corresponding action sequence to reach those goals. This structure supplies the temporal planning and intermediate reasoning steps missing from direct input-output VLAs, allowing the model to understand and emit both visual and action tokens in one unified process.
What carries the argument
Autoregressive future-image prediction that supplies visual goals before action generation, letting the model break complex manipulations into explicit visual steps.
If this is right
- VLAs gain multi-step planning ability without extra supervision or separate planners.
- The unified model can jointly reason over visual and action tokens in a single autoregressive pass.
- Performance improvements appear on both simulation benchmarks and physical robot manipulation tasks.
- The method makes fuller use of pretrained vision-language knowledge for downstream control.
Where Pith is reading between the lines
- The same visual-goal prediction could serve as a lightweight internal simulator for testing action sequences before execution.
- Longer or conditional prediction chains might handle even more extended tasks once the basic visual CoT step is reliable.
- Because the visual predictions are generated from the same backbone used for language, the approach could transfer planning patterns learned from non-robot image data more directly to physical control.
Load-bearing premise
Predicting future image frames autoregressively will produce reliable visual goals that actually improve the quality of the actions generated afterward for complex manipulation.
What would settle it
A controlled ablation in which the image-prediction branch is removed or replaced with random frames and task success rates show no statistically significant drop on the same real-world or simulation suites.
read the original abstract
Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input--output mappings, lacking the intermediate reasoning steps crucial for complex manipulation tasks. As a result, existing VLAs lack temporal planning or reasoning capabilities. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks. Project website: https://cot-vla.github.io/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CoT-VLA, a 7B-parameter vision-language-action model that augments standard VLA architectures with explicit visual chain-of-thought reasoning: future image frames are autoregressively predicted as visual goals, after which a short action sequence is decoded to reach those goals. The central claim is that this intermediate visual planning step supplies the temporal reasoning missing from direct input-output VLAs, yielding a 17% improvement over prior state-of-the-art VLAs on real-world manipulation tasks and a 6% improvement on simulation benchmarks.
Significance. If the reported gains can be shown to arise specifically from the visual-CoT branch rather than from capacity or training differences, the work would establish a concrete mechanism for adding interpretable, multi-step visual planning to large VLAs. This could influence subsequent architectures by demonstrating that explicit future-frame prediction is a viable and effective form of reasoning for sensorimotor control, particularly for long-horizon or contact-rich manipulation.
major comments (3)
- [Abstract] Abstract: the central performance claims (17% real-world, 6% simulation) are stated without any accompanying experimental protocol, baseline specifications, trial counts, error bars, or statistical tests. Because these numbers constitute the primary evidence for the value of visual CoT, their lack of supporting detail renders the central claim unverifiable from the provided text.
- [Experiments] Experimental section (inferred from abstract claims): no ablation is described that holds model size (7B), training data, and tokenization fixed while removing the autoregressive future-frame prediction branch. Without this control, the observed deltas cannot be attributed to the visual-CoT mechanism rather than to incidental differences in capacity or optimization.
- [Abstract] Abstract / method description: the premise that autoregressively generated future frames constitute reliable visual goals is not accompanied by any quantitative fidelity metrics (pixel-level error, perceptual similarity, or short-horizon prediction accuracy) against ground-truth future observations. Such metrics are required to confirm that the generated frames remain informative over the horizon used for action decoding.
minor comments (1)
- [Abstract] The abstract mentions a project website but provides no link to code, checkpoints, or evaluation protocols, which would be needed for reproducibility of the reported benchmark numbers.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and describe the revisions we plan to make.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central performance claims (17% real-world, 6% simulation) are stated without any accompanying experimental protocol, baseline specifications, trial counts, error bars, or statistical tests. Because these numbers constitute the primary evidence for the value of visual CoT, their lack of supporting detail renders the central claim unverifiable from the provided text.
Authors: We appreciate the referee's concern regarding the verifiability of the performance claims. The full details of the experimental protocol, including baselines, number of trials, and error bars, are provided in the Experiments section and associated tables of the manuscript. To address this, we will revise the abstract to include a brief summary of the evaluation setup, such as the tasks evaluated and the number of trials, along with a note on the statistical significance of the results. revision: yes
-
Referee: [Experiments] Experimental section (inferred from abstract claims): no ablation is described that holds model size (7B), training data, and tokenization fixed while removing the autoregressive future-frame prediction branch. Without this control, the observed deltas cannot be attributed to the visual-CoT mechanism rather than to incidental differences in capacity or optimization.
Authors: We agree that a controlled ablation isolating the visual CoT component is important for attributing the performance gains. The current manuscript compares against prior VLAs of similar scale, but we will add a new ablation experiment training a 7B model without the autoregressive future-frame prediction branch, keeping all other factors fixed. The results of this ablation will be included in the revised version. revision: yes
-
Referee: [Abstract] Abstract / method description: the premise that autoregressively generated future frames constitute reliable visual goals is not accompanied by any quantitative fidelity metrics (pixel-level error, perceptual similarity, or short-horizon prediction accuracy) against ground-truth future observations. Such metrics are required to confirm that the generated frames remain informative over the horizon used for action decoding.
Authors: This is a good point. While the method section describes the autoregressive prediction, we will add quantitative fidelity metrics for the generated frames, such as pixel-level error and perceptual similarity scores compared to ground-truth, to demonstrate their reliability over the relevant horizons. These metrics will be added to the method description and referenced in the abstract if space permits. revision: yes
Circularity Check
No significant circularity: empirical performance claims rest on external benchmarks without self-referential reductions
full rationale
The paper describes an architectural extension to VLAs that inserts autoregressive future-frame prediction as an explicit visual CoT step before action decoding. All reported gains (17 % real-world, 6 % simulation) are presented as outcomes of comparative experiments against prior VLA models on fixed benchmarks. No equations, fitted parameters, or self-citations are shown that would make the performance delta equivalent to the input data or to a prior result by construction. The central premise is therefore an empirical claim whose validity can be checked against external test sets rather than a derivation that collapses to its own definitions.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Foundation/HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
-
UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models
UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.
-
DreamGen: Unlocking Generalization in Robot Learning through Video World Models
DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperatio...
-
FrameSkip: Learning from Fewer but More Informative Frames in VLA Training
FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.
-
Do multimodal models imagine electric sheep?
Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
-
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
-
VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation
VADF adds an Adaptive Loss Network for hard-negative training sampling and a Hierarchical Vision Task Segmenter for adaptive noise scheduling during inference to speed convergence and reduce timeouts in diffusion robo...
-
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
-
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
-
World Action Models are Zero-shot Policies
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
-
mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.
-
villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models
villa-X enhances latent action modeling in VLA models to support zero-shot action planning for unseen robot embodiments and open-vocabulary instructions, yielding better manipulation results in simulation and real-wor...
-
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning
AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to...
-
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
-
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...
-
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
-
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
Reference graph
Works this paper leans on
-
[1]
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation
Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation. arXiv preprint arXiv:2409.16283, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero- shot robotic manipulation with pretrained image-editing dif- fusion models. arXiv preprint arXiv:2310.10639, 2023. 2, 3, 6
work page internal anchor Pith review arXiv 2023
-
[3]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 3
work page 2021
-
[5]
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Spatialvlm: Endowing vision-language models with spatial reasoning capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455–14465,
-
[7]
Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep com- pression autoencoder for efficient high-resolution diffusion models. arXiv preprint arXiv:2410.10733, 2024. 8
-
[8]
Berkeley UR5 demonstration dataset
Lawrence Yunliang Chen, Simeon Adebola, and Ken Gold- berg. Berkeley UR5 demonstration dataset. https:// sites.google.com/view/berkeley- ur5/home . 1
-
[9]
Pali-3 vision language models: Smaller, faster, stronger
Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul V oigtlaender, Basil Mustafa, Sebastian Good- man, Ibrahim Alabdulmohsin, Piotr Padlewski, et al. Pali-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199, 2023. 2
-
[10]
Diffusion policy: Visuomotor policy learning via action diffu- sion
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffu- sion. The International Journal of Robotics Research , page 02783649241273668, 2023. 2, 6, 8
work page 2023
-
[11]
Goal-conditioned imitation learning
Yiming Ding, Carlos Florensa, Pieter Abbeel, and Mariano Phielipp. Goal-conditioned imitation learning. Advances in neural information processing systems, 32, 2019. 2, 3
work page 2019
-
[12]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023. 1, 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Vision-language models as success detectors
Yuqing Du, Ksenia Konyushkova, Misha Denil, Akhil Raju, Jessica Landon, Felix Hill, Nando de Freitas, and Serkan Cabi. Vision-language models as success detectors. arXiv preprint arXiv:2303.07280, 2023. 3
-
[14]
Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B Tenenbaum, et al. Video language planning. arXiv preprint arXiv:2310.10625, 2023. 1
-
[15]
Learn- ing universal policies via text-guided video generation
Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learn- ing universal policies via text-guided video generation. Ad- vances in Neural Information Processing Systems , 36, 2024. 2, 3, 8
work page 2024
-
[16]
Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets
Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396, 2021. 1
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[17]
Humanplus: Humanoid shadowing and imita- tion from humans
Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imita- tion from humans. arXiv preprint arXiv:2406.10454, 2024. 2
-
[18]
Zipeng Fu, Tony Z. Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole- body teleoperation. In Conference on Robot Learning (CoRL),
-
[19]
Cows on pasture: Base- lines and benchmarks for language-driven zero-shot object navigation
Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig Schmidt, and Shuran Song. Cows on pasture: Base- lines and benchmarks for language-driven zero-shot object navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23171– 23181, 2023. 3
work page 2023
-
[20]
The” something something” video database for learning and evaluating visual common sense
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller- Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In Proceed- ings of the IEEE international conference on computer vision , pages 5...
work page 2017
-
[21]
Scaling up and distilling down: Language-guided robot skill acquisition
Huy Ha, Pete Florence, and Shuran Song. Scaling up and distilling down: Language-guided robot skill acquisition. In Conference on Robot Learning , pages 3766–3777. PMLR,
-
[22]
Visual chain-of-thought diffusion models
William Harvey and Frank Wood. Visual chain-of-thought diffusion models. arXiv preprint arXiv:2303.16187, 2023. 2
-
[23]
GAIA-1: A Generative World Model for Autonomous Driving
Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Cor- rado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023. 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Visual sketchpad: Sketching as a visual chain of thought for multi- modal language models
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multi- modal language models. arXiv preprint arXiv:2406.09403,
-
[25]
Visual language maps for robot navigation
Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 10608–10615. IEEE, 2023. 3
work page 2023
-
[26]
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation
Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. arXiv preprint arXiv:2409.01652, 2024. 3
work page internal anchor Pith review arXiv 2024
-
[27]
Egocentric hand track and object-based human action recognition
Georgios Kapidis, Ronald Poppe, Elsbeth Van Dam, Lu- cas Noldus, and Remco Veltkamp. Egocentric hand track and object-based human action recognition. In 2019 IEEE SmartWorld, Ubiquitous Intelligence & Com- puting, Advanced & Trusted Computing, Scalable Com- puting & Communications, Cloud & Big Data Comput- ing, Internet of People and Smart City Innovation...
work page 2019
- [28]
-
[29]
Prismatic vlms: In- vestigating the design space of visually-conditioned language models
Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Prismatic vlms: In- vestigating the design space of visually-conditioned language models. arXiv preprint arXiv:2402.07865, 2024. 2
-
[30]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024. 1, 3, 4, 5, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jos´e Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023. 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Pro- ceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 8
work page 2023
-
[33]
Autoregressive image generation using resid- ual quantization
Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using resid- ual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11523– 11532, 2022. 4
work page 2022
-
[34]
Fast inference from transformers via speculative decoding
Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274– 19286. PMLR, 2023. 8
work page 2023
-
[35]
Driving everywhere with large language model policy adaptation
Boyi Li, Yue Wang, Jiageng Mao, Boris Ivanovic, Sushant Veer, Karen Leung, and Marco Pavone. Driving everywhere with large language model policy adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14948–14957, 2024. 3
work page 2024
-
[36]
Dreamitate: Real-world visuomotor policy learning via video generation
Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V on- drick. Dreamitate: Real-world visuomotor policy learning via video generation. arXiv preprint arXiv:2406.16862, 2024. 2
-
[37]
Data scaling laws in imi- tation learning for robotic manipulation
Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Ji- acheng You, and Yang Gao. Data scaling laws in imi- tation learning for robotic manipulation. arXiv preprint arXiv:2410.18647, 2024. 1
-
[38]
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. arXiv preprint arXiv:2306.03310, 2023. 2, 5, 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 2
work page 2024
-
[40]
Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action
Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26439–26455, 2024. 2
work page 2024
-
[41]
Eureka: Human-Level Reward Design via Coding Large Language Models
Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Tingfan Wu, Jay Vakil, et al. Where are we in the search for an artificial visual cortex for embodied intelligence? Advances in Neural Information Processing Systems, 36:655–677, 2023. 3
work page 2023
-
[43]
Roboturk: A crowdsourcing platform for robotic skill learning through imitation
Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, An- chit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Con- ference on Robot Learning , pages 879–893. PMLR, 2018. 1
work page 2018
-
[44]
Grounding language with visual affordances over unstruc- tured data
Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. Grounding language with visual affordances over unstruc- tured data. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 11576–11582. IEEE,
work page 2023
-
[45]
Robotic Control via Embodied Chain-of-Thought Reasoning
Zawalski Michał, Chen William, Pertsch Karl, Mees Oier, Finn Chelsea, and Levine Sergey. Robotic control via embodied chain-of-thought reasoning. arXiv preprint arXiv:2407.08693, 2024. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Embodiedgpt: Vision-language pre-training via embodied chain of thought
Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, and Ping Luo. Embodiedgpt: Vision-language pre-training via embodied chain of thought. Advances in Neural Information Processing Systems, 36, 2024. 1, 2, 3
work page 2024
-
[47]
Visual reinforcement learn- ing with imagined goals
Ashvin V Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine. Visual reinforcement learn- ing with imagined goals. Advances in neural information processing systems, 31, 2018. 2, 3
work page 2018
-
[48]
Fei Ni, Jianye Hao, Shiguang Wu, Longxin Kou, Jiashun Liu, Yan Zheng, Bin Wang, and Yuzheng Zhuang. Generate subgoal images before act: Unlocking the chain-of-thought reasoning in diffusion model for robot manipulation with multimodal prompts. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024. 2, 3
work page 2024
-
[49]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x mod- els. arXiv preprint arXiv:2310.08864, 2023. 1, 2, 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning , pages 8748–8763. PMLR, 2021. 3
work page 2021
-
[51]
Visual chain of thought: bridging logical gaps with multimodal infillings
Daniel Rose, Vaishnavi Himakunthala, Andy Ouyang, Ryan He, Alex Mei, Yujie Lu, Michael Saxon, Chinmay Sonar, Diba Mirza, and William Yang Wang. Visual chain of thought: bridging logical gaps with multimodal infillings. arXiv preprint arXiv:2305.02317, 2023. 2
-
[52]
Latent plans for task- agnostic offline reinforcement learning
Erick Rosete-Beas, Oier Mees, Gabriel Kalweit, Joschka Boedecker, and Wolfram Burgard. Latent plans for task- agnostic offline reinforcement learning. In Conference on Robot Learning, pages 1838–1849. PMLR, 2023. 1
work page 2023
-
[53]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
V Sanh. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019. 6
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[54]
Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuo- fan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models. arXiv preprint arXiv:2403.16999, 2024. 2
-
[55]
Cliport: What and where pathways for robotic manipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. In Con- ference on robot learning, pages 894–906. PMLR, 2022. 1
work page 2022
-
[56]
Generative image as action models
Mohit Shridhar, Yat Long Lo, and Stephen James. Generative image as action models. arXiv preprint arXiv:2407.07875 ,
-
[57]
Progprompt: Generating situated robot task plans using large language models
Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE In- ternational Conference on Robotics and Automation (ICRA) , pages 11523–11530. IEEE, 2023. 3
work page 2023
-
[58]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023. 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[60]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024. 1, 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
Bridgedata v2: A dataset for robot learning at scale
Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pages 1723–1736. PMLR, 2023. 2, 5, 1
work page 2023
-
[62]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024. 2, 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[63]
Chain-of- thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. Advances in neural information processing systems , 2022. 2, 3
work page 2022
-
[64]
Any-point trajectory modeling for policy learning
Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning. arXiv preprint arXiv:2401.00025, 2023. 1, 2, 3
-
[65]
Tinyvla: Towards fast, data-efficient vision- language-action models for robotic manipulation
Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, et al. Tinyvla: Towards fast, data-efficient vision- language-action models for robotic manipulation. arXiv preprint arXiv:2409.12514, 2024. 3
-
[66]
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848, 2024. 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre- training for visual robot manipulation. arXiv preprint arXiv:2312.13139, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model inte- grating visual understanding and generation. arXiv preprint arXiv:2409.04429, 2024. 2, 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
Pandora: Towards general world model with natural language actions and video states
Jiannan Xiang, Guangyi Liu, Yi Gu, Qiyue Gao, Yuting Ning, Yuheng Zha, Zeyu Feng, Tianhua Tao, Shibo Hao, Yemin Shi, et al. Pandora: Towards general world model with natural language actions and video states. arXiv preprint arXiv:2406.09455, 2024. 8
-
[70]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and genera- tion. arXiv preprint arXiv:2408.12528, 2024. 2, 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[71]
Pushing the limits of cross-embodiment learning for manipulation and navigation
Jonathan Yang, Catherine Glossop, Arjun Bhorkar, Dhruv Shah, Quan Vuong, Chelsea Finn, Dorsa Sadigh, and Sergey Levine. Pushing the limits of cross-embodiment learning for manipulation and navigation. arXiv preprint arXiv:2402.19432, 2024. 1
-
[72]
Learning Interactive Real-World Simulators
Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learn- ing interactive real-world simulators. arXiv preprint arXiv:2310.06114, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[73]
Video as the new language for real-world decision making
Sherry Yang, Jacob Walker, Jack Parker-Holder, Yilun Du, Jake Bruce, Andre Barreto, Pieter Abbeel, and Dale Schu- urmans. Video as the new language for real-world decision making. arXiv preprint arXiv:2402.17139, 2024. 8
-
[74]
An image is worth 32 tokens for reconstruction and generation
Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation. arxiv: 2406.07550,
-
[75]
Language to rewards for robotic skill synthesis.arXiv preprint arXiv:2306.08647, 2023
Wenhao Yu, Nimrod Gileadi, Chuyuan Fu, Sean Kirmani, Kuang-Huei Lee, Montse Gonzalez Arenas, Hao-Tien Lewis Chiang, Tom Erez, Leonard Hasenclever, Jan Humplik, et al. Language to rewards for robotic skill synthesis.arXiv preprint arXiv:2306.08647, 2023. 3
-
[76]
Star: Bootstrapping reasoning with reasoning
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems , 35:15476–15488,
-
[77]
Learning manipulation skills through robot chain- of-thought with sparse failure guidance
Kaifeng Zhang, Zhao-Heng Yin, Weirui Ye, and Yang Gao. Learning manipulation skills through robot chain- of-thought with sparse failure guidance. arXiv preprint arXiv:2405.13573, 2024. 1, 2
-
[78]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low- cost hardware. arXiv preprint arXiv:2304.13705, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[79]
3D-VLA: A 3D Vision-Language-Action Generative World Model
Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[80]
Transfusion: Predict the next token and diffuse images with one multi-modal model
Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michi- hiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.