arxiv: 2503.22020 · v1 · submitted 2025-03-27 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

Recognition: 2 theorem links

· Lean Theorem

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Qingqing Zhao , Yao Lu , Moo Jin Kim , Zipeng Fu , Zhuoyang Zhang , Yecheng Wu , Zhaoshuo Li , Qianli Ma

show 7 more authors

Song Han Chelsea Finn Ankur Handa Ming-Yu Liu Donglai Xiang Gordon Wetzstein Tsung-Yi Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:17 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO

keywords vision-language-action modelschain-of-thought reasoningvisual goal predictionrobot manipulationautoregressive modelingmultimodal tokenstemporal planning

0 comments

The pith

Adding explicit visual chain-of-thought by predicting future image frames before actions improves vision-language-action model performance on complex robot tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows how to insert intermediate visual reasoning into vision-language-action models so they can plan multi-step manipulations instead of mapping inputs straight to actions. Current VLAs draw on large pretrained vision-language models and robot data but skip the temporal planning needed for harder tasks. The new approach has the model first generate future scene images autoregressively to serve as visual targets, then produce the short action sequence that reaches those targets. Experiments report clear gains over prior state-of-the-art models on both simulated and real-world benchmarks. A reader would care because the method offers a concrete route to more capable general-purpose robot control without redesigning the entire architecture.

Core claim

We introduce CoT-VLA, a 7B-parameter vision-language-action model that performs explicit visual chain-of-thought reasoning by autoregressively predicting future image frames as visual goals and then generating the corresponding action sequence to reach those goals. This structure supplies the temporal planning and intermediate reasoning steps missing from direct input-output VLAs, allowing the model to understand and emit both visual and action tokens in one unified process.

What carries the argument

Autoregressive future-image prediction that supplies visual goals before action generation, letting the model break complex manipulations into explicit visual steps.

If this is right

VLAs gain multi-step planning ability without extra supervision or separate planners.
The unified model can jointly reason over visual and action tokens in a single autoregressive pass.
Performance improvements appear on both simulation benchmarks and physical robot manipulation tasks.
The method makes fuller use of pretrained vision-language knowledge for downstream control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same visual-goal prediction could serve as a lightweight internal simulator for testing action sequences before execution.
Longer or conditional prediction chains might handle even more extended tasks once the basic visual CoT step is reliable.
Because the visual predictions are generated from the same backbone used for language, the approach could transfer planning patterns learned from non-robot image data more directly to physical control.

Load-bearing premise

Predicting future image frames autoregressively will produce reliable visual goals that actually improve the quality of the actions generated afterward for complex manipulation.

What would settle it

A controlled ablation in which the image-prediction branch is removed or replaced with random frames and task success rates show no statistically significant drop on the same real-world or simulation suites.

read the original abstract

Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input--output mappings, lacking the intermediate reasoning steps crucial for complex manipulation tasks. As a result, existing VLAs lack temporal planning or reasoning capabilities. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks. Project website: https://cot-vla.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoT-VLA adds autoregressive future-frame prediction as explicit visual CoT before actions in a 7B VLA and reports 17% real-world and 6% sim gains, but the abstract gives no prediction-quality metrics or capacity-matched ablations so the mechanism's role stays unproven.

read the letter

The core move is straightforward: instead of direct VLA mapping from observation and language to action, the model first autoregressively predicts a short sequence of future image frames as visual goals, then decodes actions conditioned on those frames. That formulation is new for this class of models and directly targets the planning gap the authors flag in prior work. The 7B scale and use of both robotic and non-robotic pretraining data are standard now, but the explicit visual-CoT branch is the concrete addition worth tracking. If the full paper shows that the predicted frames stay coherent over the horizon and that ablating them drops performance while holding parameter count fixed, the result would be useful for the embodied-AI crowd. Right now the abstract only states the headline deltas without error bars, baseline details, or frame-fidelity numbers, so the gains could trace to training schedule, tokenization, or extra capacity rather than the reasoning step itself. The stress-test concern lands: without those controls the causal story for visual CoT remains open. The paper is aimed at robotics researchers already running VLAs who want a drop-in planning extension; it is not a foundational rethinking of the architecture. I would bring it to a reading group for the method description and any ablations that appear in the full text, but I would not cite it yet until the controls are visible. It is solid enough to send out for peer review rather than desk-reject; the idea is practical and the reported numbers are large enough to merit referee scrutiny even if revisions are needed on the evidence side.

Referee Report

3 major / 1 minor

Summary. The paper introduces CoT-VLA, a 7B-parameter vision-language-action model that augments standard VLA architectures with explicit visual chain-of-thought reasoning: future image frames are autoregressively predicted as visual goals, after which a short action sequence is decoded to reach those goals. The central claim is that this intermediate visual planning step supplies the temporal reasoning missing from direct input-output VLAs, yielding a 17% improvement over prior state-of-the-art VLAs on real-world manipulation tasks and a 6% improvement on simulation benchmarks.

Significance. If the reported gains can be shown to arise specifically from the visual-CoT branch rather than from capacity or training differences, the work would establish a concrete mechanism for adding interpretable, multi-step visual planning to large VLAs. This could influence subsequent architectures by demonstrating that explicit future-frame prediction is a viable and effective form of reasoning for sensorimotor control, particularly for long-horizon or contact-rich manipulation.

major comments (3)

[Abstract] Abstract: the central performance claims (17% real-world, 6% simulation) are stated without any accompanying experimental protocol, baseline specifications, trial counts, error bars, or statistical tests. Because these numbers constitute the primary evidence for the value of visual CoT, their lack of supporting detail renders the central claim unverifiable from the provided text.
[Experiments] Experimental section (inferred from abstract claims): no ablation is described that holds model size (7B), training data, and tokenization fixed while removing the autoregressive future-frame prediction branch. Without this control, the observed deltas cannot be attributed to the visual-CoT mechanism rather than to incidental differences in capacity or optimization.
[Abstract] Abstract / method description: the premise that autoregressively generated future frames constitute reliable visual goals is not accompanied by any quantitative fidelity metrics (pixel-level error, perceptual similarity, or short-horizon prediction accuracy) against ground-truth future observations. Such metrics are required to confirm that the generated frames remain informative over the horizon used for action decoding.

minor comments (1)

[Abstract] The abstract mentions a project website but provides no link to code, checkpoints, or evaluation protocols, which would be needed for reproducibility of the reported benchmark numbers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and describe the revisions we plan to make.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claims (17% real-world, 6% simulation) are stated without any accompanying experimental protocol, baseline specifications, trial counts, error bars, or statistical tests. Because these numbers constitute the primary evidence for the value of visual CoT, their lack of supporting detail renders the central claim unverifiable from the provided text.

Authors: We appreciate the referee's concern regarding the verifiability of the performance claims. The full details of the experimental protocol, including baselines, number of trials, and error bars, are provided in the Experiments section and associated tables of the manuscript. To address this, we will revise the abstract to include a brief summary of the evaluation setup, such as the tasks evaluated and the number of trials, along with a note on the statistical significance of the results. revision: yes
Referee: [Experiments] Experimental section (inferred from abstract claims): no ablation is described that holds model size (7B), training data, and tokenization fixed while removing the autoregressive future-frame prediction branch. Without this control, the observed deltas cannot be attributed to the visual-CoT mechanism rather than to incidental differences in capacity or optimization.

Authors: We agree that a controlled ablation isolating the visual CoT component is important for attributing the performance gains. The current manuscript compares against prior VLAs of similar scale, but we will add a new ablation experiment training a 7B model without the autoregressive future-frame prediction branch, keeping all other factors fixed. The results of this ablation will be included in the revised version. revision: yes
Referee: [Abstract] Abstract / method description: the premise that autoregressively generated future frames constitute reliable visual goals is not accompanied by any quantitative fidelity metrics (pixel-level error, perceptual similarity, or short-horizon prediction accuracy) against ground-truth future observations. Such metrics are required to confirm that the generated frames remain informative over the horizon used for action decoding.

Authors: This is a good point. While the method section describes the autoregressive prediction, we will add quantitative fidelity metrics for the generated frames, such as pixel-level error and perceptual similarity scores compared to ground-truth, to demonstrate their reliability over the relevant horizons. These metrics will be added to the method description and referenced in the abstract if space permits. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical performance claims rest on external benchmarks without self-referential reductions

full rationale

The paper describes an architectural extension to VLAs that inserts autoregressive future-frame prediction as an explicit visual CoT step before action decoding. All reported gains (17 % real-world, 6 % simulation) are presented as outcomes of comparative experiments against prior VLA models on fixed benchmarks. No equations, fitted parameters, or self-citations are shown that would make the performance delta equivalent to the input data or to a prior result by construction. The central premise is therefore an empirical claim whose validity can be checked against external test sets rather than a derivation that collapses to its own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, training details, or model specifications, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5551 in / 1023 out tokens · 78768 ms · 2026-05-16T05:17:41.158602+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 7.0

VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models
cs.RO 2026-02 unverdicted novelty 7.0

UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.
DreamGen: Unlocking Generalization in Robot Learning through Video World Models
cs.RO 2025-05 unverdicted novelty 7.0

DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperatio...
FrameSkip: Learning from Fewer but More Informative Frames in VLA Training
cs.RO 2026-05 unverdicted novelty 6.0

FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.
Do multimodal models imagine electric sheep?
cs.CV 2026-05 conditional novelty 6.0

Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
cs.RO 2026-04 unverdicted novelty 6.0

A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

VADF adds an Adaptive Loss Network for hard-negative training sampling and a Hierarchical Vision Task Segmenter for adaptive noise scheduling during inference to speed convergence and reduce timeouts in diffusion robo...
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
cs.RO 2026-04 unverdicted novelty 6.0

A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
cs.CV 2026-03 unverdicted novelty 6.0

Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
World Action Models are Zero-shot Policies
cs.RO 2026-02 unverdicted novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
cs.RO 2025-12 unverdicted novelty 6.0

mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.
villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models
cs.RO 2025-07 unverdicted novelty 6.0

villa-X enhances latent action modeling in VLA models to support zero-shot action planning for unseen robot embodiments and open-vocabulary instructions, yielding better manipulation results in simulation and real-wor...
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning
cs.CV 2025-06 unverdicted novelty 6.0

AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to...
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
cs.AI 2025-06 unverdicted novelty 6.0

V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
cs.RO 2025-02 unverdicted novelty 6.0

DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 5.0

VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
cs.RO 2026-04 unverdicted novelty 5.0

ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
cs.CV 2026-05 unverdicted novelty 4.0

DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 18 Pith papers · 26 internal anchors

[1]

Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation. arXiv preprint arXiv:2409.16283, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero- shot robotic manipulation with pretrained image-editing dif- fusion models. arXiv preprint arXiv:2310.10639, 2023. 2, 3, 6

work page internal anchor Pith review arXiv 2023
[3]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 3

work page 2021
[5]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455–14465,

work page
[7]

Deep compression autoen- coder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733,

Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep com- pression autoencoder for efficient high-resolution diffusion models. arXiv preprint arXiv:2410.10733, 2024. 8

work page arXiv 2024
[8]

Berkeley UR5 demonstration dataset

Lawrence Yunliang Chen, Simeon Adebola, and Ken Gold- berg. Berkeley UR5 demonstration dataset. https:// sites.google.com/view/berkeley- ur5/home . 1

work page
[9]

Pali-3 vision language models: Smaller, faster, stronger

Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul V oigtlaender, Basil Mustafa, Sebastian Good- man, Ibrahim Alabdulmohsin, Piotr Padlewski, et al. Pali-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199, 2023. 2

work page arXiv 2023
[10]

Diffusion policy: Visuomotor policy learning via action diffu- sion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffu- sion. The International Journal of Robotics Research , page 02783649241273668, 2023. 2, 6, 8

work page 2023
[11]

Goal-conditioned imitation learning

Yiming Ding, Carlos Florensa, Pieter Abbeel, and Mariano Phielipp. Goal-conditioned imitation learning. Advances in neural information processing systems, 32, 2019. 2, 3

work page 2019
[12]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023. 1, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Vision-language models as success detectors

Yuqing Du, Ksenia Konyushkova, Misha Denil, Akhil Raju, Jessica Landon, Felix Hill, Nando de Freitas, and Serkan Cabi. Vision-language models as success detectors. arXiv preprint arXiv:2303.07280, 2023. 3

work page arXiv 2023
[14]

Video language planning

Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B Tenenbaum, et al. Video language planning. arXiv preprint arXiv:2310.10625, 2023. 1

work page arXiv 2023
[15]

Learn- ing universal policies via text-guided video generation

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learn- ing universal policies via text-guided video generation. Ad- vances in Neural Information Processing Systems , 36, 2024. 2, 3, 8

work page 2024
[16]

Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396, 2021. 1

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

Humanplus: Humanoid shadowing and imita- tion from humans

Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imita- tion from humans. arXiv preprint arXiv:2406.10454, 2024. 2

work page arXiv 2024
[18]

Zhao, and Chelsea Finn

Zipeng Fu, Tony Z. Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole- body teleoperation. In Conference on Robot Learning (CoRL),

work page
[19]

Cows on pasture: Base- lines and benchmarks for language-driven zero-shot object navigation

Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig Schmidt, and Shuran Song. Cows on pasture: Base- lines and benchmarks for language-driven zero-shot object navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23171– 23181, 2023. 3

work page 2023
[20]

The” something something” video database for learning and evaluating visual common sense

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller- Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In Proceed- ings of the IEEE international conference on computer vision , pages 5...

work page 2017
[21]

Scaling up and distilling down: Language-guided robot skill acquisition

Huy Ha, Pete Florence, and Shuran Song. Scaling up and distilling down: Language-guided robot skill acquisition. In Conference on Robot Learning , pages 3766–3777. PMLR,

work page
[22]

Visual chain-of-thought diffusion models

William Harvey and Frank Wood. Visual chain-of-thought diffusion models. arXiv preprint arXiv:2303.16187, 2023. 2

work page arXiv 2023
[23]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Cor- rado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023. 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Visual sketchpad: Sketching as a visual chain of thought for multi- modal language models

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multi- modal language models. arXiv preprint arXiv:2406.09403,

work page arXiv
[25]

Visual language maps for robot navigation

Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 10608–10615. IEEE, 2023. 3

work page 2023
[26]

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. arXiv preprint arXiv:2409.01652, 2024. 3

work page internal anchor Pith review arXiv 2024
[27]

Egocentric hand track and object-based human action recognition

Georgios Kapidis, Ronald Poppe, Elsbeth Van Dam, Lu- cas Noldus, and Remco Veltkamp. Egocentric hand track and object-based human action recognition. In 2019 IEEE SmartWorld, Ubiquitous Intelligence & Com- puting, Advanced & Trusted Computing, Scalable Com- puting & Communications, Cloud & Big Data Comput- ing, Internet of People and Smart City Innovation...

work page 2019
[28]

1, 2, 3, 4

IEEE, 2019. 1, 2, 3, 4

work page 2019
[29]

Prismatic vlms: In- vestigating the design space of visually-conditioned language models

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Prismatic vlms: In- vestigating the design space of visually-conditioned language models. arXiv preprint arXiv:2402.07865, 2024. 2

work page arXiv 2024
[30]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024. 1, 3, 4, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jos´e Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023. 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Pro- ceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 8

work page 2023
[33]

Autoregressive image generation using resid- ual quantization

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using resid- ual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11523– 11532, 2022. 4

work page 2022
[34]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274– 19286. PMLR, 2023. 8

work page 2023
[35]

Driving everywhere with large language model policy adaptation

Boyi Li, Yue Wang, Jiageng Mao, Boris Ivanovic, Sushant Veer, Karen Leung, and Marco Pavone. Driving everywhere with large language model policy adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14948–14957, 2024. 3

work page 2024
[36]

Dreamitate: Real-world visuomotor policy learning via video generation

Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V on- drick. Dreamitate: Real-world visuomotor policy learning via video generation. arXiv preprint arXiv:2406.16862, 2024. 2

work page arXiv 2024
[37]

Data scaling laws in imi- tation learning for robotic manipulation

Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Ji- acheng You, and Yang Gao. Data scaling laws in imi- tation learning for robotic manipulation. arXiv preprint arXiv:2410.18647, 2024. 1

work page arXiv 2024
[38]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. arXiv preprint arXiv:2306.03310, 2023. 2, 5, 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 2

work page 2024
[40]

Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action

Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26439–26455, 2024. 2

work page 2024
[41]

Eureka: Human-Level Reward Design via Coding Large Language Models

Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv: Arxiv-2310.12931, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Where are we in the search for an artificial visual cortex for embodied intelligence? Advances in Neural Information Processing Systems, 36:655–677, 2023

Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Tingfan Wu, Jay Vakil, et al. Where are we in the search for an artificial visual cortex for embodied intelligence? Advances in Neural Information Processing Systems, 36:655–677, 2023. 3

work page 2023
[43]

Roboturk: A crowdsourcing platform for robotic skill learning through imitation

Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, An- chit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Con- ference on Robot Learning , pages 879–893. PMLR, 2018. 1

work page 2018
[44]

Grounding language with visual affordances over unstruc- tured data

Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. Grounding language with visual affordances over unstruc- tured data. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 11576–11582. IEEE,

work page 2023
[45]

Robotic Control via Embodied Chain-of-Thought Reasoning

Zawalski Michał, Chen William, Pertsch Karl, Mees Oier, Finn Chelsea, and Levine Sergey. Robotic control via embodied chain-of-thought reasoning. arXiv preprint arXiv:2407.08693, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Embodiedgpt: Vision-language pre-training via embodied chain of thought

Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, and Ping Luo. Embodiedgpt: Vision-language pre-training via embodied chain of thought. Advances in Neural Information Processing Systems, 36, 2024. 1, 2, 3

work page 2024
[47]

Visual reinforcement learn- ing with imagined goals

Ashvin V Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine. Visual reinforcement learn- ing with imagined goals. Advances in neural information processing systems, 31, 2018. 2, 3

work page 2018
[48]

Generate subgoal images before act: Unlocking the chain-of-thought reasoning in diffusion model for robot manipulation with multimodal prompts

Fei Ni, Jianye Hao, Shiguang Wu, Longxin Kou, Jiashun Liu, Yan Zheng, Bin Wang, and Yuzheng Zhuang. Generate subgoal images before act: Unlocking the chain-of-thought reasoning in diffusion model for robot manipulation with multimodal prompts. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024. 2, 3

work page 2024
[49]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x mod- els. arXiv preprint arXiv:2310.08864, 2023. 1, 2, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning , pages 8748–8763. PMLR, 2021. 3

work page 2021
[51]

Visual chain of thought: bridging logical gaps with multimodal infillings

Daniel Rose, Vaishnavi Himakunthala, Andy Ouyang, Ryan He, Alex Mei, Yujie Lu, Michael Saxon, Chinmay Sonar, Diba Mirza, and William Yang Wang. Visual chain of thought: bridging logical gaps with multimodal infillings. arXiv preprint arXiv:2305.02317, 2023. 2

work page arXiv 2023
[52]

Latent plans for task- agnostic offline reinforcement learning

Erick Rosete-Beas, Oier Mees, Gabriel Kalweit, Joschka Boedecker, and Wolfram Burgard. Latent plans for task- agnostic offline reinforcement learning. In Conference on Robot Learning, pages 1838–1849. PMLR, 2023. 1

work page 2023
[53]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

V Sanh. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019. 6

work page internal anchor Pith review Pith/arXiv arXiv 1910
[54]

Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuo- fan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models. arXiv preprint arXiv:2403.16999, 2024. 2

work page arXiv 2024
[55]

Cliport: What and where pathways for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. In Con- ference on robot learning, pages 894–906. PMLR, 2022. 1

work page 2022
[56]

Generative image as action models

Mohit Shridhar, Yat Long Lo, and Stephen James. Generative image as action models. arXiv preprint arXiv:2407.07875 ,

work page arXiv
[57]

Progprompt: Generating situated robot task plans using large language models

Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE In- ternational Conference on Robotics and Automation (ICRA) , pages 11523–11530. IEEE, 2023. 3

work page 2023
[58]

Consistency Models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023. 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024. 1, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pages 1723–1736. PMLR, 2023. 2, 5, 1

work page 2023
[62]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024. 2, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

Chain-of- thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. Advances in neural information processing systems , 2022. 2, 3

work page 2022
[64]

Any-point trajectory modeling for policy learning

Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning. arXiv preprint arXiv:2401.00025, 2023. 1, 2, 3

work page arXiv 2023
[65]

Tinyvla: Towards fast, data-efficient vision- language-action models for robotic manipulation

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, et al. Tinyvla: Towards fast, data-efficient vision- language-action models for robotic manipulation. arXiv preprint arXiv:2409.12514, 2024. 3

work page arXiv 2024
[66]

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre- training for visual robot manipulation. arXiv preprint arXiv:2312.13139, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model inte- grating visual understanding and generation. arXiv preprint arXiv:2409.04429, 2024. 2, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

Pandora: Towards general world model with natural language actions and video states

Jiannan Xiang, Guangyi Liu, Yi Gu, Qiyue Gao, Yuting Ning, Yuheng Zha, Zeyu Feng, Tianhua Tao, Shibo Hao, Yemin Shi, et al. Pandora: Towards general world model with natural language actions and video states. arXiv preprint arXiv:2406.09455, 2024. 8

work page arXiv 2024
[70]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and genera- tion. arXiv preprint arXiv:2408.12528, 2024. 2, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

Pushing the limits of cross-embodiment learning for manipulation and navigation

Jonathan Yang, Catherine Glossop, Arjun Bhorkar, Dhruv Shah, Quan Vuong, Chelsea Finn, Dorsa Sadigh, and Sergey Levine. Pushing the limits of cross-embodiment learning for manipulation and navigation. arXiv preprint arXiv:2402.19432, 2024. 1

work page arXiv 2024
[72]

Learning Interactive Real-World Simulators

Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learn- ing interactive real-world simulators. arXiv preprint arXiv:2310.06114, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

Video as the new language for real-world decision making

Sherry Yang, Jacob Walker, Jack Parker-Holder, Yilun Du, Jake Bruce, Andre Barreto, Pieter Abbeel, and Dale Schu- urmans. Video as the new language for real-world decision making. arXiv preprint arXiv:2402.17139, 2024. 8

work page arXiv 2024
[74]

An image is worth 32 tokens for reconstruction and generation

Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation. arxiv: 2406.07550,

work page arXiv
[75]

Language to rewards for robotic skill synthesis.arXiv preprint arXiv:2306.08647, 2023

Wenhao Yu, Nimrod Gileadi, Chuyuan Fu, Sean Kirmani, Kuang-Huei Lee, Montse Gonzalez Arenas, Hao-Tien Lewis Chiang, Tom Erez, Leonard Hasenclever, Jan Humplik, et al. Language to rewards for robotic skill synthesis.arXiv preprint arXiv:2306.08647, 2023. 3

work page arXiv 2023
[76]

Star: Bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems , 35:15476–15488,

work page
[77]

Learning manipulation skills through robot chain- of-thought with sparse failure guidance

Kaifeng Zhang, Zhao-Heng Yin, Weirui Ye, and Yang Gao. Learning manipulation skills through robot chain- of-thought with sparse failure guidance. arXiv preprint arXiv:2405.13573, 2024. 1, 2

work page arXiv 2024
[78]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low- cost hardware. arXiv preprint arXiv:2304.13705, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

3D-VLA: A 3D Vision-Language-Action Generative World Model

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[80]

Transfusion: Predict the next token and diffuse images with one multi-modal model

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michi- hiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model

work page

Showing first 80 references.