arxiv: 2407.08693 · v3 · submitted 2024-07-11 · 💻 cs.RO · cs.LG

Recognition: 2 theorem links

Robotic Control via Embodied Chain-of-Thought Reasoning

Micha{\l} Zawalski , William Chen , Karl Pertsch , Oier Mees , Chelsea Finn , Sergey Levine

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:08 UTC · model grok-4.3

classification 💻 cs.RO cs.LG

keywords roboticsvision-language-action modelschain-of-thought reasoninggeneralizationembodied AIpolicy learning

0 comments

The pith

Embodied chain-of-thought reasoning trains VLAs to output grounded plans and visuals before actions, raising OpenVLA success by 28 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard vision-language-action models struggle to generalize because they lack the ability to reason iteratively about tasks in a way that stays grounded in robot observations. It introduces Embodied Chain-of-Thought Reasoning, which requires the model to generate multiple explicit steps covering task plans, sub-tasks, motions, and visual details such as object boxes and end-effector locations before producing an action. A scalable synthetic data pipeline creates these traces from existing robot datasets without new collection. When applied to OpenVLA, this yields a 28 percent absolute gain in success rate on challenging generalization benchmarks and makes policy errors easier for humans to interpret and fix with language.

Core claim

Training VLAs to perform multiple steps of reasoning about plans, sub-tasks, motions, and visually grounded features like object bounding boxes and end-effector positions before predicting the robot action, using a synthetic data generation pipeline on large existing robot datasets, increases the absolute success rate of OpenVLA by 28 percent across challenging generalization tasks without any additional robot training data.

What carries the argument

Embodied Chain-of-Thought (ECoT) reasoning, the requirement that the model first produce intermediate traces on plans, sub-tasks, motions, and sensory-grounded features before the action output.

If this is right

VLAs reach higher success rates on tasks outside their original training distribution.
Humans can more easily diagnose and correct policy failures through natural language feedback.
No new real-robot data collection is required beyond the datasets already used for the base VLA.
Reasoning stays tied to actual sensory observations and robot state rather than remaining purely abstract.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-step grounded reasoning format could be tested on other large-model robot policies that currently lack explicit planning stages.
Gains might change if ECoT is applied to closed-source or much larger VLAs that already possess stronger internal reasoning.
Reducing reliance on real-world data collection could accelerate deployment in new environments once the synthetic pipeline is adapted.

Load-bearing premise

The synthetic data pipeline must generate reasoning traces that are accurate enough to supervise the model and diverse enough to improve generalization instead of causing overfitting to generation rules.

What would settle it

Apply the ECoT training procedure to OpenVLA using synthetic traces known to contain systematic inaccuracies in motion or grounding descriptions and measure whether the 28 percent success gain disappears on the same generalization tasks.

read the original abstract

A key limitation of learned robot control policies is their inability to generalize outside their training data. Recent works on vision-language-action models (VLAs) have shown that the use of large, internet pre-trained vision-language models as the backbone of learned robot policies can substantially improve their robustness and generalization ability. Yet, one of the most exciting capabilities of large vision-language models in other domains is their ability to reason iteratively through complex problems. Can that same capability be brought into robotics to allow policies to improve performance by reasoning about a given task before acting? Naive use of "chain-of-thought" (CoT) style prompting is significantly less effective with standard VLAs because of the relatively simple training examples that are available to them. Additionally, purely semantic reasoning about sub-tasks, as is common in regular CoT, is insufficient for robot policies that need to ground their reasoning in sensory observations and the robot state. To this end, we introduce Embodied Chain-of-Thought Reasoning (ECoT) for VLAs, in which we train VLAs to perform multiple steps of reasoning about plans, sub-tasks, motions, and visually grounded features like object bounding boxes and end effector positions, before predicting the robot action. We design a scalable pipeline for generating synthetic training data for ECoT on large robot datasets. We demonstrate, that ECoT increases the absolute success rate of OpenVLA, the current strongest open-source VLA policy, by 28% across challenging generalization tasks, without any additional robot training data. Additionally, ECoT makes it easier for humans to interpret a policy's failures and correct its behavior using natural language.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ECoT lifts OpenVLA by 28% on generalization tasks through synthetic embodied reasoning traces, but the pipeline lacks clear accuracy checks.

read the letter

The main takeaway is that training VLAs to emit explicit embodied chain-of-thought steps—plans, sub-tasks, motions, bounding boxes, and end-effector positions—before the action raises OpenVLA success rates by 28% absolute on held-out generalization tasks. They generate the extra supervision synthetically from existing robot datasets, so no new physical data is collected. This is a direct response to the fact that standard semantic CoT prompting falls flat on the relatively simple examples in robot training data. By forcing the model to ground its reasoning in visual features and robot state, the approach makes the intermediate steps more relevant to control. The interpretability benefit is real too: the traces let humans read where the policy went wrong and correct it with language instructions. Building on the current strongest open-source VLA keeps the comparison straightforward and the method easy to try. The synthetic pipeline is the practical enabler here, turning large existing datasets into multi-step supervision without extra hardware. That said, the validation of those generated traces is light. The abstract gives no numbers on human agreement, bounding-box error, motion feasibility, or diversity metrics, so it is hard to know whether the 28% reflects genuine reasoning gains or artifacts from the generation heuristics. Ablations that isolate each reasoning component would also help separate the effect from simple extra supervision. This work is aimed at researchers building or extending VLAs for robotics, especially those focused on generalization without scaling data collection. It deserves peer review because the empirical delta is large and the method is concrete enough to test and extend, though reviewers should press for tighter evidence on trace quality.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Embodied Chain-of-Thought (ECoT) reasoning for vision-language-action (VLA) policies. Standard VLAs are augmented to generate multi-step reasoning traces covering plans, sub-tasks, motions, and grounded features (object bounding boxes, end-effector positions) before predicting actions. A scalable synthetic pipeline produces these traces from existing robot datasets. The central result is that fine-tuning OpenVLA on ECoT data yields a 28% absolute success-rate gain on challenging generalization tasks with no new robot demonstrations; the approach also improves human interpretability of failures.

Significance. If the empirical gains are robust, the work would demonstrate a practical route to stronger generalization in VLAs by embedding structured, embodied reasoning without extra data collection. The synthetic pipeline's efficiency is a clear strength. The result could influence VLA design toward more interpretable policies, provided the gains are shown to arise from the reasoning mechanism rather than pipeline artifacts.

major comments (3)

[Experimental Evaluation] The 28% absolute improvement is reported without error bars, statistical tests, or explicit task definitions and success criteria (experimental section). This information is load-bearing for the generalization claim and must be supplied to allow assessment of robustness.
[Methods / Synthetic Data Pipeline] The synthetic data pipeline (Methods) lacks any quantitative validation of trace accuracy or diversity, such as human agreement rates on plans/sub-tasks or grounding error statistics for bounding boxes and positions. Because the performance delta depends entirely on these traces, the absence of such checks is a central concern.
[Ablation Studies] No ablations isolate the contribution of individual ECoT components (e.g., semantic plans versus motion details versus visual grounding). Without them it is impossible to confirm that iterative embodied reasoning, rather than simply richer supervision, produces the observed gain.

minor comments (2)

[Model Architecture] Clarify the precise tokenization and separation of reasoning steps from the final action prediction in the model output format.
[Related Work] Add a short discussion of how ECoT relates to recent chain-of-thought work in multimodal models outside robotics.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the empirical claims.

read point-by-point responses

Referee: [Experimental Evaluation] The 28% absolute improvement is reported without error bars, statistical tests, or explicit task definitions and success criteria (experimental section). This information is load-bearing for the generalization claim and must be supplied to allow assessment of robustness.

Authors: We agree that error bars, statistical tests, and explicit task definitions are essential for assessing robustness. In the revised manuscript we will add error bars computed over multiple random seeds, report p-values from paired statistical tests, and provide precise task definitions together with success criteria in the Experimental Evaluation section. revision: yes
Referee: [Methods / Synthetic Data Pipeline] The synthetic data pipeline (Methods) lacks any quantitative validation of trace accuracy or diversity, such as human agreement rates on plans/sub-tasks or grounding error statistics for bounding boxes and positions. Because the performance delta depends entirely on these traces, the absence of such checks is a central concern.

Authors: We acknowledge the need to validate trace quality. The revised Methods section will include human agreement rates on generated plans and sub-tasks as well as quantitative error statistics for bounding-box and end-effector grounding accuracy. revision: yes
Referee: [Ablation Studies] No ablations isolate the contribution of individual ECoT components (e.g., semantic plans versus motion details versus visual grounding). Without them it is impossible to confirm that iterative embodied reasoning, rather than simply richer supervision, produces the observed gain.

Authors: We agree that component-wise ablations are required. The revision will add ablation experiments that systematically remove or alter individual ECoT elements (semantic plans, motion details, visual grounding) to isolate their contributions and confirm that the full embodied reasoning chain drives the gains. revision: yes

Circularity Check

0 steps flagged

No circularity: central claim is an external empirical delta on held-out tasks

full rationale

The paper reports an empirical result obtained by fine-tuning OpenVLA on synthetically generated ECoT traces and measuring absolute success-rate improvement on challenging generalization tasks. No equations, fitted parameters, or self-citations are invoked to derive the performance number; the 28% gain is presented as a measured outcome on data external to the training pipeline. The synthetic-data generation step is described as a scalable procedure applied to existing robot datasets, but the evaluation remains independent and does not reduce to any internal fit or self-definition. Consequently the derivation chain contains no load-bearing circular step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that synthetic reasoning traces generated from existing robot trajectories are sufficiently high-quality and diverse to supervise useful intermediate representations; no explicit free parameters or invented physical entities are introduced beyond standard VLA training.

axioms (1)

domain assumption Synthetic reasoning traces generated by the pipeline are accurate and useful for supervision
The pipeline is presented as the key enabler; its correctness is not independently verified in the abstract.

pith-pipeline@v0.9.0 · 5610 in / 1427 out tokens · 41280 ms · 2026-05-15T08:08:48.921925+00:00 · methodology

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 7.0

VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 7.0

MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

CoRAL lets LLMs act as adaptive cost designers for motion planners while using VLM priors and online identification to handle unknown physics, achieving over 50% higher success rates than baselines in unseen contact-r...
Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation
cs.AI 2026-05 unverdicted novelty 7.0

A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 7.0

CodeGraphVLP uses a semantic-graph state and executable code planner to enable reliable long-horizon non-Markovian robot manipulation, improving task success and lowering latency over standard VLA baselines.
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
cs.LG 2026-04 unverdicted novelty 7.0

π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 6.0

MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...
VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model
cs.RO 2026-05 unverdicted novelty 6.0

VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.
Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery
cs.RO 2026-05 unverdicted novelty 6.0

Sentinel-VLA introduces a metacognitive VLA model with a sentinel module for real-time status monitoring, dynamic reasoning, and error recovery, plus a self-evolving continual learning method, raising real-world task ...
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
GazeVLA: Learning Human Intention for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
cs.LG 2026-04 unverdicted novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies
cs.RO 2026-04 unverdicted novelty 6.0

Sim-and-real co-training for robot policies is driven primarily by balanced cross-domain representation alignment and secondarily by domain-dependent action reweighting.
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
cs.RO 2026-04 unverdicted novelty 6.0

A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making
cs.RO 2026-03 unverdicted novelty 6.0

ThermoAct integrates thermal imaging into VLA models via a VLM planner to enable robots to perceive physical properties like heat and improve safety over vision-only systems.
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
cs.RO 2025-10 unverdicted novelty 6.0

InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
cs.RO 2025-09 conditional novelty 6.0

SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' p...
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 5.0

VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

CoRAL lets LLMs design objective functions for robot motion planners and uses vision-language models plus real-time identification to adapt to unknown physical properties, raising success rates by over 50 percent on n...
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
cs.RO 2026-04 unverdicted novelty 5.0

ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment
cs.RO 2026-04 unverdicted novelty 5.0

CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer ...
Hierarchical Prompting with Dual LLM Modules for Robotic Task and Motion Planning
cs.RO 2026-05 unverdicted novelty 4.0

A dual-LLM hierarchical framework for robotic task and motion planning, integrating object detection, achieves 86% success across 24 test scenarios ranging from simple spatial commands to infeasible requests.

Reference graph

Works this paper leans on

118 extracted references · 118 canonical work pages · cited by 20 Pith papers · 2 internal anchors

[1]

Agarwal, A

A. Agarwal, A. Kumar, J. Malik, and D. Pathak. Legged locomotion in challenging terrains using egocentric vision, 2022

work page 2022
[2]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware, 2023

work page 2023
[3]

Z. Fu, T. Z. Zhao, and C. Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Bommasani, D

R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gillespie, K. Go...

work page 2022
[5]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...

work page 2023
[6]

O’Neill, A

Embodiment Collaboration, A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Gupta, A. Wang, A. Kolobov, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakrishna, A. Wahid, B. Bur...

work page 2024
[7]

M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model. 2024

work page 2024
[8]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023

work page 2023
[9]

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, Y . Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, A. Castro-Ros, M. Pellat, K. Robinson, D. Valter, S. Narang, G. Mishra, A. Yu, V . Zhao, Y . Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V . Le, and...

work page 2022
[10]

Kalashnikov, A

D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakr- ishnan, V . Vanhoucke, and S. Levine. Qt-opt: Scalable deep reinforcement learning for vision- based robotic manipulation, 2018

work page 2018
[11]

Kalashnikov, J

D. Kalashnikov, J. Varley, Y . Chebotar, B. Swanson, R. Jonschkowski, C. Finn, S. Levine, and K. Hausman. Mt-opt: Continuous multi-task robotic reinforcement learning at scale, 2021

work page 2021
[12]

Ebert, Y

F. Ebert, Y . Yang, K. Schmeckpeper, B. Bucher, G. Georgakis, K. Daniilidis, C. Finn, and S. Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets, 2021

work page 2021
[13]

Walke, K

H. Walke, K. Black, A. Lee, M. J. Kim, M. Du, C. Zheng, T. Zhao, P. Hansen-Estruch, Q. Vuong, A. He, V . Myers, K. Fang, C. Finn, and S. Levine. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning (CoRL) , 2023

work page 2023
[14]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, K.-H. Lee, S. Levine, Y . Lu, U. Malla, D. Manju- nath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsc...

work page 2023
[15]

Bharadhwaj, J

H. Bharadhwaj, J. Vakil, M. Sharma, A. Gupta, S. Tulsiani, and V . Kumar. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking, 2023

work page 2023
[16]

Ghosh, H

Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. In Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2024. 13

work page 2024
[17]

D. Shah, A. Sridhar, N. Dashora, K. Stachowicz, K. Black, N. Hirose, and S. Levine. Vint: A foundation model for visual navigation, 2023

work page 2023
[18]

Pinto and A

L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours, 2015

work page 2015
[19]

Mandlekar, Y

A. Mandlekar, Y . Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, S. Savarese, and L. Fei-Fei. Roboturk: A crowdsourcing platform for robotic skill learning through imitation, 2018

work page 2018
[20]

Gupta, A

A. Gupta, A. Murali, D. Gandhi, and L. Pinto. Robot learning in homes: Improving generaliza- tion and reducing dataset bias, 2018

work page 2018
[21]

Dasari, F

S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn. Robonet: Large-scale multi-robot learning, 2020

work page 2020
[22]

Rosete-Beas, O

E. Rosete-Beas, O. Mees, G. Kalweit, J. Boedecker, and W. Burgard. Latent plans for task agnostic offline reinforcement learning. InProceedings of the 6th Conference on Robot Learning (CoRL), Auckland, New Zealand, 2022

work page 2022
[23]

S. Cabi, S. G. Colmenarejo, A. Novikov, K. Konyushkova, S. Reed, R. Jeong, K. Zolna, Y . Aytar, D. Budden, M. Vecerik, O. Sushkov, D. Barker, J. Scholz, M. Denil, N. de Freitas, and Z. Wang. Scaling data-driven robotics with reward sketching and batch reinforcement learning, 2020

work page 2020
[24]

E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022

work page 2022
[25]

H.-S. Fang, H. Fang, Z. Tang, J. Liu, J. Wang, H. Zhu, and C. Lu. Rh20t: A robotic dataset for learning diverse skills in one-shot. In RSS 2023 Workshop on Learning for Task and Motion Planning, 2023

work page 2023
[26]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

work page 2024
[27]

Bousmalis, G

K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauza, T. Davchev, Y . Zhou, A. Gupta, A. Raju, A. Laurens, C. Fantacci, V . Dalibard, M. Zambelli, M. Martins, R. Pevce- viciute, M. Blokzijl, M. Denil, N. Batchelor, T. Lampe, E. Parisotto, K. ˙Zołna, S. Reed, S. G. Colmenarejo, J. Scholz, A. Abdolmaleki, O. Groth, J.-B. Regli, O. Sushkov, T. Rot...

work page 2023
[28]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision, 2021

work page 2021
[29]

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, and L. Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2023. 14

work page 2023
[30]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models, 2022

work page 2022
[31]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning, 2023

work page 2023
[32]

J. Li, D. Li, C. Xiong, and S. Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022

work page 2022
[33]

J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

work page 2023
[34]

W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

work page 2023
[35]

Karamcheti, S

S. Karamcheti, S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh. Prismatic vlms: Investigating the design space of visually-conditioned language models, 2024

work page 2024
[36]

Black, M

K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models, 2023

work page 2023
[37]

Y . Du, K. Konyushkova, M. Denil, A. Raju, J. Landon, F. Hill, N. de Freitas, and S. Cabi. Vision-language models as success detectors, 2023

work page 2023
[38]

S. A. Sontakke, J. Zhang, S. M. R. Arnold, K. Pertsch, E. Bıyık, D. Sadigh, C. Finn, and L. Itti. Roboclip: One demonstration is enough to learn robot policies, 2023

work page 2023
[39]

Y . J. Ma, W. Liang, V . Som, V . Kumar, A. Zhang, O. Bastani, and D. Jayaraman. Liv: Language- image representations and rewards for robotic control, 2023

work page 2023
[40]

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3m: A universal visual representation for robot manipulation, 2022

work page 2022
[41]

Karamcheti, S

S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang. Language-driven representation learning for robotics, 2023

work page 2023
[42]

W. Chen, O. Mees, A. Kumar, and S. Levine. Vision-language models provide promptable representations for reinforcement learning, 2024

work page 2024
[43]

Stone, T

A. Stone, T. Xiao, Y . Lu, K. Gopalakrishnan, K.-H. Lee, Q. Vuong, P. Wohlhart, S. Kirmani, B. Zitkovich, F. Xia, C. Finn, and K. Hausman. Open-world object manipulation using pre- trained vision-language models. In arXiv preprint, 2023

work page 2023
[44]

Shridhar, L

M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. In Proceedings of the 6th Conference on Robot Learning (CoRL) , 2022

work page 2022
[45]

F. Liu, K. Fang, P. Abbeel, and S. Levine. Moka: Open-vocabulary robotic manipulation through mark-based visual prompting, 2024

work page 2024
[46]

Kojima, S

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa. Large language models are zero-shot reasoners, 2023

work page 2023
[47]

P. Lu, B. Peng, H. Cheng, M. Galley, K.-W. Chang, Y . N. Wu, S.-C. Zhu, and J. Gao. Chameleon: Plug-and-play compositional reasoning with large language models, 2023

work page 2023
[48]

Huang, F

W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar, P. Sermanet, N. Brown, T. Jackson, L. Luu, S. Levine, K. Hausman, and B. Ichter. Inner monologue: Embodied reasoning through planning with language models, 2022

work page 2022
[49]

Liang, W

J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control, 2023. 15

work page 2023
[50]

A. Zeng, M. Attarian, B. Ichter, K. Choromanski, A. Wong, S. Welker, F. Tombari, A. Purohit, M. Ryoo, V . Sindhwani, J. Lee, V . Vanhoucke, and P. Florence. Socratic models: Composing zero-shot multimodal reasoning with language, 2022

work page 2022
[51]

O. Mees, J. Borja-Diaz, and W. Burgard. Grounding language with visual affordances over unstructured data. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2023

work page 2023
[52]

Sharma, A

P. Sharma, A. Torralba, and J. Andreas. Skill induction and planning with latent language, 2022

work page 2022
[53]

Huang, O

C. Huang, O. Mees, A. Zeng, and W. Burgard. Audio visual language maps for robot navigation. In Proceedings of the International Symposium on Experimental Robotics (ISER) , Chiang Mai, Thailand, 2023

work page 2023
[54]

Huang, O

C. Huang, O. Mees, A. Zeng, and W. Burgard. Visual language maps for robot navigation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , London, UK, 2023

work page 2023
[55]

Singh, V

I. Singh, V . Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg. Progprompt: Generating situated robot task plans using large language models, 2022

work page 2022
[56]

H. Ha, P. Florence, and S. Song. Scaling up and distilling down: Language-guided robot skill acquisition, 2023

work page 2023
[57]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training, 2023

work page 2023
[58]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. Dinov2: Learning robust visual features without super...

work page 2024
[59]

Touvron, L

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V . Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V . Kerkez, M. Khabsa, I. Kloumann, A. Koren...

work page 2023
[60]

Minderer, A

M. Minderer, A. Gritsenko, and N. Houlsby. Scaling open-vocabulary object detection.Advances in Neural Information Processing Systems , 36, 2024

work page 2024
[61]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023

work page 2023
[62]

Gemini: A family of highly capable multimodal models, 2024

Gemini Team. Gemini: A family of highly capable multimodal models, 2024

work page 2024
[63]

Mukherjee, A

S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, and A. Awadallah. Orca: Progres- sive learning from complex explanation traces of gpt-4, 2023

work page 2023
[64]

Belkhale, T

S. Belkhale, T. Ding, T. Xiao, P. Sermanet, Q. Vuong, J. Tompson, Y . Chebotar, D. Dwibedi, and D. Sadigh. Rt-h: Action hierarchies using language, 2024

work page 2024
[65]

M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM , 24(6): 381–395, 1981. 16

work page 1981
[66]

Tensorrt-llm

NVIDIA. Tensorrt-llm. https://github.com/NVIDIA/TensorRT-LLM?tab= readme-ov-file, 2024

work page 2024
[67]

Leviathan, M

Y . Leviathan, M. Kalman, and Y . Matias. Fast inference from transformers via speculative decoding, 2023

work page 2023
[68]

Kelly, C

M. Kelly, C. Sidrane, K. Driggs-Campbell, and M. J. Kochenderfer. Hg-dagger: Interactive imitation learning with human experts, 2019

work page 2019
[69]

Sharma, B

P. Sharma, B. Sundaralingam, V . Blukis, C. Paxton, T. Hermans, A. Torralba, J. Andreas, and D. Fox. Correcting robot plans with natural language feedback, 2022

work page 2022
[70]

L. X. Shi, Z. Hu, T. Z. Zhao, A. Sharma, K. Pertsch, J. Luo, S. Levine, and C. Finn. Yell at your robot: Improving on-the-fly from language corrections, 2024

work page 2024
[71]

Evaluating Real-World Robot Manipulation Policies in Simulation

X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kir- mani, S. Levine, J. Wu, C. Finn, H. Su, Q. Vuong, and T. Xiao. Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941, 2024. 17 A Grounding DINO Detections and Prismatic Descriptions We provide example scene descriptions pr...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[72]

close gripper (10.8%)

work page
[73]

move backward (2.4%)

work page
[74]

move up, open gripper (2.1%)

work page
[75]

move forward right (1.1%)

work page
[76]

move up, close gripper (1.0%)

work page
[77]

move backward left (1.0%)

work page
[78]

move forward left (0.9%)

work page
[79]

move left down (0.8%)

work page
[80]

move down, close gripper (0.8%)

work page

Showing first 80 references.