pith. machine review for the scientific record. sign in

arxiv: 2310.10639 · v1 · submitted 2023-10-16 · 💻 cs.RO

Recognition: no theorem link

Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:50 UTC · model grok-4.3

classification 💻 cs.RO
keywords robotic manipulationdiffusion modelssubgoal generationzero-shot generalizationimage editinggoal-conditioned policyCALVIN benchmark
0
0 comments X

The pith

A finetuned image-editing diffusion model generates subgoal images that let a low-level policy complete manipulation tasks on objects and instructions absent from robot training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that an image-editing diffusion model can be finetuned on human and robot videos to predict a future observation from the current camera view and a language command. This predicted image then serves as the target for a separately trained goal-conditioned policy that moves the robot arm. Because the diffusion model draws on large-scale visual pretraining, it supplies subgoal images that generalize to novel objects, scenes, and phrasings where direct language-to-action policies fail. The resulting system reaches state-of-the-art scores on the CALVIN benchmark and transfers to real robots while using far less robot-specific data than competing methods.

Core claim

The central claim is that a diffusion model finetuned to edit robot observations into language-specified future images produces subgoal targets that a goal-conditioned low-level policy can reliably execute, yielding zero-shot generalization to unseen objects and instructions on both simulated and physical manipulation tasks.

What carries the argument

SuSIE, a two-module architecture in which a finetuned InstructPix2Pix diffusion model converts a current observation and language command into a subgoal image that a goal-conditioned policy then reaches through closed-loop control.

If this is right

  • Robots succeed on instructions involving objects never shown in their own training data.
  • Long-horizon sequences can be completed by repeatedly querying the diffusion model for the next subgoal image.
  • The same high-level planner works with different low-level controllers without retraining the diffusion model.
  • Performance gains appear without requiring orders of magnitude more robot data or privileged state information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Image-generation models could replace explicit symbolic planners for many sequential robot tasks.
  • The separation of visual subgoal prediction from low-level control may scale more readily than end-to-end language-to-action training as pretraining corpora grow.
  • Similar subgoal-image generation could be tested in navigation or multi-robot coordination settings.

Load-bearing premise

The subgoal images produced by the diffusion model must stay close enough to reachable states that the low-level policy can actually execute them even when objects, lighting, or wording fall outside the finetuning videos.

What would settle it

A set of new manipulation tasks where the generated subgoal images appear plausible to humans yet the robot consistently fails to reach them would falsify the claim that the diffusion model supplies executable targets.

read the original abstract

If generalist robots are to operate in truly unstructured environments, they need to be able to recognize and reason about novel objects and scenarios. Such objects and scenarios might not be present in the robot's own training data. We propose SuSIE, a method that leverages an image-editing diffusion model to act as a high-level planner by proposing intermediate subgoals that a low-level controller can accomplish. Specifically, we finetune InstructPix2Pix on video data, consisting of both human videos and robot rollouts, such that it outputs hypothetical future "subgoal" observations given the robot's current observation and a language command. We also use the robot data to train a low-level goal-conditioned policy to act as the aforementioned low-level controller. We find that the high-level subgoal predictions can utilize Internet-scale pretraining and visual understanding to guide the low-level goal-conditioned policy, achieving significantly better generalization and precision than conventional language-conditioned policies. We achieve state-of-the-art results on the CALVIN benchmark, and also demonstrate robust generalization on real-world manipulation tasks, beating strong baselines that have access to privileged information or that utilize orders of magnitude more compute and training data. The project website can be found at http://rail-berkeley.github.io/susie .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SuSIE, which finetunes the InstructPix2Pix image-editing diffusion model on a combination of human videos and robot rollouts to generate intermediate subgoal observations from the current visual input and a language command. These subgoals then condition a separately trained low-level goal-conditioned policy for robotic manipulation tasks. The method is evaluated on the CALVIN benchmark, where it reports state-of-the-art results, and on real-world tasks, where it outperforms baselines that use privileged information or substantially more compute and data, emphasizing improved zero-shot generalization to novel objects and instructions.

Significance. If the empirical claims hold under closer scrutiny, the work provides concrete evidence that large-scale pretrained diffusion models can serve as effective high-level visual planners for robotics, decoupling semantic understanding from low-level control and thereby reducing reliance on massive robot-specific datasets. This separation could generalize to other embodied tasks and highlights a practical route for incorporating internet-scale visual priors into manipulation policies.

major comments (3)
  1. [Experimental Results] Experimental Results section: the SOTA claim on CALVIN is presented without error bars, standard deviations across seeds, or explicit confirmation that the evaluation protocol matches prior work exactly (including data splits and success criteria), which is required to substantiate robustness over strong baselines.
  2. [Method and Experiments] Method and Experiments sections: no quantitative isolation of the subgoal-generation step is reported, such as success rates when the low-level policy is conditioned on ground-truth future frames versus diffusion-generated subgoals; without this, it is impossible to determine whether performance gains stem from accurate subgoal prediction or from other factors.
  3. [Real-world experiments] Real-world experiments: the paper does not include failure-case analysis or metrics on how often generated subgoals are kinematically infeasible or visually inconsistent with novel objects/lighting, which directly tests the central transfer assumption for zero-shot generalization.
minor comments (2)
  1. [Abstract and Introduction] The abstract and introduction could more clearly distinguish the finetuning data sources (human videos versus robot rollouts) and their respective contributions to the diffusion model.
  2. [Figures] Figure captions and legends should explicitly state whether results are averaged over multiple runs and include the exact number of evaluation episodes per task.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and will incorporate revisions to improve the manuscript's clarity and rigor.

read point-by-point responses
  1. Referee: [Experimental Results] Experimental Results section: the SOTA claim on CALVIN is presented without error bars, standard deviations across seeds, or explicit confirmation that the evaluation protocol matches prior work exactly (including data splits and success criteria), which is required to substantiate robustness over strong baselines.

    Authors: We agree that error bars and explicit protocol confirmation would strengthen the SOTA claims. In the revised manuscript, we will report standard deviations across multiple random seeds for all CALVIN results and explicitly state that our evaluation follows the exact protocol from prior work, including identical data splits and success criteria. revision: yes

  2. Referee: [Method and Experiments] Method and Experiments sections: no quantitative isolation of the subgoal-generation step is reported, such as success rates when the low-level policy is conditioned on ground-truth future frames versus diffusion-generated subgoals; without this, it is impossible to determine whether performance gains stem from accurate subgoal prediction or from other factors.

    Authors: This is a fair point. We will add an ablation study in the revised Experiments section that reports success rates for the low-level policy conditioned on ground-truth future frames versus our diffusion-generated subgoals. This will isolate the subgoal-generation contribution and clarify the source of performance gains. revision: yes

  3. Referee: [Real-world experiments] Real-world experiments: the paper does not include failure-case analysis or metrics on how often generated subgoals are kinematically infeasible or visually inconsistent with novel objects/lighting, which directly tests the central transfer assumption for zero-shot generalization.

    Authors: We acknowledge the value of failure analysis for validating zero-shot transfer. In the revision, we will include a dedicated subsection on real-world failure cases with quantitative metrics on the frequency of kinematically infeasible or visually inconsistent subgoals, focusing on novel objects and lighting variations. revision: yes

Circularity Check

0 steps flagged

No circularity: independent training of diffusion planner and goal-conditioned policy with external benchmark evaluation

full rationale

The method trains the InstructPix2Pix-based high-level planner via finetuning on separate video datasets (human videos plus robot rollouts) and trains the low-level goal-conditioned policy independently on robot data. The central claims of SOTA performance on CALVIN and real-world generalization are evaluated on held-out benchmarks rather than being defined in terms of any fitted parameter or self-citation chain that reduces the output to the input by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of visual knowledge from large-scale image pretraining to robot subgoal prediction and on the assumption that a single goal-conditioned policy can reliably reach any image generated by the diffusion model.

axioms (1)
  • domain assumption Finetuning InstructPix2Pix on mixed human and robot video produces subgoal images that are both visually plausible and kinematically reachable by the robot.
    Invoked when the method states that the diffusion model outputs hypothetical future observations usable by the low-level controller.

pith-pipeline@v0.9.0 · 5541 in / 1318 out tokens · 25606 ms · 2026-05-16T05:50:34.186645+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

  2. ${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

    cs.LG 2026-04 unverdicted novelty 7.0

    π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

  3. UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

    cs.RO 2026-02 unverdicted novelty 7.0

    UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.

  4. Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training

    cs.RO 2026-04 unverdicted novelty 6.0

    DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...

  5. Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training

    cs.RO 2026-04 unverdicted novelty 6.0

    Hi-WM uses human interventions inside an action-conditioned world model with rollback and branching to generate dense corrective data, raising real-world success by 37.9 points on average across three manipulation tasks.

  6. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  7. PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    PhyEdit improves physical accuracy in image object manipulation by using explicit geometric simulation as 3D-aware guidance combined with joint 2D-3D supervision.

  8. Ctrl-World: A Controllable Generative World Model for Robot Manipulation

    cs.RO 2025-10 unverdicted novelty 6.0

    A controllable world model trained on the DROID dataset generates consistent multi-view robot trajectories for over 20 seconds and improves generalist policy success rates by 44.7% via imagined trajectory fine-tuning.

  9. Video Generators are Robot Policies

    cs.RO 2025-08 conditional novelty 6.0

    Training models to generate videos of robot actions produces policies that generalize better to new objects and tasks while using far less demonstration data than standard behavior cloning.

  10. CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

    cs.CV 2025-03 unverdicted novelty 6.0

    CoT-VLA is a 7B VLA that generates future visual frames autoregressively as planning goals before actions, outperforming prior VLAs by 17% on real-world tasks and 6% in simulation.

  11. DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

    cs.RO 2025-02 unverdicted novelty 6.0

    DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.

  12. Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    cs.CV 2024-12 unverdicted novelty 6.0

    Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.

  13. GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    cs.RO 2024-10 unverdicted novelty 6.0

    GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.

  14. Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

    cs.RO 2024-09 unverdicted novelty 6.0

    Gen2Act enables generalizable robot manipulation for unseen objects and novel motions by using zero-shot human video generation from web data to condition a policy trained on an order of magnitude less robot interaction data.

  15. Octo: An Open-Source Generalist Robot Policy

    cs.RO 2024-05 unverdicted novelty 6.0

    Octo is an open-source transformer-based generalist robot policy pretrained on 800k trajectories that serves as an effective initialization for finetuning across diverse robotic platforms.

  16. StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement

    cs.RO 2026-04 unverdicted novelty 5.0

    StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...

  17. From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

    cs.RO 2026-04 accept novelty 5.0

    A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.

  18. Motus: A Unified Latent Action World Model

    cs.CV 2025-12 unverdicted novelty 5.0

    Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.

  19. World Model for Robot Learning: A Comprehensive Survey

    cs.RO 2026-04 unverdicted novelty 3.0

    A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 19 Pith papers · 16 internal anchors

  1. [1]

    Tenenbaum, Tommi S

    Anurag Ajay, Yilun Du, Abhi Gupta, Joshua B. Tenenbaum, Tommi S. Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision making? In The Eleventh International Conference on Learning Representations , 2023. URL https:// openreview.net/forum?id=sP1fo2K9DFG

  2. [2]

    Compositional founda- tion models for hierarchical planning

    Anurag Ajay, Seungwook Han, Yilun Du, Shaung Li, Abhi Gupta, Tommi Jaakkola, Josh Tenenbaum, Leslie Kaelbling, Akash Srivastava, and Pulkit Agrawal. Compositional founda- tion models for hierarchical planning. arXiv preprint arXiv:2309.08587, 2023

  3. [3]

    Fitvid: Overfitting in pixel-level video prediction

    Mohammad Babaeizadeh, Mohammad Taghi Saffar, Suraj Nair, Sergey Levine, Chelsea Finn, and Dumitru Erhan. Fitvid: Overfitting in pixel-level video prediction. arXiv preprint arXiv:2106.13195, 2020

  4. [4]

    Robotic offline rl from internet videos via value-function pre-training

    Chethan Bhateja, Derek Guo, Dibya Ghosh, Anika Singh, Manan Tomar, Quan Ho Vuong, Yevgen Chebotar, Sergey Levine, and Aviral Kumar. Robotic offline rl from internet videos via value-function pre-training. 2023. URL https://api.semanticscholar.org/ CorpusID:262217278

  5. [5]

    Introducing ChatGPT and Whis- per APIs

    Greg Brockman, Atty Eleti, Elie Georges, Joanne Jang, Logan Kilpatrick, Rachel Lim, Luke Miller, and Michelle Pokrass. Introducing ChatGPT and Whis- per APIs. OpenAI Blog, 2023. URL https://openai.com/blog/ introducing-chatgpt-and-whisper-apis

  6. [6]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

  7. [7]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023

  8. [8]

    Do as i can, not as i say: Grounding language in robotic affordances

    Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. Do as i can, not as i say: Grounding language in robotic affordances. InConference on Robot Learning, pp. 287–318. PMLR, 2023. 12

  9. [9]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18392–18402, 2023

  10. [10]

    Decision transformer: Reinforcement learning via sequence modeling

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. arXiv preprint arXiv:2106.01345, 2021

  11. [11]

    Genaug: Retargeting behaviors to unseen situations via generative augmentation

    Zoey Chen, Sho Kiami, Abhishek Gupta, and Vikash Kumar. Genaug: Retargeting behaviors to unseen situations via generative augmentation. arXiv preprint arXiv:2302.06671, 2023

  12. [12]

    Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.arXiv preprint arXiv:2303.04137, 2023

  13. [13]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022

  14. [14]

    Open X-Embodiment Collaboration, Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, Antonin Raffin, Ayzaan Wahid, Ben Burgess-Limerick, Beomjoon Kim, Bernhard Sch¨olkopf, Brian Ichter, Cewu Lu, Charles Xu, Chelsea Finn, Chenfeng Xu, Cheng Chi, Chen- guang Huan...

  15. [15]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

  16. [16]

    Learning universal policies via text-guided video generation

    Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. arXiv preprint arXiv:2302.00111, 2023

  17. [17]

    Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

    Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568, 2018. 13

  18. [18]

    Policy adaptation from foundation model feedback

    Yuying Ge, Annabella Macaluso, Li Erran Li, Ping Luo, and Xiaolong Wang. Policy adaptation from foundation model feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19059–19069, 2023

  19. [19]

    The” something something” video database for learning and evaluating visual common sense

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision , pp. 5842– 5850, 2017

  20. [20]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18995–19012, 2022

  21. [21]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019

  22. [22]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse do- mains through world models. arXiv preprint arXiv:2301.04104, 2023

  23. [23]

    IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

    Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573, 2023

  24. [24]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

  25. [25]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Ad- vances in Neural Information Processing Systems , 33:6840–6851, 2020

  26. [26]

    Kingma, Ben Poole, Mohammad Norouzi, David J

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Ima- gen video: High definition video generation with diffusion models, 2022

  27. [27]

    Video Diffusion Models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv:2204.03458, 2022

  28. [28]

    Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

    Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207, 2022

  29. [29]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022

  30. [30]

    Bc-z: Zero-shot task generalization with robotic imitation learning

    Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pp. 991–1002. PMLR, 2022

  31. [31]

    Offline reinforcement learning as one big sequence modeling problem

    Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. In Advances in Neural Information Processing Systems , 2021

  32. [32]

    Planning with diffusion for flexible behavior synthesis

    Michael Janner, Yilun Du, Joshua Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. In International Conference on Machine Learning , 2022

  33. [33]

    Mdetr-modulated detection for end-to-end multi-modal understanding

    Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr-modulated detection for end-to-end multi-modal understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision , pp. 1780–1790, 2021

  34. [34]

    Dall-e-bot: Introducing web-scale diffu- sion models to robotics

    Ivan Kapelyukh, Vitalis V osylius, and Edward Johns. Dall-e-bot: Introducing web-scale diffu- sion models to robotics. 2023

  35. [35]

    Language-driven representation learning for robotics

    Siddharth Karamcheti, Suraj Nair, Annie S Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, and Percy Liang. Language-driven representation learning for robotics. arXiv preprint arXiv:2302.12766, 2023

  36. [36]

    Adam: A method for stochastic optimization

    Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR) , 2015. 14

  37. [37]

    Stochastic latent actor- critic: Deep reinforcement learning with a latent variable model

    Alex X Lee, Anusha Nagabandi, Pieter Abbeel, and Sergey Levine. Stochastic latent actor- critic: Deep reinforcement learning with a latent variable model. Advances in Neural Informa- tion Processing Systems, 33:741–752, 2020

  38. [38]

    Stochastic primal-dual q-learning

    Donghwan Lee and Niao He. Stochastic primal-dual q-learning. arXiv preprint arXiv:1810.08298, 2018

  39. [39]

    Code as policies: Language model programs for embodied control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pp. 9493–9500. IEEE, 2023

  40. [40]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  41. [41]

    Language conditioned imitation learning over unstructured data

    Corey Lynch and Pierre Sermanet. Language conditioned imitation learning over unstructured data. arXiv preprint arXiv:2005.07648, 2020

  42. [42]

    Liv: Language-image representations and rewards for robotic control

    Yecheng Jason Ma, William Liang, Vaidehi Som, Vikash Kumar, Amy Zhang, Osbert Bastani, and Dinesh Jayaraman. Liv: Language-image representations and rewards for robotic control. arXiv preprint arXiv:2306.00958, 2023

  43. [43]

    Cacti: A framework for scalable multi-task multi-scene visual imitation learn- ing

    Zhao Mandi, Homanga Bharadhwaj, Vincent Moens, Shuran Song, Aravind Rajeswaran, and Vikash Kumar. Cacti: A framework for scalable multi-task multi-scene visual imitation learn- ing. arXiv preprint arXiv:2212.05711, 2022

  44. [44]

    What matters in language conditioned robotic imitation learning over unstructured data

    Oier Mees, Lukas Hermann, and Wolfram Burgard. What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters , 7 (4):11205–11212, 2022

  45. [45]

    Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters (RA-L) , 7(3):7327–7334, 2022

  46. [46]

    Simple open-vocabulary object detection

    Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open-vocabulary object detection. In European Conference on Computer Vision, pp. 728–755. Springer, 2022

  47. [47]

    Goal repre- sentations for instruction following: A semi-supervised language interface to control

    Vivek Myers, Andre He, Kuan Fang, Homer Walke, Philippe Hansen-Estruch, Ching-An Cheng, Mihai Jalobeanu, Andrey Kolobov, Anca Dragan, and Sergey Levine. Goal repre- sentations for instruction following: A semi-supervised language interface to control. arXiv preprint arXiv:2307.00117, 2023

  48. [48]

    Hierarchical foresight: Self-supervised learning of long-horizon tasks via visual subgoal generation

    Suraj Nair and Chelsea Finn. Hierarchical foresight: Self-supervised learning of long-horizon tasks via visual subgoal generation. arXiv preprint arXiv:1909.05829, 2019

  49. [49]

    R3M: A Universal Visual Representation for Robot Manipulation

    Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhi Gupta. R3m: A universal visual representation for robot manipulation. ArXiv, abs/2203.12601, 2022

  50. [50]

    Learning to augment synthetic images for sim2real policy transfer

    Alexander Pashevich, Robin Strudel, Igor Kalevatykh, Ivan Laptev, and Cordelia Schmid. Learning to augment synthetic images for sim2real policy transfer. In2019 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS) , pp. 2651–2657. IEEE, 2019

  51. [51]

    FiLM: Visual Reasoning with a General Conditioning Layer, December 2017

    Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM: Visual Reasoning with a General Conditioning Layer, December 2017

  52. [52]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021

  53. [53]

    Rajeswaran, and Chelsea Finn

    Rafael Rafailov, Tianhe Yu, A. Rajeswaran, and Chelsea Finn. Offline reinforcement learning from images with latent space models. Learning for Decision Making and Control (L4DC) , 2021

  54. [54]

    Reinforcement learning with action-free pre-training from videos

    Younggyo Seo, Kimin Lee, Stephen L James, and Pieter Abbeel. Reinforcement learning with action-free pre-training from videos. In International Conference on Machine Learning , pp. 19561–19579. PMLR, 2022. 15

  55. [55]

    Cliport: What and where pathways for robotic manipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pp. 894–906. PMLR, 2022

  56. [56]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

  57. [57]

    Open-world object manipula- tion using pre-trained vision-language models

    Austin Stone, Ted Xiao, Yao Lu, Keerthana Gopalakrishnan, Kuang-Huei Lee, Quan Vuong, Paul Wohlhart, Brianna Zitkovich, Fei Xia, Chelsea Finn, et al. Open-world object manipula- tion using pre-trained vision-language models. arXiv preprint arXiv:2303.00905, 2023

  58. [58]

    High fidelity video prediction with large stochastic recurrent neural networks

    Ruben Villegas, Arkanath Pathak, Harini Kannan, Dumitru Erhan, Quoc V Le, and Honglak Lee. High fidelity video prediction with large stochastic recurrent neural networks. Advances in Neural Information Processing Systems , 32, 2019

  59. [59]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning (CoRL) , 2023

  60. [60]

    Day- dreamer: World models for physical robot learning

    Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Day- dreamer: World models for physical robot learning. In Conference on Robot Learning , pp. 2226–2240. PMLR, 2023

  61. [61]

    Multilingual Universal Sentence Encoder for Semantic Retrieval, July 2019

    Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Her- nandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, et al. Multilingual Universal Sentence Encoder for Semantic Retrieval, July 2019

  62. [62]

    Video probabilistic diffusion models in projected latent space

    Sihyun Yu, Kihyuk Sohn, Subin Kim, and Jinwoo Shin. Video probabilistic diffusion models in projected latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18456–18466, 2023

  63. [63]

    Combo: Conservative offline model-based policy optimization

    Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. Combo: Conservative offline model-based policy optimization. arXiv preprint arXiv:2102.08363, 2021

  64. [64]

    Scaling robot learning with semanti- cally imagined experience

    Tianhe Yu, Ted Xiao, Austin Stone, Jonathan Tompson, Anthony Brohan, Su Wang, Jaspiar Singh, Clayton Tan, Jodilyn Peralta, Brian Ichter, et al. Scaling robot learning with semanti- cally imagined experience. arXiv preprint arXiv:2302.11550, 2023

  65. [65]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained biman- ual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023. 16 CALVIN Smthn-Smthn Bridge kmin (Eq. 1) 20 11 11 kmax (Eq. 1) 22 14 14 kδ (Eq. 2) 2 N/A 6 CALVIN Real-world ktest (Alg. 1) 20 20 Table 5: (Goal sampling hyperparameters) The first two hy...

  66. [66]

    move X to Y ,

    use receding horizon control, sampling k-length action sequences and only executing some of the actions before sampling a new sequence. This strategy can make the policy more reactive. However, we found that the robot behavior was quite jerky as the policy switched between different modes in the action distribution with each sample. Instead, we use a temp...