Recognition: no theorem link
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
Pith reviewed 2026-05-16 05:50 UTC · model grok-4.3
The pith
A finetuned image-editing diffusion model generates subgoal images that let a low-level policy complete manipulation tasks on objects and instructions absent from robot training data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a diffusion model finetuned to edit robot observations into language-specified future images produces subgoal targets that a goal-conditioned low-level policy can reliably execute, yielding zero-shot generalization to unseen objects and instructions on both simulated and physical manipulation tasks.
What carries the argument
SuSIE, a two-module architecture in which a finetuned InstructPix2Pix diffusion model converts a current observation and language command into a subgoal image that a goal-conditioned policy then reaches through closed-loop control.
If this is right
- Robots succeed on instructions involving objects never shown in their own training data.
- Long-horizon sequences can be completed by repeatedly querying the diffusion model for the next subgoal image.
- The same high-level planner works with different low-level controllers without retraining the diffusion model.
- Performance gains appear without requiring orders of magnitude more robot data or privileged state information.
Where Pith is reading between the lines
- Image-generation models could replace explicit symbolic planners for many sequential robot tasks.
- The separation of visual subgoal prediction from low-level control may scale more readily than end-to-end language-to-action training as pretraining corpora grow.
- Similar subgoal-image generation could be tested in navigation or multi-robot coordination settings.
Load-bearing premise
The subgoal images produced by the diffusion model must stay close enough to reachable states that the low-level policy can actually execute them even when objects, lighting, or wording fall outside the finetuning videos.
What would settle it
A set of new manipulation tasks where the generated subgoal images appear plausible to humans yet the robot consistently fails to reach them would falsify the claim that the diffusion model supplies executable targets.
read the original abstract
If generalist robots are to operate in truly unstructured environments, they need to be able to recognize and reason about novel objects and scenarios. Such objects and scenarios might not be present in the robot's own training data. We propose SuSIE, a method that leverages an image-editing diffusion model to act as a high-level planner by proposing intermediate subgoals that a low-level controller can accomplish. Specifically, we finetune InstructPix2Pix on video data, consisting of both human videos and robot rollouts, such that it outputs hypothetical future "subgoal" observations given the robot's current observation and a language command. We also use the robot data to train a low-level goal-conditioned policy to act as the aforementioned low-level controller. We find that the high-level subgoal predictions can utilize Internet-scale pretraining and visual understanding to guide the low-level goal-conditioned policy, achieving significantly better generalization and precision than conventional language-conditioned policies. We achieve state-of-the-art results on the CALVIN benchmark, and also demonstrate robust generalization on real-world manipulation tasks, beating strong baselines that have access to privileged information or that utilize orders of magnitude more compute and training data. The project website can be found at http://rail-berkeley.github.io/susie .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SuSIE, which finetunes the InstructPix2Pix image-editing diffusion model on a combination of human videos and robot rollouts to generate intermediate subgoal observations from the current visual input and a language command. These subgoals then condition a separately trained low-level goal-conditioned policy for robotic manipulation tasks. The method is evaluated on the CALVIN benchmark, where it reports state-of-the-art results, and on real-world tasks, where it outperforms baselines that use privileged information or substantially more compute and data, emphasizing improved zero-shot generalization to novel objects and instructions.
Significance. If the empirical claims hold under closer scrutiny, the work provides concrete evidence that large-scale pretrained diffusion models can serve as effective high-level visual planners for robotics, decoupling semantic understanding from low-level control and thereby reducing reliance on massive robot-specific datasets. This separation could generalize to other embodied tasks and highlights a practical route for incorporating internet-scale visual priors into manipulation policies.
major comments (3)
- [Experimental Results] Experimental Results section: the SOTA claim on CALVIN is presented without error bars, standard deviations across seeds, or explicit confirmation that the evaluation protocol matches prior work exactly (including data splits and success criteria), which is required to substantiate robustness over strong baselines.
- [Method and Experiments] Method and Experiments sections: no quantitative isolation of the subgoal-generation step is reported, such as success rates when the low-level policy is conditioned on ground-truth future frames versus diffusion-generated subgoals; without this, it is impossible to determine whether performance gains stem from accurate subgoal prediction or from other factors.
- [Real-world experiments] Real-world experiments: the paper does not include failure-case analysis or metrics on how often generated subgoals are kinematically infeasible or visually inconsistent with novel objects/lighting, which directly tests the central transfer assumption for zero-shot generalization.
minor comments (2)
- [Abstract and Introduction] The abstract and introduction could more clearly distinguish the finetuning data sources (human videos versus robot rollouts) and their respective contributions to the diffusion model.
- [Figures] Figure captions and legends should explicitly state whether results are averaged over multiple runs and include the exact number of evaluation episodes per task.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and will incorporate revisions to improve the manuscript's clarity and rigor.
read point-by-point responses
-
Referee: [Experimental Results] Experimental Results section: the SOTA claim on CALVIN is presented without error bars, standard deviations across seeds, or explicit confirmation that the evaluation protocol matches prior work exactly (including data splits and success criteria), which is required to substantiate robustness over strong baselines.
Authors: We agree that error bars and explicit protocol confirmation would strengthen the SOTA claims. In the revised manuscript, we will report standard deviations across multiple random seeds for all CALVIN results and explicitly state that our evaluation follows the exact protocol from prior work, including identical data splits and success criteria. revision: yes
-
Referee: [Method and Experiments] Method and Experiments sections: no quantitative isolation of the subgoal-generation step is reported, such as success rates when the low-level policy is conditioned on ground-truth future frames versus diffusion-generated subgoals; without this, it is impossible to determine whether performance gains stem from accurate subgoal prediction or from other factors.
Authors: This is a fair point. We will add an ablation study in the revised Experiments section that reports success rates for the low-level policy conditioned on ground-truth future frames versus our diffusion-generated subgoals. This will isolate the subgoal-generation contribution and clarify the source of performance gains. revision: yes
-
Referee: [Real-world experiments] Real-world experiments: the paper does not include failure-case analysis or metrics on how often generated subgoals are kinematically infeasible or visually inconsistent with novel objects/lighting, which directly tests the central transfer assumption for zero-shot generalization.
Authors: We acknowledge the value of failure analysis for validating zero-shot transfer. In the revision, we will include a dedicated subsection on real-world failure cases with quantitative metrics on the frequency of kinematically infeasible or visually inconsistent subgoals, focusing on novel objects and lighting variations. revision: yes
Circularity Check
No circularity: independent training of diffusion planner and goal-conditioned policy with external benchmark evaluation
full rationale
The method trains the InstructPix2Pix-based high-level planner via finetuning on separate video datasets (human videos plus robot rollouts) and trains the low-level goal-conditioned policy independently on robot data. The central claims of SOTA performance on CALVIN and real-world generalization are evaluated on held-out benchmarks rather than being defined in terms of any fitted parameter or self-citation chain that reduces the output to the input by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Finetuning InstructPix2Pix on mixed human and robot video produces subgoal images that are both visually plausible and kinematically reachable by the robot.
Forward citations
Cited by 19 Pith papers
-
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
-
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
-
UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models
UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.
-
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...
-
Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training
Hi-WM uses human interventions inside an action-conditioned world model with rollback and branching to generate dense corrective data, raising real-world success by 37.9 points on average across three manipulation tasks.
-
Human Cognition in Machines: A Unified Perspective of World Models
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
-
PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing
PhyEdit improves physical accuracy in image object manipulation by using explicit geometric simulation as 3D-aware guidance combined with joint 2D-3D supervision.
-
Ctrl-World: A Controllable Generative World Model for Robot Manipulation
A controllable world model trained on the DROID dataset generates consistent multi-view robot trajectories for over 20 seconds and improves generalist policy success rates by 44.7% via imagined trajectory fine-tuning.
-
Video Generators are Robot Policies
Training models to generate videos of robot actions produces policies that generalize better to new objects and tasks while using far less demonstration data than standard behavior cloning.
-
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
CoT-VLA is a 7B VLA that generates future visual frames autoregressively as planning goals before actions, outperforming prior VLAs by 17% on real-world tasks and 6% in simulation.
-
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
-
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
-
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
-
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation
Gen2Act enables generalizable robot manipulation for unseen objects and novel motions by using zero-shot human video generation from web data to condition a policy trained on an order of magnitude less robot interaction data.
-
Octo: An Open-Source Generalist Robot Policy
Octo is an open-source transformer-based generalist robot policy pretrained on 800k trajectories that serves as an effective initialization for finetuning across diverse robotic platforms.
-
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
-
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.
-
Motus: A Unified Latent Action World Model
Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.
-
World Model for Robot Learning: A Comprehensive Survey
A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...
Reference graph
Works this paper leans on
-
[1]
Anurag Ajay, Yilun Du, Abhi Gupta, Joshua B. Tenenbaum, Tommi S. Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision making? In The Eleventh International Conference on Learning Representations , 2023. URL https:// openreview.net/forum?id=sP1fo2K9DFG
work page 2023
-
[2]
Compositional founda- tion models for hierarchical planning
Anurag Ajay, Seungwook Han, Yilun Du, Shaung Li, Abhi Gupta, Tommi Jaakkola, Josh Tenenbaum, Leslie Kaelbling, Akash Srivastava, and Pulkit Agrawal. Compositional founda- tion models for hierarchical planning. arXiv preprint arXiv:2309.08587, 2023
-
[3]
Fitvid: Overfitting in pixel-level video prediction
Mohammad Babaeizadeh, Mohammad Taghi Saffar, Suraj Nair, Sergey Levine, Chelsea Finn, and Dumitru Erhan. Fitvid: Overfitting in pixel-level video prediction. arXiv preprint arXiv:2106.13195, 2020
-
[4]
Robotic offline rl from internet videos via value-function pre-training
Chethan Bhateja, Derek Guo, Dibya Ghosh, Anika Singh, Manan Tomar, Quan Ho Vuong, Yevgen Chebotar, Sergey Levine, and Aviral Kumar. Robotic offline rl from internet videos via value-function pre-training. 2023. URL https://api.semanticscholar.org/ CorpusID:262217278
work page 2023
-
[5]
Introducing ChatGPT and Whis- per APIs
Greg Brockman, Atty Eleti, Elie Georges, Joanne Jang, Logan Kilpatrick, Rachel Lim, Luke Miller, and Michelle Pokrass. Introducing ChatGPT and Whis- per APIs. OpenAI Blog, 2023. URL https://openai.com/blog/ introducing-chatgpt-and-whisper-apis
work page 2023
-
[6]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Do as i can, not as i say: Grounding language in robotic affordances
Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. Do as i can, not as i say: Grounding language in robotic affordances. InConference on Robot Learning, pp. 287–318. PMLR, 2023. 12
work page 2023
-
[9]
Instructpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18392–18402, 2023
work page 2023
-
[10]
Decision transformer: Reinforcement learning via sequence modeling
Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. arXiv preprint arXiv:2106.01345, 2021
-
[11]
Genaug: Retargeting behaviors to unseen situations via generative augmentation
Zoey Chen, Sho Kiami, Abhishek Gupta, and Vikash Kumar. Genaug: Retargeting behaviors to unseen situations via generative augmentation. arXiv preprint arXiv:2302.06671, 2023
-
[12]
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.arXiv preprint arXiv:2303.04137, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
Open X-Embodiment Collaboration, Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, Antonin Raffin, Ayzaan Wahid, Ben Burgess-Limerick, Beomjoon Kim, Bernhard Sch¨olkopf, Brian Ichter, Cewu Lu, Charles Xu, Chelsea Finn, Chenfeng Xu, Cheng Chi, Chen- guang Huan...
work page 2023
-
[15]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Learning universal policies via text-guided video generation
Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. arXiv preprint arXiv:2302.00111, 2023
-
[17]
Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control
Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568, 2018. 13
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[18]
Policy adaptation from foundation model feedback
Yuying Ge, Annabella Macaluso, Li Erran Li, Ping Luo, and Xiaolong Wang. Policy adaptation from foundation model feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19059–19069, 2023
work page 2023
-
[19]
The” something something” video database for learning and evaluating visual common sense
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision , pp. 5842– 5850, 2017
work page 2017
-
[20]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18995–19012, 2022
work page 2022
-
[21]
Dream to Control: Learning Behaviors by Latent Imagination
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[22]
Mastering Diverse Domains through World Models
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse do- mains through world models. arXiv preprint arXiv:2301.04104, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Ad- vances in Neural Information Processing Systems , 33:6840–6851, 2020
work page 2020
-
[26]
Kingma, Ben Poole, Mohammad Norouzi, David J
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Ima- gen video: High definition video generation with diffusion models, 2022
work page 2022
-
[27]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv:2204.03458, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[28]
Language models as zero-shot planners: Extracting actionable knowledge for embodied agents
Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207, 2022
-
[29]
Inner Monologue: Embodied Reasoning through Planning with Language Models
Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[30]
Bc-z: Zero-shot task generalization with robotic imitation learning
Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pp. 991–1002. PMLR, 2022
work page 2022
-
[31]
Offline reinforcement learning as one big sequence modeling problem
Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. In Advances in Neural Information Processing Systems , 2021
work page 2021
-
[32]
Planning with diffusion for flexible behavior synthesis
Michael Janner, Yilun Du, Joshua Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. In International Conference on Machine Learning , 2022
work page 2022
-
[33]
Mdetr-modulated detection for end-to-end multi-modal understanding
Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr-modulated detection for end-to-end multi-modal understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision , pp. 1780–1790, 2021
work page 2021
-
[34]
Dall-e-bot: Introducing web-scale diffu- sion models to robotics
Ivan Kapelyukh, Vitalis V osylius, and Edward Johns. Dall-e-bot: Introducing web-scale diffu- sion models to robotics. 2023
work page 2023
-
[35]
Language-driven representation learning for robotics
Siddharth Karamcheti, Suraj Nair, Annie S Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, and Percy Liang. Language-driven representation learning for robotics. arXiv preprint arXiv:2302.12766, 2023
-
[36]
Adam: A method for stochastic optimization
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR) , 2015. 14
work page 2015
-
[37]
Stochastic latent actor- critic: Deep reinforcement learning with a latent variable model
Alex X Lee, Anusha Nagabandi, Pieter Abbeel, and Sergey Levine. Stochastic latent actor- critic: Deep reinforcement learning with a latent variable model. Advances in Neural Informa- tion Processing Systems, 33:741–752, 2020
work page 2020
-
[38]
Stochastic primal-dual q-learning
Donghwan Lee and Niao He. Stochastic primal-dual q-learning. arXiv preprint arXiv:1810.08298, 2018
-
[39]
Code as policies: Language model programs for embodied control
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pp. 9493–9500. IEEE, 2023
work page 2023
-
[40]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[41]
Language conditioned imitation learning over unstructured data
Corey Lynch and Pierre Sermanet. Language conditioned imitation learning over unstructured data. arXiv preprint arXiv:2005.07648, 2020
-
[42]
Liv: Language-image representations and rewards for robotic control
Yecheng Jason Ma, William Liang, Vaidehi Som, Vikash Kumar, Amy Zhang, Osbert Bastani, and Dinesh Jayaraman. Liv: Language-image representations and rewards for robotic control. arXiv preprint arXiv:2306.00958, 2023
-
[43]
Cacti: A framework for scalable multi-task multi-scene visual imitation learn- ing
Zhao Mandi, Homanga Bharadhwaj, Vincent Moens, Shuran Song, Aravind Rajeswaran, and Vikash Kumar. Cacti: A framework for scalable multi-task multi-scene visual imitation learn- ing. arXiv preprint arXiv:2212.05711, 2022
-
[44]
What matters in language conditioned robotic imitation learning over unstructured data
Oier Mees, Lukas Hermann, and Wolfram Burgard. What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters , 7 (4):11205–11212, 2022
work page 2022
-
[45]
Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters (RA-L) , 7(3):7327–7334, 2022
work page 2022
-
[46]
Simple open-vocabulary object detection
Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open-vocabulary object detection. In European Conference on Computer Vision, pp. 728–755. Springer, 2022
work page 2022
-
[47]
Goal repre- sentations for instruction following: A semi-supervised language interface to control
Vivek Myers, Andre He, Kuan Fang, Homer Walke, Philippe Hansen-Estruch, Ching-An Cheng, Mihai Jalobeanu, Andrey Kolobov, Anca Dragan, and Sergey Levine. Goal repre- sentations for instruction following: A semi-supervised language interface to control. arXiv preprint arXiv:2307.00117, 2023
-
[48]
Hierarchical foresight: Self-supervised learning of long-horizon tasks via visual subgoal generation
Suraj Nair and Chelsea Finn. Hierarchical foresight: Self-supervised learning of long-horizon tasks via visual subgoal generation. arXiv preprint arXiv:1909.05829, 2019
-
[49]
R3M: A Universal Visual Representation for Robot Manipulation
Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhi Gupta. R3m: A universal visual representation for robot manipulation. ArXiv, abs/2203.12601, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[50]
Learning to augment synthetic images for sim2real policy transfer
Alexander Pashevich, Robin Strudel, Igor Kalevatykh, Ivan Laptev, and Cordelia Schmid. Learning to augment synthetic images for sim2real policy transfer. In2019 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS) , pp. 2651–2657. IEEE, 2019
work page 2019
-
[51]
FiLM: Visual Reasoning with a General Conditioning Layer, December 2017
Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM: Visual Reasoning with a General Conditioning Layer, December 2017
work page 2017
-
[52]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021
work page 2021
-
[53]
Rafael Rafailov, Tianhe Yu, A. Rajeswaran, and Chelsea Finn. Offline reinforcement learning from images with latent space models. Learning for Decision Making and Control (L4DC) , 2021
work page 2021
-
[54]
Reinforcement learning with action-free pre-training from videos
Younggyo Seo, Kimin Lee, Stephen L James, and Pieter Abbeel. Reinforcement learning with action-free pre-training from videos. In International Conference on Machine Learning , pp. 19561–19579. PMLR, 2022. 15
work page 2022
-
[55]
Cliport: What and where pathways for robotic manipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pp. 894–906. PMLR, 2022
work page 2022
-
[56]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[57]
Open-world object manipula- tion using pre-trained vision-language models
Austin Stone, Ted Xiao, Yao Lu, Keerthana Gopalakrishnan, Kuang-Huei Lee, Quan Vuong, Paul Wohlhart, Brianna Zitkovich, Fei Xia, Chelsea Finn, et al. Open-world object manipula- tion using pre-trained vision-language models. arXiv preprint arXiv:2303.00905, 2023
-
[58]
High fidelity video prediction with large stochastic recurrent neural networks
Ruben Villegas, Arkanath Pathak, Harini Kannan, Dumitru Erhan, Quoc V Le, and Honglak Lee. High fidelity video prediction with large stochastic recurrent neural networks. Advances in Neural Information Processing Systems , 32, 2019
work page 2019
-
[59]
Bridgedata v2: A dataset for robot learning at scale
Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning (CoRL) , 2023
work page 2023
-
[60]
Day- dreamer: World models for physical robot learning
Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Day- dreamer: World models for physical robot learning. In Conference on Robot Learning , pp. 2226–2240. PMLR, 2023
work page 2023
-
[61]
Multilingual Universal Sentence Encoder for Semantic Retrieval, July 2019
Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Her- nandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, et al. Multilingual Universal Sentence Encoder for Semantic Retrieval, July 2019
work page 2019
-
[62]
Video probabilistic diffusion models in projected latent space
Sihyun Yu, Kihyuk Sohn, Subin Kim, and Jinwoo Shin. Video probabilistic diffusion models in projected latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18456–18466, 2023
work page 2023
-
[63]
Combo: Conservative offline model-based policy optimization
Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. Combo: Conservative offline model-based policy optimization. arXiv preprint arXiv:2102.08363, 2021
-
[64]
Scaling robot learning with semanti- cally imagined experience
Tianhe Yu, Ted Xiao, Austin Stone, Jonathan Tompson, Anthony Brohan, Su Wang, Jaspiar Singh, Clayton Tan, Jodilyn Peralta, Brian Ichter, et al. Scaling robot learning with semanti- cally imagined experience. arXiv preprint arXiv:2302.11550, 2023
-
[65]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained biman- ual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023. 16 CALVIN Smthn-Smthn Bridge kmin (Eq. 1) 20 11 11 kmax (Eq. 1) 22 14 14 kδ (Eq. 2) 2 N/A 6 CALVIN Real-world ktest (Alg. 1) 20 20 Table 5: (Goal sampling hyperparameters) The first two hy...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
use receding horizon control, sampling k-length action sequences and only executing some of the actions before sampling a new sequence. This strategy can make the policy more reactive. However, we found that the robot behavior was quite jerky as the policy switched between different modes in the action distribution with each sample. Instead, we use a temp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.