Recognition: 3 theorem links
· Lean TheoremUnleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
Pith reviewed 2026-05-13 16:25 UTC · model grok-4.3
The pith
A GPT-style transformer pre-trained on large-scale videos generalizes to multi-task language-conditioned robot manipulation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GR-1 is a unified GPT-style transformer that accepts a language instruction, a sequence of observation images, and robot states, then predicts both future images and robot actions in an end-to-end manner. When pre-trained generatively on a large-scale non-robot video dataset and subsequently fine-tuned on robot trajectories, the model outperforms prior methods on the CALVIN benchmark, lifting overall success from 88.9 percent to 94.9 percent and zero-shot unseen-scene success from 53.3 percent to 85.4 percent. Real-robot experiments likewise show improved generalization to novel scenes and objects.
What carries the argument
GR-1, the GPT-style transformer that jointly predicts robot actions and future images from language and visual state sequences.
If this is right
- The approach raises success rates on the CALVIN benchmark from 88.9 percent to 94.9 percent across multi-task settings.
- Zero-shot generalization to unseen scenes improves from 53.3 percent to 85.4 percent success.
- Real-robot trials show stronger performance on novel scenes and objects than baselines without video pre-training.
- The flexible architecture permits direct fine-tuning from video pre-training to robot action prediction without architectural changes.
Where Pith is reading between the lines
- Abundant internet video could become a primary data source for acquiring robot manipulation skills, reducing dependence on costly robot-collected trajectories.
- The same pre-training strategy may transfer to other embodied tasks such as navigation or tool use once suitable action heads are added.
- Larger video corpora or longer training could further close the remaining gap between seen and unseen environments.
- Combining the model with existing large language models might enable more open-ended natural-language instruction following in physical settings.
Load-bearing premise
Representations learned from general video data transfer to robot manipulation tasks without a prohibitive domain gap or loss of capability during fine-tuning.
What would settle it
A controlled comparison in which a model trained from scratch on the identical robot dataset matches or exceeds GR-1's success rates on CALVIN and real-robot tests would undermine the claimed benefit of the video pre-training stage.
read the original abstract
Generative pre-trained models have demonstrated remarkable effectiveness in language and vision domains by learning useful representations. In this paper, we extend the scope of this effectiveness by showing that visual robot manipulation can significantly benefit from large-scale video generative pre-training. We introduce GR-1, a straightforward GPT-style model designed for multi-task language-conditioned visual robot manipulation. GR-1 takes as inputs a language instruction, a sequence of observation images, and a sequence of robot states. It predicts robot actions as well as future images in an end-to-end manner. Thanks to a flexible design, GR-1 can be seamlessly finetuned on robot data after pre-trained on a large-scale video dataset. We perform extensive experiments on the challenging CALVIN benchmark and a real robot. On CALVIN benchmark, our method outperforms state-of-the-art baseline methods and improves the success rate from 88.9% to 94.9%. In the setting of zero-shot unseen scene generalization, GR-1 improves the success rate from 53.3% to 85.4%. In real robot experiments, GR-1 also outperforms baseline methods and shows strong potentials in generalization to unseen scenes and objects. We provide inaugural evidence that a unified GPT-style transformer, augmented with large-scale video generative pre-training, exhibits remarkable generalization to multi-task visual robot manipulation. Project page: https://GR1-Manipulation.github.io
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GR-1, a GPT-style transformer for multi-task language-conditioned visual robot manipulation that takes language instructions, observation images, and robot states as input and jointly predicts actions and future images. After large-scale video generative pre-training on non-robot data, the model is fine-tuned end-to-end on robot datasets. It reports success-rate gains on the CALVIN benchmark (88.9% to 94.9%) and zero-shot unseen-scene generalization (53.3% to 85.4%), plus real-robot results showing improved generalization to unseen scenes and objects.
Significance. If the empirical gains are reproducible, the work supplies inaugural evidence that video-scale generative pre-training transfers to robot manipulation without catastrophic forgetting, supporting a unified GPT-style architecture for joint action-image prediction. This could influence future robot learning pipelines by demonstrating that non-robot video data can close part of the domain gap in visual control.
major comments (2)
- [Abstract / Experiments] Abstract and Experiments: the central claim attributes the CALVIN and zero-shot gains specifically to large-scale video pre-training, yet the manuscript provides no ablation that isolates the pre-training stage from the GPT architecture or the joint image-action objective; without this comparison the transfer benefit remains correlational rather than causal.
- [Experiments] Experiments: success rates are reported as single point estimates (88.9% → 94.9%, 53.3% → 85.4%) with no mention of variance across random seeds, number of evaluation episodes, or statistical significance tests; this weakens confidence that the observed deltas are robust rather than run-specific.
minor comments (2)
- [Method] The description of the video pre-training dataset (size, diversity, filtering) is referenced only at high level; adding a table or paragraph with exact statistics would clarify the scale and domain distance to robot data.
- [Method] Notation for the joint prediction loss (action + image) is introduced without an explicit equation; including the combined objective as Eq. (X) would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address the two major comments below and will incorporate revisions to strengthen the manuscript's claims and reporting.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments: the central claim attributes the CALVIN and zero-shot gains specifically to large-scale video pre-training, yet the manuscript provides no ablation that isolates the pre-training stage from the GPT architecture or the joint image-action objective; without this comparison the transfer benefit remains correlational rather than causal.
Authors: We agree that an explicit ablation isolating the effect of large-scale video pre-training on the identical GR-1 architecture would provide stronger causal evidence. The current comparisons are against prior state-of-the-art methods that lack both the GPT-style joint image-action prediction and video pre-training; however, to directly address this point we will add a controlled ablation in the revised manuscript: training the same GR-1 model from scratch on the robot datasets without the video pre-training stage and reporting the resulting performance drop on CALVIN and zero-shot generalization. This will clarify the incremental benefit attributable to pre-training. revision: yes
-
Referee: [Experiments] Experiments: success rates are reported as single point estimates (88.9% → 94.9%, 53.3% → 85.4%) with no mention of variance across random seeds, number of evaluation episodes, or statistical significance tests; this weakens confidence that the observed deltas are robust rather than run-specific.
Authors: We acknowledge that single-point estimates limit assessment of robustness. The CALVIN benchmark protocol uses 1000 evaluation episodes per task setting; we will explicitly state this number, report results averaged over at least three random seeds with standard deviations, and include these details in both the main text and tables. If the performance deltas remain statistically significant under a paired t-test or similar, we will note this as well. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents empirical results from pre-training a GPT-style transformer on large-scale video data followed by fine-tuning on robot manipulation tasks, with performance gains reported on CALVIN (88.9% to 94.9%) and zero-shot settings (53.3% to 85.4%). No equations or first-principles derivations are invoked that reduce any claimed prediction to fitted parameters or self-referential definitions by construction. The architecture is described as flexible for pre-train then fine-tune, and results are positioned as direct benchmark evidence of transfer rather than a closed theoretical system. No load-bearing self-citations or ansatzes are used to justify core claims; the work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce GR-1, a straightforward GPT-style model designed for multi-task language-conditioned visual robot manipulation. GR-1 takes as inputs a language instruction, a sequence of observation images, and a sequence of robot states. It predicts robot actions as well as future images in an end-to-end manner.
-
Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We provide inaugural evidence that a unified GPT-style transformer, augmented with large-scale video generative pre-training, exhibits remarkable generalization to multi-task visual robot manipulation.
-
Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
On CALVIN benchmark, our method outperforms state-of-the-art baseline methods and improves the success rate from 88.9% to 94.9%.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 28 Pith papers
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
-
Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models
MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
-
Being-H0.7: A Latent World-Action Model from Egocentric Videos
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
-
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
-
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
-
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
GazeVLA: Learning Human Intention for Robotic Manipulation
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
-
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors
CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
-
Human Cognition in Machines: A Unified Perspective of World Models
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
-
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
-
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
-
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
-
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...
-
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
-
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
-
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...
-
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
-
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
-
M100: An Orchestrated Dataflow Architecture Powering General AI Computing
M100 is a tensor-based dataflow architecture that eliminates heavy caching through compiler-managed data streams, claiming higher utilization and better performance than GPGPUs for AD and LLM inference tasks.
-
R3D: Revisiting 3D Policy Learning
A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
-
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
-
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.
-
ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation
Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.
-
WorldVLA: Towards Autoregressive Action World Model
WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.
Reference graph
Works this paper leans on
-
[1]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do As I Can, Not As I Say : Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos
Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos . Advances in Neural Information Processing Systems, 35: 0 24639--24654, 2022
work page 2022
-
[3]
Robotic Offline RL from Internet Videos via Value-Function Pre-Training
Chethan Bhateja, Derek Guo, Dibya Ghosh, Anikait Singh, Manan Tomar, Quan Vuong, Yevgen Chebotar, Sergey Levine, and Aviral Kumar. Robotic Offline RL from Internet Videos via Value-Function Pre-Training . arXiv preprint arXiv:2309.13041, 2023
-
[4]
RoboCat : A self-improving foundation agent for robotic manipulation
Konstantinos Bousmalis, Giulia Vezzani, Dushyant Rao, Coline Devin, Alex X Lee, Maria Bauza, Todor Davchev, Yuxiang Zhou, Agrim Gupta, Akhil Raju, et al. RoboCat : A self-improving foundation agent for robotic manipulation. arXiv preprint arXiv:2306.11706, 2023
-
[5]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1 : Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2 : Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, pp.\ 1877--1901, 2020
work page 1901
-
[8]
Decision Transformer: Reinforcement learning via sequence modeling
Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision Transformer: Reinforcement learning via sequence modeling . Advances in Neural Information Processing Systems, 2021
work page 2021
-
[9]
Generative pretraining from pixels
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International Conference on Machine Learning, pp.\ 1691--1703. PMLR, 2020
work page 2020
-
[10]
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion Policy : Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. PaLM-E : An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Deep visual foresight for planning robot motion
Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 2786--2793. IEEE, 2017
work page 2017
-
[15]
Ego4D : Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D : Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18995--19012, 2022
work page 2022
-
[16]
Instruction-driven history-aware policies for robotic manipulations
Pierre-Louis Guhur, Shizhe Chen, Ricardo Garcia Pinel, Makarand Tapaswi, Ivan Laptev, and Cordelia Schmid. Instruction-driven history-aware policies for robotic manipulations. In Conference on Robot Learning, pp.\ 175--187. PMLR, 2023
work page 2023
-
[20]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 16000--16009, 2022
work page 2022
-
[21]
Inner Monologue: Embodied Reasoning through Planning with Language Models
Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner Monologue : Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
Perceiver: General perception with iterative attention
Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. In International Conference on Machine Learning, pp.\ 4651--4664. PMLR, 2021
work page 2021
-
[23]
BC-Z : Zero-shot task generalization with robotic imitation learning
Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. BC-Z : Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, 2022
work page 2022
-
[24]
VIMA : General robot manipulation with multimodal prompts
Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. VIMA : General robot manipulation with multimodal prompts. arXiv preprint arXiv:2210.03094, 2022
-
[25]
Exploring visual pre-training for robot manipulation: Datasets, models and methods
Ya Jing, Xuelin Zhu, Xingbin Liu, Qie Sima, Taozheng Yang, Yunhai Feng, and Tao Kong. Exploring visual pre-training for robot manipulation: Datasets, models and methods. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2023
work page 2023
-
[27]
Pre-training for robots: Offline RL enables learning new tasks from a handful of trials
Aviral Kumar, Anikait Singh, Frederik Ebert, Yanlai Yang, Chelsea Finn, and Sergey Levine. Pre-training for robots: Offline RL enables learning new tasks from a handful of trials. arXiv preprint arXiv:2210.05178, 2022
-
[28]
CURL : Contrastive unsupervised representations for reinforcement learning
Michael Laskin, Aravind Srinivas, and Pieter Abbeel. CURL : Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, pp.\ 5639--5650. PMLR, 2020
work page 2020
-
[34]
Interactive Language : Talking to robots in real time
Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive Language : Talking to robots in real time. arXiv preprint arXiv:2210.06407, 2022
-
[37]
What matters in language conditioned robotic imitation learning over unstructured data
Oier Mees, Lukas Hermann, and Wolfram Burgard. What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters, 7 0 (4): 0 11205--11212, 2022 b
work page 2022
-
[38]
Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. CALVIN : A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters (RA-L), 7 0 (3): 0 7327--7334, 2022 c
work page 2022
-
[40]
R3M : A universal visual representation for robot manipulation
Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3M : A universal visual representation for robot manipulation. In 6th Annual Conference on Robot Learning, 2022
work page 2022
-
[41]
R OpenAI. GPT-4 technical report. arXiv, pp.\ 2303--08774, 2023
work page 2023
-
[42]
The unsurprising effectiveness of pre-trained vision models for control
Simone Parisi, Aravind Rajeswaran, Senthil Purushwalkam, and Abhinav Gupta. The unsurprising effectiveness of pre-trained vision models for control. In International Conference on Machine Learning, pp.\ 17359--17371. PMLR, 2022
work page 2022
-
[43]
Improving language understanding by generative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018
work page 2018
-
[44]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp.\ 8748--8763, 2021
work page 2021
-
[45]
Real-world robot learning with masked visual pre-training
Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, and Trevor Darrell. Real-world robot learning with masked visual pre-training. In Conference on Robot Learning, 2022
work page 2022
-
[48]
Pretraining representations for data-efficient reinforcement learning
Max Schwarzer, Nitarshan Rajkumar, Michael Noukhovitch, Ankesh Anand, Laurent Charlin, R Devon Hjelm, Philip Bachman, and Aaron C Courville. Pretraining representations for data-efficient reinforcement learning. Advances in Neural Information Processing Systems, pp.\ 12686--12699, 2021
work page 2021
-
[49]
Masked world models for visual control
Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, and Pieter Abbeel. Masked world models for visual control. In Conference on Robot Learning, pp.\ 1332--1344. PMLR, 2023
work page 2023
-
[50]
Time-contrastive networks: Self-supervised learning from video
Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google Brain. Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 1134--1141. IEEE, 2018
work page 2018
-
[51]
Behavior Transformers : Cloning k modes with one stone
Nur Muhammad Shafiullah, Zichen Cui, Ariuntuya Arty Altanzaya, and Lerrel Pinto. Behavior Transformers : Cloning k modes with one stone. Advances in Neural Information Processing Systems, pp.\ 22955--22968, 2022
work page 2022
-
[52]
LM-Nav : Robotic navigation with large pre-trained models of language, vision, and action
Dhruv Shah, B a \.z ej Osi \'n ski, Sergey Levine, et al. LM-Nav : Robotic navigation with large pre-trained models of language, vision, and action. In Conference on Robot Learning, pp.\ 492--504. PMLR, 2023
work page 2023
-
[53]
CLIPort : What and where pathways for robotic manipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. CLIPort : What and where pathways for robotic manipulation. In Conference on Robot Learning, pp.\ 894--906, 2022
work page 2022
-
[54]
Perceiver-Actor : A multi-task transformer for robotic manipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-Actor : A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pp.\ 785--799, 2023
work page 2023
-
[55]
SMART : Self-supervised multi-task pretraining with control transformers
Yanchao Sun, Shuang Ma, Ratnesh Madaan, Rogerio Bonatti, Furong Huang, and Ashish Kapoor. SMART : Self-supervised multi-task pretraining with control transformers. arXiv preprint arXiv:2301.09816, 2023
-
[56]
PLEX : Making the most of the available data for robotic manipulation pretraining
Garrett Thomas, Ching-An Cheng, Ricky Loynd, Vibhav Vineet, Mihai Jalobeanu, and Andrey Kolobov. PLEX : Making the most of the available data for robotic manipulation pretraining. arXiv preprint arXiv:2303.08789, 2023
-
[58]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need . Advances in Neural Information Processing Systems, 2017
work page 2017
-
[60]
VideoGPT: Video Generation using VQ-VAE and Transformers
Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. VideoGPT : Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021
work page internal anchor Pith review arXiv 2021
-
[61]
Temporally consistent transformers for video generation, 2023
Wilson Yan, Danijar Hafner, Stephen James, and Pieter Abbeel. Temporally consistent transformers for video generation, 2023
work page 2023
-
[62]
Learning to see before learning to act: Visual pre-training for manipulation
Lin Yen-Chen, Andy Zeng, Shuran Song, Phillip Isola, and Tsung-Yi Lin. Learning to see before learning to act: Visual pre-training for manipulation. In IEEE International Conference on Robotics and Automation (ICRA), pp.\ 7286--7293. IEEE, 2020
work page 2020
-
[63]
Pomerleau, Dean A , journal=
-
[64]
Advances in Neural Information Processing Systems , year=
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser,. Advances in Neural Information Processing Systems , year=
-
[65]
Chen, Lili and Lu, Kevin and Rajeswaran, Aravind and Lee, Kimin and Grover, Aditya and Laskin, Misha and Abbeel, Pieter and Srinivas, Aravind and Mordatch, Igor , journal=
-
[66]
International Conference on Machine Learning , pages=
Learning transferable visual models from natural language supervision , author=. International Conference on Machine Learning , pages=
-
[67]
Alayrac, Jean-Baptiste and Donahue, Jeff and Luc, Pauline and Miech, Antoine and Barr, Iain and Hasson, Yana and Lenc, Karel and Mensch, Arthur and Millican, Katherine and Reynolds, Malcolm and others , journal=
-
[68]
IEEE Robotics and Automation Letters (RA-L) , volume=
Oier Mees and Lukas Hermann and Erick Rosete-Beas and Wolfram Burgard , title =. IEEE Robotics and Automation Letters (RA-L) , volume=
-
[69]
Decoupled Weight Decay Regularization
Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[70]
Keyframe-based Learning from Demonstration: Method and Evaluation , volume =
Akgun, Baris and Cakmak, Maya and Jiang, Karl and Thomaz, Andrea , year =. Keyframe-based Learning from Demonstration: Method and Evaluation , volume =. International Journal of Social Robotics , doi =
-
[71]
Robot See, Robot Do: An Overview of Robot Imitation , journal =
Bakker, Paul and Kuniyoshi, Yasuo , year =. Robot See, Robot Do: An Overview of Robot Imitation , journal =
-
[72]
Proceedings of the 27th International Joint Conference on Artificial Intelligence , year=
Behavioral cloning from observation , author=. Proceedings of the 27th International Joint Conference on Artificial Intelligence , year=
-
[73]
Conference on Robot Learning , year=
Real-World Robot Learning with Masked Visual Pre-training , author=. Conference on Robot Learning , year=
-
[74]
Conference on Robot Learning , year=
Visual imitation made easy , author=. Conference on Robot Learning , year=
-
[75]
Jang, Eric and Irpan, Alex and Khansari, Mohi and Kappler, Daniel and Ebert, Frederik and Lynch, Corey and Levine, Sergey and Finn, Chelsea , booktitle=
-
[76]
Sun, Yanchao and Ma, Shuang and Madaan, Ratnesh and Bonatti, Rogerio and Huang, Furong and Kapoor, Ashish , journal=
-
[77]
Nair, Suraj and Rajeswaran, Aravind and Kumar, Vikash and Finn, Chelsea and Gupta, Abhinav , booktitle=
-
[78]
arXiv preprint arXiv:2111.10364 , year=
Generalized decision transformer for offline hindsight information matching , author=. arXiv preprint arXiv:2111.10364 , year=
-
[79]
International Conference on Machine Learning , pages=
Online decision transformer , author=. International Conference on Machine Learning , pages=
-
[80]
A generalist agent , author=. arXiv preprint arXiv:2205.06175 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[81]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[82]
Shridhar, Mohit and Manuelli, Lucas and Fox, Dieter , booktitle=
-
[83]
Brohan, Anthony and Brown, Noah and Carbajal, Justice and Chebotar, Yevgen and Dabis, Joseph and Finn, Chelsea and Gopalakrishnan, Keerthana and Hausman, Karol and Herzog, Alex and Hsu, Jasmine and others , journal=
-
[84]
Improving language understanding by generative pre-training , author=. 2018 , publisher=
work page 2018
-
[85]
arXiv preprint arXiv:2304.00776 , year=
Chain-of-Thought Predictive Control , author=. arXiv preprint arXiv:2304.00776 , year=
-
[86]
Shafiullah, Nur Muhammad and Cui, Zichen and Altanzaya, Ariuntuya Arty and Pinto, Lerrel , journal=
-
[87]
Advances in Neural Information Processing Systems , pages=
Language models are few-shot learners , author=. Advances in Neural Information Processing Systems , pages=
-
[88]
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
-
[89]
Driess, Danny and Xia, Fei and Sajjadi, Mehdi SM and Lynch, Corey and Chowdhery, Aakanksha and Ichter, Brian and Wahid, Ayzaan and Tompson, Jonathan and Vuong, Quan and Yu, Tianhe and others , journal=
-
[90]
Language conditioned imitation learning over unstructured data,
Language conditioned imitation learning over unstructured data , author=. arXiv preprint arXiv:2005.07648 , year=
-
[91]
IEEE Robotics and Automation Letters , volume=
What matters in language conditioned robotic imitation learning over unstructured data , author=. IEEE Robotics and Automation Letters , volume=. 2022 , publisher=
work page 2022
-
[92]
Grounding language with visual affordances over unstructured data , author=. arXiv preprint arXiv:2210.01911 , year=
-
[93]
Zhang, Edwin and Lu, Yujie and Wang, William and Zhang, Amy , journal=
-
[94]
Conference on Robot Learning , pages=
Instruction-driven history-aware policies for robotic manipulations , author=. Conference on Robot Learning , pages=. 2023 , organization=
work page 2023
-
[95]
Shao, Lin and Migimatsu, Toki and Zhang, Qiang and Yang, Karen and Bohg, Jeannette , journal=. 2021 , publisher=
work page 2021
-
[96]
Zeng, Andy and Florence, Pete and Tompson, Jonathan and Welker, Stefan and Chien, Jonathan and Attarian, Maria and Armstrong, Travis and Krasin, Ivan and Duong, Dan and Sindhwani, Vikas and others , booktitle=. 2021 , organization=
work page 2021
-
[97]
Advances in Neural Information Processing Systems , volume=
Language-conditioned imitation learning for robot manipulation tasks , author=. Advances in Neural Information Processing Systems , volume=
-
[98]
Huang, Wenlong and Xia, Fei and Xiao, Ted and Chan, Harris and Liang, Jacky and Florence, Pete and Zeng, Andy and Tompson, Jonathan and Mordatch, Igor and Chebotar, Yevgen and others , journal=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.