arxiv: 2312.13139 · v2 · submitted 2023-12-20 · 💻 cs.RO · cs.CV

Recognition: 3 theorem links

· Lean Theorem

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Hongtao Wu , Ya Jing , Chilam Cheang , Guangzeng Chen , Jiafeng Xu , Xinghang Li , Minghuan Liu , Hang Li

show 1 more author

Tao Kong

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:25 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords visual robot manipulationvideo generative pre-trainingGPT-style transformermulti-task controllanguage-conditioned roboticsCALVIN benchmarkzero-shot generalizationaction prediction

0 comments

The pith

A GPT-style transformer pre-trained on large-scale videos generalizes to multi-task language-conditioned robot manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that pre-training a simple GPT-style model on vast video datasets produces visual representations that transfer to controlling robots across multiple tasks given language instructions. The model processes a language command along with past images and robot states to output both actions and predicted future images. After fine-tuning on robot data it raises success rates on the CALVIN benchmark and improves zero-shot performance in unseen scenes. A reader would care because the work points to a way of bootstrapping robot skills from abundant everyday video rather than relying solely on expensive robot-specific recordings.

Core claim

GR-1 is a unified GPT-style transformer that accepts a language instruction, a sequence of observation images, and robot states, then predicts both future images and robot actions in an end-to-end manner. When pre-trained generatively on a large-scale non-robot video dataset and subsequently fine-tuned on robot trajectories, the model outperforms prior methods on the CALVIN benchmark, lifting overall success from 88.9 percent to 94.9 percent and zero-shot unseen-scene success from 53.3 percent to 85.4 percent. Real-robot experiments likewise show improved generalization to novel scenes and objects.

What carries the argument

GR-1, the GPT-style transformer that jointly predicts robot actions and future images from language and visual state sequences.

If this is right

The approach raises success rates on the CALVIN benchmark from 88.9 percent to 94.9 percent across multi-task settings.
Zero-shot generalization to unseen scenes improves from 53.3 percent to 85.4 percent success.
Real-robot trials show stronger performance on novel scenes and objects than baselines without video pre-training.
The flexible architecture permits direct fine-tuning from video pre-training to robot action prediction without architectural changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Abundant internet video could become a primary data source for acquiring robot manipulation skills, reducing dependence on costly robot-collected trajectories.
The same pre-training strategy may transfer to other embodied tasks such as navigation or tool use once suitable action heads are added.
Larger video corpora or longer training could further close the remaining gap between seen and unseen environments.
Combining the model with existing large language models might enable more open-ended natural-language instruction following in physical settings.

Load-bearing premise

Representations learned from general video data transfer to robot manipulation tasks without a prohibitive domain gap or loss of capability during fine-tuning.

What would settle it

A controlled comparison in which a model trained from scratch on the identical robot dataset matches or exceeds GR-1's success rates on CALVIN and real-robot tests would undermine the claimed benefit of the video pre-training stage.

read the original abstract

Generative pre-trained models have demonstrated remarkable effectiveness in language and vision domains by learning useful representations. In this paper, we extend the scope of this effectiveness by showing that visual robot manipulation can significantly benefit from large-scale video generative pre-training. We introduce GR-1, a straightforward GPT-style model designed for multi-task language-conditioned visual robot manipulation. GR-1 takes as inputs a language instruction, a sequence of observation images, and a sequence of robot states. It predicts robot actions as well as future images in an end-to-end manner. Thanks to a flexible design, GR-1 can be seamlessly finetuned on robot data after pre-trained on a large-scale video dataset. We perform extensive experiments on the challenging CALVIN benchmark and a real robot. On CALVIN benchmark, our method outperforms state-of-the-art baseline methods and improves the success rate from 88.9% to 94.9%. In the setting of zero-shot unseen scene generalization, GR-1 improves the success rate from 53.3% to 85.4%. In real robot experiments, GR-1 also outperforms baseline methods and shows strong potentials in generalization to unseen scenes and objects. We provide inaugural evidence that a unified GPT-style transformer, augmented with large-scale video generative pre-training, exhibits remarkable generalization to multi-task visual robot manipulation. Project page: https://GR1-Manipulation.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GR-1 shows video pre-training lifts CALVIN success rates from 88.9% to 94.9% and zero-shot from 53.3% to 85.4%, but ablations isolating the pre-training step are missing from the abstract.

read the letter

GR-1 shows that pre-training a GPT-style transformer on large-scale video data before fine-tuning on robot tasks leads to better multi-task language-conditioned manipulation. The reported gains are from 88.9% to 94.9% on CALVIN and from 53.3% to 85.4% in zero-shot unseen scenes, with supporting real-robot results. The paper introduces a straightforward model that predicts actions and future images jointly from language, images, and states. The flexible design supports seamless pre-training on non-robot videos followed by robot fine-tuning. This combination appears new based on the literature referenced, and the empirical results provide initial evidence for the transfer. It does well by using standard benchmarks for direct comparisons and including real-world experiments to back up the simulation claims. The numbers suggest meaningful generalization improvements. The main soft spot is the lack of detailed ablations in the abstract, such as comparisons without video pre-training or breakdowns of the dataset used. This leaves some uncertainty about exactly how much the video data drives the gains versus other factors like model scale or training procedure. The domain gap is acknowledged in principle but not deeply analyzed here. This work is for robotics researchers interested in pre-training and scaling policies for manipulation. It would appeal to those building generalist agents. It deserves a serious referee because the benchmark results are concrete and the method is replicable with the right details. I would recommend sending it to peer review to get the full experimental picture and assess its impact.

Referee Report

2 major / 2 minor

Summary. The paper introduces GR-1, a GPT-style transformer for multi-task language-conditioned visual robot manipulation that takes language instructions, observation images, and robot states as input and jointly predicts actions and future images. After large-scale video generative pre-training on non-robot data, the model is fine-tuned end-to-end on robot datasets. It reports success-rate gains on the CALVIN benchmark (88.9% to 94.9%) and zero-shot unseen-scene generalization (53.3% to 85.4%), plus real-robot results showing improved generalization to unseen scenes and objects.

Significance. If the empirical gains are reproducible, the work supplies inaugural evidence that video-scale generative pre-training transfers to robot manipulation without catastrophic forgetting, supporting a unified GPT-style architecture for joint action-image prediction. This could influence future robot learning pipelines by demonstrating that non-robot video data can close part of the domain gap in visual control.

major comments (2)

[Abstract / Experiments] Abstract and Experiments: the central claim attributes the CALVIN and zero-shot gains specifically to large-scale video pre-training, yet the manuscript provides no ablation that isolates the pre-training stage from the GPT architecture or the joint image-action objective; without this comparison the transfer benefit remains correlational rather than causal.
[Experiments] Experiments: success rates are reported as single point estimates (88.9% → 94.9%, 53.3% → 85.4%) with no mention of variance across random seeds, number of evaluation episodes, or statistical significance tests; this weakens confidence that the observed deltas are robust rather than run-specific.

minor comments (2)

[Method] The description of the video pre-training dataset (size, diversity, filtering) is referenced only at high level; adding a table or paragraph with exact statistics would clarify the scale and domain distance to robot data.
[Method] Notation for the joint prediction loss (action + image) is introduced without an explicit equation; including the combined objective as Eq. (X) would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the two major comments below and will incorporate revisions to strengthen the manuscript's claims and reporting.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments: the central claim attributes the CALVIN and zero-shot gains specifically to large-scale video pre-training, yet the manuscript provides no ablation that isolates the pre-training stage from the GPT architecture or the joint image-action objective; without this comparison the transfer benefit remains correlational rather than causal.

Authors: We agree that an explicit ablation isolating the effect of large-scale video pre-training on the identical GR-1 architecture would provide stronger causal evidence. The current comparisons are against prior state-of-the-art methods that lack both the GPT-style joint image-action prediction and video pre-training; however, to directly address this point we will add a controlled ablation in the revised manuscript: training the same GR-1 model from scratch on the robot datasets without the video pre-training stage and reporting the resulting performance drop on CALVIN and zero-shot generalization. This will clarify the incremental benefit attributable to pre-training. revision: yes
Referee: [Experiments] Experiments: success rates are reported as single point estimates (88.9% → 94.9%, 53.3% → 85.4%) with no mention of variance across random seeds, number of evaluation episodes, or statistical significance tests; this weakens confidence that the observed deltas are robust rather than run-specific.

Authors: We acknowledge that single-point estimates limit assessment of robustness. The CALVIN benchmark protocol uses 1000 evaluation episodes per task setting; we will explicitly state this number, report results averaged over at least three random seeds with standard deviations, and include these details in both the main text and tables. If the performance deltas remain statistically significant under a paired t-test or similar, we will note this as well. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents empirical results from pre-training a GPT-style transformer on large-scale video data followed by fine-tuning on robot manipulation tasks, with performance gains reported on CALVIN (88.9% to 94.9%) and zero-shot settings (53.3% to 85.4%). No equations or first-principles derivations are invoked that reduce any claimed prediction to fitted parameters or self-referential definitions by construction. The architecture is described as flexible for pre-train then fine-tune, and results are positioned as direct benchmark evidence of transfer rather than a closed theoretical system. No load-bearing self-citations or ansatzes are used to justify core claims; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond standard transformer training assumptions.

pith-pipeline@v0.9.0 · 5574 in / 1064 out tokens · 37322 ms · 2026-05-13T16:25:46.743426+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce GR-1, a straightforward GPT-style model designed for multi-task language-conditioned visual robot manipulation. GR-1 takes as inputs a language instruction, a sequence of observation images, and a sequence of robot states. It predicts robot actions as well as future images in an end-to-end manner.
Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We provide inaugural evidence that a unified GPT-style transformer, augmented with large-scale video generative pre-training, exhibits remarkable generalization to multi-task visual robot manipulation.
Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On CALVIN benchmark, our method outperforms state-of-the-art baseline methods and improves the success rate from 88.9% to 94.9%.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 7.0

MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models
cs.AI 2026-05 unverdicted novelty 7.0

MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
cs.RO 2026-05 unverdicted novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
cs.RO 2026-04 unverdicted novelty 7.0

ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
cs.RO 2026-05 unverdicted novelty 6.0

A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 6.0

MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
GazeVLA: Learning Human Intention for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors
cs.RO 2026-04 unverdicted novelty 6.0

CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
cs.RO 2026-04 unverdicted novelty 6.0

VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
cs.RO 2026-04 conditional novelty 6.0

MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
cs.AI 2025-06 unverdicted novelty 6.0

V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
cs.RO 2025-04 unverdicted novelty 6.0

Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
cs.RO 2025-03 unverdicted novelty 6.0

GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
cs.CV 2024-12 unverdicted novelty 6.0

Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
cs.RO 2024-11 unverdicted novelty 6.0

CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
cs.RO 2024-10 unverdicted novelty 6.0

GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
cs.RO 2026-04 unverdicted novelty 5.0

StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
M100: An Orchestrated Dataflow Architecture Powering General AI Computing
cs.LG 2026-04 unverdicted novelty 5.0

M100 is a tensor-based dataflow architecture that eliminates heavy caching through compiler-managed data streams, claiming higher utilization and better performance than GPGPUs for AD and LLM inference tasks.
R3D: Revisiting 3D Policy Learning
cs.CV 2026-04 unverdicted novelty 5.0

A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
cs.CV 2026-04 unverdicted novelty 5.0

HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
cs.CV 2026-04 unverdicted novelty 5.0

HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.
ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation
cs.RO 2026-04 unverdicted novelty 5.0

Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.
WorldVLA: Towards Autoregressive Action World Model
cs.RO 2025-06 unverdicted novelty 5.0

WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.

Reference graph

Works this paper leans on

134 extracted references · 134 canonical work pages · cited by 25 Pith papers · 14 internal anchors

[1]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do As I Can, Not As I Say : Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos

Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos . Advances in Neural Information Processing Systems, 35: 0 24639--24654, 2022

work page 2022
[3]

Robotic Offline RL from Internet Videos via Value-Function Pre-Training

Chethan Bhateja, Derek Guo, Dibya Ghosh, Anikait Singh, Manan Tomar, Quan Vuong, Yevgen Chebotar, Sergey Levine, and Aviral Kumar. Robotic Offline RL from Internet Videos via Value-Function Pre-Training . arXiv preprint arXiv:2309.13041, 2023

work page arXiv 2023
[4]

RoboCat : A self-improving foundation agent for robotic manipulation

Konstantinos Bousmalis, Giulia Vezzani, Dushyant Rao, Coline Devin, Alex X Lee, Maria Bauza, Todor Davchev, Yuxiang Zhou, Agrim Gupta, Akhil Raju, et al. RoboCat : A self-improving foundation agent for robotic manipulation. arXiv preprint arXiv:2306.11706, 2023

work page arXiv 2023
[5]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1 : Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. RT-2 : Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, pp.\ 1877--1901, 2020

work page 1901
[8]

Decision Transformer: Reinforcement learning via sequence modeling

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision Transformer: Reinforcement learning via sequence modeling . Advances in Neural Information Processing Systems, 2021

work page 2021
[9]

Generative pretraining from pixels

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International Conference on Machine Learning, pp.\ 1691--1703. PMLR, 2020

work page 2020
[10]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion Policy : Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. PaLM-E : An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Deep visual foresight for planning robot motion

Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 2786--2793. IEEE, 2017

work page 2017
[15]

Ego4D : Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D : Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18995--19012, 2022

work page 2022
[16]

Instruction-driven history-aware policies for robotic manipulations

Pierre-Louis Guhur, Shizhe Chen, Ricardo Garcia Pinel, Makarand Tapaswi, Ivan Laptev, and Cordelia Schmid. Instruction-driven history-aware policies for robotic manipulations. In Conference on Robot Learning, pp.\ 175--187. PMLR, 2023

work page 2023
[20]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 16000--16009, 2022

work page 2022
[21]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner Monologue : Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Perceiver: General perception with iterative attention

Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. In International Conference on Machine Learning, pp.\ 4651--4664. PMLR, 2021

work page 2021
[23]

BC-Z : Zero-shot task generalization with robotic imitation learning

Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. BC-Z : Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, 2022

work page 2022
[24]

VIMA : General robot manipulation with multimodal prompts

Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. VIMA : General robot manipulation with multimodal prompts. arXiv preprint arXiv:2210.03094, 2022

work page arXiv 2022
[25]

Exploring visual pre-training for robot manipulation: Datasets, models and methods

Ya Jing, Xuelin Zhu, Xingbin Liu, Qie Sima, Taozheng Yang, Yunhai Feng, and Tao Kong. Exploring visual pre-training for robot manipulation: Datasets, models and methods. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2023

work page 2023
[27]

Pre-training for robots: Offline RL enables learning new tasks from a handful of trials

Aviral Kumar, Anikait Singh, Frederik Ebert, Yanlai Yang, Chelsea Finn, and Sergey Levine. Pre-training for robots: Offline RL enables learning new tasks from a handful of trials. arXiv preprint arXiv:2210.05178, 2022

work page arXiv 2022
[28]

CURL : Contrastive unsupervised representations for reinforcement learning

Michael Laskin, Aravind Srinivas, and Pieter Abbeel. CURL : Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, pp.\ 5639--5650. PMLR, 2020

work page 2020
[34]

Interactive Language : Talking to robots in real time

Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive Language : Talking to robots in real time. arXiv preprint arXiv:2210.06407, 2022

work page arXiv 2022
[37]

What matters in language conditioned robotic imitation learning over unstructured data

Oier Mees, Lukas Hermann, and Wolfram Burgard. What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters, 7 0 (4): 0 11205--11212, 2022 b

work page 2022
[38]

CALVIN : A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. CALVIN : A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters (RA-L), 7 0 (3): 0 7327--7334, 2022 c

work page 2022
[40]

R3M : A universal visual representation for robot manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3M : A universal visual representation for robot manipulation. In 6th Annual Conference on Robot Learning, 2022

work page 2022
[41]

GPT-4 technical report

R OpenAI. GPT-4 technical report. arXiv, pp.\ 2303--08774, 2023

work page 2023
[42]

The unsurprising effectiveness of pre-trained vision models for control

Simone Parisi, Aravind Rajeswaran, Senthil Purushwalkam, and Abhinav Gupta. The unsurprising effectiveness of pre-trained vision models for control. In International Conference on Machine Learning, pp.\ 17359--17371. PMLR, 2022

work page 2022
[43]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018

work page 2018
[44]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp.\ 8748--8763, 2021

work page 2021
[45]

Real-world robot learning with masked visual pre-training

Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, and Trevor Darrell. Real-world robot learning with masked visual pre-training. In Conference on Robot Learning, 2022

work page 2022
[48]

Pretraining representations for data-efficient reinforcement learning

Max Schwarzer, Nitarshan Rajkumar, Michael Noukhovitch, Ankesh Anand, Laurent Charlin, R Devon Hjelm, Philip Bachman, and Aaron C Courville. Pretraining representations for data-efficient reinforcement learning. Advances in Neural Information Processing Systems, pp.\ 12686--12699, 2021

work page 2021
[49]

Masked world models for visual control

Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, and Pieter Abbeel. Masked world models for visual control. In Conference on Robot Learning, pp.\ 1332--1344. PMLR, 2023

work page 2023
[50]

Time-contrastive networks: Self-supervised learning from video

Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google Brain. Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 1134--1141. IEEE, 2018

work page 2018
[51]

Behavior Transformers : Cloning k modes with one stone

Nur Muhammad Shafiullah, Zichen Cui, Ariuntuya Arty Altanzaya, and Lerrel Pinto. Behavior Transformers : Cloning k modes with one stone. Advances in Neural Information Processing Systems, pp.\ 22955--22968, 2022

work page 2022
[52]

LM-Nav : Robotic navigation with large pre-trained models of language, vision, and action

Dhruv Shah, B a \.z ej Osi \'n ski, Sergey Levine, et al. LM-Nav : Robotic navigation with large pre-trained models of language, vision, and action. In Conference on Robot Learning, pp.\ 492--504. PMLR, 2023

work page 2023
[53]

CLIPort : What and where pathways for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. CLIPort : What and where pathways for robotic manipulation. In Conference on Robot Learning, pp.\ 894--906, 2022

work page 2022
[54]

Perceiver-Actor : A multi-task transformer for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-Actor : A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pp.\ 785--799, 2023

work page 2023
[55]

SMART : Self-supervised multi-task pretraining with control transformers

Yanchao Sun, Shuang Ma, Ratnesh Madaan, Rogerio Bonatti, Furong Huang, and Ashish Kapoor. SMART : Self-supervised multi-task pretraining with control transformers. arXiv preprint arXiv:2301.09816, 2023

work page arXiv 2023
[56]

PLEX : Making the most of the available data for robotic manipulation pretraining

Garrett Thomas, Ching-An Cheng, Ricky Loynd, Vibhav Vineet, Mihai Jalobeanu, and Andrey Kolobov. PLEX : Making the most of the available data for robotic manipulation pretraining. arXiv preprint arXiv:2303.08789, 2023

work page arXiv 2023
[58]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need . Advances in Neural Information Processing Systems, 2017

work page 2017
[60]

VideoGPT: Video Generation using VQ-VAE and Transformers

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. VideoGPT : Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021

work page internal anchor Pith review arXiv 2021
[61]

Temporally consistent transformers for video generation, 2023

Wilson Yan, Danijar Hafner, Stephen James, and Pieter Abbeel. Temporally consistent transformers for video generation, 2023

work page 2023
[62]

Learning to see before learning to act: Visual pre-training for manipulation

Lin Yen-Chen, Andy Zeng, Shuran Song, Phillip Isola, and Tsung-Yi Lin. Learning to see before learning to act: Visual pre-training for manipulation. In IEEE International Conference on Robotics and Automation (ICRA), pp.\ 7286--7293. IEEE, 2020

work page 2020
[63]

Pomerleau, Dean A , journal=

work page
[64]

Advances in Neural Information Processing Systems , year=

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser,. Advances in Neural Information Processing Systems , year=

work page
[65]

Chen, Lili and Lu, Kevin and Rajeswaran, Aravind and Lee, Kimin and Grover, Aditya and Laskin, Misha and Abbeel, Pieter and Srinivas, Aravind and Mordatch, Igor , journal=

work page
[66]

International Conference on Machine Learning , pages=

Learning transferable visual models from natural language supervision , author=. International Conference on Machine Learning , pages=

work page
[67]

Alayrac, Jean-Baptiste and Donahue, Jeff and Luc, Pauline and Miech, Antoine and Barr, Iain and Hasson, Yana and Lenc, Karel and Mensch, Arthur and Millican, Katherine and Reynolds, Malcolm and others , journal=

work page
[68]

IEEE Robotics and Automation Letters (RA-L) , volume=

Oier Mees and Lukas Hermann and Erick Rosete-Beas and Wolfram Burgard , title =. IEEE Robotics and Automation Letters (RA-L) , volume=

work page
[69]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[70]

Keyframe-based Learning from Demonstration: Method and Evaluation , volume =

Akgun, Baris and Cakmak, Maya and Jiang, Karl and Thomaz, Andrea , year =. Keyframe-based Learning from Demonstration: Method and Evaluation , volume =. International Journal of Social Robotics , doi =

work page
[71]

Robot See, Robot Do: An Overview of Robot Imitation , journal =

Bakker, Paul and Kuniyoshi, Yasuo , year =. Robot See, Robot Do: An Overview of Robot Imitation , journal =

work page
[72]

Proceedings of the 27th International Joint Conference on Artificial Intelligence , year=

Behavioral cloning from observation , author=. Proceedings of the 27th International Joint Conference on Artificial Intelligence , year=

work page
[73]

Conference on Robot Learning , year=

Real-World Robot Learning with Masked Visual Pre-training , author=. Conference on Robot Learning , year=

work page
[74]

Conference on Robot Learning , year=

Visual imitation made easy , author=. Conference on Robot Learning , year=

work page
[75]

Jang, Eric and Irpan, Alex and Khansari, Mohi and Kappler, Daniel and Ebert, Frederik and Lynch, Corey and Levine, Sergey and Finn, Chelsea , booktitle=

work page
[76]

Sun, Yanchao and Ma, Shuang and Madaan, Ratnesh and Bonatti, Rogerio and Huang, Furong and Kapoor, Ashish , journal=

work page
[77]

Nair, Suraj and Rajeswaran, Aravind and Kumar, Vikash and Finn, Chelsea and Gupta, Abhinav , booktitle=

work page
[78]

arXiv preprint arXiv:2111.10364 , year=

Generalized decision transformer for offline hindsight information matching , author=. arXiv preprint arXiv:2111.10364 , year=

work page arXiv
[79]

International Conference on Machine Learning , pages=

Online decision transformer , author=. International Conference on Machine Learning , pages=

work page
[80]

A Generalist Agent

A generalist agent , author=. arXiv preprint arXiv:2205.06175 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[81]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[82]

Shridhar, Mohit and Manuelli, Lucas and Fox, Dieter , booktitle=

work page
[83]

Brohan, Anthony and Brown, Noah and Carbajal, Justice and Chebotar, Yevgen and Dabis, Joseph and Finn, Chelsea and Gopalakrishnan, Keerthana and Hausman, Karol and Herzog, Alex and Hsu, Jasmine and others , journal=

work page
[84]

2018 , publisher=

Improving language understanding by generative pre-training , author=. 2018 , publisher=

work page 2018
[85]

arXiv preprint arXiv:2304.00776 , year=

Chain-of-Thought Predictive Control , author=. arXiv preprint arXiv:2304.00776 , year=

work page arXiv
[86]

Shafiullah, Nur Muhammad and Cui, Zichen and Altanzaya, Ariuntuya Arty and Pinto, Lerrel , journal=

work page
[87]

Advances in Neural Information Processing Systems , pages=

Language models are few-shot learners , author=. Advances in Neural Information Processing Systems , pages=

work page
[88]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page
[89]

Driess, Danny and Xia, Fei and Sajjadi, Mehdi SM and Lynch, Corey and Chowdhery, Aakanksha and Ichter, Brian and Wahid, Ayzaan and Tompson, Jonathan and Vuong, Quan and Yu, Tianhe and others , journal=

work page
[90]

Language conditioned imitation learning over unstructured data,

Language conditioned imitation learning over unstructured data , author=. arXiv preprint arXiv:2005.07648 , year=

work page arXiv 2005
[91]

IEEE Robotics and Automation Letters , volume=

What matters in language conditioned robotic imitation learning over unstructured data , author=. IEEE Robotics and Automation Letters , volume=. 2022 , publisher=

work page 2022
[92]

Grounding language with visual affor- dances over unstructured data.arXiv preprint arXiv:2210.01911, 2022

Grounding language with visual affordances over unstructured data , author=. arXiv preprint arXiv:2210.01911 , year=

work page arXiv
[93]

Zhang, Edwin and Lu, Yujie and Wang, William and Zhang, Amy , journal=

work page
[94]

Conference on Robot Learning , pages=

Instruction-driven history-aware policies for robotic manipulations , author=. Conference on Robot Learning , pages=. 2023 , organization=

work page 2023
[95]

2021 , publisher=

Shao, Lin and Migimatsu, Toki and Zhang, Qiang and Yang, Karen and Bohg, Jeannette , journal=. 2021 , publisher=

work page 2021
[96]

2021 , organization=

Zeng, Andy and Florence, Pete and Tompson, Jonathan and Welker, Stefan and Chien, Jonathan and Attarian, Maria and Armstrong, Travis and Krasin, Ivan and Duong, Dan and Sindhwani, Vikas and others , booktitle=. 2021 , organization=

work page 2021
[97]

Advances in Neural Information Processing Systems , volume=

Language-conditioned imitation learning for robot manipulation tasks , author=. Advances in Neural Information Processing Systems , volume=

work page
[98]

Huang, Wenlong and Xia, Fei and Xiao, Ted and Chan, Harris and Liang, Jacky and Florence, Pete and Zeng, Andy and Tompson, Jonathan and Mordatch, Igor and Chebotar, Yevgen and others , journal=

work page

Showing first 80 references.