arxiv: 2203.12601 · v3 · pith:GTXFFVUWnew · submitted 2022-03-23 · 💻 cs.RO · cs.AI· cs.CV· cs.LG

R3M: A Universal Visual Representation for Robot Manipulation

Suraj Nair , Aravind Rajeswaran , Vikash Kumar , Chelsea Finn , Abhinav Gupta This is my paper

Pith reviewed 2026-05-15 13:21 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LG

keywords robot manipulationvisual representation learningpre-traininghuman videodata efficiencycontrastive learningEgo4DFranka robot

0 comments

The pith

Pre-trained visual features from human videos enable more data-efficient robot manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates using visual representations learned from large-scale human video data to improve learning of robot manipulation tasks. By pre-training on the Ego4D dataset with time-contrastive learning, video-language alignment, and a sparsity penalty, the resulting R3M model serves as a frozen visual module for policy learning. This leads to over 20% higher success rates on 12 simulated tasks compared to training from scratch, and outperforms other pre-trained models like CLIP. In real-world experiments, it allows a robotic arm to learn tasks in a cluttered environment with only 20 demonstrations. A reader would care because it points to a way to leverage abundant human video data to make robot training more practical.

Core claim

R3M is a universal visual representation pre-trained on diverse human video data from the Ego4D dataset. The pre-training combines time-contrastive learning to capture temporal structure, video-language alignment for semantic understanding, and an L1 penalty to promote sparse and compact features. When frozen and used for downstream robotic policy learning, R3M boosts task success rates by more than 20% over training from scratch and by more than 10% over state-of-the-art representations such as CLIP and MoCo across 12 simulated manipulation tasks. It further enables a real Franka Emika Panda arm to acquire a variety of manipulation skills in a cluttered apartment setting using just 20 demos

What carries the argument

The R3M visual encoder, obtained by pre-training on human videos with time-contrastive, language-alignment, and sparsity objectives, acting as a frozen perception module for policy learning.

If this is right

Pre-trained human video features transfer to robotic vision without adaptation.
Data efficiency in robot learning improves significantly with such representations.
Combining contrastive, language, and sparsity losses creates more effective visual features for control.
Real-world robot deployment becomes viable with small demonstration sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Scaling up the pre-training dataset could yield even stronger performance gains for a wider range of tasks.
This method might generalize to other robot embodiments or sensor modalities beyond the tested arm.
Integrating R3M with proprioception or other modalities could further enhance learning speed.

Load-bearing premise

Visual features learned from human video data will transfer effectively to robotic camera inputs and task distributions without any robot-specific fine-tuning or domain adaptation.

What would settle it

If policies using the frozen R3M encoder achieve no better success rates than random actions or from-scratch baselines across the 12 simulated manipulation tasks, the claimed transfer benefit would be falsified.

read the original abstract

We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. Concretely, we pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations. The resulting representation, R3M, can be used as a frozen perception module for downstream policy learning. Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika Panda arm to learn a range of manipulation tasks in a real, cluttered apartment given just 20 demonstrations. Code and pre-trained models are available at https://tinyurl.com/robotr3m.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

R3M shows clear gains on simulated manipulation from Ego4D pre-training, but the real-robot results lack the same controls and leave the transfer benefit less certain.

read the letter

The main thing to know is that pre-training a visual encoder on human video with time-contrastive learning, language alignment, and an L1 penalty produces features that improve policy learning on simulated robot tasks. The gains are reported as over 20% higher success than training from scratch and over 10% above CLIP and MoCo across 12 tasks, with the model frozen during downstream training. They also demonstrate a Franka arm acquiring several manipulation skills in a real cluttered apartment from only 20 demonstrations. Code and checkpoints are released, which is useful for anyone wanting to test the representation directly. The work applies existing contrastive methods to a new domain rather than introducing a novel algorithm, so the contribution sits in the robot evaluation and the data-efficiency angle. The simulated results look solid because they include direct comparisons to strong baselines. The real-robot section is weaker on that front: it reports success with 20 demos but does not show matched runs against CLIP or MoCo on the same hardware and tasks. That leaves open whether the specific pre-training objectives or simply any reasonable pre-trained encoder plus limited demos is doing the work, especially given the shift from egocentric human video to typical robot camera views. Domain adaptation or fine-tuning is not used, which is an interesting choice but makes the transfer assumption central. This paper is worth attention for groups working on visual representations for imitation learning and data-efficient robotics. The sim experiments and open models give readers something concrete to build on or replicate. A serious editor should send it to review; the claims are empirical and falsifiable, the code is available, and referees can ask for tighter real-world baselines without the work being fundamentally broken.

Referee Report

2 major / 2 minor

Summary. The paper introduces R3M, a visual encoder pre-trained on the Ego4D human video dataset via a combination of time-contrastive learning, video-language alignment, and an L1 sparsity penalty. The frozen R3M representation is then used for downstream policy learning. On 12 simulated manipulation tasks, R3M yields >20% higher success rates than training from scratch and >10% gains over CLIP and MoCo. In real-world experiments, a Franka Emika Panda arm learns several manipulation tasks in a cluttered apartment from only 20 demonstrations.

Significance. If the reported transfer holds under controlled conditions, R3M would provide a practical route to data-efficient robot learning by leveraging large-scale human video corpora. The availability of code and pre-trained models strengthens reproducibility and enables direct follow-up work on domain adaptation or fine-tuning.

major comments (2)

[§4.2] §4.2 (real-robot experiments): success rates are reported for R3M with 20 demonstrations, but no matched real-world baselines for CLIP, MoCo, or training from scratch are provided on the same Franka tasks. This omission prevents quantification of the transfer benefit and leaves the domain-shift assumption untested.
[§3.2] §3.2 (pre-training objectives): the time-contrastive and video-language losses are defined on egocentric human video; no analysis or ablation quantifies robustness to the shift to fixed third-person robot camera views, lighting, and motion statistics that appear in the real-robot evaluation.

minor comments (2)

[Figure 3] Figure 3 and Table 2: axis labels and legend entries are too small for print; increase font size and add error bars or statistical significance markers for the 12-task averages.
[§4.1] §4.1: the exact number of training episodes per simulated task and the precise definition of 'success' (e.g., threshold on final pose error) should be stated explicitly rather than referenced to an appendix.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments. We address each major point below, acknowledging limitations where appropriate and outlining revisions.

read point-by-point responses

Referee: [§4.2] §4.2 (real-robot experiments): success rates are reported for R3M with 20 demonstrations, but no matched real-world baselines for CLIP, MoCo, or training from scratch are provided on the same Franka tasks. This omission prevents quantification of the transfer benefit and leaves the domain-shift assumption untested.

Authors: We agree that matched real-world baselines would allow direct quantification of transfer gains. Real-robot experiments on the Franka are resource-intensive, which constrained our ability to run full comparisons for all methods. Simulation results already show consistent >10% gains for R3M over CLIP/MoCo and >20% over scratch. In revision we will add explicit discussion of this limitation in §4.2, note the practical success with 20 demos, and include any feasible preliminary real-world data points. revision: partial
Referee: [§3.2] §3.2 (pre-training objectives): the time-contrastive and video-language losses are defined on egocentric human video; no analysis or ablation quantifies robustness to the shift to fixed third-person robot camera views, lighting, and motion statistics that appear in the real-robot evaluation.

Authors: The referee is correct that no dedicated ablation isolates robustness to viewpoint, lighting, and motion shifts. The objectives aim to learn temporally consistent and semantically aligned features expected to generalize, and this is supported by sim-to-real transfer in our results. In the revised manuscript we will expand §3.2 with discussion of these factors and add supporting visualizations or limited ablations where space allows. revision: partial

standing simulated objections not resolved

Full matched real-world baselines for CLIP, MoCo, and training from scratch on the Franka tasks, as these require extensive additional physical robot time and resources beyond the current revision scope.

Circularity Check

0 steps flagged

No circularity: empirical pre-training objectives are independent of downstream robot-task metrics

full rationale

The paper pre-trains a visual encoder on Ego4D via time-contrastive loss, video-language alignment, and L1 sparsity, none of which are defined using the 12 simulated manipulation tasks or the real Franka apartment setup. Frozen R3M features are then evaluated on separate policy-learning benchmarks against scratch, CLIP, and MoCo baselines. No equation or result reduces to a fitted parameter taken from the target success rates; no self-citation supplies a uniqueness theorem that forces the architecture; and the reported gains are measured on held-out task distributions. The central empirical chain therefore remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard assumption that contrastive objectives on human video produce features transferable to robotic vision; no new entities or fitted parameters are introduced beyond the usual hyperparameters of contrastive learning.

axioms (1)

domain assumption Human video data contains visual features that transfer to robotic manipulation tasks.
Invoked when the pre-trained encoder is used without adaptation on robot camera streams.

pith-pipeline@v0.9.0 · 5479 in / 1252 out tokens · 50780 ms · 2026-05-15T13:21:26.105969+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
Multimodal Diffusion Forcing for Forceful Manipulation
cs.RO 2025-11 unverdicted novelty 7.0

Multimodal Diffusion Forcing trains a diffusion model on partially masked multimodal robot trajectories to learn temporal and cross-modal dependencies for forceful manipulation.
DreamGen: Unlocking Generalization in Robot Learning through Video World Models
cs.RO 2025-05 unverdicted novelty 7.0

DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperatio...
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
cs.RO 2023-10 conditional novelty 7.0

SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
cs.RO 2023-07 unverdicted novelty 7.0

VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training
cs.RO 2022-09 unverdicted novelty 7.0

VIP learns a visual embedding from human videos whose distance defines dense, smooth rewards for arbitrary goal-image robot tasks without task-specific fine-tuning.
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
cs.RO 2022-04 accept novelty 7.0

SayCan combines an LLM's high-level semantic knowledge with robot skill value functions to select only feasible actions, enabling completion of abstract natural-language instructions on a real mobile manipulator.
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
cs.RO 2026-05 unverdicted novelty 6.0

GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
HumanNet: Scaling Human-centric Video Learning to One Million Hours
cs.CV 2026-05 unverdicted novelty 6.0

HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.
Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.
GazeVLA: Learning Human Intention for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
cs.RO 2026-04 unverdicted novelty 6.0

UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations
cs.RO 2026-04 unverdicted novelty 6.0

WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...
ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

ProGAL-VLA uses 3D graphs, symbolic sub-goals, and a Grounding Alignment Contrastive loss to ground actions on verified embeddings, raising robustness from 30.3% to 71.5% and ambiguity AUROC to 0.81 on robotic benchmarks.
Hierarchical Planning with Latent World Models
cs.LG 2026-04 unverdicted novelty 6.0

Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
cs.RO 2026-04 conditional novelty 6.0

MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
cs.RO 2026-01 unverdicted novelty 6.0

PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation
cs.RO 2025-12 unverdicted novelty 6.0

DreamTacVLA grounds VLA models in contact physics by aligning multi-scale vision-tactile inputs and predicting future tactile states, reaching up to 95% success on contact-rich tasks.
Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning
cs.LG 2025-11 unverdicted novelty 6.0

Video diffusion models supply goal-driven rewards for RL by measuring alignment of agent trajectories with generated goal videos at both video and frame levels.
Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views
cs.CV 2025-11 unverdicted novelty 6.0

Uni-Hand forecasts 2D/3D hand waypoints, head motion, and contact states in egocentric views using vision-language fusion and dual-branch diffusion, with new benchmarks for downstream robotics and action tasks.
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
cs.RO 2025-10 unverdicted novelty 6.0

InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
cs.RO 2025-03 unverdicted novelty 6.0

GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
cs.CV 2024-12 unverdicted novelty 6.0

Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
cs.RO 2024-11 unverdicted novelty 6.0

DINO-WM builds world models on pre-trained DINOv2 features to enable zero-shot planning from offline data without rewards or demonstrations.
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
cs.RO 2024-10 unverdicted novelty 6.0

GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation
cs.RO 2024-09 unverdicted novelty 6.0

Gen2Act enables generalizable robot manipulation for unseen objects and novel motions by using zero-shot human video generation from web data to condition a policy trained on an order of magnitude less robot interaction data.
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
cs.RO 2024-01 conditional novelty 6.0

A low-cost whole-body teleoperation system enables effective imitation learning for complex bimanual mobile manipulation by co-training on mobile and static demonstration datasets.
Vision-Language Foundation Models as Effective Robot Imitators
cs.RO 2023-11 conditional novelty 6.0

RoboFlamingo adapts open-source vision-language models for robot manipulation tasks via single-step comprehension plus an explicit policy head, outperforming prior methods on benchmarks with only light fine-tuning.
Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels
cs.RO 2026-02 unverdicted novelty 5.0

An end-to-end policy learns robust humanoid locomotion directly from noisy depth images via high-fidelity sensor simulation, vision-aware distillation from privileged maps, and terrain-specific multi-critic reward shaping.
GR-3 Technical Report
cs.RO 2025-07 unverdicted novelty 5.0

GR-3 is a VLA model that generalizes to novel objects, environments, and abstract instructions, outperforms the π0 baseline, and integrates with the new ByteMini bi-manual mobile robot.
What Matters in Building Vision-Language-Action Models for Generalist Robots
cs.RO 2024-12 unverdicted novelty 5.0

Systematic tests of VLM backbones, policy architectures, and cross-embodiment data yield RoboVLMs that set new SOTA on robot manipulation benchmarks while requiring few manual designs.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 31 Pith papers · 8 internal anchors

[1]

Levine, C

S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016

work page 2016
[2]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009

work page 2009
[3]

Mzurikwao, M

D. Mzurikwao, M. Khan, O. Samuel, J. Cinatl, M. Wass, M. Michaelis, G. Marcelli, and C. S. Ang. Towards image-based cancer cell lines authentication using deep neural networks. Scientiﬁc Reports, 10, 11 2020. doi:10.1038/s41598-020-76670-6

work page doi:10.1038/s41598-020-76670-6 2020
[4]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, Minnesota, June 2019. Association for Computational Linguistics

work page 2019
[5]

org/CorpusID:22050710

Z. Zhang, J. Liu, and N. Razavian. BERT-XML: Large scale automated ICD coding using BERT pretraining. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, pages 24–34, Online, Nov. 2020. Association for Computational Linguistics. doi:10.18653/v1/ 2020.clinicalnlp-1.3. URL https://aclanthology.org/2020.clinicalnlp-1.3

work page doi:10.18653/v1/ 2020
[6]

Z. Yang, N. Garcia, C. Chu, M. Otani, Y . Nakashima, and H. Takemura. Bert representations for video question answering. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1545–1554, 2020. doi:10.1109/W ACV45572.2020.9093596

work page doi:10.1109/w 2020
[7]

Dasari, F

S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn. Robonet: Large-scale multi-robot learning. In Conference on Robot Learning, 2019

work page 2019
[8]

Mandlekar, J

A. Mandlekar, J. Booher, M. Spero, A. Tung, A. Gupta, Y . Zhu, A. Garg, S. Savarese, and L. Fei-Fei. Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1048–1055. IEEE, 2019

work page 2019
[9]

Young, D

S. Young, D. Gandhi, S. Tulsiani, A. Gupta, P. Abbeel, and L. Pinto. Visual imitation made easy. In CoRL, 2020

work page 2020
[10]

Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

F. Ebert, Y . Yang, K. Schmeckpeper, B. Bucher, G. Georgakis, K. Daniilidis, C. Finn, and S. Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. ArXiv, abs/2109.13396, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

T. B. Brown et al. Language models are few-shot learners. arXiv:2005.14165, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[12]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021

work page 2021
[13]

Goyal, Q

P. Goyal, Q. Duval, I. Seessel, M. Caron, I. Misra, L. Sagun, A. Joulin, and P. Bojanowski. Vision models are more robust and fair when pretrained on uncurated images without supervision. ArXiv, abs/2202.08360, 2022. 9

work page arXiv 2022
[14]

Goyal, S

R. Goyal, S. Ebrahimi Kahou, V . Michalski, J. Materzynska, S. Westphal, H. Kim, V . Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision, pages 5842–5850, 2017

work page 2017
[15]

Damen, H

D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Scaling egocentric vision: The epic-kitchens dataset. In European Conference on Computer Vision (ECCV), 2018

work page 2018
[16]

Grauman et al

K. Grauman et al. Ego4D: Around the World in 3,000 Hours of Egocentric Video, 2021

work page 2021
[17]

L. Shao, T. Migimatsu, Q. Zhang, K. Yang, and J. Bohg. Concept2robot: Learning manipulation concepts from instructions and human demonstrations. In Proceedings of Robotics: Science and Systems (RSS), 2020

work page 2020
[18]

A. S. Chen, S. Nair, and C. Finn. Learning generalizable robotic reward functions from ”in-the-wild” human videos. ArXiv, abs/2103.16817, 2021

work page arXiv 2021
[19]

Sermanet, C

P. Sermanet, C. Lynch, Y . Chebotar, J. Hsu, E. Jang, S. Schaal, and S. Levine. Time-contrastive networks: Self-supervised learning from video. Proceedings of International Conference in Robotics and Automation (ICRA), 2018

work page 2018
[20]

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

A. Rajeswaran, V . Kumar, A. Gupta, J. Schulman, E. Todorov, and S. Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. ArXiv, abs/1709.10087, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

Gupta, V

A. Gupta, V . Kumar, C. Lynch, S. Levine, and K. Hausman. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. In CoRL, 2019

work page 2019
[22]

T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, 2020

work page 2020
[23]

Parisi, A

S. Parisi, A. Rajeswaran, S. Purushwalkam, and A. K. Gupta. The unsurprising effectiveness of pre-trained vision models for control. 2022

work page 2022
[24]

K. He, H. Fan, Y . Wu, S. Xie, and R. B. Girshick. Momentum contrast for unsupervised visual representation learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9726–9735, 2020

work page 2020
[25]

Reinforcement learning with augmented data,

M. Laskin, K. Lee, A. Stooke, L. Pinto, P. Abbeel, and A. Srinivas. Reinforcement learning with augmented data. ArXiv, abs/2004.14990, 2020

work page arXiv 2004
[26]

Srinivas, M

A. Srinivas, M. Laskin, and P. Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning. In ICML, 2020

work page 2020
[27]

Image augmentation is all you need: Regularizing deep reinforcement learning from pixels,

I. Kostrikov, D. Yarats, and R. Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. ArXiv, abs/2004.13649, 2021

work page arXiv 2004
[28]

J. Pari, N. M. M. Shaﬁullah, S. P. Arunachalam, and L. Pinto. The surprising effectiveness of representation learning for visual imitation. ArXiv, abs/2112.01511, 2021

work page arXiv 2021
[29]

DeepMDP: Learning Continuous Latent Space Models for Representation Learning

C. Gelada, S. Kumar, J. Buckman, O. Nachum, and M. G. Bellemare. Deepmdp: Learning continuous latent space models for representation learning. ArXiv, abs/1906.02736, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[30]

Dream to Control: Learning Behaviors by Latent Imagination

D. Hafner, T. P. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. ArXiv, abs/1912.01603, 2020

work page internal anchor Pith review Pith/arXiv arXiv 1912
[31]

Zhang, R

A. Zhang, R. McAllister, R. Calandra, Y . Gal, and S. Levine. Learning invariant representations for reinforcement learning without reconstruction. ArXiv, abs/2006.10742, 2021. 10

work page arXiv 2006
[32]

S. Nair, S. Savarese, and C. Finn. Goal-aware prediction: Learning to model what matters. ArXiv, abs/2007.07170, 2020

work page arXiv 2007
[33]

M. Hong, K. Lee, M. Kang, W. Jung, and S. Oh. Dynamics-aware metric embedding: Metric learning in a latent space for visual planning. IEEE Robotics and Automation Letters, 2022

work page 2022
[34]

Jonschkowski and O

R. Jonschkowski and O. Brock. Learning state representations with robotic priors. Autonomous Robots, 39:407–428, 10 2015. doi:10.1007/s10514-015-9459-7

work page doi:10.1007/s10514-015-9459-7 2015
[35]

Y .-C. Lin, A. Zeng, S. Song, P. Isola, and T.-Y . Lin. Learning to see before learning to act: Visual pre-training for manipulation. 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 7286–7293, 2020

work page 2020
[36]

Shridhar, L

M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipula- tion. In CoRL, 2021

work page 2021
[37]

Khandelwal, L

A. Khandelwal, L. Weihs, R. Mottaghi, and A. Kembhavi. Simple but effective: Clip embeddings for embodied ai. ArXiv, abs/2111.09888, 2021

work page arXiv 2021
[38]

Shah and V

R. Shah and V . Kumar. Rrl: Resnet as representation for reinforcement learning. ArXiv, abs/2107.03380, 2021

work page arXiv 2021
[39]

Y . Seo, K. Lee, S. James, and P. Abbeel. Reinforcement learning with action-free pre-training from videos. ArXiv, abs/2203.13880, 2022

work page arXiv 2022
[40]

T. Xiao, I. Radosavovic, T. Darrell, and J. Malik. Masked visual pre-training for motor control. 2022

work page 2022
[41]

Y . Liu, A. Gupta, P. Abbeel, and S. Levine. Imitation from observation: Learning to imitate behaviors from raw video via context translation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1118–1125. IEEE, 2018

work page 2018
[42]

Sharma, D

P. Sharma, D. Pathak, and A. Gupta. Third-person visual imitation learning via decoupled hierarchical controller. In NeurIPS, 2019

work page 2019
[43]

Smith, N

L. Smith, N. Dhawan, M. Zhang, P. Abbeel, and S. Levine. A VID: Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos. InProceedings of Robotics: Science and Systems, Corvalis, Oregon, USA, July 2020

work page 2020
[44]

T. Yu, C. Finn, S. Dasari, A. Xie, T. Zhang, P. Abbeel, and S. Levine. One-shot imitation from observing humans via domain-adaptive meta-learning. In Proceedings of Robotics: Science and Systems, Pittsburgh, Pennsylvania, June 2018

work page 2018
[45]

Schmeckpeper, A

K. Schmeckpeper, A. Xie, O. Rybkin, S. Tian, K. Daniilidis, S. Levine, and C. Finn. Learning predictive models from observation and interaction. In ECCV, 2020

work page 2020
[46]

A. D. Edwards and C. L. Isbell. Perceptual values from observation. arXiv preprint arXiv:1905.07861, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[47]

Schmeckpeper, O

K. Schmeckpeper, O. Rybkin, K. Daniilidis, S. Levine, and C. Finn. Reinforcement learning with videos: Combining ofﬂine observations with interaction. In CoRL, 2020

work page 2020
[48]

Scalise, J

R. Scalise, J. Thomason, Y . Bisk, and S. Srinivasa. Improving robot success detection using static object data. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2019

work page 2019
[49]

S. Pirk, M. Khansari, Y . Bai, C. Lynch, and P. Sermanet. Online object representations with contrastive learning, 2019

work page 2019
[50]

Xiong, Q

H. Xiong, Q. Li, Y .-C. Chen, H. Bharadhwaj, S. Sinha, and A. Garg. Learning by watching: Physical imitation of manipulation skills from human videos, 2021. 11

work page 2021
[51]

N. Das, S. Bechtle, T. Davchev, D. Jayaraman, A. Rai, and F. Meier. Model-based inverse reinforcement learning from visual demonstrations, 2021

work page 2021
[52]

Zakka, A

K. Zakka, A. Zeng, P. Florence, J. Tompson, J. Bohg, and D. Dwibedi. Xirl: Cross-embodiment inverse reinforcement learning, 2021

work page 2021
[53]

arXiv preprint arXiv:2010.12083 , year=

S. Stepputtis, J. Campbell, M. Phielipp, S. Lee, C. Baral, and H. B. Amor. Language-conditioned imitation learning for robot manipulation tasks. ArXiv, abs/2010.12083, 2020

work page arXiv 2010
[54]

Lynch and P

C. Lynch and P. Sermanet. Grounding language in play. ArXiv, abs/2005.07648, 2020

work page arXiv 2005
[55]

Y . Cui, S. Niekum, A. Gupta, V . Kumar, and A. Rajeswaran. Can Foundation Models Perform Zero-Shot Task Speciﬁcation For Robot Manipulation? In L4DC, 2022

work page 2022
[56]

S. Nair, E. Mitchell, K. Chen, B. Ichter, S. Savarese, and C. Finn. Learning language-conditioned robot behavior from ofﬂine data and crowd-sourced annotation. In CoRL, 2021

work page 2021
[57]

Pinto and A

L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In IEEE international conference on robotics and automation (ICRA), 2016

work page 2016
[58]

Sharma, L

P. Sharma, L. Mohan, L. Pinto, and A. K. Gupta. Multiple interactions made easy (mime): Large scale demonstrations data for imitation. In CoRL, 2018

work page 2018
[59]

E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc- z: Zero-shot task generalization with robotic imitation learning. In A. Faust, D. Hsu, and G. Neumann, editors, Proceedings of the 5th Conference on Robot Learning, volume 164 of Proceedings of Machine Learning Research, pages 991–1002. PMLR, 08–11 Nov 2022. URL ht...

work page 2022
[60]

Wang and A

X. Wang and A. K. Gupta. Unsupervised learning of visual representations using videos. 2015 IEEE International Conference on Computer Vision (ICCV), pages 2794–2802, 2015

work page 2015
[61]

Sermanet, K

P. Sermanet, K. Xu, and S. Levine. Unsupervised perceptual rewards for imitation learning. Proceedings of Robotics: Science and Systems (RSS), 2017

work page 2017
[62]

X. Wang, A. Jabri, and A. A. Efros. Learning correspondence from the cycle-consistency of time. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2561–2571, 2019

work page 2019
[63]

Jabri, A

A. Jabri, A. Owens, and A. A. Efros. Space-time correspondence as a contrastive random walk. ArXiv, abs/2006.14613, 2020

work page arXiv 2006
[64]

Goyal, S

M. Goyal, S. Modi, R. Goyal, and S. Gupta. Human hands as probes for interactive object understanding. In Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[65]

Miech, D

A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2630–2640, 2019

work page 2019
[66]

H. Xu, G. Ghosh, P.-Y . Huang, D. Okhonko, A. Aghajanyan, and F. M. L. Z. C. Feichten- hofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding. ArXiv, abs/2109.14084, 2021

work page arXiv 2021
[67]

Representation Learning with Contrastive Predictive Coding

A. van den Oord, Y . Li, and O. Vinyals. Representation learning with contrastive predictive coding. ArXiv, abs/1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[68]

S. Ross, G. J. Gordon, and J. A. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, 2011

work page 2011
[69]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 12

work page 2016
[70]

Radosavovic, T

I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell. Real-world robot learning with masked visual pre-training. CoRL, 2022

work page 2022
[71]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

V . Sanh, L. Debut, J. Chaumond, and T. Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108, 2019. 13 A R3M Training Details A.1 Data Preprocessing The Ego4D dataset consists of several hour long videos within a certain scene. Within each scene, there are many sub-clips, each with a natural language a...

work page internal anchor Pith review Pith/arXiv arXiv 1910