R3M: A Universal Visual Representation for Robot Manipulation
Pith reviewed 2026-05-15 13:21 UTC · model grok-4.3
The pith
Pre-trained visual features from human videos enable more data-efficient robot manipulation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
R3M is a universal visual representation pre-trained on diverse human video data from the Ego4D dataset. The pre-training combines time-contrastive learning to capture temporal structure, video-language alignment for semantic understanding, and an L1 penalty to promote sparse and compact features. When frozen and used for downstream robotic policy learning, R3M boosts task success rates by more than 20% over training from scratch and by more than 10% over state-of-the-art representations such as CLIP and MoCo across 12 simulated manipulation tasks. It further enables a real Franka Emika Panda arm to acquire a variety of manipulation skills in a cluttered apartment setting using just 20 demos
What carries the argument
The R3M visual encoder, obtained by pre-training on human videos with time-contrastive, language-alignment, and sparsity objectives, acting as a frozen perception module for policy learning.
If this is right
- Pre-trained human video features transfer to robotic vision without adaptation.
- Data efficiency in robot learning improves significantly with such representations.
- Combining contrastive, language, and sparsity losses creates more effective visual features for control.
- Real-world robot deployment becomes viable with small demonstration sets.
Where Pith is reading between the lines
- Scaling up the pre-training dataset could yield even stronger performance gains for a wider range of tasks.
- This method might generalize to other robot embodiments or sensor modalities beyond the tested arm.
- Integrating R3M with proprioception or other modalities could further enhance learning speed.
Load-bearing premise
Visual features learned from human video data will transfer effectively to robotic camera inputs and task distributions without any robot-specific fine-tuning or domain adaptation.
What would settle it
If policies using the frozen R3M encoder achieve no better success rates than random actions or from-scratch baselines across the 12 simulated manipulation tasks, the claimed transfer benefit would be falsified.
read the original abstract
We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. Concretely, we pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations. The resulting representation, R3M, can be used as a frozen perception module for downstream policy learning. Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika Panda arm to learn a range of manipulation tasks in a real, cluttered apartment given just 20 demonstrations. Code and pre-trained models are available at https://tinyurl.com/robotr3m.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces R3M, a visual encoder pre-trained on the Ego4D human video dataset via a combination of time-contrastive learning, video-language alignment, and an L1 sparsity penalty. The frozen R3M representation is then used for downstream policy learning. On 12 simulated manipulation tasks, R3M yields >20% higher success rates than training from scratch and >10% gains over CLIP and MoCo. In real-world experiments, a Franka Emika Panda arm learns several manipulation tasks in a cluttered apartment from only 20 demonstrations.
Significance. If the reported transfer holds under controlled conditions, R3M would provide a practical route to data-efficient robot learning by leveraging large-scale human video corpora. The availability of code and pre-trained models strengthens reproducibility and enables direct follow-up work on domain adaptation or fine-tuning.
major comments (2)
- [§4.2] §4.2 (real-robot experiments): success rates are reported for R3M with 20 demonstrations, but no matched real-world baselines for CLIP, MoCo, or training from scratch are provided on the same Franka tasks. This omission prevents quantification of the transfer benefit and leaves the domain-shift assumption untested.
- [§3.2] §3.2 (pre-training objectives): the time-contrastive and video-language losses are defined on egocentric human video; no analysis or ablation quantifies robustness to the shift to fixed third-person robot camera views, lighting, and motion statistics that appear in the real-robot evaluation.
minor comments (2)
- [Figure 3] Figure 3 and Table 2: axis labels and legend entries are too small for print; increase font size and add error bars or statistical significance markers for the 12-task averages.
- [§4.1] §4.1: the exact number of training episodes per simulated task and the precise definition of 'success' (e.g., threshold on final pose error) should be stated explicitly rather than referenced to an appendix.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below, acknowledging limitations where appropriate and outlining revisions.
read point-by-point responses
-
Referee: [§4.2] §4.2 (real-robot experiments): success rates are reported for R3M with 20 demonstrations, but no matched real-world baselines for CLIP, MoCo, or training from scratch are provided on the same Franka tasks. This omission prevents quantification of the transfer benefit and leaves the domain-shift assumption untested.
Authors: We agree that matched real-world baselines would allow direct quantification of transfer gains. Real-robot experiments on the Franka are resource-intensive, which constrained our ability to run full comparisons for all methods. Simulation results already show consistent >10% gains for R3M over CLIP/MoCo and >20% over scratch. In revision we will add explicit discussion of this limitation in §4.2, note the practical success with 20 demos, and include any feasible preliminary real-world data points. revision: partial
-
Referee: [§3.2] §3.2 (pre-training objectives): the time-contrastive and video-language losses are defined on egocentric human video; no analysis or ablation quantifies robustness to the shift to fixed third-person robot camera views, lighting, and motion statistics that appear in the real-robot evaluation.
Authors: The referee is correct that no dedicated ablation isolates robustness to viewpoint, lighting, and motion shifts. The objectives aim to learn temporally consistent and semantically aligned features expected to generalize, and this is supported by sim-to-real transfer in our results. In the revised manuscript we will expand §3.2 with discussion of these factors and add supporting visualizations or limited ablations where space allows. revision: partial
- Full matched real-world baselines for CLIP, MoCo, and training from scratch on the Franka tasks, as these require extensive additional physical robot time and resources beyond the current revision scope.
Circularity Check
No circularity: empirical pre-training objectives are independent of downstream robot-task metrics
full rationale
The paper pre-trains a visual encoder on Ego4D via time-contrastive loss, video-language alignment, and L1 sparsity, none of which are defined using the 12 simulated manipulation tasks or the real Franka apartment setup. Frozen R3M features are then evaluated on separate policy-learning benchmarks against scratch, CLIP, and MoCo baselines. No equation or result reduces to a fitted parameter taken from the target success rates; no self-citation supplies a uniqueness theorem that forces the architecture; and the reported gains are measured on held-out task distributions. The central empirical chain therefore remains self-contained and externally falsifiable.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human video data contains visual features that transfer to robotic manipulation tasks.
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 32 Pith papers
-
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.
-
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
-
Multimodal Diffusion Forcing for Forceful Manipulation
Multimodal Diffusion Forcing trains a diffusion model on partially masked multimodal robot trajectories to learn temporal and cross-modal dependencies for forceful manipulation.
-
DreamGen: Unlocking Generalization in Robot Learning through Video World Models
DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperatio...
-
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.
-
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
-
VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training
VIP learns a visual embedding from human videos whose distance defines dense, smooth rewards for arbitrary goal-image robot tasks without task-specific fine-tuning.
-
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
SayCan combines an LLM's high-level semantic knowledge with robot skill value functions to select only feasible actions, enabling completion of abstract natural-language instructions on a real mobile manipulator.
-
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
-
HumanNet: Scaling Human-centric Video Learning to One Million Hours
HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.
-
Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation
Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.
-
GazeVLA: Learning Human Intention for Robotic Manipulation
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
-
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
-
WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations
WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...
-
ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models
ProGAL-VLA uses 3D graphs, symbolic sub-goals, and a Grounding Alignment Contrastive loss to ground actions on verified embeddings, raising robustness from 30.3% to 71.5% and ambiguity AUROC to 0.81 on robotic benchmarks.
-
Hierarchical Planning with Latent World Models
Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.
-
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
-
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
-
Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation
DreamTacVLA grounds VLA models in contact physics by aligning multi-scale vision-tactile inputs and predicting future tactile states, reaching up to 95% success on contact-rich tasks.
-
Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning
Video diffusion models supply goal-driven rewards for RL by measuring alignment of agent trajectories with generated goal videos at both video and frame levels.
-
Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views
Uni-Hand forecasts 2D/3D hand waypoints, head motion, and contact states in egocentric views using vision-language fusion and dual-branch diffusion, with new benchmarks for downstream robotics and action tasks.
-
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
-
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
-
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
-
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
DINO-WM builds world models on pre-trained DINOv2 features to enable zero-shot planning from offline data without rewards or demonstrations.
-
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
-
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation
Gen2Act enables generalizable robot manipulation for unseen objects and novel motions by using zero-shot human video generation from web data to condition a policy trained on an order of magnitude less robot interaction data.
-
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
A low-cost whole-body teleoperation system enables effective imitation learning for complex bimanual mobile manipulation by co-training on mobile and static demonstration datasets.
-
Vision-Language Foundation Models as Effective Robot Imitators
RoboFlamingo adapts open-source vision-language models for robot manipulation tasks via single-step comprehension plus an explicit policy head, outperforming prior methods on benchmarks with only light fine-tuning.
-
Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels
An end-to-end policy learns robust humanoid locomotion directly from noisy depth images via high-fidelity sensor simulation, vision-aware distillation from privileged maps, and terrain-specific multi-critic reward shaping.
-
GR-3 Technical Report
GR-3 is a VLA model that generalizes to novel objects, environments, and abstract instructions, outperforms the π0 baseline, and integrates with the new ByteMini bi-manual mobile robot.
-
What Matters in Building Vision-Language-Action Models for Generalist Robots
Systematic tests of VLM backbones, policy architectures, and cross-embodiment data yield RoboVLMs that set new SOTA on robot manipulation benchmarks while requiring few manual designs.
Reference graph
Works this paper leans on
- [1]
-
[2]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009
work page 2009
-
[3]
D. Mzurikwao, M. Khan, O. Samuel, J. Cinatl, M. Wass, M. Michaelis, G. Marcelli, and C. S. Ang. Towards image-based cancer cell lines authentication using deep neural networks. Scientific Reports, 10, 11 2020. doi:10.1038/s41598-020-76670-6
-
[4]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, Minnesota, June 2019. Association for Computational Linguistics
work page 2019
-
[5]
Z. Zhang, J. Liu, and N. Razavian. BERT-XML: Large scale automated ICD coding using BERT pretraining. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, pages 24–34, Online, Nov. 2020. Association for Computational Linguistics. doi:10.18653/v1/ 2020.clinicalnlp-1.3. URL https://aclanthology.org/2020.clinicalnlp-1.3
-
[6]
Z. Yang, N. Garcia, C. Chu, M. Otani, Y . Nakashima, and H. Takemura. Bert representations for video question answering. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1545–1554, 2020. doi:10.1109/W ACV45572.2020.9093596
work page doi:10.1109/w 2020
- [7]
-
[8]
A. Mandlekar, J. Booher, M. Spero, A. Tung, A. Gupta, Y . Zhu, A. Garg, S. Savarese, and L. Fei-Fei. Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1048–1055. IEEE, 2019
work page 2019
- [9]
-
[10]
Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets
F. Ebert, Y . Yang, K. Schmeckpeper, B. Bucher, G. Georgakis, K. Daniilidis, C. Finn, and S. Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. ArXiv, abs/2109.13396, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[11]
T. B. Brown et al. Language models are few-shot learners. arXiv:2005.14165, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[12]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021
work page 2021
- [13]
-
[14]
R. Goyal, S. Ebrahimi Kahou, V . Michalski, J. Materzynska, S. Westphal, H. Kim, V . Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision, pages 5842–5850, 2017
work page 2017
- [15]
-
[16]
K. Grauman et al. Ego4D: Around the World in 3,000 Hours of Egocentric Video, 2021
work page 2021
-
[17]
L. Shao, T. Migimatsu, Q. Zhang, K. Yang, and J. Bohg. Concept2robot: Learning manipulation concepts from instructions and human demonstrations. In Proceedings of Robotics: Science and Systems (RSS), 2020
work page 2020
- [18]
-
[19]
P. Sermanet, C. Lynch, Y . Chebotar, J. Hsu, E. Jang, S. Schaal, and S. Levine. Time-contrastive networks: Self-supervised learning from video. Proceedings of International Conference in Robotics and Automation (ICRA), 2018
work page 2018
-
[20]
Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations
A. Rajeswaran, V . Kumar, A. Gupta, J. Schulman, E. Todorov, and S. Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. ArXiv, abs/1709.10087, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [21]
-
[22]
T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, 2020
work page 2020
- [23]
-
[24]
K. He, H. Fan, Y . Wu, S. Xie, and R. B. Girshick. Momentum contrast for unsupervised visual representation learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9726–9735, 2020
work page 2020
-
[25]
Reinforcement learning with augmented data,
M. Laskin, K. Lee, A. Stooke, L. Pinto, P. Abbeel, and A. Srinivas. Reinforcement learning with augmented data. ArXiv, abs/2004.14990, 2020
-
[26]
A. Srinivas, M. Laskin, and P. Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning. In ICML, 2020
work page 2020
-
[27]
Image augmentation is all you need: Regularizing deep reinforcement learning from pixels,
I. Kostrikov, D. Yarats, and R. Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. ArXiv, abs/2004.13649, 2021
- [28]
-
[29]
DeepMDP: Learning Continuous Latent Space Models for Representation Learning
C. Gelada, S. Kumar, J. Buckman, O. Nachum, and M. G. Bellemare. Deepmdp: Learning continuous latent space models for representation learning. ArXiv, abs/1906.02736, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[30]
Dream to Control: Learning Behaviors by Latent Imagination
D. Hafner, T. P. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. ArXiv, abs/1912.01603, 2020
work page internal anchor Pith review Pith/arXiv arXiv 1912
- [31]
- [32]
-
[33]
M. Hong, K. Lee, M. Kang, W. Jung, and S. Oh. Dynamics-aware metric embedding: Metric learning in a latent space for visual planning. IEEE Robotics and Automation Letters, 2022
work page 2022
-
[34]
R. Jonschkowski and O. Brock. Learning state representations with robotic priors. Autonomous Robots, 39:407–428, 10 2015. doi:10.1007/s10514-015-9459-7
-
[35]
Y .-C. Lin, A. Zeng, S. Song, P. Isola, and T.-Y . Lin. Learning to see before learning to act: Visual pre-training for manipulation. 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 7286–7293, 2020
work page 2020
-
[36]
M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipula- tion. In CoRL, 2021
work page 2021
-
[37]
A. Khandelwal, L. Weihs, R. Mottaghi, and A. Kembhavi. Simple but effective: Clip embeddings for embodied ai. ArXiv, abs/2111.09888, 2021
-
[38]
R. Shah and V . Kumar. Rrl: Resnet as representation for reinforcement learning. ArXiv, abs/2107.03380, 2021
- [39]
-
[40]
T. Xiao, I. Radosavovic, T. Darrell, and J. Malik. Masked visual pre-training for motor control. 2022
work page 2022
-
[41]
Y . Liu, A. Gupta, P. Abbeel, and S. Levine. Imitation from observation: Learning to imitate behaviors from raw video via context translation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1118–1125. IEEE, 2018
work page 2018
- [42]
- [43]
-
[44]
T. Yu, C. Finn, S. Dasari, A. Xie, T. Zhang, P. Abbeel, and S. Levine. One-shot imitation from observing humans via domain-adaptive meta-learning. In Proceedings of Robotics: Science and Systems, Pittsburgh, Pennsylvania, June 2018
work page 2018
-
[45]
K. Schmeckpeper, A. Xie, O. Rybkin, S. Tian, K. Daniilidis, S. Levine, and C. Finn. Learning predictive models from observation and interaction. In ECCV, 2020
work page 2020
-
[46]
A. D. Edwards and C. L. Isbell. Perceptual values from observation. arXiv preprint arXiv:1905.07861, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[47]
K. Schmeckpeper, O. Rybkin, K. Daniilidis, S. Levine, and C. Finn. Reinforcement learning with videos: Combining offline observations with interaction. In CoRL, 2020
work page 2020
-
[48]
R. Scalise, J. Thomason, Y . Bisk, and S. Srinivasa. Improving robot success detection using static object data. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2019
work page 2019
-
[49]
S. Pirk, M. Khansari, Y . Bai, C. Lynch, and P. Sermanet. Online object representations with contrastive learning, 2019
work page 2019
- [50]
-
[51]
N. Das, S. Bechtle, T. Davchev, D. Jayaraman, A. Rai, and F. Meier. Model-based inverse reinforcement learning from visual demonstrations, 2021
work page 2021
- [52]
-
[53]
arXiv preprint arXiv:2010.12083 , year=
S. Stepputtis, J. Campbell, M. Phielipp, S. Lee, C. Baral, and H. B. Amor. Language-conditioned imitation learning for robot manipulation tasks. ArXiv, abs/2010.12083, 2020
-
[54]
C. Lynch and P. Sermanet. Grounding language in play. ArXiv, abs/2005.07648, 2020
-
[55]
Y . Cui, S. Niekum, A. Gupta, V . Kumar, and A. Rajeswaran. Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation? In L4DC, 2022
work page 2022
-
[56]
S. Nair, E. Mitchell, K. Chen, B. Ichter, S. Savarese, and C. Finn. Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. In CoRL, 2021
work page 2021
-
[57]
L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In IEEE international conference on robotics and automation (ICRA), 2016
work page 2016
- [58]
-
[59]
E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc- z: Zero-shot task generalization with robotic imitation learning. In A. Faust, D. Hsu, and G. Neumann, editors, Proceedings of the 5th Conference on Robot Learning, volume 164 of Proceedings of Machine Learning Research, pages 991–1002. PMLR, 08–11 Nov 2022. URL ht...
work page 2022
-
[60]
X. Wang and A. K. Gupta. Unsupervised learning of visual representations using videos. 2015 IEEE International Conference on Computer Vision (ICCV), pages 2794–2802, 2015
work page 2015
-
[61]
P. Sermanet, K. Xu, and S. Levine. Unsupervised perceptual rewards for imitation learning. Proceedings of Robotics: Science and Systems (RSS), 2017
work page 2017
-
[62]
X. Wang, A. Jabri, and A. A. Efros. Learning correspondence from the cycle-consistency of time. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2561–2571, 2019
work page 2019
- [63]
- [64]
- [65]
- [66]
-
[67]
Representation Learning with Contrastive Predictive Coding
A. van den Oord, Y . Li, and O. Vinyals. Representation learning with contrastive predictive coding. ArXiv, abs/1807.03748, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[68]
S. Ross, G. J. Gordon, and J. A. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, 2011
work page 2011
-
[69]
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 12
work page 2016
-
[70]
I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell. Real-world robot learning with masked visual pre-training. CoRL, 2022
work page 2022
-
[71]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
V . Sanh, L. Debut, J. Chaumond, and T. Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108, 2019. 13 A R3M Training Details A.1 Data Preprocessing The Ego4D dataset consists of several hour long videos within a certain scene. Within each scene, there are many sub-clips, each with a natural language a...
work page internal anchor Pith review Pith/arXiv arXiv 1910
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.