{"paper":{"title":"R3M: A Universal Visual Representation for Robot Manipulation","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Pre-trained visual features from human videos enable more data-efficient robot manipulation.","cross_cats":["cs.AI","cs.CV","cs.LG"],"primary_cat":"cs.RO","authors_text":"Abhinav Gupta, Aravind Rajeswaran, Chelsea Finn, Suraj Nair, Vikash Kumar","submitted_at":"2022-03-23T17:55:09Z","abstract_excerpt":"We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. Concretely, we pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations. The resulting representation, R3M, can be used as a frozen perception module for downstream policy learning. Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% co"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika Panda arm to learn a range of manipulation tasks in a real, cluttered apartment given just 20 demonstrations.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That visual features learned from human video data will transfer effectively to robotic camera inputs and task distributions without any robot-specific fine-tuning or domain adaptation.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"A visual encoder pre-trained on diverse human videos with contrastive and language objectives improves simulated robot manipulation success by over 20% versus training from scratch and enables real Franka arm tasks from 20 demonstrations.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Pre-trained visual features from human videos enable more data-efficient robot manipulation.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"e69e6b87e1dfd8cc4373ea87e5817d96964cd6194efccc7ee792bde0193223b1"},"source":{"id":"2203.12601","kind":"arxiv","version":3},"verdict":{"id":"40bb61ea-d550-45b9-99c3-e36d7d935b6d","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T13:21:26.105969Z","strongest_claim":"Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika Panda arm to learn a range of manipulation tasks in a real, cluttered apartment given just 20 demonstrations.","one_line_summary":"A visual encoder pre-trained on diverse human videos with contrastive and language objectives improves simulated robot manipulation success by over 20% versus training from scratch and enables real Franka arm tasks from 20 demonstrations.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That visual features learned from human video data will transfer effectively to robotic camera inputs and task distributions without any robot-specific fine-tuning or domain adaptation.","pith_extraction_headline":"Pre-trained visual features from human videos enable more data-efficient robot manipulation."},"references":{"count":71,"sample":[{"doi":"","year":2016,"title":"S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016","work_id":"25f290d1-bc32-4fac-8425-cbd617acd5d7","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2009,"title":"J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009","work_id":"cbccb039-8471-40e8-8444-986485f10316","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.1038/s41598-020-76670-6","year":2020,"title":"D. Mzurikwao, M. Khan, O. Samuel, J. Cinatl, M. Wass, M. Michaelis, G. Marcelli, and C. S. Ang. Towards image-based cancer cell lines authentication using deep neural networks. Scientiﬁc Reports, 10, ","work_id":"7304f89e-02c3-4b7a-9453-82d7a0e149bf","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for C","work_id":"dd528a0c-a50f-417f-a715-b88d7b97c800","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.18653/v1/","year":2020,"title":"doi: 10.18653/v1/ 2024.findings-acl.586","work_id":"8d675bdd-79ca-48d6-9163-fc17ce0e8ece","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":71,"snapshot_sha256":"67f00b3640739d95e32292336d2ffa0a1645b6c4ad7966a3227435c05d64b82f","internal_anchors":8},"formal_canon":{"evidence_count":2,"snapshot_sha256":"eac073ec20631c456b36b01f0bccbdae9fe60d5fe64d39c5f6a3f861ed095222"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}