{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2022:GTXFFVUW5IAGGYDDGCDKIZJSH5","short_pith_number":"pith:GTXFFVUW","schema_version":"1.0","canonical_sha256":"34ee52d696ea006360633086a465323f42504e698f7a5184bb4ac4f8e17a2bd4","source":{"kind":"arxiv","id":"2203.12601","version":3},"attestation_state":"computed","paper":{"title":"R3M: A Universal Visual Representation for Robot Manipulation","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Pre-trained visual features from human videos enable more data-efficient robot manipulation.","cross_cats":["cs.AI","cs.CV","cs.LG"],"primary_cat":"cs.RO","authors_text":"Abhinav Gupta, Aravind Rajeswaran, Chelsea Finn, Suraj Nair, Vikash Kumar","submitted_at":"2022-03-23T17:55:09Z","abstract_excerpt":"We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. Concretely, we pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations. The resulting representation, R3M, can be used as a frozen perception module for downstream policy learning. Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% co"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2203.12601","kind":"arxiv","version":3},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.RO","submitted_at":"2022-03-23T17:55:09Z","cross_cats_sorted":["cs.AI","cs.CV","cs.LG"],"title_canon_sha256":"923da15624e8a8f8846da5becb700e1523a5a0058db29deff7fbab2dd3adcab3","abstract_canon_sha256":"2e12f3f0f536b483f3c663217324ecee07bd9809f73caf8399664ec0c02148d6"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:52.455138Z","signature_b64":"9kA08c3lkE/hJ8GR57TIvG1ouKibqBqXIVxYZmyKSWqjzGvZsN6BGOz0CC8eUvP5IgusygfkvnsN8/zE1paUAQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"34ee52d696ea006360633086a465323f42504e698f7a5184bb4ac4f8e17a2bd4","last_reissued_at":"2026-05-17T23:38:52.454622Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:52.454622Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"R3M: A Universal Visual Representation for Robot Manipulation","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Pre-trained visual features from human videos enable more data-efficient robot manipulation.","cross_cats":["cs.AI","cs.CV","cs.LG"],"primary_cat":"cs.RO","authors_text":"Abhinav Gupta, Aravind Rajeswaran, Chelsea Finn, Suraj Nair, Vikash Kumar","submitted_at":"2022-03-23T17:55:09Z","abstract_excerpt":"We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. Concretely, we pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations. The resulting representation, R3M, can be used as a frozen perception module for downstream policy learning. Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% co"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika Panda arm to learn a range of manipulation tasks in a real, cluttered apartment given just 20 demonstrations.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That visual features learned from human video data will transfer effectively to robotic camera inputs and task distributions without any robot-specific fine-tuning or domain adaptation.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"A visual encoder pre-trained on diverse human videos with contrastive and language objectives improves simulated robot manipulation success by over 20% versus training from scratch and enables real Franka arm tasks from 20 demonstrations.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Pre-trained visual features from human videos enable more data-efficient robot manipulation.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"e69e6b87e1dfd8cc4373ea87e5817d96964cd6194efccc7ee792bde0193223b1"},"source":{"id":"2203.12601","kind":"arxiv","version":3},"verdict":{"id":"40bb61ea-d550-45b9-99c3-e36d7d935b6d","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T13:21:26.105969Z","strongest_claim":"Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika Panda arm to learn a range of manipulation tasks in a real, cluttered apartment given just 20 demonstrations.","one_line_summary":"A visual encoder pre-trained on diverse human videos with contrastive and language objectives improves simulated robot manipulation success by over 20% versus training from scratch and enables real Franka arm tasks from 20 demonstrations.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That visual features learned from human video data will transfer effectively to robotic camera inputs and task distributions without any robot-specific fine-tuning or domain adaptation.","pith_extraction_headline":"Pre-trained visual features from human videos enable more data-efficient robot manipulation."},"references":{"count":71,"sample":[{"doi":"","year":2016,"title":"S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016","work_id":"25f290d1-bc32-4fac-8425-cbd617acd5d7","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2009,"title":"J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009","work_id":"cbccb039-8471-40e8-8444-986485f10316","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.1038/s41598-020-76670-6","year":2020,"title":"D. Mzurikwao, M. Khan, O. Samuel, J. Cinatl, M. Wass, M. Michaelis, G. Marcelli, and C. S. Ang. Towards image-based cancer cell lines authentication using deep neural networks. Scientiﬁc Reports, 10, ","work_id":"7304f89e-02c3-4b7a-9453-82d7a0e149bf","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for C","work_id":"dd528a0c-a50f-417f-a715-b88d7b97c800","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.18653/v1/","year":2020,"title":"doi: 10.18653/v1/ 2024.findings-acl.586","work_id":"8d675bdd-79ca-48d6-9163-fc17ce0e8ece","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":71,"snapshot_sha256":"67f00b3640739d95e32292336d2ffa0a1645b6c4ad7966a3227435c05d64b82f","internal_anchors":8},"formal_canon":{"evidence_count":2,"snapshot_sha256":"eac073ec20631c456b36b01f0bccbdae9fe60d5fe64d39c5f6a3f861ed095222"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2203.12601","created_at":"2026-05-17T23:38:52.454697+00:00"},{"alias_kind":"arxiv_version","alias_value":"2203.12601v3","created_at":"2026-05-17T23:38:52.454697+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2203.12601","created_at":"2026-05-17T23:38:52.454697+00:00"},{"alias_kind":"pith_short_12","alias_value":"GTXFFVUW5IAG","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"GTXFFVUW5IAGGYDD","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"GTXFFVUW","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":38,"internal_anchor_count":38,"sample":[{"citing_arxiv_id":"2505.07813","citing_title":"DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2507.12440","citing_title":"EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17077","citing_title":"How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17517","citing_title":"AffordVLA: Injecting Affordance Representations into Vision-Language-Action Models via Implicit Feature Alignment","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16743","citing_title":"LACE: Latent Visual Representation for Cross-Embodiment Learning","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2506.10137","citing_title":"Self-Predictive Representations for Combinatorial Generalization in Behavioral Cloning","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2511.04812","citing_title":"Multimodal Diffusion Forcing for Forceful Manipulation","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2511.12878","citing_title":"Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2412.14058","citing_title":"What Matters in Building Vision-Language-Action Models for Generalist Robots","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2411.04983","citing_title":"DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2507.15493","citing_title":"GR-3 Technical Report","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2512.00961","citing_title":"Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2311.01378","citing_title":"Vision-Language Foundation Models as Effective Robot Imitators","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2512.23864","citing_title":"Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2601.07060","citing_title":"PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation","ref_index":91,"is_internal_anchor":true},{"citing_arxiv_id":"2602.06382","citing_title":"Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2310.10639","citing_title":"Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2505.12705","citing_title":"DreamGen: Unlocking Generalization in Robot Learning through Video World Models","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2409.16283","citing_title":"Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19092","citing_title":"RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2210.00030","citing_title":"VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2401.02117","citing_title":"Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation","ref_index":60,"is_internal_anchor":true},{"citing_arxiv_id":"2510.13778","citing_title":"InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03208","citing_title":"Hierarchical Planning with Latent World Models","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03181","citing_title":"Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model","ref_index":51,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/GTXFFVUW5IAGGYDDGCDKIZJSH5","json":"https://pith.science/pith/GTXFFVUW5IAGGYDDGCDKIZJSH5.json","graph_json":"https://pith.science/api/pith-number/GTXFFVUW5IAGGYDDGCDKIZJSH5/graph.json","events_json":"https://pith.science/api/pith-number/GTXFFVUW5IAGGYDDGCDKIZJSH5/events.json","paper":"https://pith.science/paper/GTXFFVUW"},"agent_actions":{"view_html":"https://pith.science/pith/GTXFFVUW5IAGGYDDGCDKIZJSH5","download_json":"https://pith.science/pith/GTXFFVUW5IAGGYDDGCDKIZJSH5.json","view_paper":"https://pith.science/paper/GTXFFVUW","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2203.12601&json=true","fetch_graph":"https://pith.science/api/pith-number/GTXFFVUW5IAGGYDDGCDKIZJSH5/graph.json","fetch_events":"https://pith.science/api/pith-number/GTXFFVUW5IAGGYDDGCDKIZJSH5/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/GTXFFVUW5IAGGYDDGCDKIZJSH5/action/timestamp_anchor","attest_storage":"https://pith.science/pith/GTXFFVUW5IAGGYDDGCDKIZJSH5/action/storage_attestation","attest_author":"https://pith.science/pith/GTXFFVUW5IAGGYDDGCDKIZJSH5/action/author_attestation","sign_citation":"https://pith.science/pith/GTXFFVUW5IAGGYDDGCDKIZJSH5/action/citation_signature","submit_replication":"https://pith.science/pith/GTXFFVUW5IAGGYDDGCDKIZJSH5/action/replication_record"}},"created_at":"2026-05-17T23:38:52.454697+00:00","updated_at":"2026-05-17T23:38:52.454697+00:00"}