hub Canonical reference

R3M: A Universal Visual Representation for Robot Manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, Abhinav Gupta · 2022 · cs.RO · arXiv 2203.12601

Canonical reference. 93% of citing Pith papers cite this work as background.

60 Pith papers citing it

Background 93% of classified citations

open full Pith review browse 60 citing papers arXiv PDF

abstract

We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. Concretely, we pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations. The resulting representation, R3M, can be used as a frozen perception module for downstream policy learning. Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika Panda arm to learn a range of manipulation tasks in a real, cluttered apartment given just 20 demonstrations. Code and pre-trained models are available at https://tinyurl.com/robotr3m.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 13 method 1

citation-polarity summary

background 13 use method 1

representative citing papers

HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

cs.CV · 2026-06-18 · unverdicted · novelty 7.0

Processed egocentric human video outperforms teleoperated real-robot trajectories as pretraining data for embodied foundation models, delivering 24% lower validation loss and 52.5-90% higher task success rates under matched post-training protocols.

Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

C4G introduces compact timestamp-conditioned Gaussian query tokens that aggregate full temporal context to decode 3D Gaussians with timestamp-modulated positions for feed-forward 4D reconstruction from monocular video, plus a diffusion-based rendering module and extension to 4D feature fields.

Point Tracking Improves World Action Models

cs.RO · 2026-05-22 · unverdicted · novelty 7.0

JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.

RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

cs.RO · 2026-04-21 · unverdicted · novelty 7.0 · 2 refs

RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.

Multimodal Diffusion Forcing for Forceful Manipulation

cs.RO · 2025-11-06 · unverdicted · novelty 7.0

Multimodal Diffusion Forcing trains a diffusion model on partially masked multimodal robot trajectories to learn temporal and cross-modal dependencies for forceful manipulation.

DreamGen: Unlocking Generalization in Robot Learning through Video World Models

cs.RO · 2025-05-19 · unverdicted · novelty 7.0

DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperation dataset.

Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

cs.RO · 2023-10-16 · conditional · novelty 7.0

SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

cs.RO · 2023-07-12 · unverdicted · novelty 7.0

VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.

VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

cs.RO · 2022-09-30 · unverdicted · novelty 7.0

VIP learns a visual embedding from human videos whose distance defines dense, smooth rewards for arbitrary goal-image robot tasks without task-specific fine-tuning.

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

cs.RO · 2022-04-04 · accept · novelty 7.0

SayCan combines an LLM's high-level semantic knowledge with robot skill value functions to select only feasible actions, enabling completion of abstract natural-language instructions on a real mobile manipulator.

STEAM: Self-Supervised Temporal Ensemble Advantage Modeling for Real-World Robot Learning

cs.RO · 2026-06-29 · unverdicted · novelty 6.0

STEAM learns advantages from expert trajectories via self-supervised temporal ensemble modeling to improve policy learning on real robot tasks like bimanual folding and pick-and-place.

Supervise What Survives: Geometry-Guided VLA Adaptation from Synthetic Robot Videos

cs.RO · 2026-06-23 · unverdicted · novelty 6.0

GRA extracts 2D waypoints from synthetic videos to supervise VLA vision while restricting action training to real data, outperforming pseudo-action baselines on real-robot tasks.

OpenHLM: An Empirical Recipe for Whole-Body Humanoid Loco-Manipulation

cs.RO · 2026-06-20 · unverdicted · novelty 6.0

OpenHLM is an empirical recipe yielding a whole-body humanoid VLA model that outperforms GR00T N1.6 and Ψ0 baselines on long-horizon tasks using less than half the demonstration time.

Imitation from Heterogeneous Demonstrations using Grounded Latent-Action World Models

cs.RO · 2026-06-19 · unverdicted · novelty 6.0

GLAM learns a shared latent action space grounded in consistent future observation prediction across heterogeneous data sources to train improved behavioral cloning policies for robot manipulation tasks.

PoLAR: Factorizing Extent and Mode in Latent Actions for Robot Policy Learning

cs.RO · 2026-06-19 · unverdicted · novelty 6.0

PoLAR imposes radial structure on latent actions in hyperbolic space to factorize extent and mode, improving robot policy performance over baselines.

Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

cs.RO · 2026-06-17 · unverdicted · novelty 6.0

DO AS I DO reconstructs and retargets hand-object interactions from in-the-wild monocular RGB videos to produce dexterous robot manipulation trajectories, outperforming prior methods on ground-truth and online video datasets.

EgoInfinity: A Web-Scale 4D Hand-Object Interaction Data Engine for Any-View Robot Retargeting and Video-to-Action Robot Learning

cs.RO · 2026-06-16 · unverdicted · novelty 6.0

EgoInfinity is a modular pipeline that lifts in-the-wild RGB videos into agent-agnostic 4D hand-object data with interaction-aware refinement and retargets motions to diverse robot morphologies for video-to-action learning.

Contrastive Action-Image Pre-training for Visuomotor Control

cs.RO · 2026-06-15 · unverdicted · novelty 6.0

CAIP learns action-aligned visual representations via contrastive pre-training on human hand keypoints from egocentric video, outperforming DINOv2, SigLIP, MVP, and R3M with >30% gains on real dexterous manipulation tasks.

Enabling Extensible Embodied Capabilities with Tools

cs.RO · 2026-05-26 · unverdicted · novelty 6.0

Introduces Embodied Tool Protocol and tool externalization to improve embodied AI performance on perception and cognition tasks, with measured gains but limits on execution capabilities.

How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning

cs.RO · 2026-05-16 · unverdicted · novelty 6.0

DeMiAn re-annotates robot and egocentric videos with VLM-generated dense labels across motion, scene, pose, and reasoning aspects, then uses a learned instructor to boost policy success by 5 points on RoboCasa over task-only baselines.

LACE: Latent Visual Representation for Cross-Embodiment Learning

cs.RO · 2026-05-16 · unverdicted · novelty 6.0

LACE aligns human-robot visual features via semantic distribution matching on corresponding body parts plus Gram loss, yielding 65% better zero-shot policy transfer than baseline DINO.

GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

cs.RO · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

GuidedVLA improves VLA generalization by supervising individual attention heads with manually defined auxiliary signals for three task-relevant factors.

HumanNet: Scaling Human-centric Video Learning to One Million Hours

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.

Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation

cs.RO · 2026-05-02 · unverdicted · novelty 6.0

Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

R3M: A Universal Visual Representation for Robot Manipulation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer