arxiv: 2310.08864 · v9 · submitted 2023-10-13 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open X-Embodiment Collaboration , Abby O'Neill , Abdul Rehman , Abhinav Gupta , Abhiram Maddukuri , Abhishek Gupta , Abhishek Padalkar , Abraham Lee

show 286 more authors

Acorn Pooley Agrim Gupta Ajay Mandlekar Ajinkya Jain Albert Tung Alex Bewley Alex Herzog Alex Irpan Alexander Khazatsky Anant Rai Anchit Gupta Andrew Wang Andrey Kolobov Anikait Singh Animesh Garg Aniruddha Kembhavi Annie Xie Anthony Brohan Antonin Raffin Archit Sharma Arefeh Yavary Arhan Jain Ashwin Balakrishna Ayzaan Wahid Ben Burgess-Limerick Beomjoon Kim Bernhard Sch\"olkopf Blake Wulfe Brian Ichter Cewu Lu Charles Xu Charlotte Le Chelsea Finn Chen Wang Chenfeng Xu Cheng Chi Chenguang Huang Christine Chan Christopher Agia Chuer Pan Chuyuan Fu Coline Devin Danfei Xu Daniel Morton Danny Driess Daphne Chen Deepak Pathak Dhruv Shah Dieter B\"uchler Dinesh Jayaraman Dmitry Kalashnikov Dorsa Sadigh Edward Johns Ethan Foster Fangchen Liu Federico Ceola Fei Xia Feiyu Zhao Felipe Vieira Frujeri Freek Stulp Gaoyue Zhou Gaurav S. Sukhatme Gautam Salhotra Ge Yan Gilbert Feng Giulio Schiavi Glen Berseth Gregory Kahn Guangwen Yang Guanzhi Wang Hao Su Hao-Shu Fang Haochen Shi Henghui Bao Heni Ben Amor Henrik I Christensen Hiroki Furuta Homanga Bharadhwaj Homer Walke Hongjie Fang Huy Ha Igor Mordatch Ilija Radosavovic Isabel Leal Jacky Liang Jad Abou-Chakra Jaehyung Kim Jaimyn Drake Jan Peters Jan Schneider Jasmine Hsu Jay Vakil Jeannette Bohg Jeffrey Bingham Jeffrey Wu Jensen Gao Jiaheng Hu Jiajun Wu Jialin Wu Jiankai Sun Jianlan Luo Jiayuan Gu Jie Tan Jihoon Oh Jimmy Wu Jingpei Lu Jingyun Yang Jitendra Malik Jo\~ao Silv\'erio Joey Hejna Jonathan Booher Jonathan Tompson Jonathan Yang Jordi Salvador Joseph J. Lim Junhyek Han Kaiyuan Wang Kanishka Rao Karl Pertsch Karol Hausman Keegan Go Keerthana Gopalakrishnan Ken Goldberg Kendra Byrne Kenneth Oslund Kento Kawaharazuka Kevin Black Kevin Lin Kevin Zhang Kiana Ehsani Kiran Lekkala Kirsty Ellis Krishan Rana Krishnan Srinivasan Kuan Fang Kunal Pratap Singh Kuo-Hao Zeng Kyle Hatch Kyle Hsu Laurent Itti Lawrence Yunliang Chen Lerrel Pinto Li Fei-Fei Liam Tan Linxi "Jim" Fan Lionel Ott Lisa Lee Luca Weihs Magnum Chen Marion Lepert Marius Memmel Masayoshi Tomizuka Masha Itkina Mateo Guaman Castro Max Spero Maximilian Du Michael Ahn Michael C. Yip Mingtong Zhang Mingyu Ding Minho Heo Mohan Kumar Srirama Mohit Sharma Moo Jin Kim Muhammad Zubair Irshad Naoaki Kanazawa Nicklas Hansen Nicolas Heess Nikhil J Joshi Niko Suenderhauf Ning Liu Norman Di Palo Nur Muhammad Mahi Shafiullah Oier Mees Oliver Kroemer Osbert Bastani Pannag R Sanketi Patrick "Tree" Miller Patrick Yin Paul Wohlhart Peng Xu Peter David Fagan Peter Mitrano Pierre Sermanet Pieter Abbeel Priya Sundaresan Qiuyu Chen Quan Vuong Rafael Rafailov Ran Tian Ria Doshi Roberto Mart\'in-Mart\'in Rohan Baijal Rosario Scalise Rose Hendrix Roy Lin Runjia Qian Ruohan Zhang Russell Mendonca Rutav Shah Ryan Hoque Ryan Julian Samuel Bustamante Sean Kirmani Sergey Levine Shan Lin Sherry Moore Shikhar Bahl Shivin Dass Shubham Sonawani Shubham Tulsiani Shuran Song Sichun Xu Siddhant Haldar Siddharth Karamcheti Simeon Adebola Simon Guist Soroush Nasiriany Stefan Schaal Stefan Welker Stephen Tian Subramanian Ramamoorthy Sudeep Dasari Suneel Belkhale Sungjae Park Suraj Nair Suvir Mirchandani Takayuki Osa Tanmay Gupta Tatsuya Harada Tatsuya Matsushima Ted Xiao Thomas Kollar Tianhe Yu Tianli Ding Todor Davchev Tony Z. Zhao Travis Armstrong Trevor Darrell Trinity Chung Vidhi Jain Vikash Kumar Vincent Vanhoucke Vitor Guizilini Wei Zhan Wenxuan Zhou Wolfram Burgard Xi Chen Xiangyu Chen Xiaolong Wang Xinghao Zhu Xinyang Geng Xiyuan Liu Xu Liangwei Xuanlin Li Yansong Pang Yao Lu Yecheng Jason Ma Yejin Kim Yevgen Chebotar Yifan Zhou Yifeng Zhu Yilin Wu Ying Xu Yixuan Wang Yonatan Bisk Yongqiang Dou Yoonyoung Cho Youngwoon Lee Yuchen Cui Yue Cao Yueh-Hua Wu Yujin Tang Yuke Zhu Yunchu Zhang Yunfan Jiang Yunshuang Li Yunzhu Li Yusuke Iwasawa Yutaka Matsuo Zehan Ma Zhuo Xu Zichen Jeff Cui Zichen Zhang Zipeng Fu Zipeng Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-11 17:18 UTC · model grok-4.3

classification 💻 cs.RO

keywords robotic manipulationmulti-robot learningpolicy transfergeneralist policiesembodimentdatasets

0 comments

The pith

A single high-capacity model trained on data from 22 robots improves task performance on each individual platform through positive transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper assembles and standardizes a large dataset of robotic manipulation skills collected from 22 different robots by multiple institutions. It trains one high-capacity model on the full combined set and demonstrates that this model performs better on the tasks for each robot than would be possible from that robot's data alone. The result suggests that experience gathered on one robot platform can be shared to strengthen learning on others. A sympathetic reader would care because most robotic learning still builds a separate model for every robot and task, which limits scale and efficiency.

Core claim

We assemble a dataset from 22 different robots demonstrating 527 skills. A high-capacity model trained on this data exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms.

What carries the argument

The high-capacity model trained on the standardized multi-robot dataset, which carries the argument by showing that cross-platform data produces measurable gains on each robot's tasks.

If this is right

Robots achieve higher success rates on tasks by drawing on experience collected elsewhere without new data collection on the target platform.
A single model can be adapted to new robots, tasks, and environments more efficiently than training from scratch for each case.
Robotic learning can shift away from training isolated models for every application toward shared generalist policies.
The standardized dataset format enables further experiments on cross-robot generalization in manipulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the positive transfer effect grows with additional robots and tasks, future datasets could be pooled at even larger scale to compound the gains.
Different research groups could contribute data in the same format and immediately benefit from improved performance on their own hardware.
The approach raises the question of how far the transfer extends when new robot morphologies or entirely unseen tasks are introduced.

Load-bearing premise

The chosen standardization and particular mix of data from the 22 robots produce net positive transfer rather than interference that would reduce performance.

What would settle it

A direct comparison in which the model trained on the combined dataset performs no better or worse than separate models trained only on each robot's own data for the same tasks.

read the original abstract

Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website https://robotics-transformer-x.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The dataset release from 22 robots is the solid part; the positive transfer claim for RT-X is undercut by missing controls for total data volume.

read the letter

The main thing here is a large standardized dataset collected across 22 robots and 21 institutions, covering 527 skills in over 160k tasks, plus an initial model called RT-X trained on the combined data. The paper shows some performance gains on individual robots when using the joint model. That scale and the open release are the concrete advances worth noting. Prior multi-robot work existed, but nothing at this collaborative breadth with a common format and public data drop. The effort to make the data usable for others is a practical step forward for anyone trying to train generalist policies. The experiments give an existence proof that a single high-capacity model can handle multiple embodiments without immediate collapse. The soft spot is exactly the one flagged in the stress-test. The comparisons pit the full multi-robot model against per-robot baselines trained only on their native subsets. There is no matched-volume control, such as subsampling the combined data to equal the per-robot totals. This leaves open the possibility that gains come from raw data quantity rather than any cross-embodiment knowledge transfer. The abstract states positive transfer, but without those ablations the central claim rests on weaker ground than it needs to. The paper is aimed at robotic learning groups working on scaling and generalist models. Readers who want a ready-made large dataset for their own experiments will get immediate value from the release and format. Those looking for airtight evidence of transfer will find the results suggestive but incomplete. It deserves a serious referee because the data itself is a verifiable contribution that the community can use and extend, even if the initial modeling results require tighter controls. I would send it to peer review with a request for volume-matched baselines in the experiments.

Referee Report

1 major / 2 minor

Summary. The paper assembles a large-scale, standardized dataset of robotic manipulation tasks collected from 22 robots across 21 institutions, covering 527 skills and 160266 tasks. It introduces RT-X, a high-capacity transformer-based policy trained on the combined data, and reports experiments claiming that this model exhibits positive cross-embodiment transfer, improving task performance on multiple robots by leveraging experience from other platforms.

Significance. If the positive-transfer claim is substantiated with proper controls, the work would be significant for robotics by providing open, standardized datasets and models that facilitate research on generalist X-robot policies, analogous to foundation models in other domains. The collaborative data release itself is a substantial community resource.

major comments (1)

[§5 (Experiments)] §5 (Experiments): The reported comparisons between RT-X (trained on the full 160k+ task multi-robot dataset) and per-robot baselines (trained only on native data subsets) do not control for total training data volume. Without an additional baseline that matches the data volume seen by RT-X (e.g., via subsampling the combined dataset to equal the per-robot volume or training on equivalent-scale single-robot data), performance gains cannot be unambiguously attributed to cross-embodiment transfer rather than simple scaling effects. This directly undermines the central claim that RT-X improves capabilities 'by leveraging experience from other platforms.'

minor comments (2)

[Abstract] Abstract: '160266 tasks' should be written with a comma as '160,266 tasks' for readability.
[§3 (Dataset)] §3 (Dataset): The standardization procedure for heterogeneous robot data (e.g., action spaces, observation formats) is described at a high level; a more detailed table or pseudocode would help readers reproduce the exact preprocessing pipeline.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below.

read point-by-point responses

Referee: [§5 (Experiments)] §5 (Experiments): The reported comparisons between RT-X (trained on the full 160k+ task multi-robot dataset) and per-robot baselines (trained only on native data subsets) do not control for total training data volume. Without an additional baseline that matches the data volume seen by RT-X (e.g., via subsampling the combined dataset to equal the per-robot volume or training on equivalent-scale single-robot data), performance gains cannot be unambiguously attributed to cross-embodiment transfer rather than simple scaling effects. This directly undermines the central claim that RT-X improves capabilities 'by leveraging experience from other platforms.'

Authors: We agree that an explicit control for total training data volume would strengthen the attribution of gains specifically to cross-embodiment transfer. The current per-robot baselines use only the native data available for each robot, while RT-X is trained on the full aggregated set; this is the standard comparison for demonstrating the value of multi-robot data. To isolate the effect of embodiment diversity from scaling, we will add in the revised Section 5 a new baseline that subsamples the combined multi-robot dataset to match the data volume of the largest single-robot subset and retrains a model under identical conditions. This addition will clarify whether the observed improvements exceed what would be expected from data volume alone. revision: yes

Circularity Check

0 steps flagged

Empirical dataset curation and model training with no derivation chain

full rationale

The paper assembles a standardized multi-robot dataset (22 platforms, 527 skills) and trains RT-X, reporting positive transfer via experimental comparisons. No mathematical derivations, predictions, or uniqueness theorems are claimed; results rest on direct training and evaluation. Self-citations (if any) are not load-bearing for the central empirical claim. The skeptic concern about data-volume confounding is a valid experimental-design issue but does not constitute circularity under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical scaling paper; no explicit free parameters, axioms, or invented entities are introduced beyond standard practices in large-scale machine learning.

pith-pipeline@v0.9.0 · 6817 in / 1040 out tokens · 31474 ms · 2026-05-11T17:18:32.713601+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We assemble a dataset from 22 different robots... RT-X exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 54 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Aligning Flow Map Policies with Optimal Q-Guidance
cs.LG 2026-05 unverdicted novelty 7.0

Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
Dynamic Execution Commitment of Vision-Language-Action Models
cs.CV 2026-05 unverdicted novelty 7.0

A3 determines the execution horizon in VLA models as the longest prefix of actions that passes consensus-based verification and sequential consistency checks.
SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation
cs.RO 2026-05 unverdicted novelty 7.0

SABER provides 44.8K multi-representation action samples from unscripted retail environments that raise a VLA model's mean success rate on ten manipulation tasks from 13.4% to 29.3%.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 7.0

ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion
cs.RO 2026-05 unverdicted novelty 7.0

Action Agent pairs LLM-driven video generation with a flow-constrained diffusion transformer to produce velocity commands, raising video success to 86% and delivering 64.7% real-world navigation on a Unitree G1 humanoid.
OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction
cs.RO 2026-04 unverdicted novelty 7.0

A 48-camera residential platform delivers real-time occlusion-robust 3D perception and coordinated actuation for multi-human multi-robot interaction in a shared home workspace.
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
cs.RO 2026-04 unverdicted novelty 7.0

A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing ...
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
cs.LG 2026-04 unverdicted novelty 7.0

π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study
cs.RO 2026-04 conditional novelty 7.0

A governed capability evolution framework with interface, policy, behavioral, and recovery checks reduces unsafe activations to zero in embodied agent upgrades while preserving task success rates.
3D-VLA: A 3D Vision-Language-Action Generative World Model
cs.CV 2024-03 unverdicted novelty 7.0

3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
Reinforcing VLAs in Task-Agnostic World Models
cs.AI 2026-05 unverdicted novelty 6.0

RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.
HumanNet: Scaling Human-centric Video Learning to One Million Hours
cs.CV 2026-05 unverdicted novelty 6.0

HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.
Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation
cs.RO 2026-05 unverdicted novelty 6.0

VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 6.0

ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
An Efficient Metric for Data Quality Measurement in Imitation Learning
cs.RO 2026-05 unverdicted novelty 6.0

Power spectral density of trajectories ranks demonstration quality for imitation learning, enabling rollout-free curation that improves fine-tuned policy success.
Lucid-XR: An Extended-Reality Data Engine for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

Lucid-XR uses XR-headset physics simulation and physics-guided video generation to create synthetic data that trains robot policies transferring zero-shot to unseen real-world manipulation tasks.
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations
cs.AI 2026-04 unverdicted novelty 6.0

PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
cs.RO 2026-04 unverdicted novelty 6.0

Empirical study on robosuite tasks reveals a dominant-skill effect in compositions and shows that an atomic probe approximates full revalidation for skill updates at much lower cost.
$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills
cs.RO 2026-04 unverdicted novelty 6.0

M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.
QDTraj: Exploration of Diverse Trajectory Primitives for Articulated Objects Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

QDTraj uses Quality-Diversity algorithms with sparse rewards to produce at least five times more diverse high-performing trajectories for articulated object manipulation than compared methods, validated across 30 obje...
XRZero-G0: Pushing the Frontier of Dexterous Robotic Manipulation with Interfaces, Quality and Ratios
cs.RO 2026-04 unverdicted novelty 6.0

XRZero-G0 enables 2000-hour robot-free datasets that, when mixed 10:1 with real-robot data, match full real-robot performance at 1/20th the cost and support zero-shot transfer.
EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems
cs.RO 2026-04 unverdicted novelty 6.0

EmbodiedGovBench is a new benchmark framework that measures embodied agent systems on seven governance dimensions including policy adherence, recovery success, and upgrade safety.
WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations
cs.RO 2026-04 unverdicted novelty 6.0

WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...
Zero-shot World Models Are Developmentally Efficient Learners
cs.AI 2026-04 unverdicted novelty 6.0

A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds
cs.RO 2026-04 unverdicted novelty 6.0

SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformabl...
Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study
cs.RO 2026-04 unverdicted novelty 6.0

A governed capability evolution framework for embodied agents uses four compatibility checks and a staged pipeline to achieve zero unsafe activations during upgrades while retaining comparable task success rates.
OpenRC: An Open-Source Robotic Colonoscopy Framework for Multimodal Data Acquisition and Autonomy Research
cs.RO 2026-04 unverdicted novelty 6.0

OpenRC is an open-source robotic colonoscopy platform with hardware retrofit and a multimodal dataset of nearly 1,900 episodes for autonomy and VLA research.
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
cs.RO 2025-10 unverdicted novelty 6.0

InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
cs.RO 2025-04 unverdicted novelty 6.0

Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
cs.RO 2025-02 accept novelty 6.0

OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
cs.RO 2025-02 unverdicted novelty 6.0

DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
FAST: Efficient Action Tokenization for Vision-Language-Action Models
cs.RO 2025-01 unverdicted novelty 6.0

FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diff...
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
cs.CV 2024-12 unverdicted novelty 6.0

Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
cs.RO 2024-11 unverdicted novelty 6.0

CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
cs.LG 2024-10 unverdicted novelty 6.0

π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
cs.RO 2024-10 unverdicted novelty 6.0

GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
OpenVLA: An Open-Source Vision-Language-Action Model
cs.RO 2024-06 unverdicted novelty 6.0

OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
cs.RO 2024-06 unverdicted novelty 6.0

RoboCasa supplies a large-scale kitchen simulator, generative assets, 100 tasks, and automated data pipelines that produce a clear scaling trend in imitation learning for generalist robots.
Octo: An Open-Source Generalist Robot Policy
cs.RO 2024-05 unverdicted novelty 6.0

Octo is an open-source transformer-based generalist robot policy pretrained on 800k trajectories that serves as an effective initialization for finetuning across diverse robotic platforms.
Evaluating Real-World Robot Manipulation Policies in Simulation
cs.RO 2024-05 conditional novelty 6.0

SIMPLER simulated environments yield policy performance that correlates strongly with real-world robot manipulation results and captures similar sensitivity to distribution shifts.
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
cs.RO 2024-03 accept novelty 6.0

DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
cs.CV 2024-02 unverdicted novelty 6.0

DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
cs.RO 2024-01 conditional novelty 6.0

A low-cost whole-body teleoperation system enables effective imitation learning for complex bimanual mobile manipulation by co-training on mobile and static demonstration datasets.
ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

ProcVLM learns procedure-grounded dense progress rewards for robotic manipulation via a reasoning-before-estimation VLM trained on a 60M-frame synthesized corpus from 30 embodied datasets.
MiniVLA-Nav v1: A Multi-Scene Simulation Dataset for Language-Conditioned Robot Navigation
cs.RO 2026-05 unverdicted novelty 5.0

MiniVLA-Nav v1 provides 1,174 episodes of language-instructed robot navigation in photorealistic simulations with RGB, depth, segmentation, and expert action data.
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 5.0

VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
cs.RO 2026-04 unverdicted novelty 5.0

ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
Jump-Start Reinforcement Learning with Vision-Language-Action Regularization
cs.LG 2026-04 unverdicted novelty 5.0

VLAJS augments PPO with sparse annealed VLA guidance through directional regularization to cut required interactions by over 50% on manipulation tasks and enable zero-shot sim-to-real transfer.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
A Co-Evolutionary Theory of Human-AI Coexistence: Mutualism, Governance, and Dynamics in Complex Societies
cs.CY 2026-04 unverdicted novelty 4.0

Human-AI coexistence is best modeled as conditional mutualism under governance, formalized as a multiplex dynamical system whose simulations show stable high-coexistence equilibria only under balanced institutional oversight.
JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy
cs.RO 2026-04 unverdicted novelty 4.0

JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.
Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines
cs.RO 2026-04 unverdicted novelty 3.0

A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data...

Reference graph

Works this paper leans on

135 extracted references · 135 canonical work pages · cited by 51 Pith papers · 5 internal anchors

[1]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al. , “Learning transferable visual models from natural language supervision,” in International conference on machine learning . PMLR, 2021, pp. 8748–8763

work page 2021
[2]

GPT-4 technical report,

OpenAI, “GPT-4 technical report,” 2023

work page 2023
[3]

PaLM 2 Technical Report

R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al. , “PaLM 2 technical report,” arXiv preprint arXiv:2305.10403, 2023

work page internal anchor Pith review arXiv 2023
[4]

Google landmarks dataset v2 - a large-scale benchmark for instance-level recognition and retrieval,

T. Weyand, A. Araujo, B. Cao, and J. Sim, “Google landmarks dataset v2 - a large-scale benchmark for instance-level recognition and retrieval,” in Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , June 2020

work page 2020
[5]

Tencent ML-images: A large-scale multi-label image database for visual representation learning,

B. Wu, W. Chen, Y . Fan, Y . Zhang, J. Hou, J. Liu, and T. Zhang, “Tencent ML-images: A large-scale multi-label image database for visual representation learning,” IEEE Access, vol. 7, 2019

work page 2019
[6]

DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia

J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer, “DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia.” Semantic Web , vol. 6, no. 2, pp. 167–195, 2015. [Online]. Available: http://dblp.uni-trier.de/db/journals/semweb/ semweb6.html#LehmannIJJKMHMK15

work page 2015
[7]

Web data commons- extracting structured data from two large web cor- pora

H. M ¨uhleisen and C. Bizer, “Web data commons- extracting structured data from two large web cor- pora.” LDOW, vol. 937, pp. 133–145, 2012

work page 2012
[8]

RT-1: Robotics transformer for real-world control at scale,

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu et al., “RT-1: Robotics transformer for real-world control at scale,” Robotics: Science and Systems (RSS), 2023

work page 2023
[9]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn et al. , “RT-2: Vision-language- action models transfer web knowledge to robotic con- trol,” arXiv preprint arXiv:2307.15818 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Learning modular neural network policies for multi-task and multi-robot transfer,

C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine, “Learning modular neural network policies for multi-task and multi-robot transfer,” in 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017, pp. 2169–2176

work page 2017
[11]

Hardware con- ditioned policies for multi-robot transfer learning,

T. Chen, A. Murali, and A. Gupta, “Hardware con- ditioned policies for multi-robot transfer learning,” in Advances in Neural Information Processing Systems , 2018, pp. 9355–9366

work page 2018
[12]

Graph networks as learnable physics engines for inference and control,

A. Sanchez-Gonzalez, N. Heess, J. T. Springenberg, J. Merel, M. Riedmiller, R. Hadsell, and P. Battaglia, “Graph networks as learnable physics engines for inference and control,” in Proceedings of the 35th International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. PMLR, 10–15 Jul 2018,...

work page 2018
[13]

Learning to control self-assembling morphologies: a study of generalization via modularity,

D. Pathak, C. Lu, T. Darrell, P. Isola, and A. A. Efros, “Learning to control self-assembling morphologies: a study of generalization via modularity,” Advances in Neural Information Processing Systems, vol. 32, 2019

work page 2019
[14]

Variable impedance control in end-effector space. an action space for reinforcement learning in contact rich tasks,

R. Mart ´ın-Mart´ın, M. Lee, R. Gardner, S. Savarese, J. Bohg, and A. Garg, “Variable impedance control in end-effector space. an action space for reinforcement learning in contact rich tasks,” in Proceedings of the International Conference of Intelligent Robots and Systems (IROS), 2019

work page 2019
[15]

One policy to control them all: Shared modular policies for agent- agnostic control,

W. Huang, I. Mordatch, and D. Pathak, “One policy to control them all: Shared modular policies for agent- agnostic control,” in ICML, 2020

work page 2020
[16]

My body is a cage: the role of morphology in graph-based incompatible control.Preprint arXiv:2010.01856,

V . Kurin, M. Igl, T. Rockt ¨aschel, W. Boehmer, and S. Whiteson, “My body is a cage: the role of mor- phology in graph-based incompatible control,” arXiv preprint arXiv:2010.01856, 2020

work page arXiv 2010
[17]

XIRL: Cross-embodiment inverse reinforcement learning,

K. Zakka, A. Zeng, P. Florence, J. Tompson, J. Bohg, and D. Dwibedi, “XIRL: Cross-embodiment inverse reinforcement learning,” Conference on Robot Learn- ing (CoRL), 2021

work page 2021
[18]

Bayesian meta-learning for few-shot policy adaptation across robotic plat- forms,

A. Ghadirzadeh, X. Chen, P. Poklukar, C. Finn, M. Bj¨orkman, and D. Kragic, “Bayesian meta-learning for few-shot policy adaptation across robotic plat- forms,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, 2021, pp. 1274–1280

work page 2021
[19]

Meta- morph: Learning universal controllers with transform- ers,

A. Gupta, L. Fan, S. Ganguli, and L. Fei-Fei, “Meta- morph: Learning universal controllers with transform- ers,” in International Conference on Learning Repre- sentations, 2021

work page 2021
[20]

A gen- eralist dynamics model for control,

I. Schubert, J. Zhang, J. Bruce, S. Bechtle, E. Parisotto, M. Riedmiller, J. T. Springenberg, A. Byravan, L. Hasenclever, and N. Heess, “A gen- eralist dynamics model for control,” 2023

work page 2023
[21]

GNM: A general navigation model to drive any robot,

D. Shah, A. Sridhar, A. Bhorkar, N. Hirose, and S. Levine, “GNM: A general navigation model to drive any robot,” in 2023 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2023, pp. 7226–7233

work page 2023
[22]

Modularity through attention: Efficient training and transfer of language-conditioned policies for robot manipulation,

Y . Zhou, S. Sonawani, M. Phielipp, S. Stepputtis, and H. Amor, “Modularity through attention: Efficient training and transfer of language-conditioned policies for robot manipulation,” in Proceedings of The 6th Conference on Robot Learning , ser. Proceedings of Machine Learning Research, K. Liu, D. Kulic, and J. Ichnowski, Eds., vol. 205. PMLR, 14– 18 Dec...

work page 2023
[23]

RoboNet: Large-scale multi-robot learning,

S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn, “RoboNet: Large-scale multi-robot learning,” in Con- ference on Robot Learning (CoRL), vol. 100. PMLR, 2019, pp. 885–897

work page 2019
[24]

Know thyself: Transferable visual control policies through robot-awareness,

E. S. Hu, K. Huang, O. Rybkin, and D. Jayaraman, “Know thyself: Transferable visual control policies through robot-awareness,” inInternational Conference on Learning Representations , 2022

work page 2022
[25]

RoboCat : A self-improving foundation agent for robotic manipulation

K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauza, T. Davchev, Y . Zhou, A. Gupta, A. Raju et al. , “RoboCat: A self-improving founda- tion agent for robotic manipulation,” arXiv preprint arXiv:2306.11706, 2023

work page arXiv 2023
[26]

Polybot: Training one policy across robots while embracing variability,

J. Yang, D. Sadigh, and C. Finn, “Polybot: Training one policy across robots while embracing variability,” arXiv preprint arXiv:2307.03719 , 2023

work page arXiv 2023
[27]

A generalist agent,

S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-maron, M. Gim ´enez, Y . Sul- sky, J. Kay, J. T. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, Y . Chen, R. Had- sell, O. Vinyals, M. Bordbar, and N. de Freitas, “A generalist agent,” Transactions on Machine Learning Research, 2022

work page 2022
[28]

Bridging action space mismatch in learning from demonstra- tions,

G. Salhotra, I.-C. A. Liu, and G. Sukhatme, “Bridging action space mismatch in learning from demonstra- tions,” arXiv preprint arXiv:2304.03833 , 2023

work page arXiv 2023
[29]

Robot learning with sensorimotor pre- training,

I. Radosavovic, B. Shi, L. Fu, K. Goldberg, T. Darrell, and J. Malik, “Robot learning with sensorimotor pre- training,” in Conference on Robot Learning , 2023

work page 2023
[30]

UniGrasp: Learning a unified model to grasp with multifingered robotic hands,

L. Shao, F. Ferreira, M. Jorda, V . Nambiar, J. Luo, E. Solowjow, J. A. Ojea, O. Khatib, and J. Bohg, “UniGrasp: Learning a unified model to grasp with multifingered robotic hands,” IEEE Robotics and Au- tomation Letters, vol. 5, no. 2, pp. 2286–2293, 2020

work page 2020
[31]

Adagrasp: Learning an adaptive gripper-aware grasping policy,

Z. Xu, B. Qi, S. Agrawal, and S. Song, “Adagrasp: Learning an adaptive gripper-aware grasping policy,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 4620–4626

work page 2021
[32]

ViNT: A Foun- dation Model for Visual Navigation,

D. Shah, A. Sridhar, N. Dashora, K. Stachowicz, K. Black, N. Hirose, and S. Levine, “ViNT: A Foun- dation Model for Visual Navigation,” in 7th Annual Conference on Robot Learning (CoRL) , 2023

work page 2023
[33]

Imitation from observation: Learning to imitate behaviors from raw video via context translation,

Y . Liu, A. Gupta, P. Abbeel, and S. Levine, “Imitation from observation: Learning to imitate behaviors from raw video via context translation,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 1118–1125

work page 2018
[34]

One-shot imitation from observing hu- mans via domain-adaptive meta-learning,

T. Yu, C. Finn, S. Dasari, A. Xie, T. Zhang, P. Abbeel, and S. Levine, “One-shot imitation from observing hu- mans via domain-adaptive meta-learning,” Robotics: Science and Systems XIV , 2018

work page 2018
[35]

Third-person visual imitation learning via decoupled hierarchical controller,

P. Sharma, D. Pathak, and A. Gupta, “Third-person visual imitation learning via decoupled hierarchical controller,” Advances in Neural Information Process- ing Systems, vol. 32, 2019

work page 2019
[36]

Avid: Learning multi-stage tasks via pixel-level translation of human videos

L. Smith, N. Dhawan, M. Zhang, P. Abbeel, and S. Levine, “Avid: Learning multi-stage tasks via pixel- level translation of human videos,” arXiv preprint arXiv:1912.04443, 2019

work page arXiv 1912
[37]

Learning one-shot imitation from humans without humans,

A. Bonardi, S. James, and A. J. Davison, “Learning one-shot imitation from humans without humans,” IEEE Robotics and Automation Letters , vol. 5, no. 2, pp. 3533–3539, 2020

work page 2020
[38]

Reinforcement learning with videos: Combining offline observations with interaction,

K. Schmeckpeper, O. Rybkin, K. Daniilidis, S. Levine, and C. Finn, “Reinforcement learning with videos: Combining offline observations with interaction,” in Conference on Robot Learning . PMLR, 2021, pp. 339–354

work page 2021
[39]

Learning by watching: Physical imita- tion of manipulation skills from human videos,

H. Xiong, Q. Li, Y .-C. Chen, H. Bharadhwaj, S. Sinha, and A. Garg, “Learning by watching: Physical imita- tion of manipulation skills from human videos,” in 2021 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS) . IEEE, 2021, pp. 7827–7834

work page 2021
[40]

BC-Z: Zero-shot task generalization with robotic imitation learning,

E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn, “BC-Z: Zero-shot task generalization with robotic imitation learning,” in Conference on Robot Learning (CoRL) , 2021, pp. 991–1002

work page 2021
[41]

Human-to-robot imitation in the wild,

S. Bahl, A. Gupta, and D. Pathak, “Human-to-robot imitation in the wild,” Robotics: Science and Systems (RSS), 2022

work page 2022
[42]

Embodied concept learner: Self-supervised learning of concepts and map- ping through instruction following,

M. Ding, Y . Xu, Z. Chen, D. D. Cox, P. Luo, J. B. Tenenbaum, and C. Gan, “Embodied concept learner: Self-supervised learning of concepts and map- ping through instruction following,” in Conference on Robot Learning. PMLR, 2023, pp. 1743–1754

work page 2023
[43]

Affordances from human videos as a versatile representation for robotics,

S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak, “Affordances from human videos as a versatile representation for robotics,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , June 2023, pp. 13 778– 13 790

work page 2023
[44]

Unsupervised Perceptual Rewards for Imitation Learning

P. Sermanet, K. Xu, and S. Levine, “Unsupervised per- ceptual rewards for imitation learning,” arXiv preprint arXiv:1612.06699, 2016

work page Pith review arXiv 2016
[45]

Concept2Robot: Learning manipulation con- cepts from instructions and human demonstrations,

L. Shao, T. Migimatsu, Q. Zhang, K. Yang, and J. Bohg, “Concept2Robot: Learning manipulation con- cepts from instructions and human demonstrations,” in Proceedings of Robotics: Science and Systems (RSS) , 2020

work page 2020
[46]

in-the-wild

A. S. Chen, S. Nair, and C. Finn, “Learning generaliz- able robotic reward functions from “in-the-wild” hu- man videos,” arXiv preprint arXiv:2103.16817 , 2021

work page arXiv 2021
[47]

Graph inverse reinforcement learning from diverse videos,

S. Kumar, J. Zamora, N. Hansen, R. Jangir, and X. Wang, “Graph inverse reinforcement learning from diverse videos,” in Conference on Robot Learning . PMLR, 2023, pp. 55–66

work page 2023
[48]

Learning reward functions for robotic manipulation by observing humans,

M. Alakuijala, G. Dulac-Arnold, J. Mairal, J. Ponce, and C. Schmid, “Learning reward functions for robotic manipulation by observing humans,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 5006–5012

work page 2023
[49]

Manipulator- independent representations for visual imitation,

Y . Zhou, Y . Aytar, and K. Bousmalis, “Manipulator- independent representations for visual imitation,” 2021

work page 2021
[50]

Mimicplay: Long- horizon imitation learning by watching human play,

C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar, “Mimicplay: Long- horizon imitation learning by watching human play,” in Conference on Robot Learning , 2023

work page 2023
[51]

Learning pre- dictive models from observation and interaction,

K. Schmeckpeper, A. Xie, O. Rybkin, S. Tian, K. Daniilidis, S. Levine, and C. Finn, “Learning pre- dictive models from observation and interaction,” in European Conference on Computer Vision. Springer, 2020, pp. 708–725

work page 2020
[52]

R3m: A universal visual representation for robot manipulation,

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta, “R3m: A universal visual representation for robot manipulation,” in CoRL, 2022

work page 2022
[53]

Masked visual pre-training for motor control

T. Xiao, I. Radosavovic, T. Darrell, and J. Malik, “Masked visual pre-training for motor control,” arXiv preprint arXiv:2203.06173, 2022

work page arXiv 2022
[54]

Real-world robot learning with masked visual pre-training,

I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Ma- lik, and T. Darrell, “Real-world robot learning with masked visual pre-training,” in Conference on Robot Learning, 2022

work page 2022
[55]

VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang, “Vip: Towards universal visual reward and representation via value-implicit pre-training,” arXiv preprint arXiv:2210.00030, 2022

work page internal anchor Pith review arXiv 2022
[56]

Where are we in the search for an artificial vi- sual cortex for embodied intelligence?

A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, P. Abbeel, J. Malik et al., “Where are we in the search for an artificial vi- sual cortex for embodied intelligence?” arXiv preprint arXiv:2303.18240, 2023

work page arXiv 2023
[57]

Language-driven represen- tation learning for robotics,

S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang, “Language-driven represen- tation learning for robotics,” Robotics: Science and Systems (RSS), 2023

work page 2023
[58]

EC2: Emergent communication for embodied control,

Y . Mu, S. Yao, M. Ding, P. Luo, and C. Gan, “EC2: Emergent communication for embodied control,” in Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2023, pp. 6704– 6714

work page 2023
[59]

Affordances from human videos as a versatile representation for robotics,

S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak, “Affordances from human videos as a versatile representation for robotics,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 778–13 790

work page 2023
[60]

Efficient grasping from RGBD images: Learning using a new rectangle representation,

Y . Jiang, S. Moseson, and A. Saxena, “Efficient grasping from RGBD images: Learning using a new rectangle representation,” in 2011 IEEE International conference on robotics and automation . IEEE, 2011, pp. 3304–3311

work page 2011
[61]

Supersizing self- supervision: Learning to grasp from 50k tries and 700 robot hours,

L. Pinto and A. K. Gupta, “Supersizing self- supervision: Learning to grasp from 50k tries and 700 robot hours,”2016 IEEE International Conference on Robotics and Automation (ICRA) , pp. 3406–3413, 2015

work page 2016
[62]

Leveraging big data for grasp planning,

D. Kappler, J. Bohg, and S. Schaal, “Leveraging big data for grasp planning,” in ICRA, 2015, pp. 4304– 4311

work page 2015
[63]

Dex-Net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,

J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, “Dex-Net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,” in Robotics: Science and Systems (RSS) , 2017

work page 2017
[64]

Jacquard: A large scale dataset for robotic grasp detection,

A. Depierre, E. Dellandr ´ea, and L. Chen, “Jacquard: A large scale dataset for robotic grasp detection,” in 2018 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS) . IEEE, 2018, pp. 3511–3516

work page 2018
[65]

Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,

S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,” The International journal of robotics research, vol. 37, no. 4-5, pp. 421–436, 2018

work page 2018
[66]

Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation, 2018

D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Her- zog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Vanhoucke et al. , “QT-Opt: Scalable deep rein- forcement learning for vision-based robotic manipu- lation,” arXiv preprint arXiv:1806.10293 , 2018

work page arXiv 2018
[67]

Contactdb: Analyzing and predicting grasp contact via thermal imaging,

S. Brahmbhatt, C. Ham, C. Kemp, and J. Hays, “Contactdb: Analyzing and predicting grasp contact via thermal imaging,” 04 2019

work page 2019
[68]

Graspnet- 1billion: a large-scale benchmark for general object grasping,

H.-S. Fang, C. Wang, M. Gou, and C. Lu, “Graspnet- 1billion: a large-scale benchmark for general object grasping,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2020, pp. 11 444–11 453

work page 2020
[69]

ACRONYM: A large-scale grasp dataset based on simulation,

C. Eppner, A. Mousavian, and D. Fox, “ACRONYM: A large-scale grasp dataset based on simulation,” in 2021 IEEE Int. Conf. on Robotics and Automation, ICRA, 2020

work page 2021
[70]

Using simulation and domain adaptation to improve effi- ciency of deep robotic grasping,

K. Bousmalis, A. Irpan, P. Wohlhart, Y . Bai, M. Kel- cey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, S. Levine, and V . Vanhoucke, “Using simulation and domain adaptation to improve effi- ciency of deep robotic grasping,” in ICRA, 2018, pp. 4243–4250

work page 2018
[71]

Fanuc manipulation: A dataset for learning-based manipulation with fanuc mate 200iD robot,

X. Zhu, R. Tian, C. Xu, M. Huo, W. Zhan, M. Tomizuka, and M. Ding, “Fanuc manipulation: A dataset for learning-based manipulation with fanuc mate 200iD robot,” https://sites.google.com/berkeley. edu/fanuc-manipulation, 2023

work page 2023
[72]

More than a million ways to be pushed. a high- fidelity experimental dataset of planar pushing,

K.-T. Yu, M. Bauza, N. Fazeli, and A. Rodriguez, “More than a million ways to be pushed. a high- fidelity experimental dataset of planar pushing,” in 2016 IEEE/RSJ international conference on intelligent robots and systems (IROS) . IEEE, 2016, pp. 30–37

work page 2016
[73]

Deep visual foresight for plan- ning robot motion,

C. Finn and S. Levine, “Deep visual foresight for plan- ning robot motion,” in 2017 IEEE International Con- ference on Robotics and Automation (ICRA) . IEEE, 2017, pp. 2786–2793

work page 2017
[74]

Visual foresight: Model-based deep reinforcement learning for vision-based robotic control,

F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine, “Visual foresight: Model-based deep rein- forcement learning for vision-based robotic control,” arXiv preprint arXiv:1812.00568 , 2018

work page arXiv 2018
[75]

The princeton shape benchmark,

P. Shilane, P. Min, M. Kazhdan, and T. Funkhouser, “The princeton shape benchmark,” in Shape Modeling Applications, 2004, pp. 167–388

work page 2004
[76]

3DNet: Large-Scale Object Class Recog- nition from CAD Models,

W. Wohlkinger, A. Aldoma Buchaca, R. Rusu, and M. Vincze, “3DNet: Large-Scale Object Class Recog- nition from CAD Models,” inIEEE International Con- ference on Robotics and Automation (ICRA) , 2012

work page 2012
[77]

The kit object models database: An object model database for object recognition, localization and manipulation in service robotics,

A. Kasper, Z. Xue, and R. Dillmann, “The kit object models database: An object model database for object recognition, localization and manipulation in service robotics,” The International Journal of Robotics Re- search, vol. 31, no. 8, pp. 927–934, 2012

work page 2012
[78]

BigBIRD: A large-scale 3D database of object instances,

A. Singh, J. Sha, K. S. Narayan, T. Achim, and P. Abbeel, “BigBIRD: A large-scale 3D database of object instances,” in IEEE International Conference on Robotics and Automation (ICRA) , 2014, pp. 509– 516

work page 2014
[79]

Benchmarking in ma- nipulation research: Using the Yale-CMU-Berkeley object and model set,

B. Calli, A. Walsman, A. Singh, S. Srinivasa, P. Abbeel, and A. M. Dollar, “Benchmarking in ma- nipulation research: Using the Yale-CMU-Berkeley object and model set,” IEEE Robotics & Automation Magazine, vol. 22, no. 3, pp. 36–52, 2015

work page 2015
[80]

3D ShapeNets: A deep representation for volumetric shapes,

Zhirong Wu, S. Song, A. Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and J. Xiao, “3D ShapeNets: A deep representation for volumetric shapes,” in IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2015, pp. 1912–1920

work page 2015

Showing first 80 references.