pith. machine review for the scientific record. sign in

arxiv: 2310.08864 · v9 · submitted 2023-10-13 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open X-Embodiment Collaboration , Abby O'Neill , Abdul Rehman , Abhinav Gupta , Abhiram Maddukuri , Abhishek Gupta , Abhishek Padalkar , Abraham Lee
show 286 more authors
Acorn Pooley Agrim Gupta Ajay Mandlekar Ajinkya Jain Albert Tung Alex Bewley Alex Herzog Alex Irpan Alexander Khazatsky Anant Rai Anchit Gupta Andrew Wang Andrey Kolobov Anikait Singh Animesh Garg Aniruddha Kembhavi Annie Xie Anthony Brohan Antonin Raffin Archit Sharma Arefeh Yavary Arhan Jain Ashwin Balakrishna Ayzaan Wahid Ben Burgess-Limerick Beomjoon Kim Bernhard Sch\"olkopf Blake Wulfe Brian Ichter Cewu Lu Charles Xu Charlotte Le Chelsea Finn Chen Wang Chenfeng Xu Cheng Chi Chenguang Huang Christine Chan Christopher Agia Chuer Pan Chuyuan Fu Coline Devin Danfei Xu Daniel Morton Danny Driess Daphne Chen Deepak Pathak Dhruv Shah Dieter B\"uchler Dinesh Jayaraman Dmitry Kalashnikov Dorsa Sadigh Edward Johns Ethan Foster Fangchen Liu Federico Ceola Fei Xia Feiyu Zhao Felipe Vieira Frujeri Freek Stulp Gaoyue Zhou Gaurav S. Sukhatme Gautam Salhotra Ge Yan Gilbert Feng Giulio Schiavi Glen Berseth Gregory Kahn Guangwen Yang Guanzhi Wang Hao Su Hao-Shu Fang Haochen Shi Henghui Bao Heni Ben Amor Henrik I Christensen Hiroki Furuta Homanga Bharadhwaj Homer Walke Hongjie Fang Huy Ha Igor Mordatch Ilija Radosavovic Isabel Leal Jacky Liang Jad Abou-Chakra Jaehyung Kim Jaimyn Drake Jan Peters Jan Schneider Jasmine Hsu Jay Vakil Jeannette Bohg Jeffrey Bingham Jeffrey Wu Jensen Gao Jiaheng Hu Jiajun Wu Jialin Wu Jiankai Sun Jianlan Luo Jiayuan Gu Jie Tan Jihoon Oh Jimmy Wu Jingpei Lu Jingyun Yang Jitendra Malik Jo\~ao Silv\'erio Joey Hejna Jonathan Booher Jonathan Tompson Jonathan Yang Jordi Salvador Joseph J. Lim Junhyek Han Kaiyuan Wang Kanishka Rao Karl Pertsch Karol Hausman Keegan Go Keerthana Gopalakrishnan Ken Goldberg Kendra Byrne Kenneth Oslund Kento Kawaharazuka Kevin Black Kevin Lin Kevin Zhang Kiana Ehsani Kiran Lekkala Kirsty Ellis Krishan Rana Krishnan Srinivasan Kuan Fang Kunal Pratap Singh Kuo-Hao Zeng Kyle Hatch Kyle Hsu Laurent Itti Lawrence Yunliang Chen Lerrel Pinto Li Fei-Fei Liam Tan Linxi "Jim" Fan Lionel Ott Lisa Lee Luca Weihs Magnum Chen Marion Lepert Marius Memmel Masayoshi Tomizuka Masha Itkina Mateo Guaman Castro Max Spero Maximilian Du Michael Ahn Michael C. Yip Mingtong Zhang Mingyu Ding Minho Heo Mohan Kumar Srirama Mohit Sharma Moo Jin Kim Muhammad Zubair Irshad Naoaki Kanazawa Nicklas Hansen Nicolas Heess Nikhil J Joshi Niko Suenderhauf Ning Liu Norman Di Palo Nur Muhammad Mahi Shafiullah Oier Mees Oliver Kroemer Osbert Bastani Pannag R Sanketi Patrick "Tree" Miller Patrick Yin Paul Wohlhart Peng Xu Peter David Fagan Peter Mitrano Pierre Sermanet Pieter Abbeel Priya Sundaresan Qiuyu Chen Quan Vuong Rafael Rafailov Ran Tian Ria Doshi Roberto Mart\'in-Mart\'in Rohan Baijal Rosario Scalise Rose Hendrix Roy Lin Runjia Qian Ruohan Zhang Russell Mendonca Rutav Shah Ryan Hoque Ryan Julian Samuel Bustamante Sean Kirmani Sergey Levine Shan Lin Sherry Moore Shikhar Bahl Shivin Dass Shubham Sonawani Shubham Tulsiani Shuran Song Sichun Xu Siddhant Haldar Siddharth Karamcheti Simeon Adebola Simon Guist Soroush Nasiriany Stefan Schaal Stefan Welker Stephen Tian Subramanian Ramamoorthy Sudeep Dasari Suneel Belkhale Sungjae Park Suraj Nair Suvir Mirchandani Takayuki Osa Tanmay Gupta Tatsuya Harada Tatsuya Matsushima Ted Xiao Thomas Kollar Tianhe Yu Tianli Ding Todor Davchev Tony Z. Zhao Travis Armstrong Trevor Darrell Trinity Chung Vidhi Jain Vikash Kumar Vincent Vanhoucke Vitor Guizilini Wei Zhan Wenxuan Zhou Wolfram Burgard Xi Chen Xiangyu Chen Xiaolong Wang Xinghao Zhu Xinyang Geng Xiyuan Liu Xu Liangwei Xuanlin Li Yansong Pang Yao Lu Yecheng Jason Ma Yejin Kim Yevgen Chebotar Yifan Zhou Yifeng Zhu Yilin Wu Ying Xu Yixuan Wang Yonatan Bisk Yongqiang Dou Yoonyoung Cho Youngwoon Lee Yuchen Cui Yue Cao Yueh-Hua Wu Yujin Tang Yuke Zhu Yunchu Zhang Yunfan Jiang Yunshuang Li Yunzhu Li Yusuke Iwasawa Yutaka Matsuo Zehan Ma Zhuo Xu Zichen Jeff Cui Zichen Zhang Zipeng Fu Zipeng Lin
Authors on Pith no claims yet

Pith reviewed 2026-05-11 17:18 UTC · model grok-4.3

classification 💻 cs.RO
keywords robotic manipulationmulti-robot learningpolicy transfergeneralist policiesembodimentdatasets
0
0 comments X

The pith

A single high-capacity model trained on data from 22 robots improves task performance on each individual platform through positive transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper assembles and standardizes a large dataset of robotic manipulation skills collected from 22 different robots by multiple institutions. It trains one high-capacity model on the full combined set and demonstrates that this model performs better on the tasks for each robot than would be possible from that robot's data alone. The result suggests that experience gathered on one robot platform can be shared to strengthen learning on others. A sympathetic reader would care because most robotic learning still builds a separate model for every robot and task, which limits scale and efficiency.

Core claim

We assemble a dataset from 22 different robots demonstrating 527 skills. A high-capacity model trained on this data exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms.

What carries the argument

The high-capacity model trained on the standardized multi-robot dataset, which carries the argument by showing that cross-platform data produces measurable gains on each robot's tasks.

If this is right

  • Robots achieve higher success rates on tasks by drawing on experience collected elsewhere without new data collection on the target platform.
  • A single model can be adapted to new robots, tasks, and environments more efficiently than training from scratch for each case.
  • Robotic learning can shift away from training isolated models for every application toward shared generalist policies.
  • The standardized dataset format enables further experiments on cross-robot generalization in manipulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the positive transfer effect grows with additional robots and tasks, future datasets could be pooled at even larger scale to compound the gains.
  • Different research groups could contribute data in the same format and immediately benefit from improved performance on their own hardware.
  • The approach raises the question of how far the transfer extends when new robot morphologies or entirely unseen tasks are introduced.

Load-bearing premise

The chosen standardization and particular mix of data from the 22 robots produce net positive transfer rather than interference that would reduce performance.

What would settle it

A direct comparison in which the model trained on the combined dataset performs no better or worse than separate models trained only on each robot's own data for the same tasks.

read the original abstract

Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website https://robotics-transformer-x.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper assembles a large-scale, standardized dataset of robotic manipulation tasks collected from 22 robots across 21 institutions, covering 527 skills and 160266 tasks. It introduces RT-X, a high-capacity transformer-based policy trained on the combined data, and reports experiments claiming that this model exhibits positive cross-embodiment transfer, improving task performance on multiple robots by leveraging experience from other platforms.

Significance. If the positive-transfer claim is substantiated with proper controls, the work would be significant for robotics by providing open, standardized datasets and models that facilitate research on generalist X-robot policies, analogous to foundation models in other domains. The collaborative data release itself is a substantial community resource.

major comments (1)
  1. [§5 (Experiments)] §5 (Experiments): The reported comparisons between RT-X (trained on the full 160k+ task multi-robot dataset) and per-robot baselines (trained only on native data subsets) do not control for total training data volume. Without an additional baseline that matches the data volume seen by RT-X (e.g., via subsampling the combined dataset to equal the per-robot volume or training on equivalent-scale single-robot data), performance gains cannot be unambiguously attributed to cross-embodiment transfer rather than simple scaling effects. This directly undermines the central claim that RT-X improves capabilities 'by leveraging experience from other platforms.'
minor comments (2)
  1. [Abstract] Abstract: '160266 tasks' should be written with a comma as '160,266 tasks' for readability.
  2. [§3 (Dataset)] §3 (Dataset): The standardization procedure for heterogeneous robot data (e.g., action spaces, observation formats) is described at a high level; a more detailed table or pseudocode would help readers reproduce the exact preprocessing pipeline.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [§5 (Experiments)] §5 (Experiments): The reported comparisons between RT-X (trained on the full 160k+ task multi-robot dataset) and per-robot baselines (trained only on native data subsets) do not control for total training data volume. Without an additional baseline that matches the data volume seen by RT-X (e.g., via subsampling the combined dataset to equal the per-robot volume or training on equivalent-scale single-robot data), performance gains cannot be unambiguously attributed to cross-embodiment transfer rather than simple scaling effects. This directly undermines the central claim that RT-X improves capabilities 'by leveraging experience from other platforms.'

    Authors: We agree that an explicit control for total training data volume would strengthen the attribution of gains specifically to cross-embodiment transfer. The current per-robot baselines use only the native data available for each robot, while RT-X is trained on the full aggregated set; this is the standard comparison for demonstrating the value of multi-robot data. To isolate the effect of embodiment diversity from scaling, we will add in the revised Section 5 a new baseline that subsamples the combined multi-robot dataset to match the data volume of the largest single-robot subset and retrains a model under identical conditions. This addition will clarify whether the observed improvements exceed what would be expected from data volume alone. revision: yes

Circularity Check

0 steps flagged

Empirical dataset curation and model training with no derivation chain

full rationale

The paper assembles a standardized multi-robot dataset (22 platforms, 527 skills) and trains RT-X, reporting positive transfer via experimental comparisons. No mathematical derivations, predictions, or uniqueness theorems are claimed; results rest on direct training and evaluation. Self-citations (if any) are not load-bearing for the central empirical claim. The skeptic concern about data-volume confounding is a valid experimental-design issue but does not constitute circularity under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical scaling paper; no explicit free parameters, axioms, or invented entities are introduced beyond standard practices in large-scale machine learning.

pith-pipeline@v0.9.0 · 6817 in / 1040 out tokens · 31474 ms · 2026-05-11T17:18:32.713601+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 55 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Aligning Flow Map Policies with Optimal Q-Guidance

    cs.LG 2026-05 unverdicted novelty 7.0

    Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.

  2. Dynamic Execution Commitment of Vision-Language-Action Models

    cs.CV 2026-05 unverdicted novelty 7.0

    A3 determines the execution horizon in VLA models as the longest prefix of actions that passes consensus-based verification and sequential consistency checks.

  3. SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation

    cs.RO 2026-05 unverdicted novelty 7.0

    SABER provides 44.8K multi-representation action samples from unscripted retail environments that raise a VLA model's mean success rate on ten manipulation tasks from 13.4% to 29.3%.

  4. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  5. ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.

  6. Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion

    cs.RO 2026-05 unverdicted novelty 7.0

    Action Agent pairs LLM-driven video generation with a flow-constrained diffusion transformer to produce velocity commands, raising video success to 86% and delivering 64.7% real-world navigation on a Unitree G1 humanoid.

  7. OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction

    cs.RO 2026-04 unverdicted novelty 7.0

    A 48-camera residential platform delivers real-time occlusion-robust 3D perception and coordinated actuation for multi-human multi-robot interaction in a shared home workspace.

  8. Atomic-Probe Governance for Skill Updates in Compositional Robot Policies

    cs.RO 2026-04 unverdicted novelty 7.0

    A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing ...

  9. ${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

    cs.LG 2026-04 unverdicted novelty 7.0

    π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

  10. Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study

    cs.RO 2026-04 conditional novelty 7.0

    A governed capability evolution framework with interface, policy, behavioral, and recovery checks reduces unsafe activations to zero in embodied agent upgrades while preserving task success rates.

  11. 3D-VLA: A 3D Vision-Language-Action Generative World Model

    cs.CV 2024-03 unverdicted novelty 7.0

    3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.

  12. Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.

  13. Reinforcing VLAs in Task-Agnostic World Models

    cs.AI 2026-05 unverdicted novelty 6.0

    RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.

  14. HumanNet: Scaling Human-centric Video Learning to One Million Hours

    cs.CV 2026-05 unverdicted novelty 6.0

    HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.

  15. Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation

    cs.RO 2026-05 unverdicted novelty 6.0

    VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.

  16. ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.

  17. An Efficient Metric for Data Quality Measurement in Imitation Learning

    cs.RO 2026-05 unverdicted novelty 6.0

    Power spectral density of trajectories ranks demonstration quality for imitation learning, enabling rollout-free curation that improves fine-tuned policy success.

  18. Lucid-XR: An Extended-Reality Data Engine for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    Lucid-XR uses XR-headset physics simulation and physics-guided video generation to create synthetic data that trains robot policies transferring zero-shot to unseen real-world manipulation tasks.

  19. PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

    cs.AI 2026-04 unverdicted novelty 6.0

    PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.

  20. Atomic-Probe Governance for Skill Updates in Compositional Robot Policies

    cs.RO 2026-04 unverdicted novelty 6.0

    Empirical study on robosuite tasks reveals a dominant-skill effect in compositions and shows that an atomic probe approximates full revalidation for skill updates at much lower cost.

  21. $M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills

    cs.RO 2026-04 unverdicted novelty 6.0

    M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.

  22. QDTraj: Exploration of Diverse Trajectory Primitives for Articulated Objects Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    QDTraj uses Quality-Diversity algorithms with sparse rewards to produce at least five times more diverse high-performing trajectories for articulated object manipulation than compared methods, validated across 30 obje...

  23. XRZero-G0: Pushing the Frontier of Dexterous Robotic Manipulation with Interfaces, Quality and Ratios

    cs.RO 2026-04 unverdicted novelty 6.0

    XRZero-G0 enables 2000-hour robot-free datasets that, when mixed 10:1 with real-robot data, match full real-robot performance at 1/20th the cost and support zero-shot transfer.

  24. EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems

    cs.RO 2026-04 unverdicted novelty 6.0

    EmbodiedGovBench is a new benchmark framework that measures embodied agent systems on seven governance dimensions including policy adherence, recovery success, and upgrade safety.

  25. WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations

    cs.RO 2026-04 unverdicted novelty 6.0

    WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...

  26. Zero-shot World Models Are Developmentally Efficient Learners

    cs.AI 2026-04 unverdicted novelty 6.0

    A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.

  27. SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds

    cs.RO 2026-04 unverdicted novelty 6.0

    SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformabl...

  28. Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study

    cs.RO 2026-04 unverdicted novelty 6.0

    A governed capability evolution framework for embodied agents uses four compatibility checks and a staged pipeline to achieve zero unsafe activations during upgrades while retaining comparable task success rates.

  29. OpenRC: An Open-Source Robotic Colonoscopy Framework for Multimodal Data Acquisition and Autonomy Research

    cs.RO 2026-04 unverdicted novelty 6.0

    OpenRC is an open-source robotic colonoscopy platform with hardware retrofit and a multimodal dataset of nearly 1,900 episodes for autonomy and VLA research.

  30. mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

    cs.RO 2025-12 unverdicted novelty 6.0

    mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.

  31. InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

    cs.RO 2025-10 unverdicted novelty 6.0

    InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.

  32. Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

    cs.RO 2025-04 unverdicted novelty 6.0

    Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...

  33. Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    cs.RO 2025-02 accept novelty 6.0

    OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.

  34. DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

    cs.RO 2025-02 unverdicted novelty 6.0

    DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.

  35. FAST: Efficient Action Tokenization for Vision-Language-Action Models

    cs.RO 2025-01 unverdicted novelty 6.0

    FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diff...

  36. Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    cs.CV 2024-12 unverdicted novelty 6.0

    Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.

  37. CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    cs.RO 2024-11 unverdicted novelty 6.0

    CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...

  38. $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    cs.LG 2024-10 unverdicted novelty 6.0

    π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.

  39. GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    cs.RO 2024-10 unverdicted novelty 6.0

    GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.

  40. OpenVLA: An Open-Source Vision-Language-Action Model

    cs.RO 2024-06 unverdicted novelty 6.0

    OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.

  41. RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    cs.RO 2024-06 unverdicted novelty 6.0

    RoboCasa supplies a large-scale kitchen simulator, generative assets, 100 tasks, and automated data pipelines that produce a clear scaling trend in imitation learning for generalist robots.

  42. Octo: An Open-Source Generalist Robot Policy

    cs.RO 2024-05 unverdicted novelty 6.0

    Octo is an open-source transformer-based generalist robot policy pretrained on 800k trajectories that serves as an effective initialization for finetuning across diverse robotic platforms.

  43. Evaluating Real-World Robot Manipulation Policies in Simulation

    cs.RO 2024-05 conditional novelty 6.0

    SIMPLER simulated environments yield policy performance that correlates strongly with real-world robot manipulation results and captures similar sensitivity to distribution shifts.

  44. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    cs.RO 2024-03 accept novelty 6.0

    DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.

  45. DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    cs.CV 2024-02 unverdicted novelty 6.0

    DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...

  46. Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

    cs.RO 2024-01 conditional novelty 6.0

    A low-cost whole-body teleoperation system enables effective imitation learning for complex bimanual mobile manipulation by co-training on mobile and static demonstration datasets.

  47. ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 5.0

    ProcVLM learns procedure-grounded dense progress rewards for robotic manipulation via a reasoning-before-estimation VLM trained on a 60M-frame synthesized corpus from 30 embodied datasets.

  48. MiniVLA-Nav v1: A Multi-Scene Simulation Dataset for Language-Conditioned Robot Navigation

    cs.RO 2026-05 unverdicted novelty 5.0

    MiniVLA-Nav v1 provides 1,174 episodes of language-instructed robot navigation in photorealistic simulations with RGB, depth, segmentation, and expert action data.

  49. VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 5.0

    VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.

  50. ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning

    cs.RO 2026-04 unverdicted novelty 5.0

    ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.

  51. Jump-Start Reinforcement Learning with Vision-Language-Action Regularization

    cs.LG 2026-04 unverdicted novelty 5.0

    VLAJS augments PPO with sparse annealed VLA guidance through directional regularization to cut required interactions by over 50% on manipulation tasks and enable zero-shot sim-to-real transfer.

  52. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  53. A Co-Evolutionary Theory of Human-AI Coexistence: Mutualism, Governance, and Dynamics in Complex Societies

    cs.CY 2026-04 unverdicted novelty 4.0

    Human-AI coexistence is best modeled as conditional mutualism under governance, formalized as a multiplex dynamical system whose simulations show stable high-coexistence equilibria only under balanced institutional oversight.

  54. JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

    cs.RO 2026-04 unverdicted novelty 4.0

    JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.

  55. Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines

    cs.RO 2026-04 unverdicted novelty 3.0

    A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data...

Reference graph

Works this paper leans on

135 extracted references · 135 canonical work pages · cited by 52 Pith papers · 5 internal anchors

  1. [1]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al. , “Learning transferable visual models from natural language supervision,” in International conference on machine learning . PMLR, 2021, pp. 8748–8763

  2. [2]

    GPT-4 technical report,

    OpenAI, “GPT-4 technical report,” 2023

  3. [3]

    PaLM 2 Technical Report

    R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al. , “PaLM 2 technical report,” arXiv preprint arXiv:2305.10403, 2023

  4. [4]

    Google landmarks dataset v2 - a large-scale benchmark for instance-level recognition and retrieval,

    T. Weyand, A. Araujo, B. Cao, and J. Sim, “Google landmarks dataset v2 - a large-scale benchmark for instance-level recognition and retrieval,” in Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , June 2020

  5. [5]

    Tencent ML-images: A large-scale multi-label image database for visual representation learning,

    B. Wu, W. Chen, Y . Fan, Y . Zhang, J. Hou, J. Liu, and T. Zhang, “Tencent ML-images: A large-scale multi-label image database for visual representation learning,” IEEE Access, vol. 7, 2019

  6. [6]

    DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia

    J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer, “DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia.” Semantic Web , vol. 6, no. 2, pp. 167–195, 2015. [Online]. Available: http://dblp.uni-trier.de/db/journals/semweb/ semweb6.html#LehmannIJJKMHMK15

  7. [7]

    Web data commons- extracting structured data from two large web cor- pora

    H. M ¨uhleisen and C. Bizer, “Web data commons- extracting structured data from two large web cor- pora.” LDOW, vol. 937, pp. 133–145, 2012

  8. [8]

    RT-1: Robotics transformer for real-world control at scale,

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu et al., “RT-1: Robotics transformer for real-world control at scale,” Robotics: Science and Systems (RSS), 2023

  9. [9]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn et al. , “RT-2: Vision-language- action models transfer web knowledge to robotic con- trol,” arXiv preprint arXiv:2307.15818 , 2023

  10. [10]

    Learning modular neural network policies for multi-task and multi-robot transfer,

    C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine, “Learning modular neural network policies for multi-task and multi-robot transfer,” in 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017, pp. 2169–2176

  11. [11]

    Hardware con- ditioned policies for multi-robot transfer learning,

    T. Chen, A. Murali, and A. Gupta, “Hardware con- ditioned policies for multi-robot transfer learning,” in Advances in Neural Information Processing Systems , 2018, pp. 9355–9366

  12. [12]

    Graph networks as learnable physics engines for inference and control,

    A. Sanchez-Gonzalez, N. Heess, J. T. Springenberg, J. Merel, M. Riedmiller, R. Hadsell, and P. Battaglia, “Graph networks as learnable physics engines for inference and control,” in Proceedings of the 35th International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. PMLR, 10–15 Jul 2018,...

  13. [13]

    Learning to control self-assembling morphologies: a study of generalization via modularity,

    D. Pathak, C. Lu, T. Darrell, P. Isola, and A. A. Efros, “Learning to control self-assembling morphologies: a study of generalization via modularity,” Advances in Neural Information Processing Systems, vol. 32, 2019

  14. [14]

    Variable impedance control in end-effector space. an action space for reinforcement learning in contact rich tasks,

    R. Mart ´ın-Mart´ın, M. Lee, R. Gardner, S. Savarese, J. Bohg, and A. Garg, “Variable impedance control in end-effector space. an action space for reinforcement learning in contact rich tasks,” in Proceedings of the International Conference of Intelligent Robots and Systems (IROS), 2019

  15. [15]

    One policy to control them all: Shared modular policies for agent- agnostic control,

    W. Huang, I. Mordatch, and D. Pathak, “One policy to control them all: Shared modular policies for agent- agnostic control,” in ICML, 2020

  16. [16]

    My body is a cage: the role of morphology in graph-based incompatible control.Preprint arXiv:2010.01856,

    V . Kurin, M. Igl, T. Rockt ¨aschel, W. Boehmer, and S. Whiteson, “My body is a cage: the role of mor- phology in graph-based incompatible control,” arXiv preprint arXiv:2010.01856, 2020

  17. [17]

    XIRL: Cross-embodiment inverse reinforcement learning,

    K. Zakka, A. Zeng, P. Florence, J. Tompson, J. Bohg, and D. Dwibedi, “XIRL: Cross-embodiment inverse reinforcement learning,” Conference on Robot Learn- ing (CoRL), 2021

  18. [18]

    Bayesian meta-learning for few-shot policy adaptation across robotic plat- forms,

    A. Ghadirzadeh, X. Chen, P. Poklukar, C. Finn, M. Bj¨orkman, and D. Kragic, “Bayesian meta-learning for few-shot policy adaptation across robotic plat- forms,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, 2021, pp. 1274–1280

  19. [19]

    Meta- morph: Learning universal controllers with transform- ers,

    A. Gupta, L. Fan, S. Ganguli, and L. Fei-Fei, “Meta- morph: Learning universal controllers with transform- ers,” in International Conference on Learning Repre- sentations, 2021

  20. [20]

    A gen- eralist dynamics model for control,

    I. Schubert, J. Zhang, J. Bruce, S. Bechtle, E. Parisotto, M. Riedmiller, J. T. Springenberg, A. Byravan, L. Hasenclever, and N. Heess, “A gen- eralist dynamics model for control,” 2023

  21. [21]

    GNM: A general navigation model to drive any robot,

    D. Shah, A. Sridhar, A. Bhorkar, N. Hirose, and S. Levine, “GNM: A general navigation model to drive any robot,” in 2023 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2023, pp. 7226–7233

  22. [22]

    Modularity through attention: Efficient training and transfer of language-conditioned policies for robot manipulation,

    Y . Zhou, S. Sonawani, M. Phielipp, S. Stepputtis, and H. Amor, “Modularity through attention: Efficient training and transfer of language-conditioned policies for robot manipulation,” in Proceedings of The 6th Conference on Robot Learning , ser. Proceedings of Machine Learning Research, K. Liu, D. Kulic, and J. Ichnowski, Eds., vol. 205. PMLR, 14– 18 Dec...

  23. [23]

    RoboNet: Large-scale multi-robot learning,

    S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn, “RoboNet: Large-scale multi-robot learning,” in Con- ference on Robot Learning (CoRL), vol. 100. PMLR, 2019, pp. 885–897

  24. [24]

    Know thyself: Transferable visual control policies through robot-awareness,

    E. S. Hu, K. Huang, O. Rybkin, and D. Jayaraman, “Know thyself: Transferable visual control policies through robot-awareness,” inInternational Conference on Learning Representations , 2022

  25. [25]

    Robocat: A self-improving generalist agent for robotic manipulation

    K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauza, T. Davchev, Y . Zhou, A. Gupta, A. Raju et al. , “RoboCat: A self-improving founda- tion agent for robotic manipulation,” arXiv preprint arXiv:2306.11706, 2023

  26. [26]

    Polybot: Training one policy across robots while embracing variability,

    J. Yang, D. Sadigh, and C. Finn, “Polybot: Training one policy across robots while embracing variability,” arXiv preprint arXiv:2307.03719 , 2023

  27. [27]

    A generalist agent,

    S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-maron, M. Gim ´enez, Y . Sul- sky, J. Kay, J. T. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, Y . Chen, R. Had- sell, O. Vinyals, M. Bordbar, and N. de Freitas, “A generalist agent,” Transactions on Machine Learning Research, 2022

  28. [28]

    Bridging action space mismatch in learning from demonstra- tions,

    G. Salhotra, I.-C. A. Liu, and G. Sukhatme, “Bridging action space mismatch in learning from demonstra- tions,” arXiv preprint arXiv:2304.03833 , 2023

  29. [29]

    Robot learning with sensorimotor pre- training,

    I. Radosavovic, B. Shi, L. Fu, K. Goldberg, T. Darrell, and J. Malik, “Robot learning with sensorimotor pre- training,” in Conference on Robot Learning , 2023

  30. [30]

    UniGrasp: Learning a unified model to grasp with multifingered robotic hands,

    L. Shao, F. Ferreira, M. Jorda, V . Nambiar, J. Luo, E. Solowjow, J. A. Ojea, O. Khatib, and J. Bohg, “UniGrasp: Learning a unified model to grasp with multifingered robotic hands,” IEEE Robotics and Au- tomation Letters, vol. 5, no. 2, pp. 2286–2293, 2020

  31. [31]

    Adagrasp: Learning an adaptive gripper-aware grasping policy,

    Z. Xu, B. Qi, S. Agrawal, and S. Song, “Adagrasp: Learning an adaptive gripper-aware grasping policy,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 4620–4626

  32. [32]

    ViNT: A Foun- dation Model for Visual Navigation,

    D. Shah, A. Sridhar, N. Dashora, K. Stachowicz, K. Black, N. Hirose, and S. Levine, “ViNT: A Foun- dation Model for Visual Navigation,” in 7th Annual Conference on Robot Learning (CoRL) , 2023

  33. [33]

    Imitation from observation: Learning to imitate behaviors from raw video via context translation,

    Y . Liu, A. Gupta, P. Abbeel, and S. Levine, “Imitation from observation: Learning to imitate behaviors from raw video via context translation,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 1118–1125

  34. [34]

    One-shot imitation from observing hu- mans via domain-adaptive meta-learning,

    T. Yu, C. Finn, S. Dasari, A. Xie, T. Zhang, P. Abbeel, and S. Levine, “One-shot imitation from observing hu- mans via domain-adaptive meta-learning,” Robotics: Science and Systems XIV , 2018

  35. [35]

    Third-person visual imitation learning via decoupled hierarchical controller,

    P. Sharma, D. Pathak, and A. Gupta, “Third-person visual imitation learning via decoupled hierarchical controller,” Advances in Neural Information Process- ing Systems, vol. 32, 2019

  36. [36]

    Avid: Learning multi-stage tasks via pixel-level translation of human videos

    L. Smith, N. Dhawan, M. Zhang, P. Abbeel, and S. Levine, “Avid: Learning multi-stage tasks via pixel- level translation of human videos,” arXiv preprint arXiv:1912.04443, 2019

  37. [37]

    Learning one-shot imitation from humans without humans,

    A. Bonardi, S. James, and A. J. Davison, “Learning one-shot imitation from humans without humans,” IEEE Robotics and Automation Letters , vol. 5, no. 2, pp. 3533–3539, 2020

  38. [38]

    Reinforcement learning with videos: Combining offline observations with interaction,

    K. Schmeckpeper, O. Rybkin, K. Daniilidis, S. Levine, and C. Finn, “Reinforcement learning with videos: Combining offline observations with interaction,” in Conference on Robot Learning . PMLR, 2021, pp. 339–354

  39. [39]

    Learning by watching: Physical imita- tion of manipulation skills from human videos,

    H. Xiong, Q. Li, Y .-C. Chen, H. Bharadhwaj, S. Sinha, and A. Garg, “Learning by watching: Physical imita- tion of manipulation skills from human videos,” in 2021 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS) . IEEE, 2021, pp. 7827–7834

  40. [40]

    BC-Z: Zero-shot task generalization with robotic imitation learning,

    E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn, “BC-Z: Zero-shot task generalization with robotic imitation learning,” in Conference on Robot Learning (CoRL) , 2021, pp. 991–1002

  41. [41]

    Human-to-robot imitation in the wild,

    S. Bahl, A. Gupta, and D. Pathak, “Human-to-robot imitation in the wild,” Robotics: Science and Systems (RSS), 2022

  42. [42]

    Embodied concept learner: Self-supervised learning of concepts and map- ping through instruction following,

    M. Ding, Y . Xu, Z. Chen, D. D. Cox, P. Luo, J. B. Tenenbaum, and C. Gan, “Embodied concept learner: Self-supervised learning of concepts and map- ping through instruction following,” in Conference on Robot Learning. PMLR, 2023, pp. 1743–1754

  43. [43]

    Affordances from human videos as a versatile representation for robotics,

    S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak, “Affordances from human videos as a versatile representation for robotics,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , June 2023, pp. 13 778– 13 790

  44. [44]

    Unsupervised Perceptual Rewards for Imitation Learning

    P. Sermanet, K. Xu, and S. Levine, “Unsupervised per- ceptual rewards for imitation learning,” arXiv preprint arXiv:1612.06699, 2016

  45. [45]

    Concept2Robot: Learning manipulation con- cepts from instructions and human demonstrations,

    L. Shao, T. Migimatsu, Q. Zhang, K. Yang, and J. Bohg, “Concept2Robot: Learning manipulation con- cepts from instructions and human demonstrations,” in Proceedings of Robotics: Science and Systems (RSS) , 2020

  46. [46]

    in-the-wild

    A. S. Chen, S. Nair, and C. Finn, “Learning generaliz- able robotic reward functions from “in-the-wild” hu- man videos,” arXiv preprint arXiv:2103.16817 , 2021

  47. [47]

    Graph inverse reinforcement learning from diverse videos,

    S. Kumar, J. Zamora, N. Hansen, R. Jangir, and X. Wang, “Graph inverse reinforcement learning from diverse videos,” in Conference on Robot Learning . PMLR, 2023, pp. 55–66

  48. [48]

    Learning reward functions for robotic manipulation by observing humans,

    M. Alakuijala, G. Dulac-Arnold, J. Mairal, J. Ponce, and C. Schmid, “Learning reward functions for robotic manipulation by observing humans,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 5006–5012

  49. [49]

    Manipulator- independent representations for visual imitation,

    Y . Zhou, Y . Aytar, and K. Bousmalis, “Manipulator- independent representations for visual imitation,” 2021

  50. [50]

    Mimicplay: Long- horizon imitation learning by watching human play,

    C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar, “Mimicplay: Long- horizon imitation learning by watching human play,” in Conference on Robot Learning , 2023

  51. [51]

    Learning pre- dictive models from observation and interaction,

    K. Schmeckpeper, A. Xie, O. Rybkin, S. Tian, K. Daniilidis, S. Levine, and C. Finn, “Learning pre- dictive models from observation and interaction,” in European Conference on Computer Vision. Springer, 2020, pp. 708–725

  52. [52]

    R3m: A universal visual representation for robot manipulation,

    S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta, “R3m: A universal visual representation for robot manipulation,” in CoRL, 2022

  53. [53]

    Masked visual pre-training for motor control

    T. Xiao, I. Radosavovic, T. Darrell, and J. Malik, “Masked visual pre-training for motor control,” arXiv preprint arXiv:2203.06173, 2022

  54. [54]

    Real-world robot learning with masked visual pre-training,

    I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Ma- lik, and T. Darrell, “Real-world robot learning with masked visual pre-training,” in Conference on Robot Learning, 2022

  55. [55]

    VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

    Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang, “Vip: Towards universal visual reward and representation via value-implicit pre-training,” arXiv preprint arXiv:2210.00030, 2022

  56. [56]

    Where are we in the search for an artificial vi- sual cortex for embodied intelligence?

    A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, P. Abbeel, J. Malik et al., “Where are we in the search for an artificial vi- sual cortex for embodied intelligence?” arXiv preprint arXiv:2303.18240, 2023

  57. [57]

    Language-driven represen- tation learning for robotics,

    S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang, “Language-driven represen- tation learning for robotics,” Robotics: Science and Systems (RSS), 2023

  58. [58]

    EC2: Emergent communication for embodied control,

    Y . Mu, S. Yao, M. Ding, P. Luo, and C. Gan, “EC2: Emergent communication for embodied control,” in Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2023, pp. 6704– 6714

  59. [59]

    Affordances from human videos as a versatile representation for robotics,

    S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak, “Affordances from human videos as a versatile representation for robotics,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 778–13 790

  60. [60]

    Efficient grasping from RGBD images: Learning using a new rectangle representation,

    Y . Jiang, S. Moseson, and A. Saxena, “Efficient grasping from RGBD images: Learning using a new rectangle representation,” in 2011 IEEE International conference on robotics and automation . IEEE, 2011, pp. 3304–3311

  61. [61]

    Supersizing self- supervision: Learning to grasp from 50k tries and 700 robot hours,

    L. Pinto and A. K. Gupta, “Supersizing self- supervision: Learning to grasp from 50k tries and 700 robot hours,”2016 IEEE International Conference on Robotics and Automation (ICRA) , pp. 3406–3413, 2015

  62. [62]

    Leveraging big data for grasp planning,

    D. Kappler, J. Bohg, and S. Schaal, “Leveraging big data for grasp planning,” in ICRA, 2015, pp. 4304– 4311

  63. [63]

    Dex-Net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,

    J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, “Dex-Net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,” in Robotics: Science and Systems (RSS) , 2017

  64. [64]

    Jacquard: A large scale dataset for robotic grasp detection,

    A. Depierre, E. Dellandr ´ea, and L. Chen, “Jacquard: A large scale dataset for robotic grasp detection,” in 2018 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS) . IEEE, 2018, pp. 3511–3516

  65. [65]

    Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,

    S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,” The International journal of robotics research, vol. 37, no. 4-5, pp. 421–436, 2018

  66. [66]

    Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation, 2018

    D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Her- zog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Vanhoucke et al. , “QT-Opt: Scalable deep rein- forcement learning for vision-based robotic manipu- lation,” arXiv preprint arXiv:1806.10293 , 2018

  67. [67]

    Contactdb: Analyzing and predicting grasp contact via thermal imaging,

    S. Brahmbhatt, C. Ham, C. Kemp, and J. Hays, “Contactdb: Analyzing and predicting grasp contact via thermal imaging,” 04 2019

  68. [68]

    Graspnet- 1billion: a large-scale benchmark for general object grasping,

    H.-S. Fang, C. Wang, M. Gou, and C. Lu, “Graspnet- 1billion: a large-scale benchmark for general object grasping,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2020, pp. 11 444–11 453

  69. [69]

    ACRONYM: A large-scale grasp dataset based on simulation,

    C. Eppner, A. Mousavian, and D. Fox, “ACRONYM: A large-scale grasp dataset based on simulation,” in 2021 IEEE Int. Conf. on Robotics and Automation, ICRA, 2020

  70. [70]

    Using simulation and domain adaptation to improve effi- ciency of deep robotic grasping,

    K. Bousmalis, A. Irpan, P. Wohlhart, Y . Bai, M. Kel- cey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, S. Levine, and V . Vanhoucke, “Using simulation and domain adaptation to improve effi- ciency of deep robotic grasping,” in ICRA, 2018, pp. 4243–4250

  71. [71]

    Fanuc manipulation: A dataset for learning-based manipulation with fanuc mate 200iD robot,

    X. Zhu, R. Tian, C. Xu, M. Huo, W. Zhan, M. Tomizuka, and M. Ding, “Fanuc manipulation: A dataset for learning-based manipulation with fanuc mate 200iD robot,” https://sites.google.com/berkeley. edu/fanuc-manipulation, 2023

  72. [72]

    More than a million ways to be pushed. a high- fidelity experimental dataset of planar pushing,

    K.-T. Yu, M. Bauza, N. Fazeli, and A. Rodriguez, “More than a million ways to be pushed. a high- fidelity experimental dataset of planar pushing,” in 2016 IEEE/RSJ international conference on intelligent robots and systems (IROS) . IEEE, 2016, pp. 30–37

  73. [73]

    Deep visual foresight for plan- ning robot motion,

    C. Finn and S. Levine, “Deep visual foresight for plan- ning robot motion,” in 2017 IEEE International Con- ference on Robotics and Automation (ICRA) . IEEE, 2017, pp. 2786–2793

  74. [74]

    Visual foresight: Model-based deep reinforcement learning for vision-based robotic control,

    F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine, “Visual foresight: Model-based deep rein- forcement learning for vision-based robotic control,” arXiv preprint arXiv:1812.00568 , 2018

  75. [75]

    The princeton shape benchmark,

    P. Shilane, P. Min, M. Kazhdan, and T. Funkhouser, “The princeton shape benchmark,” in Shape Modeling Applications, 2004, pp. 167–388

  76. [76]

    3DNet: Large-Scale Object Class Recog- nition from CAD Models,

    W. Wohlkinger, A. Aldoma Buchaca, R. Rusu, and M. Vincze, “3DNet: Large-Scale Object Class Recog- nition from CAD Models,” inIEEE International Con- ference on Robotics and Automation (ICRA) , 2012

  77. [77]

    The kit object models database: An object model database for object recognition, localization and manipulation in service robotics,

    A. Kasper, Z. Xue, and R. Dillmann, “The kit object models database: An object model database for object recognition, localization and manipulation in service robotics,” The International Journal of Robotics Re- search, vol. 31, no. 8, pp. 927–934, 2012

  78. [78]

    BigBIRD: A large-scale 3D database of object instances,

    A. Singh, J. Sha, K. S. Narayan, T. Achim, and P. Abbeel, “BigBIRD: A large-scale 3D database of object instances,” in IEEE International Conference on Robotics and Automation (ICRA) , 2014, pp. 509– 516

  79. [79]

    Benchmarking in ma- nipulation research: Using the Yale-CMU-Berkeley object and model set,

    B. Calli, A. Walsman, A. Singh, S. Srinivasa, P. Abbeel, and A. M. Dollar, “Benchmarking in ma- nipulation research: Using the Yale-CMU-Berkeley object and model set,” IEEE Robotics & Automation Magazine, vol. 22, no. 3, pp. 36–52, 2015

  80. [80]

    3D ShapeNets: A deep representation for volumetric shapes,

    Zhirong Wu, S. Song, A. Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and J. Xiao, “3D ShapeNets: A deep representation for volumetric shapes,” in IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2015, pp. 1912–1920

Showing first 80 references.