Recognition: 2 theorem links
· Lean TheoremOpen X-Embodiment: Robotic Learning Datasets and RT-X Models
Pith reviewed 2026-05-11 17:18 UTC · model grok-4.3
The pith
A single high-capacity model trained on data from 22 robots improves task performance on each individual platform through positive transfer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We assemble a dataset from 22 different robots demonstrating 527 skills. A high-capacity model trained on this data exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms.
What carries the argument
The high-capacity model trained on the standardized multi-robot dataset, which carries the argument by showing that cross-platform data produces measurable gains on each robot's tasks.
If this is right
- Robots achieve higher success rates on tasks by drawing on experience collected elsewhere without new data collection on the target platform.
- A single model can be adapted to new robots, tasks, and environments more efficiently than training from scratch for each case.
- Robotic learning can shift away from training isolated models for every application toward shared generalist policies.
- The standardized dataset format enables further experiments on cross-robot generalization in manipulation.
Where Pith is reading between the lines
- If the positive transfer effect grows with additional robots and tasks, future datasets could be pooled at even larger scale to compound the gains.
- Different research groups could contribute data in the same format and immediately benefit from improved performance on their own hardware.
- The approach raises the question of how far the transfer extends when new robot morphologies or entirely unseen tasks are introduced.
Load-bearing premise
The chosen standardization and particular mix of data from the 22 robots produce net positive transfer rather than interference that would reduce performance.
What would settle it
A direct comparison in which the model trained on the combined dataset performs no better or worse than separate models trained only on each robot's own data for the same tasks.
read the original abstract
Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website https://robotics-transformer-x.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper assembles a large-scale, standardized dataset of robotic manipulation tasks collected from 22 robots across 21 institutions, covering 527 skills and 160266 tasks. It introduces RT-X, a high-capacity transformer-based policy trained on the combined data, and reports experiments claiming that this model exhibits positive cross-embodiment transfer, improving task performance on multiple robots by leveraging experience from other platforms.
Significance. If the positive-transfer claim is substantiated with proper controls, the work would be significant for robotics by providing open, standardized datasets and models that facilitate research on generalist X-robot policies, analogous to foundation models in other domains. The collaborative data release itself is a substantial community resource.
major comments (1)
- [§5 (Experiments)] §5 (Experiments): The reported comparisons between RT-X (trained on the full 160k+ task multi-robot dataset) and per-robot baselines (trained only on native data subsets) do not control for total training data volume. Without an additional baseline that matches the data volume seen by RT-X (e.g., via subsampling the combined dataset to equal the per-robot volume or training on equivalent-scale single-robot data), performance gains cannot be unambiguously attributed to cross-embodiment transfer rather than simple scaling effects. This directly undermines the central claim that RT-X improves capabilities 'by leveraging experience from other platforms.'
minor comments (2)
- [Abstract] Abstract: '160266 tasks' should be written with a comma as '160,266 tasks' for readability.
- [§3 (Dataset)] §3 (Dataset): The standardization procedure for heterogeneous robot data (e.g., action spaces, observation formats) is described at a high level; a more detailed table or pseudocode would help readers reproduce the exact preprocessing pipeline.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below.
read point-by-point responses
-
Referee: [§5 (Experiments)] §5 (Experiments): The reported comparisons between RT-X (trained on the full 160k+ task multi-robot dataset) and per-robot baselines (trained only on native data subsets) do not control for total training data volume. Without an additional baseline that matches the data volume seen by RT-X (e.g., via subsampling the combined dataset to equal the per-robot volume or training on equivalent-scale single-robot data), performance gains cannot be unambiguously attributed to cross-embodiment transfer rather than simple scaling effects. This directly undermines the central claim that RT-X improves capabilities 'by leveraging experience from other platforms.'
Authors: We agree that an explicit control for total training data volume would strengthen the attribution of gains specifically to cross-embodiment transfer. The current per-robot baselines use only the native data available for each robot, while RT-X is trained on the full aggregated set; this is the standard comparison for demonstrating the value of multi-robot data. To isolate the effect of embodiment diversity from scaling, we will add in the revised Section 5 a new baseline that subsamples the combined multi-robot dataset to match the data volume of the largest single-robot subset and retrains a model under identical conditions. This addition will clarify whether the observed improvements exceed what would be expected from data volume alone. revision: yes
Circularity Check
Empirical dataset curation and model training with no derivation chain
full rationale
The paper assembles a standardized multi-robot dataset (22 platforms, 527 skills) and trains RT-X, reporting positive transfer via experimental comparisons. No mathematical derivations, predictions, or uniqueness theorems are claimed; results rest on direct training and evaluation. Self-citations (if any) are not load-bearing for the central empirical claim. The skeptic concern about data-volume confounding is a valid experimental-design issue but does not constitute circularity under the defined patterns.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We assemble a dataset from 22 different robots... RT-X exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 54 Pith papers
-
Aligning Flow Map Policies with Optimal Q-Guidance
Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
-
Dynamic Execution Commitment of Vision-Language-Action Models
A3 determines the execution horizon in VLA models as the longest prefix of actions that passes consensus-based verification and sequential consistency checks.
-
SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation
SABER provides 44.8K multi-representation action samples from unscripted retail environments that raise a VLA model's mean success rate on ten manipulation tasks from 13.4% to 29.3%.
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
-
Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion
Action Agent pairs LLM-driven video generation with a flow-constrained diffusion transformer to produce velocity commands, raising video success to 86% and delivering 64.7% real-world navigation on a Unitree G1 humanoid.
-
OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction
A 48-camera residential platform delivers real-time occlusion-robust 3D perception and coordinated actuation for multi-human multi-robot interaction in a shared home workspace.
-
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing ...
-
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
-
Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study
A governed capability evolution framework with interface, policy, behavioral, and recovery checks reduces unsafe activations to zero in embodied agent upgrades while preserving task success rates.
-
3D-VLA: A 3D Vision-Language-Action Generative World Model
3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
-
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
-
Reinforcing VLAs in Task-Agnostic World Models
RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.
-
HumanNet: Scaling Human-centric Video Learning to One Million Hours
HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.
-
Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation
VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
-
An Efficient Metric for Data Quality Measurement in Imitation Learning
Power spectral density of trajectories ranks demonstration quality for imitation learning, enabling rollout-free curation that improves fine-tuned policy success.
-
Lucid-XR: An Extended-Reality Data Engine for Robotic Manipulation
Lucid-XR uses XR-headset physics simulation and physics-guided video generation to create synthetic data that trains robot policies transferring zero-shot to unseen real-world manipulation tasks.
-
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
-
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
Empirical study on robosuite tasks reveals a dominant-skill effect in compositions and shows that an atomic probe approximates full revalidation for skill updates at much lower cost.
-
$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills
M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.
-
QDTraj: Exploration of Diverse Trajectory Primitives for Articulated Objects Robotic Manipulation
QDTraj uses Quality-Diversity algorithms with sparse rewards to produce at least five times more diverse high-performing trajectories for articulated object manipulation than compared methods, validated across 30 obje...
-
XRZero-G0: Pushing the Frontier of Dexterous Robotic Manipulation with Interfaces, Quality and Ratios
XRZero-G0 enables 2000-hour robot-free datasets that, when mixed 10:1 with real-robot data, match full real-robot performance at 1/20th the cost and support zero-shot transfer.
-
EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems
EmbodiedGovBench is a new benchmark framework that measures embodied agent systems on seven governance dimensions including policy adherence, recovery success, and upgrade safety.
-
WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations
WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...
-
Zero-shot World Models Are Developmentally Efficient Learners
A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
-
SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds
SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformabl...
-
Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study
A governed capability evolution framework for embodied agents uses four compatibility checks and a staged pipeline to achieve zero unsafe activations during upgrades while retaining comparable task success rates.
-
OpenRC: An Open-Source Robotic Colonoscopy Framework for Multimodal Data Acquisition and Autonomy Research
OpenRC is an open-source robotic colonoscopy platform with hardware retrofit and a multimodal dataset of nearly 1,900 episodes for autonomy and VLA research.
-
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
-
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...
-
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.
-
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
-
FAST: Efficient Action Tokenization for Vision-Language-Action Models
FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diff...
-
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
-
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...
-
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
-
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
-
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
-
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
RoboCasa supplies a large-scale kitchen simulator, generative assets, 100 tasks, and automated data pipelines that produce a clear scaling trend in imitation learning for generalist robots.
-
Octo: An Open-Source Generalist Robot Policy
Octo is an open-source transformer-based generalist robot policy pretrained on 800k trajectories that serves as an effective initialization for finetuning across diverse robotic platforms.
-
Evaluating Real-World Robot Manipulation Policies in Simulation
SIMPLER simulated environments yield policy performance that correlates strongly with real-world robot manipulation results and captures similar sensitivity to distribution shifts.
-
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.
-
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...
-
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
A low-cost whole-body teleoperation system enables effective imitation learning for complex bimanual mobile manipulation by co-training on mobile and static demonstration datasets.
-
ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation
ProcVLM learns procedure-grounded dense progress rewards for robotic manipulation via a reasoning-before-estimation VLM trained on a 60M-frame synthesized corpus from 30 embodied datasets.
-
MiniVLA-Nav v1: A Multi-Scene Simulation Dataset for Language-Conditioned Robot Navigation
MiniVLA-Nav v1 provides 1,174 episodes of language-instructed robot navigation in photorealistic simulations with RGB, depth, segmentation, and expert action data.
-
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
-
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
-
Jump-Start Reinforcement Learning with Vision-Language-Action Regularization
VLAJS augments PPO with sparse annealed VLA guidance through directional regularization to cut required interactions by over 50% on manipulation tasks and enable zero-shot sim-to-real transfer.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
A Co-Evolutionary Theory of Human-AI Coexistence: Mutualism, Governance, and Dynamics in Complex Societies
Human-AI coexistence is best modeled as conditional mutualism under governance, formalized as a multiplex dynamical system whose simulations show stable high-coexistence equilibria only under balanced institutional oversight.
-
JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy
JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.
-
Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines
A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data...
Reference graph
Works this paper leans on
-
[1]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al. , “Learning transferable visual models from natural language supervision,” in International conference on machine learning . PMLR, 2021, pp. 8748–8763
work page 2021
- [2]
-
[3]
R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al. , “PaLM 2 technical report,” arXiv preprint arXiv:2305.10403, 2023
work page internal anchor Pith review arXiv 2023
-
[4]
Google landmarks dataset v2 - a large-scale benchmark for instance-level recognition and retrieval,
T. Weyand, A. Araujo, B. Cao, and J. Sim, “Google landmarks dataset v2 - a large-scale benchmark for instance-level recognition and retrieval,” in Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , June 2020
work page 2020
-
[5]
Tencent ML-images: A large-scale multi-label image database for visual representation learning,
B. Wu, W. Chen, Y . Fan, Y . Zhang, J. Hou, J. Liu, and T. Zhang, “Tencent ML-images: A large-scale multi-label image database for visual representation learning,” IEEE Access, vol. 7, 2019
work page 2019
-
[6]
DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia
J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer, “DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia.” Semantic Web , vol. 6, no. 2, pp. 167–195, 2015. [Online]. Available: http://dblp.uni-trier.de/db/journals/semweb/ semweb6.html#LehmannIJJKMHMK15
work page 2015
-
[7]
Web data commons- extracting structured data from two large web cor- pora
H. M ¨uhleisen and C. Bizer, “Web data commons- extracting structured data from two large web cor- pora.” LDOW, vol. 937, pp. 133–145, 2012
work page 2012
-
[8]
RT-1: Robotics transformer for real-world control at scale,
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu et al., “RT-1: Robotics transformer for real-world control at scale,” Robotics: Science and Systems (RSS), 2023
work page 2023
-
[9]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn et al. , “RT-2: Vision-language- action models transfer web knowledge to robotic con- trol,” arXiv preprint arXiv:2307.15818 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Learning modular neural network policies for multi-task and multi-robot transfer,
C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine, “Learning modular neural network policies for multi-task and multi-robot transfer,” in 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017, pp. 2169–2176
work page 2017
-
[11]
Hardware con- ditioned policies for multi-robot transfer learning,
T. Chen, A. Murali, and A. Gupta, “Hardware con- ditioned policies for multi-robot transfer learning,” in Advances in Neural Information Processing Systems , 2018, pp. 9355–9366
work page 2018
-
[12]
Graph networks as learnable physics engines for inference and control,
A. Sanchez-Gonzalez, N. Heess, J. T. Springenberg, J. Merel, M. Riedmiller, R. Hadsell, and P. Battaglia, “Graph networks as learnable physics engines for inference and control,” in Proceedings of the 35th International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. PMLR, 10–15 Jul 2018,...
work page 2018
-
[13]
Learning to control self-assembling morphologies: a study of generalization via modularity,
D. Pathak, C. Lu, T. Darrell, P. Isola, and A. A. Efros, “Learning to control self-assembling morphologies: a study of generalization via modularity,” Advances in Neural Information Processing Systems, vol. 32, 2019
work page 2019
-
[14]
R. Mart ´ın-Mart´ın, M. Lee, R. Gardner, S. Savarese, J. Bohg, and A. Garg, “Variable impedance control in end-effector space. an action space for reinforcement learning in contact rich tasks,” in Proceedings of the International Conference of Intelligent Robots and Systems (IROS), 2019
work page 2019
-
[15]
One policy to control them all: Shared modular policies for agent- agnostic control,
W. Huang, I. Mordatch, and D. Pathak, “One policy to control them all: Shared modular policies for agent- agnostic control,” in ICML, 2020
work page 2020
-
[16]
V . Kurin, M. Igl, T. Rockt ¨aschel, W. Boehmer, and S. Whiteson, “My body is a cage: the role of mor- phology in graph-based incompatible control,” arXiv preprint arXiv:2010.01856, 2020
-
[17]
XIRL: Cross-embodiment inverse reinforcement learning,
K. Zakka, A. Zeng, P. Florence, J. Tompson, J. Bohg, and D. Dwibedi, “XIRL: Cross-embodiment inverse reinforcement learning,” Conference on Robot Learn- ing (CoRL), 2021
work page 2021
-
[18]
Bayesian meta-learning for few-shot policy adaptation across robotic plat- forms,
A. Ghadirzadeh, X. Chen, P. Poklukar, C. Finn, M. Bj¨orkman, and D. Kragic, “Bayesian meta-learning for few-shot policy adaptation across robotic plat- forms,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, 2021, pp. 1274–1280
work page 2021
-
[19]
Meta- morph: Learning universal controllers with transform- ers,
A. Gupta, L. Fan, S. Ganguli, and L. Fei-Fei, “Meta- morph: Learning universal controllers with transform- ers,” in International Conference on Learning Repre- sentations, 2021
work page 2021
-
[20]
A gen- eralist dynamics model for control,
I. Schubert, J. Zhang, J. Bruce, S. Bechtle, E. Parisotto, M. Riedmiller, J. T. Springenberg, A. Byravan, L. Hasenclever, and N. Heess, “A gen- eralist dynamics model for control,” 2023
work page 2023
-
[21]
GNM: A general navigation model to drive any robot,
D. Shah, A. Sridhar, A. Bhorkar, N. Hirose, and S. Levine, “GNM: A general navigation model to drive any robot,” in 2023 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2023, pp. 7226–7233
work page 2023
-
[22]
Y . Zhou, S. Sonawani, M. Phielipp, S. Stepputtis, and H. Amor, “Modularity through attention: Efficient training and transfer of language-conditioned policies for robot manipulation,” in Proceedings of The 6th Conference on Robot Learning , ser. Proceedings of Machine Learning Research, K. Liu, D. Kulic, and J. Ichnowski, Eds., vol. 205. PMLR, 14– 18 Dec...
work page 2023
-
[23]
RoboNet: Large-scale multi-robot learning,
S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn, “RoboNet: Large-scale multi-robot learning,” in Con- ference on Robot Learning (CoRL), vol. 100. PMLR, 2019, pp. 885–897
work page 2019
-
[24]
Know thyself: Transferable visual control policies through robot-awareness,
E. S. Hu, K. Huang, O. Rybkin, and D. Jayaraman, “Know thyself: Transferable visual control policies through robot-awareness,” inInternational Conference on Learning Representations , 2022
work page 2022
-
[25]
RoboCat : A self-improving foundation agent for robotic manipulation
K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauza, T. Davchev, Y . Zhou, A. Gupta, A. Raju et al. , “RoboCat: A self-improving founda- tion agent for robotic manipulation,” arXiv preprint arXiv:2306.11706, 2023
-
[26]
Polybot: Training one policy across robots while embracing variability,
J. Yang, D. Sadigh, and C. Finn, “Polybot: Training one policy across robots while embracing variability,” arXiv preprint arXiv:2307.03719 , 2023
-
[27]
S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-maron, M. Gim ´enez, Y . Sul- sky, J. Kay, J. T. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, Y . Chen, R. Had- sell, O. Vinyals, M. Bordbar, and N. de Freitas, “A generalist agent,” Transactions on Machine Learning Research, 2022
work page 2022
-
[28]
Bridging action space mismatch in learning from demonstra- tions,
G. Salhotra, I.-C. A. Liu, and G. Sukhatme, “Bridging action space mismatch in learning from demonstra- tions,” arXiv preprint arXiv:2304.03833 , 2023
-
[29]
Robot learning with sensorimotor pre- training,
I. Radosavovic, B. Shi, L. Fu, K. Goldberg, T. Darrell, and J. Malik, “Robot learning with sensorimotor pre- training,” in Conference on Robot Learning , 2023
work page 2023
-
[30]
UniGrasp: Learning a unified model to grasp with multifingered robotic hands,
L. Shao, F. Ferreira, M. Jorda, V . Nambiar, J. Luo, E. Solowjow, J. A. Ojea, O. Khatib, and J. Bohg, “UniGrasp: Learning a unified model to grasp with multifingered robotic hands,” IEEE Robotics and Au- tomation Letters, vol. 5, no. 2, pp. 2286–2293, 2020
work page 2020
-
[31]
Adagrasp: Learning an adaptive gripper-aware grasping policy,
Z. Xu, B. Qi, S. Agrawal, and S. Song, “Adagrasp: Learning an adaptive gripper-aware grasping policy,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 4620–4626
work page 2021
-
[32]
ViNT: A Foun- dation Model for Visual Navigation,
D. Shah, A. Sridhar, N. Dashora, K. Stachowicz, K. Black, N. Hirose, and S. Levine, “ViNT: A Foun- dation Model for Visual Navigation,” in 7th Annual Conference on Robot Learning (CoRL) , 2023
work page 2023
-
[33]
Imitation from observation: Learning to imitate behaviors from raw video via context translation,
Y . Liu, A. Gupta, P. Abbeel, and S. Levine, “Imitation from observation: Learning to imitate behaviors from raw video via context translation,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 1118–1125
work page 2018
-
[34]
One-shot imitation from observing hu- mans via domain-adaptive meta-learning,
T. Yu, C. Finn, S. Dasari, A. Xie, T. Zhang, P. Abbeel, and S. Levine, “One-shot imitation from observing hu- mans via domain-adaptive meta-learning,” Robotics: Science and Systems XIV , 2018
work page 2018
-
[35]
Third-person visual imitation learning via decoupled hierarchical controller,
P. Sharma, D. Pathak, and A. Gupta, “Third-person visual imitation learning via decoupled hierarchical controller,” Advances in Neural Information Process- ing Systems, vol. 32, 2019
work page 2019
-
[36]
Avid: Learning multi-stage tasks via pixel-level translation of human videos
L. Smith, N. Dhawan, M. Zhang, P. Abbeel, and S. Levine, “Avid: Learning multi-stage tasks via pixel- level translation of human videos,” arXiv preprint arXiv:1912.04443, 2019
-
[37]
Learning one-shot imitation from humans without humans,
A. Bonardi, S. James, and A. J. Davison, “Learning one-shot imitation from humans without humans,” IEEE Robotics and Automation Letters , vol. 5, no. 2, pp. 3533–3539, 2020
work page 2020
-
[38]
Reinforcement learning with videos: Combining offline observations with interaction,
K. Schmeckpeper, O. Rybkin, K. Daniilidis, S. Levine, and C. Finn, “Reinforcement learning with videos: Combining offline observations with interaction,” in Conference on Robot Learning . PMLR, 2021, pp. 339–354
work page 2021
-
[39]
Learning by watching: Physical imita- tion of manipulation skills from human videos,
H. Xiong, Q. Li, Y .-C. Chen, H. Bharadhwaj, S. Sinha, and A. Garg, “Learning by watching: Physical imita- tion of manipulation skills from human videos,” in 2021 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS) . IEEE, 2021, pp. 7827–7834
work page 2021
-
[40]
BC-Z: Zero-shot task generalization with robotic imitation learning,
E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn, “BC-Z: Zero-shot task generalization with robotic imitation learning,” in Conference on Robot Learning (CoRL) , 2021, pp. 991–1002
work page 2021
-
[41]
Human-to-robot imitation in the wild,
S. Bahl, A. Gupta, and D. Pathak, “Human-to-robot imitation in the wild,” Robotics: Science and Systems (RSS), 2022
work page 2022
-
[42]
M. Ding, Y . Xu, Z. Chen, D. D. Cox, P. Luo, J. B. Tenenbaum, and C. Gan, “Embodied concept learner: Self-supervised learning of concepts and map- ping through instruction following,” in Conference on Robot Learning. PMLR, 2023, pp. 1743–1754
work page 2023
-
[43]
Affordances from human videos as a versatile representation for robotics,
S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak, “Affordances from human videos as a versatile representation for robotics,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , June 2023, pp. 13 778– 13 790
work page 2023
-
[44]
Unsupervised Perceptual Rewards for Imitation Learning
P. Sermanet, K. Xu, and S. Levine, “Unsupervised per- ceptual rewards for imitation learning,” arXiv preprint arXiv:1612.06699, 2016
work page Pith review arXiv 2016
-
[45]
Concept2Robot: Learning manipulation con- cepts from instructions and human demonstrations,
L. Shao, T. Migimatsu, Q. Zhang, K. Yang, and J. Bohg, “Concept2Robot: Learning manipulation con- cepts from instructions and human demonstrations,” in Proceedings of Robotics: Science and Systems (RSS) , 2020
work page 2020
-
[46]
A. S. Chen, S. Nair, and C. Finn, “Learning generaliz- able robotic reward functions from “in-the-wild” hu- man videos,” arXiv preprint arXiv:2103.16817 , 2021
-
[47]
Graph inverse reinforcement learning from diverse videos,
S. Kumar, J. Zamora, N. Hansen, R. Jangir, and X. Wang, “Graph inverse reinforcement learning from diverse videos,” in Conference on Robot Learning . PMLR, 2023, pp. 55–66
work page 2023
-
[48]
Learning reward functions for robotic manipulation by observing humans,
M. Alakuijala, G. Dulac-Arnold, J. Mairal, J. Ponce, and C. Schmid, “Learning reward functions for robotic manipulation by observing humans,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 5006–5012
work page 2023
-
[49]
Manipulator- independent representations for visual imitation,
Y . Zhou, Y . Aytar, and K. Bousmalis, “Manipulator- independent representations for visual imitation,” 2021
work page 2021
-
[50]
Mimicplay: Long- horizon imitation learning by watching human play,
C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar, “Mimicplay: Long- horizon imitation learning by watching human play,” in Conference on Robot Learning , 2023
work page 2023
-
[51]
Learning pre- dictive models from observation and interaction,
K. Schmeckpeper, A. Xie, O. Rybkin, S. Tian, K. Daniilidis, S. Levine, and C. Finn, “Learning pre- dictive models from observation and interaction,” in European Conference on Computer Vision. Springer, 2020, pp. 708–725
work page 2020
-
[52]
R3m: A universal visual representation for robot manipulation,
S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta, “R3m: A universal visual representation for robot manipulation,” in CoRL, 2022
work page 2022
-
[53]
Masked visual pre-training for motor control
T. Xiao, I. Radosavovic, T. Darrell, and J. Malik, “Masked visual pre-training for motor control,” arXiv preprint arXiv:2203.06173, 2022
-
[54]
Real-world robot learning with masked visual pre-training,
I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Ma- lik, and T. Darrell, “Real-world robot learning with masked visual pre-training,” in Conference on Robot Learning, 2022
work page 2022
-
[55]
VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training
Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang, “Vip: Towards universal visual reward and representation via value-implicit pre-training,” arXiv preprint arXiv:2210.00030, 2022
work page internal anchor Pith review arXiv 2022
-
[56]
Where are we in the search for an artificial vi- sual cortex for embodied intelligence?
A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, P. Abbeel, J. Malik et al., “Where are we in the search for an artificial vi- sual cortex for embodied intelligence?” arXiv preprint arXiv:2303.18240, 2023
-
[57]
Language-driven represen- tation learning for robotics,
S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang, “Language-driven represen- tation learning for robotics,” Robotics: Science and Systems (RSS), 2023
work page 2023
-
[58]
EC2: Emergent communication for embodied control,
Y . Mu, S. Yao, M. Ding, P. Luo, and C. Gan, “EC2: Emergent communication for embodied control,” in Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2023, pp. 6704– 6714
work page 2023
-
[59]
Affordances from human videos as a versatile representation for robotics,
S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak, “Affordances from human videos as a versatile representation for robotics,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 778–13 790
work page 2023
-
[60]
Efficient grasping from RGBD images: Learning using a new rectangle representation,
Y . Jiang, S. Moseson, and A. Saxena, “Efficient grasping from RGBD images: Learning using a new rectangle representation,” in 2011 IEEE International conference on robotics and automation . IEEE, 2011, pp. 3304–3311
work page 2011
-
[61]
Supersizing self- supervision: Learning to grasp from 50k tries and 700 robot hours,
L. Pinto and A. K. Gupta, “Supersizing self- supervision: Learning to grasp from 50k tries and 700 robot hours,”2016 IEEE International Conference on Robotics and Automation (ICRA) , pp. 3406–3413, 2015
work page 2016
-
[62]
Leveraging big data for grasp planning,
D. Kappler, J. Bohg, and S. Schaal, “Leveraging big data for grasp planning,” in ICRA, 2015, pp. 4304– 4311
work page 2015
-
[63]
J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, “Dex-Net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,” in Robotics: Science and Systems (RSS) , 2017
work page 2017
-
[64]
Jacquard: A large scale dataset for robotic grasp detection,
A. Depierre, E. Dellandr ´ea, and L. Chen, “Jacquard: A large scale dataset for robotic grasp detection,” in 2018 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS) . IEEE, 2018, pp. 3511–3516
work page 2018
-
[65]
S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,” The International journal of robotics research, vol. 37, no. 4-5, pp. 421–436, 2018
work page 2018
-
[66]
Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation, 2018
D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Her- zog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Vanhoucke et al. , “QT-Opt: Scalable deep rein- forcement learning for vision-based robotic manipu- lation,” arXiv preprint arXiv:1806.10293 , 2018
-
[67]
Contactdb: Analyzing and predicting grasp contact via thermal imaging,
S. Brahmbhatt, C. Ham, C. Kemp, and J. Hays, “Contactdb: Analyzing and predicting grasp contact via thermal imaging,” 04 2019
work page 2019
-
[68]
Graspnet- 1billion: a large-scale benchmark for general object grasping,
H.-S. Fang, C. Wang, M. Gou, and C. Lu, “Graspnet- 1billion: a large-scale benchmark for general object grasping,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2020, pp. 11 444–11 453
work page 2020
-
[69]
ACRONYM: A large-scale grasp dataset based on simulation,
C. Eppner, A. Mousavian, and D. Fox, “ACRONYM: A large-scale grasp dataset based on simulation,” in 2021 IEEE Int. Conf. on Robotics and Automation, ICRA, 2020
work page 2021
-
[70]
Using simulation and domain adaptation to improve effi- ciency of deep robotic grasping,
K. Bousmalis, A. Irpan, P. Wohlhart, Y . Bai, M. Kel- cey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, S. Levine, and V . Vanhoucke, “Using simulation and domain adaptation to improve effi- ciency of deep robotic grasping,” in ICRA, 2018, pp. 4243–4250
work page 2018
-
[71]
Fanuc manipulation: A dataset for learning-based manipulation with fanuc mate 200iD robot,
X. Zhu, R. Tian, C. Xu, M. Huo, W. Zhan, M. Tomizuka, and M. Ding, “Fanuc manipulation: A dataset for learning-based manipulation with fanuc mate 200iD robot,” https://sites.google.com/berkeley. edu/fanuc-manipulation, 2023
work page 2023
-
[72]
More than a million ways to be pushed. a high- fidelity experimental dataset of planar pushing,
K.-T. Yu, M. Bauza, N. Fazeli, and A. Rodriguez, “More than a million ways to be pushed. a high- fidelity experimental dataset of planar pushing,” in 2016 IEEE/RSJ international conference on intelligent robots and systems (IROS) . IEEE, 2016, pp. 30–37
work page 2016
-
[73]
Deep visual foresight for plan- ning robot motion,
C. Finn and S. Levine, “Deep visual foresight for plan- ning robot motion,” in 2017 IEEE International Con- ference on Robotics and Automation (ICRA) . IEEE, 2017, pp. 2786–2793
work page 2017
-
[74]
Visual foresight: Model-based deep reinforcement learning for vision-based robotic control,
F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine, “Visual foresight: Model-based deep rein- forcement learning for vision-based robotic control,” arXiv preprint arXiv:1812.00568 , 2018
-
[75]
The princeton shape benchmark,
P. Shilane, P. Min, M. Kazhdan, and T. Funkhouser, “The princeton shape benchmark,” in Shape Modeling Applications, 2004, pp. 167–388
work page 2004
-
[76]
3DNet: Large-Scale Object Class Recog- nition from CAD Models,
W. Wohlkinger, A. Aldoma Buchaca, R. Rusu, and M. Vincze, “3DNet: Large-Scale Object Class Recog- nition from CAD Models,” inIEEE International Con- ference on Robotics and Automation (ICRA) , 2012
work page 2012
-
[77]
A. Kasper, Z. Xue, and R. Dillmann, “The kit object models database: An object model database for object recognition, localization and manipulation in service robotics,” The International Journal of Robotics Re- search, vol. 31, no. 8, pp. 927–934, 2012
work page 2012
-
[78]
BigBIRD: A large-scale 3D database of object instances,
A. Singh, J. Sha, K. S. Narayan, T. Achim, and P. Abbeel, “BigBIRD: A large-scale 3D database of object instances,” in IEEE International Conference on Robotics and Automation (ICRA) , 2014, pp. 509– 516
work page 2014
-
[79]
Benchmarking in ma- nipulation research: Using the Yale-CMU-Berkeley object and model set,
B. Calli, A. Walsman, A. Singh, S. Srinivasa, P. Abbeel, and A. M. Dollar, “Benchmarking in ma- nipulation research: Using the Yale-CMU-Berkeley object and model set,” IEEE Robotics & Automation Magazine, vol. 22, no. 3, pp. 36–52, 2015
work page 2015
-
[80]
3D ShapeNets: A deep representation for volumetric shapes,
Zhirong Wu, S. Song, A. Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and J. Xiao, “3D ShapeNets: A deep representation for volumetric shapes,” in IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2015, pp. 1912–1920
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.