Recognition: 2 theorem links
· Lean TheoremRoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation
Pith reviewed 2026-05-15 22:09 UTC · model grok-4.3
The pith
RoboMIND supplies 107k teleoperated trajectories across four robot embodiments to train generalizable manipulation policies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RoboMIND is a dataset containing 107k demonstration trajectories across 479 diverse tasks involving 96 object classes, collected through human teleoperation on a unified platform that covers four distinct robotic embodiments: the Franka Emika Panda, the UR5e, the AgileX dual-arm robot, and a humanoid robot with dual dexterous hands. The dataset includes multi-view observations, proprioceptive robot state information, linguistic task descriptions, and 5k real-world failure demonstrations each paired with detailed causes, together with a matching digital twin in the Isaac Sim simulator.
What carries the argument
The unified data collection platform and standardized protocol that records consistent teleoperation demonstrations across multiple robotic embodiments.
Load-bearing premise
Human teleoperation demonstrations collected on one unified platform supply enough quality and coverage to train policies that generalize across robot embodiments and unseen real-world conditions.
What would settle it
A controlled test in which a VLA model trained on RoboMIND is evaluated on a previously unseen robot embodiment or on tasks outside the 479 covered ones and yields success rates no higher than models trained on smaller single-embodiment datasets.
read the original abstract
In this paper, we introduce RoboMIND (Multi-embodiment Intelligence Normative Data for Robot Manipulation), a dataset containing 107k demonstration trajectories across 479 diverse tasks involving 96 object classes. RoboMIND is collected through human teleoperation and encompasses comprehensive robotic-related information, including multi-view observations, proprioceptive robot state information, and linguistic task descriptions. To ensure data consistency and reliability for imitation learning, RoboMIND is built on a unified data collection platform and a standardized protocol, covering four distinct robotic embodiments: the Franka Emika Panda, the UR5e, the AgileX dual-arm robot, and a humanoid robot with dual dexterous hands. Our dataset also includes 5k real-world failure demonstrations, each accompanied by detailed causes, enabling failure reflection and correction during policy learning. Additionally, we created a digital twin environment in the Isaac Sim simulator, replicating the real-world tasks and assets, which facilitates the low-cost collection of additional training data and enables efficient evaluation. To demonstrate the quality and diversity of our dataset, we conducted extensive experiments using various imitation learning methods for single-task settings and state-of-the-art Vision-Language-Action (VLA) models for multi-task scenarios. By leveraging RoboMIND, the VLA models achieved high manipulation success rates and demonstrated strong generalization capabilities. To the best of our knowledge, RoboMIND is the largest multi-embodiment teleoperation dataset collected on a unified platform, providing large-scale and high-quality robotic training data. Our project is at https://x-humanoid-robomind.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RoboMIND, a dataset of 107k teleoperated demonstration trajectories across 479 tasks, 96 object classes, and four robotic embodiments (Franka Emika Panda, UR5e, AgileX dual-arm, humanoid with dexterous hands). Collected on a unified platform with multi-view observations, proprioceptive states, and language descriptions, it also includes 5k failure cases with annotated causes and a matching Isaac Sim digital twin. Experiments with imitation learning and VLA models are said to produce high manipulation success rates and strong generalization.
Significance. If the experimental validation holds, RoboMIND would be a valuable large-scale resource for multi-embodiment imitation and VLA training. The unified collection protocol, scale, inclusion of real failure demonstrations for reflection/correction, and simulator twin are concrete strengths that could support reproducible policy development and low-cost data augmentation.
major comments (1)
- [§4] §4 (Experiments): The VLA multi-task results claim 'high manipulation success rates and strong generalization capabilities' without reporting numerical success rates, baselines, per-embodiment breakdowns, or the cross-embodiment protocol (e.g., whether any platform was held out, how kinematics/dynamics differences were handled, or the contribution of the 5k failure cases). This directly undermines assessment of the central claim that the dataset enables cross-embodiment transfer.
minor comments (2)
- [Abstract] Abstract: Add at least one key quantitative result (e.g., average success rate or comparison to prior datasets) to make the empirical claims concrete rather than qualitative.
- [§3] §3 (Dataset): Clarify the distribution of the 107k trajectories across the four embodiments to allow readers to judge balance and potential single-embodiment dominance.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We provide a point-by-point response to the major comment below and outline the revisions we will make to address the concerns.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): The VLA multi-task results claim 'high manipulation success rates and strong generalization capabilities' without reporting numerical success rates, baselines, per-embodiment breakdowns, or the cross-embodiment protocol (e.g., whether any platform was held out, how kinematics/dynamics differences were handled, or the contribution of the 5k failure cases). This directly undermines assessment of the central claim that the dataset enables cross-embodiment transfer.
Authors: We acknowledge that the manuscript does not provide specific numerical values for the VLA success rates or detailed breakdowns in the current version. To address this, we will revise Section 4 to include quantitative results, including overall and per-embodiment success rates for the VLA models, comparisons against relevant baselines, and a clear description of the experimental protocol. Specifically, we will clarify that the multi-embodiment training was performed jointly on data from all four robots using a unified observation and action space to mitigate kinematic differences, without holding out any embodiment. We will also add results showing the impact of incorporating the 5k failure demonstrations on policy performance. These changes will allow readers to better evaluate the cross-embodiment transfer capabilities enabled by RoboMIND. revision: yes
Circularity Check
No circularity: empirical dataset introduction with no derivation chain
full rationale
The paper introduces an empirical teleoperation dataset (107k trajectories across four embodiments) and reports standard imitation-learning and VLA experiments. No equations, fitted parameters, or self-referential reductions appear; the size/quality claim rests on the described collection protocol rather than any self-definition, fitted-input prediction, or load-bearing self-citation. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By leveraging RoboMIND, the VLA models achieved high manipulation success rates and demonstrated strong generalization capabilities.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
RotVLA: Rotational Latent Action for Vision-Language-Action Model
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
-
RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning
RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.
-
HRDexDB: A Large-Scale Dataset of Dexterous Human and Robotic Hand Grasps
HRDexDB is a multi-modal dataset of 1.4K human and robotic dexterous grasps across 100 objects, providing aligned 3D kinematics, high-resolution tactile data, and video streams.
-
BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination
BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
-
Towards Generalizable Robotic Manipulation in Dynamic Environments
DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.
-
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
-
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
-
AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models
AT-VLA introduces adaptive tactile injection and a dual-stream tactile reaction mechanism to integrate real-time tactile feedback into pretrained VLA models for contact-rich robotic manipulation.
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
-
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
-
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks
CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
-
ROBOGATE: Adaptive Failure Discovery for Safe Robot Policy Deployment via Two-Stage Boundary-Focused Sampling
ROBOGATE applies adaptive boundary-focused sampling in simulation to discover robot policy failure boundaries, revealing a 97.65 percentage point performance gap for a VLA model between LIBERO and industrial scenarios.
-
ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning
ABot-M0 unifies heterogeneous robot data into a 6-million-trajectory dataset and introduces Action Manifold Learning to predict stable actions on a low-dimensional manifold using a DiT backbone.
-
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
-
RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.
-
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.
-
AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems
AgiBot World supplies over 1 million trajectories enabling GO-1 to deliver 30% average gains over Open X-Embodiment and over 60% success on complex dexterous tasks while open-sourcing everything.
-
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
-
Motus: A Unified Latent Action World Model
Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.
-
Towards Robotic Dexterous Hand Intelligence: A Survey
A structured survey of dexterous robotic hand research that reviews hardware, control methods, data resources, and benchmarks while identifying major limitations and future directions.
-
World Simulation with Video Foundation Models for Physical AI
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Do as i can, not as i say: Grounding language in robotic affordances
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Haus- man, et al. Do as i can, not as i say: Grounding language in robotic affordances. In Proceedings of The 6th Conference on Robot Learning , volume 205 of Proceedings of Machine Learning Research , pages 287–318...
work page 2023
-
[3]
Learning dexterous in-hand manipula- tion
OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pa- chocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipula- tion. The International Journal of Robotics Research , 39(1):3–20, 2020
work page 2020
-
[4]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Open- flamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Affordances from human videos as a versatile representation for robotics
Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13778–13790, 2023
work page 2023
-
[6]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforce- ment learning from human feedback. arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024
Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Ser- manet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language. In https://arxiv.org/abs/2403.01823, 2024
-
[8]
Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Ab- hinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 4788–4795. IEEE, 2024
work page 2024
-
[9]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics Transformer for Real-World Control at Scale. In Proceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023
work page 2023
-
[10]
Scaling data-driven robotics with reward sketching and batch reinforcement learning
Serkan Cabi, Sergio G ´omez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Ve- cerik, et al. Scaling data-driven robotics with reward sketching and batch reinforcement learning. In Pro- ceedings of Robotics: Science and Systems , July 2020
work page 2020
-
[11]
Ler- obot: State-of-the-art machine learning for real-world robotics in pytorch
Remi Cadene, Simon Alibert, Alexander Soare, Quentin Gallouedec, Adil Zouitine, and Thomas Wolf. Ler- obot: State-of-the-art machine learning for real-world robotics in pytorch. https://github.com/huggingface/ lerobot, 2024
work page 2024
-
[12]
The Beijing Humanoid Robot Innovation Center. X- humanoid tien kung, 2024. URL https://x-humanoid. com/
work page 2024
-
[13]
Matterport3d: Learning from rgb-d data in indoor environments
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In 2017 International Conference on 3D Vision (3DV) , pages 667–676, 2017
work page 2017
-
[14]
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video- language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Mirage: Cross-embodiment zero-shot policy trans- fer with cross-painting
Lawrence Yunliang Chen, Kush Hari, Karthik Dhar- marajan, Chenfeng Xu, Quan Vuong, and Ken Gold- berg. Mirage: Cross-embodiment zero-shot policy trans- fer with cross-painting. In Proceedings of Robotics: Science and Systems , 2024
work page 2024
-
[16]
Open-television: Teleoperation with immersive active visual feedback
Xuxin Cheng, Jialong Li, Shiqi Yang, Ge Yang, and Xiaolong Wang. Open-television: Teleoperation with immersive active visual feedback. In 8th Annual Con- ference on Robot Learning , 2024
work page 2024
-
[17]
Dif- fusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. RSS, 2023
work page 2023
-
[18]
Deep reinforce- ment learning from human preferences
Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar- tic, Shane Legg, and Dario Amodei. Deep reinforce- ment learning from human preferences. Advances in neural information processing systems , 30, 2017
work page 2017
-
[19]
Pybullet, a python module for physics simulation for games, robotics and machine learning, 2016
Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning, 2016
work page 2016
-
[20]
Scaling egocentric vision: The epic-kitchens dataset
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV) , pages 720–736, 2018
work page 2018
-
[21]
Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-
-
[22]
International Journal of Computer Vision , pages 1–23, 2022
work page 2022
-
[23]
Robonet: Large-scale multi-robot learning
Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning. In Conference on Robot Learning, pages 885–897. PMLR, 2020
work page 2020
-
[24]
Mingdi Deng, Zhijun Li, Yu Kang, CL Philip Chen, and Xiaoli Chu. A learning-based hierarchical control scheme for an exoskeleton robot in human–robot coop- erative manipulation. IEEE transactions on cybernetics, 50(1):112–125, 2018
work page 2018
-
[25]
Scaling cross-embodied learning: One policy for manipulation, navigation, lo- comotion and aviation
Ria Doshi, Homer Rich Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling cross-embodied learning: One policy for manipulation, navigation, lo- comotion and aviation. In 8th Annual Conference on Robot Learning, 2024
work page 2024
-
[26]
Palm-e: an embodied multimodal language model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: an embodied multimodal language model. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023
work page 2023
-
[27]
Bridge Data: Boosting Generalization of Robotic Skills with Cross- Domain Datasets
Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Dani- ilidis, Chelsea Finn, and Sergey Levine. Bridge Data: Boosting Generalization of Robotic Skills with Cross- Domain Datasets. In Proceedings of Robotics: Science and Systems, New York City, NY , USA, June 2022. doi: 10.15607/RSS.2022.XVIII.063
-
[28]
FlowBot3D: Learning 3D Articulation Flow to Manipulate Articu- lated Objects
Ben Eisner, Harry Zhang, and David Held. FlowBot3D: Learning 3D Articulation Flow to Manipulate Articu- lated Objects. In Proceedings of Robotics: Science and Systems, June 2022. doi: 10.15607/RSS.2022.XVIII. 018
-
[29]
Anygrasp: Robust and efficient grasp perception in spatial and temporal domains
Hao-Shu Fang, Chenxi Wang, Hongjie Fang, Minghao Gou, Jirong Liu, Hengxu Yan, Wenhai Liu, Yichen Xie, and Cewu Lu. Anygrasp: Robust and efficient grasp perception in spatial and temporal domains. IEEE Transactions on Robotics , 2023
work page 2023
-
[30]
Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot
Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 653–660. IEEE, 2024
work page 2024
-
[31]
Humanplus: Humanoid shadowing and imitation from humans
Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imitation from humans. In 8th Annual Conference on Robot Learning , 2024
work page 2024
-
[32]
Zipeng Fu, Tony Z. Zhao, and Chelsea Finn. Mobile ALOHA: Learning bimanual mobile manipulation using low-cost whole-body teleoperation. In 8th Annual Conference on Robot Learning , 2024
work page 2024
-
[33]
Datacomp: In search of the next generation of multimodal datasets
Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[34]
Act3d: 3d feature field transformers for multi-task robotic manipulation
Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, and Katerina Fragkiadaki. Act3d: 3d feature field transformers for multi-task robotic manipulation. In 7th Annual Conference on Robot Learning , 2023
work page 2023
-
[35]
Franka Robotics GmbH. Franka robotics, 2024. URL https://franka.de/
work page 2024
-
[36]
Rvt: Robotic view transformer for 3d object manipulation
Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. Rvt: Robotic view transformer for 3d object manipulation. In Conference on Robot Learning, pages 694–710. PMLR, 2023
work page 2023
-
[37]
The” something some- thing” video database for learning and evaluating visual common sense
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, He- una Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something some- thing” video database for learning and evaluating visual common sense. In Proceedings of the IEEE interna- tional conference on computer vision, pages 5...
work page 2017
-
[38]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jack- son Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 18995–19012, 2022
work page 2022
-
[39]
Robot learning in homes: Improving generalization and reducing dataset bias
Abhinav Gupta, Adithyavairavan Murali, Dhiraj Prakashchand Gandhi, and Lerrel Pinto. Robot learning in homes: Improving generalization and reducing dataset bias. Advances in neural information processing systems, 31, 2018
work page 2018
-
[40]
BAKU: An efficient transformer for multi-task policy learning
Siddhant Haldar, Zhuoran Peng, and Lerrel Pinto. BAKU: An efficient transformer for multi-task policy learning. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024
work page 2024
-
[41]
Dexpilot: Vision-based tele- operation of dexterous robotic hand-arm system
Ankur Handa, Karl Van Wyk, Wei Yang, Jacky Liang, Yu-Wei Chao, Qian Wan, Stan Birchfield, Nathan Ratliff, and Dieter Fox. Dexpilot: Vision-based tele- operation of dexterous robotic hand-arm system. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9164–9170. IEEE, 2020
work page 2020
-
[42]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. NeurIPS, 2020
work page 2020
- [43]
-
[44]
V oxposer: Composable 3d value maps for robotic manipulation with language models
Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. In 7th Annual Conference on Robot Learning , 2023
work page 2023
-
[45]
Intel. Depth camera d435i. https://www.intelrealsense. com/depth-camera-d435i/, 2019
work page 2019
-
[46]
Bc-z: Zero-shot task generalization with robotic imitation learning
Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kap- pler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022
work page 2022
-
[47]
Robotic grasping using deep reinforcement learning
Shirin Joshi, Sulabh Kumra, and Ferat Sahin. Robotic grasping using deep reinforcement learning. In 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE) , pages 1461–1466. IEEE, 2020
work page 2020
-
[48]
Mt-opt: Continu- ous multi-task robotic reinforcement learning at scale
Dmitry Kalashnikov, Jacob Varley, Yevgen Chebotar, Benjamin Swanson, Rico Jonschkowski, Chelsea Finn, Sergey Levine, and Karol Hausman. Mt-opt: Continu- ous multi-task robotic reinforcement learning at scale. arXiv preprint arXiv:2104.08212 , 2021
-
[49]
DEFT: Dexterous fine- tuning for hand policies
Aditya Kannan, Kenneth Shaw, Shikhar Bahl, Pragna Mannam, and Deepak Pathak. DEFT: Dexterous fine- tuning for hand policies. In 7th Annual Conference on Robot Learning, 2023
work page 2023
-
[50]
3d diffuser actor: Policy diffusion with 3d scene representations
Tsung-Wei Ke, Nikolaos Gkanatsios, and Katerina Fragkiadaki. 3d diffuser actor: Policy diffusion with 3d scene representations. In 8th Annual Conference on Robot Learning, 2024
work page 2024
-
[51]
DROID: A large-scale in-the-wild robot manipulation dataset
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. DROID: A large-scale in-the-wild robot manipulation dataset. In RSS 2024 Workshop: Data Generation for Robotics , 2024
work page 2024
-
[52]
Open- VLA: An open-source vision-language-action model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Open- VLA: An open-source vision-language-action model. In 8th Annual Conference on Robot Learning , 2024
work page 2024
-
[53]
Design and use paradigms for gazebo, an open-source multi-robot simulator
Nathan Koenig and Andrew Howard. Design and use paradigms for gazebo, an open-source multi-robot simulator. In IEEE/RSJ International Conference on Intelligent Robots and Systems , volume 3, pages 2149– 2154, 2004
work page 2004
-
[54]
AI2- THOR: An Interactive 3D Environment for Visual AI
Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli Van- derBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. AI2- THOR: An Interactive 3D Environment for Visual AI. arXiv, 2017
work page 2017
-
[55]
Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning hand-eye coor- dination for robotic grasping with deep learning and large-scale data collection. The International journal of robotics research, 37(4-5):421–436, 2018
work page 2018
-
[56]
Datacomp-LM: In search of the next generation of training sets for language models
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-LM: In search of the next generation of training sets for language models. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024
work page 2024
-
[57]
Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
Manipllm: Embodied multimodal large language model for object-centric robotic manipulation
Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yuxing Long, Yan Shen, Renrui Zhang, Jiaming Liu, and Hao Dong. Manipllm: Embodied multimodal large language model for object-centric robotic manipulation. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 18061– 18070, 2024
work page 2024
-
[59]
Vision- language foundation models as effective robot imitators
Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, Hang Li, and Tao Kong. Vision- language foundation models as effective robot imitators. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[60]
Evaluating Real-World Robot Manipulation Policies in Simulation
Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lu- nawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
Code as policies: Language model programs for em- bodied control
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for em- bodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 9493–9500. IEEE, 2023
work page 2023
-
[62]
Video-LLaV A: Learning united visual representation by alignment before projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-LLaV A: Learning united visual representation by alignment before projection. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 5971– 5984, Miami, Florida, USA, November 2024
work page 2024
-
[63]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems , 36, 2024
work page 2024
-
[64]
Robot learning on the job: Human- in-the-loop autonomy and learning during deployment
Huihan Liu, Soroush Nasiriany, Lance Zhang, Zhiyao Bao, and Yuke Zhu. Robot learning on the job: Human- in-the-loop autonomy and learning during deployment. The International Journal of Robotics Research , page 02783649241273901, 2022
work page 2022
-
[65]
Robomamba: Efficient vision-language-action model for robotic rea- soning and manipulation
Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. Robomamba: Efficient vision-language-action model for robotic rea- soning and manipulation. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024
work page 2024
-
[66]
RDT-1b: a diffusion foundation model for bimanual manipulation
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1b: a diffusion foundation model for bimanual manipulation. In The Thirteenth International Conference on Learning Representations , 2025
work page 2025
-
[67]
REFLECT: Summarizing robot experiences for failure explanation and correction
Zeyi Liu, Arpit Bahety, and Shuran Song. REFLECT: Summarizing robot experiences for failure explanation and correction. In 7th Annual Conference on Robot Learning, 2023
work page 2023
-
[68]
Isaac gym: High performance GPU based physics simulation for robot learning
Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance GPU based physics simulation for robot learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , 2021
work page 2021
-
[69]
Learning dexterous grasping with object-centric visual affor- dances
Priyanka Mandikal and Kristen Grauman. Learning dexterous grasping with object-centric visual affor- dances. In 2021 IEEE international conference on robotics and automation (ICRA) , pages 6169–6176. IEEE, 2021
work page 2021
-
[70]
Dexvip: Learning dexterous grasping with human hand pose priors from video
Priyanka Mandikal and Kristen Grauman. Dexvip: Learning dexterous grasping with human hand pose priors from video. In Conference on Robot Learning , pages 651–661. PMLR, 2022
work page 2022
-
[71]
Roboturk: A crowdsourcing platform for robotic skill learning through imitation
Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning , pages 879–893. PMLR, 2018
work page 2018
-
[72]
Generation and comprehension of unambiguous object descriptions
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 11–20, 2016
work page 2016
-
[73]
Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot ma- nipulation tasks. IEEE Robotics and Automation Let- ters, 7(3):7327–7334, 2022
work page 2022
-
[74]
Where2act: From pixels to actions for articulated 3d objects
Kaichun Mo, Leonidas J Guibas, Mustafa Mukadam, Abhinav Gupta, and Shubham Tulsiani. Where2act: From pixels to actions for articulated 3d objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 6813–6823, 2021
work page 2021
-
[75]
Movella. Xsens. https://www.movella.com/products/ xsens, 2025. Accessed: 2025-01-15
work page 2025
-
[76]
R3m: A universal visual representation for robot manipulation
Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation. In 6th Annual Conference on Robot Learning , 2022
work page 2022
-
[77]
Nvidia isaac sim: Robotics simulation and synthetic data, 2023
NVIDIA. Nvidia isaac sim: Robotics simulation and synthetic data, 2023. URL https://developer.nvidia.com/ isaac-sim
work page 2023
-
[78]
Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment col- laboration0
Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhi- ram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Man- dlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment col- laboration0. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 6892–6903, 2024
work page 2024
-
[79]
Astra series - structured light camera
ORBBEC. Astra series - structured light camera. https://www.orbbec.com/products/ structured-light-camera/astra-series/, 2022
work page 2022
-
[80]
Gemini 335 - 3d vision for a 3d world
ORBBEC. Gemini 335 - 3d vision for a 3d world. https://www.orbbec.com/products/ stereo-vision-camera/gemini-335/, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.