Recognition: 3 theorem links
· Lean TheoremAgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems
Pith reviewed 2026-05-11 15:02 UTC · model grok-4.3
The pith
A dataset of over one million robot trajectories enables policies that improve 30% over Open X-Embodiment in both familiar and new tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that pre-training on the AgiBot World dataset of over one million trajectories produces policies with 30% higher average performance than those trained on Open X-Embodiment, both in-domain and out-of-distribution. They further show that the GO-1 policy, which leverages latent action representations, exhibits predictable scaling with data volume and reaches over 60% success on complex dexterous and long-horizon tasks while outperforming the prior RDT method by 32%.
What carries the argument
The AgiBot World dataset of over one million trajectories paired with the GO-1 policy that uses latent action representations to maximize data utilization and enable predictable scaling.
Load-bearing premise
That the standardized collection pipeline with human-in-the-loop verification produces data of sufficient quality and diversity to drive the reported 30% gains, predictable scaling behavior, and 60%+ success rates on complex tasks.
What would settle it
Retraining the same policy architectures on an equally large alternative dataset collected without the human-verification step and observing no improvement in success rates or loss of predictable scaling.
read the original abstract
We explore how scalable robot data can address real-world challenges for generalized robotic manipulation. Introducing AgiBot World, a large-scale platform comprising over 1 million trajectories across 217 tasks in five deployment scenarios, we achieve an order-of-magnitude increase in data scale compared to existing datasets. Accelerated by a standardized collection pipeline with human-in-the-loop verification, AgiBot World guarantees high-quality and diverse data distribution. It is extensible from grippers to dexterous hands and visuo-tactile sensors for fine-grained skill acquisition. Building on top of data, we introduce Genie Operator-1 (GO-1), a novel generalist policy that leverages latent action representations to maximize data utilization, demonstrating predictable performance scaling with increased data volume. Policies pre-trained on our dataset achieve an average performance improvement of 30% over those trained on Open X-Embodiment, both in in-domain and out-of-distribution scenarios. GO-1 exhibits exceptional capability in real-world dexterous and long-horizon tasks, achieving over 60% success rate on complex tasks and outperforming prior RDT approach by 32%. By open-sourcing the dataset, tools, and models, we aim to democratize access to large-scale, high-quality robot data, advancing the pursuit of scalable and general-purpose intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces AgiBot World, a large-scale robot manipulation dataset with over 1 million trajectories across 217 tasks in five scenarios, collected via a standardized pipeline incorporating human-in-the-loop verification. It presents Genie Operator-1 (GO-1), a generalist policy that employs latent action representations to improve data utilization and exhibit predictable scaling with data volume. Key claims include a 30% average performance improvement for policies pre-trained on AgiBot World versus Open X-Embodiment (in both in-domain and OOD settings), GO-1 achieving over 60% success on complex dexterous and long-horizon tasks, and a 32% outperformance over the prior RDT approach. The work open-sources the dataset, tools, and models.
Significance. If the performance deltas can be shown to stem from the dataset's scale, diversity, and collection quality under controlled conditions, this would constitute a meaningful advance in scalable robot learning by supplying an order-of-magnitude larger resource than prior corpora such as Open X-Embodiment. The open-sourcing of data, code, and models, together with the emphasis on extensible hardware (grippers to dexterous hands and visuo-tactile sensors), would facilitate community progress toward generalist embodied policies. The absence of matched experimental controls and quantitative data-quality metrics, however, currently limits the strength of these conclusions.
major comments (3)
- Abstract: The claim that policies pre-trained on AgiBot World achieve an average 30% performance improvement over those trained on Open X-Embodiment (both in-domain and OOD) does not state whether the GO-1 architecture, latent-action objective, optimizer schedule, and evaluation task suite were held identical when training the Open X-Embodiment baselines. Without explicit confirmation of matched training and evaluation protocols, the reported lift cannot be unambiguously attributed to dataset scale or the human-in-the-loop pipeline rather than confounding implementation differences.
- Dataset collection and experimental sections: The standardized collection pipeline with human-in-the-loop verification is asserted to guarantee high-quality, diverse data, yet no quantitative metrics are supplied (e.g., per-trajectory acceptance rates, inter-annotator agreement, task-coverage entropy, or diversity statistics). These metrics are load-bearing for validating the assumption that the pipeline drives the reported 30% gains and >60% success rates on complex tasks.
- Experimental results: Success rates (e.g., >60% on complex tasks) and improvement percentages (30%, 32%) are presented without error bars, number of evaluation trials, statistical significance tests, or data-exclusion criteria. This omission prevents assessment of the reliability and reproducibility of the central performance claims.
minor comments (2)
- The acronym RDT appears without expansion on first use; provide the full name and a brief citation to the prior method being compared.
- A summary table directly juxtaposing AgiBot World statistics (trajectories, tasks, scenarios, sensor modalities) against Open X-Embodiment and other benchmarks would improve clarity and allow readers to assess the claimed order-of-magnitude scale increase.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive suggestions. We address each major comment in detail below, and we will make revisions to the manuscript to incorporate clarifications and additional information as outlined.
read point-by-point responses
-
Referee: Abstract: The claim that policies pre-trained on AgiBot World achieve an average 30% performance improvement over those trained on Open X-Embodiment (both in-domain and OOD) does not state whether the GO-1 architecture, latent action objective, optimizer schedule, and evaluation task suite were held identical when training the Open X-Embodiment baselines. Without explicit confirmation of matched training and evaluation protocols, the reported lift cannot be unambiguously attributed to dataset scale or the human-in-the-loop pipeline rather than confounding implementation differences.
Authors: We confirm that all training and evaluation protocols were held identical across the AgiBot World and Open X-Embodiment pre-training experiments, with the sole difference being the dataset used. The GO-1 architecture, latent action objective, optimizer, and task suite were the same. We will revise the abstract to explicitly state this matched setup, ensuring the performance gains can be attributed to the dataset. revision: yes
-
Referee: Dataset collection and experimental sections: The standardized collection pipeline with human-in-the-loop verification is asserted to guarantee high-quality, diverse data, yet no quantitative metrics are supplied (e.g., per-trajectory acceptance rates, inter-annotator agreement, task-coverage entropy, or diversity statistics). These metrics are load-bearing for validating the assumption that the pipeline drives the reported 30% gains and >60% success rates on complex tasks.
Authors: We agree that providing quantitative metrics would better support our claims about data quality. We will add a dedicated subsection in the revised manuscript detailing metrics such as per-trajectory acceptance rates, inter-annotator agreement, task-coverage entropy, and diversity statistics. revision: yes
-
Referee: Experimental results: Success rates (e.g., >60% on complex tasks) and improvement percentages (30%, 32%) are presented without error bars, number of evaluation trials, statistical significance tests, or data-exclusion criteria. This omission prevents assessment of the reliability and reproducibility of the central performance claims.
Authors: We acknowledge the importance of statistical rigor in reporting results. We will update the experimental results section to include error bars, the number of evaluation trials, statistical significance tests, and data-exclusion criteria. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's core claims consist of empirical results: 30% average performance lift for policies pre-trained on AgiBot World versus Open X-Embodiment (in-domain and OOD), >60% success on complex tasks, and 32% outperformance versus prior RDT. These are presented as direct experimental comparisons to external datasets and methods rather than any closed mathematical derivation. The mention of 'predictable performance scaling with increased data volume' is framed as an observed experimental outcome from training GO-1 on the new data, not a first-principles equation or scaling law derived from the dataset itself. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or described structure. The human-in-the-loop pipeline is asserted as a quality guarantee but is not used to derive the performance numbers by construction. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human-in-the-loop verification in the standardized collection pipeline guarantees high-quality and diverse data distribution
invented entities (1)
-
Genie Operator-1 (GO-1)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Policies pre-trained on our dataset achieve an average performance improvement of 30% over those trained on Open X-Embodiment... GO-1 exhibits exceptional capability... outperforming prior RDT approach by 32%.
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
AgiBot World... over 1 million trajectories across 217 tasks... standardized collection pipeline with human-in-the-loop verification
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GO-1... leverages latent action representations to maximize data utilization, demonstrating predictable performance scaling with increased data volume.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 42 Pith papers
-
Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics
Open-H-Embodiment is the largest open multi-embodiment medical robotics dataset, used to train GR00T-H, the first open vision-language-action model that achieves end-to-end suturing completion where prior models fail.
-
RotVLA: Rotational Latent Action for Vision-Language-Action Model
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
-
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
-
HRDexDB: A Large-Scale Dataset of Dexterous Human and Robotic Hand Grasps
HRDexDB is a multi-modal dataset of 1.4K human and robotic dexterous grasps across 100 objects, providing aligned 3D kinematics, high-resolution tactile data, and video streams.
-
MoRight: Motion Control Done Right
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...
-
BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination
BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
-
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
-
PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models
PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.
-
Unified Noise Steering for Efficient Human-Guided VLA Adaptation
UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
-
RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models
RePO-VLA raises average adversarial success rates in VLA manipulation from 20% to 75% by using recovery-aware initialization, a progress-aware semantic value function, and value-conditioned refinement on success and c...
-
AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models
AT-VLA introduces adaptive tactile injection and a dual-stream tactile reaction mechanism to integrate real-time tactile feedback into pretrained VLA models for contact-rich robotic manipulation.
-
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
-
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...
-
Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation
MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.
-
GazeVLA: Learning Human Intention for Robotic Manipulation
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
-
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models
State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks...
-
AnchorRefine: Synergy-Manipulation Based on Trajectory Anchor and Residual Refinement for Vision-Language-Action Models
AnchorRefine factorizes VLA action generation into a trajectory anchor for coarse planning and residual refinement for local corrections, improving success rates by up to 7.8% in simulation and 18% on real robots acro...
-
FLASH: Fast Learning via GPU-Accelerated Simulation for High-Fidelity Deformable Manipulation in Minutes
A new GPU-accelerated deformable simulation framework trains manipulation policies in minutes using only synthetic data, achieving robust zero-shot transfer to physical robots.
-
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
-
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks
CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
-
AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement
AnySlot decouples language grounding from low-level control by inserting an explicit visual goal image, yielding better zero-shot performance on precise slot placement tasks than flat VLA policies.
-
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
-
TAMEn: Tactile-Aware Manipulation Engine for Closed-Loop Data Collection in Contact-Rich Tasks
TAMEn supplies a cross-morphology wearable interface and pyramid-structured visuo-tactile data regime that raises bimanual manipulation success rates from 34% to 75% via closed-loop collection.
-
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
-
ARM: Advantage Reward Modeling for Long-Horizon Manipulation
ARM trains reward models on Progressive/Regressive/Stagnant labels to enable adaptive reweighting in offline RL, reaching 99.4% success on towel-folding with minimal human intervention.
-
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
-
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
-
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' p...
-
RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.
-
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
UniVLA trains cross-embodiment vision-language-action policies from unlabeled videos via a latent action model in DINO space, beating OpenVLA on benchmarks with 1/20th pretraining compute and 1/10th downstream data.
-
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
-
Embody4D: A Generalist 4D World Model for Embodied AI
Embody4D generates high-fidelity, view-consistent novel views from monocular videos for embodied scenarios via 3D-aware data synthesis, adaptive noise injection, and interaction-aware attention.
-
STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation
STARRY uses unified diffusion to align spatial-temporal world predictions with action generation plus GASAM for geometry-aware attention, reaching 93.82%/93.30% success on 50 bimanual tasks in simulation and raising r...
-
Cortex 2.0: Grounding World Models in Real-World Industrial Deployment
Cortex 2.0 introduces world-model-based planning that generates and scores future trajectories to outperform reactive vision-language-action baselines on industrial robotic tasks including pick-and-place, sorting, and...
-
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
-
A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning
A progressive training framework using spatiotemporal chain-of-thought data reduces the forward-backward temporal query performance gap in VLMs from over 70% to 6.53%.
-
Causal World Modeling for Robot Control
LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.
-
Motus: A Unified Latent Action World Model
Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy
JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.
-
World Model for Robot Learning: A Comprehensive Survey
A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...
-
Robot Learning from Human Videos: A Survey
The survey organizes human-video-based robot learning into task-, observation-, and action-oriented transfer pathways, reviews associated datasets, and outlines challenges for scalable embodied AI.
Reference graph
Works this paper leans on
-
[1]
OpenAI, “GPT-4 Technical Report,” arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
SAM 2: Segment Anything in Images and Videos
N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafson, et al., “SAM 2: Segment anything in images and videos,” arXiv preprint arXiv:2408.00714 , 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Diffusion Policy: Visuomotor policy learning via action diffusion,
C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion Policy: Visuomotor policy learning via action diffusion,” in RSS, 2023. 1, 2
work page 2023
-
[4]
OpenVLA: An open-source vision-language-action model,
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. , “OpenVLA: An open-source vision-language-action model,” in CoRL, 2024. 1, 2, 3, 6
work page 2024
-
[5]
Toward next-generation learned robot manipu- lation,
J. Cui and J. Trinkle, “Toward next-generation learned robot manipu- lation,” in Science Robotics , 2021. 1
work page 2021
-
[6]
Open X- Embodiment: Robotic learning datasets and RT-X models,
A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Singh, A. Brohan, et al. , “Open X- Embodiment: Robotic learning datasets and RT-X models,” in ICRA,
-
[7]
DROID: A large-scale in-the-wild robot manipulation dataset,
A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karam- cheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. , “DROID: A large-scale in-the-wild robot manipulation dataset,” in RSS, 2024. 2, 3
work page 2024
-
[8]
Data scaling laws in imitation learning for robotic manipulation,
F. Lin, Y . Hu, P. Sheng, C. Wen, J. You, and Y . Gao, “Data scaling laws in imitation learning for robotic manipulation,” in ICLR, 2025. 2
work page 2025
-
[9]
Learning fine-grained bimanual manipulation with low-cost hardware,
T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” in RSS, 2023. 2
work page 2023
-
[10]
RDT-1B: a diffusion foundation model for bimanual manipulation,
S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “RDT-1B: a diffusion foundation model for bimanual manipulation,” in ICLR, 2025. 2, 3, 6, 7
work page 2025
-
[11]
RoboNet: Large-scale multi-robot learning,
S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn, “RoboNet: Large-scale multi-robot learning,” in CoRL, 2019. 3
work page 2019
-
[12]
Bridge data: Boosting gener- alization of robotic skills with cross-domain datasets,
F. Ebert, Y . Yang, K. Schmeckpeper, B. Bucher, G. Georgakis, K. Daniilidis, C. Finn, and S. Levine, “Bridge data: Boosting gener- alization of robotic skills with cross-domain datasets,” in RSS, 2022. 2, 3
work page 2022
-
[13]
BC-Z: Zero-shot task generalization with robotic imitation learning,
E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn, “BC-Z: Zero-shot task generalization with robotic imitation learning,” in CoRL, 2022. 3
work page 2022
-
[14]
RT-1: Robotics transformer for real-world control at scale,
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. , “RT-1: Robotics transformer for real-world control at scale,” in RSS, 2023. 2, 3
work page 2023
-
[15]
RH20T: A robotic dataset for learning diverse skills in one-shot,
H.-S. Fang, H. Fang, Z. Tang, J. Liu, J. Wang, H. Zhu, and C. Lu, “RH20T: A robotic dataset for learning diverse skills in one-shot,” in RSS Workshops, 2023. 3
work page 2023
-
[16]
H. Bharadhwaj, J. Vakil, M. Sharma, A. Gupta, S. Tulsiani, and V . Ku- mar, “RoboAgent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking,” in ICRA, 2024. 3
work page 2024
-
[17]
BridgeData v2: A dataset for robot learning at scale,
H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen- Estruch, A. W. He, V . Myers, M. J. Kim, M. Du, et al. , “BridgeData v2: A dataset for robot learning at scale,” in CoRL, 2023. 3
work page 2023
-
[18]
Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation,
K. Wu, C. Hou, J. Liu, Z. Che, X. Ju, Z. Yang, M. Li, Y . Zhao, et al., “RoboMIND: Benchmark on multi-embodiment intelligence norma- tive data for robot manipulation,” arXiv preprint arXiv:2412.13877 ,
-
[19]
RoboTurk: A crowdsourcing platform for robotic skill learning through imitation,
A. Mandlekar, Y . Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, S. Savarese, and L. Fei- Fei, “RoboTurk: A crowdsourcing platform for robotic skill learning through imitation,” in CoRL, 2018. 2
work page 2018
-
[20]
The colosseum: A benchmark for evaluating generalization for robotic manipulation,
W. Pumacay, I. Singh, J. Duan, R. Krishna, J. Thomason, and D. Fox, “The colosseum: A benchmark for evaluating generalization for robotic manipulation,” in RSS, 2024. 3
work page 2024
-
[21]
Learning universal policies via text-guided video generation,
Y . Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, D. Schu- urmans, and P. Abbeel, “Learning universal policies via text-guided video generation,” in NeurIPS, 2024. 3
work page 2024
-
[22]
Zero-shot robotic manipulation with pre-trained image-editing diffusion models,
K. Black, M. Nakamoto, P. Atreya, H. R. Walke, C. Finn, A. Ku- mar, and S. Levine, “Zero-shot robotic manipulation with pre-trained image-editing diffusion models,” in ICLR, 2024. 3
work page 2024
-
[23]
Closed-loop visuomotor control with generative expectation for robotic manipulation,
Q. Bu, J. Zeng, L. Chen, Y . Yang, G. Zhou, J. Yan, P. Luo, H. Cui, Y . Ma, and H. Li, “Closed-loop visuomotor control with generative expectation for robotic manipulation,” in NeurIPS, 2024. 3
work page 2024
-
[24]
RT-2: Vision- language-action models transfer web knowledge to robotic control,
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choro- manski, T. Ding, D. Driess, A. Dubey, C. Finn, et al. , “RT-2: Vision- language-action models transfer web knowledge to robotic control,” in CoRL, 2023. 3
work page 2023
-
[25]
Octo: An open-source generalist robot policy,
D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, J. Luo, et al. , “Octo: An open-source generalist robot policy,” in RSS, 2024. 3
work page 2024
-
[26]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. , “A vision- language-action flow model for general robot control,” arXiv preprint arXiv:2410.24164, 2024. 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Latent action pretraining from videos,
S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin,et al., “Latent action pretraining from videos,” in ICLR, 2025. 3
work page 2025
-
[28]
Grape: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309,
Z. Zhang, K. Zheng, Z. Chen, J. Jang, Y . Li, C. Wang, M. Ding, D. Fox, and H. Yao, “GRAPE: Generalizing robot policy via prefer- ence alignment,” arXiv preprint arXiv:2411.19309 , 2024. 4
-
[29]
Reflexion: Language agents with verbal reinforcement learning,
N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” NeurIPS, 2023. 4
work page 2023
-
[30]
Genie: Generative interactive environments,
J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y . Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, et al. , “Genie: Generative interactive environments,” in ICML, 2024. 5
work page 2024
-
[31]
M. Xu, W. Dai, C. Liu, X. Gao, W. Lin, G.-J. Qi, and H. Xiong, “Spatial-temporal transformer networks for traffic flow forecasting,” arXiv preprint arXiv:2001.02908 , 2020. 5
-
[32]
Neural discrete representation learning,
A. Van Den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” in NeurIPS, 2017. 5
work page 2017
-
[33]
Z. Chen, W. Wang, Y . Cao, Y . Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liu, et al. , “Expanding performance boundaries of open- source multimodal models with model, data, and test-time scaling,” arXiv preprint arXiv:2412.05271 , 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Q. Bu, H. Li, L. Chen, J. Cai, J. Zeng, H. Cui, M. Yao, and Y . Qiao, “Towards synergistic, generalized, and efficient dual-system for robotic manipulation,” arXiv preprint arXiv:2410.08001 , 2024. 6 APPENDIX ACKNOWLEDGEMENT We thank Remi Cadene and the LeRobot community for their support and collaboration. In addition, we are grateful to Shu Jiang, Cheng...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.