pith. sign in

arxiv: 2507.05331 · v1 · pith:ZAXTO6RHnew · submitted 2025-07-07 · 💻 cs.RO

A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation

Pith reviewed 2026-05-25 04:29 UTC · model grok-4.3

classification 💻 cs.RO
keywords large behavior modelsmultitask learningdexterous manipulationimitation learningrobot foundation modelsdiffusion policypretrainingsample efficiency
0
0 comments X

The pith

Multi-task pretraining makes robot policies more successful, robust, and data-efficient than single-task training for dexterous manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates large behavior models by extending the Diffusion Policy approach over a mix of simulated and real robot data for multitask dexterous manipulation. It compares these models against single-task baselines using blind, randomized trials in controlled settings. Multi-task pretraining improves success rates and robustness while allowing new complex tasks to be taught faster with far less data. Performance rises in a predictable way as the scale and diversity of the pretraining data increase. The work supplies a validated evaluation pipeline to support these comparisons with statistical confidence.

Core claim

Multi-task pretraining on a corpus of robot data produces policies that are more successful and robust than single-task policies, allow quicker teaching of new complex tasks with a fraction of the data, and show performance that improves predictably with greater pretraining scale and diversity.

What carries the argument

An evaluation pipeline that analyzes multitask policies with statistical confidence through blind randomized trials on simulated and real-world data.

If this is right

  • Multi-task policies achieve higher success rates and greater robustness than single-task baselines across the evaluated tasks.
  • New tasks can be taught with substantially less data when starting from a multi-task pretrained model.
  • Performance gains continue in a predictable manner as pretraining data volume and task diversity increase.
  • The advantages appear in both simulation and real-world blind trials.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same scaling pattern may apply to other robot learning settings that currently rely on task-specific training.
  • Collecting larger and more varied robot datasets could accelerate progress toward general manipulation capabilities.
  • The results motivate experiments that combine these behavior models with language or vision inputs for further gains.
  • Future work could test whether the observed data-efficiency benefits persist when the new task lies far outside the pretraining distribution.

Load-bearing premise

The selected tasks and data composition give an unbiased test of general multitask benefits that would apply to other dexterous manipulation problems.

What would settle it

A follow-up study that trains and tests single-task and multi-task policies on a fresh corpus of tasks chosen without regard to the original data distribution and finds no advantage for multi-task pretraining.

read the original abstract

Robot manipulation has seen tremendous progress in recent years, with imitation learning policies enabling successful performance of dexterous and hard-to-model tasks. Concurrently, scaling data and model size has led to the development of capable language and vision foundation models, motivating large-scale efforts to create general-purpose robot foundation models. While these models have garnered significant enthusiasm and investment, meaningful evaluation of real-world performance remains a challenge, limiting both the pace of development and inhibiting a nuanced understanding of current capabilities. In this paper, we rigorously evaluate multitask robot manipulation policies, referred to as Large Behavior Models (LBMs), by extending the Diffusion Policy paradigm across a corpus of simulated and real-world robot data. We propose and validate an evaluation pipeline to rigorously analyze the capabilities of these models with statistical confidence. We compare against single-task baselines through blind, randomized trials in a controlled setting, using both simulation and real-world experiments. We find that multi-task pretraining makes the policies more successful and robust, and enables teaching complex new tasks more quickly, using a fraction of the data when compared to single-task baselines. Moreover, performance predictably increases as pretraining scale and diversity grows. Project page: https://toyotaresearchinstitute.github.io/lbm1/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper extends the Diffusion Policy framework to train Large Behavior Models (LBMs) via multi-task pretraining on a corpus of simulated and real-world dexterous manipulation data. Through blind randomized trials with statistical analysis, it claims that multi-task pretraining yields higher success rates and robustness than single-task baselines, enables faster adaptation to novel tasks with substantially less data, and exhibits predictable performance gains as pretraining scale and diversity increase.

Significance. If the empirical claims hold, the work supplies statistically grounded evidence that multi-task pretraining confers concrete advantages in sample efficiency and robustness for robot manipulation policies. The use of blind randomized trials and controlled real-world experiments is a notable strength that reduces experimenter bias and supports reproducibility in the field.

major comments (1)
  1. [§4] §4 (Evaluation Pipeline) and the task corpus description: the central claim that multi-task benefits are general requires explicit evidence that tasks are sufficiently independent (e.g., non-overlapping state distributions or skill primitives). Without reported controls or ablation on task selection criteria, it remains possible that shared structure in the corpus favors multi-task training over single-task baselines.
minor comments (2)
  1. [Abstract] Abstract: provide one additional sentence summarizing the exact number of tasks, total demonstrations, and filtering criteria used in the pretraining corpus.
  2. [§5] Figure captions and §5: ensure all success-rate plots include the number of trials per condition and the exact statistical test employed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. The concern about task independence is well-taken, and we address it directly below.

read point-by-point responses
  1. Referee: [§4] §4 (Evaluation Pipeline) and the task corpus description: the central claim that multi-task benefits are general requires explicit evidence that tasks are sufficiently independent (e.g., non-overlapping state distributions or skill primitives). Without reported controls or ablation on task selection criteria, it remains possible that shared structure in the corpus favors multi-task training over single-task baselines.

    Authors: We agree that stronger evidence of task independence would better support the generality claim. Our corpus comprises 20 tasks drawn from distinct sources (simulation and real-world) with varied objects, initial states, and skill primitives (e.g., in-hand reorientation, tool use, and bimanual coordination). Performance scaling with both dataset size and diversity (Figure 7) provides indirect support that gains are not solely due to overlap. Nevertheless, we will add to §4 an analysis of pairwise state-distribution distances (using maximum mean discrepancy on proprioceptive and visual features) across tasks, plus an ablation that removes the most similar task pairs and re-trains. These additions will appear in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison of trained policies with no derivations or self-referential reductions.

full rationale

The paper performs direct empirical evaluation of multi-task vs. single-task policies using imitation learning on a corpus of simulated and real data, with blind randomized trials. No equations, fitted parameters, or derivations are presented that reduce reported performance gains (success rates, robustness, adaptation speed) to quantities defined by the paper's own inputs or self-citations. The central claims rest on experimental measurements rather than any self-definitional, fitted-input, or uniqueness-theorem structure. Self-citations (e.g., to Diffusion Policy) are external and not load-bearing for the comparison results. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard imitation learning assumptions and the validity of the experimental design rather than new free parameters, axioms, or invented entities.

axioms (1)
  • domain assumption The diffusion policy architecture can be extended to multitask pretraining while preserving its core learning properties.
    The paper builds directly on extending the Diffusion Policy paradigm to LBMs without additional justification in the abstract.

pith-pipeline@v0.9.0 · 6118 in / 1220 out tokens · 43791 ms · 2026-05-25T04:29:50.491496+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

    cs.RO 2026-04 conditional novelty 8.0

    Open-H-Embodiment is the largest open multi-embodiment medical robotics dataset, used to train GR00T-H, the first open vision-language-action model that achieves end-to-end suturing completion where prior models fail.

  2. ${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

    cs.LG 2026-04 unverdicted novelty 7.0

    π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

  3. Large Video Planner Enables Generalizable Robot Control

    cs.RO 2025-12 conditional novelty 7.0

    A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.

  4. Instrumentation for Imitation Learning: Enhancing Training Datasets for Clothes Hanger Insertion

    cs.RO 2026-05 unverdicted novelty 6.0

    Instrumented objects boost diffusion policy success in robotic hanger insertion by 14-25 percentage points over vision-only baselines, and augmenting datasets with instrumented expert rollouts lets a vision-only stude...

  5. Semantically Structured Mixture-of-Experts for Compositional Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    SMoDP routes action chunks in a diffusion policy to semantically specialized experts via a VLM-supervised skill predictor and dual contrastive alignment, achieving better efficiency and compositional transfer than baselines.

  6. Safe and Steerable Geometric Motion Policies for Robotic Dexterous Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    SafePBDS uses pullback control barrier functions and a task manifold action interface to generate certifiably safe, steerable motions on high-DOF robots from objectives defined on arbitrary geometric spaces.

  7. Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning

    cs.RO 2026-05 unverdicted novelty 6.0

    ZPRL adapts frozen flow-matching imitation policies via RL perturbations on a task-relevant bottleneck latent, yielding 33.7% higher average success on four real-world manipulation tasks than action-residual baselines.

  8. Distributionally Robust Control via Stein Variational Inference for Contact-Rich Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    Introduces a Stein variational inference-based deterministic formulation for distributionally robust control in contact-rich robotic manipulation, reporting up to 3x improved robustness under parametric uncertainty.

  9. From a Single Demonstration to a General Policy for Contact-Rich Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    A one-shot LfD framework abstracts a single demonstration into environmental-constraint primitives, then uses self-exploration, human corrections, and compliant recovery to produce a policy that generalizes across pos...

  10. WarmPrior: Straightening Flow-Matching Policies with Temporal Priors

    cs.LG 2026-05 unverdicted novelty 6.0

    Replacing Gaussian noise with a temporally grounded prior from recent actions straightens flow-matching paths and improves success rates in robotic manipulation and prior-space RL.

  11. Long-Horizon Manipulation via Trace-Conditioned VLA Planning

    cs.RO 2026-04 unverdicted novelty 6.0

    LoHo-Manip enables robust long-horizon robot manipulation by using a receding-horizon VLM manager to output progress-aware subtask sequences and 2D visual traces that condition a VLA executor for automatic replanning.

  12. A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies

    cs.RO 2026-04 unverdicted novelty 6.0

    Sim-and-real co-training for robot policies is driven primarily by balanced cross-domain representation alignment and secondarily by domain-dependent action reweighting.

  13. Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies

    cs.RO 2026-03 unverdicted novelty 6.0

    Q-DIG applies quality diversity optimization with vision-language models to generate diverse adversarial instructions that reveal VLA robot failures and enable robustness improvements via fine-tuning.

  14. HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations

    cs.RO 2026-03 unverdicted novelty 6.0

    HoMMI learns whole-body mobile manipulation policies from robot-free human demonstrations by augmenting UMI with egocentric sensing and bridging the embodiment gap through an agnostic visual representation, relaxed he...

  15. World Action Models are Zero-shot Policies

    cs.RO 2026-02 unverdicted novelty 6.0

    DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...

  16. Learning Native Continuation for Action Chunking Flow Policies

    cs.RO 2026-02 unverdicted novelty 6.0

    Legato trains flow-based VLA policies with schedule-shaped action-noise mixtures and randomized conditions to achieve smoother trajectories and ~10% faster task completion than real-time chunking across five real-worl...

  17. SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding

    cs.RO 2025-11 unverdicted novelty 6.0

    SPEAR-1 combines a 3D-enriched VLM with embodied control to match or exceed existing robotic foundation models using 20 times fewer robot demonstrations.

  18. Video Generators are Robot Policies

    cs.RO 2025-08 conditional novelty 6.0

    Training models to generate videos of robot actions produces policies that generalize better to new objects and tasks while using far less demonstration data than standard behavior cloning.

  19. VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 5.0

    VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.

  20. Causal World Modeling for Robot Control

    cs.CV 2026-01 unverdicted novelty 5.0

    LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.

  21. Learning Agile Striker Skills for Humanoid Soccer Robots from Noisy Sensory Input

    cs.RO 2025-12 conditional novelty 5.0

    A four-stage RL system with teacher-student distillation and online constrained adaptation enables humanoid robots to achieve robust ball-kicking accuracy under noisy perception in simulation and on physical hardware.

  22. Contact-Rich Robotic Assembly in Construction via Diffusion Policy Learning

    cs.RO 2025-11 unverdicted novelty 5.0

    Diffusion policies achieve 100% success on nominal mortise-tenon timber assembly and 75% average success under randomized 10 mm perturbations using force/torque sensing on an industrial robot.

  23. GR-3 Technical Report

    cs.RO 2025-07 unverdicted novelty 5.0

    GR-3 is a VLA model that generalizes to novel objects, environments, and abstract instructions, outperforms the π0 baseline, and integrates with the new ByteMini bi-manual mobile robot.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · cited by 23 Pith papers · 28 internal anchors

  1. [1]

    Diffusion policy: Visuomotor policy learning via action diffusion,

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” The International Journal of Robotics Research, 2024

  2. [2]

    Learning fine-grained bimanual manipulation with low-cost hardware,

    T. Z. Zhao, V. Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” in Robotics: Science and Systems XIX . Robotics: Science and Systems Foundation, 2023. [Online]. Available: https: //roboticsproceedings.org/rss19/p078.pdf

  3. [3]

    Aloha unleashed: A simple recipe for robot dexterity,

    T. Z. Zhao, J. Tompson, D. Driess, P. Florence, K. Ghasemipour, C. Finn, and A. Wahid, “Aloha unleashed: A simple recipe for robot dexterity,” in8th Annual Conference on Robot Learning , 2024

  4. [4]

    Octo: An Open-Source Generalist Robot Policy

    O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, J. Luo, Y. L. Tan, L. Y. Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine, “Octo: An open-source generalist robot policy,” 2024. [Online]. Available: https://arxiv.org/abs/2405.12213

  5. [5]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky, “ 𝜋0: A vision-language-action flow model for general robot control,” 2024. [Online]. Availabl...

  6. [6]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    NVIDIA, :, J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y. Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y. L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y. Xie, Y. Xu, Z. Xu, S. Ye, Z. Yu...

  7. [7]

    Gemini Robotics: Bringing AI into the Physical World

    G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakrishna, R. Baruch, M. Bauza, M. Blokzijl, S. Bohez, K. Bousmalis, A. Brohan, T. Buschmann, A. Byravan, S. Cabi, K. Caluwaerts, F. Casarini, O. Chang, J. E. Chen, X. Chen, H.-T. L. Chiang, K. Choromanski, D. D’ Ambrosio, S. Dasari, T. Davchev, C. Devin, N. D. Palo, T. ...

  8. [8]

    Scaling proprioceptive- visual learning with heterogeneous pre-trained transformers,

    L. Wang, X. Chen, J. Zhao, and K. He, “Scaling proprioceptive- visual learning with heterogeneous pre-trained transformers,” Advances in neural information processing systems , vol. 37, pp. 124 420–124 450, 2024

  9. [9]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation,” Mar. 2025, arXiv:2410.07864 [cs]. [Online]. Available: http://arxiv.org/abs/2410.07864

  10. [10]

    Segment any- thing,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Loet al., “Segment any- thing,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2023, pp. 3992–4003

  11. [11]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al. , “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PmLR, 2021, pp. 8748–8763

  12. [12]

    Sigmoid loss for language image pre-training,

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF international conference on computer vision , 2023, pp. 11 975– 11 986

  13. [13]

    Dinov2: Learning robust visual features without supervision,

    M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,” Transactions on Machine Learning Research Journal , pp. 1–31, 2024

  14. [14]

    GPT-4 Technical Report

    OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro, C. Berner, L. Bogdonoff, O. Boiko, M. Boyd, A.-L. Brakman, G. Brockman, T. Brooks, M. Brundag...

  15. [15]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar et al. , “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023. 19

  16. [16]

    Droid: A large-scale in-the-wild robot manipulation dataset,

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y. Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y. J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y. Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. Lu,...

  17. [17]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    E. Collaboration, A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Gupta, A. Wang, A. Kolobov, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakrishna, A. Wahid, B. B...

  18. [18]

    AgiBot World Colosseo: Large-scale Manipu- lation Platform for Scalable and Intelligent Embodied Systems

    T. AgiBot-World, “AgiBot World Colosseo: Large-scale Manipu- lation Platform for Scalable and Intelligent Embodied Systems.”

  19. [19]

    Openvla: An open-source vision-language-action model,

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, “Openvla: An open-source vision-language-action model,” in 8th Annual Conference on Robot Learning

  20. [20]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y. Lu, H. Michalewski, I. Mordatch, K. Perts...

  21. [21]

    𝜋0.5: a vision-language-action model with open-world generalization,

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y. Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vu...

  22. [22]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    [Online]. Available: https://arxiv.org/abs/2504.16054

  23. [23]

    Magma: A foundation model for multimodal ai agents,

    J. Yang, R. Tan, Q. Wu, R. Zheng, B. Peng, Y. Liang, Y. Gu, M. Cai, S. Ye, J. Janget al., “Magma: A foundation model for multimodal ai agents,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 14 203–14 214

  24. [24]

    A generalist agent,

    S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-maron, M. Gim´enez, Y. Sulsky, J. Kay, J. T. Springenberg et al. , “A generalist agent,” Transactions on Machine Learning Research

  25. [25]

    Palm-e: An embodied multimodal language model,

    D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y. Chebotar, P. Sermanet, D. Duckworth, S. Levine, V. Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence, “Palm-e: An embodied multimodal language model,”

  26. [26]

    PaLM-E: An Embodied Multimodal Language Model

    [Online]. Available: https://arxiv.org/abs/2303.03378

  27. [27]

    Robotic Control via Embodied Chain-of-Thought Reasoning

    M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine, “Robotic control via embodied chain-of-thought reasoning,” 2025. [Online]. Available: https://arxiv.org/abs/2407.08693

  28. [28]

    On the opportunities and risks of foundation models,

    R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gillespie, K. Go...

  29. [29]

    On the Opportunities and Risks of Foundation Models

    [Online]. Available: https://arxiv.org/abs/2108.07258

  30. [30]

    Gemini Robotics: Bringing AI into the Physical World,

    G. R. Team, “Gemini Robotics: Bringing AI into the Physical World,” Tech. Rep., Mar. 2025. [Online]. Available: https://deepmind.google/discover/blog/ gemini-robotics-brings-ai-into-the-physical-world/

  31. [31]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, 20 K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, I. Leal, K.-H. Lee, S. Levine, Y. Lu, U. Malla, D. Manjunath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsch, J...

  32. [32]

    OpenVLA: An Open-Source Vision-Language-Action Model

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, “OpenVLA: An Open-Source Vision-Language- Action Model,” Sep. 2024, arXiv:2406.09246 [cs]. [Online]. Available: http://arxiv.org/abs/2406.09246

  33. [33]

    Language models are few-shot learners,

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 1877–1901

  34. [34]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshimaet al., “The pile: An 800gb dataset of diverse text for language modeling,” arXiv preprint arXiv:2101.00027, 2020

  35. [35]

    Documenting large webtext corpora: A case study on the colossal clean crawled corpus,

    J. Dodge, M. Sap, A. Marasovi ´c, W. Agnew, G. Ilharco, D. Groen- eveld, M. Mitchell, and M. Gardner, “Documenting large webtext corpora: A case study on the colossal clean crawled corpus,”arXiv preprint arXiv:2104.08758, 2021

  36. [36]

    Laion-5b: An open large-scale dataset for training next generation image-text models,

    C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,” Advances in neural information processing systems, vol. 35, pp. 25 278–25 294, 2022

  37. [37]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki, “Laion-400m: Open dataset of clip-filtered 400 million image-text pairs,” arXiv preprint arXiv:2111.02114, 2021

  38. [38]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” in NeurIPS, 2023

  39. [39]

    Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

    F. Ebert, Y. Yang, K. Schmeckpeper, B. Bucher, G. Georgakis, K. Daniilidis, C. Finn, and S. Levine, “Bridge data: Boosting generalization of robotic skills with cross-domain datasets,” 2021. [Online]. Available: https://arxiv.org/abs/2109.13396

  40. [40]

    Rh20t: A robotic dataset for learning diverse skills in one-shot,

    H.-S. Fang, H. Fang, Z. Tang, J. Liu, J. Wang, H. Zhu, and C. Lu, “Rh20t: A robotic dataset for learning diverse skills in one-shot,” in RSS 2023 Workshop on Learning for Task and Motion Planning , 2023

  41. [41]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    AgiBot-World-Contributors, Q. Bu, J. Cai, L. Chen, X. Cui, Y. Ding, S. Feng, S. Gao, X. He, X. Huang, S. Jiang, Y. Jiang, C. Jing, H. Li, J. Li, C. Liu, Y. Liu, Y. Lu, J. Luo, P. Luo, Y. Mu, Y. Niu, Y. Pan, J. Pang, Y. Qiao, G. Ren, C. Ruan, J. Shan, Y. Shen, C. Shi, M. Shi, M. Shi, C. Sima, J. Song, H. Wang, W. Wang, D. Wei, C. Xie, G. Xu, J. Yan, C. Yan...

  42. [42]

    Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning,

    H. Geng, F. Wang, S. Wei, Y. Li, B. Wang, B. An, C. T. Cheng, H. Lou, P. Li, Y.-J. Wang, Y. Liang, D. Goetting, C. Xu, H. Chen, Y. Qian, Y. Geng, J. Mao, W. Wan, M. Zhang, J. Lyu, S. Zhao, J. Zhang, J. Zhang, C. Zhao, H. Lu, Y. Ding, R. Gong, Y. Wang, Y. Kuang, R. Wu, B. Jia, C. Sferrazza, H. Dong, S. Huang, K. Sreenath, Y. Wang, J. Malik, and P. Abbeel, ...

  43. [43]

    Orbit: A unified simulation framework for interactive robot learning environments,

    M. Mittal, C. Yu, Q. Yu, J. Liu, N. Rudin, D. Hoeller, J. L. Yuan, R. Singh, Y. Guo, H. Mazhar et al., “Orbit: A unified simulation framework for interactive robot learning environments,” IEEE Robotics and Automation Letters , vol. 8, no. 6, pp. 3740–3747, 2023

  44. [44]

    arXiv preprint arXiv:2410.00425 (2024)

    S. Tao, F. Xiang, A. Shukla, Y. Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y. Liu, T. kai Chan, Y. Gao, X. Li, T. Mu, N. Xiao, A. Gurha, Z. Huang, R. Calandra, R. Chen, S. Luo, and H. Su, “Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai,” 2024. [Online]. Available: https://arxiv.org/abs/2410.00425

  45. [45]

    Robogen: Towards unleashing infinite data for automated robot learning via generative simulation.arXiv preprint arXiv:2311.01455, 2023

    Y. Wang, Z. Xian, F. Chen, T.-H. Wang, Y. Wang, K. Fragkiadaki, Z. Erickson, D. Held, and C. Gan, “Robogen: Towards unleashing infinite data for automated robot learning via generative simula- tion,” arXiv preprint arXiv:2311.01455, 2023

  46. [46]

    RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y. Zhu, “Robocasa: Large-scale simulation of everyday tasks for generalist robots,”arXiv preprint arXiv:2406.02523, 2024

  47. [47]

    MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations

    A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y. Narang, L. Fan, Y. Zhu, and D. Fox, “Mimicgen: A data generation system for scalable robot learning using human demonstrations,” arXiv preprint arXiv:2310.17596, 2023

  48. [48]

    Rlbench: The robot learning benchmark & learning environment,

    S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Rlbench: The robot learning benchmark & learning environment,” IEEE Robotics and Automation Letters , vol. 5, no. 2, pp. 3019–3026, 2020

  49. [49]

    Empirical analysis of sim-and-real cotraining of diffusion policies for planar pushing from pixels,

    A. Wei, A. Agarwal, B. Chen, R. Bosworth, N. Pfaff, and R. Tedrake, “Empirical analysis of sim-and-real cotraining of diffusion policies for planar pushing from pixels,” 2025. [Online]. Available: https://arxiv.org/abs/2503.22634

  50. [50]

    Sim-and-real co-training: A simple recipe for vision-based robotic manipulation,

    A. Maddukuri, Z. Jiang, L. Y. Chen, S. Nasiriany, Y. Xie, Y. Fang, W. Huang, Z. Wang, Z. Xu, N. Chernyadev, S. Reed, K. Goldberg, A. Mandlekar, L. Fan, and Y. Zhu, “Sim-and-real co-training: A simple recipe for vision-based robotic manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2503.24361

  51. [51]

    Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

    C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song, “Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots,” 2024. [Online]. Available: https://arxiv.org/abs/2402.10329

  52. [52]

    Legato: Cross-embodiment imitation using a grasping tool,

    M. Seo, H. A. Park, S. Yuan, Y. Zhu, and L. Sentis, “Legato: Cross-embodiment imitation using a grasping tool,”IEEE Robotics and Automation Letters, vol. 10, no. 3, p. 2854–2861, Mar. 2025. [Online]. Available: http://dx.doi.org/10.1109/LRA.2025.3535182

  53. [53]

    Egomimic: Scaling imitation learning via egocentric video,

    S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu, “Egomimic: Scaling imitation learning via egocentric video,” 2024. [Online]. Available: https://arxiv.org/abs/2410.24221

  54. [54]

    Airexo: Low-cost exoskeletons for learning whole-arm manipulation in the wild,

    H. Fang, H.-S. Fang, Y. Wang, J. Ren, J. Chen, R. Zhang, W. Wang, and C. Lu, “Airexo: Low-cost exoskeletons for learning whole-arm manipulation in the wild,” 2024. [Online]. Available: https://arxiv.org/abs/2309.14975

  55. [55]

    Robot learning as an empirical science: Best practices for policy evaluation,

    H. Kress-Gazit, K. Hashimoto, N. Kuppuswamy, P. Shah, P. Hor- gan, G. Richardson, S. Feng, and B. Burchfiel, “Robot learning as an empirical science: Best practices for policy evaluation,” arXiv preprint arXiv:2409.09491, 2024

  56. [56]

    Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations,

    T. Mu, Z. Ling, F. Xiang, D. Yang, X. Li, S. Tao, Z. Huang, Z. Jia, and H. Su, “Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations,” 2021. [Online]. Available: https://arxiv.org/abs/2107.14483

  57. [57]

    Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,

    T. Yu, D. Quillen, Z. He, R. Julian, A. Narayan, H. Shively, A. Bellathur, K. Hausman, C. Finn, and S. Levine, “Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,” 2021. [Online]. Available: https://arxiv.org/abs/1910.10897

  58. [58]

    robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

    Y. Zhu, J. Wong, A. Mandlekar, R. Mart ´ın-Mart´ın, A. Joshi, S. Nasiriany, Y. Zhu, and K. Lin, “robosuite: A modular simulation framework and benchmark for robot learning,” in arXiv preprint arXiv:2009.12293, 2020

  59. [59]

    Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments,

    S. Srivastava, C. Li, M. Lingelbach, R. Mart ´ın-Mart´ın, F. Xia, K. Vainio, Z. Lian, C. Gokmen, S. Buch, C. K. Liu, S. Savarese, H. Gweon, J. Wu, and L. Fei-Fei, “Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments,” 2021. [Online]. Available: https://arxiv.org/abs/2108.03332 21

  60. [60]

    Robothor: An open simulation-to-real embodied ai platform,

    M. Deitke, W. Han, A. Herrasti, A. Kembhavi, E. Kolve, R. Mottaghi, J. Salvador, D. Schwenk, E. VanderBilt, M. Wallingford, L. Weihs, M. Yatskar, and A. Farhadi, “Robothor: An open simulation-to-real embodied ai platform,” 2020. [Online]. Available: https://arxiv.org/abs/2004.06799

  61. [61]

    Sim2real predictivity: Does evaluation in simulation predict real- world performance?

    A. Kadian, J. Truong, A. Gokaslan, A. Clegg, E. Wijmans, S. Lee, M. Savva, S. Chernova, and D. Batra, “Sim2real predictivity: Does evaluation in simulation predict real- world performance?” IEEE Robotics and Automation Letters , vol. 5, no. 4, p. 6670–6677, Oct. 2020. [Online]. Available: http://dx.doi.org/10.1109/LRA.2020.3013848

  62. [62]

    VR-Goggles for Robots: Real-to-sim Domain Adaptation for Visual Control

    J. Zhang, L. Tai, P. Yun, Y. Xiong, M. Liu, J. Boedecker, and W. Burgard, “Vr-goggles for robots: Real-to-sim domain adaptation for visual control,” 2019. [Online]. Available: https://arxiv.org/abs/1802.00265

  63. [63]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kirmani, S. Levine, J. Wu, C. Finn, H. Su, Q. Vuong, and T. Xiao, “Evaluating real-world robot manipulation policies in simulation,” arXiv preprint arXiv:2405.05941, 2024

  64. [64]

    Asid: Active exploration for system identification in robotic manipulation,

    M. Memmel, A. Wagenmaker, C. Zhu, P. Yin, D. Fox, and A. Gupta, “Asid: Active exploration for system identification in robotic manipulation,” 2024. [Online]. Available: https: //arxiv.org/abs/2404.12308

  65. [65]

    Scalable real2sim: Physics-aware asset generation via robotic pick-and-place setups,

    N. Pfaff, E. Fu, J. Binagia, P. Isola, and R. Tedrake, “Scalable real2sim: Physics-aware asset generation via robotic pick-and-place setups,” 2025. [Online]. Available: https: //arxiv.org/abs/2503.00370

  66. [66]

    Rb2: Robotic manipulation benchmarking with a twist,

    S. Dasari, J. Wang, J. Hong, S. Bahl, Y. Lin, A. Wang, A. Thankaraj, K. Chahal, B. Calli, S. Gupta, D. Held, L. Pinto, D. Pathak, V. Kumar, and A. Gupta, “Rb2: Robotic manipulation benchmarking with a twist,” 2022. [Online]. Available: https://arxiv.org/abs/2203.08098

  67. [67]

    Benchmarking cluttered robot pick- and-place manipulation with the box and blocks test,

    A. S. Morgan, K. Hang, W. G. Bircher, F. M. Alladkani, A. Gandhi, B. Calli, and A. M. Dollar, “Benchmarking cluttered robot pick- and-place manipulation with the box and blocks test,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 454–461, 2019

  68. [68]

    Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation,

    M. Heo, Y. Lee, D. Lee, and J. J. Lim, “Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation,” The International Journal of Robotics Research , p. 02783649241304789, 2023

  69. [69]

    Benchmarking protocols for evaluating small parts robotic assembly systems,

    K. Kimble, K. Van Wyk, J. Falco, E. Messina, Y. Sun, M. Shibata, W. Uemura, and Y. Yokokohji, “Benchmarking protocols for evaluating small parts robotic assembly systems,” IEEE robotics and automation letters , vol. 5, no. 2, pp. 883–889, 2020

  70. [70]

    Scenereplica: Benchmarking real-world robot manipulation by creating replicable scenes,

    N. Khargonkar, S. H. Allu, Y. Lu, B. Prabhakaran, Y. Xiang et al., “Scenereplica: Benchmarking real-world robot manipulation by creating replicable scenes,” in 2024 IEEE International Confer- ence on Robotics and Automation (ICRA) . IEEE, 2024, pp. 8258–8264

  71. [71]

    Bench- marking protocol for grasp planning algorithms,

    Y. Bekiroglu, N. Marturi, M. A. Roa, K. J. M. Adjigble, T. Pardi, C. Grimm, R. Balasubramanian, K. Hang, and R. Stolkin, “Bench- marking protocol for grasp planning algorithms,” IEEE Robotics and Automation Letters , vol. 5, no. 2, pp. 315–322, 2019

  72. [72]

    Graspa 1.0: Graspa is a robot arm grasping performance benchmark,

    F. Bottarel, G. Vezzani, U. Pattacini, and L. Natale, “Graspa 1.0: Graspa is a robot arm grasping performance benchmark,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 836–843, 2020

  73. [73]

    Benchmark for bimanual robotic manipulation of semi-deformable objects,

    K. Chatzilygeroudis, B. Fichera, I. Lauzana, F. Bu, K. Yao, F. Khadivar, and A. Billard, “Benchmark for bimanual robotic manipulation of semi-deformable objects,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 2443–2450, 2020

  74. [74]

    Ocrtoc: A cloud-based competition and benchmark for robotic grasping and manipulation,

    Z. Liu, W. Liu, Y. Qin, F. Xiang, M. Gou, S. Xin, M. A. Roa, B. Calli, H. Su, Y. Sunet al., “Ocrtoc: A cloud-based competition and benchmark for robotic grasping and manipulation,” IEEE Robotics and Automation Letters, vol. 7, no. 1, pp. 486–493, 2021

  75. [75]

    Real robot challenge: A robotics competition in the cloud,

    S. Bauer, M. W¨ uthrich, F. Widmaier, A. Buchholz, S. Stark, A. Goyal, T. Steinbrenner, J. Akpo, S. Joshi, V. Berenz et al. , “Real robot challenge: A robotics competition in the cloud,” in NeurIPS 2021 Competitions and Demonstrations Track. PMLR, 2022, pp. 190–204

  76. [76]

    Train offline, test online: A real robot learning benchmark,

    G. Zhou, V. Dean, M. K. Srirama, A. Rajeswaran, J. Pari, K. Hatch, A. Jain, T. Yu, P. Abbeel, L. Pintoet al., “Train offline, test online: A real robot learning benchmark,” in 2023 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2023, pp. 9197–9203

  77. [77]

    Autoeval: Autonomous evaluation of generalist robot manipulation policies in the real world.arXiv preprint arXiv:2503.24278, 2025

    Z. Zhou, P. Atreya, Y. L. Tan, K. Pertsch, and S. Levine, “Autoeval: Autonomous evaluation of generalist robot manipulation policies in the real world,” 2025. [Online]. Available: https://arxiv.org/abs/2503.24278

  78. [78]

    Is your imitation learning policy better than mine? policy comparison with near-optimal stopping,

    D. Snyder, A. J. Hancock, A. Badithela, E. Dixon, P. Miller, R. A. Ambrus, A. Majumdar, M. Itkina, and H. Nishimura, “Is your imitation learning policy better than mine? policy comparison with near-optimal stopping,”arXiv preprint arXiv:2503.10966, 2025

  79. [79]

    Deep reinforcement learning at the edge of the statistical precipice,

    R. Agarwal, M. Schwarzer, P. S. Castro, A. C. Courville, and M. Bellemare, “Deep reinforcement learning at the edge of the statistical precipice,” Advances in neural information processing systems, vol. 34, pp. 29 304–29 320, 2021

  80. [80]

    Statistical tests, p values, confidence intervals, and power: a guide to misinterpretations,

    S. Greenland, S. J. Senn, K. J. Rothman, J. B. Carlin, C. Poole, S. N. Goodman, and D. G. Altman, “Statistical tests, p values, confidence intervals, and power: a guide to misinterpretations,” European journal of epidemiology , vol. 31, no. 4, pp. 337–350, 2016

Showing first 80 references.