A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation
Pith reviewed 2026-05-25 04:29 UTC · model grok-4.3
The pith
Multi-task pretraining makes robot policies more successful, robust, and data-efficient than single-task training for dexterous manipulation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Multi-task pretraining on a corpus of robot data produces policies that are more successful and robust than single-task policies, allow quicker teaching of new complex tasks with a fraction of the data, and show performance that improves predictably with greater pretraining scale and diversity.
What carries the argument
An evaluation pipeline that analyzes multitask policies with statistical confidence through blind randomized trials on simulated and real-world data.
If this is right
- Multi-task policies achieve higher success rates and greater robustness than single-task baselines across the evaluated tasks.
- New tasks can be taught with substantially less data when starting from a multi-task pretrained model.
- Performance gains continue in a predictable manner as pretraining data volume and task diversity increase.
- The advantages appear in both simulation and real-world blind trials.
Where Pith is reading between the lines
- The same scaling pattern may apply to other robot learning settings that currently rely on task-specific training.
- Collecting larger and more varied robot datasets could accelerate progress toward general manipulation capabilities.
- The results motivate experiments that combine these behavior models with language or vision inputs for further gains.
- Future work could test whether the observed data-efficiency benefits persist when the new task lies far outside the pretraining distribution.
Load-bearing premise
The selected tasks and data composition give an unbiased test of general multitask benefits that would apply to other dexterous manipulation problems.
What would settle it
A follow-up study that trains and tests single-task and multi-task policies on a fresh corpus of tasks chosen without regard to the original data distribution and finds no advantage for multi-task pretraining.
read the original abstract
Robot manipulation has seen tremendous progress in recent years, with imitation learning policies enabling successful performance of dexterous and hard-to-model tasks. Concurrently, scaling data and model size has led to the development of capable language and vision foundation models, motivating large-scale efforts to create general-purpose robot foundation models. While these models have garnered significant enthusiasm and investment, meaningful evaluation of real-world performance remains a challenge, limiting both the pace of development and inhibiting a nuanced understanding of current capabilities. In this paper, we rigorously evaluate multitask robot manipulation policies, referred to as Large Behavior Models (LBMs), by extending the Diffusion Policy paradigm across a corpus of simulated and real-world robot data. We propose and validate an evaluation pipeline to rigorously analyze the capabilities of these models with statistical confidence. We compare against single-task baselines through blind, randomized trials in a controlled setting, using both simulation and real-world experiments. We find that multi-task pretraining makes the policies more successful and robust, and enables teaching complex new tasks more quickly, using a fraction of the data when compared to single-task baselines. Moreover, performance predictably increases as pretraining scale and diversity grows. Project page: https://toyotaresearchinstitute.github.io/lbm1/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper extends the Diffusion Policy framework to train Large Behavior Models (LBMs) via multi-task pretraining on a corpus of simulated and real-world dexterous manipulation data. Through blind randomized trials with statistical analysis, it claims that multi-task pretraining yields higher success rates and robustness than single-task baselines, enables faster adaptation to novel tasks with substantially less data, and exhibits predictable performance gains as pretraining scale and diversity increase.
Significance. If the empirical claims hold, the work supplies statistically grounded evidence that multi-task pretraining confers concrete advantages in sample efficiency and robustness for robot manipulation policies. The use of blind randomized trials and controlled real-world experiments is a notable strength that reduces experimenter bias and supports reproducibility in the field.
major comments (1)
- [§4] §4 (Evaluation Pipeline) and the task corpus description: the central claim that multi-task benefits are general requires explicit evidence that tasks are sufficiently independent (e.g., non-overlapping state distributions or skill primitives). Without reported controls or ablation on task selection criteria, it remains possible that shared structure in the corpus favors multi-task training over single-task baselines.
minor comments (2)
- [Abstract] Abstract: provide one additional sentence summarizing the exact number of tasks, total demonstrations, and filtering criteria used in the pretraining corpus.
- [§5] Figure captions and §5: ensure all success-rate plots include the number of trials per condition and the exact statistical test employed.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. The concern about task independence is well-taken, and we address it directly below.
read point-by-point responses
-
Referee: [§4] §4 (Evaluation Pipeline) and the task corpus description: the central claim that multi-task benefits are general requires explicit evidence that tasks are sufficiently independent (e.g., non-overlapping state distributions or skill primitives). Without reported controls or ablation on task selection criteria, it remains possible that shared structure in the corpus favors multi-task training over single-task baselines.
Authors: We agree that stronger evidence of task independence would better support the generality claim. Our corpus comprises 20 tasks drawn from distinct sources (simulation and real-world) with varied objects, initial states, and skill primitives (e.g., in-hand reorientation, tool use, and bimanual coordination). Performance scaling with both dataset size and diversity (Figure 7) provides indirect support that gains are not solely due to overlap. Nevertheless, we will add to §4 an analysis of pairwise state-distribution distances (using maximum mean discrepancy on proprioceptive and visual features) across tasks, plus an ablation that removes the most similar task pairs and re-trains. These additions will appear in the revised manuscript. revision: yes
Circularity Check
No circularity: empirical comparison of trained policies with no derivations or self-referential reductions.
full rationale
The paper performs direct empirical evaluation of multi-task vs. single-task policies using imitation learning on a corpus of simulated and real data, with blind randomized trials. No equations, fitted parameters, or derivations are presented that reduce reported performance gains (success rates, robustness, adaptation speed) to quantities defined by the paper's own inputs or self-citations. The central claims rest on experimental measurements rather than any self-definitional, fitted-input, or uniqueness-theorem structure. Self-citations (e.g., to Diffusion Policy) are external and not load-bearing for the comparison results. This matches the default expectation for non-circular empirical work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The diffusion policy architecture can be extended to multitask pretraining while preserving its core learning properties.
Forward citations
Cited by 23 Pith papers
-
Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics
Open-H-Embodiment is the largest open multi-embodiment medical robotics dataset, used to train GR00T-H, the first open vision-language-action model that achieves end-to-end suturing completion where prior models fail.
-
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
-
Large Video Planner Enables Generalizable Robot Control
A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.
-
Instrumentation for Imitation Learning: Enhancing Training Datasets for Clothes Hanger Insertion
Instrumented objects boost diffusion policy success in robotic hanger insertion by 14-25 percentage points over vision-only baselines, and augmenting datasets with instrumented expert rollouts lets a vision-only stude...
-
Semantically Structured Mixture-of-Experts for Compositional Robotic Manipulation
SMoDP routes action chunks in a diffusion policy to semantically specialized experts via a VLM-supervised skill predictor and dual contrastive alignment, achieving better efficiency and compositional transfer than baselines.
-
Safe and Steerable Geometric Motion Policies for Robotic Dexterous Manipulation
SafePBDS uses pullback control barrier functions and a task manifold action interface to generate certifiably safe, steerable motions on high-DOF robots from objectives defined on arbitrary geometric spaces.
-
Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning
ZPRL adapts frozen flow-matching imitation policies via RL perturbations on a task-relevant bottleneck latent, yielding 33.7% higher average success on four real-world manipulation tasks than action-residual baselines.
-
Distributionally Robust Control via Stein Variational Inference for Contact-Rich Manipulation
Introduces a Stein variational inference-based deterministic formulation for distributionally robust control in contact-rich robotic manipulation, reporting up to 3x improved robustness under parametric uncertainty.
-
From a Single Demonstration to a General Policy for Contact-Rich Manipulation
A one-shot LfD framework abstracts a single demonstration into environmental-constraint primitives, then uses self-exploration, human corrections, and compliant recovery to produce a policy that generalizes across pos...
-
WarmPrior: Straightening Flow-Matching Policies with Temporal Priors
Replacing Gaussian noise with a temporally grounded prior from recent actions straightens flow-matching paths and improves success rates in robotic manipulation and prior-space RL.
-
Long-Horizon Manipulation via Trace-Conditioned VLA Planning
LoHo-Manip enables robust long-horizon robot manipulation by using a receding-horizon VLM manager to output progress-aware subtask sequences and 2D visual traces that condition a VLA executor for automatic replanning.
-
A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies
Sim-and-real co-training for robot policies is driven primarily by balanced cross-domain representation alignment and secondarily by domain-dependent action reweighting.
-
Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies
Q-DIG applies quality diversity optimization with vision-language models to generate diverse adversarial instructions that reveal VLA robot failures and enable robustness improvements via fine-tuning.
-
HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations
HoMMI learns whole-body mobile manipulation policies from robot-free human demonstrations by augmenting UMI with egocentric sensing and bridging the embodiment gap through an agnostic visual representation, relaxed he...
-
World Action Models are Zero-shot Policies
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
-
Learning Native Continuation for Action Chunking Flow Policies
Legato trains flow-based VLA policies with schedule-shaped action-noise mixtures and randomized conditions to achieve smoother trajectories and ~10% faster task completion than real-time chunking across five real-worl...
-
SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding
SPEAR-1 combines a 3D-enriched VLM with embodied control to match or exceed existing robotic foundation models using 20 times fewer robot demonstrations.
-
Video Generators are Robot Policies
Training models to generate videos of robot actions produces policies that generalize better to new objects and tasks while using far less demonstration data than standard behavior cloning.
-
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
-
Causal World Modeling for Robot Control
LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.
-
Learning Agile Striker Skills for Humanoid Soccer Robots from Noisy Sensory Input
A four-stage RL system with teacher-student distillation and online constrained adaptation enables humanoid robots to achieve robust ball-kicking accuracy under noisy perception in simulation and on physical hardware.
-
Contact-Rich Robotic Assembly in Construction via Diffusion Policy Learning
Diffusion policies achieve 100% success on nominal mortise-tenon timber assembly and 75% average success under randomized 10 mm perturbations using force/torque sensing on an industrial robot.
-
GR-3 Technical Report
GR-3 is a VLA model that generalizes to novel objects, environments, and abstract instructions, outperforms the π0 baseline, and integrates with the new ByteMini bi-manual mobile robot.
Reference graph
Works this paper leans on
-
[1]
Diffusion policy: Visuomotor policy learning via action diffusion,
C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” The International Journal of Robotics Research, 2024
work page 2024
-
[2]
Learning fine-grained bimanual manipulation with low-cost hardware,
T. Z. Zhao, V. Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” in Robotics: Science and Systems XIX . Robotics: Science and Systems Foundation, 2023. [Online]. Available: https: //roboticsproceedings.org/rss19/p078.pdf
work page 2023
-
[3]
Aloha unleashed: A simple recipe for robot dexterity,
T. Z. Zhao, J. Tompson, D. Driess, P. Florence, K. Ghasemipour, C. Finn, and A. Wahid, “Aloha unleashed: A simple recipe for robot dexterity,” in8th Annual Conference on Robot Learning , 2024
work page 2024
-
[4]
Octo: An Open-Source Generalist Robot Policy
O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, J. Luo, Y. L. Tan, L. Y. Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine, “Octo: An open-source generalist robot policy,” 2024. [Online]. Available: https://arxiv.org/abs/2405.12213
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky, “ 𝜋0: A vision-language-action flow model for general robot control,” 2024. [Online]. Availabl...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
NVIDIA, :, J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y. Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y. L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y. Xie, Y. Xu, Z. Xu, S. Ye, Z. Yu...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Gemini Robotics: Bringing AI into the Physical World
G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakrishna, R. Baruch, M. Bauza, M. Blokzijl, S. Bohez, K. Bousmalis, A. Brohan, T. Buschmann, A. Byravan, S. Cabi, K. Caluwaerts, F. Casarini, O. Chang, J. E. Chen, X. Chen, H.-T. L. Chiang, K. Choromanski, D. D’ Ambrosio, S. Dasari, T. Davchev, C. Devin, N. D. Palo, T. ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Scaling proprioceptive- visual learning with heterogeneous pre-trained transformers,
L. Wang, X. Chen, J. Zhao, and K. He, “Scaling proprioceptive- visual learning with heterogeneous pre-trained transformers,” Advances in neural information processing systems , vol. 37, pp. 124 420–124 450, 2024
work page 2024
-
[9]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation,” Mar. 2025, arXiv:2410.07864 [cs]. [Online]. Available: http://arxiv.org/abs/2410.07864
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Loet al., “Segment any- thing,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2023, pp. 3992–4003
work page 2023
-
[11]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al. , “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PmLR, 2021, pp. 8748–8763
work page 2021
-
[12]
Sigmoid loss for language image pre-training,
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF international conference on computer vision , 2023, pp. 11 975– 11 986
work page 2023
-
[13]
Dinov2: Learning robust visual features without supervision,
M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,” Transactions on Machine Learning Research Journal , pp. 1–31, 2024
work page 2024
-
[14]
OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro, C. Berner, L. Bogdonoff, O. Boiko, M. Boyd, A.-L. Brakman, G. Brockman, T. Brooks, M. Brundag...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar et al. , “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023. 19
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Droid: A large-scale in-the-wild robot manipulation dataset,
A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y. Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y. J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y. Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. Lu,...
work page 2024
-
[17]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
E. Collaboration, A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Gupta, A. Wang, A. Kolobov, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakrishna, A. Wahid, B. B...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
T. AgiBot-World, “AgiBot World Colosseo: Large-scale Manipu- lation Platform for Scalable and Intelligent Embodied Systems.”
-
[19]
Openvla: An open-source vision-language-action model,
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, “Openvla: An open-source vision-language-action model,” in 8th Annual Conference on Robot Learning
-
[20]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y. Lu, H. Michalewski, I. Mordatch, K. Perts...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
𝜋0.5: a vision-language-action model with open-world generalization,
P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y. Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vu...
-
[22]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
[Online]. Available: https://arxiv.org/abs/2504.16054
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Magma: A foundation model for multimodal ai agents,
J. Yang, R. Tan, Q. Wu, R. Zheng, B. Peng, Y. Liang, Y. Gu, M. Cai, S. Ye, J. Janget al., “Magma: A foundation model for multimodal ai agents,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 14 203–14 214
work page 2025
-
[24]
S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-maron, M. Gim´enez, Y. Sulsky, J. Kay, J. T. Springenberg et al. , “A generalist agent,” Transactions on Machine Learning Research
-
[25]
Palm-e: An embodied multimodal language model,
D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y. Chebotar, P. Sermanet, D. Duckworth, S. Levine, V. Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence, “Palm-e: An embodied multimodal language model,”
-
[26]
PaLM-E: An Embodied Multimodal Language Model
[Online]. Available: https://arxiv.org/abs/2303.03378
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Robotic Control via Embodied Chain-of-Thought Reasoning
M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine, “Robotic control via embodied chain-of-thought reasoning,” 2025. [Online]. Available: https://arxiv.org/abs/2407.08693
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
On the opportunities and risks of foundation models,
R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gillespie, K. Go...
-
[29]
On the Opportunities and Risks of Foundation Models
[Online]. Available: https://arxiv.org/abs/2108.07258
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Gemini Robotics: Bringing AI into the Physical World,
G. R. Team, “Gemini Robotics: Bringing AI into the Physical World,” Tech. Rep., Mar. 2025. [Online]. Available: https://deepmind.google/discover/blog/ gemini-robotics-brings-ai-into-the-physical-world/
work page 2025
-
[31]
RT-1: Robotics Transformer for Real-World Control at Scale
A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, 20 K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, I. Leal, K.-H. Lee, S. Levine, Y. Lu, U. Malla, D. Manjunath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsch, J...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
OpenVLA: An Open-Source Vision-Language-Action Model
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, “OpenVLA: An Open-Source Vision-Language- Action Model,” Sep. 2024, arXiv:2406.09246 [cs]. [Online]. Available: http://arxiv.org/abs/2406.09246
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Language models are few-shot learners,
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 1877–1901
work page 2020
-
[34]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshimaet al., “The pile: An 800gb dataset of diverse text for language modeling,” arXiv preprint arXiv:2101.00027, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[35]
Documenting large webtext corpora: A case study on the colossal clean crawled corpus,
J. Dodge, M. Sap, A. Marasovi ´c, W. Agnew, G. Ilharco, D. Groen- eveld, M. Mitchell, and M. Gardner, “Documenting large webtext corpora: A case study on the colossal clean crawled corpus,”arXiv preprint arXiv:2104.08758, 2021
-
[36]
Laion-5b: An open large-scale dataset for training next generation image-text models,
C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,” Advances in neural information processing systems, vol. 35, pp. 25 278–25 294, 2022
work page 2022
-
[37]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki, “Laion-400m: Open dataset of clip-filtered 400 million image-text pairs,” arXiv preprint arXiv:2111.02114, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[38]
H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” in NeurIPS, 2023
work page 2023
-
[39]
Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets
F. Ebert, Y. Yang, K. Schmeckpeper, B. Bucher, G. Georgakis, K. Daniilidis, C. Finn, and S. Levine, “Bridge data: Boosting generalization of robotic skills with cross-domain datasets,” 2021. [Online]. Available: https://arxiv.org/abs/2109.13396
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[40]
Rh20t: A robotic dataset for learning diverse skills in one-shot,
H.-S. Fang, H. Fang, Z. Tang, J. Liu, J. Wang, H. Zhu, and C. Lu, “Rh20t: A robotic dataset for learning diverse skills in one-shot,” in RSS 2023 Workshop on Learning for Task and Motion Planning , 2023
work page 2023
-
[41]
AgiBot-World-Contributors, Q. Bu, J. Cai, L. Chen, X. Cui, Y. Ding, S. Feng, S. Gao, X. He, X. Huang, S. Jiang, Y. Jiang, C. Jing, H. Li, J. Li, C. Liu, Y. Liu, Y. Lu, J. Luo, P. Luo, Y. Mu, Y. Niu, Y. Pan, J. Pang, Y. Qiao, G. Ren, C. Ruan, J. Shan, Y. Shen, C. Shi, M. Shi, M. Shi, C. Sima, J. Song, H. Wang, W. Wang, D. Wei, C. Xie, G. Xu, J. Yan, C. Yan...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
H. Geng, F. Wang, S. Wei, Y. Li, B. Wang, B. An, C. T. Cheng, H. Lou, P. Li, Y.-J. Wang, Y. Liang, D. Goetting, C. Xu, H. Chen, Y. Qian, Y. Geng, J. Mao, W. Wan, M. Zhang, J. Lyu, S. Zhao, J. Zhang, J. Zhang, C. Zhao, H. Lu, Y. Ding, R. Gong, Y. Wang, Y. Kuang, R. Wu, B. Jia, C. Sferrazza, H. Dong, S. Huang, K. Sreenath, Y. Wang, J. Malik, and P. Abbeel, ...
work page 2025
-
[43]
Orbit: A unified simulation framework for interactive robot learning environments,
M. Mittal, C. Yu, Q. Yu, J. Liu, N. Rudin, D. Hoeller, J. L. Yuan, R. Singh, Y. Guo, H. Mazhar et al., “Orbit: A unified simulation framework for interactive robot learning environments,” IEEE Robotics and Automation Letters , vol. 8, no. 6, pp. 3740–3747, 2023
work page 2023
-
[44]
arXiv preprint arXiv:2410.00425 (2024)
S. Tao, F. Xiang, A. Shukla, Y. Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y. Liu, T. kai Chan, Y. Gao, X. Li, T. Mu, N. Xiao, A. Gurha, Z. Huang, R. Calandra, R. Chen, S. Luo, and H. Su, “Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai,” 2024. [Online]. Available: https://arxiv.org/abs/2410.00425
-
[45]
Y. Wang, Z. Xian, F. Chen, T.-H. Wang, Y. Wang, K. Fragkiadaki, Z. Erickson, D. Held, and C. Gan, “Robogen: Towards unleashing infinite data for automated robot learning via generative simula- tion,” arXiv preprint arXiv:2311.01455, 2023
-
[46]
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y. Zhu, “Robocasa: Large-scale simulation of everyday tasks for generalist robots,”arXiv preprint arXiv:2406.02523, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations
A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y. Narang, L. Fan, Y. Zhu, and D. Fox, “Mimicgen: A data generation system for scalable robot learning using human demonstrations,” arXiv preprint arXiv:2310.17596, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Rlbench: The robot learning benchmark & learning environment,
S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Rlbench: The robot learning benchmark & learning environment,” IEEE Robotics and Automation Letters , vol. 5, no. 2, pp. 3019–3026, 2020
work page 2020
-
[49]
Empirical analysis of sim-and-real cotraining of diffusion policies for planar pushing from pixels,
A. Wei, A. Agarwal, B. Chen, R. Bosworth, N. Pfaff, and R. Tedrake, “Empirical analysis of sim-and-real cotraining of diffusion policies for planar pushing from pixels,” 2025. [Online]. Available: https://arxiv.org/abs/2503.22634
-
[50]
Sim-and-real co-training: A simple recipe for vision-based robotic manipulation,
A. Maddukuri, Z. Jiang, L. Y. Chen, S. Nasiriany, Y. Xie, Y. Fang, W. Huang, Z. Wang, Z. Xu, N. Chernyadev, S. Reed, K. Goldberg, A. Mandlekar, L. Fan, and Y. Zhu, “Sim-and-real co-training: A simple recipe for vision-based robotic manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2503.24361
-
[51]
Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots
C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song, “Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots,” 2024. [Online]. Available: https://arxiv.org/abs/2402.10329
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Legato: Cross-embodiment imitation using a grasping tool,
M. Seo, H. A. Park, S. Yuan, Y. Zhu, and L. Sentis, “Legato: Cross-embodiment imitation using a grasping tool,”IEEE Robotics and Automation Letters, vol. 10, no. 3, p. 2854–2861, Mar. 2025. [Online]. Available: http://dx.doi.org/10.1109/LRA.2025.3535182
-
[53]
Egomimic: Scaling imitation learning via egocentric video,
S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu, “Egomimic: Scaling imitation learning via egocentric video,” 2024. [Online]. Available: https://arxiv.org/abs/2410.24221
-
[54]
Airexo: Low-cost exoskeletons for learning whole-arm manipulation in the wild,
H. Fang, H.-S. Fang, Y. Wang, J. Ren, J. Chen, R. Zhang, W. Wang, and C. Lu, “Airexo: Low-cost exoskeletons for learning whole-arm manipulation in the wild,” 2024. [Online]. Available: https://arxiv.org/abs/2309.14975
-
[55]
Robot learning as an empirical science: Best practices for policy evaluation,
H. Kress-Gazit, K. Hashimoto, N. Kuppuswamy, P. Shah, P. Hor- gan, G. Richardson, S. Feng, and B. Burchfiel, “Robot learning as an empirical science: Best practices for policy evaluation,” arXiv preprint arXiv:2409.09491, 2024
-
[56]
Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations,
T. Mu, Z. Ling, F. Xiang, D. Yang, X. Li, S. Tao, Z. Huang, Z. Jia, and H. Su, “Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations,” 2021. [Online]. Available: https://arxiv.org/abs/2107.14483
-
[57]
Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,
T. Yu, D. Quillen, Z. He, R. Julian, A. Narayan, H. Shively, A. Bellathur, K. Hausman, C. Finn, and S. Levine, “Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,” 2021. [Online]. Available: https://arxiv.org/abs/1910.10897
-
[58]
robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
Y. Zhu, J. Wong, A. Mandlekar, R. Mart ´ın-Mart´ın, A. Joshi, S. Nasiriany, Y. Zhu, and K. Lin, “robosuite: A modular simulation framework and benchmark for robot learning,” in arXiv preprint arXiv:2009.12293, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[59]
S. Srivastava, C. Li, M. Lingelbach, R. Mart ´ın-Mart´ın, F. Xia, K. Vainio, Z. Lian, C. Gokmen, S. Buch, C. K. Liu, S. Savarese, H. Gweon, J. Wu, and L. Fei-Fei, “Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments,” 2021. [Online]. Available: https://arxiv.org/abs/2108.03332 21
-
[60]
Robothor: An open simulation-to-real embodied ai platform,
M. Deitke, W. Han, A. Herrasti, A. Kembhavi, E. Kolve, R. Mottaghi, J. Salvador, D. Schwenk, E. VanderBilt, M. Wallingford, L. Weihs, M. Yatskar, and A. Farhadi, “Robothor: An open simulation-to-real embodied ai platform,” 2020. [Online]. Available: https://arxiv.org/abs/2004.06799
-
[61]
Sim2real predictivity: Does evaluation in simulation predict real- world performance?
A. Kadian, J. Truong, A. Gokaslan, A. Clegg, E. Wijmans, S. Lee, M. Savva, S. Chernova, and D. Batra, “Sim2real predictivity: Does evaluation in simulation predict real- world performance?” IEEE Robotics and Automation Letters , vol. 5, no. 4, p. 6670–6677, Oct. 2020. [Online]. Available: http://dx.doi.org/10.1109/LRA.2020.3013848
-
[62]
VR-Goggles for Robots: Real-to-sim Domain Adaptation for Visual Control
J. Zhang, L. Tai, P. Yun, Y. Xiong, M. Liu, J. Boedecker, and W. Burgard, “Vr-goggles for robots: Real-to-sim domain adaptation for visual control,” 2019. [Online]. Available: https://arxiv.org/abs/1802.00265
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[63]
Evaluating Real-World Robot Manipulation Policies in Simulation
X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kirmani, S. Levine, J. Wu, C. Finn, H. Su, Q. Vuong, and T. Xiao, “Evaluating real-world robot manipulation policies in simulation,” arXiv preprint arXiv:2405.05941, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
Asid: Active exploration for system identification in robotic manipulation,
M. Memmel, A. Wagenmaker, C. Zhu, P. Yin, D. Fox, and A. Gupta, “Asid: Active exploration for system identification in robotic manipulation,” 2024. [Online]. Available: https: //arxiv.org/abs/2404.12308
-
[65]
Scalable real2sim: Physics-aware asset generation via robotic pick-and-place setups,
N. Pfaff, E. Fu, J. Binagia, P. Isola, and R. Tedrake, “Scalable real2sim: Physics-aware asset generation via robotic pick-and-place setups,” 2025. [Online]. Available: https: //arxiv.org/abs/2503.00370
-
[66]
Rb2: Robotic manipulation benchmarking with a twist,
S. Dasari, J. Wang, J. Hong, S. Bahl, Y. Lin, A. Wang, A. Thankaraj, K. Chahal, B. Calli, S. Gupta, D. Held, L. Pinto, D. Pathak, V. Kumar, and A. Gupta, “Rb2: Robotic manipulation benchmarking with a twist,” 2022. [Online]. Available: https://arxiv.org/abs/2203.08098
-
[67]
Benchmarking cluttered robot pick- and-place manipulation with the box and blocks test,
A. S. Morgan, K. Hang, W. G. Bircher, F. M. Alladkani, A. Gandhi, B. Calli, and A. M. Dollar, “Benchmarking cluttered robot pick- and-place manipulation with the box and blocks test,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 454–461, 2019
work page 2019
-
[68]
Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation,
M. Heo, Y. Lee, D. Lee, and J. J. Lim, “Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation,” The International Journal of Robotics Research , p. 02783649241304789, 2023
work page 2023
-
[69]
Benchmarking protocols for evaluating small parts robotic assembly systems,
K. Kimble, K. Van Wyk, J. Falco, E. Messina, Y. Sun, M. Shibata, W. Uemura, and Y. Yokokohji, “Benchmarking protocols for evaluating small parts robotic assembly systems,” IEEE robotics and automation letters , vol. 5, no. 2, pp. 883–889, 2020
work page 2020
-
[70]
Scenereplica: Benchmarking real-world robot manipulation by creating replicable scenes,
N. Khargonkar, S. H. Allu, Y. Lu, B. Prabhakaran, Y. Xiang et al., “Scenereplica: Benchmarking real-world robot manipulation by creating replicable scenes,” in 2024 IEEE International Confer- ence on Robotics and Automation (ICRA) . IEEE, 2024, pp. 8258–8264
work page 2024
-
[71]
Bench- marking protocol for grasp planning algorithms,
Y. Bekiroglu, N. Marturi, M. A. Roa, K. J. M. Adjigble, T. Pardi, C. Grimm, R. Balasubramanian, K. Hang, and R. Stolkin, “Bench- marking protocol for grasp planning algorithms,” IEEE Robotics and Automation Letters , vol. 5, no. 2, pp. 315–322, 2019
work page 2019
-
[72]
Graspa 1.0: Graspa is a robot arm grasping performance benchmark,
F. Bottarel, G. Vezzani, U. Pattacini, and L. Natale, “Graspa 1.0: Graspa is a robot arm grasping performance benchmark,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 836–843, 2020
work page 2020
-
[73]
Benchmark for bimanual robotic manipulation of semi-deformable objects,
K. Chatzilygeroudis, B. Fichera, I. Lauzana, F. Bu, K. Yao, F. Khadivar, and A. Billard, “Benchmark for bimanual robotic manipulation of semi-deformable objects,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 2443–2450, 2020
work page 2020
-
[74]
Ocrtoc: A cloud-based competition and benchmark for robotic grasping and manipulation,
Z. Liu, W. Liu, Y. Qin, F. Xiang, M. Gou, S. Xin, M. A. Roa, B. Calli, H. Su, Y. Sunet al., “Ocrtoc: A cloud-based competition and benchmark for robotic grasping and manipulation,” IEEE Robotics and Automation Letters, vol. 7, no. 1, pp. 486–493, 2021
work page 2021
-
[75]
Real robot challenge: A robotics competition in the cloud,
S. Bauer, M. W¨ uthrich, F. Widmaier, A. Buchholz, S. Stark, A. Goyal, T. Steinbrenner, J. Akpo, S. Joshi, V. Berenz et al. , “Real robot challenge: A robotics competition in the cloud,” in NeurIPS 2021 Competitions and Demonstrations Track. PMLR, 2022, pp. 190–204
work page 2021
-
[76]
Train offline, test online: A real robot learning benchmark,
G. Zhou, V. Dean, M. K. Srirama, A. Rajeswaran, J. Pari, K. Hatch, A. Jain, T. Yu, P. Abbeel, L. Pintoet al., “Train offline, test online: A real robot learning benchmark,” in 2023 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2023, pp. 9197–9203
work page 2023
-
[77]
Z. Zhou, P. Atreya, Y. L. Tan, K. Pertsch, and S. Levine, “Autoeval: Autonomous evaluation of generalist robot manipulation policies in the real world,” 2025. [Online]. Available: https://arxiv.org/abs/2503.24278
-
[78]
Is your imitation learning policy better than mine? policy comparison with near-optimal stopping,
D. Snyder, A. J. Hancock, A. Badithela, E. Dixon, P. Miller, R. A. Ambrus, A. Majumdar, M. Itkina, and H. Nishimura, “Is your imitation learning policy better than mine? policy comparison with near-optimal stopping,”arXiv preprint arXiv:2503.10966, 2025
-
[79]
Deep reinforcement learning at the edge of the statistical precipice,
R. Agarwal, M. Schwarzer, P. S. Castro, A. C. Courville, and M. Bellemare, “Deep reinforcement learning at the edge of the statistical precipice,” Advances in neural information processing systems, vol. 34, pp. 29 304–29 320, 2021
work page 2021
-
[80]
Statistical tests, p values, confidence intervals, and power: a guide to misinterpretations,
S. Greenland, S. J. Senn, K. J. Rothman, J. B. Carlin, C. Poole, S. N. Goodman, and D. G. Altman, “Statistical tests, p values, confidence intervals, and power: a guide to misinterpretations,” European journal of epidemiology , vol. 31, no. 4, pp. 337–350, 2016
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.