Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

Bernadette Bucher; Chelsea Finn; Frederik Ebert; Georgios Georgakis; Karl Schmeckpeper; Kostas Daniilidis; Sergey Levine; Yanlai Yang

arxiv: 2109.13396 · v1 · submitted 2021-09-27 · 💻 cs.RO · cs.AI

Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

Frederik Ebert , Yanlai Yang , Karl Schmeckpeper , Bernadette Bucher , Georgios Georgakis , Kostas Daniilidis , Chelsea Finn , Sergey Levine This is my paper

Pith reviewed 2026-05-13 19:51 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords robot learninggeneralizationcross-domain datamulti-task datasetdemonstration learningtransfer learningrobotic skills

0 comments

The pith

A shared multi-task multi-domain robot dataset doubles success rates for new tasks in new environments when added to just 50 demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors collect and release a dataset containing 7200 demonstrations of 71 tasks performed in 10 different environments. They test whether including this data during training helps a robot learn an entirely new task in an entirely new setting. When the shared dataset is combined with only 50 demonstrations of the new task, average success rates double compared with training on the target-domain data alone. Even a small number of demonstrations from the new domain suffice to let the robot perform many of its previously learned tasks there. The results indicate that reusable cross-domain collections can reduce the need to gather large task-specific datasets for each new robot project.

Core claim

By collecting a large multi-domain multi-task dataset with 7200 demonstrations of 71 tasks across 10 environments, the authors demonstrate that jointly training with this dataset plus 50 demonstrations of a never-before-seen task in a new domain leads to a 2x improvement in success rate compared to using target domain data alone. Data for only a few tasks in a new domain can bridge the domain gap and make it possible for a robot to perform a variety of prior tasks that were only seen in other domains.

What carries the argument

The Bridge Data collection, which supplies cross-task and cross-domain demonstrations so that end-to-end policies trained on it generalize to unseen tasks and environments.

If this is right

Robots can acquire new skills with far less per-project data collection.
A small amount of data from a new environment allows reuse of many previously learned skills in that environment.
Shared datasets become a practical way to bootstrap learning instead of starting from scratch each time.
Generalization improves without exhaustive data collection in every new setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Growing the dataset with additional domains would likely further reduce the number of demonstrations needed for new tasks.
The same bridging approach could extend to different robot hardware or sensor suites.
If the dataset continues to expand, reliance on simulation for initial training may decrease.

Load-bearing premise

The collected tasks and domains are representative enough that cross-domain data produces positive transfer rather than interference for arbitrary new tasks and environments.

What would settle it

A new task and new domain in which adding the Bridge Data to the 50 target demonstrations lowers success rate below the level achieved with the 50 demonstrations alone.

read the original abstract

Robot learning holds the promise of learning policies that generalize broadly. However, such generalization requires sufficiently diverse datasets of the task of interest, which can be prohibitively expensive to collect. In other fields, such as computer vision, it is common to utilize shared, reusable datasets, such as ImageNet, to overcome this challenge, but this has proven difficult in robotics. In this paper, we ask: what would it take to enable practical data reuse in robotics for end-to-end skill learning? We hypothesize that the key is to use datasets with multiple tasks and multiple domains, such that a new user that wants to train their robot to perform a new task in a new domain can include this dataset in their training process and benefit from cross-task and cross-domain generalization. To evaluate this hypothesis, we collect a large multi-domain and multi-task dataset, with 7,200 demonstrations constituting 71 tasks across 10 environments, and empirically study how this data can improve the learning of new tasks in new environments. We find that jointly training with the proposed dataset and 50 demonstrations of a never-before-seen task in a new domain on average leads to a 2x improvement in success rate compared to using target domain data alone. We also find that data for only a few tasks in a new domain can bridge the domain gap and make it possible for a robot to perform a variety of prior tasks that were only seen in other domains. These results suggest that reusing diverse multi-task and multi-domain datasets, including our open-source dataset, may pave the way for broader robot generalization, eliminating the need to re-collect data for each new robot learning project.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Bridge Data shows that adding a shared multi-domain robot dataset to 50 target demos roughly doubles success rates on new tasks, with the main caveat that the tested domains stay close to the collected ones.

read the letter

The central result is that joint training on the Bridge dataset plus 50 demos of a held-out task in a new environment yields about twice the success rate of target data alone. They collected 7200 demonstrations across 71 tasks in 10 environments and ran the transfer experiments to back this up. The dataset is released, which matters for anyone who wants to test the claim themselves. This is the concrete step forward: not just another multi-task setup, but measured cross-domain gains at a scale that was missing before. The comparisons to training on target data only are direct and the average improvement is large enough to notice in practice. What the work does cleanly is show that a few bridging tasks in the new domain can unlock prior skills from other environments without retraining everything from scratch. The empirical pattern matches the hypothesis in the abstract. The soft spot is the domain gap question. All the held-out tasks come from the same overall collection protocol and visual regimes as the training data, so the 2x gain is demonstrated inside that distribution. It does not yet show what happens with a genuinely different robot, lighting, or object set. The paper would be stronger with more detail on run-to-run variance, exact baseline implementations, and whether any post-selection occurred in the reported numbers. Minor statistical reporting gaps do not sink the main finding, but they make it harder to judge how robust the average is. This paper is for people working on imitation learning who need evidence on whether shared datasets actually reduce per-project collection costs. A reader focused on data-efficient robot policies will find usable numbers to compare against their own setups. It deserves a serious referee because the scale and the quantified transfer results are new enough to warrant external feedback on the experimental details and the limits of the current domains.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Bridge Data, a multi-domain multi-task robotic dataset of 7,200 demonstrations spanning 71 tasks across 10 environments. Its central empirical claim is that jointly training on this dataset together with 50 demonstrations of a previously unseen task in a new domain produces an average 2x improvement in success rate relative to training on the 50 target-domain demonstrations alone; it further reports that limited data in a new domain can enable a robot to perform tasks previously observed only in other domains.

Significance. If the reported gains are robust, the work supplies concrete evidence that large-scale, reusable cross-domain datasets can materially reduce per-task data collection costs in robot learning, mirroring the role of ImageNet-style resources in vision. The open release of the dataset itself constitutes a reusable asset for the community.

major comments (2)

[Experimental Evaluation] Experimental section: the manuscript reports an average 2x success-rate gain but supplies insufficient detail on training procedures, baseline implementations, number of independent runs per condition, observed variance, and whether statistical tests were used to establish significance of the improvement over the target-only baseline. These omissions make it difficult to rule out post-hoc selection effects or implementation differences.
[§5] §5 (held-out evaluation): all reported test tasks are drawn from the same overall collection protocol and visual regimes as the training environments. This limits the strength of the claim that the dataset produces positive transfer for arbitrary new domains; the current results do not yet demonstrate robustness to substantial changes in lighting, object appearance, robot kinematics, or task structure outside the 10 environments.

minor comments (2)

[Abstract] Abstract: the phrase 'on average leads to a 2x improvement' should be accompanied by the precise mean and a measure of spread (standard deviation or range) across the evaluated tasks.
[Dataset Description] Dataset description: the selection criteria for the 10 environments and 71 tasks should be stated more explicitly so readers can assess how representative they are of typical manipulation scenarios.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation for minor revision. We address each major comment below and will revise the manuscript to improve experimental transparency and clarify the scope of our claims.

read point-by-point responses

Referee: [Experimental Evaluation] Experimental section: the manuscript reports an average 2x success-rate gain but supplies insufficient detail on training procedures, baseline implementations, number of independent runs per condition, observed variance, and whether statistical tests were used to establish significance of the improvement over the target-only baseline. These omissions make it difficult to rule out post-hoc selection effects or implementation differences.

Authors: We agree that additional experimental details are required for reproducibility and to strengthen confidence in the results. In the revised manuscript we will expand the experimental section to provide: a full description of training procedures including all hyperparameters, network architectures, and optimization settings; explicit implementation details for each baseline; the number of independent runs per condition (five runs were performed); observed variance reported as standard deviations; and results from statistical significance tests (paired t-tests) confirming the 2x improvement over the target-only baseline. These additions will directly address concerns about implementation differences and selection effects. revision: yes
Referee: [§5] §5 (held-out evaluation): all reported test tasks are drawn from the same overall collection protocol and visual regimes as the training environments. This limits the strength of the claim that the dataset produces positive transfer for arbitrary new domains; the current results do not yet demonstrate robustness to substantial changes in lighting, object appearance, robot kinematics, or task structure outside the 10 environments.

Authors: We acknowledge that the held-out tasks share the same overall collection protocol and visual regimes as the training environments. While the ten environments already include meaningful diversity in settings, objects, and lighting, the results do not demonstrate robustness to arbitrary new domains involving major shifts such as different robot kinematics or extreme lighting changes outside the collected data. In the revision we will update §5 and the discussion to more precisely scope our claims to positive transfer across the diversity present in Bridge Data, while explicitly noting this limitation for broader generalization. This clarification will better contextualize the empirical findings. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical success rates are measured outcomes, not reductions to fitted inputs

full rationale

The paper collects a multi-task multi-domain dataset of 7200 demonstrations and reports measured success rates on held-out tasks when training with the dataset plus 50 target demos. These results are direct experimental measurements rather than predictions derived from equations or parameters fitted inside the work. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation chain; the central claim rests on independent robot trials whose outcomes are not tautological with the data collection protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that joint training on the collected data improves performance; it assumes standard imitation or reinforcement learning algorithms can leverage cross-domain demonstrations without negative transfer.

axioms (1)

domain assumption Standard policy learning algorithms can effectively utilize demonstrations from multiple tasks and domains without negative interference.
Implicit in the joint training setup described in the abstract.

pith-pipeline@v0.9.0 · 5626 in / 1215 out tokens · 43714 ms · 2026-05-13T19:51:16.787575+00:00 · methodology

discussion (0)

Forward citations

Cited by 48 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 7.0

MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination
cs.RO 2026-04 conditional novelty 7.0

BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
Large Video Planner Enables Generalizable Robot Control
cs.RO 2025-12 conditional novelty 7.0

A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.
RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation
cs.RO 2025-11 accept novelty 7.0

RoboCOIN is a large multi-embodiment bimanual manipulation dataset with hierarchical annotations and an open processing pipeline that improves model performance across robotic platforms.
RoboDreamer: Learning Compositional World Models for Robot Imagination
cs.RO 2024-04 unverdicted novelty 7.0

RoboDreamer factorizes video generation using language primitives to achieve compositional generalization in robot world models, outperforming monolithic baselines on unseen goals in RT-X.
Learning Interactive Real-World Simulators
cs.AI 2023-10 conditional novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
cs.RO 2023-04 conditional novelty 7.0

Low-cost imprecise robots achieve 80-90% success on six fine bimanual manipulation tasks using imitation learning with a new Action Chunking with Transformers algorithm trained on only 10 minutes of demonstrations.
VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training
cs.RO 2022-09 unverdicted novelty 7.0

VIP learns a visual embedding from human videos whose distance defines dense, smooth rewards for arbitrary goal-image robot tasks without task-specific fine-tuning.
Target-Aligned Bellman Backup for Cross-domain Offline Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Target-Aligned Bellman Backup (TABB) improves cross-domain offline RL by selecting source transitions according to their contribution to accurate target-domain Bellman target estimation.
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
cs.RO 2026-05 unverdicted novelty 6.0

A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
BEACON: Cross-Domain Co-Training of Generative Robot Policies via Best-Effort Adaptation
cs.RO 2026-05 unverdicted novelty 6.0

BEACON uses discrepancy-aware importance reweighting to co-train generative robot policies from abundant source and limited target demonstrations, yielding better robustness and implicit feature alignment.
BEACON: Cross-Domain Co-Training of Generative Robot Policies via Best-Effort Adaptation
cs.RO 2026-05 unverdicted novelty 6.0

BEACON uses discrepancy-aware importance reweighting to jointly train diffusion-based robot policies and source sample weights, improving performance over target-only and fixed-ratio baselines in cross-domain manipula...
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 6.0

MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...
Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation
cs.CV 2026-05 unverdicted novelty 6.0

A video transfer pipeline augments simulated VLA data into realistic videos while preserving actions, yielding consistent performance gains on robot benchmarks such as 8% on Robotwin 2.0.
Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos
cs.CV 2026-04 unverdicted novelty 6.0

EgoIn uses a fine-tuned vision-language model to infer transition steps and a conditioning module plus auxiliary supervision to generate coherent egocentric video sequences of object state changes.
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
cs.RO 2026-01 unverdicted novelty 6.0

PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
IGen: Scalable Data Generation for Robot Learning from Open-World Images
cs.RO 2025-12 unverdicted novelty 6.0

IGen generates realistic visuomotor training data including actions and temporally coherent visuals from unstructured open-world images via 3D reconstruction and VLM reasoning.
villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models
cs.RO 2025-07 unverdicted novelty 6.0

villa-X enhances latent action modeling in VLA models to support zero-shot action planning for unseen robot embodiments and open-vocabulary instructions, yielding better manipulation results in simulation and real-wor...
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
cs.CV 2025-07 unverdicted novelty 6.0

DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 avera...
RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
cs.RO 2025-06 unverdicted novelty 6.0

RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
cs.LG 2025-06 unverdicted novelty 6.0

SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning
cs.RO 2025-05 conditional novelty 6.0

VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
cs.LG 2025-04 unverdicted novelty 6.0

π_{0.5} is a VLA model that achieves long-horizon dexterous manipulation in entirely new homes through co-training on heterogeneous tasks and multi-source data including web and semantic predictions.
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
cs.CV 2025-03 unverdicted novelty 6.0

CoT-VLA is a 7B VLA that generates future visual frames autoregressively as planning goals before actions, outperforming prior VLAs by 17% on real-world tasks and 6% in simulation.
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
cs.CV 2025-03 unverdicted novelty 6.0

HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
cs.CV 2024-12 unverdicted novelty 6.0

Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
cs.RO 2024-12 conditional novelty 6.0

Visual trace prompting improves spatial-temporal awareness in VLA models, delivering 10% gains on SimplerEnv and 3.5x on real-robot tasks.
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
cs.RO 2024-11 unverdicted novelty 6.0

CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
cs.LG 2024-10 unverdicted novelty 6.0

π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
OpenVLA: An Open-Source Vision-Language-Action Model
cs.RO 2024-06 unverdicted novelty 6.0

OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
Octo: An Open-Source Generalist Robot Policy
cs.RO 2024-05 unverdicted novelty 6.0

Octo is an open-source transformer-based generalist robot policy pretrained on 800k trajectories that serves as an effective initialization for finetuning across diverse robotic platforms.
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
cs.RO 2024-03 accept novelty 6.0

DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
cs.RO 2024-01 conditional novelty 6.0

A low-cost whole-body teleoperation system enables effective imitation learning for complex bimanual mobile manipulation by co-training on mobile and static demonstration datasets.
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
cs.RO 2023-12 conditional novelty 6.0

A GPT-style model pre-trained on large video datasets achieves 94.9% success on CALVIN multi-task manipulation and 85.4% zero-shot generalization, outperforming prior baselines.
Scaling Robot Learning with Semantically Imagined Experience
cs.RO 2023-02 unverdicted novelty 6.0

Augmenting robot datasets via diffusion-based semantic inpainting enables manipulation policies to solve unseen tasks with new objects and improves robustness to novel distractors.
R3M: A Universal Visual Representation for Robot Manipulation
cs.RO 2022-03 unverdicted novelty 6.0

A visual encoder pre-trained on diverse human videos with contrastive and language objectives improves simulated robot manipulation success by over 20% versus training from scratch and enables real Franka arm tasks fr...
DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization
cs.RO 2026-05 unverdicted novelty 5.0

DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.
Cortex 2.0: Grounding World Models in Real-World Industrial Deployment
cs.RO 2026-04 unverdicted novelty 5.0

Cortex 2.0 introduces world-model-based planning that generates and scores future trajectories to outperform reactive vision-language-action baselines on industrial robotic tasks including pick-and-place, sorting, and...
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
cs.RO 2026-04 unverdicted novelty 5.0

StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
cs.RO 2026-04 unverdicted novelty 5.0

ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
Lightweight Learning from Actuation-Space Demonstrations via Flow Matching for Whole-Body Soft Robotic Grasping
cs.RO 2025-11 unverdicted novelty 5.0

A rectified flow model trained on 30 actuation-space demonstrations produces control sequences that yield 97.5% grasp success across the workspace, with generalization to object size changes of ±33% and execution spee...
GR-3 Technical Report
cs.RO 2025-07 unverdicted novelty 5.0

GR-3 is a VLA model that generalizes to novel objects, environments, and abstract instructions, outperforms the π0 baseline, and integrates with the new ByteMini bi-manual mobile robot.
A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation
cs.RO 2025-07 accept novelty 5.0

Multi-task pretraining of diffusion policies on diverse robot data produces more successful, robust, and data-efficient policies for dexterous manipulation than single-task baselines, with performance scaling with pre...
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
cs.RO 2025-07 unverdicted novelty 5.0

The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks
cs.RO 2025-04 unverdicted novelty 5.0

NORA is a compact 3B-parameter VLA model trained on 970k robot demonstrations that outperforms larger VLA models in embodied tasks while using significantly less computational resources.
MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations
cs.RO 2023-10 unverdicted novelty 5.0

MimicGen creates over 50K robot demonstrations from roughly 200 human ones, allowing imitation learning to achieve strong performance on complex long-horizon tasks like assembly and coffee preparation.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 46 Pith papers · 2 internal anchors

[1]

Imagenet classiﬁca- tion with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁca- tion with deep convolutional neural networks,” Advances in neural information processing systems , vol. 25, pp. 1097–1105, 2012

work page 2012
[2]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Conference on Computer Vision and Pattern Recognition , 2009

work page 2009
[4]

Gradient surgery for multi-task learning,

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient surgery for multi-task learning,” arXiv preprint arXiv:2001.06782, 2020

work page arXiv 2001
[5]

Mt-opt: Continuous multi-task robotic reinforcement learning at scale,

D. Kalashnikov, J. Varley, Y . Chebotar, B. Swanson, R. Jon- schkowski, C. Finn, S. Levine, and K. Hausman, “Mt-opt: Continuous multi-task robotic reinforcement learning at scale,” arXiv preprint arXiv:2104.08212, 2021

work page arXiv 2021
[6]

RoboNet: Large-Scale Multi-Robot Learning

S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn, “Robonet: Large-scale multi-robot learning,” arXiv preprint arXiv:1910.11215 , 2019

work page internal anchor Pith review arXiv 1910
[7]

One-shot visual imitation learning via meta-learning,

C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine, “One-shot visual imitation learning via meta-learning,” in Conference on Robot Learning. PMLR, 2017, pp. 357–368

work page 2017
[8]

One-Shot Imitation Learning

Y . Duan, M. Andrychowicz, B. C. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba, “One-shot imitation learn- ing,” arXiv preprint arXiv:1703.07326 , 2017

work page Pith review arXiv 2017
[9]

Generative Adversarial Imitation Learning

J. Ho and S. Ermon, “Generative adversarial imitation learning,” arXiv preprint arXiv:1606.03476, 2016

work page Pith review arXiv 2016
[10]

One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning

T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel, and S. Levine, “One-shot imitation from observing humans via domain-adaptive meta-learning,” arXiv preprint arXiv:1802.01557 , 2018

work page Pith review arXiv 2018
[11]

Imitation from ob- servation: Learning to imitate behaviors from raw video via context translation,

Y . Liu, A. Gupta, P. Abbeel, and S. Levine, “Imitation from ob- servation: Learning to imitate behaviors from raw video via context translation,” in International Conference on Robotics and Automation (ICRA), 2018

work page 2018
[12]

Time-contrastive networks: Self-supervised learning from video,

P. Sermanet, C. Lynch, Y . Chebotar, J. Hsu, E. Jang, S. Schaal, S. Levine, and G. Brain, “Time-contrastive networks: Self-supervised learning from video,” in 2018 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2018, pp. 1134–1141

work page 2018
[13]

Human-centered collaborative robots with deep reinforcement learn- ing,

A. Ghadirzadeh, X. Chen, W. Yin, Z. Yi, M. Bjorkman, and D. Kragic, “Human-centered collaborative robots with deep reinforcement learn- ing,” IEEE Robotics and Automation Letters , 2020

work page 2020
[14]

Model-based visual planning with self-supervised func- tional distances,

S. Tian, S. Nair, F. Ebert, S. Dasari, B. Eysenbach, C. Finn, and S. Levine, “Model-based visual planning with self-supervised func- tional distances,” arXiv preprint arXiv:2012.15373 , 2020

work page arXiv 2012
[15]

Deep imitation learning for complex manipulation tasks from virtual reality teleoperation,

T. Zhang, Z. McCarthy, O. Jow, D. Lee, X. Chen, K. Goldberg, and P. Abbeel, “Deep imitation learning for complex manipulation tasks from virtual reality teleoperation,” in 2018 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2018, pp. 5628–5635

work page 2018
[16]

Multiple interactions made easy (mime): Large scale demonstrations data for imitation,

P. Sharma, L. Mohan, L. Pinto, and A. Gupta, “Multiple interactions made easy (mime): Large scale demonstrations data for imitation,” in Conference on robot learning . PMLR, 2018, pp. 906–915

work page 2018
[17]

Roboturk: A crowdsourcing platform for robotic skill learning through imitation,

A. Mandlekar, Y . Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, S. Savarese, and L. Fei- Fei, “Roboturk: A crowdsourcing platform for robotic skill learning through imitation,” in Conference on Robot Learning , 2018

work page 2018
[18]

Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity,

A. Mandlekar, J. Booher, M. Spero, A. Tung, A. Gupta, Y . Zhu, A. Garg, S. Savarese, and L. Fei-Fei, “Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity,” arXiv:1911.04052, 2019

work page arXiv 1911
[19]

Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours,

L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours,” in international conference on robotics and automation (ICRA) . IEEE, 2016

work page 2016
[20]

Deep visual foresight for planning robot motion,

C. Finn and S. Levine, “Deep visual foresight for planning robot motion,” in 2017 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2017, pp. 2786–2793

work page 2017
[21]

Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,

S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,” The International Journal of Robotics Research, vol. 37, no. 4-5, pp. 421–436, 2018

work page 2018
[22]

Scalable deep reinforcement learning for vision-based robotic manipulation,

D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Vanhoucke,et al., “Scalable deep reinforcement learning for vision-based robotic manipulation,” in Conference on Robot Learning . PMLR, 2018, pp. 651–673

work page 2018
[23]

Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine, “Visual foresight: Model-based deep reinforcement learning for vision-based robotic control,” arXiv preprint arXiv:1812.00568 , 2018

work page Pith review arXiv 2018
[24]

Tossing- bot: Learning to throw arbitrary objects with residual physics,

A. Zeng, S. Song, J. Lee, A. Rodriguez, and T. Funkhouser, “Tossing- bot: Learning to throw arbitrary objects with residual physics,” IEEE Transactions on Robotics , vol. 36, no. 4, pp. 1307–1319, 2020

work page 2020
[25]

Visual imitation made easy,

S. Young, D. Gandhi, S. Tulsiani, A. Gupta, P. Abbeel, and L. Pinto, “Visual imitation made easy,” arXiv e-prints , pp. arXiv–2008, 2020

work page 2008
[26]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Conference on Computer Vision and Pattern Recognition, 2016

work page 2016
[27]

Deep spatial autoencoders for visuomotor learning,

C. Finn, X. Y . Tan, Y . Duan, T. Darrell, S. Levine, and P. Abbeel, “Deep spatial autoencoders for visuomotor learning,” in International Conference on Robotics and Automation (ICRA) , 2016

work page 2016
[28]

End-to-end training of deep visuomotor policies,

S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016

work page 2016

[1] [1]

Imagenet classiﬁca- tion with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁca- tion with deep convolutional neural networks,” Advances in neural information processing systems , vol. 25, pp. 1097–1105, 2012

work page 2012

[2] [2]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Conference on Computer Vision and Pattern Recognition , 2009

work page 2009

[4] [4]

Gradient surgery for multi-task learning,

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient surgery for multi-task learning,” arXiv preprint arXiv:2001.06782, 2020

work page arXiv 2001

[5] [5]

Mt-opt: Continuous multi-task robotic reinforcement learning at scale,

D. Kalashnikov, J. Varley, Y . Chebotar, B. Swanson, R. Jon- schkowski, C. Finn, S. Levine, and K. Hausman, “Mt-opt: Continuous multi-task robotic reinforcement learning at scale,” arXiv preprint arXiv:2104.08212, 2021

work page arXiv 2021

[6] [6]

RoboNet: Large-Scale Multi-Robot Learning

S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn, “Robonet: Large-scale multi-robot learning,” arXiv preprint arXiv:1910.11215 , 2019

work page internal anchor Pith review arXiv 1910

[7] [7]

One-shot visual imitation learning via meta-learning,

C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine, “One-shot visual imitation learning via meta-learning,” in Conference on Robot Learning. PMLR, 2017, pp. 357–368

work page 2017

[8] [8]

One-Shot Imitation Learning

Y . Duan, M. Andrychowicz, B. C. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba, “One-shot imitation learn- ing,” arXiv preprint arXiv:1703.07326 , 2017

work page Pith review arXiv 2017

[9] [9]

Generative Adversarial Imitation Learning

J. Ho and S. Ermon, “Generative adversarial imitation learning,” arXiv preprint arXiv:1606.03476, 2016

work page Pith review arXiv 2016

[10] [10]

One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning

T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel, and S. Levine, “One-shot imitation from observing humans via domain-adaptive meta-learning,” arXiv preprint arXiv:1802.01557 , 2018

work page Pith review arXiv 2018

[11] [11]

Imitation from ob- servation: Learning to imitate behaviors from raw video via context translation,

Y . Liu, A. Gupta, P. Abbeel, and S. Levine, “Imitation from ob- servation: Learning to imitate behaviors from raw video via context translation,” in International Conference on Robotics and Automation (ICRA), 2018

work page 2018

[12] [12]

Time-contrastive networks: Self-supervised learning from video,

P. Sermanet, C. Lynch, Y . Chebotar, J. Hsu, E. Jang, S. Schaal, S. Levine, and G. Brain, “Time-contrastive networks: Self-supervised learning from video,” in 2018 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2018, pp. 1134–1141

work page 2018

[13] [13]

Human-centered collaborative robots with deep reinforcement learn- ing,

A. Ghadirzadeh, X. Chen, W. Yin, Z. Yi, M. Bjorkman, and D. Kragic, “Human-centered collaborative robots with deep reinforcement learn- ing,” IEEE Robotics and Automation Letters , 2020

work page 2020

[14] [14]

Model-based visual planning with self-supervised func- tional distances,

S. Tian, S. Nair, F. Ebert, S. Dasari, B. Eysenbach, C. Finn, and S. Levine, “Model-based visual planning with self-supervised func- tional distances,” arXiv preprint arXiv:2012.15373 , 2020

work page arXiv 2012

[15] [15]

Deep imitation learning for complex manipulation tasks from virtual reality teleoperation,

T. Zhang, Z. McCarthy, O. Jow, D. Lee, X. Chen, K. Goldberg, and P. Abbeel, “Deep imitation learning for complex manipulation tasks from virtual reality teleoperation,” in 2018 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2018, pp. 5628–5635

work page 2018

[16] [16]

Multiple interactions made easy (mime): Large scale demonstrations data for imitation,

P. Sharma, L. Mohan, L. Pinto, and A. Gupta, “Multiple interactions made easy (mime): Large scale demonstrations data for imitation,” in Conference on robot learning . PMLR, 2018, pp. 906–915

work page 2018

[17] [17]

Roboturk: A crowdsourcing platform for robotic skill learning through imitation,

A. Mandlekar, Y . Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, S. Savarese, and L. Fei- Fei, “Roboturk: A crowdsourcing platform for robotic skill learning through imitation,” in Conference on Robot Learning , 2018

work page 2018

[18] [18]

Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity,

A. Mandlekar, J. Booher, M. Spero, A. Tung, A. Gupta, Y . Zhu, A. Garg, S. Savarese, and L. Fei-Fei, “Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity,” arXiv:1911.04052, 2019

work page arXiv 1911

[19] [19]

Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours,

L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours,” in international conference on robotics and automation (ICRA) . IEEE, 2016

work page 2016

[20] [20]

Deep visual foresight for planning robot motion,

C. Finn and S. Levine, “Deep visual foresight for planning robot motion,” in 2017 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2017, pp. 2786–2793

work page 2017

[21] [21]

Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,

S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,” The International Journal of Robotics Research, vol. 37, no. 4-5, pp. 421–436, 2018

work page 2018

[22] [22]

Scalable deep reinforcement learning for vision-based robotic manipulation,

D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Vanhoucke,et al., “Scalable deep reinforcement learning for vision-based robotic manipulation,” in Conference on Robot Learning . PMLR, 2018, pp. 651–673

work page 2018

[23] [23]

Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine, “Visual foresight: Model-based deep reinforcement learning for vision-based robotic control,” arXiv preprint arXiv:1812.00568 , 2018

work page Pith review arXiv 2018

[24] [24]

Tossing- bot: Learning to throw arbitrary objects with residual physics,

A. Zeng, S. Song, J. Lee, A. Rodriguez, and T. Funkhouser, “Tossing- bot: Learning to throw arbitrary objects with residual physics,” IEEE Transactions on Robotics , vol. 36, no. 4, pp. 1307–1319, 2020

work page 2020

[25] [25]

Visual imitation made easy,

S. Young, D. Gandhi, S. Tulsiani, A. Gupta, P. Abbeel, and L. Pinto, “Visual imitation made easy,” arXiv e-prints , pp. arXiv–2008, 2020

work page 2008

[26] [26]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Conference on Computer Vision and Pattern Recognition, 2016

work page 2016

[27] [27]

Deep spatial autoencoders for visuomotor learning,

C. Finn, X. Y . Tan, Y . Duan, T. Darrell, S. Levine, and P. Abbeel, “Deep spatial autoencoders for visuomotor learning,” in International Conference on Robotics and Automation (ICRA) , 2016

work page 2016

[28] [28]

End-to-end training of deep visuomotor policies,

S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016

work page 2016