World models introduce a stealthy poisoning vector into robot learning pipelines where malicious prompts or dynamics in teleoperated data activate only during synthetic trajectory generation, enabling backdoors in downstream policies.
super hub Mixed citations
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Mixed citation behavior. Most common role is background (55%).
abstract
Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website https://robotics-transformer-x.github.io.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and enviro
authors
co-cited works
representative citing papers
BOKBO is the first conformal abstention method for K-sample VLA policies that supplies finite-sample distribution-free guarantees on executed violation rates, with global and Mondrian per-task variants.
SkiP introduces action relabeling and Motion Spectrum Keying to skip redundant steps in robot trajectories, cutting executed steps by 15-40% while maintaining success rates across 72 simulated and 3 real tasks.
Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
SABER provides 44.8K multi-representation action samples from unscripted retail environments that raise a VLA model's mean success rate on ten manipulation tasks from 13.4% to 29.3%.
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Action Agent pairs LLM-driven video generation with a flow-constrained diffusion transformer to produce velocity commands, raising video success to 86% and delivering 64.7% real-world navigation on a Unitree G1 humanoid.
A 48-camera residential platform delivers real-time occlusion-robust 3D perception and coordinated actuation for multi-human multi-robot interaction in a shared home workspace.
A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing full compositions.
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
PhysMem enables VLM-based robot planners to learn and verify physical properties through test-time interaction and hypothesis testing, raising success on a brick insertion task from 23% to 76%.
UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.
A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.
Averaging and temporally interpolating text latents in VLAs enables 83% success on novel task combinations in the libero-ood benchmark where SOTA models achieve under 15%.
Introduces the Kaiwu multimodal dataset and framework with 11,664 synchronized assembling demonstrations including hand motions, pressures, sounds, multi-view videos, motion capture, eye gaze, and EMG signals with timestamp-based and semantic annotations.
RoboDreamer factorizes video generation using language primitives to achieve compositional generalization in robot world models, outperforming monolithic baselines on unseen goals in RT-X.
3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
RT-H learns robot policies by first predicting language motions as an intermediate representation and then mapping those plus the high-level task to actions, yielding more robust multi-task performance and the ability to learn from language interventions.
ATM pre-trains models to predict trajectories of any points in videos, then uses those predictions to learn strong visuomotor policies from minimal action labels, beating baselines by 80% on 130+ tasks.
PACE dynamically selects execution horizons for action chunks in robot policies by detecting low-speed transition points in predicted speed profiles, raising success rates from 57.8% to 64.2% on 50 simulation tasks and from 50.7% to 70.4% in real-robot tests.
VISUALTHINK-VLA uses visual evidence tokens and selective routing to reach top success rates on VLA benchmarks while cutting reasoning latency from multi-second to sub-second levels.
Instrumented objects boost diffusion policy success in robotic hanger insertion by 14-25 percentage points over vision-only baselines, and augmenting datasets with instrumented expert rollouts lets a vision-only student match the instrumented expert.
A compositional flow-matching model learns a dictionary of motion primitives with length masks and assembles them via sparse binary placement with geometric continuity losses, reporting SOTA results on two embodied trajectory datasets.
AVP architecture has VLM emit visual-primitive tokens to condition flow-matching action expert, yielding 27.61% higher success rate than pi_0.5 on real-robot pick-and-place tasks.
citing papers explorer
-
Large Video Planner Enables Generalizable Robot Control
A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.
-
RT-H: Action Hierarchies Using Language
RT-H learns robot policies by first predicting language motions as an intermediate representation and then mapping those plus the high-level task to actions, yielding more robust multi-task performance and the ability to learn from language interventions.
-
Any-point Trajectory Modeling for Policy Learning
ATM pre-trains models to predict trajectories of any points in videos, then uses those predictions to learn strong visuomotor policies from minimal action labels, beating baselines by 80% on 130+ tasks.
-
Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation
Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks without task-specific training.
-
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning
VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.
-
Policy Contrastive Decoding for Robotic Foundation Models
PCD redirects robotic policies toward object-relevant visual features via contrastive decoding on masked inputs, improving generalization without retraining or weight access.
-
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
Visual trace prompting improves spatial-temporal awareness in VLA models, delivering 10% gains on SimplerEnv and 3.5x on real-robot tasks.
-
Evaluating Real-World Robot Manipulation Policies in Simulation
SIMPLER simulated environments yield policy performance that correlates strongly with real-world robot manipulation results and captures similar sensitivity to distribution shifts.
-
3D Diffuser Actor: Policy Diffusion with 3D Scene Representations
3D Diffuser Actor unifies diffusion policies with 3D scene features to set new state-of-the-art results on RLBench and CALVIN robot benchmarks.
-
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
A low-cost whole-body teleoperation system enables effective imitation learning for complex bimanual mobile manipulation by co-training on mobile and static demonstration datasets.
-
Vision-Language Foundation Models as Effective Robot Imitators
RoboFlamingo adapts open-source vision-language models for robot manipulation tasks via single-step comprehension plus an explicit policy head, outperforming prior methods on benchmarks with only light fine-tuning.