ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills

Fanbo Xiang; Hao Su; Jiayuan Gu; Pengwei Xie; Rui Chen; Stone Tao; Tongzhou Mu; Xiaodi Yuan; Xinyue Wei; Xiqiang Liu

arxiv: 2302.04659 · v1 · pith:L5JZQWOLnew · submitted 2023-02-09 · 💻 cs.RO · cs.AI

ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills

Jiayuan Gu , Fanbo Xiang , Xuanlin Li , Zhan Ling , Xiqiang Liu , Tongzhou Mu , Yihe Tang , Stone Tao

show 7 more authors

Xinyue Wei Yunchao Yao Xiaodi Yuan Pengwei Xie Zhiao Huang Rui Chen Hao Su

This is my paper

classification 💻 cs.RO cs.AI

keywords manipulationbenchmarkenvironmentsgeneralizablemaniskill2skillsalgorithmsbenchmarks

0 comments

read the original abstract

Generalizable manipulation skills, which can be composed to tackle long-horizon and complex daily chores, are one of the cornerstones of Embodied AI. However, existing benchmarks, mostly composed of a suite of simulatable environments, are insufficient to push cutting-edge research works because they lack object-level topological and geometric variations, are not based on fully dynamic simulation, or are short of native support for multiple types of manipulation tasks. To this end, we present ManiSkill2, the next generation of the SAPIEN ManiSkill benchmark, to address critical pain points often encountered by researchers when using benchmarks for generalizable manipulation skills. ManiSkill2 includes 20 manipulation task families with 2000+ object models and 4M+ demonstration frames, which cover stationary/mobile-base, single/dual-arm, and rigid/soft-body manipulation tasks with 2D/3D-input data simulated by fully dynamic engines. It defines a unified interface and evaluation protocol to support a wide range of algorithms (e.g., classic sense-plan-act, RL, IL), visual observations (point cloud, RGBD), and controllers (e.g., action type and parameterization). Moreover, it empowers fast visual input learning algorithms so that a CNN-based policy can collect samples at about 2000 FPS with 1 GPU and 16 processes on a regular workstation. It implements a render server infrastructure to allow sharing rendering resources across all environments, thereby significantly reducing memory usage. We open-source all codes of our benchmark (simulator, environments, and baselines) and host an online challenge open to interdisciplinary researchers.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 41 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies
cs.RO 2026-04 unverdicted novelty 8.0

RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
cs.AI 2023-06 conditional novelty 8.0

LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.
ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models
cs.RO 2026-06 unverdicted novelty 7.0

ForesightSafety-VLA is a new benchmark with 13 safety categories, cumulative cost and risk exposure metrics, and controlled variations to diagnose safety failures in VLA models rather than aggregate task success.
ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models
cs.RO 2026-06 unverdicted novelty 7.0

ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language chan...
PhysEditWorld: A Large-Scale Dataset Toward Physics-Editable World Models
cs.CV 2026-06 unverdicted novelty 7.0

PhysEditWorld is a new dataset of over 60 million frames from 12 UE5 cinematic scenes with synchronized multimodal signals and explicit gravity labels, built via replay to support physics-editable world models.
Stealthy World Model Manipulation via Data Poisoning
cs.LG 2026-06 unverdicted novelty 7.0

SWAAP is the first two-stage poisoning framework that identifies a harmful target world model via bilevel optimization and realizes it through stealth-constrained gradient matching on a limited fraction of fine-tuning...
HARBOR: A Harness Framework for Agentic Robot Reinforcement Learning
cs.RO 2026-06 unverdicted novelty 7.0

HARBOR is a new agentic harness framework that automates robot RL workflows end-to-end across 16 tasks in manipulation, locomotion, and dexterous control, matching or exceeding default configurations while enabling si...
Support-Safe Variational Hybrid Filtering for Contact-Mode and Sparse-Law Recovery
cs.RO 2026-05 unverdicted novelty 7.0

VHYDRO is a support-safe variational hybrid filter that jointly recovers continuous latent states, discrete contact modes, and sparse port-Hamiltonian laws per regime while preventing loss of feasible transitions.
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
cs.RO 2026-05 unverdicted novelty 7.0

Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.
AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

AffordSim is the first simulation framework integrating open-vocabulary 3D affordance detection into scalable manipulation data generation, with a 50-task benchmark showing imitation learning succeeds on grasping but ...
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs
cs.RO 2026-02 unverdicted novelty 7.0

ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
An Embodied Simulation Platform, Benchmark, and Data-Efficient Augmentation Framework for Wet-Lab Robotics
cs.RO 2026-06 unverdicted novelty 6.0

Pipette supplies an open wet-lab simulation platform, 11-task benchmark, and perturbation-based augmentation pipeline that raises VLA success rates on sample handling and device tasks from limited demonstrations.
FATE-VLA:Failue-aware test generation for vision-language-action models
cs.RO 2026-06 unverdicted novelty 6.0

FATE-VLA reframes VLA evaluation as active failure discovery and reports uncovering up to 29.7% more failures across four models while revealing diverse failure modes.
InfoAtlas: A Foundation Model for Zero-Shot Statistical Dependence Estimate
cs.LG 2026-05 unverdicted novelty 6.0

InfoAtlas is a pretrained neural model for zero-shot mutual information estimation that matches state-of-the-art accuracy with 100x speedup and handles varying dimensions via a single model.
RoboWits: Unexpected Challenges for Robotic Creative Problem Solving
cs.RO 2026-05 unverdicted novelty 6.0

RoboWits benchmark with 238 tasks shows pre-trained VLAs succeed on seed tasks but fail on mutated ones, highlighting brittleness in reasoning.
DexHoldem: Playing Texas Hold'em with Dexterous Embodied System
cs.RO 2026-05 unverdicted novelty 6.0

DexHoldem is a new benchmark providing 1,470 teleoperated demonstrations across 14 manipulation primitives, plus standardized tests for dexterous policy execution and agentic perception in a physical Texas Hold'em setting.
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
cs.RO 2026-05 unverdicted novelty 6.0

Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
cs.RO 2026-04 unverdicted novelty 6.0

A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

AffordSim integrates open-vocabulary 3D affordance prediction into simulation trajectory generation to create a 50-task benchmark that reaches 93% of manual annotation success rates and enables 24% average zero-shot s...
RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies
cs.RO 2026-04 unverdicted novelty 6.0

RoboLab is a photorealistic simulation benchmark with 120 tasks and perturbation analysis to evaluate true generalization and robustness of robotic foundation models.
SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds
cs.RO 2026-04 unverdicted novelty 6.0

SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformabl...
VideoPhy: Evaluating Physical Commonsense for Video Generation
cs.CV 2024-06 conditional novelty 6.0

VideoPhy benchmark shows state-of-the-art text-to-video models follow physical commonsense and text prompts in only 39.6% of cases for the best model.
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
cs.RO 2024-06 unverdicted novelty 6.0

RoboCasa supplies a large-scale kitchen simulator, generative assets, 100 tasks, and automated data pipelines that produce a clear scaling trend in imitation learning for generalist robots.
PhysEditWorld: A Large-Scale Dataset Toward Physics-Editable World Models
cs.CV 2026-06 unverdicted novelty 5.0

PhysEditWorld supplies 12 UE5 scenes, 60+ million frames, and explicit gravity labels via a replay paradigm to support gravity-faithful and physically editable world models.
LaST-HD: Learning Latent Physical Reasoning from Scalable Human Data for Robot Manipulation
cs.RO 2026-06 unverdicted novelty 5.0

LaST-HD creates a shared latent dynamics space via a world model to transfer physical reasoning from scalable human-hand demonstrations to robots, achieving over 90% accuracy with 20 minutes of new data after mixed training.
MagicSim: A Unified Infrastructure for Executable Embodied Interaction
cs.RO 2026-06 unverdicted novelty 5.0

MagicSim is a unified embodied interaction infrastructure built on a deterministic batched runtime and shared MDP that supports diverse world construction, execution, task evaluation, automatic rollout generation, and...
LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories
cs.CL 2026-06 unverdicted novelty 5.0

LabVLA uses RoboGenesis simulation data and a two-stage FAST pretraining plus flow matching recipe on a Qwen3-VL backbone to achieve the highest success rates on LabUtopia under in- and out-of-distribution conditions.
Test-Time Scaling in Multimodal Foundation Models: A Comprehensive Survey of Generation and Reasoning
cs.CV 2026-06 unverdicted novelty 5.0

A survey of test-time scaling for multimodal foundation models that introduces a three-way taxonomy of sampling, feedback, and search approaches along with applications and benchmarks.
Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization
cs.LG 2026-05 unverdicted novelty 5.0

MBDPO reformulates policy optimization as a diffusion process over searched trajectories in latent world models to reduce misalignment between search and value learning.
Learning Structural Latent Points for Efficient Visual Representations in Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

A hybrid structural latent points representation is learned by inserting a point-wise latent VAE into a point-cloud autoencoder and regularizing toward a Gaussian prior, paired with a lightweight 3DGS rendering pipeli...
E$^2$DT: Efficient and Effective Decision Transformer with Experience-Aware Sampling for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 5.0

E²DT couples a Decision Transformer with a k-Determinantal Point Process that scores trajectories on return-to-go quantiles, predictive uncertainty, and stage coverage to improve sample efficiency and policy quality i...
R3D: Revisiting 3D Policy Learning
cs.CV 2026-04 unverdicted novelty 5.0

A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
EmbodiedClaw: Conversational Workflow Execution for Embodied AI Development
cs.RO 2026-04 unverdicted novelty 5.0

EmbodiedClaw automates embodied AI development workflows through conversation, reducing manual effort and improving consistency and reproducibility.
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
cs.RO 2026-04 accept novelty 5.0

A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.
MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations
cs.RO 2023-10 unverdicted novelty 5.0

MimicGen creates over 50K robot demonstrations from roughly 200 human ones, allowing imitation learning to achieve strong performance on complex long-horizon tasks like assembly and coffee preparation.
Co-policy: Responsive Human-Robot Co-Creation for Musical Performances
cs.RO 2026-06 unverdicted novelty 4.0

Co-policy framework uses a fine-tuned Qwen-vl planner for semantic plans and a Gaussian-mixture visuomotor policy for low-latency robot actions to enable constrained human-robot musical co-creation.
GE-Sim 2.0: A Roadmap Towards Comprehensive Closed-loop Video World Simulators for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 4.0

GE-Sim 2.0 is a video-based closed-loop simulator for robotic manipulation that adds state expert, world judge, and acceleration modules on top of prior video generation to support policy learning and evaluation.
Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems
cs.RO 2026-05 unverdicted novelty 4.0

A literature review that defines silent physical-action failures in Physical AI and identifies the lack of complete runtime authorization boundaries across surveyed technical streams.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
World Action Models: A Survey
cs.RO 2026-06 unverdicted novelty 3.0

A survey that clarifies boundaries and organizes World Action Models by generation requirements and predictive substrates, identifying a trend toward generating less of the future.