ManiSkill: Generalizable Manipulation Skill Benchmark with Large-Scale Demonstrations
read the original abstract
Object manipulation from 3D visual inputs poses many challenges on building generalizable perception and policy models. However, 3D assets in existing benchmarks mostly lack the diversity of 3D shapes that align with real-world intra-class complexity in topology and geometry. Here we propose SAPIEN Manipulation Skill Benchmark (ManiSkill) to benchmark manipulation skills over diverse objects in a full-physics simulator. 3D assets in ManiSkill include large intra-class topological and geometric variations. Tasks are carefully chosen to cover distinct types of manipulation challenges. Latest progress in 3D vision also makes us believe that we should customize the benchmark so that the challenge is inviting to researchers working on 3D deep learning. To this end, we simulate a moving panoramic camera that returns ego-centric point clouds or RGB-D images. In addition, we would like ManiSkill to serve a broad set of researchers interested in manipulation research. Besides supporting the learning of policies from interactions, we also support learning-from-demonstrations (LfD) methods, by providing a large number of high-quality demonstrations (~36,000 successful trajectories, ~1.5M point cloud/RGB-D frames in total). We provide baselines using 3D deep learning and LfD algorithms. All code of our benchmark (simulator, environment, SDK, and baselines) is open-sourced, and a challenge facing interdisciplinary researchers will be held based on the benchmark.
This paper has not been read by Pith yet.
Forward citations
Cited by 29 Pith papers
-
BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation
BEHAVIOR-1K introduces a benchmark of 1,000 human everyday activities in realistic simulated scenes together with the OMNIGIBSON physics simulator to evaluate embodied AI.
-
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.
-
EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies
EBench is a benchmark that evaluates generalist mobile manipulation policies on 26 tasks across 5 capability and 4 generalization dimensions, revealing distinct capability profiles among models with similar success rates.
-
SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects
SceneCode compiles natural language prompts into executable code programs that generate editable, articulated indoor scenes for physics simulation.
-
Dynamic Execution Commitment of Vision-Language-Action Models
A3 reframes dynamic action chunk commitment in VLA models as self-speculative prefix verification, accepting the longest continuous sequence of actions that satisfies consensus-ordered conditional invariance and prefi...
-
Dynamic Execution Commitment of Vision-Language-Action Models
A3 determines the execution horizon in VLA models as the longest prefix of actions that passes consensus-based verification and sequential consistency checks.
-
Action-to-Action Flow Matching
A2A flow matching starts action generation from prior proprioceptive actions in latent space to enable single-step high-quality predictions in robotic policies.
-
Rodrigues Network for Learning Robot Actions
Proposes Rodrigues Network using a learnable Neural Rodrigues Operator to add kinematic inductive biases for improved robot action learning and prediction.
-
Inference-Time Robot Behavior Steering through Physically-Aware Reconfiguration of Task-Structure
ReStruct steers robot policies at inference time by reconfiguring task structure with neural automata and synchronous products, claiming up to 25% gains over VLA models in success and preference adherence.
-
AnnotateAnything: Automatic Annotation of 3D Assets for Robot Manipulation
AnnotateAnything converts passive 3D assets into manipulation-ready assets by combining vision-language reasoning for semantics with parallel physics pipelines for executable action annotations such as grasps and arti...
-
Efficient Skill Grounding via Code Refactoring with Small Language Models
RECENT decouples skill semantics from embodiment-specific bindings via code refactoring to let small language models achieve skill grounding performance matching large language model baselines.
-
X4Val: Learning Neural Surrogates for Variance-Reduced Policy Evaluation
X4Val learns transferable neural predictors from non-paired multi-domain data and incorporates them into control-variates estimators to reduce variance in real-world robotic policy evaluation by up to 38.4%.
-
FLASH: Efficient Visuomotor Policy via Sparse Sampling
FLASH Policy uses sparse Legendre polynomial trajectory fitting and history-anchored flow matching to enable single-step inference for visuomotor control, reporting 31.4 ms per-episode latency and >=92% success on fiv...
-
Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation
VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.
-
Learning While Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
LWD is a fleet-scale offline-to-online RL framework that continually improves pretrained VLA policies using autonomous rollouts and human interventions, reaching 95% average success on real-world manipulation tasks.
-
Learning While Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.
-
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
-
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models
State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks...
-
RoboPlayground: Democratizing Robotic Evaluation through Structured Physical Domains
RoboPlayground reframes robotic manipulation evaluation as a language-driven process over structured physical domains, letting users author varied yet reproducible tasks that reveal policy generalization failures.
-
Emergent Neural Automaton Policies: Learning Symbolic Structure from Visuomotor Trajectories
ENAP extracts an emergent Mealy automaton from visuomotor trajectories to act as a high-level planner for a low-level residual policy, yielding up to 27% higher success than end-to-end VLA policies in low-data regimes.
-
MARS Policy: Multimodality Only When It Matters
MARS policy adaptively activates multimodal generation only when beneficial in robotic tasks, claiming 16.67% higher success and 83.20% lower inference latency than baselines in real-world tests.
-
Dynamic Execution Commitment of Vision-Language-Action Models
A3 adaptively selects verifiable action prefixes in VLA models using group-sampled consensus and conditional re-decoding to balance robustness and speed without manual horizon tuning.
-
Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies
Unsupervised behavioral mode discovery combined with mutual information rewards enables RL fine-tuning of multimodal generative policies that achieves higher success rates without losing action diversity.
-
CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment
CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer ...
-
LLM-Guided Task- and Affordance-Level Exploration in Reinforcement Learning
LLM-TALE steers RL exploration using LLM-generated plans at task and affordance levels with online suboptimality correction, improving sample efficiency and success rates on pick-and-place tasks without human supervision.
-
A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation
Multi-task pretraining of diffusion policies on diverse robot data produces more successful, robust, and data-efficient policies for dexterous manipulation than single-task baselines, with performance scaling with pre...
-
MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations
MimicGen creates over 50K robot demonstrations from roughly 200 human ones, allowing imitation learning to achieve strong performance on complex long-horizon tasks like assembly and coffee preparation.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
Agent AI: Surveying the Horizons of Multimodal Interaction
The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.