RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.
hub Mixed citations
Maniskill2: A unified benchmark for generalizable manipulation skills
Mixed citation behavior. Most common role is background (50%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.
ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.
PhysEditWorld is a new dataset of over 60 million frames from 12 UE5 cinematic scenes with synchronized multimodal signals and explicit gravity labels, built via replay to support physics-editable world models.
SWAAP is the first two-stage poisoning framework that identifies a harmful target world model via bilevel optimization and realizes it through stealth-constrained gradient matching on a limited fraction of fine-tuning transitions.
HARBOR is a new agentic harness framework that automates robot RL workflows end-to-end across 16 tasks in manipulation, locomotion, and dexterous control, matching or exceeding default configurations while enabling sim-to-real transfer.
VHYDRO is a support-safe variational hybrid filter that jointly recovers continuous latent states, discrete contact modes, and sparse port-Hamiltonian laws per regime while preventing loss of feasible transitions.
Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
Pipette supplies an open wet-lab simulation platform, 11-task benchmark, and perturbation-based augmentation pipeline that raises VLA success rates on sample handling and device tasks from limited demonstrations.
FATE-VLA reframes VLA evaluation as active failure discovery and reports uncovering up to 29.7% more failures across four models while revealing diverse failure modes.
InfoAtlas is a pretrained neural model for zero-shot mutual information estimation that matches state-of-the-art accuracy with 100x speedup and handles varying dimensions via a single model.
RoboWits benchmark with 238 tasks shows pre-trained VLAs succeed on seed tasks but fail on mutated ones, highlighting brittleness in reasoning.
DexHoldem is a new benchmark providing 1,470 teleoperated demonstrations across 14 manipulation primitives, plus standardized tests for dexterous policy execution and agentic perception in a physical Texas Hold'em setting.
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
AffordSim integrates open-vocabulary 3D affordance prediction into simulation trajectory generation to create a 50-task benchmark that reaches 93% of manual annotation success rates and enables 24% average zero-shot success on a real Franka FR3.
SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformable manipulation.
VideoPhy benchmark shows state-of-the-art text-to-video models follow physical commonsense and text prompts in only 39.6% of cases for the best model.
RoboCasa supplies a large-scale kitchen simulator, generative assets, 100 tasks, and automated data pipelines that produce a clear scaling trend in imitation learning for generalist robots.
MagicSim is a unified embodied interaction infrastructure built on a deterministic batched runtime and shared MDP that supports diverse world construction, execution, task evaluation, automatic rollout generation, and interactive agent interfaces.
LabVLA uses RoboGenesis simulation data and a two-stage FAST pretraining plus flow matching recipe on a Qwen3-VL backbone to achieve the highest success rates on LabUtopia under in- and out-of-distribution conditions.
A survey of test-time scaling for multimodal foundation models that introduces a three-way taxonomy of sampling, feedback, and search approaches along with applications and benchmarks.
MBDPO reformulates policy optimization as a diffusion process over searched trajectories in latent world models to reduce misalignment between search and value learning.
citing papers explorer
-
RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies
RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.
-
ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models
ForesightSafety-VLA creates a diagnostic benchmark for VLA safety with taxonomy across physical, language, and visual risks, showing perception and structure variations cause more safety degradation than language changes in tested models.
-
PhysEditWorld: A Large-Scale Dataset Toward Physics-Editable World Models
PhysEditWorld is a new dataset of over 60 million frames from 12 UE5 cinematic scenes with synchronized multimodal signals and explicit gravity labels, built via replay to support physics-editable world models.
-
Stealthy World Model Manipulation via Data Poisoning
SWAAP is the first two-stage poisoning framework that identifies a harmful target world model via bilevel optimization and realizes it through stealth-constrained gradient matching on a limited fraction of fine-tuning transitions.
-
HARBOR: A Harness Framework for Agentic Robot Reinforcement Learning
HARBOR is a new agentic harness framework that automates robot RL workflows end-to-end across 16 tasks in manipulation, locomotion, and dexterous control, matching or exceeding default configurations while enabling sim-to-real transfer.
-
Support-Safe Variational Hybrid Filtering for Contact-Mode and Sparse-Law Recovery
VHYDRO is a support-safe variational hybrid filter that jointly recovers continuous latent states, discrete contact modes, and sparse port-Hamiltonian laws per regime while preventing loss of feasible transitions.
-
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.
-
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
-
An Embodied Simulation Platform, Benchmark, and Data-Efficient Augmentation Framework for Wet-Lab Robotics
Pipette supplies an open wet-lab simulation platform, 11-task benchmark, and perturbation-based augmentation pipeline that raises VLA success rates on sample handling and device tasks from limited demonstrations.
-
FATE-VLA:Failue-aware test generation for vision-language-action models
FATE-VLA reframes VLA evaluation as active failure discovery and reports uncovering up to 29.7% more failures across four models while revealing diverse failure modes.
-
InfoAtlas: A Foundation Model for Zero-Shot Statistical Dependence Estimate
InfoAtlas is a pretrained neural model for zero-shot mutual information estimation that matches state-of-the-art accuracy with 100x speedup and handles varying dimensions via a single model.
-
RoboWits: Unexpected Challenges for Robotic Creative Problem Solving
RoboWits benchmark with 238 tasks shows pre-trained VLAs succeed on seed tasks but fail on mutated ones, highlighting brittleness in reasoning.
-
DexHoldem: Playing Texas Hold'em with Dexterous Embodied System
DexHoldem is a new benchmark providing 1,470 teleoperated demonstrations across 14 manipulation primitives, plus standardized tests for dexterous policy execution and agentic perception in a physical Texas Hold'em setting.
-
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
-
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
-
AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation
AffordSim integrates open-vocabulary 3D affordance prediction into simulation trajectory generation to create a 50-task benchmark that reaches 93% of manual annotation success rates and enables 24% average zero-shot success on a real Franka FR3.
-
SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds
SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformable manipulation.
-
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
RoboCasa supplies a large-scale kitchen simulator, generative assets, 100 tasks, and automated data pipelines that produce a clear scaling trend in imitation learning for generalist robots.
-
MagicSim: A Unified Infrastructure for Executable Embodied Interaction
MagicSim is a unified embodied interaction infrastructure built on a deterministic batched runtime and shared MDP that supports diverse world construction, execution, task evaluation, automatic rollout generation, and interactive agent interfaces.
-
LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories
LabVLA uses RoboGenesis simulation data and a two-stage FAST pretraining plus flow matching recipe on a Qwen3-VL backbone to achieve the highest success rates on LabUtopia under in- and out-of-distribution conditions.
-
Test-Time Scaling in Multimodal Foundation Models: A Comprehensive Survey of Generation and Reasoning
A survey of test-time scaling for multimodal foundation models that introduces a three-way taxonomy of sampling, feedback, and search approaches along with applications and benchmarks.
-
Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization
MBDPO reformulates policy optimization as a diffusion process over searched trajectories in latent world models to reduce misalignment between search and value learning.
-
Learning Structural Latent Points for Efficient Visual Representations in Robotic Manipulation
A hybrid structural latent points representation is learned by inserting a point-wise latent VAE into a point-cloud autoencoder and regularizing toward a Gaussian prior, paired with a lightweight 3DGS rendering pipeline, yielding gains on RLBench and ManiSkill2 benchmarks.
-
E$^2$DT: Efficient and Effective Decision Transformer with Experience-Aware Sampling for Robotic Manipulation
E²DT couples a Decision Transformer with a k-Determinantal Point Process that scores trajectories on return-to-go quantiles, predictive uncertainty, and stage coverage to improve sample efficiency and policy quality in robotic manipulation.
-
R3D: Revisiting 3D Policy Learning
A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
-
EmbodiedClaw: Conversational Workflow Execution for Embodied AI Development
EmbodiedClaw automates embodied AI development workflows through conversation, reducing manual effort and improving consistency and reproducibility.
-
MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations
MimicGen creates over 50K robot demonstrations from roughly 200 human ones, allowing imitation learning to achieve strong performance on complex long-horizon tasks like assembly and coffee preparation.
-
GE-Sim 2.0: A Roadmap Towards Comprehensive Closed-loop Video World Simulators for Robotic Manipulation
GE-Sim 2.0 is a video-based closed-loop simulator for robotic manipulation that adds state expert, world judge, and acceleration modules on top of prior video generation to support policy learning and evaluation.
-
Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems
A literature review that defines silent physical-action failures in Physical AI and identifies the lack of complete runtime authorization boundaries across surveyed technical streams.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.