SIMPLE: Simulation-Based Policy Learning and Evaluation for Humanoid Loco-manipulation

Di Huang; Hongyi Jing; Jie Liu; Junjie Ye; Junkai Xia; Liang Heng; Michael Leong; Songlin Wei; Xiawei Liu; Yue Wang

arxiv: 2606.08278 · v1 · pith:BJUCTEEOnew · submitted 2026-06-06 · 💻 cs.RO

SIMPLE: Simulation-Based Policy Learning and Evaluation for Humanoid Loco-manipulation

Songlin Wei , Zhenhao Ni , Jie Liu , Zhenyu Zhao , Junjie Ye , Hongyi Jing , Junkai Xia , Xiawei Liu

show 4 more authors

Michael Leong Liang Heng Di Huang Yue Wang

This is my paper

Pith reviewed 2026-06-27 19:27 UTC · model grok-4.3

classification 💻 cs.RO

keywords humanoid roboticssimulation benchmarkloco-manipulationpolicy learningsim-to-real transferzero-shot transferwhole-body controldata generation pipelines

0 comments

The pith

A new simulation testbed couples accurate physics with photorealistic rendering to let humanoid loco-manipulation policies transfer zero-shot to physical robots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SIMPLE as a unified simulation environment that combines contact-rich dynamics with visual realism to benchmark and train policies for whole-body humanoid tasks. It supplies 60 tasks, 50 scenes, and over 1,000 objects, along with automated motion planning and VR teleoperation pipelines for large-scale data collection. Benchmarking of imitation networks, vision-language-action models, and world action models shows strong correlation between simulation performance and real-world results. The central result is that policies trained entirely on data from this environment can be deployed directly on physical humanoid robots in comparable settings without additional real-world training.

Core claim

SIMPLE couples accurate contact-rich dynamics with photorealistic rendering to create a large-scale testbed of 60 diverse whole-body tasks across 50 indoor scenes and more than 1,000 object assets. Data generation via motion planning and low-latency VR teleoperation supports scalable collection, and experiments with imitation networks, vision-language-action models, and world action models establish a strong correlation between policy performance in simulation and the real world. Policies trained on data collected in SIMPLE transfer zero-shot to physical humanoid robots under similar settings.

What carries the argument

The SIMPLE testbed that unifies contact-rich physics simulation with photorealistic rendering and supplies automated trajectory generation plus VR teleoperation for policy data collection and evaluation.

If this is right

Humanoid policy development can scale through simulation data collection without constant physical robot access.
Performance in the testbed predicts real-world behavior, allowing filtered evaluation before hardware deployment.
Reproducible benchmarks become available for comparing lightweight imitation networks against large vision-language-action and world action models.
Zero-shot transfer reduces the volume of real-world data needed to reach functional loco-manipulation capabilities.
A shared foundation supports community-wide progress on humanoid foundation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the correlation holds across wider task distributions, the testbed could serve as a proxy for regulatory or safety validation of humanoid behaviors.
Extending the same coupling of dynamics and rendering to additional robot morphologies might accelerate sim-to-real work outside humanoids.
Incorporating more varied lighting, sensor noise, or deformable objects into the testbed would test the robustness of the observed transfer.
Longer-horizon tasks or multi-robot coordination could be added to probe whether the current correlation persists at greater complexity.

Load-bearing premise

The simulation environment captures contact-rich dynamics, visual conditions, and task distributions of the physical world closely enough that performance correlations and zero-shot transfers observed in tested settings will hold more generally.

What would settle it

Policies trained to high performance in SIMPLE that then exhibit substantially lower success rates when run zero-shot on physical humanoid robots executing the same loco-manipulation tasks under matched conditions.

Figures

Figures reproduced from arXiv: 2606.08278 by Di Huang, Hongyi Jing, Jie Liu, Junjie Ye, Junkai Xia, Liang Heng, Michael Leong, Songlin Wei, Xiawei Liu, Yue Wang, Zhenhao Ni, Zhenyu Zhao.

**Figure 1.** Figure 1: Humanoid Loco-manipulation in SIMPLE. We introduce a comprehensive simulation benchmark designed to standardize the evaluation and training of humanoid foundation models. The framework features 60 diverse whole-body tasks across 50 indoor scenes, utilizing over 1,000 object assets. By coupling MuJoCo’s robust contact physics with Isaac Sim’s photorealistic rendering, SIMPLE provides built-in data collectio… view at source ↗

**Figure 2.** Figure 2: System Pipeline. Our pipeline consists of three stages: (1) data generation in MuJoCo via motion planning and teleoperation; (2) offline replay and rendering in Isaac Sim to obtain photorealistic visual observations; and (3) policy evaluation under diverse domain-randomized settings. 3 Method We present SIMPLE, a comprehensive simulation infrastructure designed for humanoid locomanipulation. The framewor… view at source ↗

**Figure 3.** Figure 3: System Diagram. MuJoCo simulates physical interactions, while Isaac Sim synchronizes states and renders photorealistic images for policy inference. SIMPLE employs a dual-simulator architecture that strictly decouples physical simulation from visual rendering. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Automated Motion Planning. Based on task decomposition, scripted policies coordinate upper-body manipulation via motion planning and lower-body movement to generate automated demonstrations. 3.2 Large-Scale Asset Curation and Task Design To ensure sufficient diversity for training robust vision-language-action models, we curated a massive library of objects and environments. We imported 53 objects from Gr… view at source ↗

**Figure 5.** Figure 5: Sim-to-Real Task Observations. Each task is shown as an adjacent simulation–real pair, with the simulated egocentric view on the left and the corresponding real egocentric view on the right. The left pair shows the pick-and-place task, and the right pair shows the handover task. 5 Conclusion We presented SIMPLE, a full-stack simulation infrastructure designed to standardize the evaluation and training of … view at source ↗

**Figure 6.** Figure 6: Class Diagram of the SIMPLE Framework. BaseDualSimEnv owns two Simulator instances and a Task, and exposes a standard Gym interface. Task composes a DomainRandomizer and a Robot and drives scene randomization at each reset. The Agent hierarchy (teleoperation, motion planning, remote inference) shares a unified get action interface, while whole-body controllers (AmoWBC, SonicWBC, SonicDecoupledWBC) transla… view at source ↗

**Figure 7.** Figure 7: Sequential Diagram of a Single env.step() Call. Each invocation of env.step(action) (line 8 of Algorithm 1) triggers the following sequence: MujocoSimulator advances the physics loop at 500 Hz and writes the updated robot and object states to a shared buffer; IsaacSimSimulator reads those states, renders photorealistic RGB frames at 50 Hz, and returns the visual observation; BaseDualSim then queries the Ta… view at source ↗

**Figure 8.** Figure 8: Offline Object Preprocessing Pipeline. Raw mesh assets from Objaverse and GraspNet1B are processed for two purposes: (1) For Physics — CoACD convex decomposition generates MuJoCo collision geometries, and stable resting poses are determined by dropping each object in simulation; (2) For Rendering — meshes are converted to USD format with high-resolution PBR textures for Isaac Sim. BoDex [19] synthesizes d… view at source ↗

**Figure 9.** Figure 9: Task Distribution. Distribution of all tasks across three types: basic rigid pick-andplace, articulated object manipulation, and whole-body loco-manipulation. 50 Hz. Because SONIC tracks the operator’s whole-body motion as a unified trajectory rather than handling each segment independently, it tends to produce smoother and more natural whole-body demonstrations, particularly for tasks requiring coordinat… view at source ↗

read the original abstract

Humanoid foundation models are advancing faster than we can evaluate them. While real-world testing is expensive and difficult to reproduce, existing simulation benchmarks focus primarily on table-top or wheeled robots. A scalable and reproducible benchmark for whole-body humanoid loco-manipulation remains an open problem. To this end, we present SIMPLE, a unified simulation testbed for humanoid policy learning and evaluation. SIMPLE couples the accurate contact-rich dynamics of MuJoCo with the photorealistic rendering of IsaacSim. It provides a large-scale environment comprising 60 diverse whole-body tasks, 50 indoor scenes, and over 1,000 object assets. To facilitate scalable data collection, the framework integrates two data generation pipelines: automated trajectory generation via motion planning and a low-latency VR teleoperation interface. We further integrate and benchmark mainstream humanoid policies at scale in SIMPLE, including lightweight imitation networks, large vision-language-action (VLA) models, and recent world action models (WAMs). Our experiments reveal a strong correlation between policy performance in simulation and the real world. Furthermore, we demonstrate that policies trained on data collected in SIMPLE can be transferred zero-shot to physical humanoid robots under similar settings, providing a robust and reproducible foundation for humanoid robotics research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SIMPLE gives a scaled-up dual-sim benchmark for humanoid loco-manipulation but its zero-shot transfer claims lack the fidelity metrics needed to back them.

read the letter

The paper's main contribution is a new testbed that pairs MuJoCo contact dynamics with IsaacSim rendering across 60 whole-body tasks, 50 scenes, and 1000+ objects, plus motion-planning and VR data pipelines. That scale and the explicit dual-simulator split look new relative to the narrower benchmarks it cites.

It does the practical parts well. The data-generation tools are straightforward to use at volume, and running imitation, VLA, and world-action models through the same environment lets people compare them directly. For groups already working on humanoid policy learning, having one place to generate trajectories and render visuals at this size is a clear step up from stitching separate tools.

The soft spot is the transfer story. The abstract states a strong sim-to-real correlation and successful zero-shot deployment on physical robots, yet supplies no contact-force matching numbers, friction calibration, domain-randomization schedule, or even basic error bars. In contact-rich loco-manipulation those details decide whether the sim is actually predictive; without them the claim stays untested. The stress-test note on unquantified fidelity is on target.

This is for researchers who need a shared, reproducible environment to iterate humanoid policies before hardware time. Anyone already running sim-to-real experiments or building large task suites will get immediate use from the task list and pipelines.

The work is coherent on its own terms and addresses a real gap, so it should go to peer review. The benchmark itself is worth referee time even if the transfer results need more evidence.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SIMPLE, a hybrid simulation testbed for humanoid loco-manipulation that couples MuJoCo contact dynamics with IsaacSim photorealistic rendering. It comprises 60 whole-body tasks across 50 indoor scenes and >1000 object assets, with data pipelines using motion planning and VR teleoperation. The work benchmarks imitation networks, VLAs, and world action models, claiming a strong correlation between simulated and real-world policy performance plus successful zero-shot transfer of policies trained in SIMPLE to physical humanoid robots.

Significance. If the reported correlation and zero-shot transfers are substantiated with quantitative controls, this benchmark would address a clear gap in scalable, reproducible evaluation for whole-body humanoid policies, where real-world testing is costly. The scale (tasks, scenes, assets) and dual data-generation pipelines are strengths that could accelerate research if the sim fidelity supports the transfer claims.

major comments (2)

[Abstract and Experiments] Abstract and Experiments section: the central claim of 'strong correlation' between simulation and real-world performance plus zero-shot transfer is asserted without any reported quantitative metrics (success rates, correlation coefficients, error bars), task definitions, or experimental controls, preventing assessment of whether the results support the modeling assumptions.
[Experiments] Experiments section: the zero-shot transfer claim rests on the unquantified fidelity of MuJoCo+IsaacSim for contact-rich loco-manipulation dynamics; no contact force matching metrics, friction identification procedure, or domain randomization details are supplied to validate that small contact errors do not prevent transfer in whole-body tasks.

minor comments (2)

[Related Work] Related Work: the discussion of prior simulation benchmarks for table-top or wheeled robots would benefit from explicit citations and a table comparing task coverage and fidelity features.
[Figures] Figure captions: several figures showing policy rollouts lack labels for success/failure criteria or quantitative performance overlays.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where quantitative support for our claims can be strengthened. We address each major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: the central claim of 'strong correlation' between simulation and real-world performance plus zero-shot transfer is asserted without any reported quantitative metrics (success rates, correlation coefficients, error bars), task definitions, or experimental controls, preventing assessment of whether the results support the modeling assumptions.

Authors: We agree that the abstract and Experiments section would benefit from explicit quantitative metrics to allow full assessment of the claims. In the revised manuscript, we will update the abstract to reference key results and expand the Experiments section with tables reporting success rates (with error bars) for simulation and real-world evaluations across tasks, correlation coefficients between sim and real performance, precise task definitions, and descriptions of experimental controls. revision: yes
Referee: [Experiments] Experiments section: the zero-shot transfer claim rests on the unquantified fidelity of MuJoCo+IsaacSim for contact-rich loco-manipulation dynamics; no contact force matching metrics, friction identification procedure, or domain randomization details are supplied to validate that small contact errors do not prevent transfer in whole-body tasks.

Authors: We acknowledge that additional validation details are needed to substantiate the simulation fidelity for contact-rich tasks. In the revised Experiments section, we will add available contact force matching metrics, describe the friction identification procedure used, and provide specifics on domain randomization to better support the zero-shot transfer results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or fitted predictions

full rationale

The paper introduces a simulation testbed (SIMPLE) coupling MuJoCo and IsaacSim, provides task/scene assets, data pipelines (motion planning and VR teleoperation), and reports empirical benchmarking of policies with observed sim-to-real correlation and zero-shot transfers. No equations, parameter fits, or derivation chains are present. The central claims rest on experimental results rather than any self-definitional, fitted-input, or self-citation reduction. This is a standard empirical benchmark paper whose content is independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.1-grok · 5781 in / 982 out tokens · 18584 ms · 2026-06-27T19:27:42.568640+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 23 canonical work pages · 15 internal anchors

[1]

S. Wei, H. Jing, B. Li, Z. Zhao, J. Mao, Z. Ni, S. He, J. Liu, X. Liu, K. Kang, et al.Ψ 0: An open foundation model towards universal humanoid loco-manipulation.arXiv preprint arXiv:2603.12263, 2026

work page arXiv 2026
[2]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: A vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

H. Geng, F. Wang, S. Wei, Y . Li, B. Wang, B. An, C. T. Cheng, H. Lou, P. Li, Y .-J. Wang, et al. Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904, 2025

work page arXiv 2025
[7]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

2023
[8]

C. Li, R. Zhang, J. Wong, C. Gokmen, S. Srivastava, R. Mart ´ın-Mart´ın, C. Wang, G. Levine, M. Lingelbach, J. Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. InConference on Robot Learning, pages 80–93. PMLR, 2023

2023
[9]

Atreya, K

P. Atreya, K. Pertsch, T. Lee, M. J. Kim, A. Jain, A. Kuramshin, C. Eppner, C. Neary, E. Hu, F. Ramos, et al. Roboarena: Distributed real-world evaluation of generalist robot policies. In Proceedings of the Conference on Robot Learning (CoRL 2025), 2025

2025
[10]

Z. Zhou, P. Atreya, Y . L. Tan, K. Pertsch, and S. Levine. Autoeval: Autonomous evaluation of generalist robot manipulation policies in the real world.arXiv preprint arXiv:2503.24278, 2025

work page arXiv 2025
[11]

X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kir- mani, et al. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

C. L. Lab. Humanoidverse: A multi-simulator framework for humanoid robot sim-to-real learning.https://github.com/LeCAR-Lab/HumanoidVerse, 2025. 9

2025
[13]

Sferrazza, D.-M

C. Sferrazza, D.-M. Huang, X. Lin, Y . Lee, and P. Abbeel. Humanoidbench: Simulated hu- manoid benchmark for whole-body locomotion and manipulation, 2024

2024
[14]

Khanna, Y

M. Khanna, Y . Mao, H. Jiang, S. Haresh, B. Shacklett, D. Batra, A. Clegg, E. Undersander, A. X. Chang, and M. Savva. Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16384–16393, 2024

2024
[15]

Deitke, D

M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13142–13153, 2023

2023
[16]

Todorov, T

E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

2012
[17]

Isaac Sim

NVIDIA. Isaac Sim. URLhttps://github.com/isaac-sim/IsaacSim
[18]

Sundaralingam, A

B. Sundaralingam, A. Murali, and S. Birchfield. curobov2: Dynamics-aware motion generation with depth-fused distance fields for high-dof robots, 2026

2026
[19]

J. Chen, Y . Ke, and H. Wang. Bodex: Scalable and efficient robotic dexterous grasp synthesis using bilevel optimization.arXiv preprint arXiv:2412.16490, 2024

work page arXiv 2024
[20]

J. Li, X. Cheng, T. Huang, S. Yang, R. Qiu, and X. Wang. Amo: Adaptive motion optimization for hyper-dexterous humanoid whole-body control.Robotics: Science and Systems 2025, 2025

2025
[21]

Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Castaneda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

H. Bi, L. Wu, T. Lin, H. Tan, Z. Su, H. Su, and J. Zhu. H-rdt: Human manipulation en- hanced bimanual robotic manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18135–18143, 2026

2026
[24]

S. Deng, M. Yan, S. Wei, H. Ma, Y . Yang, J. Chen, Z. Zhang, T. Yang, X. Zhang, W. Zhang, et al. Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data. arXiv preprint arXiv:2505.03233, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z. Zhang, and H. Wang. Uni- navid: A video-based vision-language-action model for unifying embodied navigation tasks. arXiv preprint arXiv:2412.06224, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

H.-S. Fang, C. Wang, M. Gou, and C. Lu. Graspnet-1billion: A large-scale benchmark for general object grasping. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11444–11453, 2020

2020
[27]

S. Wei, H. Geng, J. Chen, C. Deng, C. Wenbo, C. Zhao, X. Fang, L. Guibas, and H. Wang. D3roma: Disparity diffusion-based depth sensing for material-agnostic robotic manipulation. InECCV 2024 Workshop on Wild 3D: 3D Modeling, Reconstruction, and Generation in the Wild, 2024

2024
[28]

Robochallenge: Large-scale real-robot evaluation of embodied policies

A. Yakefu, B. Xie, C. Xu, E. Zhang, E. Zhou, F. Jia, H. Yang, H. Fan, H. Zhang, H. Peng, et al. Robochallenge: Large-scale real-robot evaluation of embodied policies.arXiv preprint arXiv:2510.17950, 2025. 10

work page arXiv 2025
[29]

DROID: A large-scale in-the-wild robot manipulation dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, and S. D. et al. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset. InProceedings of Robotics: Science and Systems (RSS), 2024. doi:10.15607/RSS.2024.XX.120

work page doi:10.15607/rss.2024.xx.120 2024
[30]

Y . R. Wang, C. Ung, G. Tannert, J. Duan, J. Li, A. Le, R. Oswal, M. Grotz, W. Pumacay, Y . Deng, et al. Roboeval: Where robotic manipulation meets structured and scalable evaluation. arXiv preprint arXiv:2507.00435, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

James, Z

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

2020
[32]

Nasiriany, S

S. Nasiriany, S. Nasiriany, A. Maddukuri, and Y . Zhu. Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots. InInternational Conference on Learning Representations (ICLR), 2026

2026
[33]

Y . Kim, W. Pumacay, O. Rayyan, M. Argus, W. Han, E. VanderBilt, J. Salvador, A. Deshpande, R. Hendrix, S. Jauhri, S. Liu, N. M. M. Shafiullah, M. Guru, A. Guru, A. Eftekhar, K. Farley, D. Clay, J. Duan, P. Wolters, A. Herrasti, Y .-C. Lee, G. Chalvatzaki, Y . Cui, A. Farhadi, D. Fox, and R. Krishna. Molmospaces: A large-scale open ecosystem for robot nav...

2026
[34]

Shukla, S

A. Shukla, S. Tao, and H. Su. Maniskill-hab: A benchmark for low-level manipulation in home rearrangement tasks.arXiv preprint arXiv:2412.13211, 2024

work page arXiv 2024
[35]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Munoz, X. Yao, R. Zurbr ¨ugg, N. Rudin, et al. Isaac lab: A gpu-accelerated simulation framework for multi- modal robot learning.arXiv preprint arXiv:2511.04831, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Y . Chen, S. Wei, B. Xiao, J. Lyu, J. Chen, F. Zhu, and H. Wang. Robohanger: Learning generalizable robotic hanger insertion for diverse garments.IEEE Robotics and Automation Letters, 2025

2025
[37]

Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

V . Makoviychuk, L. Wawrzyniak, Y . Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[38]

OpenAI Gym

G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym.arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[39]

X. Wei, M. Liu, Z. Ling, and H. Su. Approximate convex decomposition for 3d meshes with collision-aware concavity and tree search.ACM Transactions on Graphics (TOG), 41(4):1–18, 2022

2022
[40]

Cadene, S

R. Cadene, S. Aliberts, F. Capuano, M. Aractingi, A. Zouitine, P. Kooijmans, J. Choghari, M. Russi, C. Pascal, S. Palma, et al. Lerobot: An open-source library for end-to-end robot learning.arXiv preprint arXiv:2602.22818, 2026

work page arXiv 2026
[41]

vmaterials.https://developer.nvidia.com/vmaterials, 2026

NVIDIA. vmaterials.https://developer.nvidia.com/vmaterials, 2026. Accessed: 2026-05-24

2026
[42]

X. Chen, Y . Chen, Y . Fu, N. Gao, J. Jia, W. Jin, H. Li, Y . Mu, J. Pang, Y . Qiao, et al. Internvla- m1: A spatially guided vision-language-action framework for generalist robot policy.arXiv preprint arXiv:2510.13778, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[44]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. 11 Contents 1 Introduction 2 2 Related Work 3 2.1 Robot Policy Evaluation in the Real World . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Simulation-Based Benchmarks for Manipulation . . . . . . . . ....

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

S. Wei, H. Jing, B. Li, Z. Zhao, J. Mao, Z. Ni, S. He, J. Liu, X. Liu, K. Kang, et al.Ψ 0: An open foundation model towards universal humanoid loco-manipulation.arXiv preprint arXiv:2603.12263, 2026

work page arXiv 2026

[2] [2]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: A vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

H. Geng, F. Wang, S. Wei, Y . Li, B. Wang, B. An, C. T. Cheng, H. Lou, P. Li, Y .-J. Wang, et al. Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904, 2025

work page arXiv 2025

[7] [7]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

2023

[8] [8]

C. Li, R. Zhang, J. Wong, C. Gokmen, S. Srivastava, R. Mart ´ın-Mart´ın, C. Wang, G. Levine, M. Lingelbach, J. Sun, et al. Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation. InConference on Robot Learning, pages 80–93. PMLR, 2023

2023

[9] [9]

Atreya, K

P. Atreya, K. Pertsch, T. Lee, M. J. Kim, A. Jain, A. Kuramshin, C. Eppner, C. Neary, E. Hu, F. Ramos, et al. Roboarena: Distributed real-world evaluation of generalist robot policies. In Proceedings of the Conference on Robot Learning (CoRL 2025), 2025

2025

[10] [10]

Z. Zhou, P. Atreya, Y . L. Tan, K. Pertsch, and S. Levine. Autoeval: Autonomous evaluation of generalist robot manipulation policies in the real world.arXiv preprint arXiv:2503.24278, 2025

work page arXiv 2025

[11] [11]

X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kir- mani, et al. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

C. L. Lab. Humanoidverse: A multi-simulator framework for humanoid robot sim-to-real learning.https://github.com/LeCAR-Lab/HumanoidVerse, 2025. 9

2025

[13] [13]

Sferrazza, D.-M

C. Sferrazza, D.-M. Huang, X. Lin, Y . Lee, and P. Abbeel. Humanoidbench: Simulated hu- manoid benchmark for whole-body locomotion and manipulation, 2024

2024

[14] [14]

Khanna, Y

M. Khanna, Y . Mao, H. Jiang, S. Haresh, B. Shacklett, D. Batra, A. Clegg, E. Undersander, A. X. Chang, and M. Savva. Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16384–16393, 2024

2024

[15] [15]

Deitke, D

M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13142–13153, 2023

2023

[16] [16]

Todorov, T

E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

2012

[17] [17]

Isaac Sim

NVIDIA. Isaac Sim. URLhttps://github.com/isaac-sim/IsaacSim

[18] [18]

Sundaralingam, A

B. Sundaralingam, A. Murali, and S. Birchfield. curobov2: Dynamics-aware motion generation with depth-fused distance fields for high-dof robots, 2026

2026

[19] [19]

J. Chen, Y . Ke, and H. Wang. Bodex: Scalable and efficient robotic dexterous grasp synthesis using bilevel optimization.arXiv preprint arXiv:2412.16490, 2024

work page arXiv 2024

[20] [20]

J. Li, X. Cheng, T. Huang, S. Yang, R. Qiu, and X. Wang. Amo: Adaptive motion optimization for hyper-dexterous humanoid whole-body control.Robotics: Science and Systems 2025, 2025

2025

[21] [21]

Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Castaneda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, et al. Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

H. Bi, L. Wu, T. Lin, H. Tan, Z. Su, H. Su, and J. Zhu. H-rdt: Human manipulation en- hanced bimanual robotic manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18135–18143, 2026

2026

[24] [24]

S. Deng, M. Yan, S. Wei, H. Ma, Y . Yang, J. Chen, Z. Zhang, T. Yang, X. Zhang, W. Zhang, et al. Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data. arXiv preprint arXiv:2505.03233, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z. Zhang, and H. Wang. Uni- navid: A video-based vision-language-action model for unifying embodied navigation tasks. arXiv preprint arXiv:2412.06224, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

H.-S. Fang, C. Wang, M. Gou, and C. Lu. Graspnet-1billion: A large-scale benchmark for general object grasping. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11444–11453, 2020

2020

[27] [27]

S. Wei, H. Geng, J. Chen, C. Deng, C. Wenbo, C. Zhao, X. Fang, L. Guibas, and H. Wang. D3roma: Disparity diffusion-based depth sensing for material-agnostic robotic manipulation. InECCV 2024 Workshop on Wild 3D: 3D Modeling, Reconstruction, and Generation in the Wild, 2024

2024

[28] [28]

Robochallenge: Large-scale real-robot evaluation of embodied policies

A. Yakefu, B. Xie, C. Xu, E. Zhang, E. Zhou, F. Jia, H. Yang, H. Fan, H. Zhang, H. Peng, et al. Robochallenge: Large-scale real-robot evaluation of embodied policies.arXiv preprint arXiv:2510.17950, 2025. 10

work page arXiv 2025

[29] [29]

DROID: A large-scale in-the-wild robot manipulation dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, and S. D. et al. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset. InProceedings of Robotics: Science and Systems (RSS), 2024. doi:10.15607/RSS.2024.XX.120

work page doi:10.15607/rss.2024.xx.120 2024

[30] [30]

Y . R. Wang, C. Ung, G. Tannert, J. Duan, J. Li, A. Le, R. Oswal, M. Grotz, W. Pumacay, Y . Deng, et al. Roboeval: Where robotic manipulation meets structured and scalable evaluation. arXiv preprint arXiv:2507.00435, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

James, Z

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

2020

[32] [32]

Nasiriany, S

S. Nasiriany, S. Nasiriany, A. Maddukuri, and Y . Zhu. Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots. InInternational Conference on Learning Representations (ICLR), 2026

2026

[33] [33]

Y . Kim, W. Pumacay, O. Rayyan, M. Argus, W. Han, E. VanderBilt, J. Salvador, A. Deshpande, R. Hendrix, S. Jauhri, S. Liu, N. M. M. Shafiullah, M. Guru, A. Guru, A. Eftekhar, K. Farley, D. Clay, J. Duan, P. Wolters, A. Herrasti, Y .-C. Lee, G. Chalvatzaki, Y . Cui, A. Farhadi, D. Fox, and R. Krishna. Molmospaces: A large-scale open ecosystem for robot nav...

2026

[34] [34]

Shukla, S

A. Shukla, S. Tao, and H. Su. Maniskill-hab: A benchmark for low-level manipulation in home rearrangement tasks.arXiv preprint arXiv:2412.13211, 2024

work page arXiv 2024

[35] [35]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Munoz, X. Yao, R. Zurbr ¨ugg, N. Rudin, et al. Isaac lab: A gpu-accelerated simulation framework for multi- modal robot learning.arXiv preprint arXiv:2511.04831, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Y . Chen, S. Wei, B. Xiao, J. Lyu, J. Chen, F. Zhu, and H. Wang. Robohanger: Learning generalizable robotic hanger insertion for diverse garments.IEEE Robotics and Automation Letters, 2025

2025

[37] [37]

Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

V . Makoviychuk, L. Wawrzyniak, Y . Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[38] [38]

OpenAI Gym

G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym.arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[39] [39]

X. Wei, M. Liu, Z. Ling, and H. Su. Approximate convex decomposition for 3d meshes with collision-aware concavity and tree search.ACM Transactions on Graphics (TOG), 41(4):1–18, 2022

2022

[40] [40]

Cadene, S

R. Cadene, S. Aliberts, F. Capuano, M. Aractingi, A. Zouitine, P. Kooijmans, J. Choghari, M. Russi, C. Pascal, S. Palma, et al. Lerobot: An open-source library for end-to-end robot learning.arXiv preprint arXiv:2602.22818, 2026

work page arXiv 2026

[41] [41]

vmaterials.https://developer.nvidia.com/vmaterials, 2026

NVIDIA. vmaterials.https://developer.nvidia.com/vmaterials, 2026. Accessed: 2026-05-24

2026

[42] [42]

X. Chen, Y . Chen, Y . Fu, N. Gao, J. Jia, W. Jin, H. Li, Y . Mu, J. Pang, Y . Qiao, et al. Internvla- m1: A spatially guided vision-language-action framework for generalist robot policy.arXiv preprint arXiv:2510.13778, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025

[44] [44]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. 11 Contents 1 Introduction 2 2 Related Work 3 2.1 Robot Policy Evaluation in the Real World . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Simulation-Based Benchmarks for Manipulation . . . . . . . . ....

work page internal anchor Pith review Pith/arXiv arXiv 2023