pith. machine review for the scientific record. sign in

arxiv: 2604.19092 · v2 · submitted 2026-04-21 · 💻 cs.RO · cs.AI

Recognition: 2 theorem links

· Lean Theorem

RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:11 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords world modelsrobotic manipulationvideo predictionbenchmarkembodimentsimulationphysical consistencyaction execution
0
0 comments X

The pith

A new benchmark shows that visually realistic robot videos often cannot be executed as actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RoboWM-Bench to evaluate video world models by converting their generated manipulation videos into robot action sequences that are then run in physics simulation. It builds on real-to-sim scene reconstruction and diverse tasks to measure whether predicted behaviors are physically consistent enough to succeed when embodied. Testing current models reveals that high visual quality frequently fails to produce executable actions, especially in long-horizon or contact-heavy interactions due to issues like spatial errors and distorted geometry. This distinction matters for robot learning because future systems may depend on these models to generate training data at scale, so visual appeal alone is not enough to guarantee usable supervision. The benchmark provides a reproducible way to quantify this gap and identify recurring failure modes.

Core claim

RoboWM-Bench converts generated human-hand and robotic manipulation videos into embodied action sequences and validates them through execution in physically grounded simulation environments built on real-to-sim reconstruction. When applied to state-of-the-art video world models, it shows that visual plausibility and embodied executability are not always aligned, with recurring problems in spatial reasoning, contact prediction, and non-physical geometric distortions particularly in complex and long-horizon interactions.

What carries the argument

RoboWM-Bench, which translates video predictions into executable action sequences for validation in simulation.

If this is right

  • World models for robotics need training signals that enforce physical executability beyond visual fidelity.
  • Embodied execution tests should become standard alongside visual metrics when assessing manipulation world models.
  • Complex and long-horizon tasks expose larger gaps between predicted visuals and workable actions.
  • Real-to-sim reconstruction enables scalable, standardized checks of physical consistency in generated videos.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • World model training could incorporate simulation-based feedback loops to directly optimize for action success rather than pixels alone.
  • The same video-to-execution pipeline might apply to evaluating world models in other embodied settings such as navigation or assembly.
  • Pure scaling of visual pretraining may hit limits in robotics without explicit embodiment grounding.

Load-bearing premise

That successful execution of converted actions in the chosen simulation environments accurately reflects the physical consistency required for real-world robot performance.

What would settle it

A video world model that achieves high visual scores yet shows near-zero success rates on RoboWM-Bench execution across multiple tasks, or the reverse pattern of low visual quality but high execution success, would strengthen the misalignment claim; consistent alignment across models would weaken it.

Figures

Figures reproduced from arXiv: 2604.19092 by Chen Xie, Feng Jiang, HaiFeng Wang, Jasper Lu, Kyle Xu, Ruihai Wu, Shengze Huang, Yang Chen, Yuanfei Wang, Yuchen Liu, Zhenhao Shen.

Figure 1
Figure 1. Figure 1: Overview of RoboWM-Bench. RoboWM-Bench is a manipulation-centric benchmark for evaluating video world models under embodied execution. (a) Given an initial scene observation and task description, world models generate manipulation videos with human hands or robot arms. The predicted behaviors are converted into embodied action sequences and validated in simulation through real-to-sim scene re￾construction.… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of RoboWM-Bench. Given an initial scene observation, the corre￾sponding real-world scene is reconstructed in simulation through a real-to-sim pipeline, enabling consistent and reproducible evaluation. Predicted videos are then converted into executable robot actions through two pathways: human-centric retargeting, which estimates 3D hand poses and retargets them to robot end-effector actions, and … view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative execution results on RoboWM-Bench. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison between PAI-Bench and RoboWM-Bench [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Real-to-sim consistency evaluation. Identical manipulation trajectories are executed in real-world scenes and reconstructed simulation environments, yielding consistent success and failure outcomes. For robotic videos, we compare two IDM training strategies: IDMReal, trained directly on real-world data (50 trajectories per task), and IDMSim+Real, a two￾stage approach consisting of simulation pretraining fo… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison between the average quality scores in PAI-Bench with the execution accuracy in RoboWM-Bench. The left scatter plot shows human-hand tasks, and the right scatter plot shows robotic tasks [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Additional qualitative results on RoboWM-Bench. [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional qualitative results on real-to-sim consistency evaluation. [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of depth estimation results. (a) The predicted absolute depth shows a large discrepancy from the ground-truth. (b) Aligning relative depth with the first-frame ground-truth depth improves consistency, although non-negligible errors still remain. F.1 Human-Hand Tasks To generate instruction prompts for video world models on human tasks, we leverage the Qwen3-vl-flash [4] model to generate conc… view at source ↗
read the original abstract

Recent advances in large-scale video world models have enabled increasingly realistic future prediction, raising the prospect of using generated videos as scalable supervision for robot learning. However, for embodied manipulation, perceptual realism alone is not sufficient: generated interactions must also be physically consistent and executable by robotic agents. Existing benchmarks provide valuable assessments of visual quality and physical plausibility, but they do not systematically evaluate whether predicted behaviors can be translated into executable actions that complete manipulation tasks. We introduce RoboWM-Bench, a manipulation-centric benchmark for embodiment-grounded evaluation of video world models. RoboWM-Bench converts generated human-hand and robotic manipulation videos into embodied action sequences and validates them through execution in physically grounded simulation environments. Built on real-to-sim scene reconstruction and diverse manipulation tasks, RoboWM-Bench enables standardized, reproducible, and scalable evaluation of physical executability. Using RoboWM-Bench, we evaluate state-of-the-art video world models and observe that visual plausibility and embodied executability are not always aligned. Our analysis highlights several recurring factors that affect execution performance, including spatial reasoning, contact prediction, and non-physical geometric distortions, particularly in complex and long-horizon interactions. These findings provide a more fine-grained view of current model capabilities and underscore the value of embodiment-aware evaluation for guiding physically grounded world modeling in robotic manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RoboWM-Bench, a manipulation-centric benchmark that converts generated videos of human-hand and robotic interactions into embodied action sequences (via pose estimation, contact detection, and trajectory fitting) and evaluates their executability in physically grounded simulation environments reconstructed from real scenes. It reports evaluations of state-of-the-art video world models, finding that visual plausibility does not always align with successful task completion, and identifies recurring failure modes such as poor spatial reasoning, inaccurate contact prediction, and non-physical geometric distortions in long-horizon tasks.

Significance. If the video-to-action conversion pipeline is validated, the benchmark would offer a valuable, scalable, embodiment-grounded complement to existing visual and physical-plausibility metrics, helping steer world-model development toward predictions that are not only realistic but also directly usable for robotic manipulation.

major comments (2)
  1. [§3.2] The video-to-action conversion pipeline (pose estimation, contact detection, trajectory fitting) lacks any quantitative validation against ground-truth actions extracted from real videos; without this check, execution failures in simulation cannot be confidently attributed to the world model rather than pipeline artifacts under visual artifacts typical of current generators.
  2. [§4] The central claim that visual plausibility and embodied executability diverge is presented without baseline comparisons, error bars, or per-model quantitative metrics (e.g., success rates, contact accuracy); the abstract-level summary alone does not allow assessment of effect sizes or statistical significance.
minor comments (2)
  1. [Abstract] Specify the exact set of video world models evaluated and the precise simulation environments/tasks used, including any parameter settings for the real-to-sim reconstruction.
  2. [§3.1] Clarify the criteria used to select 'diverse manipulation tasks' and how long-horizon interactions are defined and segmented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and will make revisions to address them. Our point-by-point responses are provided below.

read point-by-point responses
  1. Referee: [§3.2] The video-to-action conversion pipeline (pose estimation, contact detection, trajectory fitting) lacks any quantitative validation against ground-truth actions extracted from real videos; without this check, execution failures in simulation cannot be confidently attributed to the world model rather than pipeline artifacts under visual artifacts typical of current generators.

    Authors: We agree that a quantitative validation of the video-to-action pipeline against ground-truth actions from real videos would strengthen the attribution of execution failures. In the revised manuscript, we will add a dedicated validation subsection in §3.2. This will apply the full pipeline (pose estimation, contact detection, and trajectory fitting) to the original real demonstration videos, for which ground-truth actions are available from the data collection process. We will report metrics including average joint angle error, contact detection precision/recall, and end-effector trajectory RMSE to quantify pipeline fidelity under realistic visual conditions. revision: yes

  2. Referee: [§4] The central claim that visual plausibility and embodied executability diverge is presented without baseline comparisons, error bars, or per-model quantitative metrics (e.g., success rates, contact accuracy); the abstract-level summary alone does not allow assessment of effect sizes or statistical significance.

    Authors: We acknowledge that the current results section would benefit from expanded quantitative reporting to better support the central claim. While the manuscript already evaluates multiple state-of-the-art models on a range of tasks, we will revise §4 to include per-model success rates in simulation, contact accuracy scores, comparisons against baselines (e.g., ground-truth video execution and simple motion predictors), error bars computed across multiple task instances or random seeds, and basic statistical significance tests. These additions will allow clearer assessment of effect sizes and the divergence between visual and embodied metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark evaluation is externally grounded in simulation execution

full rationale

The paper introduces RoboWM-Bench as a new evaluation framework that converts generated videos into action sequences via pose estimation and contact detection, then executes them in independent simulation environments. The key observation—that visual plausibility and embodied executability diverge—is presented as an empirical outcome from running this pipeline on external state-of-the-art video world models, not as a quantity derived from fitted parameters, self-referential equations, or self-citation chains. No load-bearing step reduces by construction to the paper's own inputs; the conversion and execution steps are described as external validation mechanisms rather than tautological redefinitions. This satisfies the default expectation of a non-circular benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that simulation environments can serve as a valid proxy for physical executability; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Physically grounded simulation environments accurately validate the executability of actions derived from generated videos
    This assumption underpins the entire validation step described in the abstract.

pith-pipeline@v0.9.0 · 5559 in / 1207 out tokens · 57329 ms · 2026-05-15T06:11:51.626507+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 15 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al.: Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 (2025)

  2. [2]

    World Simulation with Video Foundation Models for Physical AI

    Ali, A., Bai, J., Bala, M., Balaji, Y., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y.W., et al.: World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062 (2025)

  3. [3]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)

  4. [4]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  5. [5]

    Advances in Neural Information Processing Systems35, 24639–24654 (2022)

    Baker, B., Akkaya, I., Zhokov, P., Huizinga, J., Tang, J., Ecoffet, A., Houghton, B., Sampedro, R., Clune, J.: Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems35, 24639–24654 (2022)

  6. [6]

    arXiv preprint arXiv:2409.16283 (2024)

    Bharadhwaj, H., Dwibedi, D., Gupta, A., Tulsiani, S., Doersch, C., Xiao, T., Shah, D., Xia, F., Sadigh, D., Kirmani, S.: Gen2act: Human video generation in novel sce- narios enables generalizable robot manipulation. arXiv preprint arXiv:2409.16283 (2024)

  7. [7]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

  8. [8]

    Brooks, T., Peebles, W., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Ng, C., Wang, R., Ramesh, A., et al.: Video generation models as world simulators.https://openai.com/research/video-generation- models-as-world-simulators(2024), openAI Technical Report

  9. [9]

    Large Video Planner Enables Generalizable Robot Control

    Chen, B., Zhang, T., Geng, H., Song, K., Zhang, C., Li, P., Freeman, W.T., Malik, J., Abbeel, P., Tedrake, R., et al.: Large video planner enables generalizable robot control. arXiv preprint arXiv:2512.15840 (2025)

  10. [10]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Chen, S., Guo, H., Zhu, S., Zhang, F., Huang, Z., Feng, J., Kang, B.: Video depth anything: Consistent depth estimation for super-long videos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22831–22840 (2025)

  11. [11]

    SAM 3D: 3Dfy Anything in Images

    Chen, X., Chu, F.J., Gleize, P., Liang, K.J., Sax, A., Tang, H., Wang, W., Guo, M., Hardin, T., Li, X., et al.: Sam 3d: 3dfy anything in images. arXiv preprint arXiv:2511.16624 (2025)

  12. [12]

    arXiv preprint arXiv:2509.22642 (2025)

    Chi, X., Jia, P., Fan, C.K., Ju, X., Mi, W., Zhang, K., Qin, Z., Tian, W., Ge, K., Li, H., et al.: Wow: Towards a world omniscient world model through embodied interaction. arXiv preprint arXiv:2509.22642 (2025)

  13. [13]

    Advances in neural information processing systems36, 9156–9172 (2023)

    Du, Y., Yang, S., Dai, B., Dai, H., Nachum, O., Tenenbaum, J., Schuurmans, D., Abbeel, P.: Learning universal policies via text-guided video generation. Advances in neural information processing systems36, 9156–9172 (2023)

  14. [14]

    arXiv preprint arXiv:2601.04137 (2026)

    Fan, C.K., Chi, X., Ju, X., Li, H., Bao, Y., Wang, Y.K., Chen, L., Jiang, Z., Ge, K., Li, Y., et al.: Wow, wo, val! a comprehensive embodied world model evaluation turing test. arXiv preprint arXiv:2601.04137 (2026)

  15. [15]

    arXiv preprint arXiv:2406.08656 (2024)

    Feng, W., Li, J., Saxon, M., Fu, T.j., Chen, W., Wang, W.Y.: Tc-bench: Bench- marking temporal compositionality in text-to-video and image-to-video generation. arXiv preprint arXiv:2406.08656 (2024)

  16. [16]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Gao, Y., Guo, H., Hoang, T., Huang, W., Jiang, L., Kong, F., Li, H., Li, J., Li, L., Li, X., et al.: Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113 (2025)

  17. [17]

    googleapis.com/deepmind- media/veo/Veo- 3- Tech- Report.pdf, technical Re- port

    Google DeepMind: Veo: a text-to-video generation system (veo-3 technical re- port).Tech.Rep.Veo-3-Tech-Report,GoogleDeepMind(2025),https://storage. googleapis.com/deepmind- media/veo/Veo- 3- Tech- Report.pdf, technical Re- port

  18. [18]

    In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition

    Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Ham- burger, J., Jiang, H., Liu, M., Liu, X., et al.: Ego4d: Around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 18995–19012 (2022)

  19. [19]

    LTX-Video: Realtime Video Latent Diffusion

    HaCohen, Y., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., Levin, E., Shiran, G., Zabari, N., Gordon, O., et al.: Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103 (2024)

  20. [20]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: Large-scale pretrain- ing for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022)

  21. [21]

    arXiv preprint arXiv:2501.01895 (2025)

    Huang, S., Chen, L., Zhou, P., Chen, S., Jiang, Z., Hu, Y., Liao, Y., Gao, P., Li, H., Yao, M., et al.: Enerverse: Envisioning embodied future space for robotics manipulation (2025). arXiv preprint arXiv:2501.01895 (2025)

  22. [22]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024)

  23. [23]

    arXiv preprint arXiv:2505.12705 (2025)

    Jang,J.,Ye,S.,Lin,Z.,Xiang,J.,Bjorck,J.,Fang,Y.,Hu,F.,Huang,S.,Kundalia, K., Lin, Y.C., et al.: Dreamgen: Unlocking generalization in robot learning through video world models. arXiv preprint arXiv:2505.12705 (2025)

  24. [24]

    In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

    Ji, P., Xiao, C., Tai, H., Huo, M.: T2vbench: Benchmarking temporal dynamics for text-to-video generation. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 5325–5335 (2024)

  25. [25]

    arXiv preprint arXiv:2512.22414 (2025)

    Kareer, S., Pertsch, K., Darpinian, J., Hoffman, J., Xu, D., Levine, S., Finn, C., Nair, S.: Emergence of human to robot transfer in vision-language-action models. arXiv preprint arXiv:2512.22414 (2025)

  26. [26]

    arXiv preprint arXiv:2212.06870 (2022)

    Labbé, Y., Manuelli, L., Mousavian, A., Tyree, S., Birchfield, S., Tremblay, J., Carpentier, J., Aubry, M., Fox, D., Sivic, J.: Megapose: 6d pose estimation of novel objects via render & compare. arXiv preprint arXiv:2212.06870 (2022)

  27. [27]

    arXiv preprint arXiv:2508.09976 (2025)

    Lepert, M., Fang, J., Bohg, J.: Masquerade: Learning from in-the-wild human videos using data-editing. arXiv preprint arXiv:2508.09976 (2025)

  28. [28]

    URL https://arxiv

    Lepert, M., Fang, J., Bohg, J.: Phantom: Training robots without robots using only human videos. URL https://arxiv. org/abs/2503.007792(2025)

  29. [29]

    arXiv preprint arXiv:2502.20694 (2025)

    Li, D., Fang, Y., Chen, Y., Yang, S., Cao, S., Wong, J., Luo, M., Wang, X., Yin, H., Gonzalez, J.E., et al.: Worldmodelbench: Judging video generation models as world models. arXiv preprint arXiv:2502.20694 (2025)

  30. [30]

    In: IROS 2025-5th Workshop on RObotic MA- nipulation of Deformable Objects: holistic approaches and challenges forward

    Li, Z., Yang, J., Xu, J., Xie, S., Chen, T., Wang, Y., Shen, Z., Shen, Y., Zheng, Y., Li, W., et al.: Lehome: A simulation environment for deformable object ma- nipulation in household scenarios. In: IROS 2025-5th Workshop on RObotic MA- nipulation of Deformable Objects: holistic approaches and challenges forward

  31. [31]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Li, Z., Tucker, R., Cole, F., Wang, Q., Jin, L., Ye, V., Kanazawa, A., Holynski, A., Snavely, N.: Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10486–10496 (2025)

  32. [32]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Ling, X., Zhu, C., Wu, M., Li, H., Feng, X., Yang, C., Hao, A., Zhu, J., Wu, J., Chu, X.: Vmbench: A benchmark for perception-aligned video motion generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13087–13098 (2025)

  33. [33]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, Y., Cun, X., Liu, X., Wang, X., Zhang, Y., Chen, H., Liu, Y., Zeng, T., Chan, R., Shan, Y.: Evalcrafter: Benchmarking and evaluating large video generation models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22139–22149 (2024)

  34. [34]

    arXiv preprint arXiv:2411.07223 (2024)

    Luo, Y., Du, Y.: Grounding video models to actions through goal conditioned exploration. arXiv preprint arXiv:2411.07223 (2024)

  35. [35]

    VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

    Ma, Y.J., Sodhani, S., Jayaraman, D., Bastani, O., Kumar, V., Zhang, A.: Vip: Towards universal visual reward and representation via value-implicit pre-training. arXiv preprint arXiv:2210.00030 (2022)

  36. [36]

    Jour- nal of Artificial Intelligence Research83(2025)

    McCarthy, R., Tan, D.C., Schmidt, D., Acero, F., Herr, N., Du, Y., Thuruthel, T.G., Li, Z.: Towards generalist robot learning from internet video: A survey. Jour- nal of Artificial Intelligence Research83(2025)

  37. [37]

    arXiv preprint arXiv:2203.12601 (2022)

    Nair, S., Rajeswaran, A., Kumar, V., Finn, C., Gupta, A.: R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601 (2022)

  38. [38]

    https://developer.nvidia.com/isaac-sim(2023), accessed: 2026-03-03

    NVIDIA Corporation: Nvidia isaac sim: High-fidelity simulation for robotics. https://developer.nvidia.com/isaac-sim(2023), accessed: 2026-03-03

  39. [39]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Pavlakos, G., Shan, D., Radosavovic, I., Kanazawa, A., Fouhey, D., Malik, J.: Reconstructing hands in 3d with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9826–9836 (2024)

  40. [40]

    In: Human to Robot: Workshop on Sensorizing, Modeling, and Learning from Humans (2025)

    Punamiya, R., Patel, D., Aphiwetsa, P., Kuppili, P., Zhu, L.Y., Kareer, S., Hoff- man, J., Xu, D.: Egobridge: Domain adaptation for generalizable imitation from egocentric human data. In: Human to Robot: Workshop on Sensorizing, Modeling, and Learning from Humans (2025)

  41. [41]

    In: European Conference on Computer Vision

    Qin, Y., Wu, Y.H., Liu, S., Jiang, H., Yang, R., Fu, Y., Wang, X.: Dexmv: Imitation learning for dexterous manipulation from human videos. In: European Conference on Computer Vision. pp. 570–587. Springer (2022)

  42. [42]

    In: Conference on Robot Learning

    Radosavovic, I., Xiao, T., James, S., Abbeel, P., Malik, J., Darrell, T.: Real-world robot learning with masked visual pre-training. In: Conference on Robot Learning. pp. 416–426. PMLR (2023)

  43. [43]

    arXiv preprint arXiv:2602.08971 (2026)

    Shang, Y., Li, Z., Ma, Y., Su, W., Jin, X., Wang, Z., Jin, L., Zhang, X., Tang, Y., Su, H., et al.: Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models. arXiv preprint arXiv:2602.08971 (2026)

  44. [44]

    arXiv preprint arXiv:2512.10675 (2025)

    Team, G.R., Choromanski, K., Devin, C., Du, Y., Dwibedi, D., Gao, R., Jindal, A., Kipf, T., Kirmani, S., Leal, I., et al.: Evaluating gemini robotics policies in a veo world simulator. arXiv preprint arXiv:2512.10675 (2025)

  45. [45]

    arXiv preprint arXiv:2510.19430 (2025)

    Team, G., Ye, A., Wang, B., Ni, C., Huang, G., Zhao, G., Li, H., Li, J., Zhu, J., Feng, L., et al.: Gigabrain-0: A world model-powered vision-language-action model. arXiv preprint arXiv:2510.19430 (2025)

  46. [46]

    arXiv preprint arXiv:2511.19861 (2025)

    Team, G., Ye, A., Wang, B., Ni, C., Huang, G., Zhao, G., Li, H., Zhu, J., Li, K., Xu, M., et al.: Gigaworld-0: World models as data engine to empower embodied ai. arXiv preprint arXiv:2511.19861 (2025)

  47. [47]

    arXiv preprint arXiv:2512.13313 (2025)

    Team, K., Chen, J., Ding, Y., Fang, Z., Gai, K., Gao, Y., He, K., Hua, J., Jiang, B., Lao, M., et al.: Klingavatar 2.0 technical report. arXiv preprint arXiv:2512.13313 (2025)

  48. [48]

    arXiv preprint arXiv:2412.15109 (2024)

    Tian, Y., Yang, S., Zeng, J., Wang, P., Lin, D., Dong, H., Pang, J.: Predictive inverse dynamics models are scalable learners for robotic manipulation. arXiv preprint arXiv:2412.15109 (2024)

  49. [49]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

  50. [50]

    International Journal of Computer Vision133(5), 3059–3078 (2025)

    Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P., et al.: Lavie: High-quality video generation with cascaded latent diffu- sion models. International Journal of Computer Vision133(5), 3059–3078 (2025)

  51. [51]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wen, B., Yang, W., Kautz, J., Birchfield, S.: Foundationpose: Unified 6d pose esti- mation and tracking of novel objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17868–17879 (2024)

  52. [52]

    ai/blog/marble-world-model, accessed: 2026-02

    WorldLabs:Marble:AMultimodalWorldModel(2025),https://www.worldlabs. ai/blog/marble-world-model, accessed: 2026-02

  53. [53]

    HunyuanVideo 1.5 Technical Report

    Wu, B., Zou, C., Li, C., Huang, D., Yang, F., Tan, H., Peng, J., Wu, J., Xiong, J., Jiang, J., et al.: Hunyuanvideo 1.5 technical report. arXiv preprint arXiv:2511.18870 (2025)

  54. [54]

    In: 2025 IEEE International Conference on Robotics and Automation (ICRA)

    Wu, T., Zhang, J., Liang, S., Han, Z., Dong, H.: Foundation feature-driven online end-effector pose estimation: A marker-free and learning-free approach. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 1921–

  55. [55]

    In: 2021 IEEE/RSJ international conference on intelligent robots and systems (iros)

    Xiong, H., Li, Q., Chen, Y.C., Bharadhwaj, H., Sinha, S., Garg, A.: Learning by watching: Physical imitation of manipulation skills from human videos. In: 2021 IEEE/RSJ international conference on intelligent robots and systems (iros). pp. 7827–7834. IEEE (2021)

  56. [56]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10371–10381 (2024)

  57. [57]

    arXiv preprint arXiv:2310.061141(2), 6 (2023)

    Yang, M., Du, Y., Ghasemipour, K., Tompson, J., Schuurmans, D., Abbeel, P.: Learning interactive real-world simulators. arXiv preprint arXiv:2310.061141(2), 6 (2023)

  58. [58]

    Latent Action Pretraining from Videos

    Ye, S., Jang, J., Jeon, B., Joo, S., Yang, J., Peng, B., Mandlekar, A., Tan, R., Chao, Y.W., Lin, B.Y., et al.: Latent action pretraining from videos. arXiv preprint arXiv:2410.11758 (2024)

  59. [59]

    arXiv preprint arXiv:2505.09694 (2025)

    Yue, H., Huang, S., Liao, Y., Chen, S., Zhou, P., Chen, L., Yao, M., Ren, G.: Ewmbench: Evaluating scene, motion, and semantic quality in embodied world models. arXiv preprint arXiv:2505.09694 (2025)

  60. [60]

    VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    Zheng, D., Huang, Z., Liu, H., Zou, K., He, Y., Zhang, F., Gu, L., Zhang, Y., He, J., Zheng, W.S., et al.: Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755 (2025)

  61. [61]

    arXiv preprint arXiv:2512.01989 (2025)

    Zhou, F., Huang, J., Li, J., Ramanan, D., Shi, H.: Pai-bench: A comprehensive benchmark for physical ai. arXiv preprint arXiv:2512.01989 (2025)

  62. [62]

    Zhou, S., Du, Y., Chen, J., Li, Y., Yeung, D.Y., Gan, C.: Robodreamer: Learning compositional world models for robot imagination. arXiv preprint arXiv:2404.12377 (2024) Supplementary Material A Additional Results for Purely Simulated Robotic Tasks RoboWM-Bench also includes a set of robotic manipulation tasks evaluated en- tirely in simulation environment...