arxiv: 2604.19092 · v2 · submitted 2026-04-21 · 💻 cs.RO · cs.AI

Recognition: 2 theorem links

· Lean Theorem

RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

Feng Jiang , Yang Chen , Kyle Xu , Yuchen Liu , HaiFeng Wang , Zhenhao Shen , Jasper Lu , Shengze Huang

show 3 more authors

Yuanfei Wang Chen Xie Ruihai Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:11 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords world modelsrobotic manipulationvideo predictionbenchmarkembodimentsimulationphysical consistencyaction execution

0 comments

The pith

A new benchmark shows that visually realistic robot videos often cannot be executed as actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RoboWM-Bench to evaluate video world models by converting their generated manipulation videos into robot action sequences that are then run in physics simulation. It builds on real-to-sim scene reconstruction and diverse tasks to measure whether predicted behaviors are physically consistent enough to succeed when embodied. Testing current models reveals that high visual quality frequently fails to produce executable actions, especially in long-horizon or contact-heavy interactions due to issues like spatial errors and distorted geometry. This distinction matters for robot learning because future systems may depend on these models to generate training data at scale, so visual appeal alone is not enough to guarantee usable supervision. The benchmark provides a reproducible way to quantify this gap and identify recurring failure modes.

Core claim

RoboWM-Bench converts generated human-hand and robotic manipulation videos into embodied action sequences and validates them through execution in physically grounded simulation environments built on real-to-sim reconstruction. When applied to state-of-the-art video world models, it shows that visual plausibility and embodied executability are not always aligned, with recurring problems in spatial reasoning, contact prediction, and non-physical geometric distortions particularly in complex and long-horizon interactions.

What carries the argument

RoboWM-Bench, which translates video predictions into executable action sequences for validation in simulation.

If this is right

World models for robotics need training signals that enforce physical executability beyond visual fidelity.
Embodied execution tests should become standard alongside visual metrics when assessing manipulation world models.
Complex and long-horizon tasks expose larger gaps between predicted visuals and workable actions.
Real-to-sim reconstruction enables scalable, standardized checks of physical consistency in generated videos.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

World model training could incorporate simulation-based feedback loops to directly optimize for action success rather than pixels alone.
The same video-to-execution pipeline might apply to evaluating world models in other embodied settings such as navigation or assembly.
Pure scaling of visual pretraining may hit limits in robotics without explicit embodiment grounding.

Load-bearing premise

That successful execution of converted actions in the chosen simulation environments accurately reflects the physical consistency required for real-world robot performance.

What would settle it

A video world model that achieves high visual scores yet shows near-zero success rates on RoboWM-Bench execution across multiple tasks, or the reverse pattern of low visual quality but high execution success, would strengthen the misalignment claim; consistent alignment across models would weaken it.

Figures

Figures reproduced from arXiv: 2604.19092 by Chen Xie, Feng Jiang, HaiFeng Wang, Jasper Lu, Kyle Xu, Ruihai Wu, Shengze Huang, Yang Chen, Yuanfei Wang, Yuchen Liu, Zhenhao Shen.

**Figure 1.** Figure 1: Overview of RoboWM-Bench. RoboWM-Bench is a manipulation-centric benchmark for evaluating video world models under embodied execution. (a) Given an initial scene observation and task description, world models generate manipulation videos with human hands or robot arms. The predicted behaviors are converted into embodied action sequences and validated in simulation through real-to-sim scene reconstruction.… view at source ↗

**Figure 2.** Figure 2: Pipeline of RoboWM-Bench. Given an initial scene observation, the corresponding real-world scene is reconstructed in simulation through a real-to-sim pipeline, enabling consistent and reproducible evaluation. Predicted videos are then converted into executable robot actions through two pathways: human-centric retargeting, which estimates 3D hand poses and retargets them to robot end-effector actions, and … view at source ↗

**Figure 3.** Figure 3: Qualitative execution results on RoboWM-Bench. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison between PAI-Bench and RoboWM-Bench [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Real-to-sim consistency evaluation. Identical manipulation trajectories are executed in real-world scenes and reconstructed simulation environments, yielding consistent success and failure outcomes. For robotic videos, we compare two IDM training strategies: IDMReal, trained directly on real-world data (50 trajectories per task), and IDMSim+Real, a twostage approach consisting of simulation pretraining fo… view at source ↗

**Figure 6.** Figure 6: Comparison between the average quality scores in PAI-Bench with the execution accuracy in RoboWM-Bench. The left scatter plot shows human-hand tasks, and the right scatter plot shows robotic tasks [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Additional qualitative results on RoboWM-Bench. [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Additional qualitative results on real-to-sim consistency evaluation. [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization of depth estimation results. (a) The predicted absolute depth shows a large discrepancy from the ground-truth. (b) Aligning relative depth with the first-frame ground-truth depth improves consistency, although non-negligible errors still remain. F.1 Human-Hand Tasks To generate instruction prompts for video world models on human tasks, we leverage the Qwen3-vl-flash [4] model to generate conc… view at source ↗

read the original abstract

Recent advances in large-scale video world models have enabled increasingly realistic future prediction, raising the prospect of using generated videos as scalable supervision for robot learning. However, for embodied manipulation, perceptual realism alone is not sufficient: generated interactions must also be physically consistent and executable by robotic agents. Existing benchmarks provide valuable assessments of visual quality and physical plausibility, but they do not systematically evaluate whether predicted behaviors can be translated into executable actions that complete manipulation tasks. We introduce RoboWM-Bench, a manipulation-centric benchmark for embodiment-grounded evaluation of video world models. RoboWM-Bench converts generated human-hand and robotic manipulation videos into embodied action sequences and validates them through execution in physically grounded simulation environments. Built on real-to-sim scene reconstruction and diverse manipulation tasks, RoboWM-Bench enables standardized, reproducible, and scalable evaluation of physical executability. Using RoboWM-Bench, we evaluate state-of-the-art video world models and observe that visual plausibility and embodied executability are not always aligned. Our analysis highlights several recurring factors that affect execution performance, including spatial reasoning, contact prediction, and non-physical geometric distortions, particularly in complex and long-horizon interactions. These findings provide a more fine-grained view of current model capabilities and underscore the value of embodiment-aware evaluation for guiding physically grounded world modeling in robotic manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RoboWM-Bench tries to close the gap between video prediction quality and actual robotic execution, but the video-to-action conversion step lacks the validation needed to make the divergence claim stick.

read the letter

The main point is that this benchmark shows visual plausibility in generated manipulation videos often fails to produce executable robot actions in simulation. That mismatch is the central observation, and the paper builds RoboWM-Bench specifically to measure it by turning predicted frames into action sequences via pose estimation, contact detection, and trajectory fitting, then running those in reconstructed simulation scenes.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RoboWM-Bench, a manipulation-centric benchmark that converts generated videos of human-hand and robotic interactions into embodied action sequences (via pose estimation, contact detection, and trajectory fitting) and evaluates their executability in physically grounded simulation environments reconstructed from real scenes. It reports evaluations of state-of-the-art video world models, finding that visual plausibility does not always align with successful task completion, and identifies recurring failure modes such as poor spatial reasoning, inaccurate contact prediction, and non-physical geometric distortions in long-horizon tasks.

Significance. If the video-to-action conversion pipeline is validated, the benchmark would offer a valuable, scalable, embodiment-grounded complement to existing visual and physical-plausibility metrics, helping steer world-model development toward predictions that are not only realistic but also directly usable for robotic manipulation.

major comments (2)

[§3.2] The video-to-action conversion pipeline (pose estimation, contact detection, trajectory fitting) lacks any quantitative validation against ground-truth actions extracted from real videos; without this check, execution failures in simulation cannot be confidently attributed to the world model rather than pipeline artifacts under visual artifacts typical of current generators.
[§4] The central claim that visual plausibility and embodied executability diverge is presented without baseline comparisons, error bars, or per-model quantitative metrics (e.g., success rates, contact accuracy); the abstract-level summary alone does not allow assessment of effect sizes or statistical significance.

minor comments (2)

[Abstract] Specify the exact set of video world models evaluated and the precise simulation environments/tasks used, including any parameter settings for the real-to-sim reconstruction.
[§3.1] Clarify the criteria used to select 'diverse manipulation tasks' and how long-horizon interactions are defined and segmented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and will make revisions to address them. Our point-by-point responses are provided below.

read point-by-point responses

Referee: [§3.2] The video-to-action conversion pipeline (pose estimation, contact detection, trajectory fitting) lacks any quantitative validation against ground-truth actions extracted from real videos; without this check, execution failures in simulation cannot be confidently attributed to the world model rather than pipeline artifacts under visual artifacts typical of current generators.

Authors: We agree that a quantitative validation of the video-to-action pipeline against ground-truth actions from real videos would strengthen the attribution of execution failures. In the revised manuscript, we will add a dedicated validation subsection in §3.2. This will apply the full pipeline (pose estimation, contact detection, and trajectory fitting) to the original real demonstration videos, for which ground-truth actions are available from the data collection process. We will report metrics including average joint angle error, contact detection precision/recall, and end-effector trajectory RMSE to quantify pipeline fidelity under realistic visual conditions. revision: yes
Referee: [§4] The central claim that visual plausibility and embodied executability diverge is presented without baseline comparisons, error bars, or per-model quantitative metrics (e.g., success rates, contact accuracy); the abstract-level summary alone does not allow assessment of effect sizes or statistical significance.

Authors: We acknowledge that the current results section would benefit from expanded quantitative reporting to better support the central claim. While the manuscript already evaluates multiple state-of-the-art models on a range of tasks, we will revise §4 to include per-model success rates in simulation, contact accuracy scores, comparisons against baselines (e.g., ground-truth video execution and simple motion predictors), error bars computed across multiple task instances or random seeds, and basic statistical significance tests. These additions will allow clearer assessment of effect sizes and the divergence between visual and embodied metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark evaluation is externally grounded in simulation execution

full rationale

The paper introduces RoboWM-Bench as a new evaluation framework that converts generated videos into action sequences via pose estimation and contact detection, then executes them in independent simulation environments. The key observation—that visual plausibility and embodied executability diverge—is presented as an empirical outcome from running this pipeline on external state-of-the-art video world models, not as a quantity derived from fitted parameters, self-referential equations, or self-citation chains. No load-bearing step reduces by construction to the paper's own inputs; the conversion and execution steps are described as external validation mechanisms rather than tautological redefinitions. This satisfies the default expectation of a non-circular benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that simulation environments can serve as a valid proxy for physical executability; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Physically grounded simulation environments accurately validate the executability of actions derived from generated videos
This assumption underpins the entire validation step described in the abstract.

pith-pipeline@v0.9.0 · 5559 in / 1207 out tokens · 57329 ms · 2026-05-15T06:11:51.626507+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RoboWM-Bench converts generated human-hand and robotic manipulation videos into embodied action sequences and validates them through execution in physically grounded simulation environments.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We adopt a real-to-simulation (real-to-sim) framework in which scenes and interaction dynamics are reconstructed in simulation to match their real-world counterparts.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 15 internal anchors

[1]

Cosmos World Foundation Model Platform for Physical AI

Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al.: Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

World Simulation with Video Foundation Models for Physical AI

Ali, A., Bai, J., Bala, M., Balaji, Y., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y.W., et al.: World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Advances in Neural Information Processing Systems35, 24639–24654 (2022)

Baker, B., Akkaya, I., Zhokov, P., Huizinga, J., Tang, J., Ecoffet, A., Houghton, B., Sampedro, R., Clune, J.: Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems35, 24639–24654 (2022)

work page 2022
[6]

arXiv preprint arXiv:2409.16283 (2024)

Bharadhwaj, H., Dwibedi, D., Gupta, A., Tulsiani, S., Doersch, C., Xiao, T., Shah, D., Xia, F., Sadigh, D., Kirmani, S.: Gen2act: Human video generation in novel sce- narios enables generalizable robot manipulation. arXiv preprint arXiv:2409.16283 (2024)

work page arXiv 2024
[7]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Brooks, T., Peebles, W., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Ng, C., Wang, R., Ramesh, A., et al.: Video generation models as world simulators.https://openai.com/research/video-generation- models-as-world-simulators(2024), openAI Technical Report

work page 2024
[9]

Large Video Planner Enables Generalizable Robot Control

Chen, B., Zhang, T., Geng, H., Song, K., Zhang, C., Li, P., Freeman, W.T., Malik, J., Abbeel, P., Tedrake, R., et al.: Large video planner enables generalizable robot control. arXiv preprint arXiv:2512.15840 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Chen, S., Guo, H., Zhu, S., Zhang, F., Huang, Z., Feng, J., Kang, B.: Video depth anything: Consistent depth estimation for super-long videos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22831–22840 (2025)

work page 2025
[11]

SAM 3D: 3Dfy Anything in Images

Chen, X., Chu, F.J., Gleize, P., Liang, K.J., Sax, A., Tang, H., Wang, W., Guo, M., Hardin, T., Li, X., et al.: Sam 3d: 3dfy anything in images. arXiv preprint arXiv:2511.16624 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

arXiv preprint arXiv:2509.22642 (2025)

Chi, X., Jia, P., Fan, C.K., Ju, X., Mi, W., Zhang, K., Qin, Z., Tian, W., Ge, K., Li, H., et al.: Wow: Towards a world omniscient world model through embodied interaction. arXiv preprint arXiv:2509.22642 (2025)

work page arXiv 2025
[13]

Advances in neural information processing systems36, 9156–9172 (2023)

Du, Y., Yang, S., Dai, B., Dai, H., Nachum, O., Tenenbaum, J., Schuurmans, D., Abbeel, P.: Learning universal policies via text-guided video generation. Advances in neural information processing systems36, 9156–9172 (2023)

work page 2023
[14]

arXiv preprint arXiv:2601.04137 (2026)

Fan, C.K., Chi, X., Ju, X., Li, H., Bao, Y., Wang, Y.K., Chen, L., Jiang, Z., Ge, K., Li, Y., et al.: Wow, wo, val! a comprehensive embodied world model evaluation turing test. arXiv preprint arXiv:2601.04137 (2026)

work page arXiv 2026
[15]

arXiv preprint arXiv:2406.08656 (2024)

Feng, W., Li, J., Saxon, M., Fu, T.j., Chen, W., Wang, W.Y.: Tc-bench: Bench- marking temporal compositionality in text-to-video and image-to-video generation. arXiv preprint arXiv:2406.08656 (2024)

work page arXiv 2024
[16]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Gao, Y., Guo, H., Hoang, T., Huang, W., Jiang, L., Kong, F., Li, H., Li, J., Li, L., Li, X., et al.: Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

googleapis.com/deepmind- media/veo/Veo- 3- Tech- Report.pdf, technical Re- port

Google DeepMind: Veo: a text-to-video generation system (veo-3 technical re- port).Tech.Rep.Veo-3-Tech-Report,GoogleDeepMind(2025),https://storage. googleapis.com/deepmind- media/veo/Veo- 3- Tech- Report.pdf, technical Re- port

work page 2025
[18]

In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition

Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Ham- burger, J., Jiang, H., Liu, M., Liu, X., et al.: Ego4d: Around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 18995–19012 (2022)

work page 2022
[19]

LTX-Video: Realtime Video Latent Diffusion

HaCohen, Y., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., Levin, E., Shiran, G., Zabari, N., Gordon, O., et al.: Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: Large-scale pretrain- ing for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

arXiv preprint arXiv:2501.01895 (2025)

Huang, S., Chen, L., Zhou, P., Chen, S., Jiang, Z., Hu, Y., Liao, Y., Gao, P., Li, H., Yao, M., et al.: Enerverse: Envisioning embodied future space for robotics manipulation (2025). arXiv preprint arXiv:2501.01895 (2025)

work page arXiv 2025
[22]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024)

work page 2024
[23]

arXiv preprint arXiv:2505.12705 (2025)

Jang,J.,Ye,S.,Lin,Z.,Xiang,J.,Bjorck,J.,Fang,Y.,Hu,F.,Huang,S.,Kundalia, K., Lin, Y.C., et al.: Dreamgen: Unlocking generalization in robot learning through video world models. arXiv preprint arXiv:2505.12705 (2025)

work page arXiv 2025
[24]

In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

Ji, P., Xiao, C., Tai, H., Huo, M.: T2vbench: Benchmarking temporal dynamics for text-to-video generation. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 5325–5335 (2024)

work page 2024
[25]

arXiv preprint arXiv:2512.22414 (2025)

Kareer, S., Pertsch, K., Darpinian, J., Hoffman, J., Xu, D., Levine, S., Finn, C., Nair, S.: Emergence of human to robot transfer in vision-language-action models. arXiv preprint arXiv:2512.22414 (2025)

work page arXiv 2025
[26]

arXiv preprint arXiv:2212.06870 (2022)

Labbé, Y., Manuelli, L., Mousavian, A., Tyree, S., Birchfield, S., Tremblay, J., Carpentier, J., Aubry, M., Fox, D., Sivic, J.: Megapose: 6d pose estimation of novel objects via render & compare. arXiv preprint arXiv:2212.06870 (2022)

work page arXiv 2022
[27]

arXiv preprint arXiv:2508.09976 (2025)

Lepert, M., Fang, J., Bohg, J.: Masquerade: Learning from in-the-wild human videos using data-editing. arXiv preprint arXiv:2508.09976 (2025)

work page arXiv 2025
[28]

URL https://arxiv

Lepert, M., Fang, J., Bohg, J.: Phantom: Training robots without robots using only human videos. URL https://arxiv. org/abs/2503.007792(2025)

work page arXiv 2025
[29]

arXiv preprint arXiv:2502.20694 (2025)

Li, D., Fang, Y., Chen, Y., Yang, S., Cao, S., Wong, J., Luo, M., Wang, X., Yin, H., Gonzalez, J.E., et al.: Worldmodelbench: Judging video generation models as world models. arXiv preprint arXiv:2502.20694 (2025)

work page arXiv 2025
[30]

In: IROS 2025-5th Workshop on RObotic MA- nipulation of Deformable Objects: holistic approaches and challenges forward

Li, Z., Yang, J., Xu, J., Xie, S., Chen, T., Wang, Y., Shen, Z., Shen, Y., Zheng, Y., Li, W., et al.: Lehome: A simulation environment for deformable object ma- nipulation in household scenarios. In: IROS 2025-5th Workshop on RObotic MA- nipulation of Deformable Objects: holistic approaches and challenges forward

work page 2025
[31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, Z., Tucker, R., Cole, F., Wang, Q., Jin, L., Ye, V., Kanazawa, A., Holynski, A., Snavely, N.: Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10486–10496 (2025)

work page 2025
[32]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Ling, X., Zhu, C., Wu, M., Li, H., Feng, X., Yang, C., Hao, A., Zhu, J., Wu, J., Chu, X.: Vmbench: A benchmark for perception-aligned video motion generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13087–13098 (2025)

work page 2025
[33]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, Y., Cun, X., Liu, X., Wang, X., Zhang, Y., Chen, H., Liu, Y., Zeng, T., Chan, R., Shan, Y.: Evalcrafter: Benchmarking and evaluating large video generation models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22139–22149 (2024)

work page 2024
[34]

arXiv preprint arXiv:2411.07223 (2024)

Luo, Y., Du, Y.: Grounding video models to actions through goal conditioned exploration. arXiv preprint arXiv:2411.07223 (2024)

work page arXiv 2024
[35]

VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

Ma, Y.J., Sodhani, S., Jayaraman, D., Bastani, O., Kumar, V., Zhang, A.: Vip: Towards universal visual reward and representation via value-implicit pre-training. arXiv preprint arXiv:2210.00030 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Jour- nal of Artificial Intelligence Research83(2025)

McCarthy, R., Tan, D.C., Schmidt, D., Acero, F., Herr, N., Du, Y., Thuruthel, T.G., Li, Z.: Towards generalist robot learning from internet video: A survey. Jour- nal of Artificial Intelligence Research83(2025)

work page 2025
[37]

arXiv preprint arXiv:2203.12601 (2022)

Nair, S., Rajeswaran, A., Kumar, V., Finn, C., Gupta, A.: R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601 (2022)

work page arXiv 2022
[38]

https://developer.nvidia.com/isaac-sim(2023), accessed: 2026-03-03

NVIDIA Corporation: Nvidia isaac sim: High-fidelity simulation for robotics. https://developer.nvidia.com/isaac-sim(2023), accessed: 2026-03-03

work page 2023
[39]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Pavlakos, G., Shan, D., Radosavovic, I., Kanazawa, A., Fouhey, D., Malik, J.: Reconstructing hands in 3d with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9826–9836 (2024)

work page 2024
[40]

In: Human to Robot: Workshop on Sensorizing, Modeling, and Learning from Humans (2025)

Punamiya, R., Patel, D., Aphiwetsa, P., Kuppili, P., Zhu, L.Y., Kareer, S., Hoff- man, J., Xu, D.: Egobridge: Domain adaptation for generalizable imitation from egocentric human data. In: Human to Robot: Workshop on Sensorizing, Modeling, and Learning from Humans (2025)

work page 2025
[41]

In: European Conference on Computer Vision

Qin, Y., Wu, Y.H., Liu, S., Jiang, H., Yang, R., Fu, Y., Wang, X.: Dexmv: Imitation learning for dexterous manipulation from human videos. In: European Conference on Computer Vision. pp. 570–587. Springer (2022)

work page 2022
[42]

In: Conference on Robot Learning

Radosavovic, I., Xiao, T., James, S., Abbeel, P., Malik, J., Darrell, T.: Real-world robot learning with masked visual pre-training. In: Conference on Robot Learning. pp. 416–426. PMLR (2023)

work page 2023
[43]

arXiv preprint arXiv:2602.08971 (2026)

Shang, Y., Li, Z., Ma, Y., Su, W., Jin, X., Wang, Z., Jin, L., Zhang, X., Tang, Y., Su, H., et al.: Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models. arXiv preprint arXiv:2602.08971 (2026)

work page arXiv 2026
[44]

arXiv preprint arXiv:2512.10675 (2025)

Team, G.R., Choromanski, K., Devin, C., Du, Y., Dwibedi, D., Gao, R., Jindal, A., Kipf, T., Kirmani, S., Leal, I., et al.: Evaluating gemini robotics policies in a veo world simulator. arXiv preprint arXiv:2512.10675 (2025)

work page arXiv 2025
[45]

arXiv preprint arXiv:2510.19430 (2025)

Team, G., Ye, A., Wang, B., Ni, C., Huang, G., Zhao, G., Li, H., Li, J., Zhu, J., Feng, L., et al.: Gigabrain-0: A world model-powered vision-language-action model. arXiv preprint arXiv:2510.19430 (2025)

work page arXiv 2025
[46]

arXiv preprint arXiv:2511.19861 (2025)

Team, G., Ye, A., Wang, B., Ni, C., Huang, G., Zhao, G., Li, H., Zhu, J., Li, K., Xu, M., et al.: Gigaworld-0: World models as data engine to empower embodied ai. arXiv preprint arXiv:2511.19861 (2025)

work page arXiv 2025
[47]

arXiv preprint arXiv:2512.13313 (2025)

Team, K., Chen, J., Ding, Y., Fang, Z., Gai, K., Gao, Y., He, K., Hua, J., Jiang, B., Lao, M., et al.: Klingavatar 2.0 technical report. arXiv preprint arXiv:2512.13313 (2025)

work page arXiv 2025
[48]

arXiv preprint arXiv:2412.15109 (2024)

Tian, Y., Yang, S., Zeng, J., Wang, P., Lin, D., Dong, H., Pang, J.: Predictive inverse dynamics models are scalable learners for robotic manipulation. arXiv preprint arXiv:2412.15109 (2024)

work page arXiv 2024
[49]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

International Journal of Computer Vision133(5), 3059–3078 (2025)

Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P., et al.: Lavie: High-quality video generation with cascaded latent diffu- sion models. International Journal of Computer Vision133(5), 3059–3078 (2025)

work page 2025
[51]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wen, B., Yang, W., Kautz, J., Birchfield, S.: Foundationpose: Unified 6d pose esti- mation and tracking of novel objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17868–17879 (2024)

work page 2024
[52]

ai/blog/marble-world-model, accessed: 2026-02

WorldLabs:Marble:AMultimodalWorldModel(2025),https://www.worldlabs. ai/blog/marble-world-model, accessed: 2026-02

work page 2025
[53]

HunyuanVideo 1.5 Technical Report

Wu, B., Zou, C., Li, C., Huang, D., Yang, F., Tan, H., Peng, J., Wu, J., Xiong, J., Jiang, J., et al.: Hunyuanvideo 1.5 technical report. arXiv preprint arXiv:2511.18870 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

In: 2025 IEEE International Conference on Robotics and Automation (ICRA)

Wu, T., Zhang, J., Liang, S., Han, Z., Dong, H.: Foundation feature-driven online end-effector pose estimation: A marker-free and learning-free approach. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 1921–

work page 2025
[55]

In: 2021 IEEE/RSJ international conference on intelligent robots and systems (iros)

Xiong, H., Li, Q., Chen, Y.C., Bharadhwaj, H., Sinha, S., Garg, A.: Learning by watching: Physical imitation of manipulation skills from human videos. In: 2021 IEEE/RSJ international conference on intelligent robots and systems (iros). pp. 7827–7834. IEEE (2021)

work page 2021
[56]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10371–10381 (2024)

work page 2024
[57]

arXiv preprint arXiv:2310.061141(2), 6 (2023)

Yang, M., Du, Y., Ghasemipour, K., Tompson, J., Schuurmans, D., Abbeel, P.: Learning interactive real-world simulators. arXiv preprint arXiv:2310.061141(2), 6 (2023)

work page arXiv 2023
[58]

Latent Action Pretraining from Videos

Ye, S., Jang, J., Jeon, B., Joo, S., Yang, J., Peng, B., Mandlekar, A., Tan, R., Chao, Y.W., Lin, B.Y., et al.: Latent action pretraining from videos. arXiv preprint arXiv:2410.11758 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

arXiv preprint arXiv:2505.09694 (2025)

Yue, H., Huang, S., Liao, Y., Chen, S., Zhou, P., Chen, L., Yao, M., Ren, G.: Ewmbench: Evaluating scene, motion, and semantic quality in embodied world models. arXiv preprint arXiv:2505.09694 (2025)

work page arXiv 2025
[60]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Zheng, D., Huang, Z., Liu, H., Zou, K., He, Y., Zhang, F., Gu, L., Zhang, Y., He, J., Zheng, W.S., et al.: Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

arXiv preprint arXiv:2512.01989 (2025)

Zhou, F., Huang, J., Li, J., Ramanan, D., Shi, H.: Pai-bench: A comprehensive benchmark for physical ai. arXiv preprint arXiv:2512.01989 (2025)

work page arXiv 2025
[62]

Zhou, S., Du, Y., Chen, J., Li, Y., Yeung, D.Y., Gan, C.: Robodreamer: Learning compositional world models for robot imagination. arXiv preprint arXiv:2404.12377 (2024) Supplementary Material A Additional Results for Purely Simulated Robotic Tasks RoboWM-Bench also includes a set of robotic manipulation tasks evaluated en- tirely in simulation environment...

work page arXiv 2024