Recognition: 2 theorem links
· Lean TheoremRoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
Pith reviewed 2026-05-15 06:11 UTC · model grok-4.3
The pith
A new benchmark shows that visually realistic robot videos often cannot be executed as actions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RoboWM-Bench converts generated human-hand and robotic manipulation videos into embodied action sequences and validates them through execution in physically grounded simulation environments built on real-to-sim reconstruction. When applied to state-of-the-art video world models, it shows that visual plausibility and embodied executability are not always aligned, with recurring problems in spatial reasoning, contact prediction, and non-physical geometric distortions particularly in complex and long-horizon interactions.
What carries the argument
RoboWM-Bench, which translates video predictions into executable action sequences for validation in simulation.
If this is right
- World models for robotics need training signals that enforce physical executability beyond visual fidelity.
- Embodied execution tests should become standard alongside visual metrics when assessing manipulation world models.
- Complex and long-horizon tasks expose larger gaps between predicted visuals and workable actions.
- Real-to-sim reconstruction enables scalable, standardized checks of physical consistency in generated videos.
Where Pith is reading between the lines
- World model training could incorporate simulation-based feedback loops to directly optimize for action success rather than pixels alone.
- The same video-to-execution pipeline might apply to evaluating world models in other embodied settings such as navigation or assembly.
- Pure scaling of visual pretraining may hit limits in robotics without explicit embodiment grounding.
Load-bearing premise
That successful execution of converted actions in the chosen simulation environments accurately reflects the physical consistency required for real-world robot performance.
What would settle it
A video world model that achieves high visual scores yet shows near-zero success rates on RoboWM-Bench execution across multiple tasks, or the reverse pattern of low visual quality but high execution success, would strengthen the misalignment claim; consistent alignment across models would weaken it.
Figures
read the original abstract
Recent advances in large-scale video world models have enabled increasingly realistic future prediction, raising the prospect of using generated videos as scalable supervision for robot learning. However, for embodied manipulation, perceptual realism alone is not sufficient: generated interactions must also be physically consistent and executable by robotic agents. Existing benchmarks provide valuable assessments of visual quality and physical plausibility, but they do not systematically evaluate whether predicted behaviors can be translated into executable actions that complete manipulation tasks. We introduce RoboWM-Bench, a manipulation-centric benchmark for embodiment-grounded evaluation of video world models. RoboWM-Bench converts generated human-hand and robotic manipulation videos into embodied action sequences and validates them through execution in physically grounded simulation environments. Built on real-to-sim scene reconstruction and diverse manipulation tasks, RoboWM-Bench enables standardized, reproducible, and scalable evaluation of physical executability. Using RoboWM-Bench, we evaluate state-of-the-art video world models and observe that visual plausibility and embodied executability are not always aligned. Our analysis highlights several recurring factors that affect execution performance, including spatial reasoning, contact prediction, and non-physical geometric distortions, particularly in complex and long-horizon interactions. These findings provide a more fine-grained view of current model capabilities and underscore the value of embodiment-aware evaluation for guiding physically grounded world modeling in robotic manipulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RoboWM-Bench, a manipulation-centric benchmark that converts generated videos of human-hand and robotic interactions into embodied action sequences (via pose estimation, contact detection, and trajectory fitting) and evaluates their executability in physically grounded simulation environments reconstructed from real scenes. It reports evaluations of state-of-the-art video world models, finding that visual plausibility does not always align with successful task completion, and identifies recurring failure modes such as poor spatial reasoning, inaccurate contact prediction, and non-physical geometric distortions in long-horizon tasks.
Significance. If the video-to-action conversion pipeline is validated, the benchmark would offer a valuable, scalable, embodiment-grounded complement to existing visual and physical-plausibility metrics, helping steer world-model development toward predictions that are not only realistic but also directly usable for robotic manipulation.
major comments (2)
- [§3.2] The video-to-action conversion pipeline (pose estimation, contact detection, trajectory fitting) lacks any quantitative validation against ground-truth actions extracted from real videos; without this check, execution failures in simulation cannot be confidently attributed to the world model rather than pipeline artifacts under visual artifacts typical of current generators.
- [§4] The central claim that visual plausibility and embodied executability diverge is presented without baseline comparisons, error bars, or per-model quantitative metrics (e.g., success rates, contact accuracy); the abstract-level summary alone does not allow assessment of effect sizes or statistical significance.
minor comments (2)
- [Abstract] Specify the exact set of video world models evaluated and the precise simulation environments/tasks used, including any parameter settings for the real-to-sim reconstruction.
- [§3.1] Clarify the criteria used to select 'diverse manipulation tasks' and how long-horizon interactions are defined and segmented.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and will make revisions to address them. Our point-by-point responses are provided below.
read point-by-point responses
-
Referee: [§3.2] The video-to-action conversion pipeline (pose estimation, contact detection, trajectory fitting) lacks any quantitative validation against ground-truth actions extracted from real videos; without this check, execution failures in simulation cannot be confidently attributed to the world model rather than pipeline artifacts under visual artifacts typical of current generators.
Authors: We agree that a quantitative validation of the video-to-action pipeline against ground-truth actions from real videos would strengthen the attribution of execution failures. In the revised manuscript, we will add a dedicated validation subsection in §3.2. This will apply the full pipeline (pose estimation, contact detection, and trajectory fitting) to the original real demonstration videos, for which ground-truth actions are available from the data collection process. We will report metrics including average joint angle error, contact detection precision/recall, and end-effector trajectory RMSE to quantify pipeline fidelity under realistic visual conditions. revision: yes
-
Referee: [§4] The central claim that visual plausibility and embodied executability diverge is presented without baseline comparisons, error bars, or per-model quantitative metrics (e.g., success rates, contact accuracy); the abstract-level summary alone does not allow assessment of effect sizes or statistical significance.
Authors: We acknowledge that the current results section would benefit from expanded quantitative reporting to better support the central claim. While the manuscript already evaluates multiple state-of-the-art models on a range of tasks, we will revise §4 to include per-model success rates in simulation, contact accuracy scores, comparisons against baselines (e.g., ground-truth video execution and simple motion predictors), error bars computed across multiple task instances or random seeds, and basic statistical significance tests. These additions will allow clearer assessment of effect sizes and the divergence between visual and embodied metrics. revision: yes
Circularity Check
No circularity: benchmark evaluation is externally grounded in simulation execution
full rationale
The paper introduces RoboWM-Bench as a new evaluation framework that converts generated videos into action sequences via pose estimation and contact detection, then executes them in independent simulation environments. The key observation—that visual plausibility and embodied executability diverge—is presented as an empirical outcome from running this pipeline on external state-of-the-art video world models, not as a quantity derived from fitted parameters, self-referential equations, or self-citation chains. No load-bearing step reduces by construction to the paper's own inputs; the conversion and execution steps are described as external validation mechanisms rather than tautological redefinitions. This satisfies the default expectation of a non-circular benchmark paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Physically grounded simulation environments accurately validate the executability of actions derived from generated videos
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
RoboWM-Bench converts generated human-hand and robotic manipulation videos into embodied action sequences and validates them through execution in physically grounded simulation environments.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We adopt a real-to-simulation (real-to-sim) framework in which scenes and interaction dynamics are reconstructed in simulation to match their real-world counterparts.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Cosmos World Foundation Model Platform for Physical AI
Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al.: Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
World Simulation with Video Foundation Models for Physical AI
Ali, A., Bai, J., Bala, M., Balaji, Y., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y.W., et al.: World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Advances in Neural Information Processing Systems35, 24639–24654 (2022)
Baker, B., Akkaya, I., Zhokov, P., Huizinga, J., Tang, J., Ecoffet, A., Houghton, B., Sampedro, R., Clune, J.: Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems35, 24639–24654 (2022)
work page 2022
-
[6]
arXiv preprint arXiv:2409.16283 (2024)
Bharadhwaj, H., Dwibedi, D., Gupta, A., Tulsiani, S., Doersch, C., Xiao, T., Shah, D., Xia, F., Sadigh, D., Kirmani, S.: Gen2act: Human video generation in novel sce- narios enables generalizable robot manipulation. arXiv preprint arXiv:2409.16283 (2024)
-
[7]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Brooks, T., Peebles, W., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Ng, C., Wang, R., Ramesh, A., et al.: Video generation models as world simulators.https://openai.com/research/video-generation- models-as-world-simulators(2024), openAI Technical Report
work page 2024
-
[9]
Large Video Planner Enables Generalizable Robot Control
Chen, B., Zhang, T., Geng, H., Song, K., Zhang, C., Li, P., Freeman, W.T., Malik, J., Abbeel, P., Tedrake, R., et al.: Large video planner enables generalizable robot control. arXiv preprint arXiv:2512.15840 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Chen, S., Guo, H., Zhu, S., Zhang, F., Huang, Z., Feng, J., Kang, B.: Video depth anything: Consistent depth estimation for super-long videos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22831–22840 (2025)
work page 2025
-
[11]
SAM 3D: 3Dfy Anything in Images
Chen, X., Chu, F.J., Gleize, P., Liang, K.J., Sax, A., Tang, H., Wang, W., Guo, M., Hardin, T., Li, X., et al.: Sam 3d: 3dfy anything in images. arXiv preprint arXiv:2511.16624 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
arXiv preprint arXiv:2509.22642 (2025)
Chi, X., Jia, P., Fan, C.K., Ju, X., Mi, W., Zhang, K., Qin, Z., Tian, W., Ge, K., Li, H., et al.: Wow: Towards a world omniscient world model through embodied interaction. arXiv preprint arXiv:2509.22642 (2025)
-
[13]
Advances in neural information processing systems36, 9156–9172 (2023)
Du, Y., Yang, S., Dai, B., Dai, H., Nachum, O., Tenenbaum, J., Schuurmans, D., Abbeel, P.: Learning universal policies via text-guided video generation. Advances in neural information processing systems36, 9156–9172 (2023)
work page 2023
-
[14]
arXiv preprint arXiv:2601.04137 (2026)
Fan, C.K., Chi, X., Ju, X., Li, H., Bao, Y., Wang, Y.K., Chen, L., Jiang, Z., Ge, K., Li, Y., et al.: Wow, wo, val! a comprehensive embodied world model evaluation turing test. arXiv preprint arXiv:2601.04137 (2026)
-
[15]
arXiv preprint arXiv:2406.08656 (2024)
Feng, W., Li, J., Saxon, M., Fu, T.j., Chen, W., Wang, W.Y.: Tc-bench: Bench- marking temporal compositionality in text-to-video and image-to-video generation. arXiv preprint arXiv:2406.08656 (2024)
-
[16]
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Gao, Y., Guo, H., Hoang, T., Huang, W., Jiang, L., Kong, F., Li, H., Li, J., Li, L., Li, X., et al.: Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
googleapis.com/deepmind- media/veo/Veo- 3- Tech- Report.pdf, technical Re- port
Google DeepMind: Veo: a text-to-video generation system (veo-3 technical re- port).Tech.Rep.Veo-3-Tech-Report,GoogleDeepMind(2025),https://storage. googleapis.com/deepmind- media/veo/Veo- 3- Tech- Report.pdf, technical Re- port
work page 2025
-
[18]
In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition
Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Ham- burger, J., Jiang, H., Liu, M., Liu, X., et al.: Ego4d: Around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 18995–19012 (2022)
work page 2022
-
[19]
LTX-Video: Realtime Video Latent Diffusion
HaCohen, Y., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., Levin, E., Shiran, G., Zabari, N., Gordon, O., et al.: Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: Large-scale pretrain- ing for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
arXiv preprint arXiv:2501.01895 (2025)
Huang, S., Chen, L., Zhou, P., Chen, S., Jiang, Z., Hu, Y., Liao, Y., Gao, P., Li, H., Yao, M., et al.: Enerverse: Envisioning embodied future space for robotics manipulation (2025). arXiv preprint arXiv:2501.01895 (2025)
-
[22]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024)
work page 2024
-
[23]
arXiv preprint arXiv:2505.12705 (2025)
Jang,J.,Ye,S.,Lin,Z.,Xiang,J.,Bjorck,J.,Fang,Y.,Hu,F.,Huang,S.,Kundalia, K., Lin, Y.C., et al.: Dreamgen: Unlocking generalization in robot learning through video world models. arXiv preprint arXiv:2505.12705 (2025)
-
[24]
In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition
Ji, P., Xiao, C., Tai, H., Huo, M.: T2vbench: Benchmarking temporal dynamics for text-to-video generation. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 5325–5335 (2024)
work page 2024
-
[25]
arXiv preprint arXiv:2512.22414 (2025)
Kareer, S., Pertsch, K., Darpinian, J., Hoffman, J., Xu, D., Levine, S., Finn, C., Nair, S.: Emergence of human to robot transfer in vision-language-action models. arXiv preprint arXiv:2512.22414 (2025)
-
[26]
arXiv preprint arXiv:2212.06870 (2022)
Labbé, Y., Manuelli, L., Mousavian, A., Tyree, S., Birchfield, S., Tremblay, J., Carpentier, J., Aubry, M., Fox, D., Sivic, J.: Megapose: 6d pose estimation of novel objects via render & compare. arXiv preprint arXiv:2212.06870 (2022)
-
[27]
arXiv preprint arXiv:2508.09976 (2025)
Lepert, M., Fang, J., Bohg, J.: Masquerade: Learning from in-the-wild human videos using data-editing. arXiv preprint arXiv:2508.09976 (2025)
-
[28]
Lepert, M., Fang, J., Bohg, J.: Phantom: Training robots without robots using only human videos. URL https://arxiv. org/abs/2503.007792(2025)
-
[29]
arXiv preprint arXiv:2502.20694 (2025)
Li, D., Fang, Y., Chen, Y., Yang, S., Cao, S., Wong, J., Luo, M., Wang, X., Yin, H., Gonzalez, J.E., et al.: Worldmodelbench: Judging video generation models as world models. arXiv preprint arXiv:2502.20694 (2025)
-
[30]
Li, Z., Yang, J., Xu, J., Xie, S., Chen, T., Wang, Y., Shen, Z., Shen, Y., Zheng, Y., Li, W., et al.: Lehome: A simulation environment for deformable object ma- nipulation in household scenarios. In: IROS 2025-5th Workshop on RObotic MA- nipulation of Deformable Objects: holistic approaches and challenges forward
work page 2025
-
[31]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Li, Z., Tucker, R., Cole, F., Wang, Q., Jin, L., Ye, V., Kanazawa, A., Holynski, A., Snavely, N.: Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10486–10496 (2025)
work page 2025
-
[32]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Ling, X., Zhu, C., Wu, M., Li, H., Feng, X., Yang, C., Hao, A., Zhu, J., Wu, J., Chu, X.: Vmbench: A benchmark for perception-aligned video motion generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13087–13098 (2025)
work page 2025
-
[33]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Liu, Y., Cun, X., Liu, X., Wang, X., Zhang, Y., Chen, H., Liu, Y., Zeng, T., Chan, R., Shan, Y.: Evalcrafter: Benchmarking and evaluating large video generation models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22139–22149 (2024)
work page 2024
-
[34]
arXiv preprint arXiv:2411.07223 (2024)
Luo, Y., Du, Y.: Grounding video models to actions through goal conditioned exploration. arXiv preprint arXiv:2411.07223 (2024)
-
[35]
VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training
Ma, Y.J., Sodhani, S., Jayaraman, D., Bastani, O., Kumar, V., Zhang, A.: Vip: Towards universal visual reward and representation via value-implicit pre-training. arXiv preprint arXiv:2210.00030 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
Jour- nal of Artificial Intelligence Research83(2025)
McCarthy, R., Tan, D.C., Schmidt, D., Acero, F., Herr, N., Du, Y., Thuruthel, T.G., Li, Z.: Towards generalist robot learning from internet video: A survey. Jour- nal of Artificial Intelligence Research83(2025)
work page 2025
-
[37]
arXiv preprint arXiv:2203.12601 (2022)
Nair, S., Rajeswaran, A., Kumar, V., Finn, C., Gupta, A.: R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601 (2022)
-
[38]
https://developer.nvidia.com/isaac-sim(2023), accessed: 2026-03-03
NVIDIA Corporation: Nvidia isaac sim: High-fidelity simulation for robotics. https://developer.nvidia.com/isaac-sim(2023), accessed: 2026-03-03
work page 2023
-
[39]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Pavlakos, G., Shan, D., Radosavovic, I., Kanazawa, A., Fouhey, D., Malik, J.: Reconstructing hands in 3d with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9826–9836 (2024)
work page 2024
-
[40]
In: Human to Robot: Workshop on Sensorizing, Modeling, and Learning from Humans (2025)
Punamiya, R., Patel, D., Aphiwetsa, P., Kuppili, P., Zhu, L.Y., Kareer, S., Hoff- man, J., Xu, D.: Egobridge: Domain adaptation for generalizable imitation from egocentric human data. In: Human to Robot: Workshop on Sensorizing, Modeling, and Learning from Humans (2025)
work page 2025
-
[41]
In: European Conference on Computer Vision
Qin, Y., Wu, Y.H., Liu, S., Jiang, H., Yang, R., Fu, Y., Wang, X.: Dexmv: Imitation learning for dexterous manipulation from human videos. In: European Conference on Computer Vision. pp. 570–587. Springer (2022)
work page 2022
-
[42]
In: Conference on Robot Learning
Radosavovic, I., Xiao, T., James, S., Abbeel, P., Malik, J., Darrell, T.: Real-world robot learning with masked visual pre-training. In: Conference on Robot Learning. pp. 416–426. PMLR (2023)
work page 2023
-
[43]
arXiv preprint arXiv:2602.08971 (2026)
Shang, Y., Li, Z., Ma, Y., Su, W., Jin, X., Wang, Z., Jin, L., Zhang, X., Tang, Y., Su, H., et al.: Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models. arXiv preprint arXiv:2602.08971 (2026)
-
[44]
arXiv preprint arXiv:2512.10675 (2025)
Team, G.R., Choromanski, K., Devin, C., Du, Y., Dwibedi, D., Gao, R., Jindal, A., Kipf, T., Kirmani, S., Leal, I., et al.: Evaluating gemini robotics policies in a veo world simulator. arXiv preprint arXiv:2512.10675 (2025)
-
[45]
arXiv preprint arXiv:2510.19430 (2025)
Team, G., Ye, A., Wang, B., Ni, C., Huang, G., Zhao, G., Li, H., Li, J., Zhu, J., Feng, L., et al.: Gigabrain-0: A world model-powered vision-language-action model. arXiv preprint arXiv:2510.19430 (2025)
-
[46]
arXiv preprint arXiv:2511.19861 (2025)
Team, G., Ye, A., Wang, B., Ni, C., Huang, G., Zhao, G., Li, H., Zhu, J., Li, K., Xu, M., et al.: Gigaworld-0: World models as data engine to empower embodied ai. arXiv preprint arXiv:2511.19861 (2025)
-
[47]
arXiv preprint arXiv:2512.13313 (2025)
Team, K., Chen, J., Ding, Y., Fang, Z., Gai, K., Gao, Y., He, K., Hua, J., Jiang, B., Lao, M., et al.: Klingavatar 2.0 technical report. arXiv preprint arXiv:2512.13313 (2025)
-
[48]
arXiv preprint arXiv:2412.15109 (2024)
Tian, Y., Yang, S., Zeng, J., Wang, P., Lin, D., Dong, H., Pang, J.: Predictive inverse dynamics models are scalable learners for robotic manipulation. arXiv preprint arXiv:2412.15109 (2024)
-
[49]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
International Journal of Computer Vision133(5), 3059–3078 (2025)
Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P., et al.: Lavie: High-quality video generation with cascaded latent diffu- sion models. International Journal of Computer Vision133(5), 3059–3078 (2025)
work page 2025
-
[51]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wen, B., Yang, W., Kautz, J., Birchfield, S.: Foundationpose: Unified 6d pose esti- mation and tracking of novel objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17868–17879 (2024)
work page 2024
-
[52]
ai/blog/marble-world-model, accessed: 2026-02
WorldLabs:Marble:AMultimodalWorldModel(2025),https://www.worldlabs. ai/blog/marble-world-model, accessed: 2026-02
work page 2025
-
[53]
HunyuanVideo 1.5 Technical Report
Wu, B., Zou, C., Li, C., Huang, D., Yang, F., Tan, H., Peng, J., Wu, J., Xiong, J., Jiang, J., et al.: Hunyuanvideo 1.5 technical report. arXiv preprint arXiv:2511.18870 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
In: 2025 IEEE International Conference on Robotics and Automation (ICRA)
Wu, T., Zhang, J., Liang, S., Han, Z., Dong, H.: Foundation feature-driven online end-effector pose estimation: A marker-free and learning-free approach. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 1921–
work page 2025
-
[55]
In: 2021 IEEE/RSJ international conference on intelligent robots and systems (iros)
Xiong, H., Li, Q., Chen, Y.C., Bharadhwaj, H., Sinha, S., Garg, A.: Learning by watching: Physical imitation of manipulation skills from human videos. In: 2021 IEEE/RSJ international conference on intelligent robots and systems (iros). pp. 7827–7834. IEEE (2021)
work page 2021
-
[56]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10371–10381 (2024)
work page 2024
-
[57]
arXiv preprint arXiv:2310.061141(2), 6 (2023)
Yang, M., Du, Y., Ghasemipour, K., Tompson, J., Schuurmans, D., Abbeel, P.: Learning interactive real-world simulators. arXiv preprint arXiv:2310.061141(2), 6 (2023)
-
[58]
Latent Action Pretraining from Videos
Ye, S., Jang, J., Jeon, B., Joo, S., Yang, J., Peng, B., Mandlekar, A., Tan, R., Chao, Y.W., Lin, B.Y., et al.: Latent action pretraining from videos. arXiv preprint arXiv:2410.11758 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
arXiv preprint arXiv:2505.09694 (2025)
Yue, H., Huang, S., Liao, Y., Chen, S., Zhou, P., Chen, L., Yao, M., Ren, G.: Ewmbench: Evaluating scene, motion, and semantic quality in embodied world models. arXiv preprint arXiv:2505.09694 (2025)
-
[60]
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
Zheng, D., Huang, Z., Liu, H., Zou, K., He, Y., Zhang, F., Gu, L., Zhang, Y., He, J., Zheng, W.S., et al.: Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[61]
arXiv preprint arXiv:2512.01989 (2025)
Zhou, F., Huang, J., Li, J., Ramanan, D., Shi, H.: Pai-bench: A comprehensive benchmark for physical ai. arXiv preprint arXiv:2512.01989 (2025)
-
[62]
Zhou, S., Du, Y., Chen, J., Li, Y., Yeung, D.Y., Gan, C.: Robodreamer: Learning compositional world models for robot imagination. arXiv preprint arXiv:2404.12377 (2024) Supplementary Material A Additional Results for Purely Simulated Robotic Tasks RoboWM-Bench also includes a set of robotic manipulation tasks evaluated en- tirely in simulation environment...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.