RoboWorld: Fast and Reliable Neural Simulators for Generalist Robot Policy Evaluation

Byeongguk Jeon; Hyungmok Son; JaeHyeok Doo; Kimin Lee; Minjoon Seo; Seonghyeon Ye; Sungdong Kim

arxiv: 2607.01060 · v1 · pith:YY75ORKAnew · submitted 2026-07-01 · 💻 cs.RO

RoboWorld: Fast and Reliable Neural Simulators for Generalist Robot Policy Evaluation

Byeongguk Jeon , Seonghyeon Ye , JaeHyeok Doo , Sungdong Kim , Minjoon Seo , Hyungmok Son , Kimin Lee This is my paper

Pith reviewed 2026-07-02 11:19 UTC · model grok-4.3

classification 💻 cs.RO

keywords robot policy evaluationvideo world modelsneural simulatorsautoregressive modelsvision-language modelsStep Forcingpolicy evaluation pipeline

0 comments

The pith

RoboWorld pairs a fast autoregressive video world model with Step Forcing and task-progress scoring to evaluate robot policies at 0.989 correlation with real-world results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RoboWorld as an automated pipeline that replaces physical robot testing with video world models for evaluating generalist policies. It adds Step Forcing to produce reliable long-horizon rollouts and pairs the model with a vision-language scorer that tracks task progress. The central goal is to deliver fast, scalable evaluations that still match real-world outcomes across tasks and settings. A sympathetic reader cares because this removes hardware bottlenecks and engineering overhead from policy development. If the alignment holds, large numbers of policies can be tested and compared without physical deployment.

Core claim

By pairing a fast autoregressive video world model with Step Forcing for reliable rollouts and task-progress-aware vision-language model scoring, RoboWorld achieves strong alignment with real-world robot evaluation, with Pearson's r = 0.989 and Spearman's ρ = 0.970 across tasks and environments.

What carries the argument

Step Forcing, which combines anchored and one-step self-forwarded contexts to reduce train-test mismatch in autoregressive world-model rollouts while preserving action-observation dynamics.

If this is right

Robot policies can be evaluated at large scale without physical hardware or deployment constraints.
Evaluation speed increases because the autoregressive model supports faster inference than real robots.
The same pipeline supports consistent measurement across diverse tasks and environments.
High correlation lets the system act as a direct proxy for real-world success rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same combination of autoregressive rollout control and progress scoring might transfer to policy evaluation in other embodied domains such as navigation or manipulation in simulation.
If the scoring component generalizes without per-task tuning, it could reduce reliance on human-defined success criteria in automated testing loops.
Extending Step Forcing to multi-step or multi-agent rollouts could expose new limits on how far the current mismatch reduction works.

Load-bearing premise

The task-progress-aware vision-language model scoring accurately reflects true task success and does not require task-specific calibration or human oversight.

What would settle it

Running the pipeline on a fresh set of robot tasks and environments outside the original test distribution and finding Pearson correlation below 0.9 would show the alignment does not hold generally.

Figures

Figures reproduced from arXiv: 2607.01060 by Byeongguk Jeon, Hyungmok Son, JaeHyeok Doo, Kimin Lee, Minjoon Seo, Seonghyeon Ye, Sungdong Kim.

**Figure 1.** Figure 1: Overview of ROBOWORLD. ROBOWORLD evaluates robot policies via closed-loop rollouts in a video world model scored by a task-progress-aware VLM judge, yielding rankings that strongly correlate with real-world evaluations. Abstract: Video world models are emerging as a scalable alternative for evaluating generalist robot policies, bypassing the physical constraints and engineering burdens of real-world depl… view at source ↗

**Figure 2.** Figure 2: Left Upper: STEP FORCING shares the noise schedule between training and inference. (a) Diffusion Forcing conditions on noisy ground-truth (red). (b) Self Forcing conditions on selfgenerated context (green) via repeated forward rollouts. (c) STEP FORCING conditions on the onestep self-forwarded prior (green) or anchor step (red). (VLA) models, have made rapid progress in generalizing across tasks, objects… view at source ↗

**Figure 3.** Figure 3: STEP FORCING shows strong action controllability while maintaining visual quality throughout the whole horizon. We use BAIR Robot Pushing [58] as a small-scale diagnostic setup for comparing different training objectives in action-conditioned autoregressive world modeling. We examine whether each objective preserves action–observation dynamics while maintaining stable long-horizon rollouts. Specifically… view at source ↗

**Figure 4.** Figure 4: Long-horizon action-conditioned video generation on RoboArena. Quality (SSIM, LPIPS, [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: (a) Pearson and Spearman correlations between R [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Qualitative examples of ROBOWORLD in synthetic extreme environments. We transform real-world robot images into extreme environment scenes (e.g., spacecraft interiors, disaster sites), and conduct closed-loop policy evaluation. Step Forcing w/o self-forward w/o anchor step w/o time schedule aligning 0 50 100 150 200 250 300 350 F V D ↓ (Wrist View) 231.00 (+27.50) 258.50 (+63.00) 294.00 (+96.00) 327.00 [PI… view at source ↗

**Figure 6.** Figure 6: Component ablation of STEP FORCING on DROID. Effect of the VLM Evaluation Rubric. We ablate the task-progress rubric (§4.2) by replacing ROBOWORLD scores with binary scores. We measure Spearman correlation between scores and the RoboArena leaderboard ranking across eight policies. The task-progress rubric achieves ρ = 0.970, while binary success rate reduces this to ρ = 0.922. The rubric’s sensitivity to… view at source ↗

**Figure 8.** Figure 8: Design principles of the evaluation rubric. We design the rubric such that scores primarily reflect task progress. In addition, we assign different penalties depending on when world-model errors occur, so failures that arise earlier in the rollout receive lower scores than those that happen after substantial task progress. (1) Fine-grained evaluation rubric. We design a six-level rubric ( [PITH_FULL_IMAG… view at source ↗

**Figure 9.** Figure 9: Pearson correlation between RoboArena score and [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Pearson correlation between RoboArena score and [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: World-model rollouts for each level of the task-progress-aware rubric. From top to [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative examples of world-model failures correctly detected by the VLM evaluator. [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative examples of VLM evaluation failures. GPT-4o assigns success or near [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

read the original abstract

Video world models are emerging as a scalable alternative for evaluating generalist robot policies, bypassing the physical constraints and engineering burdens of real-world deployment. However, evaluating policies with video world models remains challenging, as world-model errors can make generated rollouts unreliable and slow inference limits large-scale throughput. We introduce RoboWorld, an automated evaluation pipeline that pairs a fast autoregressive video world model with a task-progress-aware vision-language model scoring. To enable reliable long-horizon autoregressive world-model rollouts, we propose Step Forcing, which combines anchored and one-step self-forwarded contexts to reduce train--test mismatch while preserving action--observation dynamics. Together, these components enable RoboWorld to align strongly with real-world robot evaluation across tasks and environments, achieving Pearson's r = 0.989 and Spearman's \r{ho} = 0.970.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RoboWorld reports very high real-world alignment via video world model plus VLM scoring and Step Forcing, but the VLM's accuracy as a task-success proxy has no supporting validation in the text.

read the letter

The main point is that this paper builds an automated pipeline called RoboWorld that runs fast autoregressive video world models for robot policy rollouts and uses a task-progress VLM to score them, with a new Step Forcing trick to stabilize long sequences. The headline numbers are Pearson's r of 0.989 and Spearman's ρ of 0.970 against real-world results.

The work does a reasonable job identifying the train-test mismatch problem in autoregressive world models and offering Step Forcing as a fix that mixes anchored contexts with one-step self-forwarding. That seems like a practical engineering step for preserving action-observation dynamics without extra parameters. The overall pipeline targets a genuine bottleneck in scaling policy evaluation for generalist robots.

The soft spot is the VLM scorer. The abstract supplies no quantitative checks against human judgments, no inter-rater numbers, no calibration curves, and no cross-task tests. The stress-test concern lands: if the VLM mis-scores certain failure modes or needs task-specific prompts, the reported correlations will not carry over to new policies or environments. The abstract also omits basic experimental details such as task count, environment diversity, and error bars, so the strength of the alignment claim cannot be judged from the given text.

This is aimed at robotics groups that want higher-throughput evaluation than physical deployments allow. Readers working on world models or sim-to-real transfer would find the Step Forcing description and the pipeline layout useful to discuss.

The paper shows clear engagement with the practical constraints of video-based simulation. It deserves peer review because the problem is real and the proposed components are concrete, even though the VLM validation gap needs addressing in revision.

Referee Report

1 major / 0 minor

Summary. The paper introduces RoboWorld, an automated pipeline for evaluating generalist robot policies that pairs a fast autoregressive video world model (with a proposed Step Forcing technique for reliable long-horizon rollouts) and a task-progress-aware vision-language model (VLM) scorer. It claims that these components enable strong alignment with real-world robot evaluations across tasks and environments, with reported Pearson's r = 0.989 and Spearman's ρ = 0.970.

Significance. If the central alignment claim holds under proper validation, the work would provide a scalable, hardware-free method for policy evaluation that could substantially reduce the cost and engineering overhead of testing generalist robot policies in simulation.

major comments (1)

[Abstract] Abstract: The headline correlations (Pearson's r = 0.989, Spearman's ρ = 0.970) are computed between RoboWorld outputs and real-world rollouts, but these outputs depend on the task-progress-aware VLM scorer. No quantitative validation of the VLM (e.g., inter-rater agreement with human ground truth, calibration curves, cross-task generalization, or error analysis on failure modes) is supplied, which is load-bearing for the claim that RoboWorld 'aligns strongly' at scale without per-task calibration or oversight.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and for highlighting this important aspect of our evaluation pipeline. We address the concern regarding validation of the task-progress-aware VLM scorer below.

read point-by-point responses

Referee: [Abstract] Abstract: The headline correlations (Pearson's r = 0.989, Spearman's ρ = 0.970) are computed between RoboWorld outputs and real-world rollouts, but these outputs depend on the task-progress-aware VLM scorer. No quantitative validation of the VLM (e.g., inter-rater agreement with human ground truth, calibration curves, cross-task generalization, or error analysis on failure modes) is supplied, which is load-bearing for the claim that RoboWorld 'aligns strongly' at scale without per-task calibration or oversight.

Authors: We agree that the VLM scorer is a critical component and that the manuscript would benefit from explicit quantitative validation of its alignment with human judgments, separate from the end-to-end pipeline correlations. The reported Pearson's r and Spearman's ρ validate the full RoboWorld system (world model + VLM), but do not isolate VLM performance. In the revised version we will add an appendix containing: (1) inter-rater agreement (Cohen's κ and percentage agreement) between the VLM and multiple human raters on a held-out set of 200 rollouts across tasks; (2) calibration curves showing VLM score vs. human-assigned progress; (3) cross-task generalization results; and (4) error analysis on failure modes where the VLM diverges from humans. These additions will be supported by the existing human-labeled data collected during real-world experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: correlations are direct empirical measurements against held-out real-world rollouts

full rationale

The paper's central claim is an empirical correlation (Pearson's r = 0.989, Spearman's ρ = 0.970) between RoboWorld-generated scores and independent real-world policy evaluations. The abstract and provided text give no equations, fitted parameters, or self-citations that define the reported alignment by construction. Step Forcing is presented as a training technique to reduce mismatch, not as a redefinition of the evaluation metric. The VLM scoring component is described as an automated proxy, but its outputs are compared externally rather than being tautological with the real-world ground truth. No load-bearing step reduces to a self-citation chain or renames a fitted input as a prediction. This is the common case of a self-contained empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, background axioms, or new entities; ledger left empty.

pith-pipeline@v0.9.1-grok · 5702 in / 1046 out tokens · 24603 ms · 2026-07-02T11:19:23.931089+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 56 canonical work pages · 34 internal anchors

[1]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision, 2014

2014
[2]

Russakovsky, J

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge.Interna- tional journal of computer vision, 115(3):211–252, 2015

2015
[3]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[4]

Chiang, L

W.-L. Chiang, L. Zheng, Y . Sheng, A. N. Angelopoulos, T. Li, D. Li, B. Zhu, H. Zhang, M. Jordan, J. E. Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. InInternational Conference on Machine Learning, 2024

2024
[5]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, 2023

2023
[7]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x mod- els: Open x-embodiment collaboration 0. InIEEE International Conference on Robotics and Automation, 2024

2024
[8]

O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InConference on Robot Learning, 2020

2020
[14]

James, Z

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

2020
[15]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024. 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Y . R. Wang, C. Ung, C. Tan, G. Tannert, J. Duan, J. Li, A. Le, R. Oswal, M. Grotz, W. Pumacay, et al. Roboeval: Where robotic manipulation meets structured and scalable evaluation.arXiv preprint arXiv:2507.00435, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Zhang, Z

S. Zhang, Z. Xu, P. Liu, X. Yu, Y . Li, Q. Gao, Z. Fei, Z. Yin, Z. Wu, Y .-G. Jiang, et al. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks. InIEEE/CVF International Conference on Computer Vision, 2025

2025
[18]

Y . Kim, W. Pumacay, O. Rayyan, M. Argus, W. Han, E. VanderBilt, J. Salvador, A. Deshpande, R. Hendrix, S. Jauhri, et al. Molmospaces: A large-scale open ecosystem for robot navigation and manipulation.arXiv preprint arXiv:2602.11337, 2026

work page arXiv 2026
[19]

W. Zhao, J. P. Queralta, and T. Westerlund. Sim-to-real transfer in deep reinforcement learning for robotics: a survey. InIEEE symposium series on computational intelligence (SSCI), 2020

2020
[20]

Blanco-Mulero, O

D. Blanco-Mulero, O. Barbany, G. Alcan, A. Colom ´e, C. Torras, and V . Kyrki. Benchmarking the sim-to-real gap in cloth manipulation.IEEE Robotics and Automation Letters, 9(3):2981– 2988, 2024

2024
[21]

Z. Zhou, P. Atreya, Y . L. Tan, K. Pertsch, and S. Levine. Autoeval: Autonomous evaluation of generalist robot manipulation policies in the real world.arXiv preprint arXiv:2503.24278, 2025

work page arXiv 2025
[22]

A. Jain, M. Zhang, K. Arora, W. Chen, M. Torne, M. Z. Irshad, S. Zakharov, Y . Wang, S. Levine, C. Finn, et al. Polaris: Scalable real-to-sim evaluations for generalist robot poli- cies.arXiv preprint arXiv:2512.16881, 2025

work page arXiv 2025
[23]

Jangir, Y

Y . Jangir, Y . Zhang, K. Yamazaki, C. Zhang, K.-H. Tu, T.-W. Ke, L. Ke, Y . Bisk, and K. Fragki- adaki. RobotArena∞: Scalable robot benchmarking via real-to-sim translation.arXiv preprint arXiv:2510.23571, 2025

work page arXiv 2025
[24]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. En- glish, V . V oleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Brooks, B

T. Brooks, B. Peebles, C. Holmes, W. DePue, Y . Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, C. Ng, R. Wang, and A. Ramesh. Video generation models as world simulators. 2024

2024
[26]

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

H. Chen, Y . Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y . Shan. Videocrafter2: Over- coming data limitations for high-quality video diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[28]

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

A. Ali, J. Bai, M. Bala, Y . Balaji, A. Blakeman, T. Cai, J. Cao, T. Cao, E. Cha, Y .-W. Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Cosmos World Foundation Model Platform for Physical AI

N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y . Chen, Y . Cui, Y . Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Y . Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

G. R. Team, C. Devin, Y . Du, D. Dwibedi, R. Gao, A. Jindal, T. Kipf, S. Kirmani, F. Liu, A. Majumdar, et al. Evaluating gemini robotics policies in a veo world simulator.arXiv preprint arXiv:2512.10675, 2025

work page arXiv 2025
[34]

S. Gao, W. Liang, K. Zheng, A. Malik, S. Ye, S. Yu, W.-C. Tseng, Y . Dong, K. Mo, C.-H. Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

X. W. M. Team. 1x world model: Evaluating bits, not atoms. Technical report, 1X Technolo- gies, 2025. Accessed: 2026-02-23

2025
[36]

G. Team, B. Wang, C. Ni, G. Huang, G. Zhao, H. Li, J. Li, J. Lv, J. Liu, L. Feng, et al. Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning.arXiv preprint arXiv:2602.12099, 2026

work page arXiv 2026
[37]

J. Yang, K. Lin, J. Li, W. Zhang, T. Lin, L. Wu, Z. Su, H. Zhao, Y .-Q. Zhang, L. Chen, et al. Rise: Self-improving robot policy with compositional world model.arXiv preprint arXiv:2602.11075, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

A. K. Sharma, Y . Sun, N. Lu, Y . Zhang, J. Liu, and S. Yang. World-gymnast: Training robots with reinforcement learning in a world model.arXiv preprint arXiv:2602.02454, 2026

work page arXiv 2026
[39]

Quevedo, A

J. Quevedo, A. K. Sharma, Y . Sun, V . Suryavanshi, P. Liang, and S. Yang. Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025

work page arXiv 2025
[40]

Y . Li, Y . Zhu, J. Wen, C. Shen, and Y . Xu. Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025

work page arXiv 2025
[41]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Atreya, K

P. Atreya, K. Pertsch, T. Lee, M. J. Kim, A. Jain, A. Kuramshin, C. Eppner, C. Neary, E. Hu, F. Ramos, et al. Roboarena: Distributed real-world evaluation of generalist robot policies. arXiv preprint arXiv:2506.18123, 2025

work page arXiv 2025
[43]

X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kir- mani, et al. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Tobin, R

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2017

2017
[45]

Zhang, S

K. Zhang, S. Sha, H. Jiang, M. Loper, H. Song, G. Cai, Z. Xu, X. Hu, C. Zheng, and Y . Li. Real- to-sim robot policy evaluation with gaussian splatting simulation of soft-body interactions. arXiv preprint arXiv:2511.04665, 2025

work page arXiv 2025
[46]

Abou-Chakra, L

J. Abou-Chakra, L. Sun, K. Rana, B. May, K. Schmeckpeper, N. Suenderhauf, M. V . Minniti, and L. Herlant. Real-is-sim: Bridging the sim-to-real gap with a dynamic digital twin.arXiv preprint arXiv:2504.03597, 2025. 11

work page arXiv 2025
[47]

M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[49]

J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y .-C. Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models. arXiv preprint arXiv:2505.12705, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Y . Liao, P. Zhou, S. Huang, D. Yang, S. Chen, Y . Jiang, Y . Hu, J. Cai, S. Liu, J. Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[52]

S. Yang, Y . Du, K. Ghasemipour, J. Tompson, L. Kaelbling, D. Schuurmans, and P. Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

J. Wu, S. Yin, N. Feng, X. He, D. Li, J. Hao, and M. Long. ivideogpt: Interactive videogpts are scalable world models.Advances in Neural Information Processing Systems, 37:68082–68119, 2024

2024
[54]

F. Zhu, H. Wu, S. Guo, Y . Liu, C. Cheang, and T. Kong. Irasim: Learning interactive real-robot action simulators.arXiv preprint arXiv:2406.14540, 2024

work page arXiv 2024
[55]

L. Wang, K. Zhao, C. Liu, and X. Chen. Learning real-world action-video dynamics with heterogeneous masked autoregression.arXiv preprint arXiv:2502.04296, 2025

work page arXiv 2025
[56]

Tseng, J

W.-C. Tseng, J. Gu, Q. Zhang, H. Mao, M.-Y . Liu, F. Shkurti, and L. Yen-Chen. Scalable policy evaluation with video world models.arXiv preprint arXiv:2511.11520, 2025

work page arXiv 2025
[57]

Jiang, S

Y . Jiang, S. Chen, S. Huang, L. Chen, P. Zhou, Y . Liao, X. He, C. Liu, H. Li, M. Yao, et al. Enerverse-ac: Envisioning embodied environments with action condition.arXiv preprint arXiv:2505.09723, 2025

work page arXiv 2025
[58]

Ebert, C

F. Ebert, C. Finn, A. X. Lee, and S. Levine. Self-supervised visual planning with temporal skip connections.CoRL, 12(16):23, 2017

2017
[59]

Bruce, M

J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y . Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, et al. Genie: Generative interactive environments. In International Conference on Machine Learning, 2024

2024
[60]

X. Mao, S. Lin, Z. Li, C. Li, W. Peng, T. He, J. Pang, M. Chi, Y . Qiao, and K. Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

work page arXiv 2025
[61]

Zhang, C

Y . Zhang, C. Peng, B. Wang, P. Wang, Q. Zhu, F. Kang, B. Jiang, Z. Gao, E. Li, Y . Liu, et al. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

work page arXiv 2025
[62]

X. He, C. Peng, Z. Liu, B. Wang, Y . Zhang, Q. Cui, F. Kang, B. Jiang, M. An, Y . Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Diffusion Models Are Real-Time Game Engines

D. Valevski, Y . Leviathan, M. Arar, and S. Fruchter. Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

T. Yin, Q. Zhang, R. Zhang, W. T. Freeman, F. Durand, E. Shechtman, and X. Huang. From slow bidirectional to fast autoregressive video diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 12

2025
[65]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

X. Huang, Z. Li, G. He, M. Zhou, and E. Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Y . Guo, C. Yang, H. He, Y . Zhao, M. Wei, Z. Yang, W. Huang, and D. Lin. End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[68]

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[69]

B. Chen, D. Mart ´ı Mons´o, Y . Du, M. Simchowitz, R. Tedrake, and V . Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

2024
[70]

D. Zhou, Q. Sun, Y . Peng, K. Yan, R. Dong, D. Wang, Z. Ge, N. Duan, and X. Zhang. Taming teacher forcing for masked autoregressive video generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025
[71]

Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004

2004
[72]

Zhang, P

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018

2018
[73]

K. Gao, J. Shi, H. Zhang, C. Wang, J. Xiao, and L. Chen. Ca2-vdm: Efficient autore- gressive video diffusion model with causal generation and cache sharing.arXiv preprint arXiv:2411.16375, 2024

work page arXiv 2024
[74]

J. Cui, J. Wu, M. Li, T. Yang, X. Li, R. Wang, A. Bai, Y . Ban, and C.-J. Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[75]

Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357,

S. Huang, J. Wu, Q. Zhou, S. Miao, and M. Long. Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025

work page arXiv 2025
[76]

Bardhan, P

J. Bardhan, P. Drozdik, J. Sivic, and V . Petrik. Persistent robot world models: Stabilizing multi-step rollouts via reinforcement learning.arXiv preprint arXiv:2603.25685, 2026

work page arXiv 2026
[77]

Towards Accurate Generative Models of Video: A New Metric & Challenges

T. Unterthiner, S. Van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[78]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[79]

Fwd/step

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, page 127063, 2024. 13 Appendix A Implementation Details: Action-conditioned BAIR Dataset.We use the BAIR Robot Pushing dataset [58]:43,264training clips and256test clips of30frames each,64×64RGB pixels in[−1,1], accompanied by...

2024
[80]

Determine whether the robot interacts with the target object based on the fixed views

Use the fixed views (the two upper views) as the primary reference. Determine whether the robot interacts with the target object based on the fixed views

Showing first 80 references.

[1] [1]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision, 2014

2014

[2] [2]

Russakovsky, J

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge.Interna- tional journal of computer vision, 115(3):211–252, 2015

2015

[3] [3]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[4] [4]

Chiang, L

W.-L. Chiang, L. Zheng, Y . Sheng, A. N. Angelopoulos, T. Li, D. Li, B. Zhu, H. Zhang, M. Jordan, J. E. Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. InInternational Conference on Machine Learning, 2024

2024

[5] [5]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, 2023

2023

[7] [7]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x mod- els: Open x-embodiment collaboration 0. InIEEE International Conference on Robotics and Automation, 2024

2024

[8] [8]

O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InConference on Robot Learning, 2020

2020

[14] [14]

James, Z

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

2020

[15] [15]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024. 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Y . R. Wang, C. Ung, C. Tan, G. Tannert, J. Duan, J. Li, A. Le, R. Oswal, M. Grotz, W. Pumacay, et al. Roboeval: Where robotic manipulation meets structured and scalable evaluation.arXiv preprint arXiv:2507.00435, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Zhang, Z

S. Zhang, Z. Xu, P. Liu, X. Yu, Y . Li, Q. Gao, Z. Fei, Z. Yin, Z. Wu, Y .-G. Jiang, et al. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks. InIEEE/CVF International Conference on Computer Vision, 2025

2025

[18] [18]

Y . Kim, W. Pumacay, O. Rayyan, M. Argus, W. Han, E. VanderBilt, J. Salvador, A. Deshpande, R. Hendrix, S. Jauhri, et al. Molmospaces: A large-scale open ecosystem for robot navigation and manipulation.arXiv preprint arXiv:2602.11337, 2026

work page arXiv 2026

[19] [19]

W. Zhao, J. P. Queralta, and T. Westerlund. Sim-to-real transfer in deep reinforcement learning for robotics: a survey. InIEEE symposium series on computational intelligence (SSCI), 2020

2020

[20] [20]

Blanco-Mulero, O

D. Blanco-Mulero, O. Barbany, G. Alcan, A. Colom ´e, C. Torras, and V . Kyrki. Benchmarking the sim-to-real gap in cloth manipulation.IEEE Robotics and Automation Letters, 9(3):2981– 2988, 2024

2024

[21] [21]

Z. Zhou, P. Atreya, Y . L. Tan, K. Pertsch, and S. Levine. Autoeval: Autonomous evaluation of generalist robot manipulation policies in the real world.arXiv preprint arXiv:2503.24278, 2025

work page arXiv 2025

[22] [22]

A. Jain, M. Zhang, K. Arora, W. Chen, M. Torne, M. Z. Irshad, S. Zakharov, Y . Wang, S. Levine, C. Finn, et al. Polaris: Scalable real-to-sim evaluations for generalist robot poli- cies.arXiv preprint arXiv:2512.16881, 2025

work page arXiv 2025

[23] [23]

Jangir, Y

Y . Jangir, Y . Zhang, K. Yamazaki, C. Zhang, K.-H. Tu, T.-W. Ke, L. Ke, Y . Bisk, and K. Fragki- adaki. RobotArena∞: Scalable robot benchmarking via real-to-sim translation.arXiv preprint arXiv:2510.23571, 2025

work page arXiv 2025

[24] [24]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. En- glish, V . V oleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Brooks, B

T. Brooks, B. Peebles, C. Holmes, W. DePue, Y . Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, C. Ng, R. Wang, and A. Ramesh. Video generation models as world simulators. 2024

2024

[26] [26]

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

H. Chen, Y . Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y . Shan. Videocrafter2: Over- coming data limitations for high-quality video diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024

[28] [28]

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

A. Ali, J. Bai, M. Bala, Y . Balaji, A. Blakeman, T. Cai, J. Cao, T. Cao, E. Cha, Y .-W. Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Cosmos World Foundation Model Platform for Physical AI

N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y . Chen, Y . Cui, Y . Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Y . Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

G. R. Team, C. Devin, Y . Du, D. Dwibedi, R. Gao, A. Jindal, T. Kipf, S. Kirmani, F. Liu, A. Majumdar, et al. Evaluating gemini robotics policies in a veo world simulator.arXiv preprint arXiv:2512.10675, 2025

work page arXiv 2025

[34] [34]

S. Gao, W. Liang, K. Zheng, A. Malik, S. Ye, S. Yu, W.-C. Tseng, Y . Dong, K. Mo, C.-H. Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

X. W. M. Team. 1x world model: Evaluating bits, not atoms. Technical report, 1X Technolo- gies, 2025. Accessed: 2026-02-23

2025

[36] [36]

G. Team, B. Wang, C. Ni, G. Huang, G. Zhao, H. Li, J. Li, J. Lv, J. Liu, L. Feng, et al. Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning.arXiv preprint arXiv:2602.12099, 2026

work page arXiv 2026

[37] [37]

J. Yang, K. Lin, J. Li, W. Zhang, T. Lin, L. Wu, Z. Su, H. Zhao, Y .-Q. Zhang, L. Chen, et al. Rise: Self-improving robot policy with compositional world model.arXiv preprint arXiv:2602.11075, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[38] [38]

A. K. Sharma, Y . Sun, N. Lu, Y . Zhang, J. Liu, and S. Yang. World-gymnast: Training robots with reinforcement learning in a world model.arXiv preprint arXiv:2602.02454, 2026

work page arXiv 2026

[39] [39]

Quevedo, A

J. Quevedo, A. K. Sharma, Y . Sun, V . Suryavanshi, P. Liang, and S. Yang. Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025

work page arXiv 2025

[40] [40]

Y . Li, Y . Zhu, J. Wen, C. Shen, and Y . Xu. Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025

work page arXiv 2025

[41] [41]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Atreya, K

P. Atreya, K. Pertsch, T. Lee, M. J. Kim, A. Jain, A. Kuramshin, C. Eppner, C. Neary, E. Hu, F. Ramos, et al. Roboarena: Distributed real-world evaluation of generalist robot policies. arXiv preprint arXiv:2506.18123, 2025

work page arXiv 2025

[43] [43]

X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kir- mani, et al. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Tobin, R

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2017

2017

[45] [45]

Zhang, S

K. Zhang, S. Sha, H. Jiang, M. Loper, H. Song, G. Cai, Z. Xu, X. Hu, C. Zheng, and Y . Li. Real- to-sim robot policy evaluation with gaussian splatting simulation of soft-body interactions. arXiv preprint arXiv:2511.04665, 2025

work page arXiv 2025

[46] [46]

Abou-Chakra, L

J. Abou-Chakra, L. Sun, K. Rana, B. May, K. Schmeckpeper, N. Suenderhauf, M. V . Minniti, and L. Herlant. Real-is-sim: Bridging the sim-to-real gap with a dynamic digital twin.arXiv preprint arXiv:2504.03597, 2025. 11

work page arXiv 2025

[47] [47]

M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[48] [48]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[49] [49]

J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y .-C. Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models. arXiv preprint arXiv:2505.12705, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Y . Liao, P. Zhou, S. Huang, D. Yang, S. Chen, Y . Jiang, Y . Hu, J. Cai, S. Liu, J. Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[52] [52]

S. Yang, Y . Du, K. Ghasemipour, J. Tompson, L. Kaelbling, D. Schuurmans, and P. Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[53] [53]

J. Wu, S. Yin, N. Feng, X. He, D. Li, J. Hao, and M. Long. ivideogpt: Interactive videogpts are scalable world models.Advances in Neural Information Processing Systems, 37:68082–68119, 2024

2024

[54] [54]

F. Zhu, H. Wu, S. Guo, Y . Liu, C. Cheang, and T. Kong. Irasim: Learning interactive real-robot action simulators.arXiv preprint arXiv:2406.14540, 2024

work page arXiv 2024

[55] [55]

L. Wang, K. Zhao, C. Liu, and X. Chen. Learning real-world action-video dynamics with heterogeneous masked autoregression.arXiv preprint arXiv:2502.04296, 2025

work page arXiv 2025

[56] [56]

Tseng, J

W.-C. Tseng, J. Gu, Q. Zhang, H. Mao, M.-Y . Liu, F. Shkurti, and L. Yen-Chen. Scalable policy evaluation with video world models.arXiv preprint arXiv:2511.11520, 2025

work page arXiv 2025

[57] [57]

Jiang, S

Y . Jiang, S. Chen, S. Huang, L. Chen, P. Zhou, Y . Liao, X. He, C. Liu, H. Li, M. Yao, et al. Enerverse-ac: Envisioning embodied environments with action condition.arXiv preprint arXiv:2505.09723, 2025

work page arXiv 2025

[58] [58]

Ebert, C

F. Ebert, C. Finn, A. X. Lee, and S. Levine. Self-supervised visual planning with temporal skip connections.CoRL, 12(16):23, 2017

2017

[59] [59]

Bruce, M

J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y . Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, et al. Genie: Generative interactive environments. In International Conference on Machine Learning, 2024

2024

[60] [60]

X. Mao, S. Lin, Z. Li, C. Li, W. Peng, T. He, J. Pang, M. Chi, Y . Qiao, and K. Zhang. Yume: An interactive world generation model.arXiv preprint arXiv:2507.17744, 2025

work page arXiv 2025

[61] [61]

Zhang, C

Y . Zhang, C. Peng, B. Wang, P. Wang, Q. Zhu, F. Kang, B. Jiang, Z. Gao, E. Li, Y . Liu, et al. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

work page arXiv 2025

[62] [62]

X. He, C. Peng, Z. Liu, B. Wang, Y . Zhang, Q. Cui, F. Kang, B. Jiang, M. An, Y . Ren, et al. Matrix-game 2.0: An open-source real-time and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[63] [63]

Diffusion Models Are Real-Time Game Engines

D. Valevski, Y . Leviathan, M. Arar, and S. Fruchter. Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [64]

T. Yin, Q. Zhang, R. Zhang, W. T. Freeman, F. Durand, E. Shechtman, and X. Huang. From slow bidirectional to fast autoregressive video diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 12

2025

[65] [65]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

X. Huang, Z. Li, G. He, M. Zhou, and E. Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[66] [66]

Y . Guo, C. Yang, H. He, Y . Zhao, M. Wei, Z. Yang, W. Huang, and D. Lin. End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[67] [67]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[68] [68]

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[69] [69]

B. Chen, D. Mart ´ı Mons´o, Y . Du, M. Simchowitz, R. Tedrake, and V . Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

2024

[70] [70]

D. Zhou, Q. Sun, Y . Peng, K. Yan, R. Dong, D. Wang, Z. Ge, N. Duan, and X. Zhang. Taming teacher forcing for masked autoregressive video generation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025

[71] [71]

Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004

2004

[72] [72]

Zhang, P

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018

2018

[73] [73]

K. Gao, J. Shi, H. Zhang, C. Wang, J. Xiao, and L. Chen. Ca2-vdm: Efficient autore- gressive video diffusion model with causal generation and cache sharing.arXiv preprint arXiv:2411.16375, 2024

work page arXiv 2024

[74] [74]

J. Cui, J. Wu, M. Li, T. Yang, X. Li, R. Wang, A. Bai, Y . Ban, and C.-J. Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[75] [75]

Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357,

S. Huang, J. Wu, Q. Zhou, S. Miao, and M. Long. Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025

work page arXiv 2025

[76] [76]

Bardhan, P

J. Bardhan, P. Drozdik, J. Sivic, and V . Petrik. Persistent robot world models: Stabilizing multi-step rollouts via reinforcement learning.arXiv preprint arXiv:2603.25685, 2026

work page arXiv 2026

[77] [77]

Towards Accurate Generative Models of Video: A New Metric & Challenges

T. Unterthiner, S. Van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[78] [78]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[79] [79]

Fwd/step

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, page 127063, 2024. 13 Appendix A Implementation Details: Action-conditioned BAIR Dataset.We use the BAIR Robot Pushing dataset [58]:43,264training clips and256test clips of30frames each,64×64RGB pixels in[−1,1], accompanied by...

2024

[80] [80]

Determine whether the robot interacts with the target object based on the fixed views

Use the fixed views (the two upper views) as the primary reference. Determine whether the robot interacts with the target object based on the fixed views