arxiv: 2604.19683 · v2 · submitted 2026-04-21 · 💻 cs.RO

Recognition: unknown

Mask World Model: Predicting What Matters for Robust Robot Policy Learning

Yunfan Lou , Xiaowei Chi , Xiaojie Zhang , Zezhong Qian , Chengxuan Li , Rongyu Zhang , Yaoxu Lyu , Guoyu Song

show 4 more authors

Chuyao Fu Haoxuan Xu Pengwei Wang Shanghang Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:10 UTC · model grok-4.3

classification 💻 cs.RO

keywords Mask World Modelrobot policy learningworld modelssemantic masksvideo diffusionrobust generalizationLIBERORLBench

0 comments

The pith

Predicting semantic mask evolution instead of RGB frames creates a bottleneck that yields more robust robot policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard world models overfit to irrelevant pixel details such as changing backgrounds and lighting, which hurts generalization in robot control. By training a diffusion model to forecast the future of semantic masks rather than full images, the system is forced to encode only geometric and contact information. This mask-based backbone is then paired directly with a diffusion policy head for end-to-end action generation. On LIBERO and RLBench the resulting policies outperform prior RGB world models, and real-robot trials plus token-pruning tests show greater resilience when visual texture is removed or altered.

Core claim

The Mask World Model uses video diffusion to predict the temporal evolution of semantic masks rather than RGB pixels, thereby imposing a geometric information bottleneck that retains essential physical dynamics and contact relations while discarding visual noise, and integrates this backbone with a diffusion policy head to produce control actions.

What carries the argument

The mask dynamics backbone, which predicts semantic mask evolution to filter visual noise and retain physical essentials for policy learning.

If this is right

Policies trained on mask predictions outperform RGB-based world models on both LIBERO and RLBench benchmarks.
The approach maintains higher success rates under real-world texture changes and random token pruning.
End-to-end diffusion policy integration removes the need for separate perception and planning modules.
Generalization improves because the model cannot rely on transient visual distractors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bottleneck principle could be applied to other modalities such as depth or tactile signals to create comparable filtering effects.
If masks reliably encode contact geometry, the method may reduce the sim-to-real gap for contact-rich tasks.
Combining mask prediction with language conditioning could allow policies to reason at a more abstract level while still grounding actions in physical structure.

Load-bearing premise

Semantic masks alone contain every piece of information required for successful control without discarding details critical to object interactions or contact events.

What would settle it

A manipulation task whose success demonstrably requires fine surface texture cues that semantic masks omit, where the mask-based policy fails while an otherwise identical RGB-based policy succeeds.

Figures

Figures reproduced from arXiv: 2604.19683 by Chengxuan Li, Chuyao Fu, Guoyu Song, Haoxuan Xu, Pengwei Wang, Rongyu Zhang, Shanghang Zhang, Xiaojie Zhang, Xiaowei Chi, Yaoxu Lyu, Yunfan Lou, Zezhong Qian.

**Figure 1.** Figure 1: MWM overview. MWM learns a mask-centric predictive world model from semantic supervision during training, but runs purely on raw multi-view RGB at test time. Training proceeds in two stages: we first learn to forecast future semantic masks via conditional diffusion, then train a diffusion policy that conditions on mask-centric predictive features for action generation. This semantic bottleneck prioritizes … view at source ↗

**Figure 2.** Figure 2: Mask World Model (MWM) architecture. Given multi-view RGB memory frames and a language prompt, MWM encodes observations with a shared video VAE, then applies Normalize & Interpolate & Stack to form a fixed-length latent token sequence. A DiT-style backbone with AdaIN timestep conditioning and text cross-attention processes these tokens for N=28 transformer blocks. In Stage 1, a mask decoder supervises futu… view at source ↗

**Figure 3.** Figure 3: Real-robot qualitative rollouts. We visualize representative third-person executions for the four real-world tasks: bread+hotdog→basket, pour water→bowl, book→shelf, and open drawer→put pen. Each row shows frames ordered left-to-right. each task and each shift, we run 20 real-robot trials with randomized initializations and report success rate (SR). Appearance shifts. We consider three appearance factors t… view at source ↗

**Figure 4.** Figure 4: Real-world experimental environment. (Left) The hardware setup features a Franka Emika Panda robot arm equipped with a parallel gripper. Perception is provided by two synchronized Intel RealSense D435i cameras: a fixed third-person view capturing the global workspace and a wrist-mounted eye-in-hand view for fine-grained interaction. (Right) Snapshots of the four manipulation tasks used for evaluation: (1) … view at source ↗

**Figure 5.** Figure 5: Visual generalization stress tests. We evaluate policy robustness under three distinct distribution shifts relative to the nominal condition (Top-Left). The shifts include: (Top-Right) Object Color Shift, where task objects are swapped with unseen colors while retaining geometry; (Bottom-Left) Lighting Shift, involving significant changes in illumination intensity; and (Bottom-Right) Background (BG) Shift,… view at source ↗

read the original abstract

World models derived from large-scale video generative pre-training have emerged as a promising paradigm for generalist robot policy learning. However, standard approaches often focus on high-fidelity RGB video prediction, this can result in overfitting to irrelevant factors, such as dynamic backgrounds and illumination changes. These distractions reduce the model's ability to generalize, ultimately leading to unreliable and fragile control policies. To address this, we introduce the Mask World Model (MWM), which leverages video diffusion architectures to predict the evolution of semantic masks instead of pixels. This shift imposes a geometric information bottleneck, forcing the model to capture essential physical dynamics and contact relations while filtering out visual noise. We seamlessly integrate this mask dynamics backbone with a diffusion-based policy head to enable robust end-to-end control. Extensive evaluations demonstrate the superiority of MWM on the LIBERO and RLBench simulation benchmarks, significantly outperforming the state-of-the-art RGB-based world models. Furthermore, real-world experiments and robustness evaluation (via random token pruning) reveal that MWM exhibits superior generalization capabilities and robust resilience to texture information loss.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MWM swaps RGB prediction for semantic mask dynamics in a diffusion world model to cut visual noise for robot policies, but the abstract gives no numbers so the gains are hard to judge yet.

read the letter

The core move here is replacing pixel-level video prediction with semantic mask forecasting inside a diffusion backbone, then feeding that into a diffusion policy head. This is meant to force the model to focus on geometry and contacts while dropping texture and background distractions that hurt generalization in standard RGB world models. The abstract says this leads to better performance on LIBERO and RLBench plus stronger real-world robustness under texture loss and random token pruning. If the full experiments back that up with clear metrics and ablations, it is a practical step on a known pain point in robot learning.

Referee Report

3 major / 1 minor

Summary. The paper introduces the Mask World Model (MWM), which replaces RGB video prediction in world models with semantic mask evolution using video diffusion architectures. This is claimed to impose a geometric information bottleneck that captures essential physical dynamics and contact relations while filtering visual noise. MWM is integrated with a diffusion policy head for end-to-end robot control. The authors assert that MWM significantly outperforms state-of-the-art RGB-based world models on LIBERO and RLBench, with superior generalization and robustness to texture loss demonstrated in real-world experiments and random token pruning tests.

Significance. If the superiority and robustness claims hold with supporting quantitative evidence, the work could advance robust robot policy learning by demonstrating that semantic mask prediction provides a useful inductive bias against visual distractors. The approach builds on existing video diffusion and diffusion policy techniques but reframes the prediction target; its significance hinges on showing that the bottleneck does not discard task-critical information.

major comments (3)

[Abstract] Abstract: The central claim that MWM 'significantly outperforming the state-of-the-art RGB-based world models' on LIBERO and RLBench is asserted without any quantitative metrics (e.g., success rates, deltas, or baseline names), error bars, or statistical tests. This absence makes it impossible to evaluate the magnitude or reliability of the reported gains.
[Method] Method section (description of mask dynamics backbone): The assertion that semantic mask prediction 'imposes a geometric information bottleneck' that captures 'essential physical dynamics and contact relations' while filtering noise lacks supporting analysis or experiments addressing potential loss of non-geometric cues (e.g., material properties, friction, or subtle deformation) that may be required for certain control policies.
[Experiments] Experiments section: No ablation studies on mask generation accuracy, no error analysis, and no details on how the diffusion policy head conditions on mask latents are provided. These omissions are load-bearing because the robustness claims (resilience to random token pruning and texture loss) cannot be assessed without them.

minor comments (1)

[Abstract] Abstract: The sentence 'However, standard approaches often focus on high-fidelity RGB video prediction, this can result in overfitting' contains a comma splice and should be rephrased for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that MWM 'significantly outperforming the state-of-the-art RGB-based world models' on LIBERO and RLBench is asserted without any quantitative metrics (e.g., success rates, deltas, or baseline names), error bars, or statistical tests. This absence makes it impossible to evaluate the magnitude or reliability of the reported gains.

Authors: We agree that the abstract requires quantitative support for the performance claims. The revised manuscript updates the abstract to include specific success rates on LIBERO and RLBench, the names of the RGB-based world model baselines used for comparison, performance deltas, error bars from multiple random seeds, and references to statistical tests. These additions are drawn directly from the experimental results already present in the paper body and are presented concisely. revision: yes
Referee: [Method] Method section (description of mask dynamics backbone): The assertion that semantic mask prediction 'imposes a geometric information bottleneck' that captures 'essential physical dynamics and contact relations' while filtering noise lacks supporting analysis or experiments addressing potential loss of non-geometric cues (e.g., material properties, friction, or subtle deformation) that may be required for certain control policies.

Authors: This observation is fair; the original method description was primarily conceptual. Semantic masks focus on object geometry, boundaries, and spatial relations, which are central to the dynamics and contacts in our evaluated manipulation tasks. In the revised version, we have expanded the method section with a dedicated discussion of the information bottleneck, including why non-geometric cues such as material properties and friction are less critical for the LIBERO and RLBench benchmarks (where shape and contact suffice) and how mask sequences can still encode motion cues relevant to deformation. We also clarify the scope of the claims to the tasks studied. revision: yes
Referee: [Experiments] Experiments section: No ablation studies on mask generation accuracy, no error analysis, and no details on how the diffusion policy head conditions on mask latents are provided. These omissions are load-bearing because the robustness claims (resilience to random token pruning and texture loss) cannot be assessed without them.

Authors: We acknowledge that these supporting details were insufficient in the original submission. The revised experiments section now includes ablation studies on mask generation accuracy (e.g., quantitative metrics such as IoU over predicted sequences), error analysis linking mask prediction quality to downstream policy performance, and explicit technical details on the conditioning of the diffusion policy head on mask latents (including the latent encoding and integration mechanism). These additions directly enable evaluation of the reported robustness to token pruning and texture loss. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method evaluated on external benchmarks

full rationale

The paper introduces MWM as an architectural design choice—predicting semantic mask evolution via video diffusion instead of RGB pixels—to impose a geometric bottleneck. All performance claims rest on direct empirical comparisons against external SOTA RGB world models on LIBERO and RLBench, plus real-world trials and token-pruning robustness tests. No derivation chain reduces by construction to fitted parameters, self-citations, or self-definitions; the central results are independent measurements on standard benchmarks rather than tautological predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that semantic masks retain sufficient information for control while discarding only noise; no explicit free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Semantic masks capture essential physical dynamics and contact relations for robot control
Invoked when stating that mask prediction filters visual noise while preserving what matters for policy learning.

pith-pipeline@v0.9.0 · 5525 in / 1219 out tokens · 49827 ms · 2026-05-10T02:10:56.040546+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

39 extracted references · 37 canonical work pages · cited by 1 Pith paper · 16 internal anchors

[1]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

URLhttps://arxiv.org/abs/2506.09985. Berg, J., Zhu, C., Bao, Y ., Durugkar, I., and Gupta, A. Semantic world models,

work page internal anchor Pith review arXiv
[2]

org/abs/2510.19818

URL https://arxiv. org/abs/2510.19818. Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li- Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L. X., Tanner, J., Vuong, Q., Walling, A., Wang, H., and Zhilinsky, U. π0: A vision-language-action fl...

work page arXiv
[3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

URL https: //arxiv.org/abs/2410.24164. Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., Florence, P., Fu, C., Arenas, M. G., Gopalakr- ishnan, K., Han, K., Hausman, K., Herzog, A., Hsu, J., Ichter, B., Irpan, A., Joshi, N., Julian, R., Kalashnikov, D., Kuang, Y ., Leal, I., Lee, L., ...

work page internal anchor Pith review arXiv
[4]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

URLhttps://arxiv.org/abs/2307.15818. Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y ., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., and Ramesh, A. Video generation models as world simulators

work page internal anchor Pith review arXiv
[5]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

URL https://arxiv. org/abs/2505.06111. Burgess, C. P., Matthey, L., Watters, N., Kabra, R., Higgins, I., Botvinick, M., and Lerchner, A. Monet: Unsupervised scene decomposition and representation,

work page internal anchor Pith review arXiv
[6]

MONet: Unsupervised Scene Decomposition and Representation

URL https://arxiv.org/abs/1901.11390. Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y ., Burchfiel, B., Tedrake, R., and Song, S. Diffusion policy: Visuo- motor policy learning via action diffusion,

work page Pith review arXiv 1901
[7]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

URL https://arxiv.org/abs/2303.04137. Chi, X., Fan, C.-K., Zhang, H., Qi, X., Zhang, R., Chen, A., min Chan, C., Xue, W., Liu, Q., Zhang, S., and Guo, Y . Eva: An embodied world model for future video an- ticipation, 2025a. URL https://arxiv.org/abs/ 2410.15461. Chi, X., Ge, K., Liu, J., Zhou, S., Jia, P., He, Z., Liu, Y ., Li, T., Han, L., Han, S., Zhang...

work page internal anchor Pith review arXiv
[8]

Greff, K., Kaufman, R

URL https://arxiv.org/abs/ 2104.09958. Greff, K., Kaufman, R. L., Kabra, R., Watters, N., Burgess, C., Zoran, D., Matthey, L., Botvinick, M., and Lerchner, A. Multi-object representation learning with iterative variational inference,

work page arXiv
[9]

arXiv preprint arXiv:1903.00450 , Title =

URL https://arxiv. org/abs/1903.00450. Grooten, B., Tomilin, T., Vasan, G., Taylor, M. E., Mah- mood, A. R., Fang, M., Pechenizkiy, M., and Mocanu, D. C. Madi: Learning to mask distractions for general- ization in visual deep reinforcement learning,

work page arXiv 1903
[10]

URL https://arxiv.org/abs/2312.15339. Ha, D. and Schmidhuber, J. World models

work page arXiv
[11]

Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch

doi: 10.5281/ZENODO.1207631. URL https://zenodo. org/record/1207631. Hafner, D., Lillicrap, T., Ba, J., and Norouzi, M. Dream to control: Learning behaviors by latent imagination,

work page doi:10.5281/zenodo.1207631
[12]

Dream to Control: Learning Behaviors by Latent Imagination

URLhttps://arxiv.org/abs/1912.01603. 9 Hafner, D., Pasukonis, J., Ba, J., and Lillicrap, T. Master- ing diverse domains through world models,

work page internal anchor Pith review arXiv 1912
[13]

Mastering Diverse Domains through World Models

URL https://arxiv.org/abs/2301.04104. Hafner, D., Yan, W., and Lillicrap, T. Training agents inside of scalable world models,

work page internal anchor Pith review arXiv
[14]

Training Agents Inside of Scalable World Models

URL https: //arxiv.org/abs/2509.24527. He, K., Chen, X., Xie, S., Li, Y ., Doll´ar, P., and Girshick, R. Masked autoencoders are scalable vision learners,

work page internal anchor Pith review arXiv
[15]

org/abs/2111.06377

URLhttps://arxiv.org/abs/2111.06377. Hu, Y ., Guo, Y ., Wang, P., Chen, X., Wang, Y .-J., Zhang, J., Sreenath, K., Lu, C., and Chen, J. Video prediction policy: A generalist robot policy with predictive visual representations,

work page arXiv
[16]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

URL https://arxiv.org/ abs/2412.14803. Huang, H., Chen, X., Chen, Y ., Li, H., Han, X., Wang, Z., Wang, T., Pang, J., and Zhao, Z. Roboground: Robotic manipulation with grounded vision-language pri- ors, 2025a. URL https://arxiv.org/abs/2504. 21530. Huang, X. and Belongie, S. Arbitrary style transfer in real- time with adaptive instance normalization,

work page internal anchor Pith review arXiv
[17]

Huang, Y ., Zhang, J., Zou, S., Liu, X., Hu, R., and Xu, K

URL https://arxiv.org/abs/1703.06868. Huang, Y ., Zhang, J., Zou, S., Liu, X., Hu, R., and Xu, K. Ladi-wm: A latent diffusion-based world model for pre- dictive manipulation, 2025b. URL https://arxiv. org/abs/2505.11528. James, S., Ma, Z., Arrojo, D. R., and Davison, A. J. Rl- bench: The robot learning benchmark & learning envi- ronment,

work page arXiv
[18]

Jia, Y ., Liu, J., Liu, S., Zhou, R., Yu, W., Yan, Y ., Chi, X., Guo, Y ., Shi, B., and Zhang, S

URL https://arxiv.org/abs/ 1909.12271. Jia, Y ., Liu, J., Liu, S., Zhou, R., Yu, W., Yan, Y ., Chi, X., Guo, Y ., Shi, B., and Zhang, S. Video2act: A dual-system video diffusion policy with robotic spatio- motional modeling,

work page arXiv 1909
[19]

Video2act: A dual-system video diffusion policy with robotic spatio-motional modeling,

URL https://arxiv. org/abs/2512.03044. Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakr- ishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., and Finn, C. Open- vla: An open-source vision-language-action model,

work page arXiv
[20]

OpenVLA: An Open-Source Vision-Language-Action Model

URLhttps://arxiv.org/abs/2406.09246. Kingma, D. P. and Welling, M. Auto-encoding varia- tional bayes,

work page internal anchor Pith review arXiv
[21]

Auto-Encoding Variational Bayes

URL https://arxiv.org/ abs/1312.6114. Li, Q., Liang, Y ., Wang, Z., Luo, L., Chen, X., Liao, M., Wei, F., Deng, Y ., Xu, S., Zhang, Y ., Wang, X., Liu, B., Fu, J., Bao, J., Chen, D., Shi, Y ., Yang, J., and Guo, B. Cogact: A foundational vision-language-action model for syner- gizing cognition and action in robotic manipulation,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

URLhttps://arxiv.org/abs/2411.19650. Li, W., Zhang, R., Shao, R., Fang, Z., Zhou, K., Tian, Z., and Nie, L. Semanticvla: Semantic-aligned sparsification and enhancement for efficient robotic manipulation, 2025a. URLhttps://arxiv.org/abs/2511.10518. Li, Y ., Wei, X., Chi, X., Li, Y ., Zhao, Z., Wang, H., Ma, N., Lu, M., Han, S., and Zhang, S. Manipdreamer3...

work page Pith review arXiv
[23]

Genieenvisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635,

URL https://arxiv.org/abs/2508.05635. Lipman, Y ., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling,

work page arXiv
[24]

Flow Matching for Generative Modeling

URLhttps://arxiv.org/abs/2210.02747. Liu, B., Zhu, Y ., Gao, C., Feng, Y ., Liu, Q., Zhu, Y ., and Stone, P. Libero: Benchmarking knowledge transfer for lifelong robot learning,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

URL https://arxiv. org/abs/2306.03310. Liu, C., Zhang, J., Li, C., Zhou, Z., Wu, S., Huang, S., and Duan, H. Ttf-vla: Temporal token fusion via pixel- attention integration for vision-language-action mod- els,

work page internal anchor Pith review arXiv
[26]

Object-centric learning with slot attention

URL https://arxiv.org/abs/2006.15055. Ma, X., Liu, W., Zhang, P., and Xu, N. 3d-rpe: Enhancing long-context modeling through 3d rotary position encod- ing,

work page arXiv 2006
[27]

URLhttps://arxiv.org/abs/2601.18323. 10 NVIDIA, :, Ali, A., Bai, J., Bala, M., Balaji, Y ., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y .-W., Chat- topadhyay, P., Chen, M., Chen, Y ., Chen, Y ., Cheng, S., Cui, Y ., Diamond, J., Ding, Y ., Fan, J., Fan, L., Feng, L., Ferroni, F., Fidler, S., Fu, X., Gao, R., Ge, Y ., Gu, J., Gupta, A., Gurur...

work page arXiv
[28]

World Simulation with Video Foundation Models for Physical AI

URL https://arxiv.org/abs/2511.00062. Pai, J., Achenbach, L., Montesinos, V ., Forrai, B., Mees, O., and Nava, E. mimic-video: Video-action models for generalizable robot control beyond vlas,

work page internal anchor Pith review arXiv
[29]

mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692,

URL https://arxiv.org/abs/2512.15692. Qian, Z., Chi, X., Li, Y ., Wang, S., Qin, Z., Ju, X., Han, S., and Zhang, S. Wristworld: Generating wrist-views via 4d world models for robotic manipulation,

work page arXiv
[30]

Wristworld: Generating wrist-views via 4d world models for robotic manipulation.CoRR, abs/2510.07313, 2025

URL https://arxiv.org/abs/2510.07313. Seo, Y ., Hafner, D., Liu, H., Liu, F., James, S., Lee, K., and Abbeel, P. Masked world models for visual control. In Conference on Robot Learning, pp. 1332–1344. PMLR,

work page arXiv
[31]

Song, W., Zhou, Z., Zhao, H., Chen, J., Ding, P., Yan, H., Huang, Y ., Tang, F., Wang, D., and Li, H

URL https://arxiv.org/abs/2509.21797. Song, W., Zhou, Z., Zhao, H., Chen, J., Ding, P., Yan, H., Huang, Y ., Tang, F., Wang, D., and Li, H. Reconvla: Reconstructive vision-language-action model as effective robot perceiver,

work page arXiv
[32]

URL https://arxiv.org/ abs/2508.10333. Team, O. M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., Luo, J., Tan, Y . L., Chen, L. Y ., Sanketi, P., Vuong, Q., Xiao, T., Sadigh, D., Finn, C., and Levine, S. Octo: An open-source generalist robot policy,

work page arXiv
[33]

Octo: An Open-Source Generalist Robot Policy

URL https: //arxiv.org/abs/2405.12213. Tong, Z., Song, Y ., Wang, J., and Wang, L. Videomae: Masked autoencoders are data-efficient learners for self- supervised video pre-training,

work page internal anchor Pith review arXiv
[34]

Wang, Q., Zhang, Z., Xie, B., Jin, X., Wang, Y ., Wang, S., Zheng, L., Yang, X., and Zeng, W

URL https:// arxiv.org/abs/2203.12602. Wang, Q., Zhang, Z., Xie, B., Jin, X., Wang, Y ., Wang, S., Zheng, L., Yang, X., and Zeng, W. Disentangled world models: Learning to transfer semantic knowledge from distracting videos for reinforcement learning,

work page arXiv
[35]

Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning

URL https://arxiv.org/abs/2503.08751. Wang, T., Du, S. S., Torralba, A., Isola, P., Zhang, A., and Tian, Y . Denoised mdps: Learning world models better than the world itself,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

org/abs/2206.15477

URL https://arxiv. org/abs/2206.15477. Yuan, C., Joshi, S., Zhu, S., Su, H., Zhao, H., and Gao, Y . Roboengine: Plug-and-play robot data augmentation with semantic robot segmentation and background genera- tion,

work page arXiv
[37]

Can world models benefit vlms for world dynamics?, 2025a

Zhang, K., Ge, K., Chi, X., Zhang, R., Shi, S., Dong, Z., Han, S., and Zhang, S. Can world models benefit vlms for world dynamics?, 2025a. URL https://arxiv. org/abs/2510.00855. Zhang, R., Dong, M., Zhang, Y ., Heng, L., Chi, X., Dai, G., Du, L., Du, Y ., and Zhang, S. Mole-vla: Dynamic layer-skipping vision language action model via mixture- of-layers fo...

work page arXiv
[38]

Network Architecture Video V AE & Tokenization.We utilize a fixed, pretrained 3D V AE to compress 256×256 RGB frames into 8×8 latent patches with channel dimension D=128

A.1. Network Architecture Video V AE & Tokenization.We utilize a fixed, pretrained 3D V AE to compress 256×256 RGB frames into 8×8 latent patches with channel dimension D=128. The compression ratios are fs=32 (spatial) and ft=8 (temporal). To align positional embeddings with this compression, we apply 3D Rotary Positional Embeddings (RoPE) with scaling fa...

2048
[39]

Table 6.Detailed Hyperparameters for MWM. Configuration Stage 1 (Dynamics) Stage 2 (Policy) Optimization (AdamW) Learning Rate 3×10 −4 5×10 −5 Batch Size 128 (global) 128 (global) Weight Decay 1×10 −5 1×10 −5 Warmup Steps 1000 1000 Gradient Clip 1.0 1.0 Precision bfloat16 bfloat16 Model Architecture Layers 28 28 Hidden Dimension 2048 512 Attention Heads 3...

2048