pith. machine review for the scientific record. sign in

arxiv: 2604.19683 · v2 · submitted 2026-04-21 · 💻 cs.RO

Recognition: unknown

Mask World Model: Predicting What Matters for Robust Robot Policy Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:10 UTC · model grok-4.3

classification 💻 cs.RO
keywords Mask World Modelrobot policy learningworld modelssemantic masksvideo diffusionrobust generalizationLIBERORLBench
0
0 comments X

The pith

Predicting semantic mask evolution instead of RGB frames creates a bottleneck that yields more robust robot policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard world models overfit to irrelevant pixel details such as changing backgrounds and lighting, which hurts generalization in robot control. By training a diffusion model to forecast the future of semantic masks rather than full images, the system is forced to encode only geometric and contact information. This mask-based backbone is then paired directly with a diffusion policy head for end-to-end action generation. On LIBERO and RLBench the resulting policies outperform prior RGB world models, and real-robot trials plus token-pruning tests show greater resilience when visual texture is removed or altered.

Core claim

The Mask World Model uses video diffusion to predict the temporal evolution of semantic masks rather than RGB pixels, thereby imposing a geometric information bottleneck that retains essential physical dynamics and contact relations while discarding visual noise, and integrates this backbone with a diffusion policy head to produce control actions.

What carries the argument

The mask dynamics backbone, which predicts semantic mask evolution to filter visual noise and retain physical essentials for policy learning.

If this is right

  • Policies trained on mask predictions outperform RGB-based world models on both LIBERO and RLBench benchmarks.
  • The approach maintains higher success rates under real-world texture changes and random token pruning.
  • End-to-end diffusion policy integration removes the need for separate perception and planning modules.
  • Generalization improves because the model cannot rely on transient visual distractors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bottleneck principle could be applied to other modalities such as depth or tactile signals to create comparable filtering effects.
  • If masks reliably encode contact geometry, the method may reduce the sim-to-real gap for contact-rich tasks.
  • Combining mask prediction with language conditioning could allow policies to reason at a more abstract level while still grounding actions in physical structure.

Load-bearing premise

Semantic masks alone contain every piece of information required for successful control without discarding details critical to object interactions or contact events.

What would settle it

A manipulation task whose success demonstrably requires fine surface texture cues that semantic masks omit, where the mask-based policy fails while an otherwise identical RGB-based policy succeeds.

Figures

Figures reproduced from arXiv: 2604.19683 by Chengxuan Li, Chuyao Fu, Guoyu Song, Haoxuan Xu, Pengwei Wang, Rongyu Zhang, Shanghang Zhang, Xiaojie Zhang, Xiaowei Chi, Yaoxu Lyu, Yunfan Lou, Zezhong Qian.

Figure 1
Figure 1. Figure 1: MWM overview. MWM learns a mask-centric predictive world model from semantic supervision during training, but runs purely on raw multi-view RGB at test time. Training proceeds in two stages: we first learn to forecast future semantic masks via conditional diffusion, then train a diffusion policy that conditions on mask-centric predictive features for action generation. This semantic bottleneck prioritizes … view at source ↗
Figure 2
Figure 2. Figure 2: Mask World Model (MWM) architecture. Given multi-view RGB memory frames and a language prompt, MWM encodes observations with a shared video VAE, then applies Normalize & Interpolate & Stack to form a fixed-length latent token sequence. A DiT-style backbone with AdaIN timestep conditioning and text cross-attention processes these tokens for N=28 transformer blocks. In Stage 1, a mask decoder supervises futu… view at source ↗
Figure 3
Figure 3. Figure 3: Real-robot qualitative rollouts. We visualize representative third-person executions for the four real-world tasks: bread+hotdog→basket, pour water→bowl, book→shelf, and open drawer→put pen. Each row shows frames ordered left-to-right. each task and each shift, we run 20 real-robot trials with randomized initializations and report success rate (SR). Appearance shifts. We consider three appearance factors t… view at source ↗
Figure 4
Figure 4. Figure 4: Real-world experimental environment. (Left) The hardware setup features a Franka Emika Panda robot arm equipped with a parallel gripper. Perception is provided by two synchronized Intel RealSense D435i cameras: a fixed third-person view capturing the global workspace and a wrist-mounted eye-in-hand view for fine-grained interaction. (Right) Snapshots of the four manipulation tasks used for evaluation: (1) … view at source ↗
Figure 5
Figure 5. Figure 5: Visual generalization stress tests. We evaluate policy robustness under three distinct distribution shifts relative to the nominal condition (Top-Left). The shifts include: (Top-Right) Object Color Shift, where task objects are swapped with unseen colors while retaining geometry; (Bottom-Left) Lighting Shift, involving significant changes in illumination intensity; and (Bottom-Right) Background (BG) Shift,… view at source ↗
read the original abstract

World models derived from large-scale video generative pre-training have emerged as a promising paradigm for generalist robot policy learning. However, standard approaches often focus on high-fidelity RGB video prediction, this can result in overfitting to irrelevant factors, such as dynamic backgrounds and illumination changes. These distractions reduce the model's ability to generalize, ultimately leading to unreliable and fragile control policies. To address this, we introduce the Mask World Model (MWM), which leverages video diffusion architectures to predict the evolution of semantic masks instead of pixels. This shift imposes a geometric information bottleneck, forcing the model to capture essential physical dynamics and contact relations while filtering out visual noise. We seamlessly integrate this mask dynamics backbone with a diffusion-based policy head to enable robust end-to-end control. Extensive evaluations demonstrate the superiority of MWM on the LIBERO and RLBench simulation benchmarks, significantly outperforming the state-of-the-art RGB-based world models. Furthermore, real-world experiments and robustness evaluation (via random token pruning) reveal that MWM exhibits superior generalization capabilities and robust resilience to texture information loss.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces the Mask World Model (MWM), which replaces RGB video prediction in world models with semantic mask evolution using video diffusion architectures. This is claimed to impose a geometric information bottleneck that captures essential physical dynamics and contact relations while filtering visual noise. MWM is integrated with a diffusion policy head for end-to-end robot control. The authors assert that MWM significantly outperforms state-of-the-art RGB-based world models on LIBERO and RLBench, with superior generalization and robustness to texture loss demonstrated in real-world experiments and random token pruning tests.

Significance. If the superiority and robustness claims hold with supporting quantitative evidence, the work could advance robust robot policy learning by demonstrating that semantic mask prediction provides a useful inductive bias against visual distractors. The approach builds on existing video diffusion and diffusion policy techniques but reframes the prediction target; its significance hinges on showing that the bottleneck does not discard task-critical information.

major comments (3)
  1. [Abstract] Abstract: The central claim that MWM 'significantly outperforming the state-of-the-art RGB-based world models' on LIBERO and RLBench is asserted without any quantitative metrics (e.g., success rates, deltas, or baseline names), error bars, or statistical tests. This absence makes it impossible to evaluate the magnitude or reliability of the reported gains.
  2. [Method] Method section (description of mask dynamics backbone): The assertion that semantic mask prediction 'imposes a geometric information bottleneck' that captures 'essential physical dynamics and contact relations' while filtering noise lacks supporting analysis or experiments addressing potential loss of non-geometric cues (e.g., material properties, friction, or subtle deformation) that may be required for certain control policies.
  3. [Experiments] Experiments section: No ablation studies on mask generation accuracy, no error analysis, and no details on how the diffusion policy head conditions on mask latents are provided. These omissions are load-bearing because the robustness claims (resilience to random token pruning and texture loss) cannot be assessed without them.
minor comments (1)
  1. [Abstract] Abstract: The sentence 'However, standard approaches often focus on high-fidelity RGB video prediction, this can result in overfitting' contains a comma splice and should be rephrased for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that MWM 'significantly outperforming the state-of-the-art RGB-based world models' on LIBERO and RLBench is asserted without any quantitative metrics (e.g., success rates, deltas, or baseline names), error bars, or statistical tests. This absence makes it impossible to evaluate the magnitude or reliability of the reported gains.

    Authors: We agree that the abstract requires quantitative support for the performance claims. The revised manuscript updates the abstract to include specific success rates on LIBERO and RLBench, the names of the RGB-based world model baselines used for comparison, performance deltas, error bars from multiple random seeds, and references to statistical tests. These additions are drawn directly from the experimental results already present in the paper body and are presented concisely. revision: yes

  2. Referee: [Method] Method section (description of mask dynamics backbone): The assertion that semantic mask prediction 'imposes a geometric information bottleneck' that captures 'essential physical dynamics and contact relations' while filtering noise lacks supporting analysis or experiments addressing potential loss of non-geometric cues (e.g., material properties, friction, or subtle deformation) that may be required for certain control policies.

    Authors: This observation is fair; the original method description was primarily conceptual. Semantic masks focus on object geometry, boundaries, and spatial relations, which are central to the dynamics and contacts in our evaluated manipulation tasks. In the revised version, we have expanded the method section with a dedicated discussion of the information bottleneck, including why non-geometric cues such as material properties and friction are less critical for the LIBERO and RLBench benchmarks (where shape and contact suffice) and how mask sequences can still encode motion cues relevant to deformation. We also clarify the scope of the claims to the tasks studied. revision: yes

  3. Referee: [Experiments] Experiments section: No ablation studies on mask generation accuracy, no error analysis, and no details on how the diffusion policy head conditions on mask latents are provided. These omissions are load-bearing because the robustness claims (resilience to random token pruning and texture loss) cannot be assessed without them.

    Authors: We acknowledge that these supporting details were insufficient in the original submission. The revised experiments section now includes ablation studies on mask generation accuracy (e.g., quantitative metrics such as IoU over predicted sequences), error analysis linking mask prediction quality to downstream policy performance, and explicit technical details on the conditioning of the diffusion policy head on mask latents (including the latent encoding and integration mechanism). These additions directly enable evaluation of the reported robustness to token pruning and texture loss. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method evaluated on external benchmarks

full rationale

The paper introduces MWM as an architectural design choice—predicting semantic mask evolution via video diffusion instead of RGB pixels—to impose a geometric bottleneck. All performance claims rest on direct empirical comparisons against external SOTA RGB world models on LIBERO and RLBench, plus real-world trials and token-pruning robustness tests. No derivation chain reduces by construction to fitted parameters, self-citations, or self-definitions; the central results are independent measurements on standard benchmarks rather than tautological predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that semantic masks retain sufficient information for control while discarding only noise; no explicit free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Semantic masks capture essential physical dynamics and contact relations for robot control
    Invoked when stating that mask prediction filters visual noise while preserving what matters for policy learning.

pith-pipeline@v0.9.0 · 5525 in / 1219 out tokens · 49827 ms · 2026-05-10T02:10:56.040546+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

39 extracted references · 37 canonical work pages · cited by 1 Pith paper · 16 internal anchors

  1. [1]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    URLhttps://arxiv.org/abs/2506.09985. Berg, J., Zhu, C., Bao, Y ., Durugkar, I., and Gupta, A. Semantic world models,

  2. [2]

    org/abs/2510.19818

    URL https://arxiv. org/abs/2510.19818. Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li- Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L. X., Tanner, J., Vuong, Q., Walling, A., Wang, H., and Zhilinsky, U. π0: A vision-language-action fl...

  3. [3]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    URL https: //arxiv.org/abs/2410.24164. Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., Florence, P., Fu, C., Arenas, M. G., Gopalakr- ishnan, K., Han, K., Hausman, K., Herzog, A., Hsu, J., Ichter, B., Irpan, A., Joshi, N., Julian, R., Kalashnikov, D., Kuang, Y ., Leal, I., Lee, L., ...

  4. [4]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    URLhttps://arxiv.org/abs/2307.15818. Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y ., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., and Ramesh, A. Video generation models as world simulators

  5. [5]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    URL https://arxiv. org/abs/2505.06111. Burgess, C. P., Matthey, L., Watters, N., Kabra, R., Higgins, I., Botvinick, M., and Lerchner, A. Monet: Unsupervised scene decomposition and representation,

  6. [6]

    MONet: Unsupervised Scene Decomposition and Representation

    URL https://arxiv.org/abs/1901.11390. Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y ., Burchfiel, B., Tedrake, R., and Song, S. Diffusion policy: Visuo- motor policy learning via action diffusion,

  7. [7]

    Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

    URL https://arxiv.org/abs/2303.04137. Chi, X., Fan, C.-K., Zhang, H., Qi, X., Zhang, R., Chen, A., min Chan, C., Xue, W., Liu, Q., Zhang, S., and Guo, Y . Eva: An embodied world model for future video an- ticipation, 2025a. URL https://arxiv.org/abs/ 2410.15461. Chi, X., Ge, K., Liu, J., Zhou, S., Jia, P., He, Z., Liu, Y ., Li, T., Han, L., Han, S., Zhang...

  8. [8]

    Greff, K., Kaufman, R

    URL https://arxiv.org/abs/ 2104.09958. Greff, K., Kaufman, R. L., Kabra, R., Watters, N., Burgess, C., Zoran, D., Matthey, L., Botvinick, M., and Lerchner, A. Multi-object representation learning with iterative variational inference,

  9. [9]

    arXiv preprint arXiv:1903.00450 , Title =

    URL https://arxiv. org/abs/1903.00450. Grooten, B., Tomilin, T., Vasan, G., Taylor, M. E., Mah- mood, A. R., Fang, M., Pechenizkiy, M., and Mocanu, D. C. Madi: Learning to mask distractions for general- ization in visual deep reinforcement learning,

  10. [10]

    URL https://arxiv.org/abs/2312.15339. Ha, D. and Schmidhuber, J. World models

  11. [11]

    Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch

    doi: 10.5281/ZENODO.1207631. URL https://zenodo. org/record/1207631. Hafner, D., Lillicrap, T., Ba, J., and Norouzi, M. Dream to control: Learning behaviors by latent imagination,

  12. [12]

    Dream to Control: Learning Behaviors by Latent Imagination

    URLhttps://arxiv.org/abs/1912.01603. 9 Hafner, D., Pasukonis, J., Ba, J., and Lillicrap, T. Master- ing diverse domains through world models,

  13. [13]

    Mastering Diverse Domains through World Models

    URL https://arxiv.org/abs/2301.04104. Hafner, D., Yan, W., and Lillicrap, T. Training agents inside of scalable world models,

  14. [14]

    Training Agents Inside of Scalable World Models

    URL https: //arxiv.org/abs/2509.24527. He, K., Chen, X., Xie, S., Li, Y ., Doll´ar, P., and Girshick, R. Masked autoencoders are scalable vision learners,

  15. [15]

    org/abs/2111.06377

    URLhttps://arxiv.org/abs/2111.06377. Hu, Y ., Guo, Y ., Wang, P., Chen, X., Wang, Y .-J., Zhang, J., Sreenath, K., Lu, C., and Chen, J. Video prediction policy: A generalist robot policy with predictive visual representations,

  16. [16]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    URL https://arxiv.org/ abs/2412.14803. Huang, H., Chen, X., Chen, Y ., Li, H., Han, X., Wang, Z., Wang, T., Pang, J., and Zhao, Z. Roboground: Robotic manipulation with grounded vision-language pri- ors, 2025a. URL https://arxiv.org/abs/2504. 21530. Huang, X. and Belongie, S. Arbitrary style transfer in real- time with adaptive instance normalization,

  17. [17]

    Huang, Y ., Zhang, J., Zou, S., Liu, X., Hu, R., and Xu, K

    URL https://arxiv.org/abs/1703.06868. Huang, Y ., Zhang, J., Zou, S., Liu, X., Hu, R., and Xu, K. Ladi-wm: A latent diffusion-based world model for pre- dictive manipulation, 2025b. URL https://arxiv. org/abs/2505.11528. James, S., Ma, Z., Arrojo, D. R., and Davison, A. J. Rl- bench: The robot learning benchmark & learning envi- ronment,

  18. [18]

    Jia, Y ., Liu, J., Liu, S., Zhou, R., Yu, W., Yan, Y ., Chi, X., Guo, Y ., Shi, B., and Zhang, S

    URL https://arxiv.org/abs/ 1909.12271. Jia, Y ., Liu, J., Liu, S., Zhou, R., Yu, W., Yan, Y ., Chi, X., Guo, Y ., Shi, B., and Zhang, S. Video2act: A dual-system video diffusion policy with robotic spatio- motional modeling,

  19. [19]

    Video2act: A dual-system video diffusion policy with robotic spatio-motional modeling,

    URL https://arxiv. org/abs/2512.03044. Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakr- ishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., and Finn, C. Open- vla: An open-source vision-language-action model,

  20. [20]

    OpenVLA: An Open-Source Vision-Language-Action Model

    URLhttps://arxiv.org/abs/2406.09246. Kingma, D. P. and Welling, M. Auto-encoding varia- tional bayes,

  21. [21]

    Auto-Encoding Variational Bayes

    URL https://arxiv.org/ abs/1312.6114. Li, Q., Liang, Y ., Wang, Z., Luo, L., Chen, X., Liao, M., Wei, F., Deng, Y ., Xu, S., Zhang, Y ., Wang, X., Liu, B., Fu, J., Bao, J., Chen, D., Shi, Y ., Yang, J., and Guo, B. Cogact: A foundational vision-language-action model for syner- gizing cognition and action in robotic manipulation,

  22. [22]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    URLhttps://arxiv.org/abs/2411.19650. Li, W., Zhang, R., Shao, R., Fang, Z., Zhou, K., Tian, Z., and Nie, L. Semanticvla: Semantic-aligned sparsification and enhancement for efficient robotic manipulation, 2025a. URLhttps://arxiv.org/abs/2511.10518. Li, Y ., Wei, X., Chi, X., Li, Y ., Zhao, Z., Wang, H., Ma, N., Lu, M., Han, S., and Zhang, S. Manipdreamer3...

  23. [23]

    Genieenvisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635,

    URL https://arxiv.org/abs/2508.05635. Lipman, Y ., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling,

  24. [24]

    Flow Matching for Generative Modeling

    URLhttps://arxiv.org/abs/2210.02747. Liu, B., Zhu, Y ., Gao, C., Feng, Y ., Liu, Q., Zhu, Y ., and Stone, P. Libero: Benchmarking knowledge transfer for lifelong robot learning,

  25. [25]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    URL https://arxiv. org/abs/2306.03310. Liu, C., Zhang, J., Li, C., Zhou, Z., Wu, S., Huang, S., and Duan, H. Ttf-vla: Temporal token fusion via pixel- attention integration for vision-language-action mod- els,

  26. [26]

    Object-centric learning with slot attention

    URL https://arxiv.org/abs/2006.15055. Ma, X., Liu, W., Zhang, P., and Xu, N. 3d-rpe: Enhancing long-context modeling through 3d rotary position encod- ing,

  27. [27]

    URLhttps://arxiv.org/abs/2601.18323. 10 NVIDIA, :, Ali, A., Bai, J., Bala, M., Balaji, Y ., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y .-W., Chat- topadhyay, P., Chen, M., Chen, Y ., Chen, Y ., Cheng, S., Cui, Y ., Diamond, J., Ding, Y ., Fan, J., Fan, L., Feng, L., Ferroni, F., Fidler, S., Fu, X., Gao, R., Ge, Y ., Gu, J., Gupta, A., Gurur...

  28. [28]

    World Simulation with Video Foundation Models for Physical AI

    URL https://arxiv.org/abs/2511.00062. Pai, J., Achenbach, L., Montesinos, V ., Forrai, B., Mees, O., and Nava, E. mimic-video: Video-action models for generalizable robot control beyond vlas,

  29. [29]

    mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692,

    URL https://arxiv.org/abs/2512.15692. Qian, Z., Chi, X., Li, Y ., Wang, S., Qin, Z., Ju, X., Han, S., and Zhang, S. Wristworld: Generating wrist-views via 4d world models for robotic manipulation,

  30. [30]

    Wristworld: Generating wrist-views via 4d world models for robotic manipulation.CoRR, abs/2510.07313, 2025

    URL https://arxiv.org/abs/2510.07313. Seo, Y ., Hafner, D., Liu, H., Liu, F., James, S., Lee, K., and Abbeel, P. Masked world models for visual control. In Conference on Robot Learning, pp. 1332–1344. PMLR,

  31. [31]

    Song, W., Zhou, Z., Zhao, H., Chen, J., Ding, P., Yan, H., Huang, Y ., Tang, F., Wang, D., and Li, H

    URL https://arxiv.org/abs/2509.21797. Song, W., Zhou, Z., Zhao, H., Chen, J., Ding, P., Yan, H., Huang, Y ., Tang, F., Wang, D., and Li, H. Reconvla: Reconstructive vision-language-action model as effective robot perceiver,

  32. [32]

    URL https://arxiv.org/ abs/2508.10333. Team, O. M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., Luo, J., Tan, Y . L., Chen, L. Y ., Sanketi, P., Vuong, Q., Xiao, T., Sadigh, D., Finn, C., and Levine, S. Octo: An open-source generalist robot policy,

  33. [33]

    Octo: An Open-Source Generalist Robot Policy

    URL https: //arxiv.org/abs/2405.12213. Tong, Z., Song, Y ., Wang, J., and Wang, L. Videomae: Masked autoencoders are data-efficient learners for self- supervised video pre-training,

  34. [34]

    Wang, Q., Zhang, Z., Xie, B., Jin, X., Wang, Y ., Wang, S., Zheng, L., Yang, X., and Zeng, W

    URL https:// arxiv.org/abs/2203.12602. Wang, Q., Zhang, Z., Xie, B., Jin, X., Wang, Y ., Wang, S., Zheng, L., Yang, X., and Zeng, W. Disentangled world models: Learning to transfer semantic knowledge from distracting videos for reinforcement learning,

  35. [35]

    Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning

    URL https://arxiv.org/abs/2503.08751. Wang, T., Du, S. S., Torralba, A., Isola, P., Zhang, A., and Tian, Y . Denoised mdps: Learning world models better than the world itself,

  36. [36]

    org/abs/2206.15477

    URL https://arxiv. org/abs/2206.15477. Yuan, C., Joshi, S., Zhu, S., Su, H., Zhao, H., and Gao, Y . Roboengine: Plug-and-play robot data augmentation with semantic robot segmentation and background genera- tion,

  37. [37]

    Can world models benefit vlms for world dynamics?, 2025a

    Zhang, K., Ge, K., Chi, X., Zhang, R., Shi, S., Dong, Z., Han, S., and Zhang, S. Can world models benefit vlms for world dynamics?, 2025a. URL https://arxiv. org/abs/2510.00855. Zhang, R., Dong, M., Zhang, Y ., Heng, L., Chi, X., Dai, G., Du, L., Du, Y ., and Zhang, S. Mole-vla: Dynamic layer-skipping vision language action model via mixture- of-layers fo...

  38. [38]

    Network Architecture Video V AE & Tokenization.We utilize a fixed, pretrained 3D V AE to compress 256×256 RGB frames into 8×8 latent patches with channel dimension D=128

    A.1. Network Architecture Video V AE & Tokenization.We utilize a fixed, pretrained 3D V AE to compress 256×256 RGB frames into 8×8 latent patches with channel dimension D=128. The compression ratios are fs=32 (spatial) and ft=8 (temporal). To align positional embeddings with this compression, we apply 3D Rotary Positional Embeddings (RoPE) with scaling fa...

  39. [39]

    Table 6.Detailed Hyperparameters for MWM. Configuration Stage 1 (Dynamics) Stage 2 (Policy) Optimization (AdamW) Learning Rate 3×10 −4 5×10 −5 Batch Size 128 (global) 128 (global) Weight Decay 1×10 −5 1×10 −5 Warmup Steps 1000 1000 Gradient Clip 1.0 1.0 Precision bfloat16 bfloat16 Model Architecture Layers 28 28 Hidden Dimension 2048 512 Attention Heads 3...