Network Architecture Video V AE & Tokenization.We utilize a fixed, pretrained 3D V AE to compress 256×256 RGB frames into 8×8 latent patches with channel dimension D=128

· 2048

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Mask World Model: Predicting What Matters for Robust Robot Policy Learning

cs.RO · 2026-04-21 · unverdicted · novelty 7.0

Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.

citing papers explorer

Showing 1 of 1 citing paper.

Mask World Model: Predicting What Matters for Robust Robot Policy Learning cs.RO · 2026-04-21 · unverdicted · none · ref 38
Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.

Network Architecture Video V AE & Tokenization.We utilize a fixed, pretrained 3D V AE to compress 256×256 RGB frames into 8×8 latent patches with channel dimension D=128

fields

years

verdicts

representative citing papers

citing papers explorer