pith. sign in

Network Architecture Video V AE & Tokenization.We utilize a fixed, pretrained 3D V AE to compress 256×256 RGB frames into 8×8 latent patches with channel dimension D=128

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

fields

cs.RO 1

years

2026 1

verdicts

UNVERDICTED 1

representative citing papers

citing papers explorer

Showing 1 of 1 citing paper.

  • Mask World Model: Predicting What Matters for Robust Robot Policy Learning cs.RO · 2026-04-21 · unverdicted · none · ref 38

    Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.