MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation

Jiahang Cao; Jiaxu Wang; Jingkai Sun; Junhao He; Mingyuan Sun; Qiang Zhang; Qiming Shao; Tianlun He; Xiangyu Yue; YiCheng Jiang

arxiv: 2602.09878 · v2 · pith:42UVAIVBnew · submitted 2026-02-10 · 💻 cs.CV

MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation

Jiaxu Wang , Yicheng Jiang , Tianlun He , Jingkai Sun , Qiang Zhang , Junhao He , Jiahang Cao , Zesen Gan

show 3 more authors

Mingyuan Sun Qiming Shao Xiangyu Yue

This is my paper

classification 💻 cs.CV

keywords modelactionsdynamicsgenerationmanipulationacrossactioncomplete

0 comments

read the original abstract

World-model-based imagine-then-act becomes a promising paradigm for robotic manipulation, yet existing approaches typically support either purely image-based forecasting or reasoning over partial 3D geometry, limiting their ability to predict complete 4D scene dynamics. This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary-view RGBD generation: given only a single-view RGBD observation as input, the model imagines the remaining viewpoints, which can then be back-projected and fused to assemble a more complete 3D structure across time. To efficiently learn the multi-view, cross-modality generation, we explicitly design cross-view and cross-modality feature fusion that jointly encourage consistency between RGB and depth and enforce geometric alignment across views. Beyond prediction, converting generated futures into actions is often handled by inverse dynamics, which is ill-posed because multiple actions can explain the same transition. We address this with a test-time action optimization strategy that backpropagates through the generative model to infer a trajectory-level latent best matching the predicted future, and a residual inverse dynamics model that turns this trajectory prior into accurate executable actions. Experiments on three datasets demonstrate strong performance on both 4D scene generation and downstream manipulation, and ablations provide practical insights into the key design choices.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
cs.RO 2026-04 unverdicted novelty 6.0

X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
cs.RO 2026-04 unverdicted novelty 6.0

X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.