4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation

Aliaksandr Siarohin; Ashkan Mirzaei; Avalon Vinella; Chaoyang Wang; Ivan Skorokhodov; Michael Vasilkovsky; Peter Wonka; Sergey Korolev; Sergey Tulyakov; Vidit Goel

arxiv: 2506.18839 · v1 · pith:DK4DZ6DNnew · submitted 2025-06-18 · 💻 cs.CV

4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation

Chaoyang Wang , Ashkan Mirzaei , Vidit Goel , Willi Menapace , Aliaksandr Siarohin , Avalon Vinella , Michael Vasilkovsky , Ivan Skorokhodov

show 4 more authors

Vladislav Shakhrai Sergey Korolev Sergey Tulyakov Peter Wonka

This is my paper

classification 💻 cs.CV

keywords attentionreconstructionarchitecturesamevideoexistingfirstfused

0 comments

read the original abstract

We propose the first framework capable of computing a 4D spatio-temporal grid of video frames and 3D Gaussian particles for each time step using a feed-forward architecture. Our architecture has two main components, a 4D video model and a 4D reconstruction model. In the first part, we analyze current 4D video diffusion architectures that perform spatial and temporal attention either sequentially or in parallel within a two-stream design. We highlight the limitations of existing approaches and introduce a novel fused architecture that performs spatial and temporal attention within a single layer. The key to our method is a sparse attention pattern, where tokens attend to others in the same frame, at the same timestamp, or from the same viewpoint. In the second part, we extend existing 3D reconstruction algorithms by introducing a Gaussian head, a camera token replacement algorithm, and additional dynamic layers and training. Overall, we establish a new state of the art for 4D generation, improving both visual quality and reconstruction capability.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras
cs.CV 2026-03 unverdicted novelty 7.0

SparseCam4D achieves spatio-temporally consistent high-fidelity 4D reconstruction from sparse cameras via a Spatio-Temporal Distortion Field that corrects inconsistencies in generative observations.
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
cs.CV 2026-04 unverdicted novelty 6.0

The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction
cs.CV 2026-06 unverdicted novelty 5.0

A multi-view video diffusion model conditioned on relative camera poses via extended RoPE generates dense synchronized views from sparse inputs for 4D Gaussian splatting reconstruction, claiming SOTA results on human ...