M²-REPA decouples modality-specific features from diffusion intermediates and aligns them to complementary expert foundation models via a multi-modal alignment loss and modality-specific decoupling regularization for improved multimodal video generation.
4dnex: Feed-forward 4d generative modeling made easy
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 7representative citing papers
HiReFF presents a feed-forward framework for 2K human video reconstruction from uncalibrated sparse-view videos via scale-synchronized calibration, Gaussian masking, and high-resolution side-tuning.
PointAction uses predicted dynamic 3D pointmaps from fine-tuned video models as an embodiment-agnostic action representation to map video predictions to executable robot actions.
A feed-forward video latent transformer that predicts time-varying 3D Gaussian primitives from one image to produce controllable 4D scenes with appearance, geometry, and motion.
ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.
World-R1 applies reinforcement learning via Flow-GRPO and a text dataset to align text-to-video models with 3D constraints from pre-trained foundation models, improving consistency while keeping original visual quality.
citing papers explorer
-
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models
M²-REPA decouples modality-specific features from diffusion intermediates and aligns them to complementary expert foundation models via a multi-modal alignment loss and modality-specific decoupling regularization for improved multimodal video generation.
-
HiReFF: High-Resolution Feedforward Human Reconstruction from Uncalibrated Sparse-View Video
HiReFF presents a feed-forward framework for 2K human video reconstruction from uncalibrated sparse-view videos via scale-synchronized calibration, Gaussian masking, and high-resolution side-tuning.
-
PointAction: 3D Points as Universal Action Representations for Robot Control
PointAction uses predicted dynamic 3D pointmaps from fine-tuned video models as an embodiment-agnostic action representation to map video predictions to executable robot actions.
-
Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models
A feed-forward video latent transformer that predicts time-varying 3D Gaussian primitives from one image to produce controllable 4D scenes with appearance, geometry, and motion.
-
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
-
Syn4D: A Multiview Synthetic 4D Dataset
Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.
-
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
World-R1 applies reinforcement learning via Flow-GRPO and a text dataset to align text-to-video models with 3D constraints from pre-trained foundation models, improving consistency while keeping original visual quality.