Robust Dreamer: Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation

Gim Hee Lee; Hanlin Chen; Hongdong Li; Jiaxin Wei; Pan Ji; Steve Wang; Xibin Song; Yifu Wang

arxiv: 2605.30855 · v2 · pith:K6XAF7OBnew · submitted 2026-05-29 · 💻 cs.CV

Robust Dreamer: Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation

Hanlin Chen , Jiaxin Wei , Xibin Song , Yifu Wang , Steve Wang , Hongdong Li , Pan Ji , Gim Hee Lee This is my paper

Pith reviewed 2026-06-28 22:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords action-controlled video generationautoregressive videolatent gaussian memorydeviation learning3D consistencyworld simulationgaussian splatting

0 comments

The pith

Latent Gaussian Memory anchored to primitives plus synthesized deviations during training prevents drift in long-horizon action-controlled video generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that two mechanisms produce catastrophic drift in autoregressive action-controlled video: repeated conversion between latent and RGB spaces that loses information, and the mismatch between clean memory seen in training and the corrupted memory that appears at inference. It replaces RGB cycling with Latent Gaussian Memory that stores diffusion latents as Gaussian primitives recalled by latent-space splatting, and it closes the training-inference gap by generating realistic deviations through a one-step approximation stored in a stage-and-timestamp archive. A reader would care because the resulting memory-augmented generator sustains visual fidelity and 3D consistency over extended rollouts on indoor, outdoor, and game scenes where prior methods collapse.

Core claim

Robust Dreamer claims that anchoring diffusion latents inherited from the generation process to Gaussian primitives and recalling them via latent-space Gaussian splatting supplies dense geometry-aware conditioning without accumulated VAE degradation, while Deviation Learning with Dynamic Deviation Archive synthesizes rollout-induced corruptions through one-step approximation, indexes them by autoregressive stage and denoising timestamp, and injects them into historical memory so the generator learns internal correction before inference.

What carries the argument

Latent Gaussian Memory that anchors diffusion latents to Gaussian primitives recalled by latent-space Gaussian splatting, paired with a Dynamic Deviation Archive that stores one-step-approximated rollout deviations indexed by stage and timestamp.

If this is right

Autoregressive rollouts maintain 3D consistency without progressive degradation from repeated VAE conversions.
The generator acquires the ability to correct corrupted historical memory states before they appear at inference time.
Action signals continue to produce immediate, geometrically coherent visual responses over hundreds of frames.
Performance gains appear across indoor reconstruction, outdoor driving, and synthetic game environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Gaussian-anchored memory could be tested in other autoregressive generative domains that currently rely on RGB or token cycling.
If the one-step deviation approximation proves sufficient, full multi-step rollouts may no longer be required to collect realistic training corruptions.
The archive structure suggests a general way to index and replay distribution shifts indexed by generation time and denoising step.

Load-bearing premise

The one-step approximation used to synthesize rollout-induced latent deviations during training is sufficient to expose the model to the actual distribution of corrupted memory states encountered at inference.

What would settle it

Train two identical generators on the same data, one with the Dynamic Deviation Archive and one without, then measure whether the version lacking the archive exhibits the same rate of 3D drift after 50-plus autoregressive steps on ScanNet or DL3DV as the baselines the paper compares against.

Figures

Figures reproduced from arXiv: 2605.30855 by Gim Hee Lee, Hanlin Chen, Hongdong Li, Jiaxin Wei, Pan Ji, Steve Wang, Xibin Song, Yifu Wang.

**Figure 1.** Figure 1: Overview of the inference pipeline. Our system performs long-horizon frame-by-frame generation through a closed-loop autoregressive process. First, a user action triggers memory recall via Gaussian Splatting, rendering a viewpoint-aligned latent. This latent conditions the proposed Dreamer to generate the next frame, which is subsequently decoded into RGB. Finally, the generated latent is directly inherite… view at source ↗

**Figure 2.** Figure 2: Motivation. (a) Latent–RGB Cycling: Repeatedly decoding a latent to RGB and encoding the RGB back to latent for 35 iterations causes catastrophic signal degradation and color distortion due to accumulated quantization errors. (b) Deviation Learning: The baseline (top), trained on clean memory, i.e., memory constructed from clean training frames/latents, suffers from structural collapse due to the training–… view at source ↗

**Figure 3.** Figure 3: Overview of the training pipeline. (a) Variable-length subsequences provide historical context. (b) Latent Gaussian Memory is built from deviation-corrupted histories. (c) The Dreamer predicts velocity conditioned on clean anchor (frame 0), predecessor (previous frame), and recalled latent memory (rendered frame). (d) One-step deviations update the Dynamic Deviation Archive. Train–Test Gap in Autoregressiv… view at source ↗

**Figure 4.** Figure 4: Qualitative results on ScanNet and DL3DV. For each method, we visualize the first generated frame (left) and a later frame in the rollout (right). As generation progresses, baseline methods suffer from accumulated color drift and structural degradation, whereas our approach maintains consistent geometry and appearance without noticeable misalignment. 4.2 Results We present the main experimental results of … view at source ↗

**Figure 5.** Figure 5: Deviation patterns. Comparison between our synthesized deviation (left) and Gaussian noise (right). 4.3 Ablation Study We conduct comprehensive ablation studies on the ScanNet dataset [13] to validate the effectiveness of our individual modules. The quantitative results are in Tab. 3. Effectiveness of Latent Gaussian Memory (Row A). We first analyze the impact of performing memory writing and recall direc… view at source ↗

**Figure 6.** Figure 6: Qualitative results on static long scenes from ScanNet (300 frames) and dynamic scenes from OmniWorldGame (80 frames). The top two rows show experiments on ScanNet, while the bottom four rows present comparisons on OmniWorldGame. Displayed frames are randomly sampled from the early, middle, and late stages of the sequences. Compared to the state-of-the-art baseline VMem, which also utilizes a 3D memory mec… view at source ↗

**Figure 7.** Figure 7: Qualitative results on OmniWorldGame. We visualize 64 randomly sampled frames from an 80-frame dynamic scene. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative results on OmniWorldGame. We visualize 64 randomly sampled frames from an 80-frame dynamic scene [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative results on OmniWorldGame. We visualize 64 randomly sampled frames from an 80-frame dynamic scene. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative results on ScanNet. We visualize 56 frames uniformly sampled from a 300-frame sequence of a long static scene. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative results on ScanNet. We visualize 112 frames uniformly sampled from a 300-frame sequence of a long static scene. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative results on ScanNet. We visualize 112 frames uniformly sampled from a 300-frame sequence of a long static scene. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative results on an out-of-domain scene. We visualize 48 uniformly sampled frames from an 80-frame sequence to demonstrate generalization [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative results on an out-of-domain scene. We visualize 40 uniformly sampled frames from an 80-frame sequence to demonstrate generalization. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative results on an out-of-domain dynamic scene. We visualize 64 uniformly sampled frames from an 80-frame sequence to demonstrate generalization. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

read the original abstract

Frame-wise action-controlled image-to-video generation is a promising paradigm for interactive world simulation, where each control signal should elicit an immediate visual response. However, maintaining visual fidelity and 3D consistency over long autoregressive rollouts remains challenging. Existing 3D-aware methods often suffer from catastrophic drift due to two impediments: information loss from \textit{Latent--RGB Cycling}, where generated latents are repeatedly decoded to RGB and re-encoded for future conditioning, and the training--inference gap induced by the \textit{error-free hypothesis}, where clean training memory fails to match prediction-corrupted inference memory. To address these challenges, we present \textbf{Robust Dreamer}, a memory-augmented framework built around how to design 3D memory and how to use it robustly. First, we introduce \textbf{Latent Gaussian Memory}, which anchors diffusion latents inherited from the generation process to Gaussian primitives and recalls them via latent-space Gaussian splatting. This provides dense, geometry-aware, view-aligned conditioning while avoiding accumulated degradation from repeated VAE conversion. Second, we propose \textbf{Deviation Learning with Dynamic Deviation Archive}, which synthesizes rollout-induced latent deviations through a one-step approximation, stores them by autoregressive stage and denoising timestamp, and injects them into historical memory during training. This exposes the generator to realistic corrupted memory states and teaches internal correction before inference. Experiments on ScanNet, DL3DV, and OmniWorldGame demonstrate state-of-the-art long-horizon performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move is storing diffusion latents as Gaussian primitives for view-aligned memory without RGB cycling, plus a staged deviation archive trained on one-step rollouts to close the train-inference gap, but the SOTA long-horizon claim can't be checked from the given text.

read the letter

The paper identifies two practical problems in action-controlled autoregressive video: repeated latent-to-RGB-to-latent cycling that loses 3D structure, and the fact that training sees clean memory while inference sees its own accumulating errors. It proposes Latent Gaussian Memory, which keeps latents tied to Gaussian primitives and recalls them by splatting in latent space, and Deviation Learning with a Dynamic Deviation Archive that stores one-step synthesized deviations indexed by autoregressive stage and denoising step.

The Gaussian memory part looks like a direct attempt to keep geometry without the degradation cycle, and the deviation archive is a concrete way to make training see corrupted states. If the one-step synthesis produces deviations that are representative enough, it could reduce the drift that usually appears after a few dozen frames.

The obvious soft spot is whether that one-step approximation actually matches the error distribution that builds up over many steps at inference. Multi-step conditioning on already-deviated latents can produce structured, compounding shifts that a single forward pass might miss, leaving the correction under-trained for the real rollout regime. The abstract does not show how they validated the approximation or whether they compared it to multi-step simulation.

No tables, ablations, or baseline numbers appear in the provided text, so the state-of-the-art claim on ScanNet, DL3DV, and OmniWorldGame remains unverified. The citation pattern is not visible either.

This is for groups already working on long-horizon world models or interactive video generation who need engineering fixes for consistency. A reader looking for a new theoretical framing will not find it, but someone implementing AR video pipelines might pick up the memory and archive ideas. The mechanisms are specific enough that a referee could evaluate them against real rollouts.

I would send it to peer review so the experiments and the approximation can be checked directly.

Referee Report

1 major / 1 minor

Summary. The paper proposes Robust Dreamer, a memory-augmented framework for action-controlled autoregressive (AR) video generation. It introduces Latent Gaussian Memory, which anchors diffusion latents to Gaussian primitives and recalls them via latent-space Gaussian splatting to avoid degradation from repeated Latent-RGB Cycling. It further proposes Deviation Learning with Dynamic Deviation Archive, which synthesizes rollout-induced latent deviations via a one-step approximation, stores them by autoregressive stage and denoising timestamp, and injects them into historical memory during training to address the training-inference gap from the error-free hypothesis. Experiments on ScanNet, DL3DV, and OmniWorldGame are claimed to demonstrate state-of-the-art long-horizon performance.

Significance. If the central claims hold, the work addresses practically important impediments to long-horizon consistency in interactive world simulation and AR video generation. The Latent Gaussian Memory approach offers a geometry-aware conditioning mechanism that sidesteps VAE cycling losses, while the deviation archive provides a targeted way to close the train-inference distribution gap. These contributions could influence downstream applications in robotics simulation and controllable video synthesis if supported by rigorous quantitative validation.

major comments (1)

[Abstract (Deviation Learning with Dynamic Deviation Archive)] Abstract (Deviation Learning with Dynamic Deviation Archive paragraph): The SOTA long-horizon claim depends on Deviation Learning successfully exposing the generator to realistic corrupted memory states. The method relies on a one-step approximation to synthesize rollout-induced latent deviations. This approximation may produce a narrower or differently structured deviation distribution than the multi-step error accumulation that occurs at inference, where generated (deviated) latents are repeatedly fed back as conditioning. Without additional analysis or experiments demonstrating that the one-step distribution is representative of the true inference regime, the internal correction mechanism may remain under-trained for the conditions that determine long-horizon performance.

minor comments (1)

[Abstract] The abstract introduces terms such as 'Latent--RGB Cycling' and 'error-free hypothesis' without citing the prior work that defines or motivates them; adding these references would improve traceability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comment regarding the Deviation Learning component in our manuscript. We provide a point-by-point response below.

read point-by-point responses

Referee: Abstract (Deviation Learning with Dynamic Deviation Archive paragraph): The SOTA long-horizon claim depends on Deviation Learning successfully exposing the generator to realistic corrupted memory states. The method relies on a one-step approximation to synthesize rollout-induced latent deviations. This approximation may produce a narrower or differently structured deviation distribution than the multi-step error accumulation that occurs at inference, where generated (deviated) latents are repeatedly fed back as conditioning. Without additional analysis or experiments demonstrating that the one-step distribution is representative of the true inference regime, the internal correction mechanism may remain under-trained for the conditions that determine long-horizon performance.

Authors: We agree with the referee that the one-step approximation is a key design choice whose fidelity to the multi-step inference distribution merits further validation. In the original manuscript, we describe the one-step approximation as a practical means to generate deviations at each autoregressive stage and denoising timestamp for storage in the Dynamic Deviation Archive. To strengthen the evidence, we will add in the revision: (1) a quantitative comparison of deviation statistics (e.g., mean and variance of latent differences) between one-step synthesized deviations and those accumulated over multiple steps in short rollouts, and (2) an ablation showing long-horizon performance when training with the one-step vs. a more expensive multi-step deviation synthesis on a smaller dataset. This will directly address whether the approximation sufficiently covers the inference regime. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering framework with no self-referential derivations or fitted predictions

full rationale

The provided abstract and description introduce Latent Gaussian Memory (anchoring diffusion latents to Gaussian primitives) and Deviation Learning with Dynamic Deviation Archive (one-step synthesis of rollout deviations stored by stage/timestamp). These are presented as design choices to address information loss and training-inference gap, without any equations, uniqueness theorems, or derivations that reduce a claimed result to a quantity defined by the method itself. No self-citations are invoked as load-bearing premises, no ansatzes are smuggled, and no predictions are statistically forced by fitting. The approach is self-contained as an empirical engineering solution; the one-step approximation is an explicit modeling choice, not a circular redefinition of the target distribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities beyond the high-level description of the two new components; standard diffusion and VAE assumptions are implicit but not enumerated.

pith-pipeline@v0.9.1-grok · 5824 in / 1162 out tokens · 23893 ms · 2026-06-28T22:55:02.101593+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

82 extracted references · 11 canonical work pages · 3 internal anchors

[1]

Ac3d: Analyzing and improving 3d camera control in video diffusion transformers.arXiv preprint arXiv:2411.18673, 2024

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers.arXiv preprint arXiv:2411.18673, 2024

arXiv 2024
[2]

Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, Ci...

2025
[3]

PixelSplat: 3D Gaussian splats from image pairs for scalable generalizable 3D reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. PixelSplat: 3D Gaussian splats from image pairs for scalable generalizable 3D reconstruction. InCVPR, 2024

2024
[4]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2025

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2025

2025
[5]

Gnesf: Generalizable neural semantic fields

Hanlin Chen, Chen Li, Mengqi Guo, Zhiwen Yan, and Gim Hee Lee. Gnesf: Generalizable neural semantic fields. InNeurIPS, 2023

2023
[6]

Neusg: Neural implicit surface reconstruction with 3d gaussian splatting guidance.arXiv preprint arXiv:2312.00846, 2023

Hanlin Chen, Chen Li, and Gim Hee Lee. Neusg: Neural implicit surface reconstruction with 3d gaussian splatting guidance.arXiv preprint arXiv:2312.00846, 2023

arXiv 2023
[7]

Vcr-gaus: View consistent depth-normal regularizer for gaussian surface reconstruction.arXiv preprint arXiv:2406.05774, 2024

Hanlin Chen, Fangyin Wei, Chen Li, Tianxin Huang, Yunsong Wang, and Gim Hee Lee. Vcr-gaus: View consistent depth-normal regularizer for gaussian surface reconstruction.arXiv preprint arXiv:2406.05774, 2024

arXiv 2024
[8]

Ttt3r: 3d reconstruction as test-time training.arXiv preprint arXiv:2509.26645, 2025

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruction as test-time training.arXiv preprint arXiv:2509.26645, 2025

Pith/arXiv arXiv 2025
[9]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InECCV, 2024

2024
[10]

Mvsplat360: Feed-forward 360 scene synthesis from sparse views

Yuedong Chen, Chuanxia Zheng, Haofei Xu, Bohan Zhuang, Andrea Vedaldi, Tat-Jen Cham, and Jianfei Cai. Mvsplat360: Feed-forward 360 scene synthesis from sparse views. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[11]

Wan-move: Motion-controllable video generation via latent trajectory guidance.arXiv preprint arXiv:2512.08765, 2025

Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Bin Xia, Dingdong Wang, Hongwei Yi, Xihui Liu, Hengshuang Zhao, et al. Wan-move: Motion-controllable video generation via latent trajectory guidance.arXiv preprint arXiv:2512.08765, 2025

arXiv 2025
[12]

Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

Pith/arXiv arXiv 2025
[13]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

2017
[14]

Oasis: A universe in a transformer

Decart, Julian Quevedo, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: A universe in a transformer. 2024

2024
[15]

Veo 3 technical report

Google DeepMind. Veo 3 technical report. Technical report, Google, 2025

2025
[16]

An image is worth 16x16 words: Transformers for image recognition at scale.ICLR, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.ICLR, 2021. 10

2021
[17]

Y . Duan, S. Ren, J. Luo, Y . Chen, H. Wang, L. Zheng, and Q. Dai. 4d radiance fields with multi-scale occupancy networks for dynamic scene reconstruction. InCVPR, 2024

2024
[18]

4d-rotor gaussian splatting: towards efficient novel view synthesis for dynamic scenes

Yuanxing Duan, Fangyin Wei, Qiyu Dai, Yuhang He, Wenzheng Chen, and Baoquan Chen. 4d-rotor gaussian splatting: towards efficient novel view synthesis for dynamic scenes. InACM SIGGRAPH 2024 Conference Papers, 2024

2024
[19]

End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702, 2025

Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei, Zhenheng Yang, Weilin Huang, and Dahua Lin. End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702, 2025

arXiv 2025
[20]

Cameractrl: Enabling camera control for text-to-video generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. InICLR, 2025

2025
[21]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-Game 2.0: An Open-Source Real-Time and Streaming Interactive World Model.arXiv preprint arXiv:2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

arXiv preprint arXiv:2512.04040 , year=

Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, et al. RELIC: Interactive Video World Model with Long-Horizon Memory.arXiv preprint arXiv:2512.04040, 2025

work page arXiv 2025
[23]

2D Gaussian splatting for geometrically accurate radiance fields

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2D Gaussian splatting for geometrically accurate radiance fields. InSIGGRAPH 2024 Conference Papers. ACM, 2024

2024
[24]

V oyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.ACM Transactions on Graphics (TOG), 44(6):1–15, 2025

Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson Lau, Wangmeng Zuo, et al. V oyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.ACM Transactions on Graphics (TOG), 44(6):1–15, 2025

2025
[25]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

InSpatio Team. InSpatio: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling. arXiv preprint arXiv:2604.07209, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

3d gaussian splatting for real-time radiance field rendering.ACM TOG, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM TOG, 2023

2023
[28]

Worldwarp: Propagating 3d geometry with asynchronous video diffusion.arXiv preprint arXiv:2512.19678, 2025

Hanyang Kong, Xingyi Yang, Xiaoxu Zheng, and Xinchao Wang. Worldwarp: Propagating 3d geometry with asynchronous video diffusion.arXiv preprint arXiv:2512.19678, 2025

work page arXiv 2025
[29]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3d with mast3r. arXiv:2406.09756, 2024

work page arXiv 2024
[30]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. In European Conference on Computer Vision, pages 71–91. Springer, 2024

2024
[31]

Magicworld: Interactive geometry-driven video world exploration.arXiv preprint arXiv:2511.18886, 2025

Guangyuan Li, Siming Zheng, Shuolin Xu, Jinwei Chen, Bo Li, Xiaobin Hu, Lei Zhao, and Peng-Tao Jiang. Magicworld: Interactive geometry-driven video world exploration.arXiv preprint arXiv:2511.18886, 2025

work page arXiv 2025
[32]

Hunyuan- gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2 (3):6, 2025

Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2025

work page arXiv 2025
[33]

arXiv preprint arXiv:2506.18903 (2025) 2, 4, 9, 10, 11, 21, 25

Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory.arXiv preprint arXiv:2506.18903, 2025

work page arXiv 2025
[34]

Realcam- i2v: Real-world image-to-video generation with interactive complex camera control.arXiv preprint arXiv:2502.10059, 2025

Teng Li, Guangcong Zheng, Rui Jiang, Tao Wu, Yehao Lu, Yining Lin, Xi Li, et al. Realcam- i2v: Real-world image-to-video generation with interactive complex camera control.arXiv preprint arXiv:2502.10059, 2025

work page arXiv 2025
[35]

Stable video infinity: Infinite- length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025

Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, and Alexandre Alahi. Stable video infinity: Infinite- length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025

work page arXiv 2025
[36]

DL3DV-10K: a large-scale scene dataset for deep learning-based 3D vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. DL3DV-10K: a large-scale scene dataset for deep learning-based 3D vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024. 11

2024
[37]

Infinite nature: Perpetual view generation of natural scenes from a single image

Andrew Liu, Richard Tucker, Varun Jampani, Ameesh Makadia, Noah Snavely, and Angjoo Kanazawa. Infinite nature: Perpetual view generation of natural scenes from a single image. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14458–14467, 2021

2021
[38]

Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

Pith/arXiv arXiv 2025
[39]

See4d: Pose-free 4d generation via auto-regressive video inpainting

Dongyue Lu, Ao Liang, Tianxin Huang, Xiao Fu, Yuyang Zhao, Baorui Ma, Liang Pan, Wei Yin, Lingdong Kong, Wei Tsang Ooi, and Ziwei Liu. See4d: Pose-free 4d generation via auto-regressive video inpainting. arXiv preprint arXiv:2510.26796, 2025

arXiv 2025
[40]

Scaffold-gs: Structured 3d gaussians for view-adaptive rendering

Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20654–20664, 2024

2024
[41]

Dynamic 3D Gaussians: tracking by persistent dynamic view synthesis

Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3D Gaussians: tracking by persistent dynamic view synthesis. In3DV, 2024

2024
[42]

Worldpack: Com- pressed memory improves spatial consistency in video world modeling.arXiv preprint arXiv:2512.02473, 2025

Yuta Oshima, Yusuke Iwasawa, Masahiro Suzuki, Yutaka Matsuo, and Hiroki Furuta. Worldpack: Com- pressed memory improves spatial consistency in video world modeling.arXiv preprint arXiv:2512.02473, 2025

arXiv 2025
[43]

Cam- ctrl3d: Single-image scene exploration with precise 3d camera control.arXiv preprint arXiv:2501.06006, 2025

Stefan Popov, Amit Raj, Michael Krainin, Yuanzhen Li, William T Freeman, and Michael Rubinstein. Cam- ctrl3d: Single-image scene exploration with precise 3d camera control.arXiv preprint arXiv:2501.06006, 2025

arXiv 2025
[44]

Langsplat: 3d language gaus- sian splatting

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaus- sian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20051–20060, 2024

2024
[45]

Gen3c: 3d-informed world-consistent video generation with precise camera control

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025
[46]

Genwarp: Single image to novel views with semantic- preserving generative warping

Junyoung Seo, Kazumi Fukuda, Takashi Shibuya, Takuya Narihira, Naoki Murata, Shoukang Hu, Chieh- Hsin Lai, Seungryong Kim, and Yuki Mitsufuji. Genwarp: Single image to novel views with semantic- preserving generative warping. InNeurIPS, 2024

2024
[47]

Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024

Brandon Smart, Chuanxia Zheng, Iro Laina, and Victor Adrian Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024

Pith/arXiv arXiv 2024
[48]

History- guided video diffusion, 2025

Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History- guided video diffusion, 2025

2025
[49]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling.arXiv preprint arXiv:2512.14614, 2025

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling.arXiv preprint arXiv:2512.14614, 2025

Pith/arXiv arXiv 2025
[50]

Hisplat: Hierar- chical 3d gaussian splatting for generalizable sparse-view reconstruction.arXiv preprint arXiv:2410.06245, 2024

Shengji Tang, Weicai Ye, Peng Ye, Weihao Lin, Yang Zhou, Tao Chen, and Wanli Ouyang. Hisplat: Hierar- chical 3d gaussian splatting for generalizable sparse-view reconstruction.arXiv preprint arXiv:2410.06245, 2024

arXiv 2024
[51]

Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

Pith/arXiv arXiv 2025
[52]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

Pith/arXiv arXiv 2025
[53]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025
[54]

Error analyses of auto-regressive video diffusion models: A unified framework.arXiv preprint arXiv:2503.10704, 2025

Jing Wang et al. Error analyses of auto-regressive video diffusion models: A unified framework.arXiv preprint arXiv:2503.10704, 2025

arXiv 2025
[55]

Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

arXiv 2025
[56]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024

2024
[57]

DUSt3R: geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: geometric 3D vision made easy. InCVPR, 2024

2024
[58]

Gov-nesf: Generalizable open-vocabulary neural semantic fields

Yunsong Wang, Hanlin Chen, and Gim Hee Lee. Gov-nesf: Generalizable open-vocabulary neural semantic fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20443–20453, 2024

2024
[59]

Freesplat: Generalizable 3d gaussian splatting towards free-view synthesis of indoor scenes.arXiv preprint arXiv:2405.17958, 2024

Yunsong Wang, Tianxin Huang, Hanlin Chen, and Gim Hee Lee. Freesplat: Generalizable 3d gaussian splatting towards free-view synthesis of indoor scenes.arXiv preprint arXiv:2405.17958, 2024

arXiv 2024
[60]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

2024
[61]

G. Wu, T. Yi, J. Fang, L. Xie, X. Zhang, W. Wei, W. Liu, Q. Tian, and W. Xinggang. 4d gaussian splatting for real-time dynamic scene rendering.arXiv preprint arXiv:2310.08528, 2023

arXiv 2023
[62]

Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling.arXiv preprint arXiv:2507.07982, 2025

Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, and Jiang Bian. Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling.arXiv preprint arXiv:2507.07982, 2025

Pith/arXiv arXiv 2025
[63]

Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

arXiv 2025
[64]

Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

arXiv 2025
[65]

Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21924–21935, 2025

2025
[66]

Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting.arXiv preprint arXiv:2310.10642, 2023

Zeyu Yang, Hongye Yang, Zijie Pan, Xiatian Zhu, and Li Zhang. Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting.arXiv preprint arXiv:2310.10642, 2023

arXiv 2023
[67]

No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images.arXiv preprint arXiv:2410.24207, 2024

Botao Ye, Sifei Liu, Haofei Xu, Li Xueting, Marc Pollefeys, Ming-Hsuan Yang, and Peng Songyou. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images.arXiv preprint arXiv:2410.24207, 2024

arXiv 2024
[68]

gsplat: An open-source library for gaussian splatting

Vickie Ye, Ruilong Li, Justin Kerr, Matias Turkulainen, Brent Yi, Zhuoyang Pan, Otto Seiskari, Jianbo Ye, Jeffrey Hu, Matthew Tancik, and Angjoo Kanazawa. gsplat: An open-source library for gaussian splatting. Journal of Machine Learning Research, 26(34):1–17, 2025

2025
[69]

Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self-rollout.arXiv preprint arXiv:2511.20649, 2025

Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, and Pinar Yanardag. Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self-rollout.arXiv preprint arXiv:2511.20649, 2025

arXiv 2025
[70]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22963–22974, 2025

2025
[71]

Context as memory: Scene-consistent interactive long video generation with memory retrieval

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. In Proceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025

2025
[72]

Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models.arXiv preprint arXiv:2503.05638, 2025

Mark YU, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models.arXiv preprint arXiv:2503.05638, 2025. 13

arXiv 2025
[73]

ViewCrafter: taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. ViewCrafter: taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

Pith/arXiv arXiv 2024
[74]

Gaussian opacity fields: Efficient adaptive surface reconstruction in unbounded scenes.ACM Transactions on Graphics, 2024

Zehao Yu, Torsten Sattler, and Andreas Geiger. Gaussian opacity fields: Efficient adaptive surface reconstruction in unbounded scenes.ACM Transactions on Graphics, 2024

2024
[75]

Transplat: Generalizable 3d gaussian splatting from sparse multi-view images with transformers.arXiv preprint arXiv:2408.13770, 2024

Chuanrui Zhang, Yingshuang Zou, Zhuoling Li, Minmin Yi, and Haoqian Wang. Transplat: Generalizable 3d gaussian splatting from sparse multi-view images with transformers.arXiv preprint arXiv:2408.13770, 2024

arXiv 2024
[76]

Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models

Lvmin Zhang, Shengqu Cai, Muyang Li, Gordon Wetzstein, and Maneesh Agrawala. Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models. InNeurIPS, 2025

2025
[77]

Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Ze- dong Gao, Eric Li, Yang Liu, et al. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

arXiv 2025
[78]

Spatia: Video Generation with Updatable Spatial Memory.arXiv preprint arXiv:2512.15716, 2025

Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, and Yan Lu. Spatia: Video Generation with Updatable Spatial Memory.arXiv preprint arXiv:2512.15716, 2025

arXiv 2025
[79]

Gps-gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis

Shunyuan Zheng, Boyao Zhou, Ruizhi Shao, Boning Liu, Shengping Zhang, Liqiang Nie, and Yebin Liu. Gps-gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19680–19690, 2024

2024
[80]

Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, Mingyu Liu, Dingning Liu, Jiange Yang, Zhoujie Fu, Junyi Chen, Chunhua Shen, Jiangmiao Pang, Kaipeng Zhang, and Tong He. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

arXiv 2025

Showing first 80 references.

[1] [1]

Ac3d: Analyzing and improving 3d camera control in video diffusion transformers.arXiv preprint arXiv:2411.18673, 2024

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers.arXiv preprint arXiv:2411.18673, 2024

arXiv 2024

[2] [2]

Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, Ci...

2025

[3] [3]

PixelSplat: 3D Gaussian splats from image pairs for scalable generalizable 3D reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. PixelSplat: 3D Gaussian splats from image pairs for scalable generalizable 3D reconstruction. InCVPR, 2024

2024

[4] [4]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2025

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2025

2025

[5] [5]

Gnesf: Generalizable neural semantic fields

Hanlin Chen, Chen Li, Mengqi Guo, Zhiwen Yan, and Gim Hee Lee. Gnesf: Generalizable neural semantic fields. InNeurIPS, 2023

2023

[6] [6]

Neusg: Neural implicit surface reconstruction with 3d gaussian splatting guidance.arXiv preprint arXiv:2312.00846, 2023

Hanlin Chen, Chen Li, and Gim Hee Lee. Neusg: Neural implicit surface reconstruction with 3d gaussian splatting guidance.arXiv preprint arXiv:2312.00846, 2023

arXiv 2023

[7] [7]

Vcr-gaus: View consistent depth-normal regularizer for gaussian surface reconstruction.arXiv preprint arXiv:2406.05774, 2024

Hanlin Chen, Fangyin Wei, Chen Li, Tianxin Huang, Yunsong Wang, and Gim Hee Lee. Vcr-gaus: View consistent depth-normal regularizer for gaussian surface reconstruction.arXiv preprint arXiv:2406.05774, 2024

arXiv 2024

[8] [8]

Ttt3r: 3d reconstruction as test-time training.arXiv preprint arXiv:2509.26645, 2025

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruction as test-time training.arXiv preprint arXiv:2509.26645, 2025

Pith/arXiv arXiv 2025

[9] [9]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InECCV, 2024

2024

[10] [10]

Mvsplat360: Feed-forward 360 scene synthesis from sparse views

Yuedong Chen, Chuanxia Zheng, Haofei Xu, Bohan Zhuang, Andrea Vedaldi, Tat-Jen Cham, and Jianfei Cai. Mvsplat360: Feed-forward 360 scene synthesis from sparse views. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[11] [11]

Wan-move: Motion-controllable video generation via latent trajectory guidance.arXiv preprint arXiv:2512.08765, 2025

Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Bin Xia, Dingdong Wang, Hongwei Yi, Xihui Liu, Hengshuang Zhao, et al. Wan-move: Motion-controllable video generation via latent trajectory guidance.arXiv preprint arXiv:2512.08765, 2025

arXiv 2025

[12] [12]

Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025

Pith/arXiv arXiv 2025

[13] [13]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

2017

[14] [14]

Oasis: A universe in a transformer

Decart, Julian Quevedo, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: A universe in a transformer. 2024

2024

[15] [15]

Veo 3 technical report

Google DeepMind. Veo 3 technical report. Technical report, Google, 2025

2025

[16] [16]

An image is worth 16x16 words: Transformers for image recognition at scale.ICLR, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.ICLR, 2021. 10

2021

[17] [17]

Y . Duan, S. Ren, J. Luo, Y . Chen, H. Wang, L. Zheng, and Q. Dai. 4d radiance fields with multi-scale occupancy networks for dynamic scene reconstruction. InCVPR, 2024

2024

[18] [18]

4d-rotor gaussian splatting: towards efficient novel view synthesis for dynamic scenes

Yuanxing Duan, Fangyin Wei, Qiyu Dai, Yuhang He, Wenzheng Chen, and Baoquan Chen. 4d-rotor gaussian splatting: towards efficient novel view synthesis for dynamic scenes. InACM SIGGRAPH 2024 Conference Papers, 2024

2024

[19] [19]

End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702, 2025

Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei, Zhenheng Yang, Weilin Huang, and Dahua Lin. End-to-end training for autoregressive video diffusion via self-resampling.arXiv preprint arXiv:2512.15702, 2025

arXiv 2025

[20] [20]

Cameractrl: Enabling camera control for text-to-video generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. InICLR, 2025

2025

[21] [21]

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, et al. Matrix-Game 2.0: An Open-Source Real-Time and Streaming Interactive World Model.arXiv preprint arXiv:2508.13009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

arXiv preprint arXiv:2512.04040 , year=

Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, et al. RELIC: Interactive Video World Model with Long-Horizon Memory.arXiv preprint arXiv:2512.04040, 2025

work page arXiv 2025

[23] [23]

2D Gaussian splatting for geometrically accurate radiance fields

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2D Gaussian splatting for geometrically accurate radiance fields. InSIGGRAPH 2024 Conference Papers. ACM, 2024

2024

[24] [24]

V oyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.ACM Transactions on Graphics (TOG), 44(6):1–15, 2025

Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson Lau, Wangmeng Zuo, et al. V oyager: Long-range and world-consistent video diffusion for explorable 3d scene generation.ACM Transactions on Graphics (TOG), 44(6):1–15, 2025

2025

[25] [25]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

InSpatio Team. InSpatio: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling. arXiv preprint arXiv:2604.07209, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

3d gaussian splatting for real-time radiance field rendering.ACM TOG, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM TOG, 2023

2023

[28] [28]

Worldwarp: Propagating 3d geometry with asynchronous video diffusion.arXiv preprint arXiv:2512.19678, 2025

Hanyang Kong, Xingyi Yang, Xiaoxu Zheng, and Xinchao Wang. Worldwarp: Propagating 3d geometry with asynchronous video diffusion.arXiv preprint arXiv:2512.19678, 2025

work page arXiv 2025

[29] [29]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3d with mast3r. arXiv:2406.09756, 2024

work page arXiv 2024

[30] [30]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. In European Conference on Computer Vision, pages 71–91. Springer, 2024

2024

[31] [31]

Magicworld: Interactive geometry-driven video world exploration.arXiv preprint arXiv:2511.18886, 2025

Guangyuan Li, Siming Zheng, Shuolin Xu, Jinwei Chen, Bo Li, Xiaobin Hu, Lei Zhao, and Peng-Tao Jiang. Magicworld: Interactive geometry-driven video world exploration.arXiv preprint arXiv:2511.18886, 2025

work page arXiv 2025

[32] [32]

Hunyuan- gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2 (3):6, 2025

Jiaqi Li, Junshu Tang, Zhiyong Xu, Longhuang Wu, Yuan Zhou, Shuai Shao, Tianbao Yu, Zhiguo Cao, and Qinglin Lu. Hunyuan-gamecraft: High-dynamic interactive game video generation with hybrid history condition.arXiv preprint arXiv:2506.17201, 2025

work page arXiv 2025

[33] [33]

arXiv preprint arXiv:2506.18903 (2025) 2, 4, 9, 10, 11, 21, 25

Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory.arXiv preprint arXiv:2506.18903, 2025

work page arXiv 2025

[34] [34]

Realcam- i2v: Real-world image-to-video generation with interactive complex camera control.arXiv preprint arXiv:2502.10059, 2025

Teng Li, Guangcong Zheng, Rui Jiang, Tao Wu, Yehao Lu, Yining Lin, Xi Li, et al. Realcam- i2v: Real-world image-to-video generation with interactive complex camera control.arXiv preprint arXiv:2502.10059, 2025

work page arXiv 2025

[35] [35]

Stable video infinity: Infinite- length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025

Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, and Alexandre Alahi. Stable video infinity: Infinite- length video generation with error recycling.arXiv preprint arXiv:2510.09212, 2025

work page arXiv 2025

[36] [36]

DL3DV-10K: a large-scale scene dataset for deep learning-based 3D vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. DL3DV-10K: a large-scale scene dataset for deep learning-based 3D vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024. 11

2024

[37] [37]

Infinite nature: Perpetual view generation of natural scenes from a single image

Andrew Liu, Richard Tucker, Varun Jampani, Ameesh Makadia, Noah Snavely, and Angjoo Kanazawa. Infinite nature: Perpetual view generation of natural scenes from a single image. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14458–14467, 2021

2021

[38] [38]

Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

Pith/arXiv arXiv 2025

[39] [39]

See4d: Pose-free 4d generation via auto-regressive video inpainting

Dongyue Lu, Ao Liang, Tianxin Huang, Xiao Fu, Yuyang Zhao, Baorui Ma, Liang Pan, Wei Yin, Lingdong Kong, Wei Tsang Ooi, and Ziwei Liu. See4d: Pose-free 4d generation via auto-regressive video inpainting. arXiv preprint arXiv:2510.26796, 2025

arXiv 2025

[40] [40]

Scaffold-gs: Structured 3d gaussians for view-adaptive rendering

Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20654–20664, 2024

2024

[41] [41]

Dynamic 3D Gaussians: tracking by persistent dynamic view synthesis

Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3D Gaussians: tracking by persistent dynamic view synthesis. In3DV, 2024

2024

[42] [42]

Worldpack: Com- pressed memory improves spatial consistency in video world modeling.arXiv preprint arXiv:2512.02473, 2025

Yuta Oshima, Yusuke Iwasawa, Masahiro Suzuki, Yutaka Matsuo, and Hiroki Furuta. Worldpack: Com- pressed memory improves spatial consistency in video world modeling.arXiv preprint arXiv:2512.02473, 2025

arXiv 2025

[43] [43]

Cam- ctrl3d: Single-image scene exploration with precise 3d camera control.arXiv preprint arXiv:2501.06006, 2025

Stefan Popov, Amit Raj, Michael Krainin, Yuanzhen Li, William T Freeman, and Michael Rubinstein. Cam- ctrl3d: Single-image scene exploration with precise 3d camera control.arXiv preprint arXiv:2501.06006, 2025

arXiv 2025

[44] [44]

Langsplat: 3d language gaus- sian splatting

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaus- sian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20051–20060, 2024

2024

[45] [45]

Gen3c: 3d-informed world-consistent video generation with precise camera control

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025

[46] [46]

Genwarp: Single image to novel views with semantic- preserving generative warping

Junyoung Seo, Kazumi Fukuda, Takashi Shibuya, Takuya Narihira, Naoki Murata, Shoukang Hu, Chieh- Hsin Lai, Seungryong Kim, and Yuki Mitsufuji. Genwarp: Single image to novel views with semantic- preserving generative warping. InNeurIPS, 2024

2024

[47] [47]

Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024

Brandon Smart, Chuanxia Zheng, Iro Laina, and Victor Adrian Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024

Pith/arXiv arXiv 2024

[48] [48]

History- guided video diffusion, 2025

Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann. History- guided video diffusion, 2025

2025

[49] [49]

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling.arXiv preprint arXiv:2512.14614, 2025

Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, and Chunchao Guo. WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling.arXiv preprint arXiv:2512.14614, 2025

Pith/arXiv arXiv 2025

[50] [50]

Hisplat: Hierar- chical 3d gaussian splatting for generalizable sparse-view reconstruction.arXiv preprint arXiv:2410.06245, 2024

Shengji Tang, Weicai Ye, Peng Ye, Weihao Lin, Yang Zhou, Tao Chen, and Wanli Ouyang. Hisplat: Hierar- chical 3d gaussian splatting for generalizable sparse-view reconstruction.arXiv preprint arXiv:2410.06245, 2024

arXiv 2024

[51] [51]

Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

Pith/arXiv arXiv 2025

[52] [52]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

Pith/arXiv arXiv 2025

[53] [53]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025

[54] [54]

Error analyses of auto-regressive video diffusion models: A unified framework.arXiv preprint arXiv:2503.10704, 2025

Jing Wang et al. Error analyses of auto-regressive video diffusion models: A unified framework.arXiv preprint arXiv:2503.10704, 2025

arXiv 2025

[55] [55]

Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

arXiv 2025

[56] [56]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024

2024

[57] [57]

DUSt3R: geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: geometric 3D vision made easy. InCVPR, 2024

2024

[58] [58]

Gov-nesf: Generalizable open-vocabulary neural semantic fields

Yunsong Wang, Hanlin Chen, and Gim Hee Lee. Gov-nesf: Generalizable open-vocabulary neural semantic fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20443–20453, 2024

2024

[59] [59]

Freesplat: Generalizable 3d gaussian splatting towards free-view synthesis of indoor scenes.arXiv preprint arXiv:2405.17958, 2024

Yunsong Wang, Tianxin Huang, Hanlin Chen, and Gim Hee Lee. Freesplat: Generalizable 3d gaussian splatting towards free-view synthesis of indoor scenes.arXiv preprint arXiv:2405.17958, 2024

arXiv 2024

[60] [60]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024

2024

[61] [61]

G. Wu, T. Yi, J. Fang, L. Xie, X. Zhang, W. Wei, W. Liu, Q. Tian, and W. Xinggang. 4d gaussian splatting for real-time dynamic scene rendering.arXiv preprint arXiv:2310.08528, 2023

arXiv 2023

[62] [62]

Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling.arXiv preprint arXiv:2507.07982, 2025

Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, and Jiang Bian. Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling.arXiv preprint arXiv:2507.07982, 2025

Pith/arXiv arXiv 2025

[63] [63]

Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

arXiv 2025

[64] [64]

Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

arXiv 2025

[65] [65]

Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21924–21935, 2025

2025

[66] [66]

Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting.arXiv preprint arXiv:2310.10642, 2023

Zeyu Yang, Hongye Yang, Zijie Pan, Xiatian Zhu, and Li Zhang. Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting.arXiv preprint arXiv:2310.10642, 2023

arXiv 2023

[67] [67]

No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images.arXiv preprint arXiv:2410.24207, 2024

Botao Ye, Sifei Liu, Haofei Xu, Li Xueting, Marc Pollefeys, Ming-Hsuan Yang, and Peng Songyou. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images.arXiv preprint arXiv:2410.24207, 2024

arXiv 2024

[68] [68]

gsplat: An open-source library for gaussian splatting

Vickie Ye, Ruilong Li, Justin Kerr, Matias Turkulainen, Brent Yi, Zhuoyang Pan, Otto Seiskari, Jianbo Ye, Jeffrey Hu, Matthew Tancik, and Angjoo Kanazawa. gsplat: An open-source library for gaussian splatting. Journal of Machine Learning Research, 26(34):1–17, 2025

2025

[69] [69]

Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self-rollout.arXiv preprint arXiv:2511.20649, 2025

Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, and Pinar Yanardag. Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self-rollout.arXiv preprint arXiv:2511.20649, 2025

arXiv 2025

[70] [70]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22963–22974, 2025

2025

[71] [71]

Context as memory: Scene-consistent interactive long video generation with memory retrieval

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. In Proceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025

2025

[72] [72]

Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models.arXiv preprint arXiv:2503.05638, 2025

Mark YU, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models.arXiv preprint arXiv:2503.05638, 2025. 13

arXiv 2025

[73] [73]

ViewCrafter: taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. ViewCrafter: taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

Pith/arXiv arXiv 2024

[74] [74]

Gaussian opacity fields: Efficient adaptive surface reconstruction in unbounded scenes.ACM Transactions on Graphics, 2024

Zehao Yu, Torsten Sattler, and Andreas Geiger. Gaussian opacity fields: Efficient adaptive surface reconstruction in unbounded scenes.ACM Transactions on Graphics, 2024

2024

[75] [75]

Transplat: Generalizable 3d gaussian splatting from sparse multi-view images with transformers.arXiv preprint arXiv:2408.13770, 2024

Chuanrui Zhang, Yingshuang Zou, Zhuoling Li, Minmin Yi, and Haoqian Wang. Transplat: Generalizable 3d gaussian splatting from sparse multi-view images with transformers.arXiv preprint arXiv:2408.13770, 2024

arXiv 2024

[76] [76]

Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models

Lvmin Zhang, Shengqu Cai, Muyang Li, Gordon Wetzstein, and Maneesh Agrawala. Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models. InNeurIPS, 2025

2025

[77] [77]

Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Ze- dong Gao, Eric Li, Yang Liu, et al. Matrix-game: Interactive world foundation model.arXiv preprint arXiv:2506.18701, 2025

arXiv 2025

[78] [78]

Spatia: Video Generation with Updatable Spatial Memory.arXiv preprint arXiv:2512.15716, 2025

Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, and Yan Lu. Spatia: Video Generation with Updatable Spatial Memory.arXiv preprint arXiv:2512.15716, 2025

arXiv 2025

[79] [79]

Gps-gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis

Shunyuan Zheng, Boyao Zhou, Ruizhi Shao, Boning Liu, Shengping Zhang, Liqiang Nie, and Yebin Liu. Gps-gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19680–19690, 2024

2024

[80] [80]

Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, Mingyu Liu, Dingning Liu, Jiange Yang, Zhoujie Fu, Junyi Chen, Chunhua Shen, Jiangmiao Pang, Kaipeng Zhang, and Tong He. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

arXiv 2025