pith. sign in

arxiv: 2607.01202 · v1 · pith:VY5XOMATnew · submitted 2026-07-01 · 💻 cs.CV · cs.AI· cs.GR

World from Motion: Generative Dynamic Gaussian Reconstruction from Monocular Video

Pith reviewed 2026-07-02 13:19 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GR
keywords 4D reconstructiondynamic Gaussian splattingmonocular videogenerative video modelsnovel view synthesisscene motion
0
0 comments X

The pith

A video model conditioned on 3D renderings produces consistent dynamic 3D Gaussians from monocular video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper describes a method to generate dynamic 3D Gaussian representations from monocular videos that can be rendered from new viewpoints and times. The key step is to condition a generative video model on pixel-aligned renderings that include appearance, geometry, and motion information for both the original and new camera paths. The model uses this to fix typical monocular reconstruction problems and complete unseen parts of the scene. These outputs are then distilled into one unified 3D model. Readers might care if they want to create 4D content from everyday single-camera recordings.

Core claim

We condition a video model on dense, pixel-aligned renderings that encode appearance, geometry, and 3D scene motion along both input and target camera trajectories to correct rendering artifacts and fill in missing regions from an initial reconstruction. We construct a dataset of aligned multiview video pairs and dynamic 3DGS representations with simulated artifacts for training. At test time we distill the model's generations back into a single consistent high-quality dynamic 3DGS, improving novel-view synthesis and the underlying 3D motion. The method sets a new state of the art in 4D reconstruction and generalizes to in-the-wild videos.

What carries the argument

The generative video model conditioned on dense pixel-aligned renderings encoding appearance, geometry, and 3D motion for artifact correction and region synthesis prior to distillation into consistent 3D Gaussians.

If this is right

  • The resulting 3DGS allows improved novel-view synthesis.
  • The underlying 3D motion estimates are improved.
  • The approach works on in-the-wild videos with large viewpoint changes.
  • It achieves state-of-the-art performance in 4D reconstruction tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This conditioning technique could be applied to other 3D reconstruction problems where consistency across views is needed.
  • Future work might explore using the method for longer videos or more complex motions.
  • It opens the possibility of combining generative models with explicit 3D representations for better controllability.

Load-bearing premise

The generative video model can reliably correct artifacts and synthesize missing regions in a manner that allows distillation to a single consistent 3D representation without new inconsistencies.

What would settle it

If applying the full pipeline to a video with available ground-truth multi-view captures results in a 3DGS that has larger rendering errors on held-out views than the initial monocular reconstruction, the claim would be falsified.

Figures

Figures reproduced from arXiv: 2607.01202 by Alex Trevithick, Amrita Mazumdar, Gordon Wetzstein, Iro Armeni, Liyuan Zhu, Shalini De Mello, Shengyu Huang, Tianye Li, Zan Gojcic.

Figure 1
Figure 1. Figure 1: World from Motion reconstructs a dynamic 3DGS world from the camera and scene motion in a single monocular video. From an input video and an initial reconstruction produced by MoSca [23], our video model generates novel views that are distilled into a refined reconstruction. Our method faithfully recovers observed structure and synthesizes plausible novel-view dynamics, in turn improving the underlying sce… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative results on challenging in-the-wild videos. Compared to the input reconstruc￾tion [23], our method performs visual outpainting, fixes degraded dynamics, and infers out-of-frustum dynamics while respecting the static region. 2 Related Work 4D reconstruction from monocular video. Casual monocular videos usually capture the dynamic nature of the real world. Yet classical methods for camera pose and… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of 4D reconstruction with World from Motion. Given a monocular video, our method first generates an initial 4D reconstruction of the input video. Along the target trajectory, our method then renders the corresponding appearance, geometry and motion to condition a video generative model. The generated samples are used to create a higher-quality 4D reconstruction. models only on information from the… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on DyCheck [12]. Each row compares the predicted test view from each method with the ground-truth frame. Our method produces sharper details and consistency. Joint depth alignment. Geometric alignment between rendered conditioning and ground-truth video is critical; scale ambiguity in monocular reconstruction often leads to inconsistencies that cause the generator to ignore the condi… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study of video model components (top) and re-optimization process (bottom). performance drop indicates that our approach benefits from larger-capacity video generators. Please see the supplement for more visual ablation comparisons. Ablation of dynamic 3DGS refinement. In [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

We present World from Motion, a method for generating freely renderable dynamic 3D Gaussian representations from monocular videos. Our approach conditions a video model on dense, pixel-aligned renderings that encode appearance, geometry, and 3D scene motion along both input and target camera trajectories to correct rendering artifacts and fill in missing regions from an initial reconstruction. To train this model, we construct a dataset of aligned multiview video pairs and dynamic 3DGS representations, with simulated artifacts characteristic of monocular reconstruction. At test time, we distill the model's generations, including newly observed regions and motions, back into a single consistent, high-quality dynamic 3DGS, improving both novel-view synthesis and the underlying 3D motion. Our method sets a new state of the art in 4D reconstruction and seamlessly generalizes to in-the-wild videos with large viewpoint changes and dynamic motions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents World from Motion, a method for generating freely renderable dynamic 3D Gaussian representations from monocular videos. It conditions a video model on dense pixel-aligned renderings of appearance, geometry, and 3D scene motion along input and target trajectories to correct monocular artifacts and synthesize missing regions. The model is trained on a constructed dataset of aligned multiview video pairs and dynamic 3DGS with simulated artifacts; at test time, generations are distilled into a single consistent dynamic 3DGS via optimization, claiming new state-of-the-art performance in 4D reconstruction that generalizes to in-the-wild videos with large viewpoint changes and dynamic motions.

Significance. If the distillation step reliably produces consistent 3D representations, the approach could meaningfully advance monocular 4D reconstruction by integrating generative video priors with explicit 3D Gaussian optimization, improving novel-view synthesis and motion recovery from casual videos beyond current baselines.

major comments (1)
  1. [§4.3] §4.3: The distillation is described as an optimization that incorporates new observations and motions from the conditioned generative model, but the manuscript provides no analysis of how conflicting motion fields or appearance hallucinations across overlapping target trajectories are reconciled. The optimization appears to rely only on standard rendering losses without an explicit consistency regularizer, which directly bears on whether a single high-quality dynamic 3DGS can be obtained without residual artifacts or over-smoothing.
minor comments (1)
  1. The description of the training dataset construction (aligned multiview pairs with simulated monocular artifacts) would benefit from additional detail on how the simulated artifacts are generated to ensure they match real monocular reconstruction failures.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and the specific comment on the distillation procedure in §4.3. We address the concern point-by-point below.

read point-by-point responses
  1. Referee: [§4.3] §4.3: The distillation is described as an optimization that incorporates new observations and motions from the conditioned generative model, but the manuscript provides no analysis of how conflicting motion fields or appearance hallucinations across overlapping target trajectories are reconciled. The optimization appears to rely only on standard rendering losses without an explicit consistency regularizer, which directly bears on whether a single high-quality dynamic 3DGS can be obtained without residual artifacts or over-smoothing.

    Authors: We agree that the manuscript does not include an explicit analysis or ablation of conflict reconciliation across overlapping trajectories. The distillation optimizes a single dynamic 3DGS by rendering losses against the generated frames (appearance, depth, and motion) produced by the conditioned video model. Because the generative model is itself conditioned on pixel-aligned renderings of the initial 3DGS along both input and target trajectories, the generated outputs inherit a degree of geometric and motion coherence from the shared 3D prior; this implicit coupling is what allows the subsequent optimization to converge to a single representation. Nevertheless, we did not quantify how often or how severely conflicting motion fields arise, nor did we introduce or evaluate an explicit consistency regularizer. In the revision we will add (i) a description of the exact loss terms used in the distillation, (ii) an analysis of trajectory overlap and any observed inconsistencies, and (iii) a short ablation that measures the effect of adding a simple motion-consistency term. We therefore mark this comment as requiring a revision. revision: yes

Circularity Check

0 steps flagged

No circularity: method pipeline relies on external data and optimization without self-referential reductions

full rationale

The provided abstract and method outline describe a standard generative pipeline: constructing a training dataset of multiview pairs with simulated monocular artifacts, conditioning a video model on pixel-aligned renderings of appearance/geometry/motion, and distilling outputs into a dynamic 3DGS via optimization. No equations, fitted parameters renamed as predictions, or self-citation load-bearing steps are visible that would reduce any claimed result to its inputs by construction. The SOTA claim is presented as an empirical outcome of the pipeline rather than a mathematical identity, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No technical details on parameters, axioms, or new entities are provided in the abstract; the dataset construction and conditioning mechanism are described at a high level only.

pith-pipeline@v0.9.1-grok · 5718 in / 1283 out tokens · 27607 ms · 2026-07-02T13:19:46.368744+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 17 canonical work pages · 10 internal anchors

  1. [1]

    Lindell, Zan Gojcic, Sanja Fidler, Huan Ling, Jun Gao, and Xuanchi Ren

    Sherwin Bahmani, Tianchang Shen, Jiawei Ren, Jiahui Huang, Yifeng Jiang, Haithem Turki, Andrea Tagliasacchi, David B. Lindell, Zan Gojcic, Sanja Fidler, Huan Ling, Jun Gao, and Xuanchi Ren. Lyra: Generative 3d scene reconstruction via self-distillation with video diffusion models. InICLR, 2026

  2. [2]

    Recammaster: Camera-controlled generative rendering from a single video

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InICCV, 2025

  3. [3]

    Align your latents: High-resolution video synthesis with latent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InCVPR, 2023

  4. [4]

    Recovering non-rigid 3d shape from image streams

    Christoph Bregler, Aaron Hertzmann, and Henning Biermann. Recovering non-rigid 3d shape from image streams. InCVPR, 2000

  5. [5]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

  6. [6]

    Dissanayake, P

    M.W.M.G. Dissanayake, P. Newman, S. Clark, H.F. Durrant-Whyte, and M. Csorba. A solution to the simultaneous localization and map building (slam) problem.IEEE Transactions on Robotics and Automation, 17(3):229–241, 2001. doi: 10.1109/70.938381

  7. [7]

    Tapir: Tracking any point with per-frame initialization and temporal refinement

    Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement. InICCV, 2023

  8. [8]

    BootsTAP: Bootstrapped training for tracking-any-point.ACCV, 2024

    Carl Doersch, Pauline Luc, Yi Yang, Dilara Gokay, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ignacio Rocco, Ross Goroshin, João Carreira, and Andrew Zisserman. BootsTAP: Bootstrapped training for tracking-any-point.ACCV, 2024

  9. [9]

    Fast dynamic radiance fields with time-aware neural voxels

    Jiemin Fang, Taoran Yi, Xinggang Wang, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Matthias Nießner, and Qi Tian. Fast dynamic radiance fields with time-aware neural voxels. InSIG- GRAPH Asia 2022 Conference Papers, pages 1–9, 2022

  10. [10]

    Flowr: Flowing from sparse to dense 3d reconstructions

    Tobias Fischer, Samuel Rota Bulò, Yung-Hsu Yang, Nikhil Keetha, Lorenzo Porzi, Norman Müller, Katja Schwarz, Jonathon Luiten, Marc Pollefeys, and Peter Kontschieder. Flowr: Flowing from sparse to dense 3d reconstructions. InICCV, 2025

  11. [11]

    Plenoptic video generation

    Xiao Fu, Shitao Tang, Min Shi, Xian Liu, Jinwei Gu, Ming-Yu Liu, Dahua Lin, and Chen-Hsuan Lin. Plenoptic video generation. InCVPR, 2026

  12. [12]

    Monocular dynamic view synthesis: A reality check.NeurIPS, 2022

    Hang Gao, Ruilong Li, Shubham Tulsiani, Bryan Russell, and Angjoo Kanazawa. Monocular dynamic view synthesis: A reality check.NeurIPS, 2022

  13. [13]

    Veo: A text-to-video generation system

    Google DeepMind. Veo: A text-to-video generation system. Technical report, Google DeepMind, 2025. URL https://storage.googleapis.com/deepmind-media/veo/ Veo-3-Tech-Report.pdf

  14. [14]

    Cambridge university press, 2003

    Richard Hartley and Andrew Zisserman.Multiple view geometry in computer vision. Cambridge university press, 2003

  15. [15]

    CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024. 11

  16. [16]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. URLhttps://openreview. net/forum?id=qw8AKxfYbI

  17. [17]

    ViPE: Video Pose Engine for 3D Geometric Perception

    Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, Jiawei Ren, Kevin Xie, Joydeep Biswas, Laura Leal-Taixé, and Sanja Fidler. ViPE: Video pose engine for 3d geometric perception. InarXiv preprint arXiv:2508.10934, 2025

  18. [18]

    Vivid4d: Improving 4d reconstruction from monocular video by video inpainting

    Jiaxin Huang, Sheng Miao, Bangbang Yang, Yuewen Ma, and Yiyi Liao. Vivid4d: Improving 4d reconstruction from monocular video by video inpainting. InICCV, 2025

  19. [19]

    Vace: All-in-one video creation and editing

    Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InICCV, 2025

  20. [20]

    Cotracker: It is better to track together

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to track together. InECCV, 2024

  21. [21]

    Any4D: Unified feed-forward metric 4d reconstruction

    Jay Karhade, Nikhil Keetha, Yuchen Zhang, Tanisha Gupta, Akash Sharma, Sebastian Scherer, and Deva Ramanan. Any4D: Unified feed-forward metric 4d reconstruction. InCVPR, 2026

  22. [22]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4), 2023

  23. [23]

    Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds

    Jiahui Lei, Yijia Weng, Adam W Harley, Leonidas Guibas, and Kostas Daniilidis. Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds. InCVPR, 2025

  24. [24]

    Vmem: Consistent interactive video scene generation with surfel-indexed view memory

    Runjia Li, Philip Torr, Andrea Vedaldi, and Tomas Jakab. Vmem: Consistent interactive video scene generation with surfel-indexed view memory. InICCV, 2025

  25. [25]

    Neural scene flow fields for space-time view synthesis of dynamic scenes

    Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neural scene flow fields for space-time view synthesis of dynamic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6498–6508, 2021

  26. [26]

    MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos

    Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos. InCVPR, 2025

  27. [27]

    Movies: Motion-aware 4d dynamic view synthesis in one second

    Chenguo Lin, Yuchen Lin, Panwang Pan, Yifan Yu, Tao Hu, Honglei Yan, Katerina Fragkiadaki, and Yadong Mu. Movies: Motion-aware 4d dynamic view synthesis in one second. InCVPR, 2026

  28. [28]

    Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang

    Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views. InICLR, 2026

  29. [29]

    Vista4D: Video Reshooting with 4D Point Clouds

    Kuan Heng Lin, Zhizheng Liu, Pablo Salamanca, Yash Kant, Ryan Burgert, Yuancheng Xu, Koichi Namekata, Yiwei Zhao, Bolei Zhou, Micah Goldblum, et al. Vista4d: Video reshooting with 4d point clouds.arXiv preprint arXiv:2604.21915, 2026

  30. [30]

    3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view-consistent 2d diffusion priors

    Xi Liu, Chaoyi Zhou, and Siyu Huang. 3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view-consistent 2d diffusion priors. InNeurIPS, 2024

  31. [31]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  32. [32]

    Robust dynamic radiance fields

    Yu-Lun Liu, Chen Gao, Andreas Meuleman, Hung-Yu Tseng, Ayush Saraf, Changil Kim, Yung-Yu Chuang, Johannes Kopf, and Jia-Bin Huang. Robust dynamic radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13–23, 2023

  33. [33]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  34. [34]

    Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis

    Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. In3DV, 2024. 12

  35. [35]

    SDEdit: Guided image synthesis and editing with stochastic differential equations

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=aBsCjcPu_tE

  36. [36]

    Orb-slam: A versatile and accurate monocular slam system.IEEE transactions on robotics, 31(5):1147–1163, 2015

    Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: A versatile and accurate monocular slam system.IEEE transactions on robotics, 31(5):1147–1163, 2015

  37. [37]

    Vidar: Video diffusion-aware 4d reconstruction from monocular inputs.arXiv preprint arXiv:2506.18792, 2025

    Michal Nazarczuk, Sibi Catley-Chandar, Thomas Tanay, Zhensong Zhang, Gregory Slabaugh, and Eduardo Pérez-Pellitero. Vidar: Video diffusion-aware 4d reconstruction from monocular inputs.arXiv preprint arXiv:2506.18792, 2025

  38. [38]

    Nerfies: Deformable neural radiance fields

    Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. InProceedings of the IEEE/CVF international conference on computer vision, pages 5865–5874, 2021

  39. [39]

    Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields.arXiv preprint arXiv:2106.13228, 2021

    Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin-Brualla, and Steven M Seitz. Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields.arXiv preprint arXiv:2106.13228, 2021

  40. [40]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

  41. [41]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022

  42. [42]

    D-nerf: Neural radiance fields for dynamic scenes

    Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10318–10327, 2021

  43. [43]

    Gen3c: 3d-informed world- consistent video generation with precise camera control

    Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world- consistent video generation with precise camera control. InCVPR, 2025

  44. [44]

    Seyedmorteza Sadat, Otmar Hilliges, and Romann M. Weber. Eliminating oversaturation and artifacts of high guidance scales in diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=e2ONKX6qzJ

  45. [45]

    Structure-from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InCVPR, 2016

  46. [46]

    Dynamic gaussian marbles for novel view synthesis of casual monocular videos

    Colton Stearns, Adam Harley, Mikaela Uy, Florian Dubost, Federico Tombari, Gordon Wet- zstein, and Leonidas Guibas. Dynamic gaussian marbles for novel view synthesis of casual monocular videos. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

  47. [47]

    Dreamcraft3d: Hierarchical 3d generation with bootstrapped diffusion prior.arXiv preprint arXiv:2310.16818, 2023

    Jingxiang Sun, Bo Zhang, Ruizhi Shao, Lizhen Wang, Wen Liu, Zhenda Xie, and Yebin Liu. Dreamcraft3d: Hierarchical 3d generation with bootstrapped diffusion prior.arXiv preprint arXiv:2310.16818, 2023

  48. [48]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InECCV. Springer, 2020

  49. [49]

    Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.NeurIPS, 2021

    Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.NeurIPS, 2021

  50. [50]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  51. [51]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025

  52. [52]

    Shape of motion: 4d reconstruction from a single video

    Qianqian Wang, Vickie Ye, Hang Gao, Weijia Zeng, Jake Austin, Zhengqi Li, and Angjoo Kanazawa. Shape of motion: 4d reconstruction from a single video. InICCV, 2025

  53. [53]

    Worldtree: Towards 4d dynamic worlds from monocular video using tree-chains.arXiv preprint arXiv:2602.11845, 2026

    Qisen Wang, Yifan Zhao, and Jia Li. Worldtree: Towards 4d dynamic worlds from monocular video using tree-chains.arXiv preprint arXiv:2602.11845, 2026

  54. [54]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

  55. [55]

    Pro- lificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation

    Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Pro- lificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in neural information processing systems, 36:8406–8441, 2023

  56. [56]

    Difix3d+: Improving 3d reconstructions with single-step diffusion models

    Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, and Huan Ling. Difix3d+: Improving 3d reconstructions with single-step diffusion models. InCVPR, 2025

  57. [57]

    Reconfusion: 3d reconstruction with diffusion priors

    Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P Srinivasan, Dor Verbin, Jonathan T Barron, Ben Poole, et al. Reconfusion: 3d reconstruction with diffusion priors. InCVPR, 2024

  58. [58]

    Cat4d: Create anything in 4d with multi-view video diffusion models

    Rundi Wu, Ruiqi Gao, Ben Poole, Alex Trevithick, Changxi Zheng, Jonathan T Barron, and Aleksander Holynski. Cat4d: Create anything in 4d with multi-view video diffusion models. In CVPR, 2025

  59. [59]

    Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284,

    Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

  60. [60]

    Spatialtrackerv2: Advancing 3d point tracking with explicit camera motion

    Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov, Bingyi Kang, Xing Zhu, Hujun Bao, Yujun Shen, and Xiaowei Zhou. Spatialtrackerv2: Advancing 3d point tracking with explicit camera motion. InICCV, 2025

  61. [61]

    4dgt: Learning a 4d gaussian transformer using real-world monocular videos

    Zhen Xu, Zhengqin Li, Zhao Dong, Xiaowei Zhou, Richard Newcombe, and Zhaoyang Lv. 4dgt: Learning a 4d gaussian transformer using real-world monocular videos. InNeurIPS, 2025

  62. [62]

    Neoverse: Enhancing 4d world model with in-the-wild monocular videos.CVPR, 2026

    Yuxue Yang, Lue Fan, Ziqi Shi, Junran Peng, Feng Wang, and Zhaoxiang Zhang. Neoverse: Enhancing 4d world model with in-the-wild monocular videos.CVPR, 2026

  63. [63]

    Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models

    Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models. InICCV, 2025

  64. [64]

    ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024

  65. [65]

    MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.arXiv preprint arXiv:2410.03825, 2024

  66. [66]

    Wildgs-slam: Monocular gaussian splatting slam in dynamic environments

    Jianhao Zheng, Zihan Zhu, Valentin Bieri, Marc Pollefeys, Songyou Peng, and Armeni Iro. Wildgs-slam: Monocular gaussian splatting slam in dynamic environments. InCVPR, 2025

  67. [67]

    Dynpoint: Dynamic neural point for view synthesis.Advances in Neural Information Processing Systems, 36:69532–69545, 2023

    Kaichen Zhou, Jia-Xing Zhong, Sangyun Shin, Kai Lu, Yiyuan Yang, Andrew Markham, and Niki Trigoni. Dynpoint: Dynamic neural point for view synthesis.Advances in Neural Information Processing Systems, 36:69532–69545, 2023. 14

  68. [68]

    Page-4d: Disentangled pose and geometry estimation for 4d perception.arXiv e-prints, 2025

    Kaichen Zhou, Yuhan Wang, Grace Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang, and Mengyu Wang. Page-4d: Disentangled pose and geometry estimation for 4d perception.arXiv e-prints, 2025

  69. [69]

    Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction

    Zhizhuo Zhou and Shubham Tulsiani. Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. InCVPR, 2023

  70. [70]

    WvzynQRWZ0HymFOQct0LtXLwp1M=

    Liyuan Zhu, Manjunath Narayana, Michal Stary, Will Hutchcroft, Gordon Wetzstein, and Iro Armeni. Gaussfusion: Improving 3d reconstruction in the wild with a geometry-informed video generator.arXiv preprint arXiv:2603.25053, 2026. 15 Supplementary Material for World from Motion Abstract This supplementary document provides additional technical details, exp...