pith. sign in

arxiv: 2506.18601 · v2 · submitted 2025-06-23 · 💻 cs.GR · cs.AI· cs.CV· cs.LG

BulletGen: Improving 4D Reconstruction with Bullet-Time Generation

Pith reviewed 2026-05-19 07:59 UTC · model grok-4.3

classification 💻 cs.GR cs.AIcs.CVcs.LG
keywords 4D reconstructiondynamic scenesGaussian splattingdiffusion modelsnovel view synthesisvideo generationmonocular videoscene completion
0
0 comments X p. Extension

The pith

BulletGen improves 4D reconstructions from monocular videos by aligning diffusion-generated frames at one frozen bullet-time step to supervise Gaussian optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BulletGen as a way to turn single-camera videos of moving scenes into more complete and accurate 4D models. It freezes the current 4D Gaussian reconstruction at one moment and uses a diffusion video model to generate matching frames that fill gaps and fix errors. Those generated frames then guide further optimization of the 4D model. The approach blends the new generative content with both the static background and moving objects without breaking consistency. This leads to stronger results on creating new viewpoints and on tracking points in 2D and 3D.

Core claim

BulletGen aligns the output of a diffusion-based video generation model with the 4D reconstruction at a single frozen bullet-time step. The generated frames are then used to supervise the optimization of the 4D Gaussian model, seamlessly blending generative content with both static and dynamic scene components.

What carries the argument

The bullet-time alignment step that matches diffusion-generated video frames to the existing 4D Gaussian reconstruction at one chosen frozen time to provide additional supervision signals.

If this is right

  • Better novel-view synthesis for dynamic scenes from casual monocular input.
  • Improved accuracy on both 2D and 3D tracking tasks.
  • More reliable handling of unseen regions and monocular depth ambiguities.
  • Seamless integration of generative content into both static and moving parts of the scene.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment idea could be tested on other dynamic representations such as neural radiance fields or mesh-based models.
  • Extending the method to multiple bullet-time steps might reduce drift over long video sequences.
  • If generation speed improves, the approach could support online refinement during capture.
  • Similar supervision could help correct other monocular reconstruction failures like those in structure-from-motion pipelines.

Load-bearing premise

The diffusion video model can generate frames that match the current 4D Gaussian reconstruction at the chosen bullet-time step without adding new inconsistencies or artifacts that damage the overall optimization.

What would settle it

A side-by-side comparison of final 4D models trained with and without the bullet-time generated supervision, measured by how well novel views match held-out real frames or how accurately 3D tracks follow ground-truth motion.

Figures

Figures reproduced from arXiv: 2506.18601 by Denis Rozumny, Johannes Sch\"onberger, Jonathon Luiten, Numair Khan, Peter Kontschieder.

Figure 1
Figure 1. Figure 1: Extreme novel view synthesis of a 4D scene with generative model guidance in frozen￾time instances (bullet times). The input is only a monocular video on the left. We compare to the Shape-of-Motion (SoM) [73] method on the cat and dog sequences from the iPhone dataset [14] and the skating sequence from Nvidia dataset [87]. Our method, termed BulletGen, uses a video diffusion model to generate novel views f… view at source ↗
Figure 2
Figure 2. Figure 2: BulletGen architecture. Starting from a monocular RGB video, we reconstruct the dynamic scene with Shape-of-Motion [73] given data-driven priors (motion masks, depths, long￾term 2D tracks). Then, we generate novel views at selected frozen timesteps (bullet times) using a conditioned generative model. These generated views are localized and mapped to the current scene using an optimization based on photomet… view at source ↗
Figure 3
Figure 3. Figure 3: Extreme novel view synthesis across space and time. The generation for training was performed nG = 7 times for nS = 5 bullet-time stamps, i.e. 1, 45, 90, 135, 180 for a sequence with length of 180 frames. The renderings shown here are at time stamps and viewpoints that were not generated using a generative model, which shows that using only several bullet-time reconstructions is enough to reconstruct a dyn… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative evaluation on several benchmark datasets. The input is a monocular video as shown in the training view column. Both benchmark datasets (Nvidia [87] and DyCheck iPhone [14]) have several additional testing cameras. We compare novel view synthesis to Shape-of-Motion (SoM) [73] and zoom-in to highlight the differences. Our method is able to provide more accurate and sharper reconstructions, both i… view at source ↗
Figure 5
Figure 5. Figure 5: Temporal plane slices of extreme camera views. Visualizing the highlighted rows across time in xt space shows that our method suffers from fewer temporal artifacts than Shape-of-Motion (SoM) caused by floating Gaussians and exploding geometry. Our rendered view is shown on the left. PSNR↑ SSIM↑ LPIPS↓ CLIP-I↑ SoM 15.26 0.454 0.388 0.87 Ours 17.02 0.462 0.386 0.87 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Transforming casually captured, monocular videos into fully immersive dynamic experiences is a highly ill-posed task, and comes with significant challenges, e.g., reconstructing unseen regions, and dealing with the ambiguity in monocular depth estimation. In this work we introduce BulletGen, an approach that takes advantage of generative models to correct errors and complete missing information in a Gaussian-based dynamic scene representation. This is done by aligning the output of a diffusion-based video generation model with the 4D reconstruction at a single frozen "bullet-time" step. The generated frames are then used to supervise the optimization of the 4D Gaussian model. Our method seamlessly blends generative content with both static and dynamic scene components, achieving state-of-the-art results on both novel-view synthesis, and 2D/3D tracking tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BulletGen, a method that improves 4D Gaussian-based dynamic scene reconstruction from monocular videos by conditioning a pre-trained diffusion video generation model on a single frozen bullet-time step extracted from the current reconstruction. The generated frames are used as additional supervision signals to optimize the 4D model, with the aim of correcting reconstruction errors and completing missing information in unseen regions while handling monocular depth ambiguity. The authors claim this yields state-of-the-art results on novel-view synthesis and 2D/3D tracking tasks.

Significance. If the generated frames maintain geometric and temporal consistency with the underlying 4D structure, the approach could provide an effective way to inject generative priors into reconstruction pipelines, addressing key ill-posed aspects of casual video capture. The single-step bullet-time conditioning offers a computationally lightweight integration point between diffusion models and Gaussian representations.

major comments (2)
  1. [§3.2] §3.2 (Bullet-time conditioning): The method description provides no explicit consistency loss, reprojection check, or warping verification between the diffusion-generated frames and the 4D Gaussian splats at the chosen bullet-time step. Because the initial reconstruction is incomplete due to monocular ambiguities, any misalignment would be directly incorporated as pseudo-ground-truth during optimization, undermining the central claim of seamless blending without new artifacts.
  2. [§5] §5 (Experiments): The reported state-of-the-art performance on novel-view synthesis and tracking lacks accompanying ablation studies isolating the contribution of the generative supervision versus a baseline 4D Gaussian optimization without bullet-time generation. This makes it difficult to assess whether the claimed improvements are load-bearing on the proposed alignment mechanism.
minor comments (2)
  1. [Abstract] The abstract asserts quantitative superiority without referencing specific metrics, datasets, or baseline comparisons; moving a concise summary of key numbers to the abstract would improve accessibility.
  2. [§2] Notation for the 4D Gaussian parameters (e.g., distinguishing time-dependent deformation fields from static attributes) could be introduced earlier in §2 for clearer reading.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the potential impact of BulletGen. We address each major comment point by point below, providing clarifications from the manuscript and indicating where we will revise the text for the next version.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Bullet-time conditioning): The method description provides no explicit consistency loss, reprojection check, or warping verification between the diffusion-generated frames and the 4D Gaussian splats at the chosen bullet-time step. Because the initial reconstruction is incomplete due to monocular ambiguities, any misalignment would be directly incorporated as pseudo-ground-truth during optimization, undermining the central claim of seamless blending without new artifacts.

    Authors: We appreciate the referee raising this point about potential misalignment. In the BulletGen pipeline, a single frame is extracted directly from the current 4D Gaussian reconstruction at the chosen bullet-time step and supplied as the conditioning input to the pre-trained video diffusion model. This conditioning anchors the entire generated sequence to the geometry and appearance present in the reconstruction at that instant. The diffusion model, having been trained on large-scale video data, produces frames that respect the provided conditioning while synthesizing plausible motion and content for other viewpoints and times. Although the manuscript does not introduce an additional explicit consistency or reprojection loss term (the generated frames serve as direct supervision), the iterative optimization loop—updating the 4D model and re-selecting a new bullet-time step—allows progressive refinement. To address the concern explicitly, we will expand the description in §3.2 to clarify the role of the conditioning mechanism in maintaining alignment at the bullet-time step and will include qualitative examples in the supplement showing the match between the conditioning frame and the generated video outputs. revision: yes

  2. Referee: [§5] §5 (Experiments): The reported state-of-the-art performance on novel-view synthesis and tracking lacks accompanying ablation studies isolating the contribution of the generative supervision versus a baseline 4D Gaussian optimization without bullet-time generation. This makes it difficult to assess whether the claimed improvements are load-bearing on the proposed alignment mechanism.

    Authors: We agree that an ablation isolating the generative supervision would strengthen the experimental section. The current results compare BulletGen against prior 4D reconstruction methods, but do not include a direct head-to-head with a 4D Gaussian baseline that omits the bullet-time diffusion component. In the revised manuscript we will add this ablation study, reporting novel-view synthesis and tracking metrics for both the full model and the baseline without generative supervision. This will allow readers to quantify the contribution of the proposed alignment and supervision mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity in BulletGen derivation chain

full rationale

The paper describes a method that conditions an external pre-trained diffusion video model on a frozen bullet-time step from an existing 4D Gaussian reconstruction, then uses the generated frames as supervision to refine the Gaussian model. This chain depends on independent components (pre-trained diffusion models and prior Gaussian splatting representations) rather than any self-referential fitting, self-citation for uniqueness, or redefinition of inputs as outputs. No equations or steps reduce predictions to inputs by construction, and the approach remains falsifiable against external benchmarks such as novel-view synthesis metrics and tracking accuracy on held-out data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are described. The method builds on existing pre-trained diffusion models and Gaussian scene representations from prior literature without introducing new postulated entities.

pith-pipeline@v0.9.0 · 5684 in / 1115 out tokens · 33577 ms · 2026-05-19T07:59:07.437032+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · 4 internal anchors

  1. [1]

    Hexplane: A fast representation for dynamic scenes

    Ang Cao and Justin Johnson. Hexplane: A fast representation for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 130–141, 2023

  2. [2]

    A survey on generative diffusion models

    Hanqun Cao, Cheng Tan, Zhangyang Gao, Yilun Xu, Guangyong Chen, Pheng-Ann Heng, and Stan Z Li. A survey on generative diffusion models. IEEE Transactions on Knowledge and Data Engineering, 2024

  3. [3]

    Plenoptic sampling

    Jin-Xiang Chai, Xin Tong, Shing-Chow Chan, and Heung-Yeung Shum. Plenoptic sampling. InProceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 307–318, 2000

  4. [4]

    Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

    Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. In European Conference on Computer Vision, pages 370–386. Springer, 2024

  5. [5]

    Generating 3d-consistent videos from unposed internet photos

    Gene Chou, Kai Zhang, Sai Bi, Hao Tan, Zexiang Xu, Fujun Luan, Bharath Hariharan, and Noah Snavely. Generating 3d-consistent videos from unposed internet photos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  6. [6]

    Diffusion models in vision: A survey

    Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10850–10869, 2023

  7. [7]

    Neural parametric gaussians for monocular non-rigid object reconstruction

    Devikalyan Das, Christopher Wewer, Raza Yunus, Eddy Ilg, and Jan Eric Lenssen. Neural parametric gaussians for monocular non-rigid object reconstruction. arXiv preprint arXiv:2312.01196, 2023

  8. [8]

    Unstructured light fields

    Abe Davis, Marc Levoy, and Fredo Durand. Unstructured light fields. In Computer Graphics Forum, volume 31, pages 305–314. Wiley Online Library, 2012

  9. [9]

    Depth-supervised NeRF: Fewer views and faster training for free

    Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. Depth-supervised NeRF: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022

  10. [10]

    BootsTAP: Bootstrapped training for tracking-any-point

    Carl Doersch, Pauline Luc, Yi Yang, Dilara Gokay, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ignacio Rocco, Ross Goroshin, João Carreira, and Andrew Zisserman. BootsTAP: Bootstrapped training for tracking-any-point. ACCV, 2024

  11. [11]

    TAPIR: Tracking any point with per-frame initialization and temporal refinement

    Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman. TAPIR: Tracking any point with per-frame initialization and temporal refinement. In ICCV, pages 10061–10072, 2023

  12. [12]

    4d-rotor gaussian splatting: Towards efficient novel-view synthesis for dynamic scenes

    Yuanxing Duan, Fangyin Wei, Qiyu Dai, Yuhang He, Wenzheng Chen, and Baoquan Chen. 4d-rotor gaussian splatting: Towards efficient novel-view synthesis for dynamic scenes. In Proc. SIGGRAPH, July 2024

  13. [13]

    K-planes: Explicit radiance fields in space, time, and appearance

    Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12479–12488, 2023

  14. [14]

    Monocular dynamic view synthesis: A reality check

    Hang Gao, Ruilong Li, Shubham Tulsiani, Bryan Russell, and Angjoo Kanazawa. Monocular dynamic view synthesis: A reality check. In NeurIPS, 2022

  15. [15]

    Srinivasan, Jonathan T

    Ruiqi Gao*, Aleksander Holynski*, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul P. Srinivasan, Jonathan T. Barron, and Ben Poole*. Cat3d: Create anything in 3d with multi-view diffusion models. NeurIPS, 2024

  16. [16]

    Spatio-angular resolution tradeoffs in integral photography

    Todor G Georgiev, Ke Colin Zheng, Brian Curless, David Salesin, Shree K Nayar, and Chintan Intwala. Spatio-angular resolution tradeoffs in integral photography. Rendering Techniques, 2006(263-272):21, 2006

  17. [17]

    Generative adversarial nets

    Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014

  18. [18]

    The lumigraph

    Steven J Gortler, Radek Grzeszczuk, Richard Szeliski, and Michael F Cohen. The lumigraph. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 453–464. 2023

  19. [19]

    The llama 3 herd of models, 2024

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, et al. The llama 3 herd of models, 2024

  20. [20]

    Single-view view synthesis in the wild with learned adaptive multiplane images

    Yuxuan Han, Ruicheng Wang, and Jiaolong Yang. Single-view view synthesis in the wild with learned adaptive multiplane images. In ACM SIGGRAPH, 2022

  21. [21]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020

  22. [22]

    Vivid4d: Improving 4d reconstruction from monocular video by video inpainting, 2025

    Jiaxin Huang, Sheng Miao, BangBang Yang, Yuewen Ma, and Yiyi Liao. Vivid4d: Improving 4d reconstruction from monocular video by video inpainting, 2025

  23. [23]

    Panoptic studio: A massively multiview system for social motion capture

    Hanbyul Joo, Hao Liu, Lei Tan, Lin Gui, Bart Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara, and Yaser Sheikh. Panoptic studio: A massively multiview system for social motion capture. In Proceedings of the IEEE international conference on computer vision, pages 3334–3342, 2015

  24. [24]

    Cotracker: It is better to track together

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker: It is better to track together. In ECCV, 2024. 10

  25. [25]

    Splatam: Splat, track and map 3d gaussians for dense rgb-d slam

    Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallabhula, Gengshan Yang, Sebastian Scherer, Deva Ramanan, and Jonathon Luiten. Splatam: Splat, track and map 3d gaussians for dense rgb-d slam. In CVPR, 2024

  26. [26]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM TOG, 42(4), July 2023

  27. [27]

    Tiled multiplane images for practical 3d photography

    Numair Khan, Eric Penner, Douglas Lanman, and Lei Xiao. Tiled multiplane images for practical 3d photography. International Conference on Computer Vision (ICCV), 2023

  28. [28]

    Scene reconstruction from high spatio-angular resolution light fields

    Changil Kim, Henning Zimmer, Yael Pritch, Alexander Sorkine-Hornung, and Markus H Gross. Scene reconstruction from high spatio-angular resolution light fields. ACM Trans. Graph., 32(4):73–1, 2013

  29. [29]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017

  30. [30]

    Segment Anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023

  31. [31]

    Dynmf: Neural motion factorization for real-time dynamic view synthesis with 3d gaussian splatting

    Agelos Kratimenos, Jiahui Lei, and Kostas Daniilidis. Dynmf: Neural motion factorization for real-time dynamic view synthesis with 3d gaussian splatting. ECCV, 2024

  32. [32]

    Imagenet classification with deep convolutional neural networks

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors, NeurIPS, volume 25. Curran Associates, Inc., 2012

  33. [33]

    Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds

    Jiahui Lei, Yijia Weng, Adam Harley, Leonidas Guibas, and Kostas Daniilidis. Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds. arXiv preprint arXiv:2405.17421, 2024

  34. [34]

    Light field rendering

    Marc Levoy and Pat Hanrahan. Light field rendering. InSeminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 441–452. 2023

  35. [35]

    Nerfacc: Efficient sampling accelerates nerfs

    Ruilong Li, Hang Gao, Matthew Tancik, and Angjoo Kanazawa. Nerfacc: Efficient sampling accelerates nerfs. arXiv preprint arXiv:2305.04966, 2023

  36. [36]

    Spacetime gaussian feature splatting for real-time dynamic view synthesis

    Zhan Li, Zhang Chen, Zhong Li, and Yi Xu. Spacetime gaussian feature splatting for real-time dynamic view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8508–8520, 2024

  37. [37]

    Neural scene flow fields for space-time view synthesis of dynamic scenes

    Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neural scene flow fields for space-time view synthesis of dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6498–6508, 2021

  38. [38]

    Megasam: Accurate, fast and robust structure and motion from casual dynamic videos

    Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. In arxiv, 2024

  39. [39]

    Dynibar: Neural dynamic image-based rendering

    Zhengqi Li, Qianqian Wang, Forrester Cole, Richard Tucker, and Noah Snavely. Dynibar: Neural dynamic image-based rendering. In CVPR, 2023

  40. [40]

    Feed-forward bullet-time reconstruction of dynamic scenes from monocular videos

    Hanxue Liang, Jiawei Ren, Ashkan Mirzaei, Antonio Torralba, Ziwei Liu, Igor Gilitschenski, Sanja Fidler, Cengiz Oztireli, Huan Ling, Zan Gojcic, and Jiahui Huang. Feed-forward bullet-time reconstruction of dynamic scenes from monocular videos. 2024

  41. [41]

    Gaufre: Gaussian deformation fields for real-time dynamic novel view synthesis

    Yiqing Liang, Numair Khan, Zhengqin Li, Thu Nguyen-Phuoc, Douglas Lanman, James Tompkin, and Lei Xiao. Gaufre: Gaussian deformation fields for real-time dynamic novel view synthesis. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2642–2652. IEEE, 2025

  42. [42]

    Himor: Monocular deformable gaussian reconstruction with hierarchical motion representation, 2025

    Yiming Liang, Tianhan Xu, and Yuta Kikuchi. Himor: Monocular deformable gaussian reconstruction with hierarchical motion representation, 2025

  43. [43]

    Gaussian-flow: 4d reconstruction with dynamic 3d gaussian particle

    Youtian Lin, Zuozhuo Dai, Siyu Zhu, and Yao Yao. Gaussian-flow: 4d reconstruction with dynamic 3d gaussian particle. In CVPR, pages 21136–21145, 2024

  44. [44]

    MoDGS: Dynamic gaussian splatting from casually-captured monocular videos with depth priors

    Qingming LIU, Yuan Liu, Jiepeng Wang, Xianqiang Lyu, Peng Wang, Wenping Wang, and Junhui Hou. MoDGS: Dynamic gaussian splatting from casually-captured monocular videos with depth priors. In ICLR, 2025

  45. [45]

    Zero-1-to-3: Zero-shot one image to 3d object, 2023

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object, 2023

  46. [46]

    3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view-consistent 2d diffusion priors

    Xi Liu, Chaoyi Zhou, and Siyu Huang. 3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view-consistent 2d diffusion priors. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  47. [47]

    SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

    Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023

  48. [48]

    Wonder3d: Single image to 3d using cross-domain diffusion

    Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Single image to 3d using cross-domain diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9970–9980, 2024

  49. [49]

    Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis

    Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. In 3DV, 2024. 11

  50. [50]

    Local light field fusion: Practical view synthesis with prescriptive sampling guidelines

    Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (ToG), 38(4):1–14, 2019

  51. [51]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020

  52. [52]

    Light field photography with a hand-held plenoptic camera

    Ren Ng, Marc Levoy, Mathieu Brédif, Gene Duval, Mark Horowitz, and Pat Hanrahan. Light field photography with a hand-held plenoptic camera. PhD thesis, Stanford university, 2005

  53. [53]

    Holoportation: Virtual 3d teleporta- tion in real-time

    Sergio Orts-Escolano, Christoph Rhemann, Sean Fanello, Wayne Chang, Adarsh Kowdle, Yury Degtyarev, David Kim, Philip L Davidson, Sameh Khamis, Mingsong Dou, et al. Holoportation: Virtual 3d teleporta- tion in real-time. In Proceedings of the 29th annual symposium on user interface software and technology, pages 741–754, 2016

  54. [54]

    Nerfies: Deformable neural radiance fields

    Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5865–5874, 2021

  55. [55]

    Hypernerf: A higher-dimensional representation for topologi- cally varying neural radiance fields

    Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin-Brualla, and Steven M Seitz. Hypernerf: A higher-dimensional representation for topologi- cally varying neural radiance fields. arXiv preprint arXiv:2106.13228, 2021

  56. [56]

    Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin-Brualla, and Steven M

    Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T. Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin-Brualla, and Steven M. Seitz. Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields. ACM TOG, 40(6), dec 2021

  57. [57]

    UniDepthV2: Universal monocular metric depth estimation made simpler, 2025

    Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. UniDepthV2: Universal monocular metric depth estimation made simpler, 2025

  58. [58]

    UniDepth: Universal monocular metric depth estimation

    Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. UniDepth: Universal monocular metric depth estimation. In CVPR, 2024

  59. [59]

    D-nerf: Neural radiance fields for dynamic scenes

    Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10318–10327, 2021

  60. [60]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, ICML, volume 139 of Proceedings of Machine Learning Researc...

  61. [61]

    Gen3c: 3d-informed world-consistent video generation with precise camera control

    Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera control. In CVPR, 2025

  62. [62]

    Barron, Ben Mildenhall, Pratul P

    Barbara Roessle, Jonathan T. Barron, Ben Mildenhall, Pratul P. Srinivasan, and Matthias Nießner. Dense depth priors for neural radiance fields from sparse input views. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022

  63. [63]

    Structure-from-motion revisited

    Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In CVPR, 2016

  64. [64]

    Pixelwise view selection for unstructured multi-view stereo

    Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. In ECCV, 2016

  65. [65]

    Tensor4d: Efficient neural 4d decomposition for high-fidelity dynamic reconstruction and rendering

    Ruizhi Shao, Zerong Zheng, Hanzhang Tu, Boning Liu, Hongwen Zhang, and Yebin Liu. Tensor4d: Efficient neural 4d decomposition for high-fidelity dynamic reconstruction and rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16632–16642, 2023

  66. [66]

    Dynamic gaussian marbles for novel view synthesis of casual monocular videos

    Colton Stearns, Adam Harley, Mikaela Uy, Florian Dubost, Federico Tombari, Gordon Wetzstein, and Leonidas Guibas. Dynamic gaussian marbles for novel view synthesis of casual monocular videos. In SIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

  67. [67]

    Dimen- sionx: Create any 3d and 4d scenes from a single image with controllable video diffusion

    Wenqiang Sun, Shuo Chen, Fangfu Liu, Zilong Chen, Yueqi Duan, Jun Zhang, and Yikai Wang. Dimen- sionx: Create any 3d and 4d scenes from a single image with controllable video diffusion. arXiv preprint arXiv:2411.04928, 2024

  68. [68]

    Single-view view synthesis with multiplane images

    Richard Tucker and Noah Snavely. Single-view view synthesis with multiplane images. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

  69. [69]

    Generative camera dolly: Extreme monocular dynamic novel view synthesis

    Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. 2024

  70. [70]

    Vistadream: Sampling multiview consistent images for single-view scene reconstruction

    Haiping Wang, Yuan Liu, Ziwei Liu, Zhen Dong, Wenping Wang, and Bisheng Yang. Vistadream: Sampling multiview consistent images for single-view scene reconstruction. arXiv preprint arXiv:2410.16892, 2024

  71. [71]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. In CVPR, 2025

  72. [72]

    ibutter: Neural interactive bullet time generator for human free-viewpoint rendering

    Liao Wang, Ziyu Wang, Pei Lin, Yuheng Jiang, Xin Suo, Minye Wu, Lan Xu, and Jingyi Yu. ibutter: Neural interactive bullet time generator for human free-viewpoint rendering. In Proceedings of the 29th ACM International Conference on Multimedia, MM ’21, page 4641–4650, New York, NY , USA, 2021. Association for Computing Machinery. 12

  73. [73]

    Shape of motion: 4d reconstruction from a single video, 2024

    Qianqian Wang, Vickie Ye, Hang Gao, Weijia Zeng, Jake Austin, Zhengqi Li, and Angjoo Kanazawa. Shape of motion: 4d reconstruction from a single video, 2024

  74. [74]

    Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision, 2024

    Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision, 2024

  75. [75]

    High performance imaging using large camera arrays

    Bennett Wilburn, Neel Joshi, Vaibhav Vaish, Eino-Ville Talvala, Emilio Antunez, Adam Barth, Andrew Adams, Mark Horowitz, and Marc Levoy. High performance imaging using large camera arrays. In ACM siggraph 2005 papers, pages 765–776. 2005

  76. [76]

    4d gaussian splatting for real-time dynamic scene rendering

    Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. In CVPR, pages 20310–20320, June 2024

  77. [77]

    Barron, and Aleksander Holynski

    Rundi Wu, Ruiqi Gao, Ben Poole, Alex Trevithick, Changxi Zheng, Jonathan T. Barron, and Aleksander Holynski. CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models. arXiv:2411.18613, 2024

  78. [78]

    Srini- vasan, Dor Verbin, Jonathan T

    Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P. Srini- vasan, Dor Verbin, Jonathan T. Barron, Ben Poole, and Aleksander Holynski. Reconfusion: 3d reconstruc- tion with diffusion priors. arXiv, 2023

  79. [79]

    Neural fields in visual computing and beyond

    Yiheng Xie, Towaki Takikawa, Shunsuke Saito, Or Litany, Shiqin Yan, Numair Khan, Federico Tombari, James Tompkin, Vincent Sitzmann, and Srinath Sridhar. Neural fields in visual computing and beyond. In Computer Graphics Forum, volume 41, pages 641–676. Wiley Online Library, 2022

  80. [80]

    Autoregressive models in vision: A survey

    Jing Xiong, Gongye Liu, Lun Huang, Chengyue Wu, Taiqiang Wu, Yao Mu, Yuan Yao, Hui Shen, Zhongwei Wan, Jinfa Huang, et al. Autoregressive models in vision: A survey. arXiv preprint arXiv:2411.05902, 2024

Showing first 80 references.