pith. sign in

arxiv: 2412.03077 · v2 · submitted 2024-12-04 · 💻 cs.CV

RoDyGS: Robust Dynamic Gaussian Splatting for Casual Videos

Pith reviewed 2026-05-23 07:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords dynamic gaussian splatting4D reconstructionmonocular videonovel view synthesisspatiotemporal regularizationpose-free reconstructionstatic-dynamic separation
0
0 comments X

The pith

RoDyGS reconstructs dynamic 3D scenes from casual monocular videos by separating static and dynamic elements with spatiotemporal regularization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to reconstruct dynamic scenes from casually captured monocular videos, where ambiguity in 3D geometry makes the task difficult. RoDyGS explicitly separates static and dynamic scene elements and applies spatiotemporal regularization to enforce physically plausible geometry and temporally consistent motion. This setup supports dynamic novel view synthesis without camera poses or multi-view data. Experiments show it outperforms earlier pose-free dynamic approaches while matching the rendering quality of pose-free static methods.

Core claim

RoDyGS explicitly separates static and dynamic scene elements, and applies spatiotemporal regularization to enforce physically plausible geometry and temporally consistent motion, significantly outperforming previous pose-free dynamic novel view synthesis approaches.

What carries the argument

Explicit separation of static and dynamic scene elements combined with spatiotemporal regularization applied to a Gaussian splatting representation.

If this is right

  • Dynamic novel view synthesis becomes feasible from single casual videos without known camera poses.
  • Rendered outputs maintain temporally consistent motion for moving scene elements.
  • Geometry in dynamic regions satisfies physical plausibility constraints enforced by regularization.
  • The method competes in quality with static reconstruction techniques while handling motion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation step could simplify downstream tasks such as object tracking or background removal in video processing pipelines.
  • Regularization patterns developed here might transfer to other monocular reconstruction settings that face similar static-dynamic ambiguities.
  • Testing on videos with rapid camera motion or long durations would reveal whether the regularization remains stable beyond the reported cases.

Load-bearing premise

That explicit separation of static and dynamic elements combined with spatiotemporal regularization will reliably resolve the inherent ambiguity in monocular dynamic reconstruction without additional constraints or multi-view data.

What would settle it

A monocular video sequence with complex object interactions or partial occlusions where the separation produces inaccurate 3D geometry or temporally inconsistent motion across frames.

Figures

Figures reproduced from arXiv: 2412.03077 by Hoseung Choi, Junmyeong Lee, Minsu Cho, Yoonwoo Jeong.

Figure 1
Figure 1. Figure 1: Robust Dynamic Gaussian Splatting (RoDyGS). RoDyGS achieves high-fidelity rendering of novel viewpoints from casual videos, significantly outperforming RoDynRF, which struggles with blurriness during substantial camera and object movement. cluding casual videos. Building on the success of Neural Radiance Fields (NeRF) [38] for static scenes, subsequent research [41–43] has extended NeRF to dynamic view syn… view at source ↗
Figure 2
Figure 2. Figure 2: RoDyGS Pipeline Overview. Starting with a casually captured video input, RoDyGS extracts camera poses and depths using MASt3R [32], while motion masks are derived from TAM [60]. It then separates static and dynamic Gaussians, enabling each to be independently learned for stationary background and moving objects. The primary optimization objective, Lgs, includes photometric loss and Pearson depth loss, with… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results on Kubric-MRig and iPhone. Our pipeline accurately reconstructs scene geometry, produces sharp ren￾derings, and aligns object positions well. Without GT camera poses, RoDynRF struggles to learn the scene geometry, resulting in object positions that differ from the GT. Even with GT camera poses, RoDynRF produces blurry results. PSNR(↑) SSIM(↑) LPIPS(↓) DynMF [29] 18.92 0.7058 0.3513 no r… view at source ↗
Figure 4
Figure 4. Figure 4: Impact of regularization terms. Our regularization effectively enhances the perceptual quality of the rendering results, leading to sharper and more realistic renderings. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of motion masks between TAM [ [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Samples from Kubric-MRig. Kubric-MRig is a dataset generated using Blender that contains 8 scenes. Each scene features multiple objects, some static and some in motion. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Novel View Synthesis on Kubric-MRig. Comparison of rendering results between RoDyGS and other dynamic neural field methods, both pose-aware [35, 43, 57, 62, 63] and pose-free [35]. In the pose-free setup, RoDyGS produces clearer rendering results than RoDynRF. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Novel View Synthesis on Tanks and Temples. Comparison on the Tanks and Temples dataset between RoDyGS and previous pose-free neural field methods [2, 13, 34]. RoDyGS demonstrates competitive rendering quality with CF-3DGS [13], the previous state￾of-the-art pose-free neural field for static scenes. TiNeuVox RoDynRF ours GT w. camera pose w.o. camera pose NSFF DynamicNeRF HyperNeRF [PITH_FULL_IMAGE:figures… view at source ↗
Figure 9
Figure 9. Figure 9: Novel View Synthesis on NVIDIA Dynamic. We compare RoDyGS with RoDynRF on NVIDIA Dynamic with the pose-free setup. RoDyGS synthesizes realistic images similar to those of RoDynRF. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Novel View Synthesis on iPhone. We compare our RoDyGS method against both pose-aware [35, 43, 57, 62, 63] and pose￾free [35] dynamic neural fields. RoDyGS achieves better visual clarity than RoDynRF under the pose-free setup. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Failure cases of RoDyGS. RoDyGs and other baselines struggle from large motions and occlusion in scenes. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of RoDyGS when leveraging SAM [45] and TAM [60] on Kubric-MRig. RoDyGS with motion masks obtained by SAM achieves competitive visual quality to RoDyGS with motions masks obtained by TAM. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
read the original abstract

4D reconstruction from casually captured monocular videos is challenging due to inherent ambiguity in reconstructing dynamic 3D geometry. To address this challenge, we introduce Robust Dynamic Gaussian Splatting (RoDyGS), a method that reconstructs dynamic scene representation from casual monocular videos. RoDyGS explicitly separates static and dynamic scene elements, and applies spatiotemporal regularization to enforce physically plausible geometry and temporally consistent motion. Furthermore, we propose a comprehensive benchmark, Kubric-MRig, which provides extensive camera and object motion along with simultaneous multi-view capture, features that are absent in previous benchmarks. Experiments demonstrate that RoDyGS significantly outperforms previous pose-free dynamic novel view synthesis approaches and achieves competitive rendering quality compared to existing pose-free static novel view synthesis approaches. Our proejct page is available at https://rodygs.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces RoDyGS, a method for 4D reconstruction of dynamic scenes from casually captured monocular videos. It explicitly separates static and dynamic scene elements and applies spatiotemporal regularization to enforce physically plausible geometry and temporally consistent motion. The work also proposes the Kubric-MRig benchmark, which features extensive camera and object motion with simultaneous multi-view capture. Experiments are claimed to show that RoDyGS significantly outperforms prior pose-free dynamic novel view synthesis methods while achieving competitive rendering quality with pose-free static approaches.

Significance. If the central claims hold with supporting quantitative evidence, the approach would offer a practical advance in monocular dynamic reconstruction by addressing inherent ambiguities through explicit decomposition and regularization. The introduction of Kubric-MRig as a benchmark with multi-view ground truth addresses a noted gap in prior datasets and could facilitate more rigorous evaluation of pose-free dynamic methods.

major comments (1)
  1. [Abstract] Abstract: The abstract asserts that RoDyGS 'significantly outperforms previous pose-free dynamic novel view synthesis approaches' and achieves 'competitive rendering quality,' yet provides no quantitative results, error metrics, ablation studies, or method details to support these claims. This absence makes it impossible to assess whether the explicit static/dynamic separation and spatiotemporal regularization actually resolve the monocular ambiguities as stated.
minor comments (1)
  1. [Abstract] Abstract: Typo in 'proejct page' should be corrected to 'project page'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and the recommendation for major revision. We address the single major comment below regarding the abstract. We will revise the manuscript accordingly to strengthen the presentation of our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts that RoDyGS 'significantly outperforms previous pose-free dynamic novel view synthesis approaches' and achieves 'competitive rendering quality,' yet provides no quantitative results, error metrics, ablation studies, or method details to support these claims. This absence makes it impossible to assess whether the explicit static/dynamic separation and spatiotemporal regularization actually resolve the monocular ambiguities as stated.

    Authors: We agree that the abstract, being a high-level summary, does not include specific quantitative metrics. The supporting results, including PSNR/SSIM comparisons on Kubric-MRig and other benchmarks, ablation studies on the static/dynamic decomposition and spatiotemporal regularization, and method details, are presented in Sections 4 and 5 with Tables 1-3 and Figures 3-7. To address the concern and make the claims more self-contained, we will revise the abstract to include key quantitative highlights (e.g., average PSNR gains over prior pose-free dynamic methods) while maintaining its concise nature. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method and experiments are self-contained

full rationale

The paper introduces an empirical method (RoDyGS) for dynamic scene reconstruction via explicit static/dynamic separation and spatiotemporal regularization, evaluated on a new benchmark (Kubric-MRig) and compared to prior approaches. No derivation chain, equations, or first-principles predictions are present in the provided text that could reduce to fitted inputs or self-citations by construction. Claims rest on proposed architecture and experimental outcomes rather than any self-definitional or load-bearing self-referential steps, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on abstract; no explicit free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5671 in / 960 out tokens · 38558 ms · 2026-05-23T07:56:30.243857+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 1 internal anchor

  1. [1]

    Nonrigid structure from motion in trajectory space

    Ijaz Akhter, Yaser Sheikh, Sohaib Khan, and Takeo Kanade. Nonrigid structure from motion in trajectory space. Ad- vances in neural information processing systems , 21, 2008. 5

  2. [2]

    Nope-nerf: Optimising neu- ral radiance field with no pose prior

    Wenjing Bian, Zirui Wang, Kejie Li, Jia-Wang Bian, and Victor Adrian Prisacariu. Nope-nerf: Optimising neu- ral radiance field with no pose prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4160–4169, 2023. 3, 7, 16

  3. [3]

    D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In European Conf. on Computer Vision (ECCV), pages 611–

  4. [4]

    Springer-Verlag, 2012. 6

  5. [5]

    Hexplane: A fast representa- tion for dynamic scenes

    Ang Cao and Justin Johnson. Hexplane: A fast representa- tion for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 130–141, 2023. 2

  6. [6]

    Gaussianeditor: Swift and control- lable 3d editing with gaussian splatting

    Yiwen Chen, Zilong Chen, Chi Zhang, Feng Wang, Xi- aofeng Yang, Yikai Wang, Zhongang Cai, Lei Yang, Huaping Liu, and Guosheng Lin. Gaussianeditor: Swift and control- lable 3d editing with gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21476–21485, 2024. 2

  7. [7]

    Gaussian activated neural radiance fields for high fidelity reconstruction and pose estimation

    Shin-Fang Chng, Sameera Ramasinghe, Jamie Sherrah, and Simon Lucey. Gaussian activated neural radiance fields for high fidelity reconstruction and pose estimation. In Eu- ropean Conference on Computer Vision , pages 264–280. Springer, 2022. 3

  8. [8]

    Cosseggaussians: Compact and swift scene segmenting 3d gaussians with dual feature fusion

    Bin Dou, Tianyu Zhang, Yongjia Ma, Zhaohui Wang, and Zejian Yuan. Cosseggaussians: Compact and swift scene segmenting 3d gaussians with dual feature fusion. CoRR,

  9. [9]

    Google scanned objects: A high- quality dataset of 3d scanned household items

    Laura Downs, Anthony Francis, Nate Koenig, Brandon Kin- man, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. Google scanned objects: A high- quality dataset of 3d scanned household items. In 2022 In- ternational Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022. 10

  10. [10]

    InstantSplat: Sparse-view gaussian splatting in seconds.arXiv preprint arXiv:2403.20309, 2024

    Zhiwen Fan, Wenyan Cong, Kairun Wen, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, et al. Instantsplat: Un- bounded sparse-view pose-free gaussian splatting in 40 sec- onds. arXiv preprint arXiv:2403.20309, 2024. 1, 3

  11. [11]

    Fast dynamic radiance fields with time-aware neural voxels

    Jiemin Fang, Taoran Yi, Xinggang Wang, Lingxi Xie, Xi- aopeng Zhang, Wenyu Liu, Matthias Nießner, and Qi Tian. Fast dynamic radiance fields with time-aware neural voxels. In SIGGRAPH Asia 2022 Conference Papers, 2022. 13

  12. [12]

    Plenoxels: Radiance fields without neural networks

    Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5501–5510, 2022. 2

  13. [13]

    K-planes: Explicit radiance fields in space, time, and appearance

    Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In CVPR, 2023. 2

  14. [14]

    Efros, and Xiaolong Wang

    Yang Fu, Sifei Liu, Amey Kulkarni, Jan Kautz, Alexei A. Efros, and Xiaolong Wang. Colmap-free 3d gaussian splat- ting. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 20796– 20805, 2024. 3, 7, 10, 16

  15. [15]

    Dynamic view synthesis from dynamic monocular video

    Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. Dynamic view synthesis from dynamic monocular video. In Proceedings of the IEEE International Conference on Com- puter Vision, 2021. 13

  16. [16]

    Monocular dynamic view synthesis: A reality check

    Hang Gao, Ruilong Li, Shubham Tulsiani, Bryan Russell, and Angjoo Kanazawa. Monocular dynamic view synthesis: A reality check. Advances in Neural Information Processing Systems, 35:33768–33780, 2022. 2, 6, 7, 10, 13

  17. [17]

    Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapra- gasam, Florian Golemo, Charles Herrmann, Thomas Kipf, Abhijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh- Ti (Derek) Liu, Henning Meyer, Yishu Miao, Derek Nowrouzezahrai, Cengiz Oztireli, Etienne Pot, Noha Rad- wan, Daniel Rebain, Sara Sabour...

  18. [18]

    Compressive sensing with un-trained neural networks: Gradient descent finds a smooth approximation

    Reinhard Heckel and Mahdi Soltanolkotabi. Compressive sensing with un-trained neural networks: Gradient descent finds a smooth approximation. In International Conference on Machine Learning, pages 4149–4158. PMLR, 2020. 5

  19. [19]

    Baking neural ra- diance fields for real-time view synthesis

    Peter Hedman, Pratul P Srinivasan, Ben Mildenhall, Jonathan T Barron, and Paul Debevec. Baking neural ra- diance fields for real-time view synthesis. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5875–5884, 2021. 2

  20. [20]

    Au- tomatic photo pop-up

    Derek Hoiem, Alexei A Efros, and Martial Hebert. Au- tomatic photo pop-up. In ACM SIGGRAPH 2005 Papers , pages 577–584. 2005. 2

  21. [21]

    Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes

    Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4220–4230, 2024. 2

  22. [22]

    Self-calibrating neural radiance fields

    Yoonwoo Jeong, Seokjun Ahn, Christopher Choy, Anima Anandkumar, Minsu Cho, and Jaesik Park. Self-calibrating neural radiance fields. In Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision , pages 5846– 5854, 2021. 3, 7

  23. [23]

    Perfception: Perception using radiance fields

    Yoonwoo Jeong, Seungjoo Shin, Junha Lee, Chris Choy, An- ima Anandkumar, Minsu Cho, and Jaesik Park. Perfception: Perception using radiance fields. Advances in Neural Infor- mation Processing Systems, 35:26105–26121, 2022. 12

  24. [24]

    Co- tracker: It is better to track together

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker: It is better to track together. arXiv preprint arXiv:2307.07635, 2023. 2

  25. [25]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1,

  26. [26]

    3d gaussian splatting as markov chain monte carlo

    Shakiba Kheradmand, Daniel Rebain, Gopal Sharma, Wei- wei Sun, Jeff Tseng, Hossam Isack, Abhishek Kar, An- drea Tagliasacchi, and Kwang Moo Yi. 3d gaussian splatting as markov chain monte carlo. arXiv preprint arXiv:2404.09591, 2024. 2

  27. [27]

    Laplacianfusion: Detailed 3d clothed- human body reconstruction

    Hyomin Kim, Hyeonseo Nam, Jungeon Kim, Jaesik Park, and Seungyong Lee. Laplacianfusion: Detailed 3d clothed- human body reconstruction. ACM Transactions on Graphics (TOG), 41(6):1–14, 2022. 5

  28. [28]

    Tanks and temples: Benchmarking large-scale scene reconstruction

    Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics (ToG) , 36 (4):1–13, 2017. 2, 6, 7, 10

  29. [29]

    Point-based neural rendering with per- view optimization

    Georgios Kopanas, Julien Philip, Thomas Leimk ¨uhler, and George Drettakis. Point-based neural rendering with per- view optimization. In Computer Graphics Forum, pages 29–

  30. [30]

    Wiley Online Library, 2021. 2

  31. [31]

    Dynmf: Neural motion factorization for real-time dynamic view synthesis with 3d gaussian splatting

    Agelos Kratimenos, Jiahui Lei, and Kostas Daniilidis. Dynmf: Neural motion factorization for real-time dynamic view synthesis with 3d gaussian splatting. arXiV, 2023. 2, 3, 5, 7, 8, 10

  32. [32]

    Multi- body non-rigid structure-from-motion

    Suryansh Kumar, Yuchao Dai, and Hongdong Li. Multi- body non-rigid structure-from-motion. In 2016 Fourth In- ternational Conference on 3D Vision (3DV), pages 148–156. IEEE, 2016. 5

  33. [33]

    Fast view synthesis of casual videos

    Yao-Chih Lee, Zhoutong Zhang, Kevin Blackburn-Matzen, Simon Niklaus, Jianming Zhang, Jia-Bin Huang, and Feng Liu. Fast view synthesis of casual videos. arXiv preprint arXiv:2312.02135, 2023. 2

  34. [34]

    Grounding image matching in 3d with mast3r, 2024

    Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. arXiv preprint arXiv:2406.09756, 2024. 1, 2, 4, 6, 10, 12

  35. [35]

    Neural 3d video synthesis from multi-view video

    Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Green, Christoph Lassner, Changil Kim, Tanner Schmidt, Steven Lovegrove, Michael Goesele, Richard Newcombe, et al. Neural 3d video synthesis from multi-view video. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 5521–5531, 2022. 2

  36. [36]

    Barf: Bundle-adjusting neural radiance fields

    Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Si- mon Lucey. Barf: Bundle-adjusting neural radiance fields. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5741–5751, 2021. 3, 7, 16

  37. [37]

    Robust dynamic radiance fields

    Yu-Lun Liu, Chen Gao, Andreas Meuleman, Hung-Yu Tseng, Ayush Saraf, Changil Kim, Yung-Yu Chuang, Jo- hannes Kopf, and Jia-Bin Huang. Robust dynamic radiance fields. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 13–23, 2023. 2, 3, 4, 6, 7, 11, 12, 13, 15, 17, 20, 21

  38. [38]

    Dynamic 3d gaussians: Tracking by per- sistent dynamic view synthesis

    Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by per- sistent dynamic view synthesis. In 3DV, 2024. 2

  39. [39]

    Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar

    Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view syn- thesis with prescriptive sampling guidelines. ACM Transac- tions on Graphics (TOG), 2019. 13

  40. [40]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. In ECCV, 2020. 1

  41. [41]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. Communications of the ACM , 65(1):99–106, 2021. 2

  42. [42]

    Instant neural graphics primitives with a mul- tiresolution hash encoding

    Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a mul- tiresolution hash encoding. ACM transactions on graphics (TOG), 41(4):1–15, 2022. 2

  43. [43]

    Nerfies: Deformable neural radiance fields

    Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5865–5874, 2021. 1, 2

  44. [44]

    Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin- Brualla, and Steven M

    Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T. Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin- Brualla, and Steven M. Seitz. Hypernerf: A higher- dimensional representation for topologically varying neural radiance fields. ACM Trans. Graph., 40(6), 2021. 6, 13

  45. [45]

    D-NeRF: Neural Radiance Fields for Dynamic Scenes

    Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-NeRF: Neural Radiance Fields for Dynamic Scenes. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2020. 1, 2, 6, 7, 15, 17, 20, 21

  46. [46]

    Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer

    Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence (TPAMI), 2020. 2

  47. [47]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024. 11, 13, 19

  48. [48]

    Free view synthesis

    Gernot Riegler and Vladlen Koltun. Free view synthesis. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIX 16, pages 623–640. Springer, 2020. 2

  49. [49]

    Stable view synthesis

    Gernot Riegler and Vladlen Koltun. Stable view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12216–12225, 2021. 2

  50. [50]

    The convergence rate of neural networks for learned functions of different frequencies

    Basri Ronen, David Jacobs, Yoni Kasten, and Shira Kritch- man. The convergence rate of neural networks for learned functions of different frequencies. Advances in Neural In- formation Processing Systems, 32, 2019. 5

  51. [51]

    Structure- from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 4104–4113, 2016. 1, 2

  52. [52]

    Improved direct voxel grid optimization for radiance fields reconstruc- tion

    Cheng Sun, Min Sun, and Hwann-Tzong Chen. Improved direct voxel grid optimization for radiance fields reconstruc- tion. arXiv preprint arXiv:2206.05085, 2022. 2

  53. [53]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 23 2020: 16th European Conference, Glasgow, UK, August 23– 28, 2020, Proceedings, Part II 16, pages 402–419. Springer,

  54. [54]

    Shape of motion: 4d reconstruc- tion from a single video

    Qianqian Wang, Vickie Ye, Hang Gao, Jake Austin, Zhengqi Li, and Angjoo Kanazawa. Shape of motion: 4d reconstruc- tion from a single video. 2024. 2

  55. [55]

    Shape of motion: 4d reconstruc- tion from a single video

    Qianqian Wang, Vickie Ye, Hang Gao, Jake Austin, Zhengqi Li, and Angjoo Kanazawa. Shape of motion: 4d reconstruc- tion from a single video. arXiv preprint arXiv:2407.13764,

  56. [56]

    Dust3r: Geometric 3d vi- sion made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 1, 3

  57. [57]

    Gflow: Recovering 4d world from monocular video

    Shizun Wang, Xingyi Yang, Qiuhong Shen, Zhenxiang Jiang, and Xinchao Wang. Gflow: Recovering 4d world from monocular video. arXiv preprint arXiv:2405.18426, 2024. 2

  58. [58]

    NeRF −−: Neural radiance fields without known camera parameters,

    Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Victor Adrian Prisacariu. Nerf–: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064, 2021. 3, 7

  59. [59]

    4d gaussian splatting for real-time dynamic scene rendering

    Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20310–20320, 2024. 2, 6, 7, 15, 17, 20, 21

  60. [60]

    Sparsegs: Real- time 360 {\deg} sparse view synthesis using gaussian splat- ting

    Haolin Xiong, Sairisheek Muttukuru, Rishi Upadhyay, Pradyumna Chari, and Achuta Kadambi. Sparsegs: Real- time 360 {\deg} sparse view synthesis using gaussian splat- ting. arXiv preprint arXiv:2312.00206, 2023. 2, 6

  61. [61]

    Point- nerf: Point-based neural radiance fields

    Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point- nerf: Point-based neural radiance fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5438–5448, 2022. 2

  62. [62]

    arXiv preprint arXiv:2304.11968 (2023)

    Jinyu Yang, Mingqi Gao, Zhe Li, Shang Gao, Fangjing Wang, and Feng Zheng. Track anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968, 2023. 2, 3, 4, 6, 7, 10, 11, 12, 13, 19, 20, 21

  63. [63]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. In CVPR, 2024. 2, 4, 6, 10

  64. [64]

    Deformable 3d gaussians for high- fidelity monocular dynamic scene reconstruction

    Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high- fidelity monocular dynamic scene reconstruction. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20331–20341, 2024. 6, 7, 15, 17, 20, 21

  65. [65]

    Real- time photorealistic dynamic scene representation and render- ing with 4d gaussian splatting

    Zeyu Yang, Hongye Yang, Zijie Pan, and Li Zhang. Real- time photorealistic dynamic scene representation and render- ing with 4d gaussian splatting. In International Conference on Learning Representations (ICLR), 2024. 2, 6, 7, 15, 17, 20, 21

  66. [66]

    Absgs: Recovering fine details in 3d gaussian splat- ting

    Zongxin Ye, Wenyu Li, Sidun Liu, Peng Qiao, and Yong Dou. Absgs: Recovering fine details in 3d gaussian splat- ting. In ACM Multimedia 2024, 2024. 2

  67. [67]

    inerf: Inverting neural radiance fields for pose estimation

    Lin Yen-Chen, Pete Florence, Jonathan T Barron, Alberto Rodriguez, Phillip Isola, and Tsung-Yi Lin. inerf: Inverting neural radiance fields for pose estimation. In2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1323–1330. IEEE, 2021. 3

  68. [68]

    Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera

    Jae Shin Yoon, Kihwan Kim, Orazio Gallo, Hyun Soo Park, and Jan Kautz. Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5336–5345, 2020. 6, 10, 12, 13

  69. [69]

    Cor-gs: Sparse-view 3d gaussian splat- ting via co-regularization

    Jiawei Zhang, Jiahe Li, Xiaohan Yu, Lei Huang, Lin Gu, Jin Zheng, and Xiao Bai. Cor-gs: Sparse-view 3d gaussian splat- ting via co-regularization. arXiv preprint arXiv:2405.12110,

  70. [70]

    Differentiable point-based radiance fields for efficient view synthesis

    Qiang Zhang, Seung-Hwan Baek, Szymon Rusinkiewicz, and Felix Heide. Differentiable point-based radiance fields for efficient view synthesis. In SIGGRAPH Asia 2022 Con- ference Papers, pages 1–12, 2022. 2

  71. [71]

    Zwicker, H

    M. Zwicker, H. Pfister, J. van Baar, and M. Gross. Ewa splatting. IEEE Transactions on Visualization and Computer Graphics, 8(3):223–238, 2002. 3 24