pith. sign in

arxiv: 2601.00285 · v2 · submitted 2026-01-01 · 💻 cs.CV

SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting

Pith reviewed 2026-05-16 18:10 UTC · model grok-4.3

classification 💻 cs.CV
keywords 4D reconstructionGaussian splattingsparse view reconstructionskeleton-driven deformationdynamic scenemotion estimationdeformation field
0
0 comments X

The pith

A skeleton-driven deformation field lets Gaussian splatting reconstruct moving objects accurately from sparse camera views over time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SV-GS to reconstruct dynamic scenes when both viewpoints and time samples are sparse, a setting where standard dense multi-view video methods fail. It initializes with a rough skeleton graph and static reconstruction, then optimizes a deformation field that separates coarse, time-varying joint poses from finer, time-independent deformations. Only the joint poses change across frames, which supports smooth motion interpolation while retaining geometric detail learned from the available views. Experiments show gains of up to 34 percent PSNR over prior sparse methods on synthetic data and performance comparable to dense monocular video techniques on real scenes despite using far fewer frames. The initial static input can later be replaced by a diffusion prior, relaxing the setup for practical use.

Core claim

SV-GS simultaneously estimates a deformation model and the object's motion over time under sparse observations by optimizing a skeleton-driven deformation field composed of a coarse skeleton joint pose estimator and a module for fine-grained deformations. By making only the joint pose estimator time-dependent, the model enables smooth motion interpolation while preserving learned geometric details.

What carries the argument

Skeleton-driven deformation field consisting of a time-dependent coarse joint pose estimator and a time-independent fine-grained deformation module that guides Gaussian splatting optimization.

If this is right

  • Outperforms existing sparse-observation methods by up to 34 percent PSNR on synthetic datasets.
  • Matches performance of dense monocular video methods on real-world data while using significantly fewer frames.
  • The initial static reconstruction input can be replaced by a diffusion-based generative prior for greater practicality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The coarse-fine separation may generalize to other dynamic representations such as neural radiance fields or meshes.
  • Existing surveillance camera networks could supply the sparse inputs needed for 4D reconstruction without new dense capture hardware.
  • Fully automatic skeleton extraction could remove the remaining manual initialization step.

Load-bearing premise

A rough skeleton graph and initial static reconstruction are available to guide motion estimation under otherwise ill-posed sparse observations.

What would settle it

Run the method on a synthetic dynamic sequence with ground-truth 4D data but supply an intentionally inaccurate or missing skeleton graph and measure whether PSNR falls below competing skeleton-free baselines.

Figures

Figures reproduced from arXiv: 2601.00285 by Jun-Jee Chao, Volkan Isler.

Figure 1
Figure 1. Figure 1: We study the problem of 4D reconstruction from sparse observations. Our method takes the following as input: (a) A set of [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of input configurations across dynamic re [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Given canonical 3D Gaussians and an input skeleton, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on the D-NeRF dataset [ [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative result on the real-world ZJU-MoCap dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison of results with and without [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 9
Figure 9. Figure 9: Results on the camel scene from the in-the-wild DAVIS [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
read the original abstract

Reconstructing a dynamic target moving over a large area is challenging. Standard approaches for dynamic object reconstruction require dense coverage in both the viewing space and the temporal dimension, typically relying on multi-view videos captured at each time step. However, such setups are only possible in constrained environments. In real-world scenarios, observations are often sparse over time and captured sparsely from diverse viewpoints (e.g., from security cameras), making dynamic reconstruction highly ill-posed. We present SV-GS, a framework that simultaneously estimates a deformation model and the object's motion over time under sparse observations. To initialize SV-GS, we leverage a rough skeleton graph and an initial static reconstruction as inputs to guide motion estimation. (Later, we show that this input requirement can be relaxed.) Our method optimizes a skeleton-driven deformation field composed of a coarse skeleton joint pose estimator and a module for fine-grained deformations. By making only the joint pose estimator time-dependent, our model enables smooth motion interpolation while preserving learned geometric details. Experiments on synthetic datasets show that our method outperforms existing approaches under sparse observations by up to 34% in PSNR, and achieves comparable performance to dense monocular video methods on real-world datasets despite using significantly fewer frames. Moreover, we demonstrate that the input initial static reconstruction can be replaced by a diffusion-based generative prior, making our method more practical for real-world scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SV-GS, a framework for 4D dynamic reconstruction from sparse multi-view observations. It initializes with a rough skeleton graph plus static reconstruction (later relaxable to a diffusion prior), then optimizes a skeleton-driven Gaussian Splatting deformation model that separates time-dependent coarse joint-pose estimation from static fine-grained deformations. The central empirical claim is up to 34% PSNR improvement over prior methods on synthetic sparse-view data and performance parity with dense monocular video baselines on real data despite using far fewer frames.

Significance. If the quantitative gains and the diffusion-prior relaxation hold under rigorous controls, the work would meaningfully advance practical 4D reconstruction outside controlled capture rigs. The separation of coarse time-dependent pose from static detail is a clean modeling choice that could generalize to other sparse dynamic settings.

major comments (2)
  1. [Abstract / Experiments] Abstract and Experiments section: the headline 34% PSNR gain is reported without error bars, per-scene variance, or ablation tables isolating the skeleton-graph contribution from the deformation field. Because the joint-pose stage is anchored by the skeleton input, the absence of an ablation that replaces the provided skeleton with automatic pose estimation on identical sparse inputs leaves open whether the reported numbers reflect an oracle initialization rather than a fully automatic pipeline.
  2. [Methods] Methods: the deformation model is defined as the sum of a time-dependent coarse joint-pose estimator and a static fine-grained module. No derivation or loss-term analysis is supplied showing that this split is necessary for stability under the stated sparsity levels; an ablation that makes the fine module also time-dependent (or removes the skeleton anchor entirely) would be required to substantiate the modeling claim.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'significantly fewer frames' should be quantified (exact frame counts and view counts for both SV-GS and the dense monocular baselines).
  2. [Experiments] The diffusion-prior relaxation is mentioned only qualitatively; a short table comparing PSNR when the static initialization is replaced by the generative prior versus the provided static mesh would strengthen the practicality claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the empirical presentation and modeling justification without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the headline 34% PSNR gain is reported without error bars, per-scene variance, or ablation tables isolating the skeleton-graph contribution from the deformation field. Because the joint-pose stage is anchored by the skeleton input, the absence of an ablation that replaces the provided skeleton with automatic pose estimation on identical sparse inputs leaves open whether the reported numbers reflect an oracle initialization rather than a fully automatic pipeline.

    Authors: We agree that error bars, per-scene variance, and targeted ablations would strengthen the results section. In the revision we will add standard error bars to all quantitative tables, report per-scene PSNR values, and include a new ablation that substitutes the provided skeleton graph with an off-the-shelf automatic pose estimator (e.g., a recent monocular 3D pose method) while keeping all other inputs and sparsity levels identical. This will clarify that the reported gains are not solely attributable to oracle skeleton initialization. The diffusion-prior relaxation already demonstrated in the manuscript applies to the static reconstruction; the new ablation will extend the same spirit to the skeleton component. revision: yes

  2. Referee: [Methods] Methods: the deformation model is defined as the sum of a time-dependent coarse joint-pose estimator and a static fine-grained module. No derivation or loss-term analysis is supplied showing that this split is necessary for stability under the stated sparsity levels; an ablation that makes the fine module also time-dependent (or removes the skeleton anchor entirely) would be required to substantiate the modeling claim.

    Authors: We will expand the Methods section with a short derivation of the composite deformation field and an analysis of the loss terms that shows why anchoring only the coarse joint-pose stage to time-dependent parameters improves stability under the sparsity regimes considered. We will also add two new ablations: (1) making the fine-grained module time-dependent as well, and (2) removing the skeleton anchor entirely (relying solely on the diffusion prior). These experiments will be reported with the same metrics and sparsity settings used in the main tables, allowing direct comparison to the original split. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an optimization-based framework that takes a rough skeleton graph and initial static reconstruction (or diffusion prior) as explicit inputs to initialize and guide motion estimation under sparse views. Performance claims rest on empirical results from synthetic and real datasets rather than any closed-form derivation or prediction that reduces by construction to fitted parameters. No self-definitional steps, fitted inputs renamed as predictions, load-bearing self-citations, or ansatz smuggling appear in the provided text; the approach is a standard engineering pipeline validated externally and therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method implicitly assumes a usable skeleton graph and static initialization exist or can be generated.

pith-pipeline@v0.9.0 · 5543 in / 1103 out tokens · 26192 ms · 2026-05-16T18:10:10.022536+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 3 internal anchors

  1. [1]

    Score distillation sampling with learned manifold cor- rective

    Thiemo Alldieck, Nikos Kolotouros, and Cristian Sminchis- escu. Score distillation sampling with learned manifold cor- rective. InEuropean Conference on Computer Vision, pages 1–18, 2024. 8

  2. [2]

    PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation

    Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael V oznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, An- jali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalam- barkar, Laurent Kirsch, Mich...

  3. [3]

    4d-fy: Text-to-4d generation using hybrid score dis- tillation sampling

    Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B Lin- dell. 4d-fy: Text-to-4d generation using hybrid score dis- tillation sampling. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7996–8006, 2024. 3

  4. [4]

    Immersive light field video with a layered mesh representation.ACM Trans- actions on Graphics (TOG), 39(4):86–1, 2020

    Michael Broxton, John Flynn, Ryan Overbeck, Daniel Erick- son, Peter Hedman, Matthew Duvall, Jason Dourgarian, Jay Busch, Matt Whalen, and Paul Debevec. Immersive light field video with a layered mesh representation.ACM Trans- actions on Graphics (TOG), 39(4):86–1, 2020. 2

  5. [5]

    Hexplane: A fast representa- tion for dynamic scenes

    Ang Cao and Justin Johnson. Hexplane: A fast representa- tion for dynamic scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 130–141, 2023. 1, 2

  6. [6]

    Part seg- mentation and motion estimation for articulated objects with dynamic 3d gaussians.arXiv preprint arXiv:2506.22718,

    Jun-Jee Chao, Qingyuan Jiang, and V olkan Isler. Part seg- mentation and motion estimation for articulated objects with dynamic 3d gaussians.arXiv preprint arXiv:2506.22718,

  7. [7]

    A kinematic no- tation for lower-pair mechanisms based on matrices

    Jacques Denavit and Richard S Hartenberg. A kinematic no- tation for lower-pair mechanisms based on matrices. 1955. 4

  8. [8]

    Fusion4d: Real-time performance capture of challeng- ing scenes.ACM Transactions on Graphics (ToG), 35(4): 1–13, 2016

    Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip Davidson, Sean Ryan Fanello, Adarsh Kowdle, Sergio Orts Escolano, Christoph Rhemann, David Kim, Jonathan Taylor, et al. Fusion4d: Real-time performance capture of challeng- ing scenes.ACM Transactions on Graphics (ToG), 35(4): 1–13, 2016. 2

  9. [9]

    K-planes: Explicit radiance fields in space, time, and appearance

    Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 12479–12488, 2023. 2

  10. [10]

    Monocular dynamic view synthesis: A reality check.Advances in Neural Information Processing Systems, 35:33768–33780, 2022

    Hang Gao, Ruilong Li, Shubham Tulsiani, Bryan Russell, and Angjoo Kanazawa. Monocular dynamic view synthesis: A reality check.Advances in Neural Information Processing Systems, 35:33768–33780, 2022. 2

  11. [11]

    Forward flow for novel view synthesis of dynamic scenes

    Xiang Guo, Jiadai Sun, Yuchao Dai, Guanying Chen, Xiao- qing Ye, Xiao Tan, Errui Ding, Yumeng Zhang, and Jingdong Wang. Forward flow for novel view synthesis of dynamic scenes. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 16022–16033, 2023. 2

  12. [12]

    Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes

    Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4220–4230, 2024. 1, 2, 3, 6

  13. [13]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 2, 3, 5

  14. [14]

    Dynmf: Neural motion factorization for real-time dynamic view synthesis with 3d gaussian splatting

    Agelos Kratimenos, Jiahui Lei, and Kostas Daniilidis. Dynmf: Neural motion factorization for real-time dynamic view synthesis with 3d gaussian splatting. InEuropean Con- ference on Computer Vision, pages 252–269. Springer, 2024. 2

  15. [15]

    Gart: Gaussian articulated template mod- els

    Jiahui Lei, Yufu Wang, Georgios Pavlakos, Lingjie Liu, and Kostas Daniilidis. Gart: Gaussian articulated template mod- els. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 19876–19887,

  16. [16]

    Articulated kinematics distillation from video diffusion models

    Xuan Li, Qianli Ma, Tsung-Yi Lin, Yongxin Chen, Chen- fanfu Jiang, Ming-Yu Liu, and Donglai Xiang. Articulated kinematics distillation from video diffusion models. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 17571–17581, 2025. 3, 4

  17. [17]

    Neural scene flow fields for space-time view synthesis of dy- namic scenes

    Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neural scene flow fields for space-time view synthesis of dy- namic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6498– 6508, 2021. 2

  18. [18]

    Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models

    Huan Ling, Seung Wook Kim, Antonio Torralba, Sanja Fi- dler, and Karsten Kreis. Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 8576–8588, 2024. 3

  19. [19]

    Lepard: Learning explicit part dis- covery for 3d articulated shape reconstruction

    Di Liu, Anastasis Stathopoulos, Qilong Zhangli, Yunhe Gao, and Dimitris Metaxas. Lepard: Learning explicit part dis- covery for 3d articulated shape reconstruction. InAdvances in Neural Information Processing Systems, pages 54187– 54198. Curran Associates, Inc., 2023. 3

  20. [20]

    Dynamic gaus- sians mesh: Consistent mesh reconstruction from dynamic scenes

    Isabella Liu, Hao Su, and Xiaolong Wang. Dynamic gaus- sians mesh: Consistent mesh reconstruction from dynamic scenes. InThe Thirteenth International Conference on Learning Representations, 2025. 6, 7

  21. [21]

    Riganything: Template-free autoregressive rigging for diverse 3d assets

    Isabella Liu, Zhan Xu, Yifan Wang, Hao Tan, Zexiang Xu, Xiaolong Wang, Hao Su, and Zifan Shi. Riganything: Template-free autoregressive rigging for diverse 3d assets. ACM Transactions on Graphics (TOG), 44(4):1–12, 2025. 3

  22. [22]

    One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d dif- fusion

    Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang, Chao Xu, Xinyue Wei, Hansheng Chen, Chong Zeng, Ji- 9 ayuan Gu, and Hao Su. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d dif- fusion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 10072–10083,

  23. [23]

    MoDGS: Dy- namic gaussian splatting from casually-captured monocular videos with depth priors

    Qingming LIU, Yuan Liu, Jiepeng Wang, Xianqiang Lyu, Peng Wang, Wenping Wang, and Junhui Hou. MoDGS: Dy- namic gaussian splatting from casually-captured monocular videos with depth priors. InThe Thirteenth International Conference on Learning Representations, 2025. 2

  24. [24]

    Zero-1-to- 3: Zero-shot one image to 3d object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to- 3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023. 2, 3, 7

  25. [25]

    Build- ing rearticulable models for arbitrary 3d objects from 4d point clouds

    Shaowei Liu, Saurabh Gupta, and Shenlong Wang. Build- ing rearticulable models for arbitrary 3d objects from 4d point clouds. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21138– 21147, 2023. 1

  26. [26]

    Neural vol- umes: learning dynamic renderable volumes from images

    Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural vol- umes: learning dynamic renderable volumes from images. ACM Transactions on Graphics (TOG), 38(4):1–14, 2019. 2

  27. [27]

    Matthew Loper, Naureen Mahmood, Javier Romero, Ger- ard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model.ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015. 3

  28. [28]

    Dynamic 3d gaussians: Tracking by per- sistent dynamic view synthesis

    Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by per- sistent dynamic view synthesis. In2024 International Con- ference on 3D Vision (3DV), pages 800–809. IEEE, 2024. 2

  29. [29]

    Score distillation via reparametrized ddim.Advances in Neural Information Pro- cessing Systems, 37:26011–26044, 2024

    Artem Lukoianov, Haitz S’aez de Oc’ariz Borde, Kristjan Greenewald, Vitor Guizilini, Timur Bagautdinov, Vincent Sitzmann, and Justin M Solomon. Score distillation via reparametrized ddim.Advances in Neural Information Pro- cessing Systems, 37:26011–26044, 2024. 8

  30. [30]

    Joint-dependent local deformations for hand an- imation and object grasping

    Nadia Magnenat-Thalmann, Richard Laperri `ere, and Daniel Thalmann. Joint-dependent local deformations for hand an- imation and object grasping. InProceedings on Graphics interface’88, pages 26–33, 1989. 4

  31. [31]

    Jacobs, Alexei A

    David McAllister, Songwei Ge, Jia-Bin Huang, David W. Jacobs, Alexei A. Efros, Aleksander Holynski, and Angjoo Kanazawa. Rethinking score distillation as a bridge between image distributions. InAdvances in Neural Information Pro- cessing Systems, 2024. 8

  32. [32]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2, 4

  33. [33]

    Watch it move: Unsupervised discovery of 3d joints for re-posing of articulated objects

    Atsuhiro Noguchi, Umar Iqbal, Jonathan Tremblay, Tatsuya Harada, and Orazio Gallo. Watch it move: Unsupervised discovery of 3d joints for re-posing of articulated objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3677–3687, 2022. 3

  34. [34]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InProceedings of the 3rd In- ternational Conference on Learning Representations (ICLR 2015), 2015. 6

  35. [35]

    Nerfies: Deformable neural radiance fields

    Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. InProceedings of the IEEE/CVF international conference on computer vision, pages 5865–5874, 2021. 2

  36. [36]

    Hypernerf: a higher- dimensional representation for topologically varying neural radiance fields.ACM Transactions on Graphics (TOG), 40 (6):1–12, 2021

    Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin- Brualla, and Steven M Seitz. Hypernerf: a higher- dimensional representation for topologically varying neural radiance fields.ACM Transactions on Graphics (TOG), 40 (6):1–12, 2021. 2

  37. [37]

    Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans

    Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9054–9063, 2021. 6, 7

  38. [38]

    A benchmark dataset and evaluation methodology for video object segmentation

    Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine- Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 6, 7, 8

  39. [39]

    Dreamfusion: Text-to-3d using 2d diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. InThe Eleventh International Conference on Learning Representa- tions, 2022. 3, 7

  40. [40]

    D-nerf: Neural radiance fields for dynamic scenes

    Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 10318–10327, 2021. 1, 2, 5, 6, 7

  41. [41]

    Em- bodied hands: Modeling and capturing hands and bodies to- gether.ACM Transactions on Graphics, 36(6), 2017

    Javier Romero, Dimitris Tzionas, and Michael J Black. Em- bodied hands: Modeling and capturing hands and bodies to- gether.ACM Transactions on Graphics, 36(6), 2017. 3

  42. [42]

    Structure-from-motion revisited

    Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2016. 2

  43. [43]

    Pixelwise view selection for un- structured multi-view stereo

    Johannes Lutz Sch ¨onberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for un- structured multi-view stereo. InEuropean Conference on Computer Vision (ECCV), 2016. 2

  44. [44]

    Tensor4d: Efficient neural 4d decomposition for high-fidelity dynamic reconstruction and rendering

    Ruizhi Shao, Zerong Zheng, Hanzhang Tu, Boning Liu, Hongwen Zhang, and Yebin Liu. Tensor4d: Efficient neural 4d decomposition for high-fidelity dynamic reconstruction and rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16632– 16642, 2023. 2

  45. [45]

    MVDream: Multi-view Diffusion for 3D Generation

    Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration.arXiv preprint arXiv:2308.16512, 2023. 2, 3

  46. [46]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 7 10

  47. [47]

    DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

    Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for effi- cient 3d content creation.arXiv preprint arXiv:2309.16653,

  48. [48]

    Non- rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video

    Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollh¨ofer, Christoph Lassner, and Christian Theobalt. Non- rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 12959–12970, 2021. 2

  49. [49]

    Template-free articulated neural point clouds for reposable view synthesis.Advances in Neural Information Processing Systems, 36:31621–31637, 2023

    Lukas Uzolas, Elmar Eisemann, and Petr Kellnhofer. Template-free articulated neural point clouds for reposable view synthesis.Advances in Neural Information Processing Systems, 36:31621–31637, 2023. 1, 4, 7

  50. [50]

    Superpoint gaussian splatting for real-time high-fidelity dynamic scene recon- struction

    Diwen Wan, Ruijie Lu, and Gang Zeng. Superpoint gaussian splatting for real-time high-fidelity dynamic scene recon- struction. InInternational Conference on Machine Learning, pages 49957–49972. PMLR, 2024. 2, 3

  51. [51]

    Template-free articulated gaussian splatting for real-time re- posable dynamic view synthesis.Advances in Neural Infor- mation Processing Systems, 37:62000–62023, 2024

    Diwen Wan, Yuxiang Wang, Ruijie Lu, and Gang Zeng. Template-free articulated gaussian splatting for real-time re- posable dynamic view synthesis.Advances in Neural Infor- mation Processing Systems, 37:62000–62023, 2024. 1, 3, 5, 6, 7

  52. [52]

    Root pose decomposition towards generic non-rigid 3d re- construction with monocular videos

    Yikai Wang, Yinpeng Dong, Fuchun Sun, and Xiao Yang. Root pose decomposition towards generic non-rigid 3d re- construction with monocular videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13890–13900, 2023. 1

  53. [53]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6

  54. [54]

    4d gaussian splatting for real-time dynamic scene render- ing

    Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene render- ing. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 20310– 20320, 2024. 1, 2, 5, 6, 7

  55. [55]

    Magicpony: Learning ar- ticulated 3d animals in the wild

    Shangzhe Wu, Ruining Li, Tomas Jakab, Christian Rup- precht, and Andrea Vedaldi. Magicpony: Learning ar- ticulated 3d animals in the wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8792–8802, 2023. 3

  56. [56]

    CASA: Category-agnostic skeletal an- imal reconstruction

    Yuefan Wu*, Zeyuan Chen*, Shaowei Liu, Zhongzheng Ren, and Shenlong Wang. CASA: Category-agnostic skeletal an- imal reconstruction. InNeural Information Processing Sys- tems (NeurIPS), 2022. 3

  57. [57]

    Space-time neural irradiance fields for free-viewpoint video

    Wenqi Xian, Jia-Bin Huang, Johannes Kopf, and Changil Kim. Space-time neural irradiance fields for free-viewpoint video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9421–9431,

  58. [58]

    Comp4d: Llm-guided compositional 4d scene generation

    Dejia Xu, Hanwen Liang, Neel P Bhatt, Hezhen Hu, Hanxue Liang, Konstantinos N Plataniotis, and Zhangyang Wang. Comp4d: Llm-guided compositional 4d scene generation. arXiv preprint arXiv:2403.16993, 2024. 3

  59. [59]

    Rignet: Neural rigging for articu- lated characters.ACM Trans

    Zhan Xu, Yang Zhou, Evangelos Kalogerakis, Chris Lan- dreth, and Karan Singh. Rignet: Neural rigging for articu- lated characters.ACM Trans. on Graphics, 39, 2020. 3

  60. [60]

    Instant gaussian stream: Fast and generalizable streaming of dy- namic scene reconstruction via gaussian splatting

    Jinbo Yan, Rui Peng, Zhiyan Wang, Luyang Tang, Jiayu Yang, Jie Liang, Jiahao Wu, and Ronggang Wang. Instant gaussian stream: Fast and generalizable streaming of dy- namic scene reconstruction via gaussian splatting. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 16520–16531, 2025. 1

  61. [61]

    Banmo: Building animatable 3d neural models from many casual videos

    Gengshan Yang, Minh V o, Natalia Neverova, Deva Ra- manan, Andrea Vedaldi, and Hanbyul Joo. Banmo: Building animatable 3d neural models from many casual videos. In CVPR, 2022. 3, 4

  62. [62]

    Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction

    Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. In 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 20331–20341. IEEE, 2024. 2

  63. [63]

    Lassie: Learning articulated shape from sparse image ensemble via 3d part discovery

    Chun-Han Yao, Wei-Chih Hung, Yuanzhen Li, Michael Ru- binstein, Ming-Hsuan Yang, and Varun Jampani. Lassie: Learning articulated shape from sparse image ensemble via 3d part discovery. InNeurIPS, 2022. 3

  64. [64]

    Riggs: Rigging of 3d gaussians for modeling articulated objects in videos

    Yuxin Yao, Zhi Deng, and Junhui Hou. Riggs: Rigging of 3d gaussians for modeling articulated objects in videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5592–5601, 2025. 1, 3, 4, 5, 6, 7

  65. [65]

    Stag4d: Spatial-temporal anchored generative 4d gaussians

    Yifei Zeng, Yanqin Jiang, Siyu Zhu, Yuanxun Lu, Youtian Lin, Hao Zhu, Weiming Hu, Xun Cao, and Yao Yao. Stag4d: Spatial-temporal anchored generative 4d gaussians. 2024. 3

  66. [66]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 6

  67. [67]

    Bags: Building animatable gaussian splat- ting from a monocular video with diffusion priors, 2024

    Tingyang Zhang, Qingzhe Gao, Weiyu Li, Libin Liu, and Baoquan Chen. Bags: Building animatable gaussian splat- ting from a monocular video with diffusion priors, 2024. 3

  68. [68]

    Animate124: Animating one im- age to 4d dynamic scene.arXiv preprint arXiv:2311.14603,

    Yuyang Zhao, Zhiwen Yan, Enze Xie, Lanqing Hong, Zhen- guo Li, and Gim Hee Lee. Animate124: Animating one im- age to 4d dynamic scene.arXiv preprint arXiv:2311.14603,

  69. [69]

    A unified approach for text- and image-guided 4d scene generation

    Yufeng Zheng, Xueting Li, Koki Nagano, Sifei Liu, Otmar Hilliges, and Shalini De Mello. A unified approach for text- and image-guided 4d scene generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7300–7309, 2024. 3

  70. [70]

    3d menagerie: Modeling the 3d shape and pose of animals

    Silvia Zuffi, Angjoo Kanazawa, David W Jacobs, and Michael J Black. 3d menagerie: Modeling the 3d shape and pose of animals. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6365–6373,