SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting

Jun-Jee Chao; Volkan Isler

arxiv: 2601.00285 · v2 · submitted 2026-01-01 · 💻 cs.CV

SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting

Jun-Jee Chao , Volkan Isler This is my paper

Pith reviewed 2026-05-16 18:10 UTC · model grok-4.3

classification 💻 cs.CV

keywords 4D reconstructionGaussian splattingsparse view reconstructionskeleton-driven deformationdynamic scenemotion estimationdeformation field

0 comments

The pith

A skeleton-driven deformation field lets Gaussian splatting reconstruct moving objects accurately from sparse camera views over time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SV-GS to reconstruct dynamic scenes when both viewpoints and time samples are sparse, a setting where standard dense multi-view video methods fail. It initializes with a rough skeleton graph and static reconstruction, then optimizes a deformation field that separates coarse, time-varying joint poses from finer, time-independent deformations. Only the joint poses change across frames, which supports smooth motion interpolation while retaining geometric detail learned from the available views. Experiments show gains of up to 34 percent PSNR over prior sparse methods on synthetic data and performance comparable to dense monocular video techniques on real scenes despite using far fewer frames. The initial static input can later be replaced by a diffusion prior, relaxing the setup for practical use.

Core claim

SV-GS simultaneously estimates a deformation model and the object's motion over time under sparse observations by optimizing a skeleton-driven deformation field composed of a coarse skeleton joint pose estimator and a module for fine-grained deformations. By making only the joint pose estimator time-dependent, the model enables smooth motion interpolation while preserving learned geometric details.

What carries the argument

Skeleton-driven deformation field consisting of a time-dependent coarse joint pose estimator and a time-independent fine-grained deformation module that guides Gaussian splatting optimization.

If this is right

Outperforms existing sparse-observation methods by up to 34 percent PSNR on synthetic datasets.
Matches performance of dense monocular video methods on real-world data while using significantly fewer frames.
The initial static reconstruction input can be replaced by a diffusion-based generative prior for greater practicality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The coarse-fine separation may generalize to other dynamic representations such as neural radiance fields or meshes.
Existing surveillance camera networks could supply the sparse inputs needed for 4D reconstruction without new dense capture hardware.
Fully automatic skeleton extraction could remove the remaining manual initialization step.

Load-bearing premise

A rough skeleton graph and initial static reconstruction are available to guide motion estimation under otherwise ill-posed sparse observations.

What would settle it

Run the method on a synthetic dynamic sequence with ground-truth 4D data but supply an intentionally inaccurate or missing skeleton graph and measure whether PSNR falls below competing skeleton-free baselines.

Figures

Figures reproduced from arXiv: 2601.00285 by Jun-Jee Chao, Volkan Isler.

**Figure 1.** Figure 1: We study the problem of 4D reconstruction from sparse observations. Our method takes the following as input: (a) A set of [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Comparison of input configurations across dynamic re [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Given canonical 3D Gaussians and an input skeleton, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results on the D-NeRF dataset [ [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 7.** Figure 7: Qualitative result on the real-world ZJU-MoCap dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison of results with and without [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 9.** Figure 9: Results on the camel scene from the in-the-wild DAVIS [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

read the original abstract

Reconstructing a dynamic target moving over a large area is challenging. Standard approaches for dynamic object reconstruction require dense coverage in both the viewing space and the temporal dimension, typically relying on multi-view videos captured at each time step. However, such setups are only possible in constrained environments. In real-world scenarios, observations are often sparse over time and captured sparsely from diverse viewpoints (e.g., from security cameras), making dynamic reconstruction highly ill-posed. We present SV-GS, a framework that simultaneously estimates a deformation model and the object's motion over time under sparse observations. To initialize SV-GS, we leverage a rough skeleton graph and an initial static reconstruction as inputs to guide motion estimation. (Later, we show that this input requirement can be relaxed.) Our method optimizes a skeleton-driven deformation field composed of a coarse skeleton joint pose estimator and a module for fine-grained deformations. By making only the joint pose estimator time-dependent, our model enables smooth motion interpolation while preserving learned geometric details. Experiments on synthetic datasets show that our method outperforms existing approaches under sparse observations by up to 34% in PSNR, and achieves comparable performance to dense monocular video methods on real-world datasets despite using significantly fewer frames. Moreover, we demonstrate that the input initial static reconstruction can be replaced by a diffusion-based generative prior, making our method more practical for real-world scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SV-GS anchors Gaussian splatting to a skeleton graph for sparse 4D reconstruction and reports clear PSNR gains, but the skeleton input looks like the real limiter on practical use.

read the letter

The paper's core move is to split the deformation model into a time-dependent coarse joint-pose estimator and a static fine-detail module, then optimize both together under sparse views. That split lets the method interpolate motion smoothly while keeping learned geometry intact, and the authors show it beats prior sparse-view baselines by up to 34% PSNR on synthetic data while matching dense monocular video results on real scenes with far fewer frames. They also demonstrate that the initial static reconstruction can be swapped for a diffusion prior, which removes one manual step and makes the pipeline more realistic for uncontrolled settings like security cameras.

Referee Report

2 major / 2 minor

Summary. The paper proposes SV-GS, a framework for 4D dynamic reconstruction from sparse multi-view observations. It initializes with a rough skeleton graph plus static reconstruction (later relaxable to a diffusion prior), then optimizes a skeleton-driven Gaussian Splatting deformation model that separates time-dependent coarse joint-pose estimation from static fine-grained deformations. The central empirical claim is up to 34% PSNR improvement over prior methods on synthetic sparse-view data and performance parity with dense monocular video baselines on real data despite using far fewer frames.

Significance. If the quantitative gains and the diffusion-prior relaxation hold under rigorous controls, the work would meaningfully advance practical 4D reconstruction outside controlled capture rigs. The separation of coarse time-dependent pose from static detail is a clean modeling choice that could generalize to other sparse dynamic settings.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: the headline 34% PSNR gain is reported without error bars, per-scene variance, or ablation tables isolating the skeleton-graph contribution from the deformation field. Because the joint-pose stage is anchored by the skeleton input, the absence of an ablation that replaces the provided skeleton with automatic pose estimation on identical sparse inputs leaves open whether the reported numbers reflect an oracle initialization rather than a fully automatic pipeline.
[Methods] Methods: the deformation model is defined as the sum of a time-dependent coarse joint-pose estimator and a static fine-grained module. No derivation or loss-term analysis is supplied showing that this split is necessary for stability under the stated sparsity levels; an ablation that makes the fine module also time-dependent (or removes the skeleton anchor entirely) would be required to substantiate the modeling claim.

minor comments (2)

[Abstract] Abstract: the phrase 'significantly fewer frames' should be quantified (exact frame counts and view counts for both SV-GS and the dense monocular baselines).
[Experiments] The diffusion-prior relaxation is mentioned only qualitatively; a short table comparing PSNR when the static initialization is replaced by the generative prior versus the provided static mesh would strengthen the practicality claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the empirical presentation and modeling justification without altering the core claims.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the headline 34% PSNR gain is reported without error bars, per-scene variance, or ablation tables isolating the skeleton-graph contribution from the deformation field. Because the joint-pose stage is anchored by the skeleton input, the absence of an ablation that replaces the provided skeleton with automatic pose estimation on identical sparse inputs leaves open whether the reported numbers reflect an oracle initialization rather than a fully automatic pipeline.

Authors: We agree that error bars, per-scene variance, and targeted ablations would strengthen the results section. In the revision we will add standard error bars to all quantitative tables, report per-scene PSNR values, and include a new ablation that substitutes the provided skeleton graph with an off-the-shelf automatic pose estimator (e.g., a recent monocular 3D pose method) while keeping all other inputs and sparsity levels identical. This will clarify that the reported gains are not solely attributable to oracle skeleton initialization. The diffusion-prior relaxation already demonstrated in the manuscript applies to the static reconstruction; the new ablation will extend the same spirit to the skeleton component. revision: yes
Referee: [Methods] Methods: the deformation model is defined as the sum of a time-dependent coarse joint-pose estimator and a static fine-grained module. No derivation or loss-term analysis is supplied showing that this split is necessary for stability under the stated sparsity levels; an ablation that makes the fine module also time-dependent (or removes the skeleton anchor entirely) would be required to substantiate the modeling claim.

Authors: We will expand the Methods section with a short derivation of the composite deformation field and an analysis of the loss terms that shows why anchoring only the coarse joint-pose stage to time-dependent parameters improves stability under the sparsity regimes considered. We will also add two new ablations: (1) making the fine-grained module time-dependent as well, and (2) removing the skeleton anchor entirely (relying solely on the diffusion prior). These experiments will be reported with the same metrics and sparsity settings used in the main tables, allowing direct comparison to the original split. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an optimization-based framework that takes a rough skeleton graph and initial static reconstruction (or diffusion prior) as explicit inputs to initialize and guide motion estimation under sparse views. Performance claims rest on empirical results from synthetic and real datasets rather than any closed-form derivation or prediction that reduces by construction to fitted parameters. No self-definitional steps, fitted inputs renamed as predictions, load-bearing self-citations, or ansatz smuggling appear in the provided text; the approach is a standard engineering pipeline validated externally and therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method implicitly assumes a usable skeleton graph and static initialization exist or can be generated.

pith-pipeline@v0.9.0 · 5543 in / 1103 out tokens · 26192 ms · 2026-05-16T18:10:10.022536+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 3 internal anchors

[1]

Score distillation sampling with learned manifold cor- rective

Thiemo Alldieck, Nikos Kolotouros, and Cristian Sminchis- escu. Score distillation sampling with learned manifold cor- rective. InEuropean Conference on Computer Vision, pages 1–18, 2024. 8

work page 2024
[2]

PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael V oznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, An- jali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalam- barkar, Laurent Kirsch, Mich...

work page 2024
[3]

4d-fy: Text-to-4d generation using hybrid score dis- tillation sampling

Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B Lin- dell. 4d-fy: Text-to-4d generation using hybrid score dis- tillation sampling. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7996–8006, 2024. 3

work page 2024
[4]

Immersive light field video with a layered mesh representation.ACM Trans- actions on Graphics (TOG), 39(4):86–1, 2020

Michael Broxton, John Flynn, Ryan Overbeck, Daniel Erick- son, Peter Hedman, Matthew Duvall, Jason Dourgarian, Jay Busch, Matt Whalen, and Paul Debevec. Immersive light field video with a layered mesh representation.ACM Trans- actions on Graphics (TOG), 39(4):86–1, 2020. 2

work page 2020
[5]

Hexplane: A fast representa- tion for dynamic scenes

Ang Cao and Justin Johnson. Hexplane: A fast representa- tion for dynamic scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 130–141, 2023. 1, 2

work page 2023
[6]

Part seg- mentation and motion estimation for articulated objects with dynamic 3d gaussians.arXiv preprint arXiv:2506.22718,

Jun-Jee Chao, Qingyuan Jiang, and V olkan Isler. Part seg- mentation and motion estimation for articulated objects with dynamic 3d gaussians.arXiv preprint arXiv:2506.22718,

work page arXiv
[7]

A kinematic no- tation for lower-pair mechanisms based on matrices

Jacques Denavit and Richard S Hartenberg. A kinematic no- tation for lower-pair mechanisms based on matrices. 1955. 4

work page 1955
[8]

Fusion4d: Real-time performance capture of challeng- ing scenes.ACM Transactions on Graphics (ToG), 35(4): 1–13, 2016

Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip Davidson, Sean Ryan Fanello, Adarsh Kowdle, Sergio Orts Escolano, Christoph Rhemann, David Kim, Jonathan Taylor, et al. Fusion4d: Real-time performance capture of challeng- ing scenes.ACM Transactions on Graphics (ToG), 35(4): 1–13, 2016. 2

work page 2016
[9]

K-planes: Explicit radiance fields in space, time, and appearance

Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 12479–12488, 2023. 2

work page 2023
[10]

Monocular dynamic view synthesis: A reality check.Advances in Neural Information Processing Systems, 35:33768–33780, 2022

Hang Gao, Ruilong Li, Shubham Tulsiani, Bryan Russell, and Angjoo Kanazawa. Monocular dynamic view synthesis: A reality check.Advances in Neural Information Processing Systems, 35:33768–33780, 2022. 2

work page 2022
[11]

Forward flow for novel view synthesis of dynamic scenes

Xiang Guo, Jiadai Sun, Yuchao Dai, Guanying Chen, Xiao- qing Ye, Xiao Tan, Errui Ding, Yumeng Zhang, and Jingdong Wang. Forward flow for novel view synthesis of dynamic scenes. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 16022–16033, 2023. 2

work page 2023
[12]

Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes

Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4220–4230, 2024. 1, 2, 3, 6

work page 2024
[13]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 2, 3, 5

work page 2023
[14]

Dynmf: Neural motion factorization for real-time dynamic view synthesis with 3d gaussian splatting

Agelos Kratimenos, Jiahui Lei, and Kostas Daniilidis. Dynmf: Neural motion factorization for real-time dynamic view synthesis with 3d gaussian splatting. InEuropean Con- ference on Computer Vision, pages 252–269. Springer, 2024. 2

work page 2024
[15]

Gart: Gaussian articulated template mod- els

Jiahui Lei, Yufu Wang, Georgios Pavlakos, Lingjie Liu, and Kostas Daniilidis. Gart: Gaussian articulated template mod- els. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 19876–19887,

work page
[16]

Articulated kinematics distillation from video diffusion models

Xuan Li, Qianli Ma, Tsung-Yi Lin, Yongxin Chen, Chen- fanfu Jiang, Ming-Yu Liu, and Donglai Xiang. Articulated kinematics distillation from video diffusion models. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 17571–17581, 2025. 3, 4

work page 2025
[17]

Neural scene flow fields for space-time view synthesis of dy- namic scenes

Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neural scene flow fields for space-time view synthesis of dy- namic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6498– 6508, 2021. 2

work page 2021
[18]

Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models

Huan Ling, Seung Wook Kim, Antonio Torralba, Sanja Fi- dler, and Karsten Kreis. Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 8576–8588, 2024. 3

work page 2024
[19]

Lepard: Learning explicit part dis- covery for 3d articulated shape reconstruction

Di Liu, Anastasis Stathopoulos, Qilong Zhangli, Yunhe Gao, and Dimitris Metaxas. Lepard: Learning explicit part dis- covery for 3d articulated shape reconstruction. InAdvances in Neural Information Processing Systems, pages 54187– 54198. Curran Associates, Inc., 2023. 3

work page 2023
[20]

Dynamic gaus- sians mesh: Consistent mesh reconstruction from dynamic scenes

Isabella Liu, Hao Su, and Xiaolong Wang. Dynamic gaus- sians mesh: Consistent mesh reconstruction from dynamic scenes. InThe Thirteenth International Conference on Learning Representations, 2025. 6, 7

work page 2025
[21]

Riganything: Template-free autoregressive rigging for diverse 3d assets

Isabella Liu, Zhan Xu, Yifan Wang, Hao Tan, Zexiang Xu, Xiaolong Wang, Hao Su, and Zifan Shi. Riganything: Template-free autoregressive rigging for diverse 3d assets. ACM Transactions on Graphics (TOG), 44(4):1–12, 2025. 3

work page 2025
[22]

One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d dif- fusion

Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang, Chao Xu, Xinyue Wei, Hansheng Chen, Chong Zeng, Ji- 9 ayuan Gu, and Hao Su. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d dif- fusion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 10072–10083,

work page
[23]

MoDGS: Dy- namic gaussian splatting from casually-captured monocular videos with depth priors

Qingming LIU, Yuan Liu, Jiepeng Wang, Xianqiang Lyu, Peng Wang, Wenping Wang, and Junhui Hou. MoDGS: Dy- namic gaussian splatting from casually-captured monocular videos with depth priors. InThe Thirteenth International Conference on Learning Representations, 2025. 2

work page 2025
[24]

Zero-1-to- 3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to- 3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023. 2, 3, 7

work page 2023
[25]

Build- ing rearticulable models for arbitrary 3d objects from 4d point clouds

Shaowei Liu, Saurabh Gupta, and Shenlong Wang. Build- ing rearticulable models for arbitrary 3d objects from 4d point clouds. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21138– 21147, 2023. 1

work page 2023
[26]

Neural vol- umes: learning dynamic renderable volumes from images

Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural vol- umes: learning dynamic renderable volumes from images. ACM Transactions on Graphics (TOG), 38(4):1–14, 2019. 2

work page 2019
[27]

Matthew Loper, Naureen Mahmood, Javier Romero, Ger- ard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model.ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015. 3

work page 2015
[28]

Dynamic 3d gaussians: Tracking by per- sistent dynamic view synthesis

Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by per- sistent dynamic view synthesis. In2024 International Con- ference on 3D Vision (3DV), pages 800–809. IEEE, 2024. 2

work page 2024
[29]

Score distillation via reparametrized ddim.Advances in Neural Information Pro- cessing Systems, 37:26011–26044, 2024

Artem Lukoianov, Haitz S’aez de Oc’ariz Borde, Kristjan Greenewald, Vitor Guizilini, Timur Bagautdinov, Vincent Sitzmann, and Justin M Solomon. Score distillation via reparametrized ddim.Advances in Neural Information Pro- cessing Systems, 37:26011–26044, 2024. 8

work page 2024
[30]

Joint-dependent local deformations for hand an- imation and object grasping

Nadia Magnenat-Thalmann, Richard Laperri `ere, and Daniel Thalmann. Joint-dependent local deformations for hand an- imation and object grasping. InProceedings on Graphics interface’88, pages 26–33, 1989. 4

work page 1989
[31]

Jacobs, Alexei A

David McAllister, Songwei Ge, Jia-Bin Huang, David W. Jacobs, Alexei A. Efros, Aleksander Holynski, and Angjoo Kanazawa. Rethinking score distillation as a bridge between image distributions. InAdvances in Neural Information Pro- cessing Systems, 2024. 8

work page 2024
[32]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2, 4

work page 2021
[33]

Watch it move: Unsupervised discovery of 3d joints for re-posing of articulated objects

Atsuhiro Noguchi, Umar Iqbal, Jonathan Tremblay, Tatsuya Harada, and Orazio Gallo. Watch it move: Unsupervised discovery of 3d joints for re-posing of articulated objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3677–3687, 2022. 3

work page 2022
[34]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InProceedings of the 3rd In- ternational Conference on Learning Representations (ICLR 2015), 2015. 6

work page 2015
[35]

Nerfies: Deformable neural radiance fields

Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. InProceedings of the IEEE/CVF international conference on computer vision, pages 5865–5874, 2021. 2

work page 2021
[36]

Hypernerf: a higher- dimensional representation for topologically varying neural radiance fields.ACM Transactions on Graphics (TOG), 40 (6):1–12, 2021

Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin- Brualla, and Steven M Seitz. Hypernerf: a higher- dimensional representation for topologically varying neural radiance fields.ACM Transactions on Graphics (TOG), 40 (6):1–12, 2021. 2

work page 2021
[37]

Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans

Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9054–9063, 2021. 6, 7

work page 2021
[38]

A benchmark dataset and evaluation methodology for video object segmentation

Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine- Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 6, 7, 8

work page 2016
[39]

Dreamfusion: Text-to-3d using 2d diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. InThe Eleventh International Conference on Learning Representa- tions, 2022. 3, 7

work page 2022
[40]

D-nerf: Neural radiance fields for dynamic scenes

Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 10318–10327, 2021. 1, 2, 5, 6, 7

work page 2021
[41]

Em- bodied hands: Modeling and capturing hands and bodies to- gether.ACM Transactions on Graphics, 36(6), 2017

Javier Romero, Dimitris Tzionas, and Michael J Black. Em- bodied hands: Modeling and capturing hands and bodies to- gether.ACM Transactions on Graphics, 36(6), 2017. 3

work page 2017
[42]

Structure-from-motion revisited

Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2016. 2

work page 2016
[43]

Pixelwise view selection for un- structured multi-view stereo

Johannes Lutz Sch ¨onberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for un- structured multi-view stereo. InEuropean Conference on Computer Vision (ECCV), 2016. 2

work page 2016
[44]

Tensor4d: Efficient neural 4d decomposition for high-fidelity dynamic reconstruction and rendering

Ruizhi Shao, Zerong Zheng, Hanzhang Tu, Boning Liu, Hongwen Zhang, and Yebin Liu. Tensor4d: Efficient neural 4d decomposition for high-fidelity dynamic reconstruction and rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16632– 16642, 2023. 2

work page 2023
[45]

MVDream: Multi-view Diffusion for 3D Generation

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration.arXiv preprint arXiv:2308.16512, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 7 10

work page internal anchor Pith review Pith/arXiv arXiv 2010
[47]

DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for effi- cient 3d content creation.arXiv preprint arXiv:2309.16653,

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Non- rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video

Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollh¨ofer, Christoph Lassner, and Christian Theobalt. Non- rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 12959–12970, 2021. 2

work page 2021
[49]

Template-free articulated neural point clouds for reposable view synthesis.Advances in Neural Information Processing Systems, 36:31621–31637, 2023

Lukas Uzolas, Elmar Eisemann, and Petr Kellnhofer. Template-free articulated neural point clouds for reposable view synthesis.Advances in Neural Information Processing Systems, 36:31621–31637, 2023. 1, 4, 7

work page 2023
[50]

Superpoint gaussian splatting for real-time high-fidelity dynamic scene recon- struction

Diwen Wan, Ruijie Lu, and Gang Zeng. Superpoint gaussian splatting for real-time high-fidelity dynamic scene recon- struction. InInternational Conference on Machine Learning, pages 49957–49972. PMLR, 2024. 2, 3

work page 2024
[51]

Template-free articulated gaussian splatting for real-time re- posable dynamic view synthesis.Advances in Neural Infor- mation Processing Systems, 37:62000–62023, 2024

Diwen Wan, Yuxiang Wang, Ruijie Lu, and Gang Zeng. Template-free articulated gaussian splatting for real-time re- posable dynamic view synthesis.Advances in Neural Infor- mation Processing Systems, 37:62000–62023, 2024. 1, 3, 5, 6, 7

work page 2024
[52]

Root pose decomposition towards generic non-rigid 3d re- construction with monocular videos

Yikai Wang, Yinpeng Dong, Fuchun Sun, and Xiao Yang. Root pose decomposition towards generic non-rigid 3d re- construction with monocular videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13890–13900, 2023. 1

work page 2023
[53]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6

work page 2004
[54]

4d gaussian splatting for real-time dynamic scene render- ing

Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene render- ing. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 20310– 20320, 2024. 1, 2, 5, 6, 7

work page 2024
[55]

Magicpony: Learning ar- ticulated 3d animals in the wild

Shangzhe Wu, Ruining Li, Tomas Jakab, Christian Rup- precht, and Andrea Vedaldi. Magicpony: Learning ar- ticulated 3d animals in the wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8792–8802, 2023. 3

work page 2023
[56]

CASA: Category-agnostic skeletal an- imal reconstruction

Yuefan Wu*, Zeyuan Chen*, Shaowei Liu, Zhongzheng Ren, and Shenlong Wang. CASA: Category-agnostic skeletal an- imal reconstruction. InNeural Information Processing Sys- tems (NeurIPS), 2022. 3

work page 2022
[57]

Space-time neural irradiance fields for free-viewpoint video

Wenqi Xian, Jia-Bin Huang, Johannes Kopf, and Changil Kim. Space-time neural irradiance fields for free-viewpoint video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9421–9431,

work page
[58]

Comp4d: Llm-guided compositional 4d scene generation

Dejia Xu, Hanwen Liang, Neel P Bhatt, Hezhen Hu, Hanxue Liang, Konstantinos N Plataniotis, and Zhangyang Wang. Comp4d: Llm-guided compositional 4d scene generation. arXiv preprint arXiv:2403.16993, 2024. 3

work page arXiv 2024
[59]

Rignet: Neural rigging for articu- lated characters.ACM Trans

Zhan Xu, Yang Zhou, Evangelos Kalogerakis, Chris Lan- dreth, and Karan Singh. Rignet: Neural rigging for articu- lated characters.ACM Trans. on Graphics, 39, 2020. 3

work page 2020
[60]

Instant gaussian stream: Fast and generalizable streaming of dy- namic scene reconstruction via gaussian splatting

Jinbo Yan, Rui Peng, Zhiyan Wang, Luyang Tang, Jiayu Yang, Jie Liang, Jiahao Wu, and Ronggang Wang. Instant gaussian stream: Fast and generalizable streaming of dy- namic scene reconstruction via gaussian splatting. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 16520–16531, 2025. 1

work page 2025
[61]

Banmo: Building animatable 3d neural models from many casual videos

Gengshan Yang, Minh V o, Natalia Neverova, Deva Ra- manan, Andrea Vedaldi, and Hanbyul Joo. Banmo: Building animatable 3d neural models from many casual videos. In CVPR, 2022. 3, 4

work page 2022
[62]

Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction

Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. In 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 20331–20341. IEEE, 2024. 2

work page 2024
[63]

Lassie: Learning articulated shape from sparse image ensemble via 3d part discovery

Chun-Han Yao, Wei-Chih Hung, Yuanzhen Li, Michael Ru- binstein, Ming-Hsuan Yang, and Varun Jampani. Lassie: Learning articulated shape from sparse image ensemble via 3d part discovery. InNeurIPS, 2022. 3

work page 2022
[64]

Riggs: Rigging of 3d gaussians for modeling articulated objects in videos

Yuxin Yao, Zhi Deng, and Junhui Hou. Riggs: Rigging of 3d gaussians for modeling articulated objects in videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5592–5601, 2025. 1, 3, 4, 5, 6, 7

work page 2025
[65]

Stag4d: Spatial-temporal anchored generative 4d gaussians

Yifei Zeng, Yanqin Jiang, Siyu Zhu, Yuanxun Lu, Youtian Lin, Hao Zhu, Weiming Hu, Xun Cao, and Yao Yao. Stag4d: Spatial-temporal anchored generative 4d gaussians. 2024. 3

work page 2024
[66]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 6

work page 2018
[67]

Bags: Building animatable gaussian splat- ting from a monocular video with diffusion priors, 2024

Tingyang Zhang, Qingzhe Gao, Weiyu Li, Libin Liu, and Baoquan Chen. Bags: Building animatable gaussian splat- ting from a monocular video with diffusion priors, 2024. 3

work page 2024
[68]

Animate124: Animating one im- age to 4d dynamic scene.arXiv preprint arXiv:2311.14603,

Yuyang Zhao, Zhiwen Yan, Enze Xie, Lanqing Hong, Zhen- guo Li, and Gim Hee Lee. Animate124: Animating one im- age to 4d dynamic scene.arXiv preprint arXiv:2311.14603,

work page arXiv
[69]

A unified approach for text- and image-guided 4d scene generation

Yufeng Zheng, Xueting Li, Koki Nagano, Sifei Liu, Otmar Hilliges, and Shalini De Mello. A unified approach for text- and image-guided 4d scene generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7300–7309, 2024. 3

work page 2024
[70]

3d menagerie: Modeling the 3d shape and pose of animals

Silvia Zuffi, Angjoo Kanazawa, David W Jacobs, and Michael J Black. 3d menagerie: Modeling the 3d shape and pose of animals. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6365–6373,

work page

[1] [1]

Score distillation sampling with learned manifold cor- rective

Thiemo Alldieck, Nikos Kolotouros, and Cristian Sminchis- escu. Score distillation sampling with learned manifold cor- rective. InEuropean Conference on Computer Vision, pages 1–18, 2024. 8

work page 2024

[2] [2]

PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael V oznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, An- jali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalam- barkar, Laurent Kirsch, Mich...

work page 2024

[3] [3]

4d-fy: Text-to-4d generation using hybrid score dis- tillation sampling

Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B Lin- dell. 4d-fy: Text-to-4d generation using hybrid score dis- tillation sampling. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7996–8006, 2024. 3

work page 2024

[4] [4]

Immersive light field video with a layered mesh representation.ACM Trans- actions on Graphics (TOG), 39(4):86–1, 2020

Michael Broxton, John Flynn, Ryan Overbeck, Daniel Erick- son, Peter Hedman, Matthew Duvall, Jason Dourgarian, Jay Busch, Matt Whalen, and Paul Debevec. Immersive light field video with a layered mesh representation.ACM Trans- actions on Graphics (TOG), 39(4):86–1, 2020. 2

work page 2020

[5] [5]

Hexplane: A fast representa- tion for dynamic scenes

Ang Cao and Justin Johnson. Hexplane: A fast representa- tion for dynamic scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 130–141, 2023. 1, 2

work page 2023

[6] [6]

Part seg- mentation and motion estimation for articulated objects with dynamic 3d gaussians.arXiv preprint arXiv:2506.22718,

Jun-Jee Chao, Qingyuan Jiang, and V olkan Isler. Part seg- mentation and motion estimation for articulated objects with dynamic 3d gaussians.arXiv preprint arXiv:2506.22718,

work page arXiv

[7] [7]

A kinematic no- tation for lower-pair mechanisms based on matrices

Jacques Denavit and Richard S Hartenberg. A kinematic no- tation for lower-pair mechanisms based on matrices. 1955. 4

work page 1955

[8] [8]

Fusion4d: Real-time performance capture of challeng- ing scenes.ACM Transactions on Graphics (ToG), 35(4): 1–13, 2016

Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip Davidson, Sean Ryan Fanello, Adarsh Kowdle, Sergio Orts Escolano, Christoph Rhemann, David Kim, Jonathan Taylor, et al. Fusion4d: Real-time performance capture of challeng- ing scenes.ACM Transactions on Graphics (ToG), 35(4): 1–13, 2016. 2

work page 2016

[9] [9]

K-planes: Explicit radiance fields in space, time, and appearance

Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 12479–12488, 2023. 2

work page 2023

[10] [10]

Monocular dynamic view synthesis: A reality check.Advances in Neural Information Processing Systems, 35:33768–33780, 2022

Hang Gao, Ruilong Li, Shubham Tulsiani, Bryan Russell, and Angjoo Kanazawa. Monocular dynamic view synthesis: A reality check.Advances in Neural Information Processing Systems, 35:33768–33780, 2022. 2

work page 2022

[11] [11]

Forward flow for novel view synthesis of dynamic scenes

Xiang Guo, Jiadai Sun, Yuchao Dai, Guanying Chen, Xiao- qing Ye, Xiao Tan, Errui Ding, Yumeng Zhang, and Jingdong Wang. Forward flow for novel view synthesis of dynamic scenes. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 16022–16033, 2023. 2

work page 2023

[12] [12]

Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes

Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4220–4230, 2024. 1, 2, 3, 6

work page 2024

[13] [13]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 2, 3, 5

work page 2023

[14] [14]

Dynmf: Neural motion factorization for real-time dynamic view synthesis with 3d gaussian splatting

Agelos Kratimenos, Jiahui Lei, and Kostas Daniilidis. Dynmf: Neural motion factorization for real-time dynamic view synthesis with 3d gaussian splatting. InEuropean Con- ference on Computer Vision, pages 252–269. Springer, 2024. 2

work page 2024

[15] [15]

Gart: Gaussian articulated template mod- els

Jiahui Lei, Yufu Wang, Georgios Pavlakos, Lingjie Liu, and Kostas Daniilidis. Gart: Gaussian articulated template mod- els. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 19876–19887,

work page

[16] [16]

Articulated kinematics distillation from video diffusion models

Xuan Li, Qianli Ma, Tsung-Yi Lin, Yongxin Chen, Chen- fanfu Jiang, Ming-Yu Liu, and Donglai Xiang. Articulated kinematics distillation from video diffusion models. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 17571–17581, 2025. 3, 4

work page 2025

[17] [17]

Neural scene flow fields for space-time view synthesis of dy- namic scenes

Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neural scene flow fields for space-time view synthesis of dy- namic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6498– 6508, 2021. 2

work page 2021

[18] [18]

Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models

Huan Ling, Seung Wook Kim, Antonio Torralba, Sanja Fi- dler, and Karsten Kreis. Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 8576–8588, 2024. 3

work page 2024

[19] [19]

Lepard: Learning explicit part dis- covery for 3d articulated shape reconstruction

Di Liu, Anastasis Stathopoulos, Qilong Zhangli, Yunhe Gao, and Dimitris Metaxas. Lepard: Learning explicit part dis- covery for 3d articulated shape reconstruction. InAdvances in Neural Information Processing Systems, pages 54187– 54198. Curran Associates, Inc., 2023. 3

work page 2023

[20] [20]

Dynamic gaus- sians mesh: Consistent mesh reconstruction from dynamic scenes

Isabella Liu, Hao Su, and Xiaolong Wang. Dynamic gaus- sians mesh: Consistent mesh reconstruction from dynamic scenes. InThe Thirteenth International Conference on Learning Representations, 2025. 6, 7

work page 2025

[21] [21]

Riganything: Template-free autoregressive rigging for diverse 3d assets

Isabella Liu, Zhan Xu, Yifan Wang, Hao Tan, Zexiang Xu, Xiaolong Wang, Hao Su, and Zifan Shi. Riganything: Template-free autoregressive rigging for diverse 3d assets. ACM Transactions on Graphics (TOG), 44(4):1–12, 2025. 3

work page 2025

[22] [22]

One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d dif- fusion

Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang, Chao Xu, Xinyue Wei, Hansheng Chen, Chong Zeng, Ji- 9 ayuan Gu, and Hao Su. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d dif- fusion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 10072–10083,

work page

[23] [23]

MoDGS: Dy- namic gaussian splatting from casually-captured monocular videos with depth priors

Qingming LIU, Yuan Liu, Jiepeng Wang, Xianqiang Lyu, Peng Wang, Wenping Wang, and Junhui Hou. MoDGS: Dy- namic gaussian splatting from casually-captured monocular videos with depth priors. InThe Thirteenth International Conference on Learning Representations, 2025. 2

work page 2025

[24] [24]

Zero-1-to- 3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to- 3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023. 2, 3, 7

work page 2023

[25] [25]

Build- ing rearticulable models for arbitrary 3d objects from 4d point clouds

Shaowei Liu, Saurabh Gupta, and Shenlong Wang. Build- ing rearticulable models for arbitrary 3d objects from 4d point clouds. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21138– 21147, 2023. 1

work page 2023

[26] [26]

Neural vol- umes: learning dynamic renderable volumes from images

Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural vol- umes: learning dynamic renderable volumes from images. ACM Transactions on Graphics (TOG), 38(4):1–14, 2019. 2

work page 2019

[27] [27]

Matthew Loper, Naureen Mahmood, Javier Romero, Ger- ard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model.ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015. 3

work page 2015

[28] [28]

Dynamic 3d gaussians: Tracking by per- sistent dynamic view synthesis

Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by per- sistent dynamic view synthesis. In2024 International Con- ference on 3D Vision (3DV), pages 800–809. IEEE, 2024. 2

work page 2024

[29] [29]

Score distillation via reparametrized ddim.Advances in Neural Information Pro- cessing Systems, 37:26011–26044, 2024

Artem Lukoianov, Haitz S’aez de Oc’ariz Borde, Kristjan Greenewald, Vitor Guizilini, Timur Bagautdinov, Vincent Sitzmann, and Justin M Solomon. Score distillation via reparametrized ddim.Advances in Neural Information Pro- cessing Systems, 37:26011–26044, 2024. 8

work page 2024

[30] [30]

Joint-dependent local deformations for hand an- imation and object grasping

Nadia Magnenat-Thalmann, Richard Laperri `ere, and Daniel Thalmann. Joint-dependent local deformations for hand an- imation and object grasping. InProceedings on Graphics interface’88, pages 26–33, 1989. 4

work page 1989

[31] [31]

Jacobs, Alexei A

David McAllister, Songwei Ge, Jia-Bin Huang, David W. Jacobs, Alexei A. Efros, Aleksander Holynski, and Angjoo Kanazawa. Rethinking score distillation as a bridge between image distributions. InAdvances in Neural Information Pro- cessing Systems, 2024. 8

work page 2024

[32] [32]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2, 4

work page 2021

[33] [33]

Watch it move: Unsupervised discovery of 3d joints for re-posing of articulated objects

Atsuhiro Noguchi, Umar Iqbal, Jonathan Tremblay, Tatsuya Harada, and Orazio Gallo. Watch it move: Unsupervised discovery of 3d joints for re-posing of articulated objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3677–3687, 2022. 3

work page 2022

[34] [34]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InProceedings of the 3rd In- ternational Conference on Learning Representations (ICLR 2015), 2015. 6

work page 2015

[35] [35]

Nerfies: Deformable neural radiance fields

Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. InProceedings of the IEEE/CVF international conference on computer vision, pages 5865–5874, 2021. 2

work page 2021

[36] [36]

Hypernerf: a higher- dimensional representation for topologically varying neural radiance fields.ACM Transactions on Graphics (TOG), 40 (6):1–12, 2021

Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin- Brualla, and Steven M Seitz. Hypernerf: a higher- dimensional representation for topologically varying neural radiance fields.ACM Transactions on Graphics (TOG), 40 (6):1–12, 2021. 2

work page 2021

[37] [37]

Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans

Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9054–9063, 2021. 6, 7

work page 2021

[38] [38]

A benchmark dataset and evaluation methodology for video object segmentation

Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine- Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 6, 7, 8

work page 2016

[39] [39]

Dreamfusion: Text-to-3d using 2d diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. InThe Eleventh International Conference on Learning Representa- tions, 2022. 3, 7

work page 2022

[40] [40]

D-nerf: Neural radiance fields for dynamic scenes

Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 10318–10327, 2021. 1, 2, 5, 6, 7

work page 2021

[41] [41]

Em- bodied hands: Modeling and capturing hands and bodies to- gether.ACM Transactions on Graphics, 36(6), 2017

Javier Romero, Dimitris Tzionas, and Michael J Black. Em- bodied hands: Modeling and capturing hands and bodies to- gether.ACM Transactions on Graphics, 36(6), 2017. 3

work page 2017

[42] [42]

Structure-from-motion revisited

Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2016. 2

work page 2016

[43] [43]

Pixelwise view selection for un- structured multi-view stereo

Johannes Lutz Sch ¨onberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for un- structured multi-view stereo. InEuropean Conference on Computer Vision (ECCV), 2016. 2

work page 2016

[44] [44]

Tensor4d: Efficient neural 4d decomposition for high-fidelity dynamic reconstruction and rendering

Ruizhi Shao, Zerong Zheng, Hanzhang Tu, Boning Liu, Hongwen Zhang, and Yebin Liu. Tensor4d: Efficient neural 4d decomposition for high-fidelity dynamic reconstruction and rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16632– 16642, 2023. 2

work page 2023

[45] [45]

MVDream: Multi-view Diffusion for 3D Generation

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration.arXiv preprint arXiv:2308.16512, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [46]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 7 10

work page internal anchor Pith review Pith/arXiv arXiv 2010

[47] [47]

DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for effi- cient 3d content creation.arXiv preprint arXiv:2309.16653,

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

Non- rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video

Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollh¨ofer, Christoph Lassner, and Christian Theobalt. Non- rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 12959–12970, 2021. 2

work page 2021

[49] [49]

Template-free articulated neural point clouds for reposable view synthesis.Advances in Neural Information Processing Systems, 36:31621–31637, 2023

Lukas Uzolas, Elmar Eisemann, and Petr Kellnhofer. Template-free articulated neural point clouds for reposable view synthesis.Advances in Neural Information Processing Systems, 36:31621–31637, 2023. 1, 4, 7

work page 2023

[50] [50]

Superpoint gaussian splatting for real-time high-fidelity dynamic scene recon- struction

Diwen Wan, Ruijie Lu, and Gang Zeng. Superpoint gaussian splatting for real-time high-fidelity dynamic scene recon- struction. InInternational Conference on Machine Learning, pages 49957–49972. PMLR, 2024. 2, 3

work page 2024

[51] [51]

Template-free articulated gaussian splatting for real-time re- posable dynamic view synthesis.Advances in Neural Infor- mation Processing Systems, 37:62000–62023, 2024

Diwen Wan, Yuxiang Wang, Ruijie Lu, and Gang Zeng. Template-free articulated gaussian splatting for real-time re- posable dynamic view synthesis.Advances in Neural Infor- mation Processing Systems, 37:62000–62023, 2024. 1, 3, 5, 6, 7

work page 2024

[52] [52]

Root pose decomposition towards generic non-rigid 3d re- construction with monocular videos

Yikai Wang, Yinpeng Dong, Fuchun Sun, and Xiao Yang. Root pose decomposition towards generic non-rigid 3d re- construction with monocular videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13890–13900, 2023. 1

work page 2023

[53] [53]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6

work page 2004

[54] [54]

4d gaussian splatting for real-time dynamic scene render- ing

Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene render- ing. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 20310– 20320, 2024. 1, 2, 5, 6, 7

work page 2024

[55] [55]

Magicpony: Learning ar- ticulated 3d animals in the wild

Shangzhe Wu, Ruining Li, Tomas Jakab, Christian Rup- precht, and Andrea Vedaldi. Magicpony: Learning ar- ticulated 3d animals in the wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8792–8802, 2023. 3

work page 2023

[56] [56]

CASA: Category-agnostic skeletal an- imal reconstruction

Yuefan Wu*, Zeyuan Chen*, Shaowei Liu, Zhongzheng Ren, and Shenlong Wang. CASA: Category-agnostic skeletal an- imal reconstruction. InNeural Information Processing Sys- tems (NeurIPS), 2022. 3

work page 2022

[57] [57]

Space-time neural irradiance fields for free-viewpoint video

Wenqi Xian, Jia-Bin Huang, Johannes Kopf, and Changil Kim. Space-time neural irradiance fields for free-viewpoint video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9421–9431,

work page

[58] [58]

Comp4d: Llm-guided compositional 4d scene generation

Dejia Xu, Hanwen Liang, Neel P Bhatt, Hezhen Hu, Hanxue Liang, Konstantinos N Plataniotis, and Zhangyang Wang. Comp4d: Llm-guided compositional 4d scene generation. arXiv preprint arXiv:2403.16993, 2024. 3

work page arXiv 2024

[59] [59]

Rignet: Neural rigging for articu- lated characters.ACM Trans

Zhan Xu, Yang Zhou, Evangelos Kalogerakis, Chris Lan- dreth, and Karan Singh. Rignet: Neural rigging for articu- lated characters.ACM Trans. on Graphics, 39, 2020. 3

work page 2020

[60] [60]

Instant gaussian stream: Fast and generalizable streaming of dy- namic scene reconstruction via gaussian splatting

Jinbo Yan, Rui Peng, Zhiyan Wang, Luyang Tang, Jiayu Yang, Jie Liang, Jiahao Wu, and Ronggang Wang. Instant gaussian stream: Fast and generalizable streaming of dy- namic scene reconstruction via gaussian splatting. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 16520–16531, 2025. 1

work page 2025

[61] [61]

Banmo: Building animatable 3d neural models from many casual videos

Gengshan Yang, Minh V o, Natalia Neverova, Deva Ra- manan, Andrea Vedaldi, and Hanbyul Joo. Banmo: Building animatable 3d neural models from many casual videos. In CVPR, 2022. 3, 4

work page 2022

[62] [62]

Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction

Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. In 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 20331–20341. IEEE, 2024. 2

work page 2024

[63] [63]

Lassie: Learning articulated shape from sparse image ensemble via 3d part discovery

Chun-Han Yao, Wei-Chih Hung, Yuanzhen Li, Michael Ru- binstein, Ming-Hsuan Yang, and Varun Jampani. Lassie: Learning articulated shape from sparse image ensemble via 3d part discovery. InNeurIPS, 2022. 3

work page 2022

[64] [64]

Riggs: Rigging of 3d gaussians for modeling articulated objects in videos

Yuxin Yao, Zhi Deng, and Junhui Hou. Riggs: Rigging of 3d gaussians for modeling articulated objects in videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5592–5601, 2025. 1, 3, 4, 5, 6, 7

work page 2025

[65] [65]

Stag4d: Spatial-temporal anchored generative 4d gaussians

Yifei Zeng, Yanqin Jiang, Siyu Zhu, Yuanxun Lu, Youtian Lin, Hao Zhu, Weiming Hu, Xun Cao, and Yao Yao. Stag4d: Spatial-temporal anchored generative 4d gaussians. 2024. 3

work page 2024

[66] [66]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 6

work page 2018

[67] [67]

Bags: Building animatable gaussian splat- ting from a monocular video with diffusion priors, 2024

Tingyang Zhang, Qingzhe Gao, Weiyu Li, Libin Liu, and Baoquan Chen. Bags: Building animatable gaussian splat- ting from a monocular video with diffusion priors, 2024. 3

work page 2024

[68] [68]

Animate124: Animating one im- age to 4d dynamic scene.arXiv preprint arXiv:2311.14603,

Yuyang Zhao, Zhiwen Yan, Enze Xie, Lanqing Hong, Zhen- guo Li, and Gim Hee Lee. Animate124: Animating one im- age to 4d dynamic scene.arXiv preprint arXiv:2311.14603,

work page arXiv

[69] [69]

A unified approach for text- and image-guided 4d scene generation

Yufeng Zheng, Xueting Li, Koki Nagano, Sifei Liu, Otmar Hilliges, and Shalini De Mello. A unified approach for text- and image-guided 4d scene generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7300–7309, 2024. 3

work page 2024

[70] [70]

3d menagerie: Modeling the 3d shape and pose of animals

Silvia Zuffi, Angjoo Kanazawa, David W Jacobs, and Michael J Black. 3d menagerie: Modeling the 3d shape and pose of animals. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6365–6373,

work page