pith. sign in

arxiv: 2504.19584 · v3 · pith:JBG6TWGTnew · submitted 2025-04-28 · 💻 cs.CV

ShowMak3r: Compositional TV Show Reconstruction

Pith reviewed 2026-05-22 17:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords TV show reconstructiondynamic radiance fieldscompositional 3D reconstructionactor trackingshot matchingface fittingscene editingSitcoms3D
0
0 comments X

The pith

ShowMak3r reconstructs TV show videos into dynamic 3D radiance fields that support new camera views and actor edits at different times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ShowMak3r as a full pipeline for turning entertainment video clips into editable 3D scenes. It focuses on the practical difficulties of actor overlaps, shifting facial expressions, crowded sets, narrow camera moves, and sudden cuts that break standard reconstruction methods. A 3DLocator step uses depth information to position people on the stage and fills in missing poses through interpolation. A ShotMatcher step follows the same actors when the camera jumps. A separate face network updates expressions on the fly. The result is a radiance field that can be reassembled with fresh viewpoints and used for operations such as moving performers, inserting or deleting them, or changing their posture.

Core claim

ShowMak3r is a comprehensive reconstruction pipeline that allows the editing of scenes like how video clips are made in a production control room. In ShowMak3r, a 3DLocator module locates recovered actors on the stage using depth prior and estimates unseen human poses via interpolation. The proposed ShotMatcher module then tracks the actors under shot changes. Furthermore, ShowMak3r introduces a face-fitting network that dynamically recovers the actors' expressions. Experiments on Sitcoms3D dataset show that our pipeline can reassemble TV show scenes with new cameras at different timestamps.

What carries the argument

The ShowMak3r pipeline, which integrates 3DLocator for depth-based actor placement and pose interpolation, ShotMatcher for cross-shot tracking, and a dynamic face-fitting network to recover expressions, builds an editable dynamic radiance field from multi-shot TV video.

Load-bearing premise

The method assumes depth priors plus pose interpolation and shot tracking can place actors and handle occlusions and cuts without creating large errors in the final 3D field.

What would settle it

Visible artifacts or wrong actor locations when the reconstructed field is rendered from entirely new camera angles and timestamps not present in the input clips would show the claim is incorrect.

Figures

Figures reproduced from arXiv: 2504.19584 by Daeun Lee, Jaesik Park, Sangmin Kim, Seunguk Do.

Figure 1
Figure 1. Figure 1: We present ShowMak3r++, a compositional video reconstruction [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our ShowMak3r++ pipeline. Given an entertainment video clip, we perform dense reconstruction of the stage (Sec. III-C), locate SMPL [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An example of transient object removal. Object removal. Since the gathered images have transient objects in the scene, they interfere with the reconstruction pro￾cess. In particular, objects with no reference frames are hard to reconstruct or remove, which leads to floaters remaining in the background. These artifacts degrade the background quality by a large margin. To mitigate this issue, we annotate the… view at source ↗
Figure 4
Figure 4. Figure 4: Effects of trajectory loss and penetration loss in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: An effect of using the proposed foreground masking. filled into the subsequent shots by extrapolating their SMPL parameters from the data at the shot boundary. F. 3D Actor Reconstruction In the final step of our pipeline, we introduce our human reconstruction module that makes {Gactor n }n=1...N using 3DGS, given N SMPL models associated with different shots, as shown in [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗
Figure 5
Figure 5. Figure 5: Results of actor association. ShotMatcher can associate actors even when some individuals do not appear in a shot. If the distance of the matched actors is above the matching threshold, ShotMatcher identifies them as different. Pose interpolation and extrapolation. Entertainment videos frequently present various occlusion scenarios, such as one actor blocking another, objects obscuring actors, or actors te… view at source ↗
Figure 7
Figure 7. Figure 7: Overview of our actor reconstruction module. Actor gaussians are [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison on ’The Big Bang Theory’ videos from Sitcoms3D dataset, where each video feature a single actor. Our method demonstrates [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison of four TV show videos, each featuring multiple actors. We compare our pipeline with both point cloud representation and [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative result on CMU Panoptic dataset. Not only does our method produce photometric results from novel viewpoints, but it also aligns the [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparison of web videos. Our method can handle various scenarios such as dynamic action clips, dance videos, or movie clips. The [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Ablation study for face-fitting network. [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The reconstructed scenes with our pipeline are [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The architecture of our face fitting network. [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Visualization of aligned actors, estimated cameras, and recon￾structed 3D stage. plane and the feet, our approach optimizes scale using aligned depth information. This approach is robust to scenarios where actors are cropped or occluded by objects. APPENDIX D UNCONTROLLED ENVIRONMENTS In this section, we present additional results from Show￾Mak3r++ on videos with uncontrolled environments. We select chall… view at source ↗
Figure 16
Figure 16. Figure 16: Additional results of the aligned actors in controlled environments. We visualize gaussian centers from two different novel viewpoints. Green points [PITH_FULL_IMAGE:figures/full_fig_p014_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Additional results of the web video reconstruction. We select challenging web videos with dynamic human motions or fast camera movements. We [PITH_FULL_IMAGE:figures/full_fig_p015_17.png] view at source ↗
read the original abstract

Reconstructing dynamic radiance fields from video clips is challenging, especially when entertainment videos like TV shows are given. Many challenges make the reconstruction difficult due to (1) actors occluding with each other and having diverse facial expressions, (2) cluttered stages, and (3) small baseline views or sudden shot changes. To address these issues, we present ShowMak3r, a comprehensive reconstruction pipeline that allows the editing of scenes like how video clips are made in a production control room. In ShowMak3r, a 3DLocator module locates recovered actors on the stage using depth prior and estimates unseen human poses via interpolation. The proposed ShotMatcher module then tracks the actors under shot changes. Furthermore, ShowMak3r introduces a face-fitting network that dynamically recovers the actors' expressions. Experiments on Sitcoms3D dataset show that our pipeline can reassemble TV show scenes with new cameras at different timestamps. We also demonstrate that ShowMak3r enables interesting applications such as synthetic shot-making, actor relocation, insertion, deletion, and pose manipulation. Project page : https://nstar1125.github.io/showmak3r

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents ShowMak3r, a pipeline for reconstructing dynamic radiance fields from TV show video clips. It addresses challenges of actor occlusions with diverse expressions, cluttered stages, small baselines, and abrupt shot changes via three modules: 3DLocator (depth priors plus pose interpolation for actor localization and unseen poses), ShotMatcher (actor tracking across shot changes), and a face-fitting network (dynamic expression recovery). Experiments on the Sitcoms3D dataset are said to demonstrate reassembly of scenes under novel cameras and timestamps, plus applications including synthetic shot-making, actor relocation/insertion/deletion, and pose manipulation.

Significance. If the pipeline's modules prove robust, the work would provide a practical system for compositional editing of dynamic entertainment video, extending radiance-field methods to production-style control-room operations. Targeted handling of shot changes and expressions could enable new virtual-production workflows, though the lack of supporting numbers leaves the magnitude of the advance unclear.

major comments (1)
  1. [Experiments] Experiments section: the central claim that the pipeline 'can reassemble TV show scenes with new cameras at different timestamps' and supports reliable editing applications rests on 3DLocator and ShotMatcher recovering accurate 3D positions and radiance fields. No quantitative metrics (novel-view PSNR/SSIM on held-out timestamps, 3D localization error, or ablation on depth-prior vs. interpolation) are reported to bound errors under mutual occlusions, diverse expressions, or abrupt cuts; without these the robustness assumption remains unverified and load-bearing for the editing claims.
minor comments (1)
  1. [Abstract] Abstract: the project-page URL is given without a period or proper formatting; consider moving it to a footnote or the end of the paper.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address the concern about experimental validation below and will incorporate the suggested improvements in the revision.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central claim that the pipeline 'can reassemble TV show scenes with new cameras at different timestamps' and supports reliable editing applications rests on 3DLocator and ShotMatcher recovering accurate 3D positions and radiance fields. No quantitative metrics (novel-view PSNR/SSIM on held-out timestamps, 3D localization error, or ablation on depth-prior vs. interpolation) are reported to bound errors under mutual occlusions, diverse expressions, or abrupt cuts; without these the robustness assumption remains unverified and load-bearing for the editing claims.

    Authors: We agree that the current experiments rely primarily on qualitative demonstrations and that quantitative metrics would better support the robustness claims. In the revised manuscript we will add novel-view PSNR and SSIM scores on held-out timestamps, report 3D localization error for the 3DLocator module, and include an ablation study comparing depth priors against pose interpolation. These additions will directly address performance under mutual occlusions, diverse expressions, and abrupt shot changes. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline modules are independently specified without definitional reduction

full rationale

The paper describes a reconstruction pipeline (3DLocator using depth priors and pose interpolation, ShotMatcher for tracking under shot changes, and a face-fitting network) that maps input video clips to editable dynamic radiance fields. No equations, fitted parameters, or self-citations appear in the provided text that would make any output quantity definitionally equivalent to its inputs. The central claims rest on the empirical behavior of these proposed modules on the Sitcoms3D dataset rather than on any self-referential loop or imported uniqueness result. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities can be extracted from the given text.

pith-pipeline@v0.9.0 · 5732 in / 1102 out tokens · 45726 ms · 2026-05-22T17:51:54.625368+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · 2 internal anchors

  1. [1]

    Nerf: Representing scenes as neural radiance fields for view synthesis,

    B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” inEuropean Conference on Computer Vision, 2020

  2. [2]

    3d gaussian splatting for real-time radiance field rendering

    B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.”ACM Transactions on Graphics, vol. 42, no. 4, pp. 139–1, 2023

  3. [3]

    Hypernerf: a higher-dimensional representation for topologically varying neural radiance fields,

    K. Park, U. Sinha, P. Hedman, J. T. Barron, S. Bouaziz, D. B. Goldman, R. Martin-Brualla, and S. M. Seitz, “Hypernerf: a higher-dimensional representation for topologically varying neural radiance fields,”ACM Transactions on Graphics, vol. 40, no. 6, pp. 1–12, 2021

  4. [4]

    Deformable 3d gaussian splatting for animatable human avatars,

    H. Jung, N. Brasch, J. Song, E. Perez-Pellitero, Y . Zhou, Z. Li, N. Navab, and B. Busam, “Deformable 3d gaussian splatting for animatable human avatars,” inarXiv, 2023

  5. [5]

    4d gaussian splatting for real-time dynamic scene rendering,

    G. Wu, T. Yi, J. Fang, L. Xie, X. Zhang, W. Wei, W. Liu, Q. Tian, and X. Wang, “4d gaussian splatting for real-time dynamic scene rendering,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 20 310–20 320. Reference Frame (c) Actor Insertion (a) Actor Deletion (d) Pose Manipulation (b) Actor Relocation Fig. 13...

  6. [6]

    NeuMan: Neural Human Radiance Field from a Single Video,

    W. Jiang, K. M. Yi, G. Samei, O. Tuzel, and A. Ranjan, “NeuMan: Neural Human Radiance Field from a Single Video,” inEuropean Conference on Computer Vision, S. Avidan, G. Brostow, M. Ciss´e, G. M. Farinella, and T. Hassner, Eds. Cham: Springer Nature Switzerland, 2022, pp. 402–418

  7. [7]

    Hugs: Human gaussian splats,

    M. Kocabas, J.-H. R. Chang, J. Gabriel, O. Tuzel, and A. Ranjan, “Hugs: Human gaussian splats,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 505–515

  8. [8]

    Shape of motion: 4d reconstruction from a single video,

    Q. Wang, V . Ye, H. Gao, W. Zeng, J. Austin, Z. Li, and A. Kanazawa, “Shape of motion: 4d reconstruction from a single video,” inInterna- tional Conference on Computer Vision (ICCV), 2025

  9. [9]

    Monst3r: A simple approach for estimating geometry in the presence of motion,

    J. Zhang, C. Herrmann, J. Hur, V . Jampani, T. Darrell, F. Cole, D. Sun, and M.-H. Yang, “Monst3r: A simple approach for estimating geometry in the presence of motion,”International Conference on Learning Representations, 2025

  10. [10]

    Align3r: Aligned monocular depth estimation for dynamic videos,

    J. Lu, T. Huang, P. Li, Z. Dou, C. Lin, Z. Cui, Z. Dong, S.-K. Yeung, W. Wang, and Y . Liu, “Align3r: Aligned monocular depth estimation for dynamic videos,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 22 820–22 830

  11. [11]

    Continuous 3d perception model with persistent state,

    Q. Wang*, Y . Zhang*, A. Holynski, A. A. Efros, and A. Kanazawa, “Continuous 3d perception model with persistent state,” inCVPR, 2025

  12. [12]

    Megasam: Accurate, fast and robust structure and motion from casual dynamic videos,

    Z. Li, R. Tucker, F. Cole, Q. Wang, L. Jin, V . Ye, A. Kanazawa, A. Holynski, and N. Snavely, “Megasam: Accurate, fast and robust structure and motion from casual dynamic videos,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 10 486–10 496

  13. [13]

    Showmak3r: Compositional tv show reconstruction,

    S. Kim, S. Do, and J. Park, “Showmak3r: Compositional tv show reconstruction,”CVPR, 2025

  14. [14]

    The one where they reconstructed 3d humans and environments in tv shows,

    G. Pavlakos, E. Weber, M. Tancik, and A. Kanazawa, “The one where they reconstructed 3d humans and environments in tv shows,” in European Conference on Computer Vision. Springer, 2022, pp. 732– 749

  15. [15]

    Panoptic studio: A massively multiview system for social motion capture,

    H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y . Sheikh, “Panoptic studio: A massively multiview system for social motion capture,” inProceedings of IEEE International Conference on Computer Vision, 2015, pp. 3334–3342

  16. [16]

    Neural 3d video synthesis from multi-view video,

    T. Li, M. Slavcheva, M. Zollhoefer, S. Green, C. Lassner, C. Kim, T. Schmidt, S. Lovegrove, M. Goesele, R. Newcombeet al., “Neural 3d video synthesis from multi-view video,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5521–5531

  17. [17]

    Dynibar: Neural dynamic image-based rendering,

    Z. Li, Q. Wang, F. Cole, R. Tucker, and N. Snavely, “Dynibar: Neural dynamic image-based rendering,” inProceedings of the IEEE/CVF PREPRINT 11 Conference on Computer Vision and Pattern Recognition, 2023, pp. 4273–4284

  18. [18]

    Nerfies: Deformable neural radiance fields,

    K. Park, U. Sinha, J. T. Barron, S. Bouaziz, D. B. Goldman, S. M. Seitz, and R. Martin-Brualla, “Nerfies: Deformable neural radiance fields,” inProceedings of IEEE International Conference on Computer Vision, 2021, pp. 5865–5874

  19. [19]

    D- nerf: Neural radiance fields for dynamic scenes,

    A. Pumarola, E. Corona, G. Pons-Moll, and F. Moreno-Noguer, “D- nerf: Neural radiance fields for dynamic scenes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 318–10 327

  20. [20]

    Hexplane: A fast representation for dynamic scenes,

    A. Cao and J. Johnson, “Hexplane: A fast representation for dynamic scenes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 130–141

  21. [21]

    Neural scene flow fields for space-time view synthesis of dynamic scenes,

    Z. Li, S. Niklaus, N. Snavely, and O. Wang, “Neural scene flow fields for space-time view synthesis of dynamic scenes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2021, pp. 6498–6508

  22. [22]

    Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds,

    J. Lei, Y . Weng, A. Harley, L. Guibas, and K. Daniilidis, “Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds,” inarXiv, 2024

  23. [23]

    Dynamic gaussian marbles for novel view synthesis of casual monocular videos,

    C. Stearns, A. Harley, M. Uy, F. Dubost, F. Tombari, G. Wetzstein, and L. Guibas, “Dynamic gaussian marbles for novel view synthesis of casual monocular videos,” inSIGGRAPH Asia 2024 Conference Papers, 2024, pp. 1–11

  24. [24]

    Gflow: Recovering 4d world from monocular video,

    S. Wang, X. Yang, Q. Shen, Z. Jiang, and X. Wang, “Gflow: Recovering 4d world from monocular video,” inAssociation for the Advancement of Artificial Intelligence, 2025

  25. [25]

    K-planes: Explicit radiance fields in space, time, and appearance,

    S. Fridovich-Keil, G. Meanti, F. R. Warburg, B. Recht, and A. Kanazawa, “K-planes: Explicit radiance fields in space, time, and appearance,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2023, pp. 12 479–12 488

  26. [26]

    Nerfplayer: A streamable dynamic scene representation with decomposed neural radiance fields,

    L. Song, A. Chen, Z. Li, Z. Chen, L. Chen, J. Yuan, Y . Xu, and A. Geiger, “Nerfplayer: A streamable dynamic scene representation with decomposed neural radiance fields,”IEEE Transactions on Visualization and Computer Graphics, vol. 29, no. 5, pp. 2732–2742, 2023

  27. [27]

    Gaussian-flow: 4d reconstruction with dynamic 3d gaussian particle,

    Y . Lin, Z. Dai, S. Zhu, and Y . Yao, “Gaussian-flow: 4d reconstruction with dynamic 3d gaussian particle,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 21 136–21 145

  28. [28]

    Bard-gs: Blur-aware re- construction of dynamic scenes via gaussian splatting,

    Y . Lu, Y . Zhou, D. Liu, T. Liang, and Y . Yin, “Bard-gs: Blur-aware re- construction of dynamic scenes via gaussian splatting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  29. [29]

    Mem4d: Decoupling static and dynamic memory for dynamic scene reconstruction,

    Y . Wang, D. Ceylan, and L. Agapito, “Mem4d: Decoupling static and dynamic memory for dynamic scene reconstruction,”arXiv preprint arXiv:2508.07908, 2025

  30. [30]

    HumanNeRF: Free-Viewpoint Rendering of Moving People From Monocular Video,

    C.-Y . Weng, B. Curless, P. P. Srinivasan, J. T. Barron, and I. Kemelmacher-Shlizerman, “HumanNeRF: Free-Viewpoint Rendering of Moving People From Monocular Video,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2022, pp. 16 210–16 220

  31. [31]

    Humannerf: Efficiently generated human radiance field from sparse inputs,

    F. Zhao, W. Yang, J. Zhang, P. Lin, Y . Zhang, J. Yu, and L. Xu, “Humannerf: Efficiently generated human radiance field from sparse inputs,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7743–7753

  32. [32]

    Vid2Avatar: 3D Avatar Reconstruction From Videos in the Wild via Self-Supervised Scene Decomposition,

    C. Guo, T. Jiang, X. Chen, J. Song, and O. Hilliges, “Vid2Avatar: 3D Avatar Reconstruction From Videos in the Wild via Self-Supervised Scene Decomposition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2023, pp. 12 858–12 868

  33. [33]

    MonoHuman: Animat- able Human Neural Field From Monocular Video,

    Z. Yu, W. Cheng, X. Liu, W. Wu, and K.-Y . Lin, “MonoHuman: Animat- able Human Neural Field From Monocular Video,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2023, pp. 16 943–16 953

  34. [34]

    Learning Neural V olumetric Representations of Dynamic Humans in Minutes,

    C. Geng, S. Peng, Z. Xu, H. Bao, and X. Zhou, “Learning Neural V olumetric Representations of Dynamic Humans in Minutes,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2023, pp. 8759–8770

  35. [35]

    Gaussianavatar: Towards realistic human avatar modeling from a single video via animatable 3d gaussians,

    L. Hu, H. Zhang, Y . Zhang, B. Zhou, B. Liu, S. Zhang, and L. Nie, “Gaussianavatar: Towards realistic human avatar modeling from a single video via animatable 3d gaussians,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 634– 644

  36. [36]

    Gauhuman: Articulated gaussian splatting from monocular human videos,

    S. Hu, T. Hu, and Z. Liu, “Gauhuman: Articulated gaussian splatting from monocular human videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 20 418–20 431

  37. [37]

    HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting,

    Y . Jiang, Z. Shen, P. Wang, Z. Su, Y . Hong, Y . Zhang, J. Yu, and L. Xu, “HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2024, pp. 19 734–19 745

  38. [38]

    Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Model- ing,

    Z. Li, Z. Zheng, L. Wang, and Y . Liu, “Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Model- ing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2024, pp. 19 711–19 722

  39. [39]

    Expressive Whole-Body 3D Gaus- sian Avatar,

    G. Moon, T. Shiratori, and S. Saito, “Expressive Whole-Body 3D Gaus- sian Avatar,” inEuropean Conference on Computer Vision, A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, and G. Varol, Eds. Cham: Springer Nature Switzerland, 2024, pp. 19–35

  40. [40]

    Human Gaussian Splatting: Real-time Rendering of Animat- able Avatars,

    A. Moreau, J. Song, H. Dhamo, R. Shaw, Y . Zhou, and E. P ´erez- Pellitero, “Human Gaussian Splatting: Real-time Rendering of Animat- able Avatars,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2024, pp. 788–798

  41. [41]

    ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering,

    H. Pang, H. Zhu, A. Kortylewski, C. Theobalt, and M. Habermann, “ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2024, pp. 1165–1175

  42. [42]

    Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians,

    S. Qian, T. Kirschstein, L. Schoneveld, D. Davoli, S. Giebenhain, and M. Nießner, “Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 20 299–20 309

  43. [43]

    3DGS- Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting,

    Z. Qian, S. Wang, M. Mihajlovic, A. Geiger, and S. Tang, “3DGS- Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2024, pp. 5020–5030

  44. [44]

    DEGAS: Detailed Expressions on Full-Body Gaussian Avatars,

    Z. Shao, D. Wang, Q.-Y . Tian, Y .-D. Yang, H. Meng, Z. Cai, B. Dong, Y . Zhang, K. Zhang, and Z. Wang, “DEGAS: Detailed Expressions on Full-Body Gaussian Avatars,” inProceedings of the International Conference on 3D Vision (3DV), 2025

  45. [45]

    SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting,

    Z. Shao, Z. Wang, Z. Li, D. Wang, X. Lin, Y . Zhang, M. Fan, and Z. Wang, “SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2024, pp. 1606–1616

  46. [46]

    Humanrf: High-fidelity neural radiance fields for humans in motion,

    M. Is ¸ık, M. R¨unz, M. Georgopoulos, T. Khakhulin, J. Starck, L. Agapito, and M. Nießner, “Humanrf: High-fidelity neural radiance fields for humans in motion,”ACM Transactions on Graphics, vol. 42, no. 4, pp. 1–12, 2023. [Online]. Available: https://doi.org/10.1145/3592415

  47. [47]

    A-nerf: Articulated neural radiance fields for learning human shape, appearance, and pose,

    S.-Y . Su, F. Yu, M. Zollh¨ofer, and H. Rhodin, “A-nerf: Articulated neural radiance fields for learning human shape, appearance, and pose,”Ad- vances in Neural Information Processing Systems, vol. 34, pp. 12 278– 12 291, 2021

  48. [48]

    Animatable neural radiance fields for modeling dynamic human bod- ies,

    S. Peng, J. Dong, Q. Wang, S. Zhang, Q. Shuai, X. Zhou, and H. Bao, “Animatable neural radiance fields for modeling dynamic human bod- ies,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 314–14 323

  49. [49]

    Dream, lift, animate: From single images to animatable gaussian avatars,

    M. C. Buehler, Y . Yuan, X. Li, Y . Huang, K. Nagano, and U. Iqbal, “Dream, lift, animate: From single images to animatable gaussian avatars,” 2025

  50. [50]

    MoGA: 3d Gen- erative Avatar Prior for Monocular Gaussian Avatar Reconstruction,

    Z. Dong, L. Duan, J. Song, M. J. Black, and A. Geiger, “MoGA: 3d Gen- erative Avatar Prior for Monocular Gaussian Avatar Reconstruction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  51. [51]

    Vid2avatar- pro: Authentic avatar from videos in the wild via universal prior,

    C. Guo, J. Li, Y . Kant, Y . Sheikh, S. Saito, and C. Cao, “Vid2avatar- pro: Authentic avatar from videos in the wild via universal prior,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025

  52. [52]

    ACM Trans

    M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: a skinned multi-person linear model,”ACM Trans. Graph., vol. 34, no. 6, Oct. 2015. [Online]. Available: https: //doi.org/10.1145/2816795.2818013

  53. [53]

    Expressive body capture: 3d hands, face, and body from a single image,

    G. Pavlakos, V . Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas, and M. J. Black, “Expressive body capture: 3d hands, face, and body from a single image,” inProceedings of IEEE International Conference on Computer Vision, 2019, pp. 10 975–10 985

  54. [54]

    Sherf: Generalizable human nerf from a single image,

    S. Hu, F. Hong, L. Pan, H. Mei, L. Yang, and Z. Liu, “Sherf: Generalizable human nerf from a single image,” inProceedings of IEEE International Conference on Computer Vision, 2023, pp. 9352–9364

  55. [55]

    Ghunerf: Generalizable human nerf from a monocular video,

    C. Li, J. Lin, and G. H. Lee, “Ghunerf: Generalizable human nerf from a monocular video,” in2024 International Conference on 3D Vision (3DV). IEEE, 2024, pp. 923–932

  56. [56]

    Ghnerf: Learning generalizable human features with PREPRINT 12 efficient neural radiance fields,

    A. Dey, D. Yang, R. Agaram, A. Dantcheva, A. I. Comport, S. Sridhar, and J. Martinet, “Ghnerf: Learning generalizable human features with PREPRINT 12 efficient neural radiance fields,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024, pp. 2812– 2821

  57. [57]

    Eg-humannerf: Efficient gener- alizable human nerf utilizing human prior for sparse view,

    Z. Wang, Y . Kanamori, and Y . Endo, “Eg-humannerf: Efficient gener- alizable human nerf utilizing human prior for sparse view,” inarXiv, 2024

  58. [58]

    Actorsnerf: Animatable few-shot human rendering with generalizable nerfs,

    J. Mu, S. Sang, N. Vasconcelos, and X. Wang, “Actorsnerf: Animatable few-shot human rendering with generalizable nerfs,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 391–18 401

  59. [59]

    GAIA: Generative animatable interactive avatars with expression-conditioned gaussians,

    Z. Yu, T. Li, J. Sun, O. Shapira, S. Park, M. Stengel, M. Chan, X. Li, W. Wang, K. Nagano, and S. D. Mello, “GAIA: Generative animatable interactive avatars with expression-conditioned gaussians,” inACM SIGGRAPH, 2025

  60. [60]

    High- resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2022, pp. 10 684–10 695

  61. [61]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis,

    D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” inInternational Conference on Learning Representations, 2024

  62. [62]

    Prolific- dreamer: High-fidelity and diverse text-to-3d generation with variational score distillation,

    Z. Wang, C. Lu, Y . Wang, F. Bao, C. Li, H. Su, and J. Zhu, “Prolific- dreamer: High-fidelity and diverse text-to-3d generation with variational score distillation,” inAdvances in Neural Information Processing Sys- tems, vol. 36, 2023, pp. 8406–8441

  63. [63]

    Zero- shot text-guided object generation with dream fields,

    A. Jain, B. Mildenhall, J. T. Barron, P. Abbeel, and B. Poole, “Zero- shot text-guided object generation with dream fields,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 867–876

  64. [64]

    Dreamfusion: Text-to-3d using 2d diffusion,

    B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “Dreamfusion: Text-to-3d using 2d diffusion,” inInternational Conference on Learning Representations, 2023

  65. [65]

    Diffusion models as plug-and-play priors,

    A. Graikos, N. Malkin, N. Jojic, and D. Samaras, “Diffusion models as plug-and-play priors,” inAdvances in Neural Information Processing Systems, vol. 35, 2022, pp. 14 715–14 728

  66. [66]

    Guess the unseen: Dynamic 3d scene re- construction from partial 2d glimpses,

    I. Lee, B. Kim, and H. Joo, “Guess the unseen: Dynamic 3d scene re- construction from partial 2d glimpses,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1062–1071

  67. [67]

    Chrome: Clothed human reconstruction with occlusion-resilience and multiview-consistency from a single im- age,

    A. Dutta, M. Zheng, Z. Gao, B. Planche, A. Choudhuri, T. Chen, A. K. Roy-Chowdhury, and Z. Wu, “Chrome: Clothed human reconstruction with occlusion-resilience and multiview-consistency from a single im- age,” inarXiv, 2025

  68. [68]

    Scaffoldavatar: High-fidelity gaussian avatars with patch expressions,

    S. Aneja, S. Weiss, I. Baeza, P. Chandran, G. Zoss, M. Nießner, and D. Bradley, “Scaffoldavatar: High-fidelity gaussian avatars with patch expressions,”ACM Trans. Graph., vol. 44, no. 4, 2025

  69. [69]

    Nerf in the wild: Neural radiance fields for unconstrained photo collections,

    R. Martin-Brualla, N. Radwan, M. S. Sajjadi, J. T. Barron, A. Doso- vitskiy, and D. Duckworth, “Nerf in the wild: Neural radiance fields for unconstrained photo collections,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7210–7219

  70. [70]

    Omnire: Omni urban scene reconstruction,

    Z. Chen, J. Yang, J. Huang, R. de Lutio, J. M. Esturo, B. Ivanovic, O. Litany, Z. Gojcic, S. Fidler, M. Pavoneet al., “Omnire: Omni urban scene reconstruction,” inInternational Conference on Learning Representations, 2025

  71. [71]

    Ephraim katz’s the film encyclopedia,

    E. Katz, “Ephraim katz’s the film encyclopedia,” 1979

  72. [72]

    Sklar,Film: An International History of the Medium

    R. Sklar,Film: An International History of the Medium. Thames and Hudson, 1990

  73. [73]

    Global structure-from-motion revisited,

    L. Pan, D. Bar ´ath, M. Pollefeys, and J. L. Sch ¨onberger, “Global structure-from-motion revisited,” inEuropean Conference on Computer Vision, 2024

  74. [74]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of IEEE International Conference on Computer Vision, 2023, pp. 4015–4026

  75. [75]

    Structure-from-motion revisited,

    J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 4104–4113

  76. [76]

    $\pi^3$: Permutation-Equivariant Visual Geometry Learning

    Y . Wang, J. Zhou, H. Zhu, W. Chang, Y . Zhou, Z. Li, J. Chen, J. Pang, C. Shen, and T. He, “π 3: Scalable permutation-equivariant visual geometry learning,” 2025. [Online]. Available: https://arxiv.org/ abs/2507.13347

  77. [77]

    MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

    R. Wang, S. Xu, Y . Dong, Y . Deng, J. Xiang, Z. Lv, G. Sun, X. Tong, and J. Yang, “Moge-2: Accurate monocular geometry with metric scale and sharp details,”arXiv preprint arXiv:2507.02546, 2025

  78. [78]

    Dn-splatter: Depth and normal priors for gaussian splatting and meshing,

    M. Turkulainen, X. Ren, I. Melekhov, O. Seiskari, E. Rahtu, and J. Kannala, “Dn-splatter: Depth and normal priors for gaussian splatting and meshing,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025

  79. [79]

    Depth-regularized optimization for 3d gaussian splatting in few-shot images,

    J. Chung, J. Oh, and K. M. Lee, “Depth-regularized optimization for 3d gaussian splatting in few-shot images,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 811– 820

  80. [80]

    ObjectClear: Complete object removal via object-effect attention,

    J. Zhao, S. Zhou, Z. Wang, P. Yang, and C. C. Loy, “ObjectClear: Complete object removal via object-effect attention,” inarXiv preprint arXiv:2505.22636, 2025

Showing first 80 references.