pith. sign in

arxiv: 2605.23555 · v1 · pith:3MY4DIFTnew · submitted 2026-05-22 · 💻 cs.CV

Generator-Refiner-Examiner: A Tri-Module Data Augmentation Framework for 3D Human Avatar Learning from Monocular Videos

Pith reviewed 2026-05-25 05:06 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D human avatardata augmentationmonocular videodiffusion refinementpose perturbationattention-based filteringavatar reconstruction
0
0 comments X

The pith

A tri-module data augmentation system improves 3D human avatar reconstruction from monocular videos with limited frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TrioMan to handle data scarcity when building photorealistic, animatable 3D human avatars from monocular videos. Current approaches combine per-subject optimization with generic human priors but lose fine details under limited training frames. TrioMan adds a Generator that creates new samples through Gaussian perturbations on pose and camera parameters, a Refiner that enhances those samples with one-step diffusion using texture and geometry guidance, and an Examiner that filters for subject consistency via dual-branch attention similarity scoring. Experiments on the X-Humans and NeuMan benchmarks indicate that this augmented training yields higher performance than prior state-of-the-art methods.

Core claim

TrioMan augments limited monocular video data for 3D avatar learning through three modules: the Generator imposes Gaussian perturbations on pose and camera to produce diverse unseen samples; the Refiner applies one-step diffusion conditioned on texture and geometry cues to raise sample quality; the Examiner uses dual-branch attention-based similarity evaluation to retain only subject-consistent examples. This process supplies additional useful training signal that improves reconstruction when real frames are scarce.

What carries the argument

The tri-module Generator-Refiner-Examiner pipeline, where Generator perturbs pose and camera, Refiner performs guided one-step diffusion, and Examiner applies dual-branch attention similarity filtering.

If this is right

  • Augmented samples enable capture of fine-grained details that standard per-subject optimization misses under data limits.
  • The framework outperforms existing methods on the X-Humans and NeuMan benchmarks.
  • Subject-consistent extra data reduces dependence on generic human priors for avatar quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same generate-refine-examine loop could apply to other sparse-view 3D reconstruction problems beyond human avatars.
  • If the examiner's attention scoring proves reliable, similar filtering might improve synthetic data use in related vision tasks.

Load-bearing premise

That the perturbed, diffused, and filtered samples remain subject-consistent and supply useful training signal beyond the original limited frames.

What would settle it

Running the full TrioMan pipeline on X-Humans or NeuMan videos with few frames and finding no measurable gain in avatar reconstruction metrics compared with training on the original frames alone.

Figures

Figures reproduced from arXiv: 2605.23555 by Gangjian Zhang, Hao Wang, Jian Shu, Sicheng Yu, Wenhao Shen, Yu Feng.

Figure 1
Figure 1. Figure 1: Qualitative Comparison. We use the same SMPL templates to drive the animatable 3D avatars of different SOTA [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Method Overview. Our method, TrioMan, addresses expressive 3D human avatar learning from monocular video [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Refiner module. We take the refinement of the [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examiner module. We design a dual-branch similar [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visual comparison with SOTA methods on Neuman. Compared to current methods, our approach can achieve better [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visual comparison with SOTA methods on X [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visual ablation of Refiner. We show the refinement effects of the Refiner module before and after incorporating [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visual ablation about the geometry condition in [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
read the original abstract

This paper addresses the challenge of reconstructing photorealistic and animatable 3D human avatars from monocular videos. While existing methods rely on combining per-subject optimization with generic human priors, they often fail to capture fine-grained details when training frames are limited. To mitigate this data scarcity, we propose TrioMan, a systematic tri-module framework for augmented 3D avatar learning. Our approach comprises three synergistic components. The Generator creates diverse unseen samples by imposing Gaussian perturbations on pose and camera. The Refiner improves the quality of generated data through one-step diffusion guided by texture and geometry cues. The Examiner selects subject-consistent samples using a dual-branch attention-based similarity evaluation. Experiments on the X-Humans and NeuMan benchmarks show that TrioMan outperforms state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes TrioMan, a tri-module data augmentation framework for reconstructing photorealistic and animatable 3D human avatars from monocular videos with limited frames. The Generator creates diverse samples via Gaussian perturbations on pose and camera parameters; the Refiner enhances them using one-step diffusion conditioned on texture and geometry cues; the Examiner filters for subject consistency with a dual-branch attention-based similarity metric. The central claim is that this pipeline yields useful additional training signal and outperforms prior methods on the X-Humans and NeuMan benchmarks.

Significance. If the experimental claims hold, the framework would offer a practical route to mitigate data scarcity in per-subject avatar optimization, potentially improving fine-grained detail capture without requiring additional real captures or heavier reliance on generic human priors.

major comments (1)
  1. [Abstract / Experiments] Abstract / Experiments section: The claim that 'Experiments on the X-Humans and NeuMan benchmarks show that TrioMan outperforms state-of-the-art methods' is unsupported by any reported metrics, tables, ablation studies, error analysis, or implementation details. This directly undermines assessment of whether the Generator-Refiner-Examiner pipeline produces subject-consistent, high-signal augmentations as assumed.
minor comments (1)
  1. [Method] The description of the dual-branch attention mechanism in the Examiner and the precise conditioning signals in the Refiner would benefit from explicit algorithmic pseudocode or equations to support reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and the identification of this critical issue with the experimental claims. We address the comment below.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract / Experiments section: The claim that 'Experiments on the X-Humans and NeuMan benchmarks show that TrioMan outperforms state-of-the-art methods' is unsupported by any reported metrics, tables, ablation studies, error analysis, or implementation details. This directly undermines assessment of whether the Generator-Refiner-Examiner pipeline produces subject-consistent, high-signal augmentations as assumed.

    Authors: We agree that the claim in the abstract is currently unsupported in the manuscript. The provided text consists only of the abstract and does not contain any quantitative results, tables, ablations, error analysis, or implementation details. In the revised version we will add a full Experiments section with metrics on X-Humans and NeuMan, direct comparisons to prior methods, module ablations, and implementation details so that the performance claims can be properly evaluated. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a tri-module data augmentation framework (Generator-Refiner-Examiner) for 3D avatar learning, with claims resting on empirical benchmark results rather than any mathematical derivation chain. No equations, fitted parameters, self-citations as load-bearing premises, or ansatzes are described in the provided text. The central claim (outperformance on X-Humans and NeuMan) is an experimental outcome, not a quantity that reduces to its own inputs by construction. The method is self-contained against external benchmarks with no internal reduction to tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No information is available from the abstract to populate free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5680 in / 1220 out tokens · 24056 ms · 2026-05-25T05:06:22.031685+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 4 internal anchors

  1. [1]

    Timur Bagautdinov, Chenglei Wu, Tomas Simon, Fabián Prada, Takaaki Shiratori, Shih-En Wei, Weipeng Xu, Yaser Sheikh, and Jason Saragih. 2021. Driving-signal aware full-body avatars.ACM Trans. Graph.40, 4, Article 143 (July 2021), 17 pages. doi:10.1145/3450626.3459850

  2. [2]

    Dongliang Chang, Yifeng Ding, Jiyang Xie, Ayan Kumar Bhunia, and Yi Zhe Song. 2020. The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification.IEEE Transactions on Image ProcessingPP, 99 (2020), 1–1

  3. [3]

    Yushuo Chen, Zerong Zheng, Zhe Li, Chao Xu, and Yebin Liu. 2024. MeshA- vatar: Learning High-quality Triangular Human Avatars from Multi-view Videos. arXiv:2407.08414 [cs.CV]

  4. [4]

    Chong Cheng, Gaochao Song, Yiyang Yao, Qinzheng Zhou, Gangjian Zhang, and Hao Wang. 2025. Graph-Guided Scene Reconstruction from Images with 3D Gaussian Splatting. arXiv:2502.17377 [cs.CV] https://arxiv.org/abs/2502.17377

  5. [5]

    Wei Cheng, Ruixiang Chen, Siming Fan, Wanqi Yin, Keyu Chen, Zhongang Cai, Jingbo Wang, Yang Gao, Zhengming Yu, Zhengyu Lin, et al. 2023. Dna-rendering: A diverse neural actor repository for high-fidelity human-centric rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 19982– 19993

  6. [6]

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929(2020)

  7. [7]

    Yuanxing Duan, Fangyin Wei, Qiyu Dai, Yuhang He, Wenzheng Chen, and Bao- quan Chen. 2024. 4d-rotor gaussian splatting: towards efficient novel view synthesis for dynamic scenes. InACM SIGGRAPH 2024 Conference Papers. 1–11

  8. [8]

    Black, and Timo Bolkart

    Yao Feng, Jinlong Yang, Marc Pollefeys, Michael J. Black, and Timo Bolkart

  9. [9]

    InSIGGRAPH Asia 2022 Conference Papers(Daegu, Republic of Korea)(SA ’22)

    Capturing and Animation of Body and Clothing from Monocular Video. InSIGGRAPH Asia 2022 Conference Papers(Daegu, Republic of Korea)(SA ’22). Association for Computing Machinery, New York, NY, USA, Article 45, 9 pages. doi:10.1145/3550469.3555423

  10. [10]

    Chen Guo, Tianjian Jiang, Xu Chen, Jie Song, and Otmar Hilliges. 2023. Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self- supervised Scene Decomposition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  11. [11]

    Chen Guo, Junxuan Li, Yash Kant, Yaser Sheikh, Shunsuke Saito, and Chen Cao

  12. [12]

    InProceedings of the Computer Vision and Pattern Recognition Conference

    Vid2avatar-pro: Authentic avatar from videos in the wild via universal prior. InProceedings of the Computer Vision and Pattern Recognition Conference. 5559–5570

  13. [13]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. arXiv:2006.11239 [cs.LG]

  14. [14]

    Hezhen Hu, Zhiwen Fan, Tianhao Wu, Yihan Xi, Seoyoung Lee, Georgios Pavlakos, and Zhangyang Wang. 2024. Expressive Gaussian Human Avatars from Monocular RGB Video. InNeurIPS

  15. [15]

    Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. InProceed- ings of the IEEE conference on computer vision and pattern recognition. 7132–7141

  16. [16]

    Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang, Boyao Zhou, Boning Liu, Sheng- ping Zhang, and Liqiang Nie. 2024. GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  17. [17]

    Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang, Boyao Zhou, Boning Liu, Sheng- ping Zhang, and Liqiang Nie. 2024. Gaussianavatar: Towards realistic human avatar modeling from a single video via animatable 3d gaussians. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 634–644

  18. [18]

    Shoukang Hu, Tao Hu, and Ziwei Liu. 2024. Gauhuman: Articulated gaussian splatting from monocular human videos. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition. 20418–20431

  19. [19]

    Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiao- juan Qi. 2024. Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4220–4230

  20. [20]

    Boyi Jiang, Yang Hong, Hujun Bao, and Juyong Zhang. 2022. Selfrecon: Self reconstruction your digital avatar from monocular video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5605–5615

  21. [21]

    Tianjian Jiang, Xu Chen, Jie Song, and Otmar Hilliges. 2023. InstantAvatar: Learning Avatars from Monocular Video in 60 Seconds. (June 2023)

  22. [22]

    Tianjian Jiang, Hsuan-I Ho, Manuel Kaufmann, and Jie Song. 2025. PriorAvatar: Efficient and Robust Avatar Creation from Monocular Video Using Learned Priors. InProceedings of the SIGGRAPH Asia 2025 Conference Papers (SA Conference Papers ’25). Association for Computing Machinery, New York, NY, USA, Article 31, 10 pages. doi:10.1145/3757377.3763978

  23. [23]

    Wei Jiang, Kwang Moo Yi, Golnoosh Samei, Oncel Tuzel, and Anurag Ranjan

  24. [24]

    InProceedings of the European conference on computer vision (ECCV)

    NeuMan: Neural Human Radiance Field from a Single Video. InProceedings of the European conference on computer vision (ECCV)

  25. [25]

    Daisheng Jin and Ying He. 2026. MonoCloth: Reconstruction and Animation of Cloth-Decoupled Human Avatars from Monocular Videos. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 5503–5511

  26. [26]

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis

  27. [27]

    Graph.42, 4, Article 139 (jul 2023), 14 pages

    3D Gaussian Splatting for Real-Time Radiance Field Rendering.ACM Trans. Graph.42, 4, Article 139 (jul 2023), 14 pages. doi:10.1145/3592433

  28. [29]

    Muhammed Kocabas, Jen-Hao Rick Chang, James Gabriel, Oncel Tuzel, and Anurag Ranjan. 2024. Hugs: Human gaussian splats. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 505–515

  29. [30]

    SuBeen Lee, WonJun Moon, Hyun Seok Seong, and Jae-Pil Heo. 2024. Task- oriented channel attention for fine-grained few-shot classification.IEEE Transac- tions on Pattern Analysis and Machine Intelligence(2024)

  30. [31]

    Jiahui Lei, Yufu Wang, Georgios Pavlakos, Lingjie Liu, and Kostas Daniilidis

  31. [32]

    InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Gart: Gaussian articulated template models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 19876–19887

  32. [33]

    Mingwei Li, Jiachen Tao, Zongxin Yang, and Yi Yang. 2023. Human101: Training 100+FPS Human Gaussians in 100s from 1 View. arXiv:2312.15258 [cs.CV]

  33. [34]

    Mengtian Li, Shengxiang Yao, Zhifeng Xie, and Keyu Chen. 2024. Gaussian- body: Clothed human reconstruction via 3d gaussian splatting.arXiv preprint arXiv:2401.09720(2024)

  34. [35]

    Zhan Li, Zhang Chen, Zhong Li, and Yi Xu. 2024. Spacetime gaussian feature splatting for real-time dynamic view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8508–8520

  35. [36]

    Shanchuan Lin, Anran Wang, and Xiao Yang. 2024. Sdxl-lightning: Progressive adversarial diffusion distillation.arXiv preprint arXiv:2402.13929(2024)

  36. [37]

    Youtian Lin, Zuozhuo Dai, Siyu Zhu, and Yao Yao. 2024. Gaussian-flow: 4d reconstruction with dynamic 3d gaussian particle. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21136–21145

  37. [38]

    Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, and Yueqi Duan. 2024. Reconx: Reconstruct any scene from sparse views with video diffusion model.arXiv preprint arXiv:2408.16767(2024)

  38. [39]

    Xinqi Liu and Chenming Wu. 2025. VGA: Reconstructing Vivid 3D Gaussian Avatars from Monocular Videos. InInternational Conference on Computational Visual Media. Springer, 172–193

  39. [40]

    Xian Liu, Xiaohang Zhan, Jiaxiang Tang, Ying Shan, Gang Zeng, Dahua Lin, Xihui Liu, and Ziwei Liu. 2024. Humangaussian: Text-driven 3d human generation with gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6646–6657

  40. [41]

    Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. 2023. SMPL: A skinned multi-person linear model. InSeminal Graphics Papers: Pushing the Boundaries, Volume 2. 851–866

  41. [42]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. InECCV

  42. [43]

    Gyeongsik Moon, Takaaki Shiratori, and Shunsuke Saito. 2024. Expressive Whole- Body 3D Gaussian Avatar. InECCV

  43. [44]

    Gyeongsik Moon, Takaaki Shiratori, and Shunsuke Saito. 2024. Expressive whole- body 3d gaussian avatar. InEuropean Conference on Computer Vision. Springer, 19–35

  44. [45]

    Arthur Moreau, Jifei Song, Helisa Dhamo, Richard Shaw, Yiren Zhou, and Ed- uardo Pérez-Pellitero. 2024. Human gaussian splatting: Real-time rendering of animatable avatars. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 788–798

  45. [46]

    Jongmin Park, Minh-Quan Viet Bui, Juan Luis Gonzalez Bello, Jaeho Moon, Jihyong Oh, and Munchurl Kim. 2025. Splinegs: Robust motion-adaptive spline for real-time dynamic 3d gaussians from monocular video. InProceedings of the Computer Vision and Pattern Recognition Conference. 26866–26875

  46. [47]

    Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. 2019. Expressive Body Capture: 3D Hands, Face, and Body from a Single Image. InProceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)

  47. [48]

    Cheng Peng, Jingxiang Sun, Yushuo Chen, Zhaoqi Su, Zhuo Su, and Yebin Liu

  48. [49]

    Parametric Gaussian Human Model: Generalizable Prior for Efficient and Realistic Human Avatar Modeling.arXiv preprint arXiv:2506.06645(2025)

  49. [50]

    Zhiyin Qian, Shaofei Wang, Marko Mihajlovic, Andreas Geiger, and Siyu Tang

  50. [51]

    3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting. (2024)

  51. [52]

    Lingteng Qiu, Xiaodong Gu, Peihao Li, Qi Zuo, Weichao Shen, Junfei Zhang, Kejie Qiu, Weihao Yuan, Guanying Chen, Zilong Dong, et al. 2025. Lhm: Large animatable human reconstruction model from a single image in seconds.arXiv preprint arXiv:2503.10625(2025)

  52. [53]

    Lingteng Qiu, Peihao Li, Qi Zuo, Xiaodong Gu, Yuan Dong, Weihao Yuan, Siyu Zhu, Xiaoguang Han, Guanying Chen, and Zilong Dong. 2025. PF-LHM: 3D Animatable Avatar Reconstruction from Pose-free Articulated Human Images. Conference’17, July 2017, Washington, DC, USA Gangjian Zhang, Jian Shu, Sicheng Yu, Wenhao Shen, Yu Feng, and Hao Wang arXiv preprint arXiv...

  53. [54]

    Javier Romero, Dimitrios Tzionas, and Michael J Black. 2022. Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610(2022)

  54. [55]

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. 2024. Adversarial Diffusion Distillation. InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXXVI (Milan, Italy). Springer-Verlag, Berlin, Heidelberg, 87–103. doi:10.1007/978-3- 031-73016-0_6

  55. [57]

    Zhijing Shao, Zhaolong Wang, Zhuang Li, Duotun Wang, Xiangru Lin, Yu Zhang, Mingming Fan, and Zeyu Wang. 2024. Splattingavatar: Realistic real-time human avatars with mesh-embedded gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1606–1616

  56. [58]

    Kaiyue Shen, Chen Guo, Manuel Kaufmann, Juan Zarate, Julien Valentin, Jie Song, and Otmar Hilliges. 2023. X-Avatar: Expressive Human Avatars.Computer Vision and Pattern Recognition (CVPR)

  57. [59]

    Jian Shu, Nanjie Yao, Gangjian Zhang, Junlong Ren, Yu Feng, and Hao Wang. 2025. FastAnimate: Towards Learnable Template Construction and Pose Deformation for Fast 3D Human Avatar Animation. arXiv:2512.01444 [cs.CV] https://arxiv. org/abs/2512.01444

  58. [60]

    Geonhee Sim and Gyeongsik Moon. 2025. PERSONA: Personalized Whole-Body 3D Avatar with Pose-Driven Deformations from a Single Image. InProceedings of the IEEE/CVF International Conference on Computer Vision. 12670–12680

  59. [61]

    Shih-Yang Su, Timur Bagautdinov, and Helge Rhodin. 2023. Npc: Neural point characters from video. InProceedings of the IEEE/CVF International conference on computer vision. 14795–14805

  60. [62]

    Shih-Yang Su, Frank Yu, Michael Zollhöfer, and Helge Rhodin. 2021. A-nerf: Articulated neural radiance fields for learning human shape, appearance, and pose.Advances in neural information processing systems34 (2021), 12278–12291

  61. [63]

    David Svitov, Pietro Morerio, Lourdes Agapito, and Alessio Del Bue. 2024. Haha: Highly articulated gaussian human avatars with textured mesh prior. InProceed- ings of the Asian Conference on Computer Vision. 4051–4068

  62. [64]

    Gusi Te, Xiu Li, Xiao Li, Jinglu Wang, Wei Hu, and Yan Lu. 2022. Neural capture of animatable 3d human from monocular video. InEuropean Conference on Computer Vision. Springer, 275–291

  63. [65]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

  64. [66]

    Zhou Wang and Alan Conrad Bovik. 2006. Modern image quality assessment. (2006)

  65. [67]

    Jing Wen, Xiaoming Zhao, Zhongzheng Ren, Alexander G Schwing, and Shenlong Wang. 2024. Gomavatar: Efficient animatable human modeling from monocular video using gaussians-on-mesh. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2059–2069

  66. [68]

    Srinivasan, Jonathan T

    Chung-Yi Weng, Brian Curless, Pratul P. Srinivasan, Jonathan T. Barron, and Ira Kemelmacher-Shlizerman. 2022. HumanNeRF: Free-Viewpoint Rendering of Moving People From Monocular Video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 16210–16220

  67. [69]

    Chung-Yi Weng, Brian Curless, Pratul P Srinivasan, Jonathan T Barron, and Ira Kemelmacher-Shlizerman. 2022. Humannerf: Free-viewpoint rendering of moving people from monocular video. InProceedings of the IEEE/CVF conference on computer vision and pattern Recognition. 16210–16220

  68. [70]

    Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 2024. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20310–20320

  69. [71]

    Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, and Huan Ling. 2025. DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models. InProceedings of the Computer Vision and Pattern Recognition Conference. 26024–26035

  70. [72]

    Zhangyang Xiong, Chenghong Li, Kenkun Liu, Hongjie Liao, Jianqiao Hu, Junyi Zhu, Shuliang Ning, Lingteng Qiu, Chongjie Wang, Shijie Wang, et al . 2024. MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19801–19811

  71. [73]

    Jiawei Xu, Zexin Fan, Jian Yang, and Jin Xie. 2024. Grid4d: 4d decomposed hash encoding for high-fidelity dynamic gaussian splatting.Advances in Neural Information Processing Systems37 (2024), 123787–123811

  72. [74]

    Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. 2024. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20331–20341

  73. [75]

    Zeyu Yang, Hongye Yang, Zijie Pan, and Li Zhang. 2023. Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting.arXiv preprint arXiv:2310.10642(2023)

  74. [76]

    Nanjie Yao, Gangjian Zhang, Wenhao Shen, Jian Shu, Yu Feng, and Hao Wang

  75. [77]

    MultiGO++: Monocular 3D Clothed Human Reconstruction via Geometry- Texture Collaboration.arXiv preprint arXiv:2603.04993(2026)

  76. [78]

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. 2024. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6613–6623

  77. [79]

    Heng Yu, Joel Julin, Zoltán Á Milacski, Koichiro Niinuma, and László A Jeni. 2024. Cogs: Controllable gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21624–21633

  78. [80]

    Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. 2024. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048(2024)

  79. [81]

    Zhengming Yu, Wei Cheng, Xian Liu, Wayne Wu, and Kwan-Yee Lin. 2023. Mono- human: Animatable human neural field from monocular video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16943– 16953

  80. [82]

    Gangjian Zhang, Jian Shu, Nanjie Yao, and Hao Wang. 2025. SAT: Supervisor Regularization and Animation Augmentation for Two-process Monocular Tex- ture 3D Human Reconstruction. InProceedings of the 33rd ACM International Conference on Multimedia(Dublin, Ireland)(MM ’25). Association for Computing Machinery, New York, NY, USA, 10563–10572. doi:10.1145/3746...

Showing first 80 references.