Generator-Refiner-Examiner: A Tri-Module Data Augmentation Framework for 3D Human Avatar Learning from Monocular Videos

Gangjian Zhang; Hao Wang; Jian Shu; Sicheng Yu; Wenhao Shen; Yu Feng

arxiv: 2605.23555 · v1 · pith:3MY4DIFTnew · submitted 2026-05-22 · 💻 cs.CV

Generator-Refiner-Examiner: A Tri-Module Data Augmentation Framework for 3D Human Avatar Learning from Monocular Videos

Gangjian Zhang , Jian Shu , Sicheng Yu , Wenhao Shen , Yu Feng , Hao Wang This is my paper

Pith reviewed 2026-05-25 05:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D human avatardata augmentationmonocular videodiffusion refinementpose perturbationattention-based filteringavatar reconstruction

0 comments

The pith

A tri-module data augmentation system improves 3D human avatar reconstruction from monocular videos with limited frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TrioMan to handle data scarcity when building photorealistic, animatable 3D human avatars from monocular videos. Current approaches combine per-subject optimization with generic human priors but lose fine details under limited training frames. TrioMan adds a Generator that creates new samples through Gaussian perturbations on pose and camera parameters, a Refiner that enhances those samples with one-step diffusion using texture and geometry guidance, and an Examiner that filters for subject consistency via dual-branch attention similarity scoring. Experiments on the X-Humans and NeuMan benchmarks indicate that this augmented training yields higher performance than prior state-of-the-art methods.

Core claim

TrioMan augments limited monocular video data for 3D avatar learning through three modules: the Generator imposes Gaussian perturbations on pose and camera to produce diverse unseen samples; the Refiner applies one-step diffusion conditioned on texture and geometry cues to raise sample quality; the Examiner uses dual-branch attention-based similarity evaluation to retain only subject-consistent examples. This process supplies additional useful training signal that improves reconstruction when real frames are scarce.

What carries the argument

The tri-module Generator-Refiner-Examiner pipeline, where Generator perturbs pose and camera, Refiner performs guided one-step diffusion, and Examiner applies dual-branch attention similarity filtering.

If this is right

Augmented samples enable capture of fine-grained details that standard per-subject optimization misses under data limits.
The framework outperforms existing methods on the X-Humans and NeuMan benchmarks.
Subject-consistent extra data reduces dependence on generic human priors for avatar quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same generate-refine-examine loop could apply to other sparse-view 3D reconstruction problems beyond human avatars.
If the examiner's attention scoring proves reliable, similar filtering might improve synthetic data use in related vision tasks.

Load-bearing premise

That the perturbed, diffused, and filtered samples remain subject-consistent and supply useful training signal beyond the original limited frames.

What would settle it

Running the full TrioMan pipeline on X-Humans or NeuMan videos with few frames and finding no measurable gain in avatar reconstruction metrics compared with training on the original frames alone.

Figures

Figures reproduced from arXiv: 2605.23555 by Gangjian Zhang, Hao Wang, Jian Shu, Sicheng Yu, Wenhao Shen, Yu Feng.

**Figure 1.** Figure 1: Qualitative Comparison. We use the same SMPL templates to drive the animatable 3D avatars of different SOTA [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Method Overview. Our method, TrioMan, addresses expressive 3D human avatar learning from monocular video [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Refiner module. We take the refinement of the [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Examiner module. We design a dual-branch similar [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Visual comparison with SOTA methods on Neuman. Compared to current methods, our approach can achieve better [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Visual comparison with SOTA methods on X [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Visual ablation of Refiner. We show the refinement effects of the Refiner module before and after incorporating [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 10.** Figure 10: Visual ablation about the geometry condition in [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

read the original abstract

This paper addresses the challenge of reconstructing photorealistic and animatable 3D human avatars from monocular videos. While existing methods rely on combining per-subject optimization with generic human priors, they often fail to capture fine-grained details when training frames are limited. To mitigate this data scarcity, we propose TrioMan, a systematic tri-module framework for augmented 3D avatar learning. Our approach comprises three synergistic components. The Generator creates diverse unseen samples by imposing Gaussian perturbations on pose and camera. The Refiner improves the quality of generated data through one-step diffusion guided by texture and geometry cues. The Examiner selects subject-consistent samples using a dual-branch attention-based similarity evaluation. Experiments on the X-Humans and NeuMan benchmarks show that TrioMan outperforms state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TrioMan packages a Generator-Refiner-Examiner pipeline to create consistent augmentations for limited-frame monocular avatar reconstruction, but the abstract supplies no numbers or ablations to show whether it actually improves results.

read the letter

The core idea is straightforward: when you only have a few video frames of a person, generate extra samples by perturbing pose and camera, clean them with one-step diffusion that respects texture and geometry, then keep only the ones that pass an attention-based consistency check. That three-step structure is the main thing the paper brings to the table for 3D avatar work. It is a reasonable response to the data-scarcity problem that shows up in monocular reconstruction papers, and the modules are described clearly enough that someone could try to reimplement the flow. The choice of benchmarks (X-Humans and NeuMan) is also sensible for the subfield. Beyond that, the paper does not appear to introduce new math or a fundamentally different representation; it recombines existing tools (Gaussian perturbation, diffusion, dual-branch attention) into one pipeline aimed at subject consistency. The abstract does not mention code release or formal verification, so any reproducibility would depend on what the full manuscript supplies. The obvious soft spot is the complete absence of numbers. The claim that TrioMan outperforms prior methods is stated without error tables, ablation results, or even rough deltas, which makes it impossible to judge whether the generated samples add signal or just noise. The central assumption—that the Examiner will reliably filter for subject-consistent data that helps downstream optimization—remains untested in the provided description. If the full paper contains proper quantitative support and controls, the framework could be worth trying in practice for people already working on video-based avatar or human NeRF methods. Without those results, it is hard to know how much weight to give the outperformance statement. I would send this to peer review because the problem is real, the pipeline is concrete, and the benchmarks are standard; a referee can check whether the experiments actually back the claim.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes TrioMan, a tri-module data augmentation framework for reconstructing photorealistic and animatable 3D human avatars from monocular videos with limited frames. The Generator creates diverse samples via Gaussian perturbations on pose and camera parameters; the Refiner enhances them using one-step diffusion conditioned on texture and geometry cues; the Examiner filters for subject consistency with a dual-branch attention-based similarity metric. The central claim is that this pipeline yields useful additional training signal and outperforms prior methods on the X-Humans and NeuMan benchmarks.

Significance. If the experimental claims hold, the framework would offer a practical route to mitigate data scarcity in per-subject avatar optimization, potentially improving fine-grained detail capture without requiring additional real captures or heavier reliance on generic human priors.

major comments (1)

[Abstract / Experiments] Abstract / Experiments section: The claim that 'Experiments on the X-Humans and NeuMan benchmarks show that TrioMan outperforms state-of-the-art methods' is unsupported by any reported metrics, tables, ablation studies, error analysis, or implementation details. This directly undermines assessment of whether the Generator-Refiner-Examiner pipeline produces subject-consistent, high-signal augmentations as assumed.

minor comments (1)

[Method] The description of the dual-branch attention mechanism in the Examiner and the precise conditioning signals in the Refiner would benefit from explicit algorithmic pseudocode or equations to support reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and the identification of this critical issue with the experimental claims. We address the comment below.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract / Experiments section: The claim that 'Experiments on the X-Humans and NeuMan benchmarks show that TrioMan outperforms state-of-the-art methods' is unsupported by any reported metrics, tables, ablation studies, error analysis, or implementation details. This directly undermines assessment of whether the Generator-Refiner-Examiner pipeline produces subject-consistent, high-signal augmentations as assumed.

Authors: We agree that the claim in the abstract is currently unsupported in the manuscript. The provided text consists only of the abstract and does not contain any quantitative results, tables, ablations, error analysis, or implementation details. In the revised version we will add a full Experiments section with metrics on X-Humans and NeuMan, direct comparisons to prior methods, module ablations, and implementation details so that the performance claims can be properly evaluated. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a tri-module data augmentation framework (Generator-Refiner-Examiner) for 3D avatar learning, with claims resting on empirical benchmark results rather than any mathematical derivation chain. No equations, fitted parameters, self-citations as load-bearing premises, or ansatzes are described in the provided text. The central claim (outperformance on X-Humans and NeuMan) is an experimental outcome, not a quantity that reduces to its own inputs by construction. The method is self-contained against external benchmarks with no internal reduction to tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No information is available from the abstract to populate free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5680 in / 1220 out tokens · 24056 ms · 2026-05-25T05:06:22.031685+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 4 internal anchors

[1]

Timur Bagautdinov, Chenglei Wu, Tomas Simon, Fabián Prada, Takaaki Shiratori, Shih-En Wei, Weipeng Xu, Yaser Sheikh, and Jason Saragih. 2021. Driving-signal aware full-body avatars.ACM Trans. Graph.40, 4, Article 143 (July 2021), 17 pages. doi:10.1145/3450626.3459850

work page doi:10.1145/3450626.3459850 2021
[2]

Dongliang Chang, Yifeng Ding, Jiyang Xie, Ayan Kumar Bhunia, and Yi Zhe Song. 2020. The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification.IEEE Transactions on Image ProcessingPP, 99 (2020), 1–1

work page 2020
[3]

Yushuo Chen, Zerong Zheng, Zhe Li, Chao Xu, and Yebin Liu. 2024. MeshA- vatar: Learning High-quality Triangular Human Avatars from Multi-view Videos. arXiv:2407.08414 [cs.CV]

work page arXiv 2024
[4]

Chong Cheng, Gaochao Song, Yiyang Yao, Qinzheng Zhou, Gangjian Zhang, and Hao Wang. 2025. Graph-Guided Scene Reconstruction from Images with 3D Gaussian Splatting. arXiv:2502.17377 [cs.CV] https://arxiv.org/abs/2502.17377

work page arXiv 2025
[5]

Wei Cheng, Ruixiang Chen, Siming Fan, Wanqi Yin, Keyu Chen, Zhongang Cai, Jingbo Wang, Yang Gao, Zhengming Yu, Zhengyu Lin, et al. 2023. Dna-rendering: A diverse neural actor repository for high-fidelity human-centric rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 19982– 19993

work page 2023
[6]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[7]

Yuanxing Duan, Fangyin Wei, Qiyu Dai, Yuhang He, Wenzheng Chen, and Bao- quan Chen. 2024. 4d-rotor gaussian splatting: towards efficient novel view synthesis for dynamic scenes. InACM SIGGRAPH 2024 Conference Papers. 1–11

work page 2024
[8]

Black, and Timo Bolkart

Yao Feng, Jinlong Yang, Marc Pollefeys, Michael J. Black, and Timo Bolkart

work page
[9]

InSIGGRAPH Asia 2022 Conference Papers(Daegu, Republic of Korea)(SA ’22)

Capturing and Animation of Body and Clothing from Monocular Video. InSIGGRAPH Asia 2022 Conference Papers(Daegu, Republic of Korea)(SA ’22). Association for Computing Machinery, New York, NY, USA, Article 45, 9 pages. doi:10.1145/3550469.3555423

work page doi:10.1145/3550469.3555423 2022
[10]

Chen Guo, Tianjian Jiang, Xu Chen, Jie Song, and Otmar Hilliges. 2023. Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self- supervised Scene Decomposition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2023
[11]

Chen Guo, Junxuan Li, Yash Kant, Yaser Sheikh, Shunsuke Saito, and Chen Cao

work page
[12]

InProceedings of the Computer Vision and Pattern Recognition Conference

Vid2avatar-pro: Authentic avatar from videos in the wild via universal prior. InProceedings of the Computer Vision and Pattern Recognition Conference. 5559–5570

work page
[13]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. arXiv:2006.11239 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2020
[14]

Hezhen Hu, Zhiwen Fan, Tianhao Wu, Yihan Xi, Seoyoung Lee, Georgios Pavlakos, and Zhangyang Wang. 2024. Expressive Gaussian Human Avatars from Monocular RGB Video. InNeurIPS

work page 2024
[15]

Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. InProceed- ings of the IEEE conference on computer vision and pattern recognition. 7132–7141

work page 2018
[16]

Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang, Boyao Zhou, Boning Liu, Sheng- ping Zhang, and Liqiang Nie. 2024. GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2024
[17]

Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang, Boyao Zhou, Boning Liu, Sheng- ping Zhang, and Liqiang Nie. 2024. Gaussianavatar: Towards realistic human avatar modeling from a single video via animatable 3d gaussians. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 634–644

work page 2024
[18]

Shoukang Hu, Tao Hu, and Ziwei Liu. 2024. Gauhuman: Articulated gaussian splatting from monocular human videos. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition. 20418–20431

work page 2024
[19]

Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiao- juan Qi. 2024. Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4220–4230

work page 2024
[20]

Boyi Jiang, Yang Hong, Hujun Bao, and Juyong Zhang. 2022. Selfrecon: Self reconstruction your digital avatar from monocular video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5605–5615

work page 2022
[21]

Tianjian Jiang, Xu Chen, Jie Song, and Otmar Hilliges. 2023. InstantAvatar: Learning Avatars from Monocular Video in 60 Seconds. (June 2023)

work page 2023
[22]

Tianjian Jiang, Hsuan-I Ho, Manuel Kaufmann, and Jie Song. 2025. PriorAvatar: Efficient and Robust Avatar Creation from Monocular Video Using Learned Priors. InProceedings of the SIGGRAPH Asia 2025 Conference Papers (SA Conference Papers ’25). Association for Computing Machinery, New York, NY, USA, Article 31, 10 pages. doi:10.1145/3757377.3763978

work page doi:10.1145/3757377.3763978 2025
[23]

Wei Jiang, Kwang Moo Yi, Golnoosh Samei, Oncel Tuzel, and Anurag Ranjan

work page
[24]

InProceedings of the European conference on computer vision (ECCV)

NeuMan: Neural Human Radiance Field from a Single Video. InProceedings of the European conference on computer vision (ECCV)

work page
[25]

Daisheng Jin and Ying He. 2026. MonoCloth: Reconstruction and Animation of Cloth-Decoupled Human Avatars from Monocular Videos. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 5503–5511

work page 2026
[26]

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis

work page
[27]

Graph.42, 4, Article 139 (jul 2023), 14 pages

3D Gaussian Splatting for Real-Time Radiance Field Rendering.ACM Trans. Graph.42, 4, Article 139 (jul 2023), 14 pages. doi:10.1145/3592433

work page doi:10.1145/3592433 2023
[29]

Muhammed Kocabas, Jen-Hao Rick Chang, James Gabriel, Oncel Tuzel, and Anurag Ranjan. 2024. Hugs: Human gaussian splats. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 505–515

work page 2024
[30]

SuBeen Lee, WonJun Moon, Hyun Seok Seong, and Jae-Pil Heo. 2024. Task- oriented channel attention for fine-grained few-shot classification.IEEE Transac- tions on Pattern Analysis and Machine Intelligence(2024)

work page 2024
[31]

Jiahui Lei, Yufu Wang, Georgios Pavlakos, Lingjie Liu, and Kostas Daniilidis

work page
[32]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

Gart: Gaussian articulated template models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 19876–19887

work page
[33]

Mingwei Li, Jiachen Tao, Zongxin Yang, and Yi Yang. 2023. Human101: Training 100+FPS Human Gaussians in 100s from 1 View. arXiv:2312.15258 [cs.CV]

work page arXiv 2023
[34]

Mengtian Li, Shengxiang Yao, Zhifeng Xie, and Keyu Chen. 2024. Gaussian- body: Clothed human reconstruction via 3d gaussian splatting.arXiv preprint arXiv:2401.09720(2024)

work page arXiv 2024
[35]

Zhan Li, Zhang Chen, Zhong Li, and Yi Xu. 2024. Spacetime gaussian feature splatting for real-time dynamic view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8508–8520

work page 2024
[36]

Shanchuan Lin, Anran Wang, and Xiao Yang. 2024. Sdxl-lightning: Progressive adversarial diffusion distillation.arXiv preprint arXiv:2402.13929(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Youtian Lin, Zuozhuo Dai, Siyu Zhu, and Yao Yao. 2024. Gaussian-flow: 4d reconstruction with dynamic 3d gaussian particle. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21136–21145

work page 2024
[38]

Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, and Yueqi Duan. 2024. Reconx: Reconstruct any scene from sparse views with video diffusion model.arXiv preprint arXiv:2408.16767(2024)

work page arXiv 2024
[39]

Xinqi Liu and Chenming Wu. 2025. VGA: Reconstructing Vivid 3D Gaussian Avatars from Monocular Videos. InInternational Conference on Computational Visual Media. Springer, 172–193

work page 2025
[40]

Xian Liu, Xiaohang Zhan, Jiaxiang Tang, Ying Shan, Gang Zeng, Dahua Lin, Xihui Liu, and Ziwei Liu. 2024. Humangaussian: Text-driven 3d human generation with gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6646–6657

work page 2024
[41]

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. 2023. SMPL: A skinned multi-person linear model. InSeminal Graphics Papers: Pushing the Boundaries, Volume 2. 851–866

work page 2023
[42]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. InECCV

work page 2020
[43]

Gyeongsik Moon, Takaaki Shiratori, and Shunsuke Saito. 2024. Expressive Whole- Body 3D Gaussian Avatar. InECCV

work page 2024
[44]

Gyeongsik Moon, Takaaki Shiratori, and Shunsuke Saito. 2024. Expressive whole- body 3d gaussian avatar. InEuropean Conference on Computer Vision. Springer, 19–35

work page 2024
[45]

Arthur Moreau, Jifei Song, Helisa Dhamo, Richard Shaw, Yiren Zhou, and Ed- uardo Pérez-Pellitero. 2024. Human gaussian splatting: Real-time rendering of animatable avatars. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 788–798

work page 2024
[46]

Jongmin Park, Minh-Quan Viet Bui, Juan Luis Gonzalez Bello, Jaeho Moon, Jihyong Oh, and Munchurl Kim. 2025. Splinegs: Robust motion-adaptive spline for real-time dynamic 3d gaussians from monocular video. InProceedings of the Computer Vision and Pattern Recognition Conference. 26866–26875

work page 2025
[47]

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. 2019. Expressive Body Capture: 3D Hands, Face, and Body from a Single Image. InProceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)

work page 2019
[48]

Cheng Peng, Jingxiang Sun, Yushuo Chen, Zhaoqi Su, Zhuo Su, and Yebin Liu

work page
[49]

Parametric Gaussian Human Model: Generalizable Prior for Efficient and Realistic Human Avatar Modeling.arXiv preprint arXiv:2506.06645(2025)

work page arXiv 2025
[50]

Zhiyin Qian, Shaofei Wang, Marko Mihajlovic, Andreas Geiger, and Siyu Tang

work page
[51]

3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting. (2024)

work page 2024
[52]

Lingteng Qiu, Xiaodong Gu, Peihao Li, Qi Zuo, Weichao Shen, Junfei Zhang, Kejie Qiu, Weihao Yuan, Guanying Chen, Zilong Dong, et al. 2025. Lhm: Large animatable human reconstruction model from a single image in seconds.arXiv preprint arXiv:2503.10625(2025)

work page arXiv 2025
[53]

Lingteng Qiu, Peihao Li, Qi Zuo, Xiaodong Gu, Yuan Dong, Weihao Yuan, Siyu Zhu, Xiaoguang Han, Guanying Chen, and Zilong Dong. 2025. PF-LHM: 3D Animatable Avatar Reconstruction from Pose-free Articulated Human Images. Conference’17, July 2017, Washington, DC, USA Gangjian Zhang, Jian Shu, Sicheng Yu, Wenhao Shen, Yu Feng, and Hao Wang arXiv preprint arXiv...

work page arXiv 2025
[54]

Javier Romero, Dimitrios Tzionas, and Michael J Black. 2022. Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610(2022)

work page arXiv 2022
[55]

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. 2024. Adversarial Diffusion Distillation. InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXXVI (Milan, Italy). Springer-Verlag, Berlin, Heidelberg, 87–103. doi:10.1007/978-3- 031-73016-0_6

work page doi:10.1007/978-3- 2024
[57]

Zhijing Shao, Zhaolong Wang, Zhuang Li, Duotun Wang, Xiangru Lin, Yu Zhang, Mingming Fan, and Zeyu Wang. 2024. Splattingavatar: Realistic real-time human avatars with mesh-embedded gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1606–1616

work page 2024
[58]

Kaiyue Shen, Chen Guo, Manuel Kaufmann, Juan Zarate, Julien Valentin, Jie Song, and Otmar Hilliges. 2023. X-Avatar: Expressive Human Avatars.Computer Vision and Pattern Recognition (CVPR)

work page 2023
[59]

Jian Shu, Nanjie Yao, Gangjian Zhang, Junlong Ren, Yu Feng, and Hao Wang. 2025. FastAnimate: Towards Learnable Template Construction and Pose Deformation for Fast 3D Human Avatar Animation. arXiv:2512.01444 [cs.CV] https://arxiv. org/abs/2512.01444

work page arXiv 2025
[60]

Geonhee Sim and Gyeongsik Moon. 2025. PERSONA: Personalized Whole-Body 3D Avatar with Pose-Driven Deformations from a Single Image. InProceedings of the IEEE/CVF International Conference on Computer Vision. 12670–12680

work page 2025
[61]

Shih-Yang Su, Timur Bagautdinov, and Helge Rhodin. 2023. Npc: Neural point characters from video. InProceedings of the IEEE/CVF International conference on computer vision. 14795–14805

work page 2023
[62]

Shih-Yang Su, Frank Yu, Michael Zollhöfer, and Helge Rhodin. 2021. A-nerf: Articulated neural radiance fields for learning human shape, appearance, and pose.Advances in neural information processing systems34 (2021), 12278–12291

work page 2021
[63]

David Svitov, Pietro Morerio, Lourdes Agapito, and Alessio Del Bue. 2024. Haha: Highly articulated gaussian human avatars with textured mesh prior. InProceed- ings of the Asian Conference on Computer Vision. 4051–4068

work page 2024
[64]

Gusi Te, Xiu Li, Xiao Li, Jinglu Wang, Wei Hu, and Yan Lu. 2022. Neural capture of animatable 3d human from monocular video. InEuropean Conference on Computer Vision. Springer, 275–291

work page 2022
[65]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

work page 2017
[66]

Zhou Wang and Alan Conrad Bovik. 2006. Modern image quality assessment. (2006)

work page 2006
[67]

Jing Wen, Xiaoming Zhao, Zhongzheng Ren, Alexander G Schwing, and Shenlong Wang. 2024. Gomavatar: Efficient animatable human modeling from monocular video using gaussians-on-mesh. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2059–2069

work page 2024
[68]

Srinivasan, Jonathan T

Chung-Yi Weng, Brian Curless, Pratul P. Srinivasan, Jonathan T. Barron, and Ira Kemelmacher-Shlizerman. 2022. HumanNeRF: Free-Viewpoint Rendering of Moving People From Monocular Video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 16210–16220

work page 2022
[69]

Chung-Yi Weng, Brian Curless, Pratul P Srinivasan, Jonathan T Barron, and Ira Kemelmacher-Shlizerman. 2022. Humannerf: Free-viewpoint rendering of moving people from monocular video. InProceedings of the IEEE/CVF conference on computer vision and pattern Recognition. 16210–16220

work page 2022
[70]

Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 2024. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20310–20320

work page 2024
[71]

Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, and Huan Ling. 2025. DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models. InProceedings of the Computer Vision and Pattern Recognition Conference. 26024–26035

work page 2025
[72]

Zhangyang Xiong, Chenghong Li, Kenkun Liu, Hongjie Liao, Jianqiao Hu, Junyi Zhu, Shuliang Ning, Lingteng Qiu, Chongjie Wang, Shijie Wang, et al . 2024. MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19801–19811

work page 2024
[73]

Jiawei Xu, Zexin Fan, Jian Yang, and Jin Xie. 2024. Grid4d: 4d decomposed hash encoding for high-fidelity dynamic gaussian splatting.Advances in Neural Information Processing Systems37 (2024), 123787–123811

work page 2024
[74]

Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. 2024. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20331–20341

work page 2024
[75]

Zeyu Yang, Hongye Yang, Zijie Pan, and Li Zhang. 2023. Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting.arXiv preprint arXiv:2310.10642(2023)

work page arXiv 2023
[76]

Nanjie Yao, Gangjian Zhang, Wenhao Shen, Jian Shu, Yu Feng, and Hao Wang

work page
[77]

MultiGO++: Monocular 3D Clothed Human Reconstruction via Geometry- Texture Collaboration.arXiv preprint arXiv:2603.04993(2026)

work page arXiv 2026
[78]

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. 2024. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6613–6623

work page 2024
[79]

Heng Yu, Joel Julin, Zoltán Á Milacski, Koichiro Niinuma, and László A Jeni. 2024. Cogs: Controllable gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21624–21633

work page 2024
[80]

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. 2024. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[81]

Zhengming Yu, Wei Cheng, Xian Liu, Wayne Wu, and Kwan-Yee Lin. 2023. Mono- human: Animatable human neural field from monocular video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16943– 16953

work page 2023
[82]

Gangjian Zhang, Jian Shu, Nanjie Yao, and Hao Wang. 2025. SAT: Supervisor Regularization and Animation Augmentation for Two-process Monocular Tex- ture 3D Human Reconstruction. InProceedings of the 33rd ACM International Conference on Multimedia(Dublin, Ireland)(MM ’25). Association for Computing Machinery, New York, NY, USA, 10563–10572. doi:10.1145/3746...

work page doi:10.1145/3746027.3755774 2025

Showing first 80 references.

[1] [1]

Timur Bagautdinov, Chenglei Wu, Tomas Simon, Fabián Prada, Takaaki Shiratori, Shih-En Wei, Weipeng Xu, Yaser Sheikh, and Jason Saragih. 2021. Driving-signal aware full-body avatars.ACM Trans. Graph.40, 4, Article 143 (July 2021), 17 pages. doi:10.1145/3450626.3459850

work page doi:10.1145/3450626.3459850 2021

[2] [2]

Dongliang Chang, Yifeng Ding, Jiyang Xie, Ayan Kumar Bhunia, and Yi Zhe Song. 2020. The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification.IEEE Transactions on Image ProcessingPP, 99 (2020), 1–1

work page 2020

[3] [3]

Yushuo Chen, Zerong Zheng, Zhe Li, Chao Xu, and Yebin Liu. 2024. MeshA- vatar: Learning High-quality Triangular Human Avatars from Multi-view Videos. arXiv:2407.08414 [cs.CV]

work page arXiv 2024

[4] [4]

Chong Cheng, Gaochao Song, Yiyang Yao, Qinzheng Zhou, Gangjian Zhang, and Hao Wang. 2025. Graph-Guided Scene Reconstruction from Images with 3D Gaussian Splatting. arXiv:2502.17377 [cs.CV] https://arxiv.org/abs/2502.17377

work page arXiv 2025

[5] [5]

Wei Cheng, Ruixiang Chen, Siming Fan, Wanqi Yin, Keyu Chen, Zhongang Cai, Jingbo Wang, Yang Gao, Zhengming Yu, Zhengyu Lin, et al. 2023. Dna-rendering: A diverse neural actor repository for high-fidelity human-centric rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 19982– 19993

work page 2023

[6] [6]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[7] [7]

Yuanxing Duan, Fangyin Wei, Qiyu Dai, Yuhang He, Wenzheng Chen, and Bao- quan Chen. 2024. 4d-rotor gaussian splatting: towards efficient novel view synthesis for dynamic scenes. InACM SIGGRAPH 2024 Conference Papers. 1–11

work page 2024

[8] [8]

Black, and Timo Bolkart

Yao Feng, Jinlong Yang, Marc Pollefeys, Michael J. Black, and Timo Bolkart

work page

[9] [9]

InSIGGRAPH Asia 2022 Conference Papers(Daegu, Republic of Korea)(SA ’22)

Capturing and Animation of Body and Clothing from Monocular Video. InSIGGRAPH Asia 2022 Conference Papers(Daegu, Republic of Korea)(SA ’22). Association for Computing Machinery, New York, NY, USA, Article 45, 9 pages. doi:10.1145/3550469.3555423

work page doi:10.1145/3550469.3555423 2022

[10] [10]

Chen Guo, Tianjian Jiang, Xu Chen, Jie Song, and Otmar Hilliges. 2023. Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self- supervised Scene Decomposition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2023

[11] [11]

Chen Guo, Junxuan Li, Yash Kant, Yaser Sheikh, Shunsuke Saito, and Chen Cao

work page

[12] [12]

InProceedings of the Computer Vision and Pattern Recognition Conference

Vid2avatar-pro: Authentic avatar from videos in the wild via universal prior. InProceedings of the Computer Vision and Pattern Recognition Conference. 5559–5570

work page

[13] [13]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. arXiv:2006.11239 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2020

[14] [14]

Hezhen Hu, Zhiwen Fan, Tianhao Wu, Yihan Xi, Seoyoung Lee, Georgios Pavlakos, and Zhangyang Wang. 2024. Expressive Gaussian Human Avatars from Monocular RGB Video. InNeurIPS

work page 2024

[15] [15]

Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. InProceed- ings of the IEEE conference on computer vision and pattern recognition. 7132–7141

work page 2018

[16] [16]

Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang, Boyao Zhou, Boning Liu, Sheng- ping Zhang, and Liqiang Nie. 2024. GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

work page 2024

[17] [17]

Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang, Boyao Zhou, Boning Liu, Sheng- ping Zhang, and Liqiang Nie. 2024. Gaussianavatar: Towards realistic human avatar modeling from a single video via animatable 3d gaussians. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 634–644

work page 2024

[18] [18]

Shoukang Hu, Tao Hu, and Ziwei Liu. 2024. Gauhuman: Articulated gaussian splatting from monocular human videos. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition. 20418–20431

work page 2024

[19] [19]

Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiao- juan Qi. 2024. Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4220–4230

work page 2024

[20] [20]

Boyi Jiang, Yang Hong, Hujun Bao, and Juyong Zhang. 2022. Selfrecon: Self reconstruction your digital avatar from monocular video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5605–5615

work page 2022

[21] [21]

Tianjian Jiang, Xu Chen, Jie Song, and Otmar Hilliges. 2023. InstantAvatar: Learning Avatars from Monocular Video in 60 Seconds. (June 2023)

work page 2023

[22] [22]

Tianjian Jiang, Hsuan-I Ho, Manuel Kaufmann, and Jie Song. 2025. PriorAvatar: Efficient and Robust Avatar Creation from Monocular Video Using Learned Priors. InProceedings of the SIGGRAPH Asia 2025 Conference Papers (SA Conference Papers ’25). Association for Computing Machinery, New York, NY, USA, Article 31, 10 pages. doi:10.1145/3757377.3763978

work page doi:10.1145/3757377.3763978 2025

[23] [23]

Wei Jiang, Kwang Moo Yi, Golnoosh Samei, Oncel Tuzel, and Anurag Ranjan

work page

[24] [24]

InProceedings of the European conference on computer vision (ECCV)

NeuMan: Neural Human Radiance Field from a Single Video. InProceedings of the European conference on computer vision (ECCV)

work page

[25] [25]

Daisheng Jin and Ying He. 2026. MonoCloth: Reconstruction and Animation of Cloth-Decoupled Human Avatars from Monocular Videos. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 5503–5511

work page 2026

[26] [26]

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis

work page

[27] [27]

Graph.42, 4, Article 139 (jul 2023), 14 pages

3D Gaussian Splatting for Real-Time Radiance Field Rendering.ACM Trans. Graph.42, 4, Article 139 (jul 2023), 14 pages. doi:10.1145/3592433

work page doi:10.1145/3592433 2023

[28] [29]

Muhammed Kocabas, Jen-Hao Rick Chang, James Gabriel, Oncel Tuzel, and Anurag Ranjan. 2024. Hugs: Human gaussian splats. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 505–515

work page 2024

[29] [30]

SuBeen Lee, WonJun Moon, Hyun Seok Seong, and Jae-Pil Heo. 2024. Task- oriented channel attention for fine-grained few-shot classification.IEEE Transac- tions on Pattern Analysis and Machine Intelligence(2024)

work page 2024

[30] [31]

Jiahui Lei, Yufu Wang, Georgios Pavlakos, Lingjie Liu, and Kostas Daniilidis

work page

[31] [32]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

Gart: Gaussian articulated template models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 19876–19887

work page

[32] [33]

Mingwei Li, Jiachen Tao, Zongxin Yang, and Yi Yang. 2023. Human101: Training 100+FPS Human Gaussians in 100s from 1 View. arXiv:2312.15258 [cs.CV]

work page arXiv 2023

[33] [34]

Mengtian Li, Shengxiang Yao, Zhifeng Xie, and Keyu Chen. 2024. Gaussian- body: Clothed human reconstruction via 3d gaussian splatting.arXiv preprint arXiv:2401.09720(2024)

work page arXiv 2024

[34] [35]

Zhan Li, Zhang Chen, Zhong Li, and Yi Xu. 2024. Spacetime gaussian feature splatting for real-time dynamic view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8508–8520

work page 2024

[35] [36]

Shanchuan Lin, Anran Wang, and Xiao Yang. 2024. Sdxl-lightning: Progressive adversarial diffusion distillation.arXiv preprint arXiv:2402.13929(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [37]

Youtian Lin, Zuozhuo Dai, Siyu Zhu, and Yao Yao. 2024. Gaussian-flow: 4d reconstruction with dynamic 3d gaussian particle. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21136–21145

work page 2024

[37] [38]

Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, and Yueqi Duan. 2024. Reconx: Reconstruct any scene from sparse views with video diffusion model.arXiv preprint arXiv:2408.16767(2024)

work page arXiv 2024

[38] [39]

Xinqi Liu and Chenming Wu. 2025. VGA: Reconstructing Vivid 3D Gaussian Avatars from Monocular Videos. InInternational Conference on Computational Visual Media. Springer, 172–193

work page 2025

[39] [40]

Xian Liu, Xiaohang Zhan, Jiaxiang Tang, Ying Shan, Gang Zeng, Dahua Lin, Xihui Liu, and Ziwei Liu. 2024. Humangaussian: Text-driven 3d human generation with gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6646–6657

work page 2024

[40] [41]

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. 2023. SMPL: A skinned multi-person linear model. InSeminal Graphics Papers: Pushing the Boundaries, Volume 2. 851–866

work page 2023

[41] [42]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. InECCV

work page 2020

[42] [43]

Gyeongsik Moon, Takaaki Shiratori, and Shunsuke Saito. 2024. Expressive Whole- Body 3D Gaussian Avatar. InECCV

work page 2024

[43] [44]

Gyeongsik Moon, Takaaki Shiratori, and Shunsuke Saito. 2024. Expressive whole- body 3d gaussian avatar. InEuropean Conference on Computer Vision. Springer, 19–35

work page 2024

[44] [45]

Arthur Moreau, Jifei Song, Helisa Dhamo, Richard Shaw, Yiren Zhou, and Ed- uardo Pérez-Pellitero. 2024. Human gaussian splatting: Real-time rendering of animatable avatars. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 788–798

work page 2024

[45] [46]

Jongmin Park, Minh-Quan Viet Bui, Juan Luis Gonzalez Bello, Jaeho Moon, Jihyong Oh, and Munchurl Kim. 2025. Splinegs: Robust motion-adaptive spline for real-time dynamic 3d gaussians from monocular video. InProceedings of the Computer Vision and Pattern Recognition Conference. 26866–26875

work page 2025

[46] [47]

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. 2019. Expressive Body Capture: 3D Hands, Face, and Body from a Single Image. InProceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)

work page 2019

[47] [48]

Cheng Peng, Jingxiang Sun, Yushuo Chen, Zhaoqi Su, Zhuo Su, and Yebin Liu

work page

[48] [49]

Parametric Gaussian Human Model: Generalizable Prior for Efficient and Realistic Human Avatar Modeling.arXiv preprint arXiv:2506.06645(2025)

work page arXiv 2025

[49] [50]

Zhiyin Qian, Shaofei Wang, Marko Mihajlovic, Andreas Geiger, and Siyu Tang

work page

[50] [51]

3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting. (2024)

work page 2024

[51] [52]

Lingteng Qiu, Xiaodong Gu, Peihao Li, Qi Zuo, Weichao Shen, Junfei Zhang, Kejie Qiu, Weihao Yuan, Guanying Chen, Zilong Dong, et al. 2025. Lhm: Large animatable human reconstruction model from a single image in seconds.arXiv preprint arXiv:2503.10625(2025)

work page arXiv 2025

[52] [53]

Lingteng Qiu, Peihao Li, Qi Zuo, Xiaodong Gu, Yuan Dong, Weihao Yuan, Siyu Zhu, Xiaoguang Han, Guanying Chen, and Zilong Dong. 2025. PF-LHM: 3D Animatable Avatar Reconstruction from Pose-free Articulated Human Images. Conference’17, July 2017, Washington, DC, USA Gangjian Zhang, Jian Shu, Sicheng Yu, Wenhao Shen, Yu Feng, and Hao Wang arXiv preprint arXiv...

work page arXiv 2025

[53] [54]

Javier Romero, Dimitrios Tzionas, and Michael J Black. 2022. Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610(2022)

work page arXiv 2022

[54] [55]

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. 2024. Adversarial Diffusion Distillation. InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXXVI (Milan, Italy). Springer-Verlag, Berlin, Heidelberg, 87–103. doi:10.1007/978-3- 031-73016-0_6

work page doi:10.1007/978-3- 2024

[55] [57]

Zhijing Shao, Zhaolong Wang, Zhuang Li, Duotun Wang, Xiangru Lin, Yu Zhang, Mingming Fan, and Zeyu Wang. 2024. Splattingavatar: Realistic real-time human avatars with mesh-embedded gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1606–1616

work page 2024

[56] [58]

Kaiyue Shen, Chen Guo, Manuel Kaufmann, Juan Zarate, Julien Valentin, Jie Song, and Otmar Hilliges. 2023. X-Avatar: Expressive Human Avatars.Computer Vision and Pattern Recognition (CVPR)

work page 2023

[57] [59]

Jian Shu, Nanjie Yao, Gangjian Zhang, Junlong Ren, Yu Feng, and Hao Wang. 2025. FastAnimate: Towards Learnable Template Construction and Pose Deformation for Fast 3D Human Avatar Animation. arXiv:2512.01444 [cs.CV] https://arxiv. org/abs/2512.01444

work page arXiv 2025

[58] [60]

Geonhee Sim and Gyeongsik Moon. 2025. PERSONA: Personalized Whole-Body 3D Avatar with Pose-Driven Deformations from a Single Image. InProceedings of the IEEE/CVF International Conference on Computer Vision. 12670–12680

work page 2025

[59] [61]

Shih-Yang Su, Timur Bagautdinov, and Helge Rhodin. 2023. Npc: Neural point characters from video. InProceedings of the IEEE/CVF International conference on computer vision. 14795–14805

work page 2023

[60] [62]

Shih-Yang Su, Frank Yu, Michael Zollhöfer, and Helge Rhodin. 2021. A-nerf: Articulated neural radiance fields for learning human shape, appearance, and pose.Advances in neural information processing systems34 (2021), 12278–12291

work page 2021

[61] [63]

David Svitov, Pietro Morerio, Lourdes Agapito, and Alessio Del Bue. 2024. Haha: Highly articulated gaussian human avatars with textured mesh prior. InProceed- ings of the Asian Conference on Computer Vision. 4051–4068

work page 2024

[62] [64]

Gusi Te, Xiu Li, Xiao Li, Jinglu Wang, Wei Hu, and Yan Lu. 2022. Neural capture of animatable 3d human from monocular video. InEuropean Conference on Computer Vision. Springer, 275–291

work page 2022

[63] [65]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

work page 2017

[64] [66]

Zhou Wang and Alan Conrad Bovik. 2006. Modern image quality assessment. (2006)

work page 2006

[65] [67]

Jing Wen, Xiaoming Zhao, Zhongzheng Ren, Alexander G Schwing, and Shenlong Wang. 2024. Gomavatar: Efficient animatable human modeling from monocular video using gaussians-on-mesh. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2059–2069

work page 2024

[66] [68]

Srinivasan, Jonathan T

Chung-Yi Weng, Brian Curless, Pratul P. Srinivasan, Jonathan T. Barron, and Ira Kemelmacher-Shlizerman. 2022. HumanNeRF: Free-Viewpoint Rendering of Moving People From Monocular Video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 16210–16220

work page 2022

[67] [69]

Chung-Yi Weng, Brian Curless, Pratul P Srinivasan, Jonathan T Barron, and Ira Kemelmacher-Shlizerman. 2022. Humannerf: Free-viewpoint rendering of moving people from monocular video. InProceedings of the IEEE/CVF conference on computer vision and pattern Recognition. 16210–16220

work page 2022

[68] [70]

Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 2024. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20310–20320

work page 2024

[69] [71]

Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, and Huan Ling. 2025. DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models. InProceedings of the Computer Vision and Pattern Recognition Conference. 26024–26035

work page 2025

[70] [72]

Zhangyang Xiong, Chenghong Li, Kenkun Liu, Hongjie Liao, Jianqiao Hu, Junyi Zhu, Shuliang Ning, Lingteng Qiu, Chongjie Wang, Shijie Wang, et al . 2024. MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19801–19811

work page 2024

[71] [73]

Jiawei Xu, Zexin Fan, Jian Yang, and Jin Xie. 2024. Grid4d: 4d decomposed hash encoding for high-fidelity dynamic gaussian splatting.Advances in Neural Information Processing Systems37 (2024), 123787–123811

work page 2024

[72] [74]

Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. 2024. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20331–20341

work page 2024

[73] [75]

Zeyu Yang, Hongye Yang, Zijie Pan, and Li Zhang. 2023. Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting.arXiv preprint arXiv:2310.10642(2023)

work page arXiv 2023

[74] [76]

Nanjie Yao, Gangjian Zhang, Wenhao Shen, Jian Shu, Yu Feng, and Hao Wang

work page

[75] [77]

MultiGO++: Monocular 3D Clothed Human Reconstruction via Geometry- Texture Collaboration.arXiv preprint arXiv:2603.04993(2026)

work page arXiv 2026

[76] [78]

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. 2024. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6613–6623

work page 2024

[77] [79]

Heng Yu, Joel Julin, Zoltán Á Milacski, Koichiro Niinuma, and László A Jeni. 2024. Cogs: Controllable gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21624–21633

work page 2024

[78] [80]

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. 2024. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[79] [81]

Zhengming Yu, Wei Cheng, Xian Liu, Wayne Wu, and Kwan-Yee Lin. 2023. Mono- human: Animatable human neural field from monocular video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16943– 16953

work page 2023

[80] [82]

Gangjian Zhang, Jian Shu, Nanjie Yao, and Hao Wang. 2025. SAT: Supervisor Regularization and Animation Augmentation for Two-process Monocular Tex- ture 3D Human Reconstruction. InProceedings of the 33rd ACM International Conference on Multimedia(Dublin, Ireland)(MM ’25). Association for Computing Machinery, New York, NY, USA, 10563–10572. doi:10.1145/3746...

work page doi:10.1145/3746027.3755774 2025