pith. sign in

arxiv: 2511.22553 · v2 · submitted 2025-11-27 · 💻 cs.CV

Bringing Your Portrait to 3D Presence

Pith reviewed 2026-05-17 04:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D avatar reconstructionsingle imageanimatable humansynthetic dataUV representationproxy meshin-the-wildportrait to 3D
0
0 comments X

The pith

A unified framework turns a single portrait into an animatable 3D human avatar across head, half-body, and full-body scales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method to reconstruct animatable 3D human avatars from just one portrait image that works consistently whether the input shows only the head, the upper body, or the entire body. It does so by solving three main issues: features that change with pose and framing, insufficient training data, and unstable initial mesh estimates. The solution rests on a Dual-UV feature mapping that sends image information to a stable canonical space, a way to generate synthetic training data that keeps both visual variety and geometric accuracy, and a tracker that keeps the mesh reliable even when parts are hidden. Because the entire system trains on synthetic half-body data alone yet generalizes to real photos and full bodies, it suggests that high-quality personalized 3D avatars can be created without expensive multi-view capture or real 3D scans.

Core claim

By introducing Dual-UV representation mapping image features to canonical UV space through Core-UV and Shell-UV branches to remove pose and framing effects, building a factorized synthetic data manifold that merges 2D generative diversity with 3D-consistent renderings along with a supporting training scheme for better realism and identity consistency, and employing a robust proxy-mesh tracker for stability under partial visibility, the framework achieves strong in-the-wild generalization. When trained exclusively on half-body synthetic data, the model attains state-of-the-art results for head and upper-body reconstruction while remaining competitive for full-body cases.

What carries the argument

Dual-UV representation with Core-UV and Shell-UV branches that map image features to a canonical UV space to eliminate pose- and framing-induced shifts.

If this is right

  • Reconstruction becomes possible from single images rather than requiring multiple views or videos.
  • The model generalizes from synthetic half-body training to real-world full-body portraits.
  • Animatable avatars can be produced at different body scales with one unified approach.
  • Proxy mesh estimation remains stable even with incomplete visibility in the input.
  • Reliance on real 3D scanned data for training is reduced through the synthetic manifold.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the Dual-UV mapping proves robust, it could be adapted for reconstructing other dynamic objects like animals or clothing from single views.
  • The factorized data approach might enable easy scaling to new identities by swapping in different generative models without retraining the full system.
  • Competitive full-body performance suggests potential for extension to complete body animation including legs and hands with minimal additional data.
  • Strong in-the-wild results imply applications in mobile apps for quick avatar creation from selfies.

Load-bearing premise

The factorized synthetic data manifold combined with the described training scheme provides enough realism and identity consistency to support strong in-the-wild generalization despite training exclusively on half-body synthetic data.

What would settle it

Running the model on a diverse set of real in-the-wild portraits with unusual poses, framings, or demographics and measuring reconstruction quality against ground-truth 3D models would falsify the generalization if errors exceed those on synthetic tests.

Figures

Figures reproduced from arXiv: 2511.22553 by Chong Li, Hao Zhu, Jiahao Li, Jiawei Zhang, Lei Chu, Xiao Li, Xun Cao, Yan Lu, Zhenyu Zang.

Figure 1
Figure 1. Figure 1: Our method uses a dual-UV formulation to represent 3D avatars, enabling reconstruction from full-body, half-body, and headshot [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Reconstruction Pipeline. Given a reference image and its tracked proxy mesh, dense features from a frozen encoder are sampled along visible rays and scattered into canonical UV space to form the Core-UV map, while an offset shell captures off-surface regions such as hair and clothing. The Core-UV and Shell-UV tokens are fused and decoded by a lightweight transformer to reconstruct UV-space Gaussian attribu… view at source ↗
Figure 3
Figure 3. Figure 3: Data Curation. We build a hybrid dataset by combining geometry-anchored 3D rendering with semantics-driven generative synthesis. The synthetic rendering branch offers geometry-consistent multi-view supervision through procedural sampling of identity, pose, appearance, illumination, and cameras. The generative branch constructs a factorized appearance manifold by decomposing scene attributes, applying LLM-b… view at source ↗
Figure 4
Figure 4. Figure 4: Reenactment Results. Our method is trained solely on upper-body data only, generalizes well to head and full-body inputs [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Novel View Synthesis. Our method generates multi-view human renderings from a single reference image, showing compara￾tively more consistent appearance, especially in the head and upper-body regions [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Editing Results. Our model supports various appear￾ance edits from a single image, demonstrating its adaptability to diverse visual conditions [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Multiple Input. Our model is capable of taking multiple images as input, indicating its potential flexibility in leveraging multi-view information. Dataset Scalability We also study the impact of training data type and scale. As shown in Tab. 2 (b) and (c), model performance improves steadily as the dataset grows, high￾lighting the benefit of larger and more diverse supervision. When trained only on synthe… view at source ↗
Figure 8
Figure 8. Figure 8: A conceptual illustration of Bringing Your Portrait to 3D Presence. Our pipeline transforms everyday portrait images into fully controllable 3D avatars that can be animated via a tracked proxy mesh. The model is trained entirely on a hybrid synthetic corpus combining rendered and generative sources. Thanks to our dual-UV representation, the system robustly handles inputs of varying com￾pleteness—ranging fr… view at source ↗
Figure 9
Figure 9. Figure 9: UV Topology Visualization and Position Map. We vi￾sualize the modified UV topology and the corresponding position map used for sinusoidal encoding. Reconstruction loss. For each view v ∈ {ref,tgt}, we su￾pervise image fidelity using pixel and perceptual losses: L (v) rec = λL1 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Estimation Pipeline Diagram. We illustrate our proxy-mesh estimation pipeline using a single image for clarity, while noting that the pipeline naturally supports parallel processing for multi-frame inputs. Starting from an input image, we preprocess it to extract a foreground mask and apply a pretrained human mesh recovery model to obtain an initial mesh estimate. The initial estimate is subsequently refi… view at source ↗
Figure 11
Figure 11. Figure 11: Hands Missing Prediction. Multi-stage methods, such as PIXIE, often produce unpredictable results when hand regions are missing [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Multi-HMR and OSX. We find that OSX, trained primarily on upper-body data, produces reasonable results when hands are not visible, whereas MultiHMR often yields unsatisfactory predictions. 6 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Synthetic Rendering Dataset. Our synthetic rendering dataset contains diverse body poses, rendered from multiple viewpoints with perfect mesh annotations, providing strong structural priors for model training. 9 [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Filmic Realism Regularization.. The structured templates are processed by a lightweight LLM that improves linguistic fluency and resolves inconsistencies, yielding scene descriptions with enhanced realism and contextual coherence. 10 [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Outfit-centric Generation. Generation guided by outfit produces visually coherent and structurally consistent human images. 11 [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Role-centric Generation. Role-guided composition produce human images with noticeably more complex textures and styles. 12 [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Side/Back-view Augmentation. We leverage advanced image-editing models to supplement abundant side- and rear-view information. 13 [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Proxy Mesh Estimation. We showcase how our tracker, GUAVA, and LHM perform on arbitrary upper-body images, high￾lighting the robustness under unconstrained input conditions. 14 [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
read the original abstract

We present a unified framework for reconstructing animatable 3D human avatars from a single portrait across head, half-body, and full-body inputs. Our method tackles three bottlenecks: pose- and framing-sensitive feature representations, limited scalable data, and unreliable proxy-mesh estimation. We introduce a Dual-UV representation that maps image features to a canonical UV space via Core-UV and Shell-UV branches, eliminating pose- and framing-induced token shifts. We also build a factorized synthetic data manifold combining 2D generative diversity with geometry-consistent 3D renderings, supported by a training scheme that improves realism and identity consistency. A robust proxy-mesh tracker maintains stability under partial visibility. Together, these components enable strong in-the-wild generalization. Trained only on half-body synthetic data, our model achieves state-of-the-art head and upper-body reconstruction and competitive full-body results. Extensive experiments and analyses further validate the effectiveness of our approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents a unified framework for reconstructing animatable 3D human avatars from a single portrait image, applicable to head, half-body, and full-body inputs. It introduces a Dual-UV representation with Core-UV and Shell-UV branches to map features to canonical space, a factorized synthetic data manifold combining 2D generative diversity with 3D-consistent renderings, and a robust proxy-mesh tracker for stability under partial visibility. The central claim is that training exclusively on half-body synthetic data enables state-of-the-art head and upper-body reconstruction, competitive full-body results, and strong in-the-wild generalization.

Significance. If the generalization claims hold with supporting evidence, the work could meaningfully advance single-image 3D avatar reconstruction by mitigating data scarcity and proxy estimation issues through synthetic factorization and architectural innovations. The Dual-UV approach and training scheme offer a potentially reusable strategy for handling pose/framing variations. However, the overall significance is limited by the absence of direct quantitative validation for the synthetic-to-real transfer on full-body cases.

major comments (3)
  1. [Abstract and §5] Abstract and §5 (Experiments): The claim that the model 'achieves state-of-the-art head and upper-body reconstruction and competitive full-body results' when trained only on half-body synthetic data is not accompanied by any quantitative metrics, ablation studies, error bars, or baseline comparisons. Without these in the experiments, it is impossible to determine whether the data support the stated performance claims or to attribute gains to the Dual-UV branches versus the data manifold.
  2. [§4.3] §4.3 (Data manifold and training scheme): The central generalization claim—that the factorized synthetic data manifold plus Core-UV/Shell-UV training produces sufficient realism and identity consistency for in-the-wild full-body inputs—rests on an untested transfer. Half-body data inherently lacks lower-body pose/occlusion statistics, and no ablation isolates the manifold's contribution on real full-body test images; if this transfer fails, the SOTA and competitive results cannot be credited to the proposed components.
  3. [§4.4] §4.4 (Proxy-mesh tracker): The robust proxy-mesh tracker is presented as solving unreliable estimation under partial visibility, yet no quantitative evaluation (e.g., stability metrics or failure rates versus baselines on occluded full-body cases) is reported. This component is load-bearing for the full-body results but lacks the evidence needed to confirm its contribution.
minor comments (2)
  1. [Figure 2] Figure 2: The Dual-UV visualization would benefit from explicit arrows or labels clarifying how image features are mapped through the Core-UV and Shell-UV branches to the canonical space.
  2. [§3.2] §3.2: The notation for the factorized synthetic data manifold could be formalized with an equation defining the combination of 2D generative diversity and geometry-consistent 3D renderings.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for the constructive and detailed feedback. The comments highlight important opportunities to strengthen the quantitative support for our claims. We address each major comment point by point below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and §5] The claim that the model 'achieves state-of-the-art head and upper-body reconstruction and competitive full-body results' when trained only on half-body synthetic data is not accompanied by any quantitative metrics, ablation studies, error bars, or baseline comparisons. Without these in the experiments, it is impossible to determine whether the data support the stated performance claims or to attribute gains to the Dual-UV branches versus the data manifold.

    Authors: We appreciate this observation. Our current experiments emphasize qualitative visual comparisons and in-the-wild generalization results, which we believe demonstrate the effectiveness of the approach. To provide more rigorous validation, we will add quantitative metrics (e.g., PSNR, SSIM, LPIPS) on synthetic test sets, baseline comparisons, and ablations isolating the Dual-UV and data manifold contributions, including error bars from repeated runs. These will be incorporated into the revised manuscript. revision: yes

  2. Referee: [§4.3] The central generalization claim—that the factorized synthetic data manifold plus Core-UV/Shell-UV training produces sufficient realism and identity consistency for in-the-wild full-body inputs—rests on an untested transfer. Half-body data inherently lacks lower-body pose/occlusion statistics, and no ablation isolates the manifold's contribution on real full-body test images; if this transfer fails, the SOTA and competitive results cannot be credited to the proposed components.

    Authors: The factorized data manifold combines 2D generative diversity with 3D-consistent renderings precisely to support generalization beyond the half-body training distribution, with the Dual-UV representation further mitigating pose and framing variations. We agree that an explicit ablation on real full-body inputs would strengthen attribution of the results. In the revision we will add such an ablation evaluating the manifold's isolated contribution on real full-body test cases. revision: yes

  3. Referee: [§4.4] The robust proxy-mesh tracker is presented as solving unreliable estimation under partial visibility, yet no quantitative evaluation (e.g., stability metrics or failure rates versus baselines on occluded full-body cases) is reported. This component is load-bearing for the full-body results but lacks the evidence needed to confirm its contribution.

    Authors: We acknowledge that quantitative evidence for the proxy-mesh tracker's robustness would better substantiate its role. We will add stability metrics (e.g., average vertex displacement and failure rates under occlusion) and comparisons against baseline trackers on occluded full-body cases in the experiments section of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

Novel components and data scheme presented without self-referential reductions or fitted predictions

full rationale

The paper introduces Dual-UV representation (Core-UV and Shell-UV branches), a factorized synthetic data manifold, and a robust proxy-mesh tracker as new elements to address pose/framing issues, data scalability, and proxy estimation. These are described as enabling strong in-the-wild generalization from half-body synthetic training data to head/upper-body SOTA and competitive full-body results. No equations, predictions, or central claims reduce by construction to fitted parameters, self-definitions, or self-citation chains. Extensive experiments are cited as independent validation, making the derivation self-contained against external benchmarks with only minor self-citation risk at most.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The abstract introduces the Dual-UV representation and synthetic data manifold as core new elements but does not detail numerical free parameters or external validation.

axioms (1)
  • domain assumption A factorized synthetic data manifold can combine 2D generative diversity with geometry-consistent 3D renderings to improve realism and identity consistency.
    Invoked to support the training scheme that enables in-the-wild generalization from half-body data only.
invented entities (2)
  • Dual-UV representation no independent evidence
    purpose: Maps image features to a canonical UV space via Core-UV and Shell-UV branches to eliminate pose- and framing-induced token shifts.
    New representation introduced to address pose- and framing-sensitive feature representations.
  • robust proxy-mesh tracker no independent evidence
    purpose: Maintains stability under partial visibility for unreliable proxy-mesh estimation.
    Component added to handle partial visibility cases.

pith-pipeline@v0.9.0 · 5474 in / 1317 out tokens · 54296 ms · 2026-05-17T04:33:34.263678+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. UIKA: Fast Universal Head Avatar from Pose-Free Images

    cs.CV 2026-01 conditional novelty 7.0

    UIKA is a feed-forward animatable Gaussian head model using UV-guided correspondence estimation and learnable UV tokens with dual-level attention, trained on large-scale synthetic data to handle pose-free inputs.

Reference graph

Works this paper leans on

114 extracted references · 114 canonical work pages · cited by 1 Pith paper

  1. [1]

    Gaussian shell maps for efficient 3d hu- man generation

    Rameen Abdal, Wang Yifan, Zifan Shi, Yinghao Xu, Ryan Po, Zhengfei Kuang, Qifeng Chen, Dit-Yan Yeung, and Gordon Wetzstein. Gaussian shell maps for efficient 3d hu- man generation. InCVPR, 2024. 3

  2. [2]

    Ogras, and Linjie Luo

    Sizhe An, Hongyi Xu, Yichun Shi, Guoxian Song, Umit Y . Ogras, and Linjie Luo. Panohead: Geometry-aware 3d full- head synthesis in 360deg. InCVPR, pages 20950–20959,

  3. [3]

    Multi-hmr: Multi-person whole-body hu- man mesh recovery in a single shot

    Fabien Baradel*, Matthieu Armando, Salma Galaaoui, Ro- main Br ´egier, Philippe Weinzaepfel, Gr ´egory Rogez, and Thomas Lucas*. Multi-hmr: Multi-person whole-body hu- man mesh recovery in a single shot. InECCV, 2024. 5

  4. [4]

    Jonathan T. Barron. A general and adaptive robust loss function, 2019. 7

  5. [5]

    A morphable model for the synthesis of 3d faces

    V olker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. InACM TOG, page 187–194, USA, 1999. ACM Press/Addison-Wesley Publishing Co. 3

  6. [6]

    B ¨uhler, Ye Yuan, Xueting Li, Yangyi Huang, Koki Nagano, and Umar Iqbal

    Marcel C. B ¨uhler, Ye Yuan, Xueting Li, Yangyi Huang, Koki Nagano, and Umar Iqbal. Dream, lift, animate: From single images to animatable gaussian avatars, 2025. 3

  7. [7]

    Hera: Hybrid explicit representation for ultra-realistic head avatars

    Hongrui Cai, Yuting Xiao, Xuan Wang, Jiafei Li, Yudong Guo, Yanbo Fan, Shenghua Gao, and Juyong Zhang. Hera: Hybrid explicit representation for ultra-realistic head avatars. InCVPR, 2025. 3

  8. [8]

    Facewarehouse: A 3d facial expression database for visual computing.IEEE Transactions on Visualization and Computer Graphics, 20(3):413–425, 2014

    Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. Facewarehouse: A 3d facial expression database for visual computing.IEEE Transactions on Visualization and Computer Graphics, 20(3):413–425, 2014. 2

  9. [9]

    Real-time facial animation with image-based dynamic avatars.ACM TOG, 35(4), 2016

    Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. Real-time facial animation with image-based dynamic avatars.ACM TOG, 35(4), 2016. 3

  10. [10]

    pi-gan: Periodic implicit genera- tive adversarial networks for 3d-aware image synthesis

    Eric Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. pi-gan: Periodic implicit genera- tive adversarial networks for 3d-aware image synthesis. In CVPR, 2021. 3

  11. [11]

    Chan, Connor Z

    Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero Karras, and Gordon Wetzstein. Efficient geometry-aware 3D generative adversarial networks. InCVPR, 2022. 3

  12. [12]

    Magicpose: Realistic human poses and facial expressions retargeting with identity-aware diffu- sion

    Di Chang, Yichun Shi, Quankai Gao, Hongyi Xu, Jessica Fu, Guoxian Song, Qing Yan, Yizhe Zhu, Xiao Yang, and Mohammad Soleymani. Magicpose: Realistic human poses and facial expressions retargeting with identity-aware diffu- sion. InICML, pages 6263–6285, 2024. 3

  13. [13]

    Taoavatar: Real-time lifelike full-body talking avatars for augmented reality via 3d gaussian splatting

    Jianchuan Chen, Jingchuan Hu, Gaige Wang, Zhonghua Jiang, Tiansong Zhou, Zhiwen Chen, and Chengfei Lv. Taoavatar: Real-time lifelike full-body talking avatars for augmented reality via 3d gaussian splatting. InCVPR, pages 10723–10734, 2025. 2

  14. [14]

    Synchuman: Synchronizing 2d and 3d diffusion models for single-view human reconstruction

    Wenyue Chen, Peng Li, Wangguandong Zheng, Chengfeng Zhao, Mengfei Li, Yaolong Zhu, Zhiyang Dou, Ronggang Wang, and Yuan Liu. Synchuman: Synchronizing 2d and 3d diffusion models for single-view human reconstruction. InNeurIPS, 2025. 3

  15. [15]

    Dna-rendering: A diverse neural actor repository for high-fidelity human-centric rendering

    Wei Cheng, Ruixiang Chen, Siming Fan, Wanqi Yin, Keyu Chen, Zhongang Cai, Jingbo Wang, Yang Gao, Zheng- ming Yu, Zhengyu Lin, Daxuan Ren, Lei Yang, Ziwei Liu, Chen Change Loy, Chen Qian, Wayne Wu, Dahua Lin, Bo Dai, and Kwan-Yee Lin. Dna-rendering: A diverse neural actor repository for high-fidelity human-centric rendering. InICCV, pages 19982–19993, 2023. 2

  16. [16]

    Generalizable and an- imatable gaussian head avatar

    Xuangeng Chu and Tatsuya Harada. Generalizable and an- imatable gaussian head avatar. InThe Thirty-eighth An- nual Conference on Neural Information Processing Sys- tems, 2024. 7

  17. [17]

    The light stages and their applications to photoreal digital actors.ACM TOG, 2(4):1–6, 2012

    Paul Debevec. The light stages and their applications to photoreal digital actors.ACM TOG, 2(4):1–6, 2012. 3

  18. [18]

    Black, Ot- mar Hilliges, and Andreas Geiger

    Zijian Dong, Xu Chen, Jinlong Yang, Michael J. Black, Ot- mar Hilliges, and Andreas Geiger. AG3D: Learning to gen- erate 3D avatars from 2D image collections. InICCV, 2023. 3

  19. [19]

    Tam- ing transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Tam- ing transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 12873–12883,

  20. [20]

    Yao Feng, Vasileios Choutas, Timo Bolkart, Dimitrios Tzionas, and Michael J. Black. Collaborative regression of expressive bodies using moderation. InInternational Con- ference on 3D Vision (3DV), 2021. 5

  21. [21]

    Stylegan-human: A data-centric odyssey of human genera- tion

    Jianglin Fu, Shikai Li, Yuming Jiang, Kwan-Yee Lin, Chen Qian, Chen-Change Loy, Wayne Wu, and Ziwei Liu. Stylegan-human: A data-centric odyssey of human genera- tion. InECCV, pages 729–747, 2022. 3, 6

  22. [22]

    Portrait video editing em- powered by multimodal generative priors

    Xuan Gao, Haiyao Xiao, Chenglai Zhong, Shimin Hu, Yudong Guo, and Juyong Zhang. Portrait video editing em- powered by multimodal generative priors. InSIGGRAPH Asia Conference Proceedings, 2024. 3

  23. [23]

    Controlling avatar diffusion with learnable gaussian embedding

    Xuan Gao, Jingtao Zhou, Dongyu Liu, Yuqi Zhou, and Juyong Zhang. Controlling avatar diffusion with learnable gaussian embedding. InProceedings of SIGGRAPH Asia 2025, 2025. 3, 5

  24. [24]

    Talk-act: Enhance textural-awareness for 2d speaking avatar reenactment with diffusion model

    Jiazhi Guan, Quanwei Yang, Kaisiyuan Wang, Hang Zhou, Shengyi He, Zhiliang Xu, Haocheng Feng, Errui Ding, Jingdong Wang, Hongtao Xie, Youjian Zhao, and Ziwei Liu. Talk-act: Enhance textural-awareness for 2d speaking avatar reenactment with diffusion model. InSIGGRAPH Asia 2024 Conference Papers, 2024. 3

  25. [25]

    Vid2avatar: 3d avatar reconstruction from videos in the wild via self-supervised scene decomposition

    Chen Guo, Tianjian Jiang, Xu Chen, Jie Song, and Otmar Hilliges. Vid2avatar: 3d avatar reconstruction from videos in the wild via self-supervised scene decomposition. In CVPR, 2023. 3 9

  26. [26]

    Sega: Drivable 3d gaussian head avatar from a single im- age, 2025

    Chen Guo, Zhuo Su, Jian Wang, Shuang Li, Xu Chang, Zhaohu Li, Yang Zhao, Guidong Wang, and Ruqi Huang. Sega: Drivable 3d gaussian head avatar from a single im- age, 2025. 3

  27. [27]

    High-fidelity 3d hu- man digitization from single 2k resolution images

    Sang-Hun Han, Min-Gyu Park, Ju Hong Yoon, Ju-Mi Kang, Young-Jae Park, and Hae-Gon Jeon. High-fidelity 3d hu- man digitization from single 2k resolution images. In CVPR, 2023. 2

  28. [28]

    Lam: Large avatar model for one-shot animatable gaussian head

    Yisheng He, Xiaodong Gu, Xiaodan Ye, Chao Xu, Zhengyi Zhao, Yuan Dong, Weihao Yuan, Zilong Dong, and Liefeng Bo. Lam: Large avatar model for one-shot animatable gaussian head. InProceedings of SIGGRAPH, pages 1–13,

  29. [29]

    Look ma, no markers: holistic per- formance capture without the hassle.ACM TOG, 43(6),

    Charlie Hewitt, Fatemeh Saleh, Sadegh Aliakbarian, Lohit Petikam, Shideh Rezaeifar, Louis Florentin, Zafiirah Ho- senie, Thomas J Cashman, Julien Valentin, Darren Cosker, and Tadas Baltruˇsaitis. Look ma, no markers: holistic per- formance capture without the hassle.ACM TOG, 43(6),

  30. [30]

    Eva3d: Compositional 3d human generation from 2d image collections.ICLR, 2022

    Fangzhou Hong, Zhaoxi Chen, Yushi Lan, Liang Pan, and Ziwei Liu. Eva3d: Compositional 3d human generation from 2d image collections.ICLR, 2022. 3

  31. [31]

    Headnerf: A real-time nerf-based parametric head model

    Yang Hong, Bo Peng, Haiyao Xiao, Ligang Liu, and Juyong Zhang. Headnerf: A real-time nerf-based parametric head model. InCVPR, 2022. 3

  32. [32]

    Lrm: Large reconstruction model for single image to 3d

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. InICLR, 2024. 1, 3, 4

  33. [33]

    Adahuman: Animatable detailed 3d human genera- tion with compositional multiview diffusion

    Yangyi Huang, Ye Yuan, Xueting Li, Jan Kautz, and Umar Iqbal. Adahuman: Animatable detailed 3d human genera- tion with compositional multiview diffusion. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), pages 13533–13543, 2025. 3

  34. [34]

    Humanrf: High-fidelity neural radiance fields for humans in motion.ACM TOG, 42(4):1–12, 2023

    Mustafa Is ¸ık, Martin R¨unz, Markos Georgopoulos, Taras Khakhulin, Jonathan Starck, Lourdes Agapito, and Matthias Nießner. Humanrf: High-fidelity neural radiance fields for humans in motion.ACM TOG, 42(4):1–12, 2023. 2, 5

  35. [35]

    Learning high fi- delity depths of dressed humans by watching social media dance videos

    Yasamin Jafarian and Hyun Soo Park. Learning high fi- delity depths of dressed humans by watching social media dance videos. InCVPR, pages 12753–12762, 2021. 3

  36. [36]

    Dif- fuman4d: 4d consistent human view synthesis from sparse- view videos with spatio-temporal diffusion models

    Yudong Jin, Sida Peng, Xuan Wang, Tao Xie, Zhen Xu, Yi- fan Yang, Yujun Shen, Hujun Bao, and Xiaowei Zhou. Dif- fuman4d: 4d consistent human view synthesis from sparse- view videos with spatio-temporal diffusion models. In ICCV, 2025. 3

  37. [37]

    Pippo: High-resolution multi-view humans from a single image

    Yash Kant, Ethan Weber, Jin Kyu Kim, Rawal Khirodkar, Su Zhaoen, Julieta Martinez, Igor Gilitschenski, Shunsuke Saito, and Timur Bagautdinov. Pippo: High-resolution multi-view humans from a single image. InCVPR, 2025. 3, 5

  38. [38]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intel- ligence, 43(12):4217–4228, 2021. 3

  39. [39]

    3d gaussian splatting for real-time radiance field rendering.ACM TOG, 42(4), 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM TOG, 42(4), 2023. 3

  40. [40]

    Sapiens: Foundation for human vision models

    Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Zhaoen Su, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. Sapiens: Foundation for human vision models. InECCV, 2024. 4, 5

  41. [41]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. 6, 7

  42. [42]

    Nersemble: Multi-view radi- ance field reconstruction of human heads.ACM TOG, 2023

    Tobias Kirschstein, Shenhan Qian, Simon Giebenhain, Tim Walter, and Matthias Nießner. Nersemble: Multi-view radi- ance field reconstruction of human heads.ACM TOG, 2023. 2, 6

  43. [43]

    GGHead: Fast and Generalizable 3D Gaussian Heads

    Tobias Kirschstein, Simon Giebenhain, Jiapeng Tang, Markos Georgopoulos, and Matthias Nießner. GGHead: Fast and Generalizable 3D Gaussian Heads. InSIGGRAPH Asia Conference Papers, 2024. 3

  44. [44]

    Dreamhuman: Animatable 3d avatars from text.NeurIPS, 36:10516–10529, 2023

    Nikos Kolotouros, Thiemo Alldieck, Andrei Zanfir, Ed- uard Bazavan, Mihai Fieraru, and Cristian Sminchisescu. Dreamhuman: Animatable 3d avatars from text.NeurIPS, 36:10516–10529, 2023. 3

  45. [45]

    Desloge, Tommy Fortes, Eric M

    Jason Lawrence, Danb Goldman, Supreeth Achar, Gre- gory Major Blascovich, Joseph G. Desloge, Tommy Fortes, Eric M. Gomez, Sascha H ¨aberling, Hugues Hoppe, Andy Huibers, Claude Knaus, Brian Kuschak, Ricardo Martin- Brualla, Harris Nover, Andrew Ian Russell, Steven M. Seitz, and Kevin Tong. Project starline: a high-fidelity telepresence system.ACM TOG, 40(...

  46. [46]

    Spherehead: Stable 3d full-head synthesis with spherical tri-plane representa- tion

    Heyuan Li, Ce Chen, Tianhao Shi, Yuda Qiu, Sizhe An, Guanying Chen, and Xiaoguang Han. Spherehead: Stable 3d full-head synthesis with spherical tri-plane representa- tion. InECCV, 2024. 3

  47. [47]

    Hyplanehead: Rethinking tri-plane-like representations in full-head image synthesis

    Heyuan Li, Kenkun Liu, Lingteng Qiu, Qi Zuo, Keru Zheng, Zilong Dong, and Xiaoguang Han. Hyplanehead: Rethinking tri-plane-like representations in full-head image synthesis. InNeurIPS, 2025. Poster. 3

  48. [48]

    Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation

    Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kai- hui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, et al. Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation. In CVPR, 2025. 3, 6

  49. [49]

    Uravatar: Universal relightable gaussian codec avatars

    Junxuan Li, Chen Cao, Gabriel Schwartz, Rawal Khirod- kar, Christian Richardt, Tomas Simon, Yaser Sheikh, and Shunsuke Saito. Uravatar: Universal relightable gaussian codec avatars. InSIGGRAPH Conference Papers, 2024. 3

  50. [50]

    Pshuman: Photorealistic single-view human reconstruction using cross-scale diffusion

    Peng Li, Wangguandong Zheng, Yuan Liu, Tao Yu, Yang- guang Li, Xingqun Qi, Mengfei Li, Xiaowei Chi, Siyu Xia, Wei Xue, et al. Pshuman: Photorealistic single-view human reconstruction using cross-scale diffusion. InCVPR, 2025. 3

  51. [51]

    Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and ex- pression from 4D scans.ACM TOG, 36(6):194:1–194:17,

  52. [52]

    Animatable gaussians: Learning pose-dependent gaussian maps for high-fidelity human avatar modeling

    Zhe Li, Zerong Zheng, Lizhen Wang, and Yebin Liu. Animatable gaussians: Learning pose-dependent gaussian maps for high-fidelity human avatar modeling. InProceed- 10 ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19711–19722, 2024. 3

  53. [53]

    Cyberhost: A one-stage diffusion framework for audio-driven talking body generation

    Gaojie Lin, Jianwen Jiang, Chao Liang, Tianyun Zhong, Jiaqi Yang, Zerong Zheng, and Yanbo Zheng. Cyberhost: A one-stage diffusion framework for audio-driven talking body generation. InICLR, 2025. 3

  54. [54]

    One-stage 3d whole-body mesh recovery with com- ponent aware transformer

    Jing Lin, Ailing Zeng, Haoqian Wang, Lei Zhang, and Yu Li. One-stage 3d whole-body mesh recovery with com- ponent aware transformer. InCVPR, pages 21159–21168,

  55. [55]

    Tango: Co-speech gesture video reenactment with hierarchical audio motion embedding and diffusion inter- polation

    Haiyang Liu, Xingchao Yang, Tomoya Akiyama, Yuantian Huang, Qiaoge Li, Shigeru Kuriyama, and Takafumi Take- tomi. Tango: Co-speech gesture video reenactment with hierarchical audio motion embedding and diffusion inter- polation. InICLR, 2025. 3

  56. [56]

    Humangaus- sian: Text-driven 3d human generation with gaussian splat- ting

    Xian Liu, Xiaohang Zhan, Jiaxiang Tang, Ying Shan, Gang Zeng, Dahua Lin, Xihui Liu, and Ziwei Liu. Humangaus- sian: Text-driven 3d human generation with gaussian splat- ting. InCVPR, 2024. 3

  57. [57]

    Gas: Generative avatar synthesis from a single image

    Yixing Lu, Junting Dong, Youngjoong Kwon, Qin Zhao, Bo Dai, and Fernando De la Torre. Gas: Generative avatar synthesis from a single image. InICCV, 2025. 3

  58. [58]

    Codec Avatar Studio: Paired Human Captures for Complete, Driveable, and Generalizable Avatars.NeurIPS, 2024

    Julieta Martinez, Emily Kim, Javier Romero, et al. Codec Avatar Studio: Paired Human Captures for Complete, Driveable, and Generalizable Avatars.NeurIPS, 2024. 2, 5

  59. [59]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. InECCV, 2020. 3

  60. [60]

    Expressive whole-body 3D gaussian avatar

    Gyeongsik Moon, Takaaki Shiratori, and Shunsuke Saito. Expressive whole-body 3D gaussian avatar. InECCV,

  61. [61]

    Wright.Numerical Optimiza- tion

    Jorge Nocedal and Stephen J. Wright.Numerical Optimiza- tion. Springer, New York, NY , USA, second edition, 2006. 2

  62. [62]

    Introducing gpt-5, 2025

    OpenAI. Introducing gpt-5, 2025. Blog post. 5

  63. [63]

    Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernan- dez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Ass- ran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patric...

  64. [64]

    Renderme-360: Large digital asset library and benchmark towards high-fidelity head avatars

    Dongwei Pan, Long Zhuo, Jingtan Piao, Huiwen Luo, Wei Cheng, Yuxin Wang, Siming Fan, Shengqi Liu, Lei Yang, Bo Dai, Ziwei Liu, Chen Change Loy, Chen Qian, Wayne Wu, Dahua Lin, and Kwan-Yee Lin. Renderme-360: Large digital asset library and benchmark towards high-fidelity head avatars. InThirty-seventh Conference on Neural In- formation Processing Systems ...

  65. [65]

    Humansplat: Generalizable single-image human gaus- sian splatting with structure priors

    Panwang Pan, Zhuo Su, Chenguo Lin, Zhen Fan, Yongjie Zhang, Zeming Li, Tingting Shen, Yadong Mu, and Yebin Liu. Humansplat: Generalizable single-image human gaus- sian splatting with structure priors. InNeurIPS, 2024. 3

  66. [66]

    Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. InCVPR, pages 10975– 10985, 2019. 1

  67. [67]

    Re- constructing hands in 3D with transformers

    Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Re- constructing hands in 3D with transformers. InCVPR,

  68. [68]

    Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans

    Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. InCVPR,

  69. [69]

    Barron, and Ben Milden- hall

    Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. InICLR,

  70. [70]

    Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians

    Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Da- vide Davoli, Simon Giebenhain, and Matthias Nießner. Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians. InCVPR, 2023. 3

  71. [71]

    Lhm: Large animat- able human reconstruction model from a single image in seconds

    Lingteng Qiu, Xiaodong Gu, Peihao Li, Qi Zuo, Weichao Shen, Junfei Zhang, Kejie Qiu, Weihao Yuan, Guanying Chen, Zilong Dong, and Liefeng Bo. Lhm: Large animat- able human reconstruction model from a single image in seconds. InICCV, 2025. 1, 3, 4, 5

  72. [72]

    Pf-lhm: 3d animatable avatar reconstruction from pose-free articulated human images,

    Lingteng Qiu, Peihao Li, Qi Zuo, Xiaodong Gu, Yuan Dong, Weihao Yuan, Siyu Zhu, Xiaoguang Han, Guany- ing Chen, and Zilong Dong. Pf-lhm: 3d animatable avatar reconstruction from pose-free articulated human images,

  73. [73]

    Anigs: Animatable gaussian avatar from a single image with inconsistent gaussian reconstruction

    Lingteng Qiu, Shenhao Zhu, Qi Zuo, Xiaodong Gu, Yuan Dong, Junfei Zhang, Chao Xu, Zhe Li, Weihao Yuan, Liefeng Bo, et al. Anigs: Animatable gaussian avatar from a single image with inconsistent gaussian reconstruction. In CVPR, 2025. 3

  74. [74]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022. 3

  75. [75]

    Pifu: Pixel-aligned implicit function for high-resolution clothed human digiti- zation

    Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Mor- ishima, Angjoo Kanazawa, and Hao Li. Pifu: Pixel-aligned implicit function for high-resolution clothed human digiti- zation. InICCV, pages 2304–2313, 2019. 3

  76. [76]

    Relightable gaussian codec avatars

    Shunsuke Saito, Gabriel Schwartz, Tomas Simon, Junxuan Li, and Giljoo Nam. Relightable gaussian codec avatars. In CVPR, 2024. 3

  77. [77]

    Dreamgaussian: Generative gaussian splatting for efficient 3d content creation

    Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. InICLR, 2024. 3

  78. [78]

    Qwen2.5 technical report, 2025

    Qwen Team. Qwen2.5 technical report, 2025. 5, 4

  79. [79]

    Qwen-image technical report, 2025

    Qwen-Image Team. Qwen-image technical report, 2025. 5, 4

  80. [80]

    Wan: Open and advanced large-scale video generative models, 2025

    Wan Team. Wan: Open and advanced large-scale video generative models, 2025. 5, 4

Showing first 80 references.