pith. sign in

arxiv: 2604.13171 · v1 · submitted 2026-04-14 · 💻 cs.CV

3DRealHead: Few-Shot Detailed Head Avatar

Pith reviewed 2026-05-10 15:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords few-shot head avatar3D Gaussian primitivesStyle U-Netmouth featuresfacial expression control3DMMNeRSemble dataset
0
0 comments X

The pith

Mouth features from driving video let few-shot 3D head avatars capture expressions beyond 3DMM limits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents 3DRealHead, a method to build detailed 3D head avatars from only a few photographs of a person. It inverts a learned prior in the form of a Style U-Net that produces 3D Gaussian primitives for rendering from new viewpoints. Animation conditions this network on standard 3D morphable model expression parameters together with additional features taken from the mouth region in a monocular video. These mouth signals are meant to recover person-specific movements that pure 3DMM control cannot represent, yielding avatars that match real facial dynamics more closely.

Core claim

3DRealHead reconstructs 3D head avatars via few-shot inversion of a Style U-Net prior that emits 3D Gaussian primitives. Animation conditions the U-Net on 3DMM expression signals augmented by mouth-region features extracted from the driving video, enabling recovery of facial expressions that cannot be represented by the 3DMM alone and producing closer resemblance to physical reality.

What carries the argument

Style U-Net that emits 3D Gaussian primitives, conditioned on 3DMM-based facial expression signals plus mouth region features extracted from monocular driving video.

If this is right

  • A small set of subject photographs suffices to create a renderable 3D head avatar.
  • The avatar can be driven by ordinary consumer webcam video.
  • Facial expressions gain expressivity through mouth-specific signals not captured by 3DMM.
  • Novel-view rendering remains possible through the emitted 3D Gaussian primitives.
  • The learned prior supports inversion for new identities without full retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mouth-augmented conditioning could be added to other 3DMM-based avatar pipelines to raise fidelity.
  • Region-specific video features might extend the approach to hands or full-body avatars.
  • Gaussian primitive output supports efficient rendering that could suit real-time virtual-reality use.

Load-bearing premise

The head prior learned on the NeRSemble dataset generalizes to unseen subjects for accurate few-shot inversion and the added mouth features integrate without artifacts or identity drift.

What would settle it

Observe whether an unseen subject avatar fails to reproduce a mouth expression visible in the driving video yet absent from 3DMM parameters, or whether mouth features produce visible identity mismatch or rendering artifacts.

Figures

Figures reproduced from arXiv: 2604.13171 by Jalees Nehvi, Justus Thies, Thabo Beeler, Timo Bolkart.

Figure 1
Figure 1. Figure 1: Given a few-shot input (1-3 images), we reconstruct a 3D head avatar that can be driven by a monocular input [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of 3DRealHead. Based on few-shot input data, an RGB texture is generated which serves as identity conditioning to the Style U-Net which predicts 3D Gaussian primitives in UV-space relative to a FLAME head mesh. To control detailed facial expressions from a driving input video, we extract 3DMM-based position maps [35] and image gradient features which are projected to the UV texture space. From thi… view at source ↗
Figure 3
Figure 3. Figure 3: We show qualitative results of our method along [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of our approach (single target) against one-shot prior-based methods. We observe that our [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison to SynShot [61] and SOTA monocular approaches on In-the-Wild INSTA dataset [60]. It is [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation experiments where we replace our gradient-based conditioning with different other types of conditioning, [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of different mouth feature conditionings. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Interpolation between two test identities (leftmost and [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Inversion Experiment for the No Fine-Tuning, 1-shot [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison against GAN-based baselines on In-the [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison against Avat3r [25] for different subjects [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative results of using 3DMM-only based [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Qualitative comparison of our method against Gaus [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Plots of performance versus number of frames [PITH_FULL_IMAGE:figures/full_fig_p016_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Qualitative comparison of different training set sizes. [PITH_FULL_IMAGE:figures/full_fig_p016_19.png] view at source ↗
read the original abstract

The human face is central to communication. For immersive applications, the digital presence of a person should mirror the physical reality, capturing the users idiosyncrasies and detailed facial expressions. However, current 3D head avatar methods often struggle to faithfully reproduce the identity and facial expressions, despite having multi-view data or learned priors. Learning priors that capture the diversity of human appearances, especially, for regions with highly person-specific features, like the mouth and teeth region is challenging as the underlying training data is limited. In addition, many of the avatar methods are purely relying on 3D morphable model-based expression control which strongly limits expressivity. To address these challenges, we are introducing 3DRealHead, a few-shot head avatar reconstruction method with a novel expression control signal that is extracted from a monocular video stream of the subject. Specifically, the subject can take a few pictures of themselves, recover a 3D head avatar and drive it with a consumer-level webcam. The avatar reconstruction is enabled via a novel few-shot inversion process of a 3D human head prior which is represented as a Style U-Net that emits 3D Gaussian primitives which can be rendered under novel views. The prior is learned on the NeRSemble dataset. For animating the avatar, the U-Net is conditioned on 3DMM-based facial expression signals, as well as features of the mouth region extracted from the driving video. These additional mouth features allow us to recover facial expressions that cannot be represented by the 3DMM leading to a higher expressivity and closer resemblance to the physical reality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces 3DRealHead, a few-shot head avatar reconstruction method. A Style U-Net prior is learned on the NeRSemble dataset to emit 3D Gaussian primitives for novel-view rendering. Reconstruction proceeds via few-shot inversion from a small set of user images. Animation conditions the U-Net on 3DMM expression parameters together with mouth-region features extracted from monocular driving video, with the claim that the added mouth conditioning recovers expressions outside the 3DMM span and yields higher fidelity to physical reality.

Significance. If the central claims on generalization and expressivity are substantiated by quantitative evidence, the work would provide a practical route to consumer-level, high-fidelity head avatars that better capture idiosyncratic mouth and expression details than pure 3DMM pipelines. The 3D-Gaussian representation and learned prior could also support efficient rendering in immersive applications.

major comments (2)
  1. [Abstract] Abstract: the central claim that mouth-region features enable recovery of expressions outside the 3DMM span is presented without any quantitative metric, ablation isolating the mouth conditioning, or comparison against a 3DMM-only baseline; this evidence is required to support the expressivity improvement.
  2. [Method] Method section (few-shot inversion description): the assumption that the NeRSemble-trained Style U-Net prior inverts accurately for arbitrary unseen identities from a handful of images is load-bearing for the reconstruction pipeline, yet no cross-subject or cross-dataset reconstruction errors, identity metrics, or failure-case analysis are reported.
minor comments (1)
  1. [Method] The notation for the combined conditioning signal (3DMM parameters plus mouth features) inside the Style U-Net could be made more explicit to clarify how the two signals are fused without identity drift.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that mouth-region features enable recovery of expressions outside the 3DMM span is presented without any quantitative metric, ablation isolating the mouth conditioning, or comparison against a 3DMM-only baseline; this evidence is required to support the expressivity improvement.

    Authors: We agree that the abstract's claim would benefit from explicit quantitative backing. The manuscript already includes qualitative demonstrations of improved mouth and expression fidelity via side-by-side renderings and driving-video comparisons. To directly address the concern, we will add a dedicated ablation subsection that isolates the mouth-feature conditioning, reports quantitative metrics (e.g., mouth-region landmark error and perceptual similarity scores), and compares against a 3DMM-only baseline on a held-out test set. These results will be summarized in the abstract and discussed in the experiments section. revision: yes

  2. Referee: [Method] Method section (few-shot inversion description): the assumption that the NeRSemble-trained Style U-Net prior inverts accurately for arbitrary unseen identities from a handful of images is load-bearing for the reconstruction pipeline, yet no cross-subject or cross-dataset reconstruction errors, identity metrics, or failure-case analysis are reported.

    Authors: The few-shot inversion is supported by visual results and user studies on diverse real-world subjects in the current experiments. We acknowledge that additional quantitative validation would increase confidence in generalization. We will therefore report cross-subject identity preservation metrics (e.g., cosine similarity of ArcFace embeddings) and reconstruction error statistics on held-out NeRSemble identities. A short failure-case analysis covering challenging conditions (extreme lighting, accessories) will also be added. Cross-dataset evaluation is limited by the availability of comparable multi-view head datasets; we will clarify the scope of generalization accordingly rather than claim broader transfer. revision: partial

Circularity Check

0 steps flagged

No circularity: method relies on learned priors and external conditioning without self-referential reduction

full rationale

The paper describes a Style U-Net prior trained on the external NeRSemble dataset, followed by few-shot inversion and conditioning on 3DMM parameters plus independently extracted mouth features from driving video. No equations, fitted parameters, or self-citations are presented that reduce the claimed expressivity gain to a tautological reparameterization or input fit. The derivation chain consists of standard neural rendering components and dataset-driven learning, remaining self-contained against external benchmarks without any load-bearing step that collapses by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on a learned prior from NeRSemble and the assumption that mouth features provide independent expressive signal; no explicit free parameters or invented physical entities are named in the abstract.

axioms (2)
  • domain assumption A Style U-Net trained on NeRSemble can be inverted from a few images to produce view-consistent 3D Gaussian primitives for novel subjects.
    Invoked when describing the few-shot inversion process.
  • domain assumption Mouth-region features extracted from driving video are complementary to 3DMM expression parameters and can be fused without introducing inconsistencies.
    Central to the claim of higher expressivity.
invented entities (1)
  • Style U-Net emitting 3D Gaussian primitives no independent evidence
    purpose: Represent the 3D head prior for few-shot inversion and novel-view rendering.
    Introduced as the core representation; no independent evidence outside the method is provided.

pith-pipeline@v0.9.0 · 5599 in / 1469 out tokens · 71777 ms · 2026-05-10T15:15:04.743676+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 1 internal anchor

  1. [1]

    S. An, H. Xu, Y . Shi, G. Song, U. Ogras, and L. Luo. Panohead: Geometry-aware 3d full-head synthesis in 360 ◦, 2023

  2. [2]

    Blanz and T

    V . Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. InProceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’99, page 187–194, USA, 1999. ACM Press/Addison-Wesley Publishing Co

  3. [3]

    M. C. Buehler, G. Li, E. Wood, L. Helminger, X. Chen, T. Shah, D. Wang, S. Garbin, S. Orts-Escolano, O. Hilliges, D. Lagun, J. Riviere, P. Gotardo, T. Beeler, A. Meka, and K. Sarkar. Cafca: High-quality novel view synthesis of expressive faces from casual few-shot captures. InACM SIGGRAPH Asia 2024 Conference Paper. 2024

  4. [4]

    M. C. B ¨uhler, K. Sarkar, T. Shah, G. Li, D. Wang, L. Helminger, S. Orts-Escolano, D. Lagun, O. Hilliges, T. Beeler, et al. Preface: A data-driven volumetric prior for few-shot ultra high-resolution face synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 3402–3413, 2023

  5. [5]

    Caron, H

    M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bo- janowski, and A. Joulin. Emerging properties in self- supervised vision transformers. InProceedings of the Inter- national Conference on Computer Vision (ICCV), 2021

  6. [6]

    E. Chan, C. Lin, M. Chan, K. Nagano, B. Pan, S. Mello, O. Gallo, L. Guibas, J. Tremblay, S. Khamis, T. Karras, and G. Wetzstein. Efficient geometry-aware 3d generative adversarial networks. pages 16102–16112, 06 2022

  7. [7]

    X. Chen, M. Mihajlovic, S. Wang, S. Prokudin, and S. Tang. Morphable diffusion: 3d-consistent diffusion for single-image avatar creation. 2024

  8. [8]

    Y . Chen, L. Wang, Q. Li, H. Xiao, S. Zhang, H. Yao, and Y . Liu. Monogaussianavatar: Monocular gaussian point-based head avatar.ACM SIGGRAPH 2024 Conference Papers, 2023

  9. [9]

    Chu and T

    X. Chu and T. Harada. Generalizable and animatable gaussian head avatar. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  10. [10]

    X. Chu, Y . Li, A. Zeng, T. Yang, L. Lin, Y . Liu, and T. Harada. Gpavatar: Generalizable and precise head avatar from image(s), 2024

  11. [11]

    Dalal and B

    N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. InProceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) - V olume 1 - V olume 01, CVPR ’05, page 886–893, USA, 2005. IEEE Computer Society

  12. [12]

    Y . Deng, D. Wang, X. Ren, X. Chen, and B. Wang. Portrait4d: Learning one-shot 4d head avatar synthesis using synthetic data, 2024

  13. [13]

    Egger, W

    B. Egger, W. A. P. Smith, A. Tewari, S. Wuhrer, M. Zollhoefer, T. Beeler, F. Bernard, T. Bolkart, A. Kortylewski, S. Romd- hani, C. Theobalt, V . Blanz, and T. Vetter. 3d morphable face models – past, present and future, 2020

  14. [14]

    Gafni, J

    G. Gafni, J. Thies, M. Zollh ¨ofer, and M. Nießner. Dynamic neural radiance fields for monocular 4d facial avatar recon- struction, 2020

  15. [15]

    I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial networks, 2014

  16. [16]

    Y . He, X. Gu, X. Ye, C. Xu, Z. Zhao, Y . Dong, W. Yuan, Z. Dong, and L. Bo. Lam: Large avatar model for one-shot animatable gaussian head, 2025

  17. [17]

    Jacobson and Y

    A. Jacobson and Y . Gingold. Skinning: real-time shape deformation. InSIGGRAPH Asia 2014 Courses, SA ’14, New York, NY , USA, 2014. Association for Computing Machinery. 9

  18. [18]

    Johnson, A

    J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution, 2016

  19. [19]

    Kabadayi, W

    B. Kabadayi, W. Zielonka, B. L. Bhatnagar, G. Pons-Moll, and J. Thies. Gan-avatar: Controllable personalized gan-based human head avatar. InInternational Conference on 3D Vision (3DV), March 2024

  20. [20]

    Kerbl, G

    B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), July 2023

  21. [21]

    D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.CoRR, abs/1412.6980, 2014

  22. [22]

    Kirschstein, S

    T. Kirschstein, S. Giebenhain, and M. Nießner. Diffusiona- vatars: Deferred diffusion for high-fidelity 3d head avatars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5481–5492, 2024

  23. [23]

    Kirschstein, S

    T. Kirschstein, S. Giebenhain, J. Tang, M. Georgopoulos, and M. Nießner. GGHead: Fast and Generalizable 3D Gaussian Heads. InSIGGRAPH Asia 2024 Conference Papers, SA ’24, New York, NY , USA, 2024. Association for Computing Machinery

  24. [24]

    Kirschstein, S

    T. Kirschstein, S. Qian, S. Giebenhain, T. Walter, and M. Nießner. Nersemble: Multi-view radiance field reconstruc- tion of human heads.ACM Trans. Graph., 42(4), jul 2023

  25. [25]

    Kirschstein, J

    T. Kirschstein, J. Romero, A. Sevastopolsky, M. Nießner, and S. Saito. Avat3r: Large animatable gaussian reconstruction model for high-fidelity 3d head avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12089–12100, October 2025

  26. [26]

    Y . Lan, F. Tan, D. Qiu, Q. Xu, K. Genova, Z. Huang, S. Fanello, R. Pandey, T. Funkhouser, C. C. Loy, and Y . Zhang. Gaussian3diff: 3d gaussian diffusion for 3d full head synthesis and editing. InECCV, 2024

  27. [27]

    T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017

  28. [28]

    Lombardi, J

    S. Lombardi, J. Saragih, T. Simon, and Y . Sheikh. Deep appearance models for face rendering.ACM Trans. Graph., 37(4):68:1–68:13, July 2018

  29. [29]

    Lombardi, T

    S. Lombardi, T. Simon, G. Schwartz, M. Zollhoefer, Y . Sheikh, and J. Saragih. Mixture of volumetric primitives for efficient neural rendering.ACM Trans. Graph., 40(4), jul 2021

  30. [30]

    Martinez, E

    J. Martinez, E. Kim, J. Romero, T. Bagautdinov, S. Saito, S.-I. Yu, S. Anderson, M. Zollh¨ofer, T.-L. Wang, S. Bai, C. Li, S.-E. Wei, R. Joshi, W. Borsos, T. Simon, J. Saragih, P. Theodosis, A. Greene, A. Josyula, S. M. Maeta, A. I. Jewett, S. Venshtain, C. Heilman, Y .-T. Chen, S. Fu, M. E. A. Elshaer, T. Du, L. Wu, S.-C. Chen, K. Kang, M. Wu, Y . Emad, ...

  31. [31]

    Mildenhall, P

    B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InECCV, 2020

  32. [32]

    M ¨uller, A

    T. M ¨uller, A. Evans, C. Schied, and A. Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, July 2022

  33. [33]

    S. Qian. Vhap: Versatile head alignment with adaptive appearance priors, sep 2024

  34. [34]

    S. Qian, T. Kirschstein, L. Schoneveld, D. Davoli, S. Gieben- hain, and M. Nießner. Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20299–20309, 2024

  35. [35]

    Retsinas, P

    G. Retsinas, P. P. Filntisis, R. Danecek, V . F. Abrevaya, A. Roussos, T. Bolkart, and P. Maragos. 3d facial expres- sions through analysis-by-neural-synthesis. InConference on Computer Vision and Pattern Recognition (CVPR), 2024

  36. [36]

    R ¨ossler, D

    A. R ¨ossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner. Faceforensics: A large-scale video dataset for forgery detection in human faces.arXiv, 2018

  37. [37]

    R ¨ossler, D

    A. R ¨ossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner. Faceforensics++: Learning to detect manipulated facial images. InICCV 2019, 2019

  38. [38]

    Saito, G

    S. Saito, G. Schwartz, T. Simon, J. Li, and G. Nam. Re- lightable gaussian codec avatars. InCVPR, 2024

  39. [39]

    doi:10.48550/arXiv.2301.09515 , urldate =

    A. Sauer, T. Karras, S. Laine, A. Geiger, and T. Aila. StyleGAN-T: Unlocking the power of GANs for fast large- scale text-to-image synthesis. volume abs/2301.09515, 2023

  40. [40]

    Z. Shao, Z. Wang, Z. Li, D. Wang, X. Lin, Y . Zhang, M. Fan, and Z. Wang. SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  41. [41]

    Sobel and G

    I. Sobel and G. M. Feldman. An isotropic 3×3 image gradient operator. 1990

  42. [42]

    J. Sun, X. Wang, L. Wang, X. Li, Y . Zhang, H. Zhang, and Y . Liu. Next3d: Generative neural texture rasterization for 3d-aware head avatars. InCVPR, 2023

  43. [43]

    Teotia, H

    K. Teotia, H. Kim, P. Garrido, M. Habermann, M. Elgharib, and C. Theobalt. Gaussianheads: End-to-end learning of driv- able gaussian head avatars from coarse-to-fine representations. ACM Transactions on Graphics (TOG), 43:1 – 12, 2024

  44. [44]

    Thies, M

    J. Thies, M. Zollh ¨ofer, and M. Nießner. Deferred neural rendering: Image synthesis using neural textures.ACM Trans- actions on Graphics 2019 (TOG), 2019

  45. [45]

    Thies, M

    J. Thies, M. Zollh ¨ofer, M. Stamminger, C. Theobalt, and M. Nießner. Face2Face: Real-time Face Capture and Reenact- ment of RGB Videos. InProc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2016

  46. [46]

    L. Wang, X. Zhao, J. Sun, Y . Zhang, H. Zhang, T. Yu, and Y . Liu. Styleavatar: Real-time photo-realistic portrait avatar from a single video. InACM SIGGRAPH 2023 Conference Proceedings, 2023

  47. [47]

    T. Wang, B. Zhang, T. Zhang, S. Gu, J. Bao, T. Baltrusaitis, J. Shen, D. Chen, F. Wen, Q. Chen, and B. Guo. Rodin: A generative model for sculpting 3d digital avatars using diffusion. pages 4563–4573, 06 2023

  48. [48]

    Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli. Image quality assessment: From error visibility to structural similar- 10 ity.Image Processing, IEEE Transactions on, 13:600 – 612, 05 2004

  49. [49]

    C.-h. Wuu, N. Zheng, S. Ardisson, R. Bali, D. Belko, E. Brockmeyer, L. Evans, T. Godisart, H. Ha, X. Huang, A. Hypes, T. Koska, S. Krenn, S. Lombardi, X. Luo, K. McPhail, L. Millerschoen, M. Perdoch, M. Pitts, A. Richard, J. Saragih, J. Saragih, T. Shiratori, T. Simon, M. Stewart, A. Trimble, X. Weng, D. Whitewolf, C. Wu, S.- I. Yu, and Y . Sheikh. Multif...

  50. [50]

    Xiang, X

    J. Xiang, X. Gao, Y . Guo, and J. Zhang. Flashavatar: High- fidelity head avatar with efficient gaussian embedding. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  51. [51]

    L. Xie, X. Wang, H. Zhang, C. Dong, and Y . Shan. Vfhq: A high-quality dataset and benchmark for video face super- resolution. InThe IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2022

  52. [52]

    Y . Xu, B. Chen, Z. Li, H. Zhang, L. Wang, Z. Zheng, and Y . Liu. Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  53. [53]

    Y . Xu, H. Zhang, L. Wang, X. Zhao, H. Huang, G. Qi, and Y . Liu. Latentavatar: Learning latent expression code for expressive neural head avatar. InACM SIGGRAPH 2023 Conference Proceedings, SIGGRAPH ’23, New York, NY , USA, 2023. Association for Computing Machinery

  54. [54]

    Z. Yu, Z. Bai, A. Meka, F. Tan, Q. Xu, R. Pandey, S. Fanello, H. S. Park, and Y . Zhang. One2avatar: Generative implicit head avatar for few-shot user adaptation, 2024

  55. [55]

    Zhang, Y

    B. Zhang, Y . Cheng, C. Wang, T. Zhang, J. Yang, Y . Tang, F. Zhao, D. Chen, and B. Guo. Rodinhd: High-fidelity 3d avatar generation with diffusion models. InEuropean Conference on Computer Vision, pages 465–483. Springer, 2025

  56. [56]

    X. Zhao, J. Sun, L. Wang, J. Suo, and Y . Liu. Invertavatar: In- cremental gan inversion for generalized head avatars. InACM SIGGRAPH 2024 Conference Papers, SIGGRAPH ’24, New York, NY , USA, 2024. Association for Computing Machinery

  57. [57]

    Zheng, C

    X. Zheng, C. Wen, Z. Li, W. Zhang, Z. Su, X. Chang, Y . Zhao, Z. Lv, X. Zhang, Y . Zhang, G. Wang, and L. Xu. Headgap: Few-shot 3d head avatar via generalizable gaussian priors, 2025

  58. [58]

    Zielonka, T

    W. Zielonka, T. Bagautdinov, S. Saito, M. Zollh ¨ofer, J. Thies, and J. Romero. Drivable 3d gaussian avatars. InI3DV, March 2025

  59. [59]

    Zielonka, T

    W. Zielonka, T. Bolkart, T. Beeler, and J. Thies. Gaussian eigen models for human heads. InCVPR, June 2025

  60. [60]

    Zielonka, T

    W. Zielonka, T. Bolkart, and J. Thies. Instant volumetric head avatars. InCVPR, pages 4574–4584, 2023

  61. [61]

    Zielonka, S

    W. Zielonka, S. J. Garbin, A. Lattas, G. Kopanas, P. Gotardo, T. Beeler, J. Thies, and T. Bolkart. Synthetic prior for few-shot drivable head avatar inversion. InCVPR, June 2025. 11 3DRealHead: Few-Shot Detailed Head Avatar Supplemental Document A. ARCHITECTUREDETAILS Our model employs a Style U-Net [46] with two iden- tical encoders and a common decoder....