3DRealHead: Few-Shot Detailed Head Avatar

Jalees Nehvi; Justus Thies; Thabo Beeler; Timo Bolkart

arxiv: 2604.13171 · v1 · submitted 2026-04-14 · 💻 cs.CV

3DRealHead: Few-Shot Detailed Head Avatar

Jalees Nehvi , Timo Bolkart , Thabo Beeler , Justus Thies This is my paper

Pith reviewed 2026-05-10 15:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords few-shot head avatar3D Gaussian primitivesStyle U-Netmouth featuresfacial expression control3DMMNeRSemble dataset

0 comments

The pith

Mouth features from driving video let few-shot 3D head avatars capture expressions beyond 3DMM limits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents 3DRealHead, a method to build detailed 3D head avatars from only a few photographs of a person. It inverts a learned prior in the form of a Style U-Net that produces 3D Gaussian primitives for rendering from new viewpoints. Animation conditions this network on standard 3D morphable model expression parameters together with additional features taken from the mouth region in a monocular video. These mouth signals are meant to recover person-specific movements that pure 3DMM control cannot represent, yielding avatars that match real facial dynamics more closely.

Core claim

3DRealHead reconstructs 3D head avatars via few-shot inversion of a Style U-Net prior that emits 3D Gaussian primitives. Animation conditions the U-Net on 3DMM expression signals augmented by mouth-region features extracted from the driving video, enabling recovery of facial expressions that cannot be represented by the 3DMM alone and producing closer resemblance to physical reality.

What carries the argument

Style U-Net that emits 3D Gaussian primitives, conditioned on 3DMM-based facial expression signals plus mouth region features extracted from monocular driving video.

If this is right

A small set of subject photographs suffices to create a renderable 3D head avatar.
The avatar can be driven by ordinary consumer webcam video.
Facial expressions gain expressivity through mouth-specific signals not captured by 3DMM.
Novel-view rendering remains possible through the emitted 3D Gaussian primitives.
The learned prior supports inversion for new identities without full retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mouth-augmented conditioning could be added to other 3DMM-based avatar pipelines to raise fidelity.
Region-specific video features might extend the approach to hands or full-body avatars.
Gaussian primitive output supports efficient rendering that could suit real-time virtual-reality use.

Load-bearing premise

The head prior learned on the NeRSemble dataset generalizes to unseen subjects for accurate few-shot inversion and the added mouth features integrate without artifacts or identity drift.

What would settle it

Observe whether an unseen subject avatar fails to reproduce a mouth expression visible in the driving video yet absent from 3DMM parameters, or whether mouth features produce visible identity mismatch or rendering artifacts.

Figures

Figures reproduced from arXiv: 2604.13171 by Jalees Nehvi, Justus Thies, Thabo Beeler, Timo Bolkart.

**Figure 2.** Figure 2: Overview of 3DRealHead. Based on few-shot input data, an RGB texture is generated which serves as identity conditioning to the Style U-Net which predicts 3D Gaussian primitives in UV-space relative to a FLAME head mesh. To control detailed facial expressions from a driving input video, we extract 3DMM-based position maps [35] and image gradient features which are projected to the UV texture space. From thi… view at source ↗

**Figure 3.** Figure 3: We show qualitative results of our method along [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of our approach (single target) against one-shot prior-based methods. We observe that our [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison to SynShot [61] and SOTA monocular approaches on In-the-Wild INSTA dataset [60]. It is [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation experiments where we replace our gradient-based conditioning with different other types of conditioning, [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of different mouth feature conditionings. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 9.** Figure 9: Interpolation between two test identities (leftmost and [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 11.** Figure 11: Inversion Experiment for the No Fine-Tuning, 1-shot [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Comparison against GAN-based baselines on In-the [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: Comparison against Avat3r [25] for different subjects [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 16.** Figure 16: Qualitative results of using 3DMM-only based [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗

**Figure 17.** Figure 17: Qualitative comparison of our method against Gaus [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗

**Figure 18.** Figure 18: Plots of performance versus number of frames [PITH_FULL_IMAGE:figures/full_fig_p016_18.png] view at source ↗

**Figure 19.** Figure 19: Qualitative comparison of different training set sizes. [PITH_FULL_IMAGE:figures/full_fig_p016_19.png] view at source ↗

read the original abstract

The human face is central to communication. For immersive applications, the digital presence of a person should mirror the physical reality, capturing the users idiosyncrasies and detailed facial expressions. However, current 3D head avatar methods often struggle to faithfully reproduce the identity and facial expressions, despite having multi-view data or learned priors. Learning priors that capture the diversity of human appearances, especially, for regions with highly person-specific features, like the mouth and teeth region is challenging as the underlying training data is limited. In addition, many of the avatar methods are purely relying on 3D morphable model-based expression control which strongly limits expressivity. To address these challenges, we are introducing 3DRealHead, a few-shot head avatar reconstruction method with a novel expression control signal that is extracted from a monocular video stream of the subject. Specifically, the subject can take a few pictures of themselves, recover a 3D head avatar and drive it with a consumer-level webcam. The avatar reconstruction is enabled via a novel few-shot inversion process of a 3D human head prior which is represented as a Style U-Net that emits 3D Gaussian primitives which can be rendered under novel views. The prior is learned on the NeRSemble dataset. For animating the avatar, the U-Net is conditioned on 3DMM-based facial expression signals, as well as features of the mouth region extracted from the driving video. These additional mouth features allow us to recover facial expressions that cannot be represented by the 3DMM leading to a higher expressivity and closer resemblance to the physical reality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds mouth-region features to a Style U-Net Gaussian prior for few-shot head avatars, but the expressivity gains rest on unshown generalization and missing ablations.

read the letter

The paper describes a pipeline for building a detailed 3D head avatar from a few photos and animating it from monocular video. A Style U-Net trained as a prior on NeRSemble emits 3D Gaussian primitives, and the U-Net gets conditioned on both 3DMM parameters and extra features extracted from the mouth area in the driving video. The goal is to recover expressions that standard 3DMMs miss, especially around teeth and lips. This specific conditioning step is the concrete addition over prior inversion-plus-Gaussian work. The write-up lays out the motivation clearly and gives a workable description of how the few-shot inversion and animation would run in practice. That part is useful for anyone already experimenting with neural head models. The soft spots sit in the validation. No ablations appear that isolate what the mouth features contribute versus plain 3DMM conditioning, and there are no quantitative metrics on expression fidelity or identity preservation. Generalization of the NeRSemble prior to unseen subjects in the few-shot regime is asserted but not demonstrated with cross-dataset tests or reported failure cases. If the inversion drifts or the added features introduce artifacts, the claimed higher expressivity does not materialize, yet those risks are not quantified. The paper targets researchers already working on neural avatars, Gaussian splatting, or few-shot 3D reconstruction. Someone looking for a concrete recipe to combine video mouth cues with a learned head prior could extract implementation ideas, but the lack of results keeps it from being immediately actionable. It deserves a serious referee because the technical framing is coherent and the problem it targets is real, even though the current evidence is thin. Send it to review with a clear request for ablations, cross-subject metrics, and failure analysis.

Referee Report

2 major / 1 minor

Summary. The paper introduces 3DRealHead, a few-shot head avatar reconstruction method. A Style U-Net prior is learned on the NeRSemble dataset to emit 3D Gaussian primitives for novel-view rendering. Reconstruction proceeds via few-shot inversion from a small set of user images. Animation conditions the U-Net on 3DMM expression parameters together with mouth-region features extracted from monocular driving video, with the claim that the added mouth conditioning recovers expressions outside the 3DMM span and yields higher fidelity to physical reality.

Significance. If the central claims on generalization and expressivity are substantiated by quantitative evidence, the work would provide a practical route to consumer-level, high-fidelity head avatars that better capture idiosyncratic mouth and expression details than pure 3DMM pipelines. The 3D-Gaussian representation and learned prior could also support efficient rendering in immersive applications.

major comments (2)

[Abstract] Abstract: the central claim that mouth-region features enable recovery of expressions outside the 3DMM span is presented without any quantitative metric, ablation isolating the mouth conditioning, or comparison against a 3DMM-only baseline; this evidence is required to support the expressivity improvement.
[Method] Method section (few-shot inversion description): the assumption that the NeRSemble-trained Style U-Net prior inverts accurately for arbitrary unseen identities from a handful of images is load-bearing for the reconstruction pipeline, yet no cross-subject or cross-dataset reconstruction errors, identity metrics, or failure-case analysis are reported.

minor comments (1)

[Method] The notation for the combined conditioning signal (3DMM parameters plus mouth features) inside the Style U-Net could be made more explicit to clarify how the two signals are fused without identity drift.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that mouth-region features enable recovery of expressions outside the 3DMM span is presented without any quantitative metric, ablation isolating the mouth conditioning, or comparison against a 3DMM-only baseline; this evidence is required to support the expressivity improvement.

Authors: We agree that the abstract's claim would benefit from explicit quantitative backing. The manuscript already includes qualitative demonstrations of improved mouth and expression fidelity via side-by-side renderings and driving-video comparisons. To directly address the concern, we will add a dedicated ablation subsection that isolates the mouth-feature conditioning, reports quantitative metrics (e.g., mouth-region landmark error and perceptual similarity scores), and compares against a 3DMM-only baseline on a held-out test set. These results will be summarized in the abstract and discussed in the experiments section. revision: yes
Referee: [Method] Method section (few-shot inversion description): the assumption that the NeRSemble-trained Style U-Net prior inverts accurately for arbitrary unseen identities from a handful of images is load-bearing for the reconstruction pipeline, yet no cross-subject or cross-dataset reconstruction errors, identity metrics, or failure-case analysis are reported.

Authors: The few-shot inversion is supported by visual results and user studies on diverse real-world subjects in the current experiments. We acknowledge that additional quantitative validation would increase confidence in generalization. We will therefore report cross-subject identity preservation metrics (e.g., cosine similarity of ArcFace embeddings) and reconstruction error statistics on held-out NeRSemble identities. A short failure-case analysis covering challenging conditions (extreme lighting, accessories) will also be added. Cross-dataset evaluation is limited by the availability of comparable multi-view head datasets; we will clarify the scope of generalization accordingly rather than claim broader transfer. revision: partial

Circularity Check

0 steps flagged

No circularity: method relies on learned priors and external conditioning without self-referential reduction

full rationale

The paper describes a Style U-Net prior trained on the external NeRSemble dataset, followed by few-shot inversion and conditioning on 3DMM parameters plus independently extracted mouth features from driving video. No equations, fitted parameters, or self-citations are presented that reduce the claimed expressivity gain to a tautological reparameterization or input fit. The derivation chain consists of standard neural rendering components and dataset-driven learning, remaining self-contained against external benchmarks without any load-bearing step that collapses by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on a learned prior from NeRSemble and the assumption that mouth features provide independent expressive signal; no explicit free parameters or invented physical entities are named in the abstract.

axioms (2)

domain assumption A Style U-Net trained on NeRSemble can be inverted from a few images to produce view-consistent 3D Gaussian primitives for novel subjects.
Invoked when describing the few-shot inversion process.
domain assumption Mouth-region features extracted from driving video are complementary to 3DMM expression parameters and can be fused without introducing inconsistencies.
Central to the claim of higher expressivity.

invented entities (1)

Style U-Net emitting 3D Gaussian primitives no independent evidence
purpose: Represent the 3D head prior for few-shot inversion and novel-view rendering.
Introduced as the core representation; no independent evidence outside the method is provided.

pith-pipeline@v0.9.0 · 5599 in / 1469 out tokens · 71777 ms · 2026-05-10T15:15:04.743676+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 1 internal anchor

[1]

S. An, H. Xu, Y . Shi, G. Song, U. Ogras, and L. Luo. Panohead: Geometry-aware 3d full-head synthesis in 360 ◦, 2023

work page 2023
[2]

Blanz and T

V . Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. InProceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’99, page 187–194, USA, 1999. ACM Press/Addison-Wesley Publishing Co

work page 1999
[3]

M. C. Buehler, G. Li, E. Wood, L. Helminger, X. Chen, T. Shah, D. Wang, S. Garbin, S. Orts-Escolano, O. Hilliges, D. Lagun, J. Riviere, P. Gotardo, T. Beeler, A. Meka, and K. Sarkar. Cafca: High-quality novel view synthesis of expressive faces from casual few-shot captures. InACM SIGGRAPH Asia 2024 Conference Paper. 2024

work page 2024
[4]

M. C. B ¨uhler, K. Sarkar, T. Shah, G. Li, D. Wang, L. Helminger, S. Orts-Escolano, D. Lagun, O. Hilliges, T. Beeler, et al. Preface: A data-driven volumetric prior for few-shot ultra high-resolution face synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 3402–3413, 2023

work page 2023
[5]

Caron, H

M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bo- janowski, and A. Joulin. Emerging properties in self- supervised vision transformers. InProceedings of the Inter- national Conference on Computer Vision (ICCV), 2021

work page 2021
[6]

E. Chan, C. Lin, M. Chan, K. Nagano, B. Pan, S. Mello, O. Gallo, L. Guibas, J. Tremblay, S. Khamis, T. Karras, and G. Wetzstein. Efficient geometry-aware 3d generative adversarial networks. pages 16102–16112, 06 2022

work page 2022
[7]

X. Chen, M. Mihajlovic, S. Wang, S. Prokudin, and S. Tang. Morphable diffusion: 3d-consistent diffusion for single-image avatar creation. 2024

work page 2024
[8]

Y . Chen, L. Wang, Q. Li, H. Xiao, S. Zhang, H. Yao, and Y . Liu. Monogaussianavatar: Monocular gaussian point-based head avatar.ACM SIGGRAPH 2024 Conference Papers, 2023

work page 2024
[9]

Chu and T

X. Chu and T. Harada. Generalizable and animatable gaussian head avatar. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[10]

X. Chu, Y . Li, A. Zeng, T. Yang, L. Lin, Y . Liu, and T. Harada. Gpavatar: Generalizable and precise head avatar from image(s), 2024

work page 2024
[11]

Dalal and B

N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. InProceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) - V olume 1 - V olume 01, CVPR ’05, page 886–893, USA, 2005. IEEE Computer Society

work page 2005
[12]

Y . Deng, D. Wang, X. Ren, X. Chen, and B. Wang. Portrait4d: Learning one-shot 4d head avatar synthesis using synthetic data, 2024

work page 2024
[13]

Egger, W

B. Egger, W. A. P. Smith, A. Tewari, S. Wuhrer, M. Zollhoefer, T. Beeler, F. Bernard, T. Bolkart, A. Kortylewski, S. Romd- hani, C. Theobalt, V . Blanz, and T. Vetter. 3d morphable face models – past, present and future, 2020

work page 2020
[14]

Gafni, J

G. Gafni, J. Thies, M. Zollh ¨ofer, and M. Nießner. Dynamic neural radiance fields for monocular 4d facial avatar recon- struction, 2020

work page 2020
[15]

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial networks, 2014

work page 2014
[16]

Y . He, X. Gu, X. Ye, C. Xu, Z. Zhao, Y . Dong, W. Yuan, Z. Dong, and L. Bo. Lam: Large avatar model for one-shot animatable gaussian head, 2025

work page 2025
[17]

Jacobson and Y

A. Jacobson and Y . Gingold. Skinning: real-time shape deformation. InSIGGRAPH Asia 2014 Courses, SA ’14, New York, NY , USA, 2014. Association for Computing Machinery. 9

work page 2014
[18]

Johnson, A

J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution, 2016

work page 2016
[19]

Kabadayi, W

B. Kabadayi, W. Zielonka, B. L. Bhatnagar, G. Pons-Moll, and J. Thies. Gan-avatar: Controllable personalized gan-based human head avatar. InInternational Conference on 3D Vision (3DV), March 2024

work page 2024
[20]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), July 2023

work page 2023
[21]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.CoRR, abs/1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[22]

Kirschstein, S

T. Kirschstein, S. Giebenhain, and M. Nießner. Diffusiona- vatars: Deferred diffusion for high-fidelity 3d head avatars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5481–5492, 2024

work page 2024
[23]

Kirschstein, S

T. Kirschstein, S. Giebenhain, J. Tang, M. Georgopoulos, and M. Nießner. GGHead: Fast and Generalizable 3D Gaussian Heads. InSIGGRAPH Asia 2024 Conference Papers, SA ’24, New York, NY , USA, 2024. Association for Computing Machinery

work page 2024
[24]

Kirschstein, S

T. Kirschstein, S. Qian, S. Giebenhain, T. Walter, and M. Nießner. Nersemble: Multi-view radiance field reconstruc- tion of human heads.ACM Trans. Graph., 42(4), jul 2023

work page 2023
[25]

Kirschstein, J

T. Kirschstein, J. Romero, A. Sevastopolsky, M. Nießner, and S. Saito. Avat3r: Large animatable gaussian reconstruction model for high-fidelity 3d head avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12089–12100, October 2025

work page 2025
[26]

Y . Lan, F. Tan, D. Qiu, Q. Xu, K. Genova, Z. Huang, S. Fanello, R. Pandey, T. Funkhouser, C. C. Loy, and Y . Zhang. Gaussian3diff: 3d gaussian diffusion for 3d full head synthesis and editing. InECCV, 2024

work page 2024
[27]

T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017

work page 2017
[28]

Lombardi, J

S. Lombardi, J. Saragih, T. Simon, and Y . Sheikh. Deep appearance models for face rendering.ACM Trans. Graph., 37(4):68:1–68:13, July 2018

work page 2018
[29]

Lombardi, T

S. Lombardi, T. Simon, G. Schwartz, M. Zollhoefer, Y . Sheikh, and J. Saragih. Mixture of volumetric primitives for efficient neural rendering.ACM Trans. Graph., 40(4), jul 2021

work page 2021
[30]

Martinez, E

J. Martinez, E. Kim, J. Romero, T. Bagautdinov, S. Saito, S.-I. Yu, S. Anderson, M. Zollh¨ofer, T.-L. Wang, S. Bai, C. Li, S.-E. Wei, R. Joshi, W. Borsos, T. Simon, J. Saragih, P. Theodosis, A. Greene, A. Josyula, S. M. Maeta, A. I. Jewett, S. Venshtain, C. Heilman, Y .-T. Chen, S. Fu, M. E. A. Elshaer, T. Du, L. Wu, S.-C. Chen, K. Kang, M. Wu, Y . Emad, ...

work page 2024
[31]

Mildenhall, P

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InECCV, 2020

work page 2020
[32]

M ¨uller, A

T. M ¨uller, A. Evans, C. Schied, and A. Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, July 2022

work page 2022
[33]

S. Qian. Vhap: Versatile head alignment with adaptive appearance priors, sep 2024

work page 2024
[34]

S. Qian, T. Kirschstein, L. Schoneveld, D. Davoli, S. Gieben- hain, and M. Nießner. Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20299–20309, 2024

work page 2024
[35]

Retsinas, P

G. Retsinas, P. P. Filntisis, R. Danecek, V . F. Abrevaya, A. Roussos, T. Bolkart, and P. Maragos. 3d facial expres- sions through analysis-by-neural-synthesis. InConference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[36]

R ¨ossler, D

A. R ¨ossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner. Faceforensics: A large-scale video dataset for forgery detection in human faces.arXiv, 2018

work page 2018
[37]

R ¨ossler, D

A. R ¨ossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner. Faceforensics++: Learning to detect manipulated facial images. InICCV 2019, 2019

work page 2019
[38]

Saito, G

S. Saito, G. Schwartz, T. Simon, J. Li, and G. Nam. Re- lightable gaussian codec avatars. InCVPR, 2024

work page 2024
[39]

doi:10.48550/arXiv.2301.09515 , urldate =

A. Sauer, T. Karras, S. Laine, A. Geiger, and T. Aila. StyleGAN-T: Unlocking the power of GANs for fast large- scale text-to-image synthesis. volume abs/2301.09515, 2023

work page arXiv 2023
[40]

Z. Shao, Z. Wang, Z. Li, D. Wang, X. Lin, Y . Zhang, M. Fan, and Z. Wang. SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[41]

Sobel and G

I. Sobel and G. M. Feldman. An isotropic 3×3 image gradient operator. 1990

work page 1990
[42]

J. Sun, X. Wang, L. Wang, X. Li, Y . Zhang, H. Zhang, and Y . Liu. Next3d: Generative neural texture rasterization for 3d-aware head avatars. InCVPR, 2023

work page 2023
[43]

Teotia, H

K. Teotia, H. Kim, P. Garrido, M. Habermann, M. Elgharib, and C. Theobalt. Gaussianheads: End-to-end learning of driv- able gaussian head avatars from coarse-to-fine representations. ACM Transactions on Graphics (TOG), 43:1 – 12, 2024

work page 2024
[44]

Thies, M

J. Thies, M. Zollh ¨ofer, and M. Nießner. Deferred neural rendering: Image synthesis using neural textures.ACM Trans- actions on Graphics 2019 (TOG), 2019

work page 2019
[45]

Thies, M

J. Thies, M. Zollh ¨ofer, M. Stamminger, C. Theobalt, and M. Nießner. Face2Face: Real-time Face Capture and Reenact- ment of RGB Videos. InProc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2016

work page 2016
[46]

L. Wang, X. Zhao, J. Sun, Y . Zhang, H. Zhang, T. Yu, and Y . Liu. Styleavatar: Real-time photo-realistic portrait avatar from a single video. InACM SIGGRAPH 2023 Conference Proceedings, 2023

work page 2023
[47]

T. Wang, B. Zhang, T. Zhang, S. Gu, J. Bao, T. Baltrusaitis, J. Shen, D. Chen, F. Wen, Q. Chen, and B. Guo. Rodin: A generative model for sculpting 3d digital avatars using diffusion. pages 4563–4573, 06 2023

work page 2023
[48]

Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli. Image quality assessment: From error visibility to structural similar- 10 ity.Image Processing, IEEE Transactions on, 13:600 – 612, 05 2004

work page 2004
[49]

C.-h. Wuu, N. Zheng, S. Ardisson, R. Bali, D. Belko, E. Brockmeyer, L. Evans, T. Godisart, H. Ha, X. Huang, A. Hypes, T. Koska, S. Krenn, S. Lombardi, X. Luo, K. McPhail, L. Millerschoen, M. Perdoch, M. Pitts, A. Richard, J. Saragih, J. Saragih, T. Shiratori, T. Simon, M. Stewart, A. Trimble, X. Weng, D. Whitewolf, C. Wu, S.- I. Yu, and Y . Sheikh. Multif...

work page 2022
[50]

Xiang, X

J. Xiang, X. Gao, Y . Guo, and J. Zhang. Flashavatar: High- fidelity head avatar with efficient gaussian embedding. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[51]

L. Xie, X. Wang, H. Zhang, C. Dong, and Y . Shan. Vfhq: A high-quality dataset and benchmark for video face super- resolution. InThe IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2022

work page 2022
[52]

Y . Xu, B. Chen, Z. Li, H. Zhang, L. Wang, Z. Zheng, and Y . Liu. Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[53]

Y . Xu, H. Zhang, L. Wang, X. Zhao, H. Huang, G. Qi, and Y . Liu. Latentavatar: Learning latent expression code for expressive neural head avatar. InACM SIGGRAPH 2023 Conference Proceedings, SIGGRAPH ’23, New York, NY , USA, 2023. Association for Computing Machinery

work page 2023
[54]

Z. Yu, Z. Bai, A. Meka, F. Tan, Q. Xu, R. Pandey, S. Fanello, H. S. Park, and Y . Zhang. One2avatar: Generative implicit head avatar for few-shot user adaptation, 2024

work page 2024
[55]

Zhang, Y

B. Zhang, Y . Cheng, C. Wang, T. Zhang, J. Yang, Y . Tang, F. Zhao, D. Chen, and B. Guo. Rodinhd: High-fidelity 3d avatar generation with diffusion models. InEuropean Conference on Computer Vision, pages 465–483. Springer, 2025

work page 2025
[56]

X. Zhao, J. Sun, L. Wang, J. Suo, and Y . Liu. Invertavatar: In- cremental gan inversion for generalized head avatars. InACM SIGGRAPH 2024 Conference Papers, SIGGRAPH ’24, New York, NY , USA, 2024. Association for Computing Machinery

work page 2024
[57]

Zheng, C

X. Zheng, C. Wen, Z. Li, W. Zhang, Z. Su, X. Chang, Y . Zhao, Z. Lv, X. Zhang, Y . Zhang, G. Wang, and L. Xu. Headgap: Few-shot 3d head avatar via generalizable gaussian priors, 2025

work page 2025
[58]

Zielonka, T

W. Zielonka, T. Bagautdinov, S. Saito, M. Zollh ¨ofer, J. Thies, and J. Romero. Drivable 3d gaussian avatars. InI3DV, March 2025

work page 2025
[59]

Zielonka, T

W. Zielonka, T. Bolkart, T. Beeler, and J. Thies. Gaussian eigen models for human heads. InCVPR, June 2025

work page 2025
[60]

Zielonka, T

W. Zielonka, T. Bolkart, and J. Thies. Instant volumetric head avatars. InCVPR, pages 4574–4584, 2023

work page 2023
[61]

Zielonka, S

W. Zielonka, S. J. Garbin, A. Lattas, G. Kopanas, P. Gotardo, T. Beeler, J. Thies, and T. Bolkart. Synthetic prior for few-shot drivable head avatar inversion. InCVPR, June 2025. 11 3DRealHead: Few-Shot Detailed Head Avatar Supplemental Document A. ARCHITECTUREDETAILS Our model employs a Style U-Net [46] with two iden- tical encoders and a common decoder....

work page 2025

[1] [1]

S. An, H. Xu, Y . Shi, G. Song, U. Ogras, and L. Luo. Panohead: Geometry-aware 3d full-head synthesis in 360 ◦, 2023

work page 2023

[2] [2]

Blanz and T

V . Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. InProceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’99, page 187–194, USA, 1999. ACM Press/Addison-Wesley Publishing Co

work page 1999

[3] [3]

M. C. Buehler, G. Li, E. Wood, L. Helminger, X. Chen, T. Shah, D. Wang, S. Garbin, S. Orts-Escolano, O. Hilliges, D. Lagun, J. Riviere, P. Gotardo, T. Beeler, A. Meka, and K. Sarkar. Cafca: High-quality novel view synthesis of expressive faces from casual few-shot captures. InACM SIGGRAPH Asia 2024 Conference Paper. 2024

work page 2024

[4] [4]

M. C. B ¨uhler, K. Sarkar, T. Shah, G. Li, D. Wang, L. Helminger, S. Orts-Escolano, D. Lagun, O. Hilliges, T. Beeler, et al. Preface: A data-driven volumetric prior for few-shot ultra high-resolution face synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 3402–3413, 2023

work page 2023

[5] [5]

Caron, H

M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bo- janowski, and A. Joulin. Emerging properties in self- supervised vision transformers. InProceedings of the Inter- national Conference on Computer Vision (ICCV), 2021

work page 2021

[6] [6]

E. Chan, C. Lin, M. Chan, K. Nagano, B. Pan, S. Mello, O. Gallo, L. Guibas, J. Tremblay, S. Khamis, T. Karras, and G. Wetzstein. Efficient geometry-aware 3d generative adversarial networks. pages 16102–16112, 06 2022

work page 2022

[7] [7]

X. Chen, M. Mihajlovic, S. Wang, S. Prokudin, and S. Tang. Morphable diffusion: 3d-consistent diffusion for single-image avatar creation. 2024

work page 2024

[8] [8]

Y . Chen, L. Wang, Q. Li, H. Xiao, S. Zhang, H. Yao, and Y . Liu. Monogaussianavatar: Monocular gaussian point-based head avatar.ACM SIGGRAPH 2024 Conference Papers, 2023

work page 2024

[9] [9]

Chu and T

X. Chu and T. Harada. Generalizable and animatable gaussian head avatar. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[10] [10]

X. Chu, Y . Li, A. Zeng, T. Yang, L. Lin, Y . Liu, and T. Harada. Gpavatar: Generalizable and precise head avatar from image(s), 2024

work page 2024

[11] [11]

Dalal and B

N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. InProceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) - V olume 1 - V olume 01, CVPR ’05, page 886–893, USA, 2005. IEEE Computer Society

work page 2005

[12] [12]

Y . Deng, D. Wang, X. Ren, X. Chen, and B. Wang. Portrait4d: Learning one-shot 4d head avatar synthesis using synthetic data, 2024

work page 2024

[13] [13]

Egger, W

B. Egger, W. A. P. Smith, A. Tewari, S. Wuhrer, M. Zollhoefer, T. Beeler, F. Bernard, T. Bolkart, A. Kortylewski, S. Romd- hani, C. Theobalt, V . Blanz, and T. Vetter. 3d morphable face models – past, present and future, 2020

work page 2020

[14] [14]

Gafni, J

G. Gafni, J. Thies, M. Zollh ¨ofer, and M. Nießner. Dynamic neural radiance fields for monocular 4d facial avatar recon- struction, 2020

work page 2020

[15] [15]

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial networks, 2014

work page 2014

[16] [16]

Y . He, X. Gu, X. Ye, C. Xu, Z. Zhao, Y . Dong, W. Yuan, Z. Dong, and L. Bo. Lam: Large avatar model for one-shot animatable gaussian head, 2025

work page 2025

[17] [17]

Jacobson and Y

A. Jacobson and Y . Gingold. Skinning: real-time shape deformation. InSIGGRAPH Asia 2014 Courses, SA ’14, New York, NY , USA, 2014. Association for Computing Machinery. 9

work page 2014

[18] [18]

Johnson, A

J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution, 2016

work page 2016

[19] [19]

Kabadayi, W

B. Kabadayi, W. Zielonka, B. L. Bhatnagar, G. Pons-Moll, and J. Thies. Gan-avatar: Controllable personalized gan-based human head avatar. InInternational Conference on 3D Vision (3DV), March 2024

work page 2024

[20] [20]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), July 2023

work page 2023

[21] [21]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.CoRR, abs/1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[22] [22]

Kirschstein, S

T. Kirschstein, S. Giebenhain, and M. Nießner. Diffusiona- vatars: Deferred diffusion for high-fidelity 3d head avatars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5481–5492, 2024

work page 2024

[23] [23]

Kirschstein, S

T. Kirschstein, S. Giebenhain, J. Tang, M. Georgopoulos, and M. Nießner. GGHead: Fast and Generalizable 3D Gaussian Heads. InSIGGRAPH Asia 2024 Conference Papers, SA ’24, New York, NY , USA, 2024. Association for Computing Machinery

work page 2024

[24] [24]

Kirschstein, S

T. Kirschstein, S. Qian, S. Giebenhain, T. Walter, and M. Nießner. Nersemble: Multi-view radiance field reconstruc- tion of human heads.ACM Trans. Graph., 42(4), jul 2023

work page 2023

[25] [25]

Kirschstein, J

T. Kirschstein, J. Romero, A. Sevastopolsky, M. Nießner, and S. Saito. Avat3r: Large animatable gaussian reconstruction model for high-fidelity 3d head avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12089–12100, October 2025

work page 2025

[26] [26]

Y . Lan, F. Tan, D. Qiu, Q. Xu, K. Genova, Z. Huang, S. Fanello, R. Pandey, T. Funkhouser, C. C. Loy, and Y . Zhang. Gaussian3diff: 3d gaussian diffusion for 3d full head synthesis and editing. InECCV, 2024

work page 2024

[27] [27]

T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017

work page 2017

[28] [28]

Lombardi, J

S. Lombardi, J. Saragih, T. Simon, and Y . Sheikh. Deep appearance models for face rendering.ACM Trans. Graph., 37(4):68:1–68:13, July 2018

work page 2018

[29] [29]

Lombardi, T

S. Lombardi, T. Simon, G. Schwartz, M. Zollhoefer, Y . Sheikh, and J. Saragih. Mixture of volumetric primitives for efficient neural rendering.ACM Trans. Graph., 40(4), jul 2021

work page 2021

[30] [30]

Martinez, E

J. Martinez, E. Kim, J. Romero, T. Bagautdinov, S. Saito, S.-I. Yu, S. Anderson, M. Zollh¨ofer, T.-L. Wang, S. Bai, C. Li, S.-E. Wei, R. Joshi, W. Borsos, T. Simon, J. Saragih, P. Theodosis, A. Greene, A. Josyula, S. M. Maeta, A. I. Jewett, S. Venshtain, C. Heilman, Y .-T. Chen, S. Fu, M. E. A. Elshaer, T. Du, L. Wu, S.-C. Chen, K. Kang, M. Wu, Y . Emad, ...

work page 2024

[31] [31]

Mildenhall, P

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InECCV, 2020

work page 2020

[32] [32]

M ¨uller, A

T. M ¨uller, A. Evans, C. Schied, and A. Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, July 2022

work page 2022

[33] [33]

S. Qian. Vhap: Versatile head alignment with adaptive appearance priors, sep 2024

work page 2024

[34] [34]

S. Qian, T. Kirschstein, L. Schoneveld, D. Davoli, S. Gieben- hain, and M. Nießner. Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20299–20309, 2024

work page 2024

[35] [35]

Retsinas, P

G. Retsinas, P. P. Filntisis, R. Danecek, V . F. Abrevaya, A. Roussos, T. Bolkart, and P. Maragos. 3d facial expres- sions through analysis-by-neural-synthesis. InConference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[36] [36]

R ¨ossler, D

A. R ¨ossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner. Faceforensics: A large-scale video dataset for forgery detection in human faces.arXiv, 2018

work page 2018

[37] [37]

R ¨ossler, D

A. R ¨ossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner. Faceforensics++: Learning to detect manipulated facial images. InICCV 2019, 2019

work page 2019

[38] [38]

Saito, G

S. Saito, G. Schwartz, T. Simon, J. Li, and G. Nam. Re- lightable gaussian codec avatars. InCVPR, 2024

work page 2024

[39] [39]

doi:10.48550/arXiv.2301.09515 , urldate =

A. Sauer, T. Karras, S. Laine, A. Geiger, and T. Aila. StyleGAN-T: Unlocking the power of GANs for fast large- scale text-to-image synthesis. volume abs/2301.09515, 2023

work page arXiv 2023

[40] [40]

Z. Shao, Z. Wang, Z. Li, D. Wang, X. Lin, Y . Zhang, M. Fan, and Z. Wang. SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[41] [41]

Sobel and G

I. Sobel and G. M. Feldman. An isotropic 3×3 image gradient operator. 1990

work page 1990

[42] [42]

J. Sun, X. Wang, L. Wang, X. Li, Y . Zhang, H. Zhang, and Y . Liu. Next3d: Generative neural texture rasterization for 3d-aware head avatars. InCVPR, 2023

work page 2023

[43] [43]

Teotia, H

K. Teotia, H. Kim, P. Garrido, M. Habermann, M. Elgharib, and C. Theobalt. Gaussianheads: End-to-end learning of driv- able gaussian head avatars from coarse-to-fine representations. ACM Transactions on Graphics (TOG), 43:1 – 12, 2024

work page 2024

[44] [44]

Thies, M

J. Thies, M. Zollh ¨ofer, and M. Nießner. Deferred neural rendering: Image synthesis using neural textures.ACM Trans- actions on Graphics 2019 (TOG), 2019

work page 2019

[45] [45]

Thies, M

J. Thies, M. Zollh ¨ofer, M. Stamminger, C. Theobalt, and M. Nießner. Face2Face: Real-time Face Capture and Reenact- ment of RGB Videos. InProc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2016

work page 2016

[46] [46]

L. Wang, X. Zhao, J. Sun, Y . Zhang, H. Zhang, T. Yu, and Y . Liu. Styleavatar: Real-time photo-realistic portrait avatar from a single video. InACM SIGGRAPH 2023 Conference Proceedings, 2023

work page 2023

[47] [47]

T. Wang, B. Zhang, T. Zhang, S. Gu, J. Bao, T. Baltrusaitis, J. Shen, D. Chen, F. Wen, Q. Chen, and B. Guo. Rodin: A generative model for sculpting 3d digital avatars using diffusion. pages 4563–4573, 06 2023

work page 2023

[48] [48]

Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli. Image quality assessment: From error visibility to structural similar- 10 ity.Image Processing, IEEE Transactions on, 13:600 – 612, 05 2004

work page 2004

[49] [49]

C.-h. Wuu, N. Zheng, S. Ardisson, R. Bali, D. Belko, E. Brockmeyer, L. Evans, T. Godisart, H. Ha, X. Huang, A. Hypes, T. Koska, S. Krenn, S. Lombardi, X. Luo, K. McPhail, L. Millerschoen, M. Perdoch, M. Pitts, A. Richard, J. Saragih, J. Saragih, T. Shiratori, T. Simon, M. Stewart, A. Trimble, X. Weng, D. Whitewolf, C. Wu, S.- I. Yu, and Y . Sheikh. Multif...

work page 2022

[50] [50]

Xiang, X

J. Xiang, X. Gao, Y . Guo, and J. Zhang. Flashavatar: High- fidelity head avatar with efficient gaussian embedding. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[51] [51]

L. Xie, X. Wang, H. Zhang, C. Dong, and Y . Shan. Vfhq: A high-quality dataset and benchmark for video face super- resolution. InThe IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2022

work page 2022

[52] [52]

Y . Xu, B. Chen, Z. Li, H. Zhang, L. Wang, Z. Zheng, and Y . Liu. Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[53] [53]

Y . Xu, H. Zhang, L. Wang, X. Zhao, H. Huang, G. Qi, and Y . Liu. Latentavatar: Learning latent expression code for expressive neural head avatar. InACM SIGGRAPH 2023 Conference Proceedings, SIGGRAPH ’23, New York, NY , USA, 2023. Association for Computing Machinery

work page 2023

[54] [54]

Z. Yu, Z. Bai, A. Meka, F. Tan, Q. Xu, R. Pandey, S. Fanello, H. S. Park, and Y . Zhang. One2avatar: Generative implicit head avatar for few-shot user adaptation, 2024

work page 2024

[55] [55]

Zhang, Y

B. Zhang, Y . Cheng, C. Wang, T. Zhang, J. Yang, Y . Tang, F. Zhao, D. Chen, and B. Guo. Rodinhd: High-fidelity 3d avatar generation with diffusion models. InEuropean Conference on Computer Vision, pages 465–483. Springer, 2025

work page 2025

[56] [56]

X. Zhao, J. Sun, L. Wang, J. Suo, and Y . Liu. Invertavatar: In- cremental gan inversion for generalized head avatars. InACM SIGGRAPH 2024 Conference Papers, SIGGRAPH ’24, New York, NY , USA, 2024. Association for Computing Machinery

work page 2024

[57] [57]

Zheng, C

X. Zheng, C. Wen, Z. Li, W. Zhang, Z. Su, X. Chang, Y . Zhao, Z. Lv, X. Zhang, Y . Zhang, G. Wang, and L. Xu. Headgap: Few-shot 3d head avatar via generalizable gaussian priors, 2025

work page 2025

[58] [58]

Zielonka, T

W. Zielonka, T. Bagautdinov, S. Saito, M. Zollh ¨ofer, J. Thies, and J. Romero. Drivable 3d gaussian avatars. InI3DV, March 2025

work page 2025

[59] [59]

Zielonka, T

W. Zielonka, T. Bolkart, T. Beeler, and J. Thies. Gaussian eigen models for human heads. InCVPR, June 2025

work page 2025

[60] [60]

Zielonka, T

W. Zielonka, T. Bolkart, and J. Thies. Instant volumetric head avatars. InCVPR, pages 4574–4584, 2023

work page 2023

[61] [61]

Zielonka, S

W. Zielonka, S. J. Garbin, A. Lattas, G. Kopanas, P. Gotardo, T. Beeler, J. Thies, and T. Bolkart. Synthetic prior for few-shot drivable head avatar inversion. InCVPR, June 2025. 11 3DRealHead: Few-Shot Detailed Head Avatar Supplemental Document A. ARCHITECTUREDETAILS Our model employs a Style U-Net [46] with two iden- tical encoders and a common decoder....

work page 2025