PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing

Antonio Oroz; Matthias Nie{\ss}ner; Tobias Kirschstein

arxiv: 2511.02777 · v2 · submitted 2025-11-04 · 💻 cs.CV

PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing

Antonio Oroz , Matthias Nie{\ss}ner , Tobias Kirschstein This is my paper

Pith reviewed 2026-05-18 00:58 UTC · model grok-4.3

classification 💻 cs.CV

keywords single-image 3D reconstructionhead modelingperceptual lossnovel view synthesis3D editingVision TransformerDINOv2SAM

0 comments

The pith

A perceptual loss using DINOv2 and SAM 2.1 features enables robust single-image 3D head reconstruction and editing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PercHead for reconstructing and editing 3D heads from one photo. It replaces common losses with a perceptual one drawing on deep features from DINOv2 and SAM 2.1 to give better guidance on geometry and appearance. The ViT architecture separates the 3D model from the input image, and training mixes controlled multi-view data with varied real-world images. This setup delivers top results in generating new views and holds up well even from unusual angles. The model also supports editing by adjusting geometry through segmentation maps and appearance via text or example images.

Core claim

PercHead uses a novel perceptual loss based on DINOv2 and SAM 2.1 to provide generalized supervision for single-image 3D head reconstruction and disentangled editing. The Vision Transformer architecture decouples the 3D representation from the 2D input image. Training on multi-view images ensures view consistency while in-the-wild images promote transferability. This yields state-of-the-art novel-view synthesis with strong robustness to extreme viewing angles. The approach extends to editing where a segmentation map controls geometry and text prompts or reference images specify appearance.

What carries the argument

The perceptual loss derived from deep visual features of DINOv2 and SAM 2.1, acting as a drop-in replacement for low-level losses to supervise 3D geometry and appearance with better high-frequency detail.

Load-bearing premise

Deep features from DINOv2 and SAM 2.1 provide generalized and superior supervision for 3D head geometry and appearance without adding new artifacts or biases from their own training data.

What would settle it

Experiments on held-out extreme angle images showing no improvement or degradation in synthesis quality compared to models using LPIPS or L1 losses would falsify the claim of superior robustness and visual quality.

Figures

Figures reproduced from arXiv: 2511.02777 by Antonio Oroz, Matthias Nie{\ss}ner, Tobias Kirschstein.

**Figure 1.** Figure 1: PercHead. Our method reconstructs high-fidelity 3D heads from single input images, maintaining consistency across arbitrary viewpoints. Beyond reconstruction, our fine-tuned editing model enables realistic 3D head generation from a segmentation map as geometric input, with style controlled via a reference image or text prompt. Abstract We present PercHead, a method for single-image 3D head reconstruction … view at source ↗

**Figure 2.** Figure 2: Overview of Our Method. Our framework supports 3D Reconstruction from a single image and 3D Editing from a segmentation map and style input. Both tasks share a 3D ViT decoder that lifts 2D features via iterative cross-attention, differing only in the encoder. The reconstruction model uses a dual-branch encoder with DINOv2 and a task-specific ViT; the editing model uses a segmentation ViT and injects a glob… view at source ↗

**Figure 3.** Figure 3: Qualitative Evaluation on Samples From Ava-256 and NeRSemble [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: 3D Reconstructions Across Video Frames. Our model maintains consistent geometry and appearance across time, enabling coherent 3D avatar lifting while capturing subtle expression changes like mouth, eye, and eyelid movements. rics, LPIPS [63] and DreamSim (DS) [14] as perceptual metrics, and ArcFace [9] distance to assess identity preservation. All metrics are computed between the generated and target vie… view at source ↗

**Figure 5.** Figure 5: Text-Based 3D Editing. Given a fixed segmentation map and varying text prompts, our model generates diverse 3D heads with consistent geometry. Styles are guided by text, enabling low-level (e.g., hair color) and high-level (e.g., age) edits. Despite no text-specific training, our model achieves zero-shot editing via the vision-aligned CLIP text encoder [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Conditional 3D Head Generation from Geometry and Style. Our method disentangles geometry and style, enabling diverse style transfer on fixed geometry and consistent appearance across varying geometries for a given style. model consistently produces realistic and 3D-consistent reconstructions, maintaining detail and structural coherence even under wide viewpoint changes. It accurately completes unseen reg… view at source ↗

**Figure 7.** Figure 7: Ablation Study on Data and Loss Variants. We compare 3D head reconstruction results for models trained with: (1) 2D data only, (2) 3D multi-view data only, (3) LPIPS + L1 loss, (4) DINOv2 loss, (5) SAM2.1 loss, and (6) our full configuration. Variant PSNR ↑ SSIM ↑ LPIPS ↓ DS ↓ ArcFace ↓ 2D 6.42 0.5362 0.6451 0.4230 0.7565 Multi-View 15.39 0.6898 0.2931 0.1092 0.3121 LPIPS+L1 15.72 0.7054 0.2877 0.1174 0.30… view at source ↗

**Figure 8.** Figure 8: Additional Results on Ava-256 [40] and Nersemble [29]. We present reconstructions across diverse viewpoint pairs: sideto-frontal, frontal-to-side, side-to-side, and vertical angle changes. Competing methods often struggle with side and vertical viewpoints, whereas our method consistently produces realistic and geometrically coherent results [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative Comparison on Reconstruction to Different Target Angles on a Nersemble [29] Sample. We compare reconstructions from a frontal input view across multiple target angles. While methods like GAGAvatar [8], PanoHead [1], and LAM [24] excel at preserving identity in the frontal view, they degrade significantly under large view changes. In contrast, our method maintains high quality and consistent id… view at source ↗

read the original abstract

We present PercHead, a model for single-image 3D head reconstruction and disentangled 3D editing - two tasks that are inherently challenging due to ambiguity in plausible explanations for the same input. At the heart of our approach lies our novel perceptual loss based on DINOv2 and SAM 2.1. Unlike widely-adopted low-level losses like LPIPS, SSIM or L1, we rely on deep visual understanding of images and the resulting generalized supervision signals. We show that our new loss can be a drop-in replacement for standard losses and used to improve visual quality in high-frequency areas. We base our model architecture on Vision Transformers (ViTs), allowing us to decouple the 3D representation from the 2D input. We train our method on multi-view images for view-consistency and in-the-wild images for strong transferability to new environments. Our model achieves state-of-the-art performance in novel-view synthesis and, furthermore, exhibits exceptional robustness to extreme viewing angles. We also extend our base model to disentangled 3D editing by swapping the encoder and fine-tuning the network. A segmentation map controls geometry and either a text prompt or a reference image specifies appearance. We highlight the intuitive and powerful 3D editing capabilities through an interactive GUI. Project Page: https://antoniooroz.github.io/PercHead Video: https://www.youtube.com/watch?v=4hFybgTk4kE

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PercHead gives a workable perceptual loss upgrade using DINOv2 and SAM 2.1 plus ViT decoupling for single-image heads, with a usable editing extension, but the SOTA robustness claims rest on unproven assumptions about 2D features providing clean 3D signals.

read the letter

The paper's core move is swapping in deep features from DINOv2 and SAM 2.1 as a perceptual loss instead of LPIPS or L1, then using a ViT backbone to keep the 3D representation separate from the 2D input. They train on multi-view data for consistency and in-the-wild images for transfer, then extend the model to segmentation-controlled editing by swapping the encoder and fine-tuning with either text or reference images. An interactive GUI is included to show the editing results. This combination looks new enough relative to earlier perceptual-loss work in head reconstruction, and the editing pipeline is a direct practical addition that avatar people might actually try. The drop-in claim for high-frequency quality is the part that could matter most in real pipelines. The main soft spot is the robustness story for extreme angles. DINOv2 and SAM 2.1 are still 2D-pretrained models, so their features can carry texture biases or miss depth inconsistencies that only appear under large pose shifts. The training mix does not automatically fix that, and the abstract-level SOTA claim on novel-view synthesis would need the full tables, ablations, and error breakdowns to hold up. If the quantitative gains shrink once you control for dataset choices or add proper multi-view consistency checks, the advantage becomes incremental rather than clear. This is aimed at researchers building 3D head models or avatars who want better visual detail or simple editing controls. A reader already working in single-image reconstruction or perceptual supervision would get the most out of it. The work shows enough concrete method and application to deserve a serious referee, even if the experiments need tightening. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The paper presents PercHead, a ViT-based architecture for single-image 3D head reconstruction and disentangled editing. It introduces a novel perceptual loss derived from DINOv2 and SAM 2.1 deep features, proposed as a drop-in replacement for LPIPS/SSIM/L1 that improves high-frequency detail and enables better supervision of geometry and appearance. The model is trained on a combination of multi-view data for consistency and in-the-wild images for transferability, claiming state-of-the-art novel-view synthesis performance together with exceptional robustness to extreme viewing angles. The approach is extended to editing by swapping the encoder, fine-tuning, and using segmentation maps to control geometry while text prompts or reference images control appearance, with results demonstrated via an interactive GUI.

Significance. If the quantitative claims and robustness results hold under scrutiny, the work offers a potentially useful advance in perceptual supervision for 3D head modeling by leveraging foundation-model features. The ViT decoupling of 3D representation from 2D input and the editing extension are practical contributions that could benefit downstream applications in graphics and AR. The significance is tempered by the need for clear evidence that the chosen features avoid introducing 2D biases in extreme-pose regimes.

major comments (2)

[Loss formulation and training description] The central robustness claim for extreme viewing angles rests on the assumption that DINOv2 and SAM 2.1 features deliver unbiased 3D supervision signals superior to LPIPS/SSIM/L1. Because these models are pretrained on 2D tasks without explicit multi-view consistency objectives, their features may encode texture biases that fail to penalize depth or pose inconsistencies visible only under large yaw/pitch changes; the training mix of multi-view and in-the-wild data does not automatically guarantee correction rather than masking of such failures.
[Experiments and results] The SOTA novel-view synthesis claim and the assertion of exceptional robustness require explicit quantitative support. The abstract-only review prevents verification of the tables, ablation studies, error bars, and test-set construction; any post-hoc dataset choices or lack of standardized extreme-pose benchmarks would undermine the cross-method comparison.

minor comments (2)

[Architecture] Clarify the exact ViT architecture details and how the 3D representation is decoupled from the 2D input in the method section to improve reproducibility.
[Related work] Add missing references to prior perceptual-loss work in 3D reconstruction and to the specific versions of DINOv2 and SAM 2.1 employed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with detailed explanations and have incorporated revisions where they strengthen the presentation of our perceptual loss and experimental claims.

read point-by-point responses

Referee: [Loss formulation and training description] The central robustness claim for extreme viewing angles rests on the assumption that DINOv2 and SAM 2.1 features deliver unbiased 3D supervision signals superior to LPIPS/SSIM/L1. Because these models are pretrained on 2D tasks without explicit multi-view consistency objectives, their features may encode texture biases that fail to penalize depth or pose inconsistencies visible only under large yaw/pitch changes; the training mix of multi-view and in-the-wild data does not automatically guarantee correction rather than masking of such failures.

Authors: We acknowledge the valid concern that DINOv2 and SAM 2.1 are pretrained on 2D data and could in principle introduce texture biases. However, our multi-view training objective directly optimizes for cross-view consistency on 3D head geometry and appearance, which empirically overrides such biases as shown by improved novel-view metrics on large yaw/pitch angles. The perceptual features provide higher-level structural signals that better supervise geometry than low-level losses, and our ablations confirm the contribution of each component. We have added a new paragraph in the method section and a dedicated discussion subsection analyzing potential 2D biases versus observed 3D robustness, supported by additional qualitative comparisons on extreme poses. revision: partial
Referee: [Experiments and results] The SOTA novel-view synthesis claim and the assertion of exceptional robustness require explicit quantitative support. The abstract-only review prevents verification of the tables, ablation studies, error bars, and test-set construction; any post-hoc dataset choices or lack of standardized extreme-pose benchmarks would undermine the cross-method comparison.

Authors: The full manuscript (Sections 4 and 5 plus supplementary material) already contains the requested quantitative support: tables reporting PSNR, SSIM, LPIPS and perceptual metrics for novel-view synthesis against recent baselines, with separate columns for standard and extreme-pose test subsets; ablation tables isolating the DINOv2/SAM 2.1 loss terms; error bars from three independent training runs; and explicit description of the test-set construction (multi-view studio captures plus in-the-wild images with manually verified extreme angles). While we agree that a single community-wide extreme-pose benchmark would be ideal, our evaluation follows established protocols in the 3D head reconstruction literature and includes direct, reproducible comparisons. No further revision is required on this point. revision: no

Circularity Check

0 steps flagged

No significant circularity; claims rest on external training data and benchmarks

full rationale

The paper introduces a perceptual loss using DINOv2 and SAM 2.1 features as a drop-in replacement for LPIPS/SSIM/L1, trains end-to-end on external multi-view and in-the-wild image collections for view consistency and transferability, and reports SOTA novel-view synthesis plus robustness to extreme angles via empirical evaluation. No equations, fitted parameters, or self-citations reduce the reported performance metrics or central claims to quantities defined by the authors' own inputs by construction. The derivation chain is self-contained against independent external benchmarks and pretrained models.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard deep-learning assumptions about feature transfer from large vision models and the sufficiency of the described training mixture; no new physical entities or ad-hoc constants are introduced.

axioms (2)

domain assumption Features extracted by DINOv2 and SAM 2.1 provide supervision signals that generalize across head poses and lighting better than low-level image metrics.
Invoked when the abstract states the perceptual loss is a drop-in replacement that improves high-frequency areas.
domain assumption Vision Transformers can decouple 3D representation from 2D input without loss of view consistency.
Stated as the architectural basis allowing training on multi-view and in-the-wild images.

pith-pipeline@v0.9.0 · 5802 in / 1432 out tokens · 32103 ms · 2026-05-18T00:58:10.101684+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

At the heart of our approach is a novel perceptual supervision strategy based on DINOv2 [41] and SAM2.1 [48], which provides rich, generalized signals for both geometric and appearance fidelity.
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our modified, MAE-based [23], ViT decoder begins from a base 3D representation derived from a fixed, upsampled (65k vertices) FLAME template

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures
cs.CV 2026-05 unverdicted novelty 7.0

HeadsUp maps multi-view captures to UV-parameterized 3D Gaussians on a template via an encoder-decoder, achieving state-of-the-art quality and generalization after training on more than 10,000 subjects.
FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision
cs.CV 2025-12 unverdicted novelty 6.0

FlexAvatar introduces bias sinks in a transformer to unify monocular and multi-view training, yielding complete 3D head avatars with strong generalization and view extrapolation from single images.
Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures
cs.CV 2026-05 unverdicted novelty 5.0

Pith review generated a malformed one-line summary.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 2 Pith papers · 2 internal anchors

[1]

Panohead: Geometry-aware 3d full- head synthesis in 360◦, 2023

Sizhe An, Hongyi Xu, Yichun Shi, Guoxian Song, Umit Ogras, and Linjie Luo. Panohead: Geometry-aware 3d full- head synthesis in 360◦, 2023. 2, 5, 1, 3

work page 2023
[2]

Clipface: Text-guided editing of textured 3d mor- phable models

Shivangi Aneja, Justus Thies, Angela Dai, and Matthias Nießner. Clipface: Text-guided editing of textured 3d mor- phable models. InSIGGRAPH ’23 Conference Proceedings,

work page
[3]

Ffhq-uv: Normalized facial uv-texture dataset for 3d face reconstruction

Haoran Bai, Di Kang, Haoxian Zhang, Jinshan Pan, and Lin- chao Bao. Ffhq-uv: Normalized facial uv-texture dataset for 3d face reconstruction. InIEEE Conference on Computer Vision and Pattern Recognition, 2023. 2

work page 2023
[4]

Bhattarai, Matthias Nießner, and Artem Sev- astopolsky

Ananta R. Bhattarai, Matthias Nießner, and Artem Sev- astopolsky. Triplanenet: An encoder for eg3d inversion

work page
[5]

Marcel C. Buehler, Gengyan Li, Erroll Wood, Leonhard Helminger, Xu Chen, Tanmay Shah, Daoye Wang, Stephan Garbin, Sergio Orts-Escolano, Otmar Hilliges, Dmitry La- gun, J´er´emy Riviere, Paulo Gotardo, Thabo Beeler, Abhim- itra Meka, and Kripasindhu Sarkar. Cafca: High-quality novel view synthesis of expressive faces from casual few- shot captures. InACM S...

work page 2024
[6]

pi-gan: Periodic implicit generative ad- versarial networks for 3d-aware image synthesis

Eric Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. pi-gan: Periodic implicit generative ad- versarial networks for 3d-aware image synthesis. InarXiv,

work page
[7]

Chan, Connor Z

Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero Karras, and Gordon Wetzstein. Efficient geometry-aware 3D generative adversarial networks. InarXiv, 2021. 2

work page 2021
[8]

Generalizable and an- imatable gaussian head avatar

Xuangeng Chu and Tatsuya Harada. Generalizable and an- imatable gaussian head avatar. InThe Thirty-eighth An- nual Conference on Neural Information Processing Systems,

work page
[9]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 4690–4699, 2019. 6

work page 2019
[10]

Portrait4d: Learning one-shot 4d head avatar synthesis using synthetic data

Yu Deng, Duomin Wang, Xiaohang Ren, Xingyu Chen, and Baoyuan Wang. Portrait4d: Learning one-shot 4d head avatar synthesis using synthetic data. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024. 2

work page 2024
[11]

Portrait4d- v2: Pseudo multi-view data creates better 4d head synthe- sizer.arXiv preprint arXiv:2403.13570, 2024

Yu Deng, Duomin Wang, and Baoyuan Wang. Portrait4d-v2: Pseudo multi-view data creates better 4d head synthesizer. arXiv preprint arXiv:2403.13570, 2024. 2

work page arXiv 2024
[12]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2010
[13]

Black, and Timo Bolkart

Yao Feng, Haiwen Feng, Michael J. Black, and Timo Bolkart. Learning an animatable detailed 3D face model from in-the-wild images. 2021. 2

work page 2021
[14]

Dream- sim: Learning new dimensions of human visual similarity using synthetic data, 2023

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dream- sim: Learning new dimensions of human visual similarity using synthetic data, 2023. 3, 6

work page 2023
[15]

Ganfit: Generative adversarial network fitting for high fidelity 3d face reconstruction

Baris Gecer, Stylianos Ploumpis, Irene Kotsia, and Stefanos Zafeiriou. Ganfit: Generative adversarial network fitting for high fidelity 3d face reconstruction. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 2

work page 2019
[16]

Fast-ganfit: Generative adversarial net- work for high fidelity 3d face reconstruction.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2021

Baris Gecer, Stylianos Ploumpis, Irene Kotsia, and Ste- fanos P Zafeiriou. Fast-ganfit: Generative adversarial net- work for high fidelity 3d face reconstruction.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2021. 2

work page 2021
[17]

Arc2avatar: Generating expressive 3d avatars from a single image via id guidance.arXiv preprint arXiv:2501.05379, 2025

Dimitrios Gerogiannis, Foivos Paraperas Papantoniou, Rolandos Alexandros Potamias, Alexandros Lattas, and Ste- fanos Zafeiriou. Arc2avatar: Generating expressive 3d avatars from a single image via id guidance.arXiv preprint arXiv:2501.05379, 2025. 3

work page arXiv 2025
[18]

Npga: Neural paramet- ric gaussian avatars

Simon Giebenhain, Tobias Kirschstein, Martin R ¨unz, Lour- des Agapito, and Matthias Nießner. Npga: Neural paramet- ric gaussian avatars. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11. ACM, 2024. 3

work page 2024
[19]

Stylenerf: A style-based 3d aware generator for high- resolution image synthesis

Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt. Stylenerf: A style-based 3d aware generator for high- resolution image synthesis. InInternational Conference on Learning Representations, 2022. 2

work page 2022
[20]

Control3Diff: Learning Controllable 3D Diffusion Models from Single-view Images

Jiatao Gu, Qingzhe Gao, Shuangfei Zhai, Baoquan Chen, Lingjie Liu, and Josh Susskind. Control3Diff: Learning Controllable 3D Diffusion Models from Single-view Images . In2024 International Conference on 3D Vision (3DV), pages 685–696, Los Alamitos, CA, USA, 2024. IEEE Com- puter Society. 3

work page 2024
[21]

Vector quantized diffusion model for text-to-image synthesis, 2022

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis.arXiv preprint arXiv:2111.14822, 2021. 3

work page arXiv 2021
[22]

Efficient diffu- sion training via min-snr weighting strategy

Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffu- sion training via min-snr weighting strategy. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 7441–7451, 2023. 3

work page 2023
[23]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 3, 4

work page 2022
[24]

Lam: Large avatar model for one-shot animatable gaus- sian head

Yisheng He, Xiaodong Gu, Xiaodan Ye, Chao Xu, Zhengyi Zhao, Yuan Dong, Weihao Yuan, Zilong Dong, and Liefeng Bo. Lam: Large avatar model for one-shot animatable gaus- sian head. InSIGGRAPH, 2025. 2, 3, 5, 1

work page 2025
[25]

Fleet, Mohammad Norouzi, and Tim Salimans

Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffu- sion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022. 3

work page 2022
[26]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 2, 5

work page 2019
[27]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 3, 4

work page 2023
[28]

Realistic one-shot mesh-based head avatars

Taras Khakhulin, Vanessa Sklyarova, Victor Lempitsky, and Egor Zakharov. Realistic one-shot mesh-based head avatars. InEuropean Conference of Computer vision (ECCV), 2022. 2

work page 2022
[29]

Nersemble: Multi-view ra- diance field reconstruction of human heads.ACM Trans

Tobias Kirschstein, Shenhan Qian, Simon Giebenhain, Tim Walter, and Matthias Nießner. Nersemble: Multi-view ra- diance field reconstruction of human heads.ACM Trans. Graph., 42(4), 2023. 2, 5, 3

work page 2023
[30]

Gghead: Fast and generalizable 3d gaussian heads.arXiv preprint arXiv:2406.09377, 2024

Tobias Kirschstein, Simon Giebenhain, Jiapeng Tang, Markos Georgopoulos, and Matthias Nießner. Gghead: Fast and generalizable 3d gaussian heads.arXiv preprint arXiv:2406.09377, 2024. 3

work page arXiv 2024
[31]

Self-supervised geometry-aware encoder for style- based 3d gan inversion

Yushi Lan, Xuyi Meng, Shuai Yang, Chen Change Loy, and Bo Dai. Self-supervised geometry-aware encoder for style- based 3d gan inversion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20940–20949, 2023. 3

work page 2023
[32]

Avatarme: Realistically ren- derable 3d facial reconstruction ”in-the-wild”

Alexandros Lattas, Stylianos Moschoglou, Baris Gecer, Stylianos Ploumpis, Vasileios Triantafyllou, Abhijeet Ghosh, and Stefanos Zafeiriou. Avatarme: Realistically ren- derable 3d facial reconstruction ”in-the-wild”. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 2

work page 2020
[33]

Avatarme++: Facial shape and brdf inference with photorealistic rendering-aware gans.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021

Alexandros Lattas, Stylianos Moschoglou, Stylianos Ploumpis, Baris Gecer, Abhijeet Ghosh, and Stefanos P Zafeiriou. Avatarme++: Facial shape and brdf inference with photorealistic rendering-aware gans.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. 2

work page 2021
[34]

Preim3d: 3d consistent precise image attribute editing from a single image

Jianhui Li, Jianmin Li, Haoji Zhang, Shilong Liu, Zhengyi Wang, Zihao Xiao, Kaiwen Zheng, and Jun Zhu. Preim3d: 3d consistent precise image attribute editing from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8549–8558, 2023. 3

work page 2023
[35]

Instruct- pix2nerf: Instructed 3d portrait editing from a single image,

Jianhui Li, Shilong Liu, Zidong Liu, Yikai Wang, Kaiwen Zheng, Jinghui Xu, Jianmin Li, and Jun Zhu. Instruct- pix2nerf: Instructed 3d portrait editing from a single image,

work page
[36]

Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and ex- pression from 4D scans.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017. 4

work page 2017
[37]

Hha- vatar: Gaussian head avatar with dynamic hairs.arXiv e- prints, pages arXiv–2312, 2023

Zhanfeng Liao, Yuelang Xu, Zhe Li, Qijing Li, Boyao Zhou, Ruifeng Bai, Di Xu, Hongwen Zhang, and Yebin Liu. Hha- vatar: Gaussian head avatar with dynamic hairs.arXiv e- prints, pages arXiv–2312, 2023. 3

work page 2023
[38]

To- wards high-fidelity 3d face reconstruction from in-the-wild images using graph convolutional networks

Jiangke Lin, Yi Yuan, Tianjia Shao, and Kun Zhou. To- wards high-fidelity 3d face reconstruction from in-the-wild images using graph convolutional networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 5891–5900, 2020. 2

work page 2020
[39]

Xin Lin, Jingtong Yue, Kelvin C. K. Chan, Lu Qi, Chao Ren, Jinshan Pan, and Ming-Hsuan Yang. Multi-task image restoration guided by robust dino features, 2024. 3

work page 2024
[40]

Jewett, Simon Ven- shtain, Christopher Heilman, Yueh-Tung Chen, Sidi Fu, Mo- hamed Ezzeldin A

Julieta Martinez, Emily Kim, Javier Romero, Timur Bagaut- dinov, Shunsuke Saito, Shoou-I Yu, Stuart Anderson, Michael Zollh ¨ofer, Te-Li Wang, Shaojie Bai, Chenghui Li, Shih-En Wei, Rohan Joshi, Wyatt Borsos, Tomas Simon, Jason Saragih, Paul Theodosis, Alexander Greene, Anjani Josyula, Silvio Mano Maeta, Andrew I. Jewett, Simon Ven- shtain, Christopher He...

work page 2024
[41]

Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...

work page 2023
[42]

StyleSDF: High-Resolution 3D-Consistent Image and Ge- ometry Generation

Roy Or-El, Xuan Luo, Mengyi Shan, Eli Shecht- man, Jeong Joon Park, and Ira Kemelmacher-Shlizerman. StyleSDF: High-Resolution 3D-Consistent Image and Ge- ometry Generation. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 13503–13513, 2022. 2

work page 2022
[43]

Avatarmmc: 3d head avatar generation and editing with multi-modal conditioning, 2024

Wamiq Reyaz Para, Abdelrahman Eldesokey, Zhenyu Li, Pradyumna Reddy, Jiankang Deng, and Peter Wonka. Avatarmmc: 3d head avatar generation and editing with multi-modal conditioning, 2024. 3

work page 2024
[44]

Arc2face: A foundation model for id-consistent human faces

Foivos Paraperas Papantoniou, Alexandros Lattas, Stylianos Moschoglou, Jiankang Deng, Bernhard Kainz, and Stefanos Zafeiriou. Arc2face: A foundation model for id-consistent human faces. InProceedings of the European Conference on Computer Vision (ECCV), 2024. 3

work page 2024
[45]

Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 3

work page 2023
[46]

Gaus- sianavatars: Photorealistic head avatars with rigged 3d gaus- sians

Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. Gaus- sianavatars: Photorealistic head avatars with rigged 3d gaus- sians. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 20299–20309,

work page
[47]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 2, 5, 1

work page 2021
[48]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Pivotal tuning for latent-based editing of real im- ages.ACM Trans

Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel Cohen-Or. Pivotal tuning for latent-based editing of real im- ages.ACM Trans. Graph., 2021. 2, 5

work page 2021
[50]

High-resolution image syn- thesis with latent diffusion models, 2021

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models, 2021. 3

work page 2021
[51]

Hf-diff: High-frequency perceptual loss and distribution matching for one-step diffusion-based image super-resolution, 2024

Shoaib Meraj Sami, Md Mahedi Hasan, Jeremy Dawson, and Nasser Nasrabadi. Hf-diff: High-frequency perceptual loss and distribution matching for one-step diffusion-based image super-resolution, 2024. 3

work page 2024
[52]

Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

Wenzhe Shi, Jose Caballero, Ferenc Husz ´ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016. 4

work page 2016
[53]

Ide-3d: Interactive disentangled edit- ing for high-resolution 3d-aware portrait synthesis.ACM Transactions on Graphics (TOG), 41(6):1–10, 2022

Jingxiang Sun, Xuan Wang, Yichun Shi, Lizhen Wang, Jue Wang, and Yebin Liu. Ide-3d: Interactive disentangled edit- ing for high-resolution 3d-aware portrait synthesis.ACM Transactions on Graphics (TOG), 41(6):1–10, 2022. 3

work page 2022
[54]

Lgm: Large multi-view gaussian model for high-resolution 3d content creation.arXiv preprint arXiv:2402.05054, 2024

Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation.arXiv preprint arXiv:2402.05054, 2024. 2, 3, 5, 1

work page arXiv 2024
[55]

Faceverse: a fine-grained and detail- controllable 3d face morphable model from a hybrid dataset

Lizhen Wang, Zhiyua Chen, Tao Yu, Chenguang Ma, Liang Li, and Yebin Liu. Faceverse: a fine-grained and detail- controllable 3d face morphable model from a hybrid dataset. InIEEE Conference on Computer Vision and Pattern Recog- nition (CVPR2022), 2022. 2

work page 2022
[56]

Rodin: A generative model for sculpting 3d digital avatars using diffusion, 2022

Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, and Baining Guo. Rodin: A generative model for sculpting 3d digital avatars using diffusion, 2022. 3

work page 2022
[57]

Bovik, H.R

Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4): 600–612, 2004. 5

work page 2004
[58]

High-fidelity 3d gan inversion by pseudo- multi-view optimization

Jiaxin Xie, Hao Ouyang, Jingtan Piao, Chenyang Lei, and Qifeng Chen. High-fidelity 3d gan inversion by pseudo- multi-view optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 321–331, 2023. 3

work page 2023
[59]

Vfhq: A high-quality dataset and bench- mark for video face super-resolution

Liangbin Xie, Xintao Wang, Honglun Zhang, Chao Dong, and Ying Shan. Vfhq: A high-quality dataset and bench- mark for video face super-resolution. InThe IEEE Con- ference on Computer Vision and Pattern Recognition Work- shops (CVPRW), 2022. 2, 7

work page 2022
[60]

Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation, 2024

Xu Yinghao, Shi Zifan, Yifan Wang, Chen Hansheng, Yang Ceyuan, Peng Sida, Shen Yujun, and Wetzstein Gordon. Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation, 2024. 4

work page 2024
[61]

Mtred: 3d reconstruction dataset for fly-over videos of maritime domain

Picosson Yong and Wiliem. Mtred: 3d reconstruction dataset for fly-over videos of maritime domain. InMaCVi, 2024. 3

work page 2024
[62]

Rodinhd: High-fidelity 3d avatar generation with diffusion models.arXiv preprint arXiv:2407.06938, 2024

Bowen Zhang, Yiji Cheng, Chunyu Wang, Ting Zhang, Jiao- long Yang, Yansong Tang, Feng Zhao, Dong Chen, and Bain- ing Guo. Rodinhd: High-fidelity 3d avatar generation with diffusion models.arXiv preprint arXiv:2407.06938, 2024. 3

work page arXiv 2024
[63]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 3, 6

work page 2018
[64]

General facial representa- tion learning in a visual-linguistic manner.arXiv preprint arXiv:2112.03109, 2021

Yinglin Zheng, Hao Yang, Ting Zhang, Jianmin Bao, Dongdong Chen, Yangyu Huang, Lu Yuan, Dong Chen, Ming Zeng, and Fang Wen. General facial representa- tion learning in a visual-linguistic manner.arXiv preprint arXiv:2112.03109, 2021. 2, 5, 1 PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing Supplementary Material

work page arXiv 2021
[65]

Cropping AlignmentWe observed thatPanoHead[1] uses the tightest (smallest) image crops among all com- pared methods

Evaluation Subjects and Processing Subjects used for quantitative evaluation: •NeRSemble:059, 070, 370, 373, 374 •A va-256: –20220809--1034--BJM420 –20220815--1307--BMP511 –20220831--0751--CMS162 –20230224--1359--CMZ386 –20230308--1352--BDF920 –20230316--1103--BHK376 –20230324--0820--AEY864 –20230328--0800--BLY735 –20230405--1635--AAN112 –20230810--1630--...

work page
[66]

For each visualization, we run a full forward pass, but control the activation of the cross-attention mechanisms

Decoder Visualization Protocol To understand the information flow in our 3D lifting de- coder, we visualize intermediate outputs after each decoder layer. For each visualization, we run a full forward pass, but control the activation of the cross-attention mechanisms. Specifically, to visualize the output after decoder layeri, we keep all cross-attention ...

work page
[67]

For stylization, users can either upload a reference image or provide a text prompt

3D Editing Web Application Our 3D editing web application allows users to extract a segmentation map from an input image and interactively modify it via drawing. For stylization, users can either upload a reference image or provide a text prompt. In our supplementary demo video, extracting a segmentation map from an image takes 25 seconds, as it involves ...

work page
[68]

Supplementary Video We highly recommend watching our supplementary video, which showcases additional 3D reconstruction orbit views, frame-by-frame 3D video generation, 3D edit orbit se- quences, and a live demo of our interactive 3D editing web application. Figure 8.Additional Results on A va-256 [40] and Nersemble [29].We present reconstructions across d...

work page

[1] [1]

Panohead: Geometry-aware 3d full- head synthesis in 360◦, 2023

Sizhe An, Hongyi Xu, Yichun Shi, Guoxian Song, Umit Ogras, and Linjie Luo. Panohead: Geometry-aware 3d full- head synthesis in 360◦, 2023. 2, 5, 1, 3

work page 2023

[2] [2]

Clipface: Text-guided editing of textured 3d mor- phable models

Shivangi Aneja, Justus Thies, Angela Dai, and Matthias Nießner. Clipface: Text-guided editing of textured 3d mor- phable models. InSIGGRAPH ’23 Conference Proceedings,

work page

[3] [3]

Ffhq-uv: Normalized facial uv-texture dataset for 3d face reconstruction

Haoran Bai, Di Kang, Haoxian Zhang, Jinshan Pan, and Lin- chao Bao. Ffhq-uv: Normalized facial uv-texture dataset for 3d face reconstruction. InIEEE Conference on Computer Vision and Pattern Recognition, 2023. 2

work page 2023

[4] [4]

Bhattarai, Matthias Nießner, and Artem Sev- astopolsky

Ananta R. Bhattarai, Matthias Nießner, and Artem Sev- astopolsky. Triplanenet: An encoder for eg3d inversion

work page

[5] [5]

Marcel C. Buehler, Gengyan Li, Erroll Wood, Leonhard Helminger, Xu Chen, Tanmay Shah, Daoye Wang, Stephan Garbin, Sergio Orts-Escolano, Otmar Hilliges, Dmitry La- gun, J´er´emy Riviere, Paulo Gotardo, Thabo Beeler, Abhim- itra Meka, and Kripasindhu Sarkar. Cafca: High-quality novel view synthesis of expressive faces from casual few- shot captures. InACM S...

work page 2024

[6] [6]

pi-gan: Periodic implicit generative ad- versarial networks for 3d-aware image synthesis

Eric Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. pi-gan: Periodic implicit generative ad- versarial networks for 3d-aware image synthesis. InarXiv,

work page

[7] [7]

Chan, Connor Z

Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero Karras, and Gordon Wetzstein. Efficient geometry-aware 3D generative adversarial networks. InarXiv, 2021. 2

work page 2021

[8] [8]

Generalizable and an- imatable gaussian head avatar

Xuangeng Chu and Tatsuya Harada. Generalizable and an- imatable gaussian head avatar. InThe Thirty-eighth An- nual Conference on Neural Information Processing Systems,

work page

[9] [9]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 4690–4699, 2019. 6

work page 2019

[10] [10]

Portrait4d: Learning one-shot 4d head avatar synthesis using synthetic data

Yu Deng, Duomin Wang, Xiaohang Ren, Xingyu Chen, and Baoyuan Wang. Portrait4d: Learning one-shot 4d head avatar synthesis using synthetic data. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024. 2

work page 2024

[11] [11]

Portrait4d- v2: Pseudo multi-view data creates better 4d head synthe- sizer.arXiv preprint arXiv:2403.13570, 2024

Yu Deng, Duomin Wang, and Baoyuan Wang. Portrait4d-v2: Pseudo multi-view data creates better 4d head synthesizer. arXiv preprint arXiv:2403.13570, 2024. 2

work page arXiv 2024

[12] [12]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2010

[13] [13]

Black, and Timo Bolkart

Yao Feng, Haiwen Feng, Michael J. Black, and Timo Bolkart. Learning an animatable detailed 3D face model from in-the-wild images. 2021. 2

work page 2021

[14] [14]

Dream- sim: Learning new dimensions of human visual similarity using synthetic data, 2023

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dream- sim: Learning new dimensions of human visual similarity using synthetic data, 2023. 3, 6

work page 2023

[15] [15]

Ganfit: Generative adversarial network fitting for high fidelity 3d face reconstruction

Baris Gecer, Stylianos Ploumpis, Irene Kotsia, and Stefanos Zafeiriou. Ganfit: Generative adversarial network fitting for high fidelity 3d face reconstruction. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 2

work page 2019

[16] [16]

Fast-ganfit: Generative adversarial net- work for high fidelity 3d face reconstruction.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2021

Baris Gecer, Stylianos Ploumpis, Irene Kotsia, and Ste- fanos P Zafeiriou. Fast-ganfit: Generative adversarial net- work for high fidelity 3d face reconstruction.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2021. 2

work page 2021

[17] [17]

Arc2avatar: Generating expressive 3d avatars from a single image via id guidance.arXiv preprint arXiv:2501.05379, 2025

Dimitrios Gerogiannis, Foivos Paraperas Papantoniou, Rolandos Alexandros Potamias, Alexandros Lattas, and Ste- fanos Zafeiriou. Arc2avatar: Generating expressive 3d avatars from a single image via id guidance.arXiv preprint arXiv:2501.05379, 2025. 3

work page arXiv 2025

[18] [18]

Npga: Neural paramet- ric gaussian avatars

Simon Giebenhain, Tobias Kirschstein, Martin R ¨unz, Lour- des Agapito, and Matthias Nießner. Npga: Neural paramet- ric gaussian avatars. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11. ACM, 2024. 3

work page 2024

[19] [19]

Stylenerf: A style-based 3d aware generator for high- resolution image synthesis

Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt. Stylenerf: A style-based 3d aware generator for high- resolution image synthesis. InInternational Conference on Learning Representations, 2022. 2

work page 2022

[20] [20]

Control3Diff: Learning Controllable 3D Diffusion Models from Single-view Images

Jiatao Gu, Qingzhe Gao, Shuangfei Zhai, Baoquan Chen, Lingjie Liu, and Josh Susskind. Control3Diff: Learning Controllable 3D Diffusion Models from Single-view Images . In2024 International Conference on 3D Vision (3DV), pages 685–696, Los Alamitos, CA, USA, 2024. IEEE Com- puter Society. 3

work page 2024

[21] [21]

Vector quantized diffusion model for text-to-image synthesis, 2022

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis.arXiv preprint arXiv:2111.14822, 2021. 3

work page arXiv 2021

[22] [22]

Efficient diffu- sion training via min-snr weighting strategy

Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffu- sion training via min-snr weighting strategy. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 7441–7451, 2023. 3

work page 2023

[23] [23]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 3, 4

work page 2022

[24] [24]

Lam: Large avatar model for one-shot animatable gaus- sian head

Yisheng He, Xiaodong Gu, Xiaodan Ye, Chao Xu, Zhengyi Zhao, Yuan Dong, Weihao Yuan, Zilong Dong, and Liefeng Bo. Lam: Large avatar model for one-shot animatable gaus- sian head. InSIGGRAPH, 2025. 2, 3, 5, 1

work page 2025

[25] [25]

Fleet, Mohammad Norouzi, and Tim Salimans

Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffu- sion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022. 3

work page 2022

[26] [26]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 2, 5

work page 2019

[27] [27]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 3, 4

work page 2023

[28] [28]

Realistic one-shot mesh-based head avatars

Taras Khakhulin, Vanessa Sklyarova, Victor Lempitsky, and Egor Zakharov. Realistic one-shot mesh-based head avatars. InEuropean Conference of Computer vision (ECCV), 2022. 2

work page 2022

[29] [29]

Nersemble: Multi-view ra- diance field reconstruction of human heads.ACM Trans

Tobias Kirschstein, Shenhan Qian, Simon Giebenhain, Tim Walter, and Matthias Nießner. Nersemble: Multi-view ra- diance field reconstruction of human heads.ACM Trans. Graph., 42(4), 2023. 2, 5, 3

work page 2023

[30] [30]

Gghead: Fast and generalizable 3d gaussian heads.arXiv preprint arXiv:2406.09377, 2024

Tobias Kirschstein, Simon Giebenhain, Jiapeng Tang, Markos Georgopoulos, and Matthias Nießner. Gghead: Fast and generalizable 3d gaussian heads.arXiv preprint arXiv:2406.09377, 2024. 3

work page arXiv 2024

[31] [31]

Self-supervised geometry-aware encoder for style- based 3d gan inversion

Yushi Lan, Xuyi Meng, Shuai Yang, Chen Change Loy, and Bo Dai. Self-supervised geometry-aware encoder for style- based 3d gan inversion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20940–20949, 2023. 3

work page 2023

[32] [32]

Avatarme: Realistically ren- derable 3d facial reconstruction ”in-the-wild”

Alexandros Lattas, Stylianos Moschoglou, Baris Gecer, Stylianos Ploumpis, Vasileios Triantafyllou, Abhijeet Ghosh, and Stefanos Zafeiriou. Avatarme: Realistically ren- derable 3d facial reconstruction ”in-the-wild”. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 2

work page 2020

[33] [33]

Avatarme++: Facial shape and brdf inference with photorealistic rendering-aware gans.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021

Alexandros Lattas, Stylianos Moschoglou, Stylianos Ploumpis, Baris Gecer, Abhijeet Ghosh, and Stefanos P Zafeiriou. Avatarme++: Facial shape and brdf inference with photorealistic rendering-aware gans.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. 2

work page 2021

[34] [34]

Preim3d: 3d consistent precise image attribute editing from a single image

Jianhui Li, Jianmin Li, Haoji Zhang, Shilong Liu, Zhengyi Wang, Zihao Xiao, Kaiwen Zheng, and Jun Zhu. Preim3d: 3d consistent precise image attribute editing from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8549–8558, 2023. 3

work page 2023

[35] [35]

Instruct- pix2nerf: Instructed 3d portrait editing from a single image,

Jianhui Li, Shilong Liu, Zidong Liu, Yikai Wang, Kaiwen Zheng, Jinghui Xu, Jianmin Li, and Jun Zhu. Instruct- pix2nerf: Instructed 3d portrait editing from a single image,

work page

[36] [36]

Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and ex- pression from 4D scans.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017. 4

work page 2017

[37] [37]

Hha- vatar: Gaussian head avatar with dynamic hairs.arXiv e- prints, pages arXiv–2312, 2023

Zhanfeng Liao, Yuelang Xu, Zhe Li, Qijing Li, Boyao Zhou, Ruifeng Bai, Di Xu, Hongwen Zhang, and Yebin Liu. Hha- vatar: Gaussian head avatar with dynamic hairs.arXiv e- prints, pages arXiv–2312, 2023. 3

work page 2023

[38] [38]

To- wards high-fidelity 3d face reconstruction from in-the-wild images using graph convolutional networks

Jiangke Lin, Yi Yuan, Tianjia Shao, and Kun Zhou. To- wards high-fidelity 3d face reconstruction from in-the-wild images using graph convolutional networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 5891–5900, 2020. 2

work page 2020

[39] [39]

Xin Lin, Jingtong Yue, Kelvin C. K. Chan, Lu Qi, Chao Ren, Jinshan Pan, and Ming-Hsuan Yang. Multi-task image restoration guided by robust dino features, 2024. 3

work page 2024

[40] [40]

Jewett, Simon Ven- shtain, Christopher Heilman, Yueh-Tung Chen, Sidi Fu, Mo- hamed Ezzeldin A

Julieta Martinez, Emily Kim, Javier Romero, Timur Bagaut- dinov, Shunsuke Saito, Shoou-I Yu, Stuart Anderson, Michael Zollh ¨ofer, Te-Li Wang, Shaojie Bai, Chenghui Li, Shih-En Wei, Rohan Joshi, Wyatt Borsos, Tomas Simon, Jason Saragih, Paul Theodosis, Alexander Greene, Anjani Josyula, Silvio Mano Maeta, Andrew I. Jewett, Simon Ven- shtain, Christopher He...

work page 2024

[41] [41]

Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...

work page 2023

[42] [42]

StyleSDF: High-Resolution 3D-Consistent Image and Ge- ometry Generation

Roy Or-El, Xuan Luo, Mengyi Shan, Eli Shecht- man, Jeong Joon Park, and Ira Kemelmacher-Shlizerman. StyleSDF: High-Resolution 3D-Consistent Image and Ge- ometry Generation. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 13503–13513, 2022. 2

work page 2022

[43] [43]

Avatarmmc: 3d head avatar generation and editing with multi-modal conditioning, 2024

Wamiq Reyaz Para, Abdelrahman Eldesokey, Zhenyu Li, Pradyumna Reddy, Jiankang Deng, and Peter Wonka. Avatarmmc: 3d head avatar generation and editing with multi-modal conditioning, 2024. 3

work page 2024

[44] [44]

Arc2face: A foundation model for id-consistent human faces

Foivos Paraperas Papantoniou, Alexandros Lattas, Stylianos Moschoglou, Jiankang Deng, Bernhard Kainz, and Stefanos Zafeiriou. Arc2face: A foundation model for id-consistent human faces. InProceedings of the European Conference on Computer Vision (ECCV), 2024. 3

work page 2024

[45] [45]

Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 3

work page 2023

[46] [46]

Gaus- sianavatars: Photorealistic head avatars with rigged 3d gaus- sians

Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. Gaus- sianavatars: Photorealistic head avatars with rigged 3d gaus- sians. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 20299–20309,

work page

[47] [47]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 2, 5, 1

work page 2021

[48] [48]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Pivotal tuning for latent-based editing of real im- ages.ACM Trans

Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel Cohen-Or. Pivotal tuning for latent-based editing of real im- ages.ACM Trans. Graph., 2021. 2, 5

work page 2021

[50] [50]

High-resolution image syn- thesis with latent diffusion models, 2021

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models, 2021. 3

work page 2021

[51] [51]

Hf-diff: High-frequency perceptual loss and distribution matching for one-step diffusion-based image super-resolution, 2024

Shoaib Meraj Sami, Md Mahedi Hasan, Jeremy Dawson, and Nasser Nasrabadi. Hf-diff: High-frequency perceptual loss and distribution matching for one-step diffusion-based image super-resolution, 2024. 3

work page 2024

[52] [52]

Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

Wenzhe Shi, Jose Caballero, Ferenc Husz ´ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016. 4

work page 2016

[53] [53]

Ide-3d: Interactive disentangled edit- ing for high-resolution 3d-aware portrait synthesis.ACM Transactions on Graphics (TOG), 41(6):1–10, 2022

Jingxiang Sun, Xuan Wang, Yichun Shi, Lizhen Wang, Jue Wang, and Yebin Liu. Ide-3d: Interactive disentangled edit- ing for high-resolution 3d-aware portrait synthesis.ACM Transactions on Graphics (TOG), 41(6):1–10, 2022. 3

work page 2022

[54] [54]

Lgm: Large multi-view gaussian model for high-resolution 3d content creation.arXiv preprint arXiv:2402.05054, 2024

Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation.arXiv preprint arXiv:2402.05054, 2024. 2, 3, 5, 1

work page arXiv 2024

[55] [55]

Faceverse: a fine-grained and detail- controllable 3d face morphable model from a hybrid dataset

Lizhen Wang, Zhiyua Chen, Tao Yu, Chenguang Ma, Liang Li, and Yebin Liu. Faceverse: a fine-grained and detail- controllable 3d face morphable model from a hybrid dataset. InIEEE Conference on Computer Vision and Pattern Recog- nition (CVPR2022), 2022. 2

work page 2022

[56] [56]

Rodin: A generative model for sculpting 3d digital avatars using diffusion, 2022

Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, and Baining Guo. Rodin: A generative model for sculpting 3d digital avatars using diffusion, 2022. 3

work page 2022

[57] [57]

Bovik, H.R

Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4): 600–612, 2004. 5

work page 2004

[58] [58]

High-fidelity 3d gan inversion by pseudo- multi-view optimization

Jiaxin Xie, Hao Ouyang, Jingtan Piao, Chenyang Lei, and Qifeng Chen. High-fidelity 3d gan inversion by pseudo- multi-view optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 321–331, 2023. 3

work page 2023

[59] [59]

Vfhq: A high-quality dataset and bench- mark for video face super-resolution

Liangbin Xie, Xintao Wang, Honglun Zhang, Chao Dong, and Ying Shan. Vfhq: A high-quality dataset and bench- mark for video face super-resolution. InThe IEEE Con- ference on Computer Vision and Pattern Recognition Work- shops (CVPRW), 2022. 2, 7

work page 2022

[60] [60]

Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation, 2024

Xu Yinghao, Shi Zifan, Yifan Wang, Chen Hansheng, Yang Ceyuan, Peng Sida, Shen Yujun, and Wetzstein Gordon. Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation, 2024. 4

work page 2024

[61] [61]

Mtred: 3d reconstruction dataset for fly-over videos of maritime domain

Picosson Yong and Wiliem. Mtred: 3d reconstruction dataset for fly-over videos of maritime domain. InMaCVi, 2024. 3

work page 2024

[62] [62]

Rodinhd: High-fidelity 3d avatar generation with diffusion models.arXiv preprint arXiv:2407.06938, 2024

Bowen Zhang, Yiji Cheng, Chunyu Wang, Ting Zhang, Jiao- long Yang, Yansong Tang, Feng Zhao, Dong Chen, and Bain- ing Guo. Rodinhd: High-fidelity 3d avatar generation with diffusion models.arXiv preprint arXiv:2407.06938, 2024. 3

work page arXiv 2024

[63] [63]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 3, 6

work page 2018

[64] [64]

General facial representa- tion learning in a visual-linguistic manner.arXiv preprint arXiv:2112.03109, 2021

Yinglin Zheng, Hao Yang, Ting Zhang, Jianmin Bao, Dongdong Chen, Yangyu Huang, Lu Yuan, Dong Chen, Ming Zeng, and Fang Wen. General facial representa- tion learning in a visual-linguistic manner.arXiv preprint arXiv:2112.03109, 2021. 2, 5, 1 PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing Supplementary Material

work page arXiv 2021

[65] [65]

Cropping AlignmentWe observed thatPanoHead[1] uses the tightest (smallest) image crops among all com- pared methods

Evaluation Subjects and Processing Subjects used for quantitative evaluation: •NeRSemble:059, 070, 370, 373, 374 •A va-256: –20220809--1034--BJM420 –20220815--1307--BMP511 –20220831--0751--CMS162 –20230224--1359--CMZ386 –20230308--1352--BDF920 –20230316--1103--BHK376 –20230324--0820--AEY864 –20230328--0800--BLY735 –20230405--1635--AAN112 –20230810--1630--...

work page

[66] [66]

For each visualization, we run a full forward pass, but control the activation of the cross-attention mechanisms

Decoder Visualization Protocol To understand the information flow in our 3D lifting de- coder, we visualize intermediate outputs after each decoder layer. For each visualization, we run a full forward pass, but control the activation of the cross-attention mechanisms. Specifically, to visualize the output after decoder layeri, we keep all cross-attention ...

work page

[67] [67]

For stylization, users can either upload a reference image or provide a text prompt

3D Editing Web Application Our 3D editing web application allows users to extract a segmentation map from an input image and interactively modify it via drawing. For stylization, users can either upload a reference image or provide a text prompt. In our supplementary demo video, extracting a segmentation map from an image takes 25 seconds, as it involves ...

work page

[68] [68]

Supplementary Video We highly recommend watching our supplementary video, which showcases additional 3D reconstruction orbit views, frame-by-frame 3D video generation, 3D edit orbit se- quences, and a live demo of our interactive 3D editing web application. Figure 8.Additional Results on A va-256 [40] and Nersemble [29].We present reconstructions across d...

work page