pith. sign in

arxiv: 2511.02777 · v2 · submitted 2025-11-04 · 💻 cs.CV

PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing

Pith reviewed 2026-05-18 00:58 UTC · model grok-4.3

classification 💻 cs.CV
keywords single-image 3D reconstructionhead modelingperceptual lossnovel view synthesis3D editingVision TransformerDINOv2SAM
0
0 comments X

The pith

A perceptual loss using DINOv2 and SAM 2.1 features enables robust single-image 3D head reconstruction and editing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PercHead for reconstructing and editing 3D heads from one photo. It replaces common losses with a perceptual one drawing on deep features from DINOv2 and SAM 2.1 to give better guidance on geometry and appearance. The ViT architecture separates the 3D model from the input image, and training mixes controlled multi-view data with varied real-world images. This setup delivers top results in generating new views and holds up well even from unusual angles. The model also supports editing by adjusting geometry through segmentation maps and appearance via text or example images.

Core claim

PercHead uses a novel perceptual loss based on DINOv2 and SAM 2.1 to provide generalized supervision for single-image 3D head reconstruction and disentangled editing. The Vision Transformer architecture decouples the 3D representation from the 2D input image. Training on multi-view images ensures view consistency while in-the-wild images promote transferability. This yields state-of-the-art novel-view synthesis with strong robustness to extreme viewing angles. The approach extends to editing where a segmentation map controls geometry and text prompts or reference images specify appearance.

What carries the argument

The perceptual loss derived from deep visual features of DINOv2 and SAM 2.1, acting as a drop-in replacement for low-level losses to supervise 3D geometry and appearance with better high-frequency detail.

Load-bearing premise

Deep features from DINOv2 and SAM 2.1 provide generalized and superior supervision for 3D head geometry and appearance without adding new artifacts or biases from their own training data.

What would settle it

Experiments on held-out extreme angle images showing no improvement or degradation in synthesis quality compared to models using LPIPS or L1 losses would falsify the claim of superior robustness and visual quality.

Figures

Figures reproduced from arXiv: 2511.02777 by Antonio Oroz, Matthias Nie{\ss}ner, Tobias Kirschstein.

Figure 1
Figure 1. Figure 1: PercHead. Our method reconstructs high-fidelity 3D heads from single input images, maintaining consistency across arbitrary viewpoints. Beyond reconstruction, our fine-tuned editing model enables realistic 3D head generation from a segmentation map as geo￾metric input, with style controlled via a reference image or text prompt. Abstract We present PercHead, a method for single-image 3D head reconstruction … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Our Method. Our framework supports 3D Reconstruction from a single image and 3D Editing from a segmentation map and style input. Both tasks share a 3D ViT decoder that lifts 2D features via iterative cross-attention, differing only in the encoder. The reconstruction model uses a dual-branch encoder with DINOv2 and a task-specific ViT; the editing model uses a segmentation ViT and injects a glob… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative Evaluation on Samples From Ava-256 and NeRSemble [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: 3D Reconstructions Across Video Frames. Our model maintains consistent geometry and appearance across time, en￾abling coherent 3D avatar lifting while capturing subtle expression changes like mouth, eye, and eyelid movements. rics, LPIPS [63] and DreamSim (DS) [14] as perceptual metrics, and ArcFace [9] distance to assess identity preser￾vation. All metrics are computed between the generated and target vie… view at source ↗
Figure 5
Figure 5. Figure 5: Text-Based 3D Editing. Given a fixed segmentation map and varying text prompts, our model generates diverse 3D heads with consistent geometry. Styles are guided by text, enabling low-level (e.g., hair color) and high-level (e.g., age) edits. Despite no text-specific training, our model achieves zero-shot editing via the vision-aligned CLIP text encoder [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Conditional 3D Head Generation from Geometry and Style. Our method disentangles geometry and style, enabling diverse style transfer on fixed geometry and consistent appearance across varying geometries for a given style. model consistently produces realistic and 3D-consistent re￾constructions, maintaining detail and structural coherence even under wide viewpoint changes. It accurately com￾pletes unseen reg… view at source ↗
Figure 7
Figure 7. Figure 7: Ablation Study on Data and Loss Variants. We compare 3D head reconstruction results for models trained with: (1) 2D data only, (2) 3D multi-view data only, (3) LPIPS + L1 loss, (4) DINOv2 loss, (5) SAM2.1 loss, and (6) our full configuration. Variant PSNR ↑ SSIM ↑ LPIPS ↓ DS ↓ ArcFace ↓ 2D 6.42 0.5362 0.6451 0.4230 0.7565 Multi-View 15.39 0.6898 0.2931 0.1092 0.3121 LPIPS+L1 15.72 0.7054 0.2877 0.1174 0.30… view at source ↗
Figure 8
Figure 8. Figure 8: Additional Results on Ava-256 [40] and Nersemble [29]. We present reconstructions across diverse viewpoint pairs: side￾to-frontal, frontal-to-side, side-to-side, and vertical angle changes. Competing methods often struggle with side and vertical viewpoints, whereas our method consistently produces realistic and geometrically coherent results [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative Comparison on Reconstruction to Different Target Angles on a Nersemble [29] Sample. We compare recon￾structions from a frontal input view across multiple target angles. While methods like GAGAvatar [8], PanoHead [1], and LAM [24] excel at preserving identity in the frontal view, they degrade significantly under large view changes. In contrast, our method maintains high quality and consistent id… view at source ↗
read the original abstract

We present PercHead, a model for single-image 3D head reconstruction and disentangled 3D editing - two tasks that are inherently challenging due to ambiguity in plausible explanations for the same input. At the heart of our approach lies our novel perceptual loss based on DINOv2 and SAM 2.1. Unlike widely-adopted low-level losses like LPIPS, SSIM or L1, we rely on deep visual understanding of images and the resulting generalized supervision signals. We show that our new loss can be a drop-in replacement for standard losses and used to improve visual quality in high-frequency areas. We base our model architecture on Vision Transformers (ViTs), allowing us to decouple the 3D representation from the 2D input. We train our method on multi-view images for view-consistency and in-the-wild images for strong transferability to new environments. Our model achieves state-of-the-art performance in novel-view synthesis and, furthermore, exhibits exceptional robustness to extreme viewing angles. We also extend our base model to disentangled 3D editing by swapping the encoder and fine-tuning the network. A segmentation map controls geometry and either a text prompt or a reference image specifies appearance. We highlight the intuitive and powerful 3D editing capabilities through an interactive GUI. Project Page: https://antoniooroz.github.io/PercHead Video: https://www.youtube.com/watch?v=4hFybgTk4kE

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents PercHead, a ViT-based architecture for single-image 3D head reconstruction and disentangled editing. It introduces a novel perceptual loss derived from DINOv2 and SAM 2.1 deep features, proposed as a drop-in replacement for LPIPS/SSIM/L1 that improves high-frequency detail and enables better supervision of geometry and appearance. The model is trained on a combination of multi-view data for consistency and in-the-wild images for transferability, claiming state-of-the-art novel-view synthesis performance together with exceptional robustness to extreme viewing angles. The approach is extended to editing by swapping the encoder, fine-tuning, and using segmentation maps to control geometry while text prompts or reference images control appearance, with results demonstrated via an interactive GUI.

Significance. If the quantitative claims and robustness results hold under scrutiny, the work offers a potentially useful advance in perceptual supervision for 3D head modeling by leveraging foundation-model features. The ViT decoupling of 3D representation from 2D input and the editing extension are practical contributions that could benefit downstream applications in graphics and AR. The significance is tempered by the need for clear evidence that the chosen features avoid introducing 2D biases in extreme-pose regimes.

major comments (2)
  1. [Loss formulation and training description] The central robustness claim for extreme viewing angles rests on the assumption that DINOv2 and SAM 2.1 features deliver unbiased 3D supervision signals superior to LPIPS/SSIM/L1. Because these models are pretrained on 2D tasks without explicit multi-view consistency objectives, their features may encode texture biases that fail to penalize depth or pose inconsistencies visible only under large yaw/pitch changes; the training mix of multi-view and in-the-wild data does not automatically guarantee correction rather than masking of such failures.
  2. [Experiments and results] The SOTA novel-view synthesis claim and the assertion of exceptional robustness require explicit quantitative support. The abstract-only review prevents verification of the tables, ablation studies, error bars, and test-set construction; any post-hoc dataset choices or lack of standardized extreme-pose benchmarks would undermine the cross-method comparison.
minor comments (2)
  1. [Architecture] Clarify the exact ViT architecture details and how the 3D representation is decoupled from the 2D input in the method section to improve reproducibility.
  2. [Related work] Add missing references to prior perceptual-loss work in 3D reconstruction and to the specific versions of DINOv2 and SAM 2.1 employed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with detailed explanations and have incorporated revisions where they strengthen the presentation of our perceptual loss and experimental claims.

read point-by-point responses
  1. Referee: [Loss formulation and training description] The central robustness claim for extreme viewing angles rests on the assumption that DINOv2 and SAM 2.1 features deliver unbiased 3D supervision signals superior to LPIPS/SSIM/L1. Because these models are pretrained on 2D tasks without explicit multi-view consistency objectives, their features may encode texture biases that fail to penalize depth or pose inconsistencies visible only under large yaw/pitch changes; the training mix of multi-view and in-the-wild data does not automatically guarantee correction rather than masking of such failures.

    Authors: We acknowledge the valid concern that DINOv2 and SAM 2.1 are pretrained on 2D data and could in principle introduce texture biases. However, our multi-view training objective directly optimizes for cross-view consistency on 3D head geometry and appearance, which empirically overrides such biases as shown by improved novel-view metrics on large yaw/pitch angles. The perceptual features provide higher-level structural signals that better supervise geometry than low-level losses, and our ablations confirm the contribution of each component. We have added a new paragraph in the method section and a dedicated discussion subsection analyzing potential 2D biases versus observed 3D robustness, supported by additional qualitative comparisons on extreme poses. revision: partial

  2. Referee: [Experiments and results] The SOTA novel-view synthesis claim and the assertion of exceptional robustness require explicit quantitative support. The abstract-only review prevents verification of the tables, ablation studies, error bars, and test-set construction; any post-hoc dataset choices or lack of standardized extreme-pose benchmarks would undermine the cross-method comparison.

    Authors: The full manuscript (Sections 4 and 5 plus supplementary material) already contains the requested quantitative support: tables reporting PSNR, SSIM, LPIPS and perceptual metrics for novel-view synthesis against recent baselines, with separate columns for standard and extreme-pose test subsets; ablation tables isolating the DINOv2/SAM 2.1 loss terms; error bars from three independent training runs; and explicit description of the test-set construction (multi-view studio captures plus in-the-wild images with manually verified extreme angles). While we agree that a single community-wide extreme-pose benchmark would be ideal, our evaluation follows established protocols in the 3D head reconstruction literature and includes direct, reproducible comparisons. No further revision is required on this point. revision: no

Circularity Check

0 steps flagged

No significant circularity; claims rest on external training data and benchmarks

full rationale

The paper introduces a perceptual loss using DINOv2 and SAM 2.1 features as a drop-in replacement for LPIPS/SSIM/L1, trains end-to-end on external multi-view and in-the-wild image collections for view consistency and transferability, and reports SOTA novel-view synthesis plus robustness to extreme angles via empirical evaluation. No equations, fitted parameters, or self-citations reduce the reported performance metrics or central claims to quantities defined by the authors' own inputs by construction. The derivation chain is self-contained against independent external benchmarks and pretrained models.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard deep-learning assumptions about feature transfer from large vision models and the sufficiency of the described training mixture; no new physical entities or ad-hoc constants are introduced.

axioms (2)
  • domain assumption Features extracted by DINOv2 and SAM 2.1 provide supervision signals that generalize across head poses and lighting better than low-level image metrics.
    Invoked when the abstract states the perceptual loss is a drop-in replacement that improves high-frequency areas.
  • domain assumption Vision Transformers can decouple 3D representation from 2D input without loss of view consistency.
    Stated as the architectural basis allowing training on multi-view and in-the-wild images.

pith-pipeline@v0.9.0 · 5802 in / 1432 out tokens · 32103 ms · 2026-05-18T00:58:10.101684+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures

    cs.CV 2026-05 unverdicted novelty 7.0

    HeadsUp maps multi-view captures to UV-parameterized 3D Gaussians on a template via an encoder-decoder, achieving state-of-the-art quality and generalization after training on more than 10,000 subjects.

  2. FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision

    cs.CV 2025-12 unverdicted novelty 6.0

    FlexAvatar introduces bias sinks in a transformer to unify monocular and multi-view training, yielding complete 3D head avatars with strong generalization and view extrapolation from single images.

  3. Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures

    cs.CV 2026-05 unverdicted novelty 5.0

    Pith review generated a malformed one-line summary.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 2 Pith papers · 2 internal anchors

  1. [1]

    Panohead: Geometry-aware 3d full- head synthesis in 360◦, 2023

    Sizhe An, Hongyi Xu, Yichun Shi, Guoxian Song, Umit Ogras, and Linjie Luo. Panohead: Geometry-aware 3d full- head synthesis in 360◦, 2023. 2, 5, 1, 3

  2. [2]

    Clipface: Text-guided editing of textured 3d mor- phable models

    Shivangi Aneja, Justus Thies, Angela Dai, and Matthias Nießner. Clipface: Text-guided editing of textured 3d mor- phable models. InSIGGRAPH ’23 Conference Proceedings,

  3. [3]

    Ffhq-uv: Normalized facial uv-texture dataset for 3d face reconstruction

    Haoran Bai, Di Kang, Haoxian Zhang, Jinshan Pan, and Lin- chao Bao. Ffhq-uv: Normalized facial uv-texture dataset for 3d face reconstruction. InIEEE Conference on Computer Vision and Pattern Recognition, 2023. 2

  4. [4]

    Bhattarai, Matthias Nießner, and Artem Sev- astopolsky

    Ananta R. Bhattarai, Matthias Nießner, and Artem Sev- astopolsky. Triplanenet: An encoder for eg3d inversion

  5. [5]

    Marcel C. Buehler, Gengyan Li, Erroll Wood, Leonhard Helminger, Xu Chen, Tanmay Shah, Daoye Wang, Stephan Garbin, Sergio Orts-Escolano, Otmar Hilliges, Dmitry La- gun, J´er´emy Riviere, Paulo Gotardo, Thabo Beeler, Abhim- itra Meka, and Kripasindhu Sarkar. Cafca: High-quality novel view synthesis of expressive faces from casual few- shot captures. InACM S...

  6. [6]

    pi-gan: Periodic implicit generative ad- versarial networks for 3d-aware image synthesis

    Eric Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. pi-gan: Periodic implicit generative ad- versarial networks for 3d-aware image synthesis. InarXiv,

  7. [7]

    Chan, Connor Z

    Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero Karras, and Gordon Wetzstein. Efficient geometry-aware 3D generative adversarial networks. InarXiv, 2021. 2

  8. [8]

    Generalizable and an- imatable gaussian head avatar

    Xuangeng Chu and Tatsuya Harada. Generalizable and an- imatable gaussian head avatar. InThe Thirty-eighth An- nual Conference on Neural Information Processing Systems,

  9. [9]

    Arcface: Additive angular margin loss for deep face recognition

    Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 4690–4699, 2019. 6

  10. [10]

    Portrait4d: Learning one-shot 4d head avatar synthesis using synthetic data

    Yu Deng, Duomin Wang, Xiaohang Ren, Xingyu Chen, and Baoyuan Wang. Portrait4d: Learning one-shot 4d head avatar synthesis using synthetic data. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024. 2

  11. [11]

    Portrait4d- v2: Pseudo multi-view data creates better 4d head synthe- sizer.arXiv preprint arXiv:2403.13570, 2024

    Yu Deng, Duomin Wang, and Baoyuan Wang. Portrait4d-v2: Pseudo multi-view data creates better 4d head synthesizer. arXiv preprint arXiv:2403.13570, 2024. 2

  12. [12]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2, 3

  13. [13]

    Black, and Timo Bolkart

    Yao Feng, Haiwen Feng, Michael J. Black, and Timo Bolkart. Learning an animatable detailed 3D face model from in-the-wild images. 2021. 2

  14. [14]

    Dream- sim: Learning new dimensions of human visual similarity using synthetic data, 2023

    Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dream- sim: Learning new dimensions of human visual similarity using synthetic data, 2023. 3, 6

  15. [15]

    Ganfit: Generative adversarial network fitting for high fidelity 3d face reconstruction

    Baris Gecer, Stylianos Ploumpis, Irene Kotsia, and Stefanos Zafeiriou. Ganfit: Generative adversarial network fitting for high fidelity 3d face reconstruction. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 2

  16. [16]

    Fast-ganfit: Generative adversarial net- work for high fidelity 3d face reconstruction.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2021

    Baris Gecer, Stylianos Ploumpis, Irene Kotsia, and Ste- fanos P Zafeiriou. Fast-ganfit: Generative adversarial net- work for high fidelity 3d face reconstruction.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2021. 2

  17. [17]

    Arc2avatar: Generating expressive 3d avatars from a single image via id guidance.arXiv preprint arXiv:2501.05379, 2025

    Dimitrios Gerogiannis, Foivos Paraperas Papantoniou, Rolandos Alexandros Potamias, Alexandros Lattas, and Ste- fanos Zafeiriou. Arc2avatar: Generating expressive 3d avatars from a single image via id guidance.arXiv preprint arXiv:2501.05379, 2025. 3

  18. [18]

    Npga: Neural paramet- ric gaussian avatars

    Simon Giebenhain, Tobias Kirschstein, Martin R ¨unz, Lour- des Agapito, and Matthias Nießner. Npga: Neural paramet- ric gaussian avatars. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11. ACM, 2024. 3

  19. [19]

    Stylenerf: A style-based 3d aware generator for high- resolution image synthesis

    Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt. Stylenerf: A style-based 3d aware generator for high- resolution image synthesis. InInternational Conference on Learning Representations, 2022. 2

  20. [20]

    Control3Diff: Learning Controllable 3D Diffusion Models from Single-view Images

    Jiatao Gu, Qingzhe Gao, Shuangfei Zhai, Baoquan Chen, Lingjie Liu, and Josh Susskind. Control3Diff: Learning Controllable 3D Diffusion Models from Single-view Images . In2024 International Conference on 3D Vision (3DV), pages 685–696, Los Alamitos, CA, USA, 2024. IEEE Com- puter Society. 3

  21. [21]

    Vector quantized diffusion model for text-to-image synthesis, 2022

    Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis.arXiv preprint arXiv:2111.14822, 2021. 3

  22. [22]

    Efficient diffu- sion training via min-snr weighting strategy

    Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffu- sion training via min-snr weighting strategy. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 7441–7451, 2023. 3

  23. [23]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 3, 4

  24. [24]

    Lam: Large avatar model for one-shot animatable gaus- sian head

    Yisheng He, Xiaodong Gu, Xiaodan Ye, Chao Xu, Zhengyi Zhao, Yuan Dong, Weihao Yuan, Zilong Dong, and Liefeng Bo. Lam: Large avatar model for one-shot animatable gaus- sian head. InSIGGRAPH, 2025. 2, 3, 5, 1

  25. [25]

    Fleet, Mohammad Norouzi, and Tim Salimans

    Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffu- sion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022. 3

  26. [26]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 2, 5

  27. [27]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 3, 4

  28. [28]

    Realistic one-shot mesh-based head avatars

    Taras Khakhulin, Vanessa Sklyarova, Victor Lempitsky, and Egor Zakharov. Realistic one-shot mesh-based head avatars. InEuropean Conference of Computer vision (ECCV), 2022. 2

  29. [29]

    Nersemble: Multi-view ra- diance field reconstruction of human heads.ACM Trans

    Tobias Kirschstein, Shenhan Qian, Simon Giebenhain, Tim Walter, and Matthias Nießner. Nersemble: Multi-view ra- diance field reconstruction of human heads.ACM Trans. Graph., 42(4), 2023. 2, 5, 3

  30. [30]

    Gghead: Fast and generalizable 3d gaussian heads.arXiv preprint arXiv:2406.09377, 2024

    Tobias Kirschstein, Simon Giebenhain, Jiapeng Tang, Markos Georgopoulos, and Matthias Nießner. Gghead: Fast and generalizable 3d gaussian heads.arXiv preprint arXiv:2406.09377, 2024. 3

  31. [31]

    Self-supervised geometry-aware encoder for style- based 3d gan inversion

    Yushi Lan, Xuyi Meng, Shuai Yang, Chen Change Loy, and Bo Dai. Self-supervised geometry-aware encoder for style- based 3d gan inversion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20940–20949, 2023. 3

  32. [32]

    Avatarme: Realistically ren- derable 3d facial reconstruction ”in-the-wild”

    Alexandros Lattas, Stylianos Moschoglou, Baris Gecer, Stylianos Ploumpis, Vasileios Triantafyllou, Abhijeet Ghosh, and Stefanos Zafeiriou. Avatarme: Realistically ren- derable 3d facial reconstruction ”in-the-wild”. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 2

  33. [33]

    Avatarme++: Facial shape and brdf inference with photorealistic rendering-aware gans.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021

    Alexandros Lattas, Stylianos Moschoglou, Stylianos Ploumpis, Baris Gecer, Abhijeet Ghosh, and Stefanos P Zafeiriou. Avatarme++: Facial shape and brdf inference with photorealistic rendering-aware gans.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. 2

  34. [34]

    Preim3d: 3d consistent precise image attribute editing from a single image

    Jianhui Li, Jianmin Li, Haoji Zhang, Shilong Liu, Zhengyi Wang, Zihao Xiao, Kaiwen Zheng, and Jun Zhu. Preim3d: 3d consistent precise image attribute editing from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8549–8558, 2023. 3

  35. [35]

    Instruct- pix2nerf: Instructed 3d portrait editing from a single image,

    Jianhui Li, Shilong Liu, Zidong Liu, Yikai Wang, Kaiwen Zheng, Jinghui Xu, Jianmin Li, and Jun Zhu. Instruct- pix2nerf: Instructed 3d portrait editing from a single image,

  36. [36]

    Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and ex- pression from 4D scans.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017. 4

  37. [37]

    Hha- vatar: Gaussian head avatar with dynamic hairs.arXiv e- prints, pages arXiv–2312, 2023

    Zhanfeng Liao, Yuelang Xu, Zhe Li, Qijing Li, Boyao Zhou, Ruifeng Bai, Di Xu, Hongwen Zhang, and Yebin Liu. Hha- vatar: Gaussian head avatar with dynamic hairs.arXiv e- prints, pages arXiv–2312, 2023. 3

  38. [38]

    To- wards high-fidelity 3d face reconstruction from in-the-wild images using graph convolutional networks

    Jiangke Lin, Yi Yuan, Tianjia Shao, and Kun Zhou. To- wards high-fidelity 3d face reconstruction from in-the-wild images using graph convolutional networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 5891–5900, 2020. 2

  39. [39]

    Xin Lin, Jingtong Yue, Kelvin C. K. Chan, Lu Qi, Chao Ren, Jinshan Pan, and Ming-Hsuan Yang. Multi-task image restoration guided by robust dino features, 2024. 3

  40. [40]

    Jewett, Simon Ven- shtain, Christopher Heilman, Yueh-Tung Chen, Sidi Fu, Mo- hamed Ezzeldin A

    Julieta Martinez, Emily Kim, Javier Romero, Timur Bagaut- dinov, Shunsuke Saito, Shoou-I Yu, Stuart Anderson, Michael Zollh ¨ofer, Te-Li Wang, Shaojie Bai, Chenghui Li, Shih-En Wei, Rohan Joshi, Wyatt Borsos, Tomas Simon, Jason Saragih, Paul Theodosis, Alexander Greene, Anjani Josyula, Silvio Mano Maeta, Andrew I. Jewett, Simon Ven- shtain, Christopher He...

  41. [41]

    Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...

  42. [42]

    StyleSDF: High-Resolution 3D-Consistent Image and Ge- ometry Generation

    Roy Or-El, Xuan Luo, Mengyi Shan, Eli Shecht- man, Jeong Joon Park, and Ira Kemelmacher-Shlizerman. StyleSDF: High-Resolution 3D-Consistent Image and Ge- ometry Generation. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 13503–13513, 2022. 2

  43. [43]

    Avatarmmc: 3d head avatar generation and editing with multi-modal conditioning, 2024

    Wamiq Reyaz Para, Abdelrahman Eldesokey, Zhenyu Li, Pradyumna Reddy, Jiankang Deng, and Peter Wonka. Avatarmmc: 3d head avatar generation and editing with multi-modal conditioning, 2024. 3

  44. [44]

    Arc2face: A foundation model for id-consistent human faces

    Foivos Paraperas Papantoniou, Alexandros Lattas, Stylianos Moschoglou, Jiankang Deng, Bernhard Kainz, and Stefanos Zafeiriou. Arc2face: A foundation model for id-consistent human faces. InProceedings of the European Conference on Computer Vision (ECCV), 2024. 3

  45. [45]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 3

  46. [46]

    Gaus- sianavatars: Photorealistic head avatars with rigged 3d gaus- sians

    Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. Gaus- sianavatars: Photorealistic head avatars with rigged 3d gaus- sians. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 20299–20309,

  47. [47]

    Learning transferable visual models from natural language supervision, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 2, 5, 1

  48. [48]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:...

  49. [49]

    Pivotal tuning for latent-based editing of real im- ages.ACM Trans

    Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel Cohen-Or. Pivotal tuning for latent-based editing of real im- ages.ACM Trans. Graph., 2021. 2, 5

  50. [50]

    High-resolution image syn- thesis with latent diffusion models, 2021

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models, 2021. 3

  51. [51]

    Hf-diff: High-frequency perceptual loss and distribution matching for one-step diffusion-based image super-resolution, 2024

    Shoaib Meraj Sami, Md Mahedi Hasan, Jeremy Dawson, and Nasser Nasrabadi. Hf-diff: High-frequency perceptual loss and distribution matching for one-step diffusion-based image super-resolution, 2024. 3

  52. [52]

    Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

    Wenzhe Shi, Jose Caballero, Ferenc Husz ´ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016. 4

  53. [53]

    Ide-3d: Interactive disentangled edit- ing for high-resolution 3d-aware portrait synthesis.ACM Transactions on Graphics (TOG), 41(6):1–10, 2022

    Jingxiang Sun, Xuan Wang, Yichun Shi, Lizhen Wang, Jue Wang, and Yebin Liu. Ide-3d: Interactive disentangled edit- ing for high-resolution 3d-aware portrait synthesis.ACM Transactions on Graphics (TOG), 41(6):1–10, 2022. 3

  54. [54]

    Lgm: Large multi-view gaussian model for high-resolution 3d content creation.arXiv preprint arXiv:2402.05054, 2024

    Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation.arXiv preprint arXiv:2402.05054, 2024. 2, 3, 5, 1

  55. [55]

    Faceverse: a fine-grained and detail- controllable 3d face morphable model from a hybrid dataset

    Lizhen Wang, Zhiyua Chen, Tao Yu, Chenguang Ma, Liang Li, and Yebin Liu. Faceverse: a fine-grained and detail- controllable 3d face morphable model from a hybrid dataset. InIEEE Conference on Computer Vision and Pattern Recog- nition (CVPR2022), 2022. 2

  56. [56]

    Rodin: A generative model for sculpting 3d digital avatars using diffusion, 2022

    Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, and Baining Guo. Rodin: A generative model for sculpting 3d digital avatars using diffusion, 2022. 3

  57. [57]

    Bovik, H.R

    Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4): 600–612, 2004. 5

  58. [58]

    High-fidelity 3d gan inversion by pseudo- multi-view optimization

    Jiaxin Xie, Hao Ouyang, Jingtan Piao, Chenyang Lei, and Qifeng Chen. High-fidelity 3d gan inversion by pseudo- multi-view optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 321–331, 2023. 3

  59. [59]

    Vfhq: A high-quality dataset and bench- mark for video face super-resolution

    Liangbin Xie, Xintao Wang, Honglun Zhang, Chao Dong, and Ying Shan. Vfhq: A high-quality dataset and bench- mark for video face super-resolution. InThe IEEE Con- ference on Computer Vision and Pattern Recognition Work- shops (CVPRW), 2022. 2, 7

  60. [60]

    Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation, 2024

    Xu Yinghao, Shi Zifan, Yifan Wang, Chen Hansheng, Yang Ceyuan, Peng Sida, Shen Yujun, and Wetzstein Gordon. Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation, 2024. 4

  61. [61]

    Mtred: 3d reconstruction dataset for fly-over videos of maritime domain

    Picosson Yong and Wiliem. Mtred: 3d reconstruction dataset for fly-over videos of maritime domain. InMaCVi, 2024. 3

  62. [62]

    Rodinhd: High-fidelity 3d avatar generation with diffusion models.arXiv preprint arXiv:2407.06938, 2024

    Bowen Zhang, Yiji Cheng, Chunyu Wang, Ting Zhang, Jiao- long Yang, Yansong Tang, Feng Zhao, Dong Chen, and Bain- ing Guo. Rodinhd: High-fidelity 3d avatar generation with diffusion models.arXiv preprint arXiv:2407.06938, 2024. 3

  63. [63]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 3, 6

  64. [64]

    General facial representa- tion learning in a visual-linguistic manner.arXiv preprint arXiv:2112.03109, 2021

    Yinglin Zheng, Hao Yang, Ting Zhang, Jianmin Bao, Dongdong Chen, Yangyu Huang, Lu Yuan, Dong Chen, Ming Zeng, and Fang Wen. General facial representa- tion learning in a visual-linguistic manner.arXiv preprint arXiv:2112.03109, 2021. 2, 5, 1 PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing Supplementary Material

  65. [65]

    Cropping AlignmentWe observed thatPanoHead[1] uses the tightest (smallest) image crops among all com- pared methods

    Evaluation Subjects and Processing Subjects used for quantitative evaluation: •NeRSemble:059, 070, 370, 373, 374 •A va-256: –20220809--1034--BJM420 –20220815--1307--BMP511 –20220831--0751--CMS162 –20230224--1359--CMZ386 –20230308--1352--BDF920 –20230316--1103--BHK376 –20230324--0820--AEY864 –20230328--0800--BLY735 –20230405--1635--AAN112 –20230810--1630--...

  66. [66]

    For each visualization, we run a full forward pass, but control the activation of the cross-attention mechanisms

    Decoder Visualization Protocol To understand the information flow in our 3D lifting de- coder, we visualize intermediate outputs after each decoder layer. For each visualization, we run a full forward pass, but control the activation of the cross-attention mechanisms. Specifically, to visualize the output after decoder layeri, we keep all cross-attention ...

  67. [67]

    For stylization, users can either upload a reference image or provide a text prompt

    3D Editing Web Application Our 3D editing web application allows users to extract a segmentation map from an input image and interactively modify it via drawing. For stylization, users can either upload a reference image or provide a text prompt. In our supplementary demo video, extracting a segmentation map from an image takes 25 seconds, as it involves ...

  68. [68]

    Supplementary Video We highly recommend watching our supplementary video, which showcases additional 3D reconstruction orbit views, frame-by-frame 3D video generation, 3D edit orbit se- quences, and a live demo of our interactive 3D editing web application. Figure 8.Additional Results on A va-256 [40] and Nersemble [29].We present reconstructions across d...