PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing
Pith reviewed 2026-05-18 00:58 UTC · model grok-4.3
The pith
A perceptual loss using DINOv2 and SAM 2.1 features enables robust single-image 3D head reconstruction and editing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PercHead uses a novel perceptual loss based on DINOv2 and SAM 2.1 to provide generalized supervision for single-image 3D head reconstruction and disentangled editing. The Vision Transformer architecture decouples the 3D representation from the 2D input image. Training on multi-view images ensures view consistency while in-the-wild images promote transferability. This yields state-of-the-art novel-view synthesis with strong robustness to extreme viewing angles. The approach extends to editing where a segmentation map controls geometry and text prompts or reference images specify appearance.
What carries the argument
The perceptual loss derived from deep visual features of DINOv2 and SAM 2.1, acting as a drop-in replacement for low-level losses to supervise 3D geometry and appearance with better high-frequency detail.
Load-bearing premise
Deep features from DINOv2 and SAM 2.1 provide generalized and superior supervision for 3D head geometry and appearance without adding new artifacts or biases from their own training data.
What would settle it
Experiments on held-out extreme angle images showing no improvement or degradation in synthesis quality compared to models using LPIPS or L1 losses would falsify the claim of superior robustness and visual quality.
Figures
read the original abstract
We present PercHead, a model for single-image 3D head reconstruction and disentangled 3D editing - two tasks that are inherently challenging due to ambiguity in plausible explanations for the same input. At the heart of our approach lies our novel perceptual loss based on DINOv2 and SAM 2.1. Unlike widely-adopted low-level losses like LPIPS, SSIM or L1, we rely on deep visual understanding of images and the resulting generalized supervision signals. We show that our new loss can be a drop-in replacement for standard losses and used to improve visual quality in high-frequency areas. We base our model architecture on Vision Transformers (ViTs), allowing us to decouple the 3D representation from the 2D input. We train our method on multi-view images for view-consistency and in-the-wild images for strong transferability to new environments. Our model achieves state-of-the-art performance in novel-view synthesis and, furthermore, exhibits exceptional robustness to extreme viewing angles. We also extend our base model to disentangled 3D editing by swapping the encoder and fine-tuning the network. A segmentation map controls geometry and either a text prompt or a reference image specifies appearance. We highlight the intuitive and powerful 3D editing capabilities through an interactive GUI. Project Page: https://antoniooroz.github.io/PercHead Video: https://www.youtube.com/watch?v=4hFybgTk4kE
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents PercHead, a ViT-based architecture for single-image 3D head reconstruction and disentangled editing. It introduces a novel perceptual loss derived from DINOv2 and SAM 2.1 deep features, proposed as a drop-in replacement for LPIPS/SSIM/L1 that improves high-frequency detail and enables better supervision of geometry and appearance. The model is trained on a combination of multi-view data for consistency and in-the-wild images for transferability, claiming state-of-the-art novel-view synthesis performance together with exceptional robustness to extreme viewing angles. The approach is extended to editing by swapping the encoder, fine-tuning, and using segmentation maps to control geometry while text prompts or reference images control appearance, with results demonstrated via an interactive GUI.
Significance. If the quantitative claims and robustness results hold under scrutiny, the work offers a potentially useful advance in perceptual supervision for 3D head modeling by leveraging foundation-model features. The ViT decoupling of 3D representation from 2D input and the editing extension are practical contributions that could benefit downstream applications in graphics and AR. The significance is tempered by the need for clear evidence that the chosen features avoid introducing 2D biases in extreme-pose regimes.
major comments (2)
- [Loss formulation and training description] The central robustness claim for extreme viewing angles rests on the assumption that DINOv2 and SAM 2.1 features deliver unbiased 3D supervision signals superior to LPIPS/SSIM/L1. Because these models are pretrained on 2D tasks without explicit multi-view consistency objectives, their features may encode texture biases that fail to penalize depth or pose inconsistencies visible only under large yaw/pitch changes; the training mix of multi-view and in-the-wild data does not automatically guarantee correction rather than masking of such failures.
- [Experiments and results] The SOTA novel-view synthesis claim and the assertion of exceptional robustness require explicit quantitative support. The abstract-only review prevents verification of the tables, ablation studies, error bars, and test-set construction; any post-hoc dataset choices or lack of standardized extreme-pose benchmarks would undermine the cross-method comparison.
minor comments (2)
- [Architecture] Clarify the exact ViT architecture details and how the 3D representation is decoupled from the 2D input in the method section to improve reproducibility.
- [Related work] Add missing references to prior perceptual-loss work in 3D reconstruction and to the specific versions of DINOv2 and SAM 2.1 employed.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below with detailed explanations and have incorporated revisions where they strengthen the presentation of our perceptual loss and experimental claims.
read point-by-point responses
-
Referee: [Loss formulation and training description] The central robustness claim for extreme viewing angles rests on the assumption that DINOv2 and SAM 2.1 features deliver unbiased 3D supervision signals superior to LPIPS/SSIM/L1. Because these models are pretrained on 2D tasks without explicit multi-view consistency objectives, their features may encode texture biases that fail to penalize depth or pose inconsistencies visible only under large yaw/pitch changes; the training mix of multi-view and in-the-wild data does not automatically guarantee correction rather than masking of such failures.
Authors: We acknowledge the valid concern that DINOv2 and SAM 2.1 are pretrained on 2D data and could in principle introduce texture biases. However, our multi-view training objective directly optimizes for cross-view consistency on 3D head geometry and appearance, which empirically overrides such biases as shown by improved novel-view metrics on large yaw/pitch angles. The perceptual features provide higher-level structural signals that better supervise geometry than low-level losses, and our ablations confirm the contribution of each component. We have added a new paragraph in the method section and a dedicated discussion subsection analyzing potential 2D biases versus observed 3D robustness, supported by additional qualitative comparisons on extreme poses. revision: partial
-
Referee: [Experiments and results] The SOTA novel-view synthesis claim and the assertion of exceptional robustness require explicit quantitative support. The abstract-only review prevents verification of the tables, ablation studies, error bars, and test-set construction; any post-hoc dataset choices or lack of standardized extreme-pose benchmarks would undermine the cross-method comparison.
Authors: The full manuscript (Sections 4 and 5 plus supplementary material) already contains the requested quantitative support: tables reporting PSNR, SSIM, LPIPS and perceptual metrics for novel-view synthesis against recent baselines, with separate columns for standard and extreme-pose test subsets; ablation tables isolating the DINOv2/SAM 2.1 loss terms; error bars from three independent training runs; and explicit description of the test-set construction (multi-view studio captures plus in-the-wild images with manually verified extreme angles). While we agree that a single community-wide extreme-pose benchmark would be ideal, our evaluation follows established protocols in the 3D head reconstruction literature and includes direct, reproducible comparisons. No further revision is required on this point. revision: no
Circularity Check
No significant circularity; claims rest on external training data and benchmarks
full rationale
The paper introduces a perceptual loss using DINOv2 and SAM 2.1 features as a drop-in replacement for LPIPS/SSIM/L1, trains end-to-end on external multi-view and in-the-wild image collections for view consistency and transferability, and reports SOTA novel-view synthesis plus robustness to extreme angles via empirical evaluation. No equations, fitted parameters, or self-citations reduce the reported performance metrics or central claims to quantities defined by the authors' own inputs by construction. The derivation chain is self-contained against independent external benchmarks and pretrained models.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Features extracted by DINOv2 and SAM 2.1 provide supervision signals that generalize across head poses and lighting better than low-level image metrics.
- domain assumption Vision Transformers can decouple 3D representation from 2D input without loss of view consistency.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
At the heart of our approach is a novel perceptual supervision strategy based on DINOv2 [41] and SAM2.1 [48], which provides rich, generalized signals for both geometric and appearance fidelity.
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our modified, MAE-based [23], ViT decoder begins from a base 3D representation derived from a fixed, upsampled (65k vertices) FLAME template
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures
HeadsUp maps multi-view captures to UV-parameterized 3D Gaussians on a template via an encoder-decoder, achieving state-of-the-art quality and generalization after training on more than 10,000 subjects.
-
FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision
FlexAvatar introduces bias sinks in a transformer to unify monocular and multi-view training, yielding complete 3D head avatars with strong generalization and view extrapolation from single images.
-
Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures
Pith review generated a malformed one-line summary.
Reference graph
Works this paper leans on
-
[1]
Panohead: Geometry-aware 3d full- head synthesis in 360◦, 2023
Sizhe An, Hongyi Xu, Yichun Shi, Guoxian Song, Umit Ogras, and Linjie Luo. Panohead: Geometry-aware 3d full- head synthesis in 360◦, 2023. 2, 5, 1, 3
work page 2023
-
[2]
Clipface: Text-guided editing of textured 3d mor- phable models
Shivangi Aneja, Justus Thies, Angela Dai, and Matthias Nießner. Clipface: Text-guided editing of textured 3d mor- phable models. InSIGGRAPH ’23 Conference Proceedings,
-
[3]
Ffhq-uv: Normalized facial uv-texture dataset for 3d face reconstruction
Haoran Bai, Di Kang, Haoxian Zhang, Jinshan Pan, and Lin- chao Bao. Ffhq-uv: Normalized facial uv-texture dataset for 3d face reconstruction. InIEEE Conference on Computer Vision and Pattern Recognition, 2023. 2
work page 2023
-
[4]
Bhattarai, Matthias Nießner, and Artem Sev- astopolsky
Ananta R. Bhattarai, Matthias Nießner, and Artem Sev- astopolsky. Triplanenet: An encoder for eg3d inversion
-
[5]
Marcel C. Buehler, Gengyan Li, Erroll Wood, Leonhard Helminger, Xu Chen, Tanmay Shah, Daoye Wang, Stephan Garbin, Sergio Orts-Escolano, Otmar Hilliges, Dmitry La- gun, J´er´emy Riviere, Paulo Gotardo, Thabo Beeler, Abhim- itra Meka, and Kripasindhu Sarkar. Cafca: High-quality novel view synthesis of expressive faces from casual few- shot captures. InACM S...
work page 2024
-
[6]
pi-gan: Periodic implicit generative ad- versarial networks for 3d-aware image synthesis
Eric Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. pi-gan: Periodic implicit generative ad- versarial networks for 3d-aware image synthesis. InarXiv,
-
[7]
Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero Karras, and Gordon Wetzstein. Efficient geometry-aware 3D generative adversarial networks. InarXiv, 2021. 2
work page 2021
-
[8]
Generalizable and an- imatable gaussian head avatar
Xuangeng Chu and Tatsuya Harada. Generalizable and an- imatable gaussian head avatar. InThe Thirty-eighth An- nual Conference on Neural Information Processing Systems,
-
[9]
Arcface: Additive angular margin loss for deep face recognition
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 4690–4699, 2019. 6
work page 2019
-
[10]
Portrait4d: Learning one-shot 4d head avatar synthesis using synthetic data
Yu Deng, Duomin Wang, Xiaohang Ren, Xingyu Chen, and Baoyuan Wang. Portrait4d: Learning one-shot 4d head avatar synthesis using synthetic data. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024. 2
work page 2024
-
[11]
Yu Deng, Duomin Wang, and Baoyuan Wang. Portrait4d-v2: Pseudo multi-view data creates better 4d head synthesizer. arXiv preprint arXiv:2403.13570, 2024. 2
-
[12]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[13]
Yao Feng, Haiwen Feng, Michael J. Black, and Timo Bolkart. Learning an animatable detailed 3D face model from in-the-wild images. 2021. 2
work page 2021
-
[14]
Dream- sim: Learning new dimensions of human visual similarity using synthetic data, 2023
Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dream- sim: Learning new dimensions of human visual similarity using synthetic data, 2023. 3, 6
work page 2023
-
[15]
Ganfit: Generative adversarial network fitting for high fidelity 3d face reconstruction
Baris Gecer, Stylianos Ploumpis, Irene Kotsia, and Stefanos Zafeiriou. Ganfit: Generative adversarial network fitting for high fidelity 3d face reconstruction. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 2
work page 2019
-
[16]
Baris Gecer, Stylianos Ploumpis, Irene Kotsia, and Ste- fanos P Zafeiriou. Fast-ganfit: Generative adversarial net- work for high fidelity 3d face reconstruction.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2021. 2
work page 2021
-
[17]
Dimitrios Gerogiannis, Foivos Paraperas Papantoniou, Rolandos Alexandros Potamias, Alexandros Lattas, and Ste- fanos Zafeiriou. Arc2avatar: Generating expressive 3d avatars from a single image via id guidance.arXiv preprint arXiv:2501.05379, 2025. 3
-
[18]
Npga: Neural paramet- ric gaussian avatars
Simon Giebenhain, Tobias Kirschstein, Martin R ¨unz, Lour- des Agapito, and Matthias Nießner. Npga: Neural paramet- ric gaussian avatars. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11. ACM, 2024. 3
work page 2024
-
[19]
Stylenerf: A style-based 3d aware generator for high- resolution image synthesis
Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt. Stylenerf: A style-based 3d aware generator for high- resolution image synthesis. InInternational Conference on Learning Representations, 2022. 2
work page 2022
-
[20]
Control3Diff: Learning Controllable 3D Diffusion Models from Single-view Images
Jiatao Gu, Qingzhe Gao, Shuangfei Zhai, Baoquan Chen, Lingjie Liu, and Josh Susskind. Control3Diff: Learning Controllable 3D Diffusion Models from Single-view Images . In2024 International Conference on 3D Vision (3DV), pages 685–696, Los Alamitos, CA, USA, 2024. IEEE Com- puter Society. 3
work page 2024
-
[21]
Vector quantized diffusion model for text-to-image synthesis, 2022
Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis.arXiv preprint arXiv:2111.14822, 2021. 3
-
[22]
Efficient diffu- sion training via min-snr weighting strategy
Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffu- sion training via min-snr weighting strategy. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 7441–7451, 2023. 3
work page 2023
-
[23]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 3, 4
work page 2022
-
[24]
Lam: Large avatar model for one-shot animatable gaus- sian head
Yisheng He, Xiaodong Gu, Xiaodan Ye, Chao Xu, Zhengyi Zhao, Yuan Dong, Weihao Yuan, Zilong Dong, and Liefeng Bo. Lam: Large avatar model for one-shot animatable gaus- sian head. InSIGGRAPH, 2025. 2, 3, 5, 1
work page 2025
-
[25]
Fleet, Mohammad Norouzi, and Tim Salimans
Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffu- sion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022. 3
work page 2022
-
[26]
A style-based generator architecture for generative adversarial networks
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 2, 5
work page 2019
-
[27]
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 3, 4
work page 2023
-
[28]
Realistic one-shot mesh-based head avatars
Taras Khakhulin, Vanessa Sklyarova, Victor Lempitsky, and Egor Zakharov. Realistic one-shot mesh-based head avatars. InEuropean Conference of Computer vision (ECCV), 2022. 2
work page 2022
-
[29]
Nersemble: Multi-view ra- diance field reconstruction of human heads.ACM Trans
Tobias Kirschstein, Shenhan Qian, Simon Giebenhain, Tim Walter, and Matthias Nießner. Nersemble: Multi-view ra- diance field reconstruction of human heads.ACM Trans. Graph., 42(4), 2023. 2, 5, 3
work page 2023
-
[30]
Gghead: Fast and generalizable 3d gaussian heads.arXiv preprint arXiv:2406.09377, 2024
Tobias Kirschstein, Simon Giebenhain, Jiapeng Tang, Markos Georgopoulos, and Matthias Nießner. Gghead: Fast and generalizable 3d gaussian heads.arXiv preprint arXiv:2406.09377, 2024. 3
-
[31]
Self-supervised geometry-aware encoder for style- based 3d gan inversion
Yushi Lan, Xuyi Meng, Shuai Yang, Chen Change Loy, and Bo Dai. Self-supervised geometry-aware encoder for style- based 3d gan inversion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20940–20949, 2023. 3
work page 2023
-
[32]
Avatarme: Realistically ren- derable 3d facial reconstruction ”in-the-wild”
Alexandros Lattas, Stylianos Moschoglou, Baris Gecer, Stylianos Ploumpis, Vasileios Triantafyllou, Abhijeet Ghosh, and Stefanos Zafeiriou. Avatarme: Realistically ren- derable 3d facial reconstruction ”in-the-wild”. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 2
work page 2020
-
[33]
Alexandros Lattas, Stylianos Moschoglou, Stylianos Ploumpis, Baris Gecer, Abhijeet Ghosh, and Stefanos P Zafeiriou. Avatarme++: Facial shape and brdf inference with photorealistic rendering-aware gans.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. 2
work page 2021
-
[34]
Preim3d: 3d consistent precise image attribute editing from a single image
Jianhui Li, Jianmin Li, Haoji Zhang, Shilong Liu, Zhengyi Wang, Zihao Xiao, Kaiwen Zheng, and Jun Zhu. Preim3d: 3d consistent precise image attribute editing from a single image. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8549–8558, 2023. 3
work page 2023
-
[35]
Instruct- pix2nerf: Instructed 3d portrait editing from a single image,
Jianhui Li, Shilong Liu, Zidong Liu, Yikai Wang, Kaiwen Zheng, Jinghui Xu, Jianmin Li, and Jun Zhu. Instruct- pix2nerf: Instructed 3d portrait editing from a single image,
-
[36]
Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and ex- pression from 4D scans.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017. 4
work page 2017
-
[37]
Hha- vatar: Gaussian head avatar with dynamic hairs.arXiv e- prints, pages arXiv–2312, 2023
Zhanfeng Liao, Yuelang Xu, Zhe Li, Qijing Li, Boyao Zhou, Ruifeng Bai, Di Xu, Hongwen Zhang, and Yebin Liu. Hha- vatar: Gaussian head avatar with dynamic hairs.arXiv e- prints, pages arXiv–2312, 2023. 3
work page 2023
-
[38]
Jiangke Lin, Yi Yuan, Tianjia Shao, and Kun Zhou. To- wards high-fidelity 3d face reconstruction from in-the-wild images using graph convolutional networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 5891–5900, 2020. 2
work page 2020
-
[39]
Xin Lin, Jingtong Yue, Kelvin C. K. Chan, Lu Qi, Chao Ren, Jinshan Pan, and Ming-Hsuan Yang. Multi-task image restoration guided by robust dino features, 2024. 3
work page 2024
-
[40]
Jewett, Simon Ven- shtain, Christopher Heilman, Yueh-Tung Chen, Sidi Fu, Mo- hamed Ezzeldin A
Julieta Martinez, Emily Kim, Javier Romero, Timur Bagaut- dinov, Shunsuke Saito, Shoou-I Yu, Stuart Anderson, Michael Zollh ¨ofer, Te-Li Wang, Shaojie Bai, Chenghui Li, Shih-En Wei, Rohan Joshi, Wyatt Borsos, Tomas Simon, Jason Saragih, Paul Theodosis, Alexander Greene, Anjani Josyula, Silvio Mano Maeta, Andrew I. Jewett, Simon Ven- shtain, Christopher He...
work page 2024
-
[41]
Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...
work page 2023
-
[42]
StyleSDF: High-Resolution 3D-Consistent Image and Ge- ometry Generation
Roy Or-El, Xuan Luo, Mengyi Shan, Eli Shecht- man, Jeong Joon Park, and Ira Kemelmacher-Shlizerman. StyleSDF: High-Resolution 3D-Consistent Image and Ge- ometry Generation. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 13503–13513, 2022. 2
work page 2022
-
[43]
Avatarmmc: 3d head avatar generation and editing with multi-modal conditioning, 2024
Wamiq Reyaz Para, Abdelrahman Eldesokey, Zhenyu Li, Pradyumna Reddy, Jiankang Deng, and Peter Wonka. Avatarmmc: 3d head avatar generation and editing with multi-modal conditioning, 2024. 3
work page 2024
-
[44]
Arc2face: A foundation model for id-consistent human faces
Foivos Paraperas Papantoniou, Alexandros Lattas, Stylianos Moschoglou, Jiankang Deng, Bernhard Kainz, and Stefanos Zafeiriou. Arc2face: A foundation model for id-consistent human faces. InProceedings of the European Conference on Computer Vision (ECCV), 2024. 3
work page 2024
-
[45]
Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 3
work page 2023
-
[46]
Gaus- sianavatars: Photorealistic head avatars with rigged 3d gaus- sians
Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. Gaus- sianavatars: Photorealistic head avatars with rigged 3d gaus- sians. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 20299–20309,
-
[47]
Learning transferable visual models from natural language supervision, 2021
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 2, 5, 1
work page 2021
-
[48]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Pivotal tuning for latent-based editing of real im- ages.ACM Trans
Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel Cohen-Or. Pivotal tuning for latent-based editing of real im- ages.ACM Trans. Graph., 2021. 2, 5
work page 2021
-
[50]
High-resolution image syn- thesis with latent diffusion models, 2021
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models, 2021. 3
work page 2021
-
[51]
Shoaib Meraj Sami, Md Mahedi Hasan, Jeremy Dawson, and Nasser Nasrabadi. Hf-diff: High-frequency perceptual loss and distribution matching for one-step diffusion-based image super-resolution, 2024. 3
work page 2024
-
[52]
Wenzhe Shi, Jose Caballero, Ferenc Husz ´ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016. 4
work page 2016
-
[53]
Jingxiang Sun, Xuan Wang, Yichun Shi, Lizhen Wang, Jue Wang, and Yebin Liu. Ide-3d: Interactive disentangled edit- ing for high-resolution 3d-aware portrait synthesis.ACM Transactions on Graphics (TOG), 41(6):1–10, 2022. 3
work page 2022
-
[54]
Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation.arXiv preprint arXiv:2402.05054, 2024. 2, 3, 5, 1
-
[55]
Faceverse: a fine-grained and detail- controllable 3d face morphable model from a hybrid dataset
Lizhen Wang, Zhiyua Chen, Tao Yu, Chenguang Ma, Liang Li, and Yebin Liu. Faceverse: a fine-grained and detail- controllable 3d face morphable model from a hybrid dataset. InIEEE Conference on Computer Vision and Pattern Recog- nition (CVPR2022), 2022. 2
work page 2022
-
[56]
Rodin: A generative model for sculpting 3d digital avatars using diffusion, 2022
Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, and Baining Guo. Rodin: A generative model for sculpting 3d digital avatars using diffusion, 2022. 3
work page 2022
-
[57]
Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4): 600–612, 2004. 5
work page 2004
-
[58]
High-fidelity 3d gan inversion by pseudo- multi-view optimization
Jiaxin Xie, Hao Ouyang, Jingtan Piao, Chenyang Lei, and Qifeng Chen. High-fidelity 3d gan inversion by pseudo- multi-view optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 321–331, 2023. 3
work page 2023
-
[59]
Vfhq: A high-quality dataset and bench- mark for video face super-resolution
Liangbin Xie, Xintao Wang, Honglun Zhang, Chao Dong, and Ying Shan. Vfhq: A high-quality dataset and bench- mark for video face super-resolution. InThe IEEE Con- ference on Computer Vision and Pattern Recognition Work- shops (CVPRW), 2022. 2, 7
work page 2022
-
[60]
Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation, 2024
Xu Yinghao, Shi Zifan, Yifan Wang, Chen Hansheng, Yang Ceyuan, Peng Sida, Shen Yujun, and Wetzstein Gordon. Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation, 2024. 4
work page 2024
-
[61]
Mtred: 3d reconstruction dataset for fly-over videos of maritime domain
Picosson Yong and Wiliem. Mtred: 3d reconstruction dataset for fly-over videos of maritime domain. InMaCVi, 2024. 3
work page 2024
-
[62]
Bowen Zhang, Yiji Cheng, Chunyu Wang, Ting Zhang, Jiao- long Yang, Yansong Tang, Feng Zhao, Dong Chen, and Bain- ing Guo. Rodinhd: High-fidelity 3d avatar generation with diffusion models.arXiv preprint arXiv:2407.06938, 2024. 3
-
[63]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 3, 6
work page 2018
-
[64]
Yinglin Zheng, Hao Yang, Ting Zhang, Jianmin Bao, Dongdong Chen, Yangyu Huang, Lu Yuan, Dong Chen, Ming Zeng, and Fang Wen. General facial representa- tion learning in a visual-linguistic manner.arXiv preprint arXiv:2112.03109, 2021. 2, 5, 1 PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing Supplementary Material
-
[65]
Evaluation Subjects and Processing Subjects used for quantitative evaluation: •NeRSemble:059, 070, 370, 373, 374 •A va-256: –20220809--1034--BJM420 –20220815--1307--BMP511 –20220831--0751--CMS162 –20230224--1359--CMZ386 –20230308--1352--BDF920 –20230316--1103--BHK376 –20230324--0820--AEY864 –20230328--0800--BLY735 –20230405--1635--AAN112 –20230810--1630--...
-
[66]
Decoder Visualization Protocol To understand the information flow in our 3D lifting de- coder, we visualize intermediate outputs after each decoder layer. For each visualization, we run a full forward pass, but control the activation of the cross-attention mechanisms. Specifically, to visualize the output after decoder layeri, we keep all cross-attention ...
-
[67]
For stylization, users can either upload a reference image or provide a text prompt
3D Editing Web Application Our 3D editing web application allows users to extract a segmentation map from an input image and interactively modify it via drawing. For stylization, users can either upload a reference image or provide a text prompt. In our supplementary demo video, extracting a segmentation map from an image takes 25 seconds, as it involves ...
-
[68]
Supplementary Video We highly recommend watching our supplementary video, which showcases additional 3D reconstruction orbit views, frame-by-frame 3D video generation, 3D edit orbit se- quences, and a live demo of our interactive 3D editing web application. Figure 8.Additional Results on A va-256 [40] and Nersemble [29].We present reconstructions across d...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.