pith. sign in

arxiv: 2606.28144 · v1 · pith:3VXVB2DVnew · submitted 2026-06-26 · 💻 cs.CV

Monocular Avatar Reconstruction via Cascaded Diffusion Priors and UV-Space Differentiable Shading

Pith reviewed 2026-06-29 04:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords monocular avatar reconstructiondiffusion priorsLoRA adaptationUV spacePBR material decompositiondifferentiable shadingrelightable 3Dtexture completion
0
0 comments X

The pith

Cascaded LoRAs adapt a diffusion backbone in UV space to reconstruct relightable 4K PBR avatars from a single image after training on under 100 scans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a data-efficient framework that adapts a pre-trained diffusion model through sequential Low-Rank Adaptations to handle texture completion, lighting removal, and material decomposition for monocular 3D avatar reconstruction. It operates entirely in UV space and applies a differentiable BRDF shading loss to enforce physical consistency with the rendering equation. This yields comprehensive PBR assets including albedo, normals, roughness, specular, and displacement maps. A sympathetic reader would care because the approach sidesteps the need for large proprietary 3D datasets while producing relightable results from everyday photos.

Core claim

Sequential LoRA adaptations on a unified diffusion backbone first complete occluded UV textures via inpainting, then apply light homogenization and cross-intrinsic attention to synthesize pixel-aligned PBR maps, with a UV-space differentiable BRDF shading loss ensuring the outputs satisfy the rendering equation and produce 4K assets that generalize better than prior methods when trained on fewer than 100 real scans.

What carries the argument

Cascaded Low-Rank Adaptations (LoRAs) for inpainting, light-homogenization, and material decomposition in UV space, augmented by cross-intrinsic attention and supervised by a differentiable BRDF shading loss.

If this is right

  • The method produces complete 4K PBR maps for albedo, normal, roughness, specular, and displacement.
  • Assets exhibit superior realism and generalization compared with existing state-of-the-art pipelines.
  • Reconstruction succeeds from a single in-the-wild image without reliance on large proprietary 3D datasets.
  • Physical plausibility follows directly from enforcing the rendering equation during the decomposition stage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The cascaded adaptation pattern could be tested on other inverse rendering tasks that currently lack paired PBR data.
  • Public release of the weights may enable rapid integration into consumer 3D content tools.
  • Extending the same UV-space loss to video inputs could be a direct next experiment for temporal consistency.

Load-bearing premise

The pre-trained diffusion backbone's semantic priors stay sufficiently intact after the sequential LoRA adaptations for the generated UV textures and PBR maps to remain pixel-aligned and physically consistent under only the shading loss.

What would settle it

Rendered images of the output PBR assets under new lighting conditions that deviate from expected material behavior or fail to match the input photo's appearance would show the decomposition does not hold.

Figures

Figures reproduced from arXiv: 2606.28144 by Baochang Zhang, Chaojie Yang, Chongjie Ye, Guojun Lei, Hao Zhao, Hong Li, Houyuan Chen, Minqi Meng, Weiqing Xiao, Xianda Guo, Xuhui Liu, Yanjun Liang, Yanlun Peng.

Figure 1
Figure 1. Figure 1: High-Fidelity 3D Avatar Reconstruction. From a single input image, we reconstruct high-fidelity 3D geometry and PBR materials (albedo, normal, packed maps) to enable relightable avatar synthesis. As shown in the novel lighting results (right), our method accurately recovers fine details (e.g., wrinkles, moles) whilst main￾taining identity consistency. 1 Introduction Photorealistic avatars are pivotal to im… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our high-fidelity 3D avatar reconstruction pipeline. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Synthetic Data Generation Pipeline. Starting from an image Iwild, we reconstruct geometry G and com￾pute visibility mask Mvis. We assign high￾quality PBR textures TP BR to G. α δ β γ ��� ���� ���� R �’ � � Pose � SH Coef s DINOv3 ConvNext V2 Multi-Scale Space Fusion Φ [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training pipeline for texture generation. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of 3D face reconstruc￾tion methods [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Texture inpainting and light homogenization. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison of avatar relighting under novel environ [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison with ChatAvatar. Geometry (Left): [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Texture Reconstruction ablation study. relying on screen-space supervision. Trained on fewer than 100 real 3D scans, our framework achieves state-of-the-art performance in recovering photorealistic, relightable, and geometrically consistent avatars. Limitations. Our method targets relightable face and head PBR assets; it does not reconstruct full bodies, dynamic hair, clothing, or accessories. Treating ha… view at source ↗
read the original abstract

Reconstructing high-fidelity, relightable 3D avatars from a single in-the-wild image is a challenging ill-posed problem, primarily hindered by the scarcity of high-quality PBR data and the complexity of disentangling illumination from intrinsic materials. In this paper, we present a data-efficient framework that leverages the robust priors of a unified pre-trained diffusion backbone to sequentially address texture completion, delighting, and material decomposition. Unlike existing methods that rely on fragmented pipelines or extensive proprietary datasets, we utilize cascaded Low-Rank Adaptations (LoRAs) to adapt the strong generative prior of the diffusion model for each sub-task in UV space. Specifically, we first employ an Inpainting LoRA to complete missing UV textures caused by occlusion, leveraging the model's semantic understanding to generate semantically and photometrically coherent details. Subsequently, a Light-Homogenization LoRA and a novel Cross-Intrinsic Attention mechanism are introduced to remove baked-in lighting and collaboratively synthesize pixel-aligned PBR maps (Albedo, Normal, Roughness, Specular, and Displacement). To ensure physical plausibility, we impose a UV-space differentiable BRDF shading loss during the decomposition stage, forcing the generative process to adhere to the rendering equation without the artifacts typical of rasterization-based supervision. Extensive experiments demonstrate that our method, trained on fewer than 100 real 3D scans, generates comprehensive, 4K-resolution PBR assets with superior realism and generalization compared to state-of-the-art methods, and all training code and model weights will be released upon acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper presents a data-efficient framework for monocular reconstruction of high-fidelity, relightable 3D avatars from a single in-the-wild image. It uses cascaded Low-Rank Adaptations (LoRAs) on a pre-trained diffusion backbone to sequentially perform UV-space texture completion (Inpainting LoRA), light homogenization, and material decomposition into PBR maps (Albedo, Normal, Roughness, Specular, Displacement) via a novel Cross-Intrinsic Attention mechanism, supervised by a UV-space differentiable BRDF shading loss to enforce adherence to the rendering equation. The method claims to generate comprehensive 4K-resolution PBR assets with superior realism and generalization when trained on fewer than 100 real 3D scans.

Significance. If the central claims hold with supporting evidence, the work would offer a notable advance in data-efficient monocular avatar reconstruction by leveraging unified diffusion priors through sequential adaptations rather than fragmented pipelines or large proprietary datasets, potentially enabling more accessible production of relightable PBR assets for graphics and vision applications.

major comments (2)
  1. [Abstract] Abstract: the claim of 'superior realism and generalization compared to state-of-the-art methods' is asserted without any quantitative metrics, baseline comparisons, ablation studies, or error analysis, directly undermining evaluation of the central claim that the cascaded LoRA pipeline succeeds with <100 scans.
  2. [Abstract] Abstract (description of decomposition stage): the physical-plausibility claim rests on the UV-space differentiable BRDF shading loss forcing adherence to the rendering equation, yet no implementation details, formulation, or verification are supplied that this loss (which penalizes only rendered appearance) can enforce mutual pixel alignment and intrinsic uniqueness across the generated PBR maps after three sequential LoRA adaptations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We address the major comments point-by-point below and propose revisions where they strengthen the presentation of our work.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'superior realism and generalization compared to state-of-the-art methods' is asserted without any quantitative metrics, baseline comparisons, ablation studies, or error analysis, directly undermining evaluation of the central claim that the cascaded LoRA pipeline succeeds with <100 scans.

    Authors: The abstract serves as a high-level overview. The full manuscript provides quantitative metrics, baseline comparisons, ablation studies, and error analysis in the dedicated Experiments section, supporting the data-efficiency claim with fewer than 100 scans. We will update the abstract to include specific quantitative highlights from our results. revision: yes

  2. Referee: [Abstract] Abstract (description of decomposition stage): the physical-plausibility claim rests on the UV-space differentiable BRDF shading loss forcing adherence to the rendering equation, yet no implementation details, formulation, or verification are supplied that this loss (which penalizes only rendered appearance) can enforce mutual pixel alignment and intrinsic uniqueness across the generated PBR maps after three sequential LoRA adaptations.

    Authors: Details on the loss formulation, implementation, and verification are provided in the Methods and Experiments sections of the manuscript. The loss is designed to enforce adherence to the rendering equation through differentiable shading in UV space. We acknowledge the referee's point on potential need for more explicit verification of uniqueness and will add further analysis or clarification in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No circularity: pipeline uses external pre-trained diffusion weights and standard rendering equation

full rationale

The method adapts a pre-trained diffusion backbone with cascaded LoRAs for inpainting, light-homogenization, and decomposition, then applies a UV-space differentiable BRDF shading loss derived from the standard rendering equation. No equations or procedures define outputs in terms of the method's own fitted predictions or self-citations; training uses external <100 scans, and supervision enforces physical consistency externally rather than by construction. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the transferability of semantic priors from a large pre-trained diffusion model to the three sub-tasks via low-rank adaptation and on the rendering equation providing a sufficient supervisory signal in UV space; no free parameters, new axioms, or invented entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Pre-trained diffusion model semantic priors remain usable after sequential LoRA fine-tuning for texture completion, delighting, and material decomposition.
    Invoked when the abstract states that the Inpainting LoRA and Light-Homogenization LoRA leverage the model's semantic understanding.
  • standard math The standard rendering equation can be evaluated differentiably in UV space to enforce physical consistency without rasterization artifacts.
    Invoked when the abstract describes imposition of the UV-space differentiable BRDF shading loss.

pith-pipeline@v0.9.1-grok · 5860 in / 1550 out tokens · 34114 ms · 2026-06-29T04:27:05.462659+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 8 canonical work pages · 5 internal anchors

  1. [1]

    In: CGF (2023)

    Aliari, M.A., Beauchamp, A., Popa, T., Paquette, E.: Face editing using part-based optimization of the latent space. In: CGF (2023)

  2. [2]

    In: CVPR (2023)

    Bai, H., Kang, D., Zhang, H., Pan, J., Bao, L.: Ffhq-uv: Normalized facial uv- texture dataset for 3d face reconstruction. In: CVPR (2023)

  3. [3]

    In: CVPR (2021)

    Bai, Z., Cui, Z., Liu, X., Tan, P.: Riggable 3d face reconstruction via in-network optimization. In: CVPR (2021)

  4. [4]

    TOG (2021)

    Bao, L., Lin, X., Chen, Y., Zhang, H., Wang, S., Zhe, X., Kang, D., Huang, H., Jiang, X., Wang, J., Yu, D., Zhang, Z.: High-fidelity 3d digital human head creation from rgb-d selfies. TOG (2021)

  5. [5]

    TOG (2021)

    Bao, L., Lin, X., Chen, Y., Zhang, H., Wang, S., Zhe, X., Kang, D., Huang, H., Jiang, X., Wang, J., et al.: High-fidelity 3d digital human head creation from rgb-d selfies. TOG (2021)

  6. [6]

    In: SIG- GRAPH (1999)

    Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: SIG- GRAPH (1999)

  7. [7]

    In: ICCV (2017)

    Bulat, A., Tzimiropoulos, G.: How far are we from solving the 2d & 3d face align- ment problem? (and a dataset of 230,000 3d facial landmarks). In: ICCV (2017)

  8. [8]

    In: ECCV (2022)

    Chai, Z., Zhang, H., Ren, J., Kang, D., Xu, Z., Zhe, X., Yuan, C., Bao, L.: Realy: Rethinking the evaluation of 3d face reconstruction. In: ECCV (2022)

  9. [9]

    In: ICCV (2023)

    Chai, Z., Zhang, T., He, T., Tan, X., Baltrusaitis, T., Wu, H., Li, R., Zhao, S., Yuan, C., Bian, J.: Hiface: High-fidelity 3d face reconstruction by learning static and dynamic details. In: ICCV (2023)

  10. [10]

    TOG (1982)

    Cook, R.L., Torrance, K.E.: A reflectance model for computer graphics. TOG (1982)

  11. [11]

    arXiv:2506.13233 (2025)

    Dai, J., Wang, A., Ni, B., Cao, T.: High-quality facial albedo generation for 3d face reconstruction from a single image using a coarse-to-fine approach. arXiv:2506.13233 (2025)

  12. [12]

    In: CVPR (2022)

    Daněček, R., Black, M.J., Bolkart, T.: Emoca: Emotion driven monocular face capture and animation. In: CVPR (2022)

  13. [13]

    Danecek, R., Black, M.J., Bolkart, T.: Emoca: Emotion driven monocular face capture and animation. In: CVPR (2022) Monocular Avatar Reconstruction 25 3DDFA-V3HRNDeep3DOursDECASMIRK EMOCA Fig.A5:Qualitativecomparisonofgeometricfidelity.Wecompareourdisplaced geometry against various state-of-the-art methods. While DECA and EMOCA attempt to reconstruct faces...

  14. [14]

    In: ICLR (2023)

    De Luigi, L., Cardace, A., Spezialetti, R., Zama Ramirez, P., Salti, S., Di Stefano, L.: Deep learning on implicit neural representations of shapes. In: ICLR (2023)

  15. [15]

    In: ACM SIGGRAPH 2012 Courses

    Debevec, P.: The light stages and their applications to photoreal digital actors. In: ACM SIGGRAPH 2012 Courses. pp. 1–10 (2012)

  16. [16]

    In: CVPR (2018)

    Deng, J., Cheng, S., Xue, N., Zhou, Y., Zafeiriou, S.: Uv-gan: Adversarial facial uv map completion for pose-invariant face recognition. In: CVPR (2018)

  17. [17]

    In: CVPRW (2019)

    Deng, Y., Yang, J., Xu, S., Chen, D., Jia, Y., Tong, X.: Accurate 3d face reconstruc- tion with weakly-supervised learning: From single image to image set. In: CVPRW (2019)

  18. [18]

    EG (2021)

    Dib, A., Bharaj, G., Ahn, J., Thébault, C., Gosselin, P.H., Romeo, M., Chevallier, L.: Practical face reconstruction via differentiable ray tracing. EG (2021)

  19. [19]

    Dib, A., Hafemann, L.G., Got, E., Anderson, T., Fadaeinejad, A., Cruz, R.M., Car- bonneau, M.A.: Mosar: Monocular semi-supervised model for avatar reconstruction 26 Li et al. 3DDFA-V3HRNDeep3D OursDECASMIRK EMOCA Fig.A6: Qualitative comparison of geometric fidelity.Compared to alternative geometric estimation methods, our approach leverages normal and dis...

  20. [20]

    In: SIGGRAPH (2021)

    Feng, Y., Feng, H., Black, M.J., Bolkart, T.: Learning an animatable detailed 3D face model from in-the-wild images. In: SIGGRAPH (2021)

  21. [21]

    In: WACV (2025)

    Galanakis, S., Lattas, A., Moschoglou, S., Zafeiriou, S.: Fitdiff: Robust monocular 3d facial shape and reflectance estimation using diffusion models. In: WACV (2025)

  22. [22]

    In: CVPR (2021)

    Gecer, B., Deng, J., Zafeiriou, S.: Ostec: One-shot texture completion. In: CVPR (2021)

  23. [23]

    In: CVPR (2019)

    Gecer, B., Ploumpis, S., Kotsia, I., Zafeiriou, S.: Ganfit: Generative adversarial network fitting for high fidelity 3d face reconstruction. In: CVPR (2019)

  24. [24]

    In: ECCV (2020)

    Guo, J., Zhu, X., Yang, Y., Yang, F., Lei, Z., Li, S.Z.: Towards fast, accurate and stable 3d dense face alignment. In: ECCV (2020)

  25. [25]

    ICLR (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR (2022)

  26. [26]

    CVPR (2017)

    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with condi- tional adversarial networks. CVPR (2017)

  27. [27]

    arXiv:2511.13649 (2025)

    Jiang, D., Liu, D., Wang, Z., Wu, Q., Jin, X., Liu, D., Li, Z., Wang, M., Gao, P., Yang, H.: Distribution matching distillation meets reinforcement learning. arXiv:2511.13649 (2025)

  28. [28]

    In: SIGGRAPH (1986)

    Kajiya, J.T.: The rendering equation. In: SIGGRAPH (1986)

  29. [29]

    In: CVPR (2019)

    Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)

  30. [30]

    In: CVPR (2020) Monocular Avatar Reconstruction 27

    Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: CVPR (2020) Monocular Avatar Reconstruction 27

  31. [31]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv:2506.15742 (2025)

  32. [32]

    In: CVPR (2023)

    Lattas, A., Moschoglou, S., Ploumpis, S., Gecer, B., Deng, J., Zafeiriou, S.: Fitme: Deep photorealistic 3d morphable model avatars. In: CVPR (2023)

  33. [33]

    in-the- wild

    Lattas, A., Moschoglou, S., Gecer, B., Ploumpis, S., Triantafyllou, V., Ghosh, A., Zafeiriou, S.: Avatarme: Realistically renderable 3d facial reconstruction "in-the- wild". In: CVPR (2020)

  34. [34]

    PAMI (2022)

    Lattas, A., Moschoglou, S., Ploumpis, S., Gecer, B., Ghosh, A., Zafeiriou, S.: Avatarme++: Facial shape and brdf inference with photorealistic rendering-aware gans. PAMI (2022)

  35. [35]

    In: CVPR (2020)

    Lee, C.H., Liu, Z., Wu, L., Luo, P.: Maskgan: Towards diverse and interactive facial image manipulation. In: CVPR (2020)

  36. [36]

    In: CVPR (2023)

    Lei, B., Ren, J., Feng, M., Cui, M., Xie, X.: A hierarchical representation network for accurate and detailed face reconstruction from in-the-wild images. In: CVPR (2023)

  37. [37]

    arXiv:2506.18882 (2025)

    Li, H., Chen, H., Ye, C., Chen, Z., Li, B., Xu, S., Guo, X., Liu, X., Wang, Y., Zhang, B., Ikehata, S., Shi, B., Rao, A., Zhao, H.: Light of normals: Unified feature representation for universal photometric stereo. arXiv:2506.18882 (2025)

  38. [38]

    In: CVPR (2024)

    Li, H., Feng, Y., Xue, S., Liu, X., Zeng, B., Li, S., Liu, B., Liu, J., Han, S., Zhang, B.: Uv-idm: identity-conditioned latent diffusion model for face uv-texture generation. In: CVPR (2024)

  39. [39]

    arXiv:2511.18600 (2025)

    Li, H., Ye, C., Chen, H., Xiao, W., Yan, Z., Xiao, L., Chen, Z., Xiang, J., Xu, S., Liu, X., Wang, Y., Zhang, B., Han, X., Yang, J., Zhao, H.: Near: Coupled neural asset-renderer stack. arXiv:2511.18600 (2025)

  40. [40]

    In: CVPR (2020)

    Li, R., Bladin, K., Zhao, Y., Chinara, C., Ingraham, O., Xiang, P., Ren, X., Prasad, P., Kishore, B., Xing, J., Li, H.: Learning formation of physically-based face at- tributes. In: CVPR (2020)

  41. [41]

    SIGGRAPH Asia (2017)

    Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4D scans. SIGGRAPH Asia (2017)

  42. [42]

    In: NeurIPS (2017)

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: NeurIPS (2017)

  43. [43]

    MediaPipe: A Framework for Building Perception Pipelines

    Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.L., Yong, M.G., Lee, J., et al.: Mediapipe: A framework for building perception pipelines. arXiv:1906.08172 (2019)

  44. [44]

    In: CVPR (2023)

    Papantoniou, F., Lattas, A., Moschoglou, S., Zafeiriou, S.: Relightify: Relightable 3d faces from a single image via diffusion models. In: CVPR (2023)

  45. [45]

    In: AVSS (2009)

    Paysan, P., Knothe, R., Amberg, B., Romdhani, S., Vetter, T.: A 3d face model for pose and illumination invariant face recognition. In: AVSS (2009)

  46. [46]

    In: WACV (2024)

    Rai, A., Gupta, H., Pandey, A., Carrasco, F.V., Takagi, S.J., Aubel, A., Kim, D., Prakash, A., De la Torre, F.: Towards realistic generative 3d face models. In: WACV (2024)

  47. [47]

    In: ECCV (2018)

    Ranjan, A., Bolkart, T., Sanyal, S., Black, M.J.: Generating 3d faces using convo- lutional mesh autoencoders. In: ECCV (2018)

  48. [48]

    In: CVPR (2024)

    Retsinas, G., Filntisis, P.P., Danecek, R., Abrevaya, V.F., Roussos, A., Bolkart, T., Maragos, P.: 3d facial expressions through analysis-by-neural-synthesis. In: CVPR (2024)

  49. [49]

    In: CVPR (2022) 28 Li et al

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022) 28 Li et al

  50. [50]

    In: CVPR (2019)

    Sanyal, S., Bolkart, T., Feng, H., Black, M.: Learning to regress 3d face shape and expression from an image without 3d supervision. In: CVPR (2019)

  51. [51]

    In: ECCV (2024)

    Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distilla- tion. In: ECCV (2024)

  52. [52]

    arXiv:2508.10104 (2025)

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv:2508.10104 (2025)

  53. [53]

    In: CVPR (2020)

    Smith, W.A.P., Seck, A., Dee, H., Tiddeman, B., Tenenbaum, J., Egger, B.: A morphable face albedo model. In: CVPR (2020)

  54. [54]

    In: ICML (2023)

    Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: ICML (2023)

  55. [56]

    LongCat-Image Technical Report

    Team, M.L., Ma, H., Tan, H., Huang, J., Wu, J., He, J.Y., Gao, L., Xiao, S., Wei, X., Ma, X., et al.: Longcat-image technical report. arXiv:2512.07584 (2025)

  56. [57]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Team, Z.I.: Z-image: An efficient image generation foundation model with single- stream diffusion transformer. arXiv:2511.22699 (2025)

  57. [58]

    In: CVPR (2019)

    Tran, L., Liu, F., Liu, X.: Towards high-fidelity nonlinear 3d face morphable model. In: CVPR (2019)

  58. [59]

    In: CVPR (2018)

    Tran, L., Liu, X.: Nonlinear 3d face morphable model. In: CVPR (2018)

  59. [60]

    In: CVPR (2025)

    Wang, C., Kang, D., Sun, H., Qian, S., Wang, Z., Bao, L., Zhang, S.H.: Mega: Hybrid mesh-gaussian head avatar for high-fidelity rendering and head editing. In: CVPR (2025)

  60. [61]

    arXiv:2110.09299 (2021)

    Wang, H.: A review of 3d face reconstruction from a single image. arXiv:2110.09299 (2021)

  61. [62]

    In: CVPR (2022)

    Wang, L., Chen, Z., Yu, T., Ma, C., Li, L., Liu, Y.: Faceverse: a fine-grained and detail-controllable 3d face morphable model from a hybrid dataset. In: CVPR (2022)

  62. [63]

    In: ICCVW (2021)

    Wang, X., Xie, L., Dong, C., Shan, Y.: Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In: ICCVW (2021)

  63. [64]

    In: CVPR (2024)

    Wang, Z., Zhu, X., Zhang, T., Wang, B., Lei, Z.: 3d face reconstruction with the geometric guidance of facial part segmentation. In: CVPR (2024)

  64. [65]

    In: CVPR (2023)

    Woo, S., Debnath, S., Hu, R., Chen, X., Liu, Z., Kweon, I.S., Xie, S.: Convnext v2: Co-designing and scaling convnets with masked autoencoders. In: CVPR (2023)

  65. [66]

    Qwen-Image Technical Report

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...

  66. [67]

    arXiv preprint arXiv:2508.19754 (2025)

    Wu, Y., Wu, Y., Li, W., Lu, Y., Feng, K., Chen, X.: Fastavatar: Towards uni- fied fast high-fidelity 3d avatar reconstruction with large gaussian reconstruction transformers. arXiv:2508.19754 (2025)

  67. [68]

    In: CVPR (2025)

    Yang, X., Taketomi, T., Endo, Y., Kanamori, Y.: Freeuv: Ground-truth-free re- alistic facial uv texture recovery via cross-assembly inference strategy. In: CVPR (2025)

  68. [69]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zheng, M., Yang, H., Huang, D., Chen, L.: Imface: A nonlinear 3d morphable face model with implicit neural representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20343–20352 (2022) Monocular Avatar Reconstruction 29

  69. [70]

    PAMI (2025)

    Zheng, M., Zhang, H., Yang, H., Chen, L., Huang, D.: Imface++: A sophisticated nonlinear 3d morphable face model with implicit neural representations. PAMI (2025)

  70. [71]

    In: CVPR (2022)

    Zheng, Q., Deng, J., Zhu, Z., Li, Y., Zafeiriou, S.: Decoupled multi-task learning with cyclical self-regulation for face parsing. In: CVPR (2022)

  71. [72]

    In: CVPR (2024)

    Zhou, M., Hyder, R., Xuan, Z., Qi, G.: Ultravatar: A realistic animatable 3d avatar diffusion model with authenticity guided textures. In: CVPR (2024)

  72. [73]

    in-the-wild

    Zielonka, W., Bolkart, T., Thies, J.: Towards metrical reconstruction of human faces. In: ECCV (2022) 30 Li et al. Fig.A7:For "in-the-wild" images, our pipeline robustly infers across different ethnic- ities, ages, and genders. This inference produces complete PBR textures and relighting renderings under a variety of environmental lighting conditions. Mon...