pith. sign in

arxiv: 2304.09479 · v5 · submitted 2023-04-19 · 💻 cs.CV · cs.GR· cs.LG

DiFaReli++: Diffusion Face Relighting with Consistent Cast Shadows

Pith reviewed 2026-05-24 08:40 UTC · model grok-4.3

classification 💻 cs.CV cs.GRcs.LG
keywords face relightingdiffusion modelscast shadowssingle-viewin-the-wildDDIM conditioningshadow map
0
0 comments X

The pith

A conditional diffusion model relights single-view faces with consistent cast shadows by modulating DDIM steps with rendered shading and an inferred shadow map.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a face relighting method that avoids explicit decomposition into shape, albedo, and lighting. Instead it feeds off-the-shelf 3D and identity encodings plus a light code into a DDIM whose denoising is spatially guided by a rendered shading image and a simple shadow map. The approach trains on ordinary 2D photographs alone, without light-stage captures or paired relit data. If successful it produces temporally consistent cast shadows across lighting changes and reaches state-of-the-art scores on the Multi-PIE benchmark plus top user-study rankings.

Core claim

By conditioning a DDIM decoder on a disentangled light encoding together with a rendered shading reference and an inferred shadow map, the model can synthesize relit face images that preserve identity and geometry while adding realistic, temporally consistent cast shadows, all from a single network pass and without any ground-truth lighting supervision.

What carries the argument

Conditional DDIM whose spatial modulation is performed by a rendered shading reference combined with a shadow map inferred from the input geometry.

If this is right

  • Relighting no longer requires light-stage data, relit pairs, or multi-view images.
  • A single forward pass produces the relit image once pre-processing is done.
  • Cast shadows remain consistent when the same face is shown under changing target lights.
  • Performance exceeds the teacher model on all reported metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same conditioning trick could be tested on non-face objects once reliable shape estimators exist.
  • If the shadow-map step generalizes, it may reduce the need for full global-illumination simulation in other diffusion relighting tasks.
  • Single-pass operation opens the possibility of applying the model to short video clips without per-frame retraining.

Load-bearing premise

Off-the-shelf 3D shape and facial-identity estimators supply inputs accurate enough that the simple shadow-map modulation does not introduce large visible errors in the final output.

What would settle it

Run the method on Multi-PIE test sequences using the same off-the-shelf estimators; if the generated cast shadows fail to match ground-truth shadow boundaries or show temporal flicker, the claim is falsified.

Figures

Figures reproduced from arXiv: 2304.09479 by Nontawat Tritrong, Puntawat Ponglertnapakorn, Supasorn Suwajanakorn.

Figure 1
Figure 1. Figure 1: Our method addresses one of the most challenging relighting scenarios where input images contain strong highlights and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of DiFaReli++. We use off-the-shelf estimators to derive various encodings from the input image: segmentation masks, shadow map, (light, shape, camera) parameters, and face embedding. These encodings are then fed into a conditional DDIM via “spatial” and “non-spatial” conditioning techniques. For spatial conditioning, a shading reference, shadow map, and segmentation masks are concatenated and fed… view at source ↗
Figure 3
Figure 3. Figure 3: Computing the shadow map for training. We used a pretrained DiFaReli model to generate stronger and reduced versions of the input image, then identify shadow areas through pixel differences. Our process produces more accurate and spatially aligned shadow maps compared to ray-traced maps shown in red, which suffer from inaccurate lighting and geometry estimation. 1) Computing a shadow map: We use our pretra… view at source ↗
Figure 4
Figure 4. Figure 4: Modifications of the Modulator’s input in Di￾FaReli++. The input is a concatenation of the shadow map, the shading reference, and segmentation masks (see all masks in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Single-shot face relighting framework involves a) using DiFaReli++ to generate supervised relit pairs and b) training a single-shot relighting network with the same architecture as DiFaReli++ using the training pairs with a simple L2 loss. Input - Reference Pandey et al. (SIGGRAPH’21) Hou et al. (CVPR’21) Hou et al. (CVPR’22) IC-Light (Github’24) DiFaReli (ICCV’23) DiFaReli++ss Ours [PITH_FULL_IMAGE:figur… view at source ↗
Figure 6
Figure 6. Figure 6: Relit results on FFHQ [29]. The FFHQ dataset contains diverse face images captured in real-world environments. Our method produces more realistic relit images, as well as cast shadows, which can be controlled via the shadow map in the rightmost column. It effectively removes existing cast shadows and adds new ones. Additionally, it can relight non-facial parts (e.g., hats, hoodies, or shirts) to match the … view at source ↗
Figure 7
Figure 7. Figure 7: Relighting results on Multi-PIE [20] when the target lighting is taken from the same person (first row) and from a different person (second row). - Shadow Input + Shadow [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Varying intensities of cast shadow. DiFaReli’s ability to change the intensity of cast shadows by adjusting the scalar c and decode the modified feature vector (more in [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Relighting with consistent cast shadows. Compared to four recent state-of-the-art methods [95], [27], [26], [50], DiFaReli++ effectively removes input cast shadows and synthesizes new ones in a realistic and consistent manner. The bottom row shows our shading references and shadow maps. Additional results are in Appendix ( [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Results using various acceleration techniques. with different sampling steps on an input image from FFHQ. While these techniques can reduce the sampling steps, they introduce artifacts and blurriness. In contrast, our distilled version of DiFaReli++ (DiFaReli++ss) delivers the highest quality, the least noisy output, and runs in just 0.07 seconds. image to match the target lighting, considering: (1) only … view at source ↗
Figure 11
Figure 11. Figure 11: Trade-off between runtime and relighting performance of different acceleration techniques measured on three metrics: DSSIM, MSE, and LPIPS. The first row shows results on the test set where the target lighting is taken from the same subject, while the second row uses target lighting from a different subject. The red dashed line represents our single-shot face relighting score (DiFaReli++ss), and the magen… view at source ↗
Figure 12
Figure 12. Figure 12: Background conditioning ablation. Without back￾ground conditioning, non-facial regions like hats may disap￾pear. Conditioning on raw pixels in DiFaReli preserves the hat, while conditioning on segmentation masks in DiFaReli++ss not only preserves it but also enables its relighting. E. Ablation studies Light conditioning. We compare our full pipeline with two alternatives for conditioning the DDIM on the l… view at source ↗
Figure 13
Figure 13. Figure 13: Improvements over DiFaReli’s failure cases. DiFaReli++ss better remove shadows cast by external objects (top) and better preserves sunglasses (bottom). DiFaReli++ss Input a) Fails to handle cast shadows on hat/shirt b) Does not create shadows cast by external objects c) Mistakenly relights object covering the face [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Failure cases. Our method a) may fail to add or remove cast shadows on non-facial parts (e.g., hats, clothing), b) may not produce shadows cast by external objects, or c) may mistakenly relight objects occluding the face (e.g., hands), leading to unrealistic relighting in some cases. ArcFace (ξ) and DECA (s, cam) by evaluating the relight performance on: c) Our method with no s, cam, ξ. d) Our method with… view at source ↗
Figure 15
Figure 15. Figure 15: Diagram of one of the residual blocks inside the first [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Diagram of one of the 3-layer MLPs in the non-spatial [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Comparison of DiFaReli and DiFaReli++ pipelines. Differences are highlighted with red borders. Key changes are: 1) Background conditioning: replacing the background image with a concatenation of segmentation masks to enable relighting of non-facial parts. 2) Shadow estimator: using a shadow map with an encoded shadow scalar for improved consistency in cast shadows generation. 3) The cast shadow scalar c i… view at source ↗
Figure 18
Figure 18. Figure 18: Comparison against HoloRelighting [42] and visual analysis of our limitations. a) Our method better preserves fine details, such as hair and teeth, compared to HoloRelighting results, taken directly from their paper due to the lack of source code. Note that our target lighting was estimated using DECA [15] from the target image. b) The overall lighting in our result lacks the strong orange shading present… view at source ↗
Figure 19
Figure 19. Figure 19: Comparison against SwitchLight [31] and visual analysis of our limitations. SwitchLight’s results were taken directly from their paper due to the lack of source code. Our method addresses SwitchLight’s limitations: a) our method effectively removes hard cast shadows and better preserves makeup details, and b) produces sharper details. c) Our results appear less consistent with the target lighting, lacking… view at source ↗
Figure 20
Figure 20. Figure 20: Ablation study of the light conditioning (Section 4.3A in the main text). Ground truth Used as non-spatial Ours No Modulator Ours (DiFaReli) Input [PITH_FULL_IMAGE:figures/full_fig_p021_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Ablation study of the non-spatial conditioning variable (Section 4.3B in the main text) [PITH_FULL_IMAGE:figures/full_fig_p021_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Relit results under rotating light around the forward axis (roll) on the FFHQ test set [29]. The order of results for each task is shuffled when displayed to each participant. Instructions and criteria for making selections are provided at the top of the page [PITH_FULL_IMAGE:figures/full_fig_p022_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Relit results under rotating light around the forward axis (roll) on the FFHQ test set [29] [PITH_FULL_IMAGE:figures/full_fig_p023_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Relit results under rotating light around the forward axis (roll) on the FFHQ test set [29] [PITH_FULL_IMAGE:figures/full_fig_p024_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Relit results under rotating light around the up axis (yaw) on the FFHQ test set [29] [PITH_FULL_IMAGE:figures/full_fig_p025_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Relit results under rotating light around the up axis (yaw) on the FFHQ test set [29] [PITH_FULL_IMAGE:figures/full_fig_p026_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Relit results on the FFHQ test set [29] [PITH_FULL_IMAGE:figures/full_fig_p027_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Relit results on the FFHQ test set [29] [PITH_FULL_IMAGE:figures/full_fig_p028_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Relit results on the FFHQ test set [29] [PITH_FULL_IMAGE:figures/full_fig_p029_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Relit results on the FFHQ test set [29] [PITH_FULL_IMAGE:figures/full_fig_p030_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Improved DDIM sampling with mean-matching. We show a qualitative comparison between“with” and “without” mean-matching. Our mean-matching technique helps correct the overall brightness in both the inversion output and relit image [PITH_FULL_IMAGE:figures/full_fig_p031_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Varying the intensities of cast shadows on FFHQ [29] [PITH_FULL_IMAGE:figures/full_fig_p032_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Poor results from using ray-traced shadow maps for inversion. Using ray-traced shadow maps for DDIM inversion, the top result shows that non-shadow areas are over-brightened (highlighted with a red circle), while the bottom result shows a failure to remove shadows and closely follow the conditioning shadow map. Hair Face skin Eyes Eyeballs Glasses Ears Nose Inside mouth Upper lip Lower lip Neck Cloth Hat … view at source ↗
Figure 34
Figure 34. Figure 34: All segmentation masks used as conditioning inputs in DiFaReli++ (Section IV-B in the main text) [PITH_FULL_IMAGE:figures/full_fig_p033_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Examples of proxy background images that serve as target lighting for IC-light [ [PITH_FULL_IMAGE:figures/full_fig_p034_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: User interface for the relighting user study of facial and non-facial parts (Section V-C1 in the main text) [PITH_FULL_IMAGE:figures/full_fig_p035_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: User interface for the relighting user study on relighting quality of controllable cast shadows (Section V-C2 in the main text). In the interface, these results are videos that play simultaneously [PITH_FULL_IMAGE:figures/full_fig_p036_37.png] view at source ↗
read the original abstract

We introduce a novel approach to single-view face relighting in the wild, addressing challenges such as global illumination and cast shadows. A common scheme in recent methods involves intrinsically decomposing an input image into 3D shape, albedo, and lighting, then recomposing it with the target lighting. However, estimating these components is error-prone and requires many training examples with ground-truth lighting to generalize well. Our work bypasses the need for accurate intrinsic estimation and can be trained solely on 2D images without any light stage data, relit pairs, multi-view images, or lighting ground truth. Our key idea is to leverage a conditional diffusion implicit model (DDIM) for decoding a disentangled light encoding along with other encodings related to 3D shape and facial identity inferred from off-the-shelf estimators. We propose a novel conditioning technique that simplifies modeling the complex interaction between light and geometry. It uses a rendered shading reference along with a shadow map, inferred using a simple and effective technique, to spatially modulate the DDIM. Moreover, we propose a single-shot relighting framework that requires just one network pass, given pre-processed data, and even outperforms the teacher model across all metrics. Our method realistically relights in-the-wild images with temporally consistent cast shadows under varying lighting conditions. We achieve state-of-the-art performance on the standard benchmark Multi-PIE and rank highest in user studies. Please visit our page: https://diffusion-face-relighting-pp.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces DiFaReli++, a single-view face relighting method for in-the-wild images that uses a conditional DDIM decoder on disentangled light, 3D shape, and identity encodings obtained from off-the-shelf estimators. It proposes a conditioning technique that spatially modulates the DDIM via a rendered shading reference plus an inferred shadow map (obtained by a 'simple and effective technique') to model light-geometry interactions, including cast shadows. The method is trained only on 2D images without light-stage data, relit pairs, or lighting ground truth; it claims a single-shot inference pass, SOTA quantitative results on Multi-PIE, highest user-study rankings, and temporally consistent cast shadows under varying lighting.

Significance. If the conditioning technique and error propagation from off-the-shelf estimators can be shown to be robust, the approach would be significant for enabling realistic relighting with consistent shadows without requiring accurate intrinsic decomposition or specialized training data. The single-pass inference and avoidance of light-stage supervision are practical strengths.

major comments (3)
  1. [Abstract, §3] Abstract and §3 (Method): the central claim that the rendered shading + inferred shadow map 'simplifies modeling the complex interaction between light and geometry' and produces 'temporally consistent cast shadows' rests on unverified accuracy of the shadow map when derived from off-the-shelf 3D estimators; no quantitative propagation analysis, no ablation with ground-truth geometry, and no failure-case study on pose/expression variation (where shape errors are known to be large) are provided, leaving the link between 2D-only training and output consistency unsecured.
  2. [§4] §4 (Experiments): the assertion of 'state-of-the-art performance on the standard benchmark Multi-PIE' and 'outperforms the teacher model across all metrics' is stated without any reported quantitative tables, error bars, or per-metric comparisons in the abstract and is not accompanied by ablation details on the shadow-map component, which is load-bearing for the consistency claim.
  3. [§3.2] §3.2 (Conditioning technique): the shadow-map inference is described as 'simple and effective' yet no explicit formulation, pseudocode, or sensitivity analysis to input shape error (e.g., angular or depth deviation) is given, so it is impossible to assess whether the DDIM can correct misaligned shadows or merely propagates them.
minor comments (2)
  1. [Abstract] The abstract states results on Multi-PIE and user studies but does not cite the exact table or figure numbers where these appear; adding explicit cross-references would improve readability.
  2. [§3] Notation for the 'disentangled light encoding' and 'other encodings related to 3D shape and facial identity' should be defined with symbols or a diagram in §3 to avoid ambiguity when describing the DDIM conditioning.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, proposing revisions where appropriate to strengthen the paper while remaining faithful to the presented work.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (Method): the central claim that the rendered shading + inferred shadow map 'simplifies modeling the complex interaction between light and geometry' and produces 'temporally consistent cast shadows' rests on unverified accuracy of the shadow map when derived from off-the-shelf 3D estimators; no quantitative propagation analysis, no ablation with ground-truth geometry, and no failure-case study on pose/expression variation (where shape errors are known to be large) are provided, leaving the link between 2D-only training and output consistency unsecured.

    Authors: We agree that additional analysis would strengthen the claims regarding robustness. The current manuscript does not include quantitative error propagation analysis or ablations with ground-truth geometry, as our approach is explicitly designed for 2D-only training without such supervision. In revision, we will add a failure-case study examining performance under pose and expression variations to better illustrate behavior with shape estimation errors. We maintain that the Multi-PIE results and user study provide supporting evidence for consistency, but acknowledge the value of the suggested additions. revision: partial

  2. Referee: [§4] §4 (Experiments): the assertion of 'state-of-the-art performance on the standard benchmark Multi-PIE' and 'outperforms the teacher model across all metrics' is stated without any reported quantitative tables, error bars, or per-metric comparisons in the abstract and is not accompanied by ablation details on the shadow-map component, which is load-bearing for the consistency claim.

    Authors: Quantitative tables with per-metric comparisons on Multi-PIE, including outperformance over the teacher model, are reported in Section 4 of the manuscript. Abstracts are summaries and do not typically contain full tables or error bars. To address the concern, we will add error bars to the existing tables and include a dedicated ablation study on the shadow-map component in the revised experiments section. revision: yes

  3. Referee: [§3.2] §3.2 (Conditioning technique): the shadow-map inference is described as 'simple and effective' yet no explicit formulation, pseudocode, or sensitivity analysis to input shape error (e.g., angular or depth deviation) is given, so it is impossible to assess whether the DDIM can correct misaligned shadows or merely propagates them.

    Authors: We will revise Section 3.2 to include the explicit formulation of the shadow-map inference technique, accompanying pseudocode, and a sensitivity analysis to input shape errors (such as angular or depth deviations) to the extent possible with available data. This will allow readers to better evaluate the conditioning mechanism. revision: yes

standing simulated objections not resolved
  • Quantitative ablation studies using ground-truth geometry for error propagation analysis, as the method is trained solely on 2D images and does not have access to such ground-truth data.

Circularity Check

0 steps flagged

No circularity: method uses external estimators and standard DDIM conditioning without self-referential reductions

full rationale

The paper's approach relies on off-the-shelf 3D shape and identity estimators plus a simple shadow map inference to condition a standard DDIM, with training solely on 2D images. No equations, predictions, or derivations in the abstract or described framework reduce outputs to quantities defined by the method's own fitted parameters or self-citations. The conditioning technique is presented as a practical modulation step rather than a tautological construction, and claims rest on benchmark performance and user studies rather than internal self-definition. This is a standard empirical method paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Only the abstract is available, so the ledger is limited to premises stated or implied there; no free parameters, axioms, or invented entities are explicitly quantified.

axioms (2)
  • domain assumption Off-the-shelf 3D shape and identity estimators supply inputs accurate enough for the downstream diffusion conditioning to succeed.
    Abstract states that shape and identity encodings are inferred from off-the-shelf estimators and used directly in the DDIM conditioning.
  • ad hoc to paper A simple shadow-map inference technique combined with rendered shading can spatially modulate the DDIM to model light-geometry interactions.
    Abstract presents this as the novel conditioning technique that simplifies the complex interaction.

pith-pipeline@v0.9.0 · 5814 in / 1411 out tokens · 21382 ms · 2026-05-24T08:40:22.561784+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Our key idea is to leverage a conditional diffusion implicit model (DDIM) for decoding a disentangled light encoding along with other encodings related to 3D shape and facial identity inferred from off-the-shelf estimators. We propose a novel conditioning technique that simplifies modeling the complex interaction between light and geometry. It uses a rendered shading reference along with a shadow map, inferred using a simple and effective technique, to spatially modulate the DDIM.

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We achieve state-of-the-art performance on the standard benchmark Multi-PIE and rank highest in user studies.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · 10 internal anchors

  1. [1]

    Segdiff: Image segmentation with diffusion probabilistic models

    Tomer Amit, Eliya Nachmani, Tal Shaharbany, and Lior Wolf. Segdiff: Image segmentation with diffusion probabilistic models. arXiv:2112.00390, 2021. 17

  2. [2]

    eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

    Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv:2211.01324, 2022. 17

  3. [3]

    Analytic-dpm: an an- alytic estimate of the optimal reverse variance in diffusion probabilistic models

    Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-dpm: an an- alytic estimate of the optimal reverse variance in diffusion probabilistic models. arXiv:2201.06503, 2022. 4

  4. [4]

    Label-efficient semantic segmentation with diffu- sion models

    Dmitry Baranchuk, Ivan Rubachev, Andrey V oynov, Valentin Khrulkov, and Artem Babenko. Label-efficient semantic segmentation with diffu- sion models. arXiv:2112.03126, 2021. 17

  5. [5]

    Shape, illumination, and reflectance from shading

    Jonathan T Barron and Jitendra Malik. Shape, illumination, and reflectance from shading. IEEE transactions on pattern analysis and machine intelligence, 37(8):1670–1687, 2014. 1, 3

  6. [6]

    A morphable model for the synthesis of 3d faces

    V olker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques , pages 187–194, 1999. 3, 17

  7. [7]

    Chan, Connor Z

    Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Guibas, Jonathan Trem- blay, Sameh Khamis, Tero Karras, and Gordon Wetzstein. Efficient geometry-aware 3d generative adversarial networks, 2022. 4

  8. [8]

    Denoising likelihood score matching for conditional score-based data generation

    Chen-Hao Chao, Wei-Fang Sun, Bo-Wun Cheng, Yi-Chen Lo, Chia-Che Chang, Yu-Lun Liu, Yu-Lin Chang, Chia-Ping Chen, and Chun-Yi Lee. Denoising likelihood score matching for conditional score-based data generation. arXiv:2203.14206, 2022. 17

  9. [9]

    Diffusion models in vision: A survey

    Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. arXiv:2209.04747, 2022. 17 14

  10. [10]

    Arcface: Additive angular margin loss for deep face recognition

    Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 4690–4699, 2019. 2, 5, 17

  11. [11]

    Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set

    Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages 0–0, 2019. 17

  12. [12]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems , 34:8780–8794, 2021. 6, 16, 17, 19

  13. [13]

    Sigmoid-weighted linear units for neural network function approximation in reinforcement learning

    Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks , 107:3–11, 2018. 6

  14. [14]

    Near perfect gan inversion

    Qianli Feng, Viraj Shah, Raghudeep Gadde, Pietro Perona, and Aleix Martinez. Near perfect gan inversion. arXiv:2202.11833, 2022. 4

  15. [15]

    Black, and Timo Bolkart

    Yao Feng, Haiwen Feng, Michael J. Black, and Timo Bolkart. Learning an animatable detailed 3D face model from in-the-wild images. vol- ume 40, 2021. 2, 4, 5, 8, 18, 19

  16. [16]

    Learning an animatable detailed 3d face model from in-the-wild images

    Yao Feng, Haiwen Feng, Michael J Black, and Timo Bolkart. Learning an animatable detailed 3d face model from in-the-wild images. ACM Transactions on Graphics (ToG) , 40(4):1–13, 2021. 17

  17. [17]

    Con- trollable light diffusion for portraits, 2023

    David Futschik, Kelvin Ritland, James Vecore, Sean Fanello, Sergio Orts-Escolano, Brian Curless, Daniel S ´ykora, and Rohit Pandey. Con- trollable light diffusion for portraits, 2023. 3

  18. [18]

    Unsupervised training for 3d morphable model regression

    Kyle Genova, Forrester Cole, Aaron Maschinot, Aaron Sarna, Daniel Vlasic, and William T Freeman. Unsupervised training for 3d morphable model regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 8377–8386, 2018. 17

  19. [19]

    Gen- erative adversarial networks

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Gen- erative adversarial networks. Communications of the ACM , 63(11):139– 144, 2020. 4

  20. [20]

    Multi-pie

    Ralph Gross, Iain Matthews, Jeffrey Cohn, Takeo Kanade, and Simon Baker. Multi-pie. Image and vision computing , 2010. 3, 9, 10, 16

  21. [21]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016. 5

  22. [22]

    Diffrelight: Diffusion-based facial performance relighting

    Mingming He, Pascal Clausen, Ahmet Levent Tas ¸el, Li Ma, Oliver Pilarski, Wenqi Xian, Laszlo Rikker, Xueming Yu, Ryan Burgert, Ning Yu, et al. Diffrelight: Diffusion-based facial performance relighting. In SIGGRAPH Asia 2024 Conference Papers , pages 1–12, 2024. 4

  23. [23]

    Denoising diffusion prob- abilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion prob- abilistic models. Advances in Neural Information Processing Systems , 33:6840–6851, 2020. 5, 6, 17

  24. [24]

    Cascaded diffusion models for high fidelity image generation

    Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Moham- mad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res. , 23:47–1, 2022. 17

  25. [25]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv:2207.12598, 2022. 17

  26. [26]

    Face relighting with geometrically consistent shadows

    Andrew Hou, Michel Sarkis, Ning Bi, Yiying Tong, and Xiaoming Liu. Face relighting with geometrically consistent shadows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4217–4226, 2022. 1, 3, 4, 7, 8, 9, 10, 11, 12, 16, 18

  27. [27]

    Towards high fidelity face relighting with realistic shadows

    Andrew Hou, Ze Zhang, Michel Sarkis, Ning Bi, Yiying Tong, and Xiaoming Liu. Towards high fidelity face relighting with realistic shadows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14719–14728, 2021. 1, 3, 4, 8, 9, 10, 11, 12, 18

  28. [28]

    3d face reconstruction with geometry details from a single image

    Luo Jiang, Juyong Zhang, Bailin Deng, Hao Li, and Ligang Liu. 3d face reconstruction with geometry details from a single image. IEEE Transactions on Image Processing , 27(10):4756–4770, 2018. 17

  29. [29]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2019. 4, 9, 12, 16, 22, 23, 24, 25, 26, 27, 28, 29, 30, 32

  30. [30]

    Zhanghan Ke, Chunyi Sun, Lei Zhu, Ke Xu, and Rynson W.H. Lau. Harmonizer: Learning to perform white-box image and video harmo- nization. In European Conference on Computer Vision , 2022. 4

  31. [31]

    Switchlight: Co-design of physics-driven architecture and pre-training framework for human portrait relighting, 2024

    Hoon Kim, Minje Jang, Wonjun Yoon, Jisoo Lee, Donghyun Na, and Sanghyun Woo. Switchlight: Co-design of physics-driven architecture and pre-training framework for human portrait relighting, 2024. 1, 3, 4, 9, 17, 18, 20

  32. [32]

    Illumination-invariant face recog- nition with deep relit face images

    Ha A Le and Ioannis A Kakadiaris. Illumination-invariant face recog- nition with deep relit face images. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV) . IEEE, 2019. 1, 3

  33. [33]

    Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia) , 2017. 5, 6, 17

  34. [34]

    A closed-form solution to photorealistic image stylization

    Yijun Li, Ming-Yu Liu, Xueting Li, Ming-Hsuan Yang, and Jan Kautz. A closed-form solution to photorealistic image stylization. In Proceedings of the European Conference on Computer Vision (ECCV) , 2018. 4

  35. [35]

    Feature- preserving detailed 3d face reconstruction from a single image

    Yue Li, Liqian Ma, Haoqiang Fan, and Kenny Mitchell. Feature- preserving detailed 3d face reconstruction from a single image. In Proceedings of the 15th ACM SIGGRAPH European Conference on Visual Media Production , pages 1–9, 2018. 17

  36. [36]

    Targeting Ultimate Accuracy: Face Recognition via Deep Embedding

    Jingtuo Liu, Yafeng Deng, Tao Bai, Zhengping Wei, and Chang Huang. Targeting ultimate accuracy: Face recognition via deep embedding. arXiv:1506.07310, 2015. 17

  37. [37]

    Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022. 4, 10

  38. [38]

    DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv:2211.01095, 2022. 4, 10

  39. [39]

    Deep photo style transfer

    Fujun Luan, Sylvain Paris, Eli Shechtman, and Kavita Bala. Deep photo style transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 4990–4998, 2017. 4

  40. [40]

    Photoapp: Photorealistic appearance editing of head portraits

    BR Mallikarjun, Ayush Tewari, Abdallah Dib, Tim Weyrich, Bernd Bickel, Hans Peter Seidel, Hanspeter Pfister, Wojciech Matusik, Louis Chevallier, Mohamed A Elgharib, et al. Photoapp: Photorealistic appearance editing of head portraits. ACM Transactions on Graphics ,

  41. [41]

    Face-specific data augmentation for unconstrained face recog- nition

    Iacopo Masi, Anh Tu ˆan Tr ˆa`n, Tal Hassner, Gozde Sahin, and G ´erard Medioni. Face-specific data augmentation for unconstrained face recog- nition. International Journal of Computer Vision , 127, 2019. 17

  42. [42]

    Holo- relighting: Controllable volumetric portrait relighting from a single image

    Yiqun Mei, Yu Zeng, He Zhang, Zhixin Shu, Xuaner Zhang, Sai Bi, Jianming Zhang, HyunJoon Jung, and Vishal M Patel. Holo- relighting: Controllable volumetric portrait relighting from a single image. arXiv:2403.09632, 2024. 1, 3, 4, 9, 17, 18, 19

  43. [43]

    Sdedit: Guided image synthesis and editing with stochastic differential equations

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations , 2021. 17

  44. [44]

    Learning physics-guided face relighting under directional light

    Thomas Nestmeyer, Jean-Franc ¸ois Lalonde, Iain Matthews, and Andreas Lehrmann. Learning physics-guided face relighting under directional light. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 5124–5133, 2020. 1, 3, 4, 8, 9, 10, 18

  45. [45]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv:2112.10741, 2021. 6, 17

  46. [46]

    Vaes meet diffusion models: Efficient and high-fidelity generation

    Kushagra Pandey, Avideep Mukherjee, Piyush Rai, and Abhishek Ku- mar. Vaes meet diffusion models: Efficient and high-fidelity generation. In NeurIPS 2021 Workshop on Deep Generative Models and Down- stream Applications, 2021. 17

  47. [47]

    Total relighting: learning to relight portraits for background replacement

    Rohit Pandey, Sergio Orts Escolano, Chloe Legendre, Christian Haene, Sofien Bouaziz, Christoph Rhemann, Paul Debevec, and Sean Fanello. Total relighting: learning to relight portraits for background replacement. ACM Transactions on Graphics (TOG) , 40(4):1–21, 2021. 1, 3, 4, 9, 10, 11, 17, 18

  48. [48]

    Relightify: Relightable 3d faces from a single image via diffusion models

    Foivos Paraperas Papantoniou, Alexandros Lattas, Stylianos Moschoglou, and Stefanos Zafeiriou. Relightify: Relightable 3d faces from a single image via diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023. 3

  49. [49]

    Deep face recognition

    Omkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman. Deep face recognition. 2015. 17

  50. [50]

    DiFaReli++: Diffusion Face Relighting with Consistent Cast Shadows

    Puntawat Ponglertnapakorn, Nontawat Tritrong, and Supasorn Suwa- janakorn. Difareli: Diffusion face relighting. arXiv:2304.09479, 2023. 2, 3, 7, 8, 9, 11, 12, 18

  51. [51]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dream- fusion: Text-to-3d using 2d diffusion. arXiv:2209.14988, 2022. 17

  52. [52]

    Diffusion autoencoders: Toward a meaningful and decodable representation

    Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Su- pasorn Suwajanakorn. Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10619–10629, 2022. 2, 5, 6, 7, 17, 19

  53. [53]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 17

  54. [54]

    Explor- ing the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Explor- ing the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research , 21(1):5485–5551, 2020. 17 15

  55. [55]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv:2204.06125, 2022. 17

  56. [56]

    Facelit: Neural 3d relightable faces, 2023

    Anurag Ranjan, Kwang Moo Yi, Jen-Hao Rick Chang, and Oncel Tuzel. Facelit: Neural 3d relightable faces, 2023. 4

  57. [57]

    Relightful harmonization: Lighting-aware portrait background replacement, 2023

    Mengwei Ren, Wei Xiong, Jae Shin Yoon, Zhixin Shu, Jianming Zhang, HyunJoon Jung, Guido Gerig, and He Zhang. Relightful harmonization: Lighting-aware portrait background replacement, 2023. 3, 4, 9

  58. [58]

    Encoding in style: a stylegan encoder for image-to-image translation

    Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2021. 4

  59. [59]

    Pivotal tuning for latent-based editing of real images

    Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel Cohen-Or. Pivotal tuning for latent-based editing of real images. ACM Transactions on Graphics (TOG) , 42(1):1–13, 2022. 4

  60. [60]

    High-resolution image synthesis with latent diffusion models, 2021

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image synthesis with latent diffusion models, 2021. 6

  61. [61]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 10684–10695, 2022. 17

  62. [62]

    Dreambooth: Fine tuning text-to-image diffusion models for subject- driven generation,

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv:2208.12242, 2022. 17

  63. [63]

    Palette: Image-to-image diffusion models

    Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022. 17

  64. [64]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv:2205.11487,

  65. [65]

    Relightable gaussian codec avatars

    Shunsuke Saito, Gabriel Schwartz, Tomas Simon, Junxuan Li, and Giljoo Nam. Relightable gaussian codec avatars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 130–141, 2024. 4

  66. [66]

    Adversarial diffusion distillation

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. arXiv:2311.17042, 2023. 3, 4

  67. [67]

    Facenet: A unified embedding for face recognition and clustering

    Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 815–823, 2015. 17

  68. [68]

    Sfsnet: Learning shape, reflectance and illuminance of facesin the wild

    Soumyadip Sengupta, Angjoo Kanazawa, Carlos D Castillo, and David W Jacobs. Sfsnet: Learning shape, reflectance and illuminance of facesin the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition , 2018. 1, 3, 10, 17, 18

  69. [69]

    Style transfer for headshot portraits

    YiChang Shih, Sylvain Paris, Connelly Barnes, William T Freeman, and Fr´edo Durand. Style transfer for headshot portraits. 2014. 4

  70. [70]

    Portrait lighting transfer using a mass transport approach

    Zhixin Shu, Sunil Hadap, Eli Shechtman, Kalyan Sunkavalli, Sylvain Paris, and Dimitris Samaras. Portrait lighting transfer using a mass transport approach. ACM Transactions on Graphics (TOG) , 2017. 4, 17

  71. [71]

    Neural face editing with intrinsic image disentangling

    Zhixin Shu, Ersin Yumer, Sunil Hadap, Kalyan Sunkavalli, Eli Shecht- man, and Dimitris Samaras. Neural face editing with intrinsic image disentangling. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 5541–5550, 2017. 1, 3

  72. [72]

    D2c: Diffusion-decoding models for few-shot conditional generation

    Abhishek Sinha, Jiaming Song, Chenlin Meng, and Stefano Ermon. D2c: Diffusion-decoding models for few-shot conditional generation. Advances in Neural Information Processing Systems , 34, 2021. 17

  73. [73]

    Deep unsupervised learning using nonequilibrium thermody- namics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermody- namics. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning , Proceedings of Machine Learning Research, pages 2256–2265. PMLR, 2015. 5, 17

  74. [74]

    Denoising diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representa- tions, 2021. 2, 5, 6

  75. [75]

    Consistency Models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consis- tency models. arXiv:2303.01469, 2023. 3, 4

  76. [76]

    Generative modeling by estimating gradients of the data distribution

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 32. Curran Asso- ciates, Inc., 2019. 5, 17

  77. [77]

    Single image portrait relighting

    Tiancheng Sun, Jonathan T Barron, Yun-Ta Tsai, Zexiang Xu, Xueming Yu, Graham Fyffe, Christoph Rhemann, Jay Busch, Paul E Debevec, and Ravi Ramamoorthi. Single image portrait relighting. ACM Trans. Graph., 38(4):79–1, 2019. 3, 10, 18

  78. [78]

    Deepface: Closing the gap to human-level performance in face veri- fication

    Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to human-level performance in face veri- fication. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1701–1708, 2014. 17

  79. [79]

    Pie: Portrait image embedding for semantic control

    Ayush Tewari, Mohamed Elgharib, Florian Bernard, Hans-Peter Seidel, Patrick P ´erez, Michael Zollh ¨ofer, and Christian Theobalt. Pie: Portrait image embedding for semantic control. ACM Transactions on Graphics (TOG), 39(6):1–14, 2020. 4

  80. [80]

    Stylerig: Rigging stylegan for 3d control over portrait images

    Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Florian Bernard, Hans-Peter Seidel, Patrick P ´erez, Michael Zollhofer, and Christian Theobalt. Stylerig: Rigging stylegan for 3d control over portrait images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6142–6151, 2020. 4

Showing first 80 references.