pith. machine review for the scientific record. sign in

arxiv: 2603.16570 · v2 · submitted 2026-03-17 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:39 UTC · model grok-4.3

classification 💻 cs.CV
keywords image restorationface restorationdiffusion modelsdegradation estimationreference-based restorationscene restorationconditional diffusionidentity references
0
0 comments X

The pith

Facial degradation extracted from restored faces can condition diffusion models to restore entire degraded scenes including body and background.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a degraded face can serve as a perceptual oracle for estimating degradation across an entire image. It first restores the face using reference-based face restoration, then derives a degradation code from the restored-degraded face pair that captures attributes such as noise, blur, and compression. This code is converted into multi-scale tokens that condition a diffusion model, enabling single-step restoration of the full scene. The method addresses the gap where face-focused restorers ignore non-facial regions and full-scene methods produce artifacts by ignoring degradation cues.

Core claim

Given a degraded image and one or more identity references, apply a Ref-FR model to reconstruct high-quality facial details. From the restored-degraded face pair, extract a face-derived degradation code that captures degradation attributes such as noise, blur, and compression, which is then transformed into multi-scale degradation-aware tokens. These tokens condition a diffusion model to restore the full scene in a single step, including the body and background.

What carries the argument

Face-derived degradation code extracted from the restored-degraded face pair and transformed into multi-scale degradation-aware tokens that condition the diffusion model for full-scene restoration.

Load-bearing premise

The degradation attributes captured from the face are representative of the degradation present across the entire scene including body and background.

What would settle it

A set of test images in which the face receives one type of degradation while the background and body receive a clearly different type, such as added blur only to the face, would show whether the method applies the wrong restoration to non-facial regions.

Figures

Figures reproduced from arXiv: 2603.16570 by Alex Levinshtein, Amanpreet Walia, Amirhossein Kazerouni, Babak Taati, Iqbal Mohomed, Konstantinos G. Derpanis, Maitreya Suin, Sina Honari, Tristan Aumentado-Armstrong.

Figure 1
Figure 1. Figure 1: Overview of Face2Scene. We infer a face-derived degradation code from identity references and the observed LQ face, then use it as oracle guidance to restore the full scene (face, body, background) with a one-step diffusion restorer. Abstract Recent advances in image restoration have enabled high￾fidelity recovery of faces from degraded inputs using reference-based face restoration models (Ref-FR). How￾eve… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Face2Scene pipeline. In stage I, we leverage a set of reference faces to restore the LQ face crop. In stage II, we use the pair of LQ and HQ faces to extract a guided degradation with FaDeX and inject it into a one-step diffusion model using MapNet. The diffusion model then reconstructs the full-scene image. references are often available, enabling a Ref-FR model to reconstruct a high-quali… view at source ↗
Figure 3
Figure 3. Figure 3: Visual comparison of Face2Scene with the three top-performing methods from the quantitative results (zoom in to see details). [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cosine similarity analysis. We show the cosine similar￾ity across embeddings of image pairs with different degradations. (Left) similarities per degradation type (averaged over images). (Right) similarities per image, averaged over degradation types. Shaded area shows standard deviation. This confirms FaDeX iso￾lates degradation from image content. II: 1.3s) to process a 1024 × 1024 image, which is slower … view at source ↗
Figure 5
Figure 5. Figure 5: Scene diversity visualization. Word-cloud representa￾tion of the semantic tags extracted from our synthetic dataset using the RAM++ [65] image-tagging model. Larger words indicate tags that occur more frequently across generated images, highlighting the broad diversity of scenes captured in our dataset. Stage 1: Image size. We first apply a resolution filter and discard any image whose shorter side is smal… view at source ↗
Figure 6
Figure 6. Figure 6: Four representative images of our synthetic InScene dataset. Each sample is shown alongside its structured prompt. We also show the stored metadata, including facial landmarks, bounding boxes, and associated identity labels. Dataset and Gallery Collection Stage 1: Image Size 1- Resolution Filter Reject if min 𝐻, 𝑊 < 1000 2- Face Detection (YOLOv8) Reject if no face is detected! Reject Accept 3- Resize & Sq… view at source ↗
Figure 7
Figure 7. Figure 7: Real dataset preprocessing pipeline. Our real-world images are standardized through a three-stage filtering process. Stage 1 removes images with insufficient resolution and ensures a detectable face is present. Stage 2 enforces content quality by verifying that the face region is sufficiently large and the image is not overly blurred (e.g., due to defocus blur in the background, as in the example image sho… view at source ↗
Figure 8
Figure 8. Figure 8: Overview of FaDeX contrastive learning. Momentum-encoder contrastive training setup with dual encoders and a queue of negatives to learn degradation-discriminative face embeddings. 7.4. FaDeX Cosine-Similarity Experiment In the paper, we evaluate whether FaDeX encodes degra￾dations while being invariant to image content. To this end, we apply four fixed RealESRGAN-style degradation presets to 10 randomly s… view at source ↗
Figure 9
Figure 9. Figure 9: Model complexity comparison. We compare infer￾ence time across different diffusion-based restoration models. For Face2Scene, we show the separate time costs of the reference￾based preprocessing and our final restoration as two different shades. While our method is not as efficient as standard one-step models, which do not require an initial face restoration step, it is still considerably faster than curren… view at source ↗
Figure 10
Figure 10. Figure 10: Effect of CFG scaling. We vary the CFG scaling factor from 1.00 to 1.16 on the InScene synthetic validation set. The top row shows no-reference metrics and the bottom row shows reference-based metrics, illustrating the trade-off between perceptual quality and fidelity. In particular, although there is some noise in the trends, one can see that the no-reference quality metrics tend to improve as the CFG we… view at source ↗
Figure 11
Figure 11. Figure 11: Visual comparison on InScene Synthetic dataset across seven methods [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visual comparison on InScene Synthetic dataset across seven methods [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Visual comparison on InScene Synthetic dataset across seven methods [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Visual comparison on InScene Synthetic dataset across seven methods [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Visual comparison on InScene Synthetic dataset across seven methods [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Visual comparison on InScene Synthetic dataset across seven methods [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Visual comparison on real validation across seven methods [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Visual comparison on real validation across seven methods [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Visual comparison across restoration methods on real test samples [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Visual comparison across restoration methods on real test samples [PITH_FULL_IMAGE:figures/full_fig_p031_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Visual comparison across restoration methods on real test samples [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Failure Cases. Our method, like other SD 2.1-based restorers, struggles to recover fine text and very small faces, reflecting the backbone’s limited text rendering ability and difficulty preserving details in tiny spatial regions. More advanced backbones such as FLUX or SD3.5, which provide higher native resolution and stronger text/face modeling, could alleviate these issues [PITH_FULL_IMAGE:figures/ful… view at source ↗
read the original abstract

Recent advances in image restoration have enabled high-fidelity recovery of faces from degraded inputs using reference-based face restoration models (Ref-FR). However, such methods focus solely on facial regions, neglecting degradation across the full scene, including body and background, which limits practical usability. Meanwhile, full-scene restorers often ignore degradation cues entirely, leading to underdetermined predictions and visual artifacts. In this work, we propose Face2Scene, a two-stage restoration framework that leverages the face as a perceptual oracle to estimate degradation and guide the restoration of the entire image. Given a degraded image and one or more identity references, we first apply a Ref-FR model to reconstruct high-quality facial details. From the restored-degraded face pair, we extract a face-derived degradation code that captures degradation attributes (e.g., noise, blur, compression), which is then transformed into multi-scale degradation-aware tokens. These tokens condition a diffusion model to restore the full scene in a single step, including the body and background. Extensive experiments demonstrate the superior effectiveness of the proposed method compared to state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces Face2Scene, a two-stage framework for full-scene image restoration. Given a degraded input and identity references, a Ref-FR model first restores the face; a degradation code is then extracted from the restored-degraded face pair, converted to multi-scale tokens, and used to condition a diffusion model that restores the entire image (including body and background) in one step. The central claim is that this face-derived oracle yields superior restoration quality over existing methods, supported by extensive experiments.

Significance. If the face-to-scene degradation transfer holds, the approach offers a practical way to supply degradation cues to otherwise underdetermined scene restorers, potentially improving usability in real-world settings where faces are salient. The integration of Ref-FR outputs with diffusion conditioning is a coherent extension of prior work and could influence hybrid restoration pipelines.

major comments (1)
  1. [Method overview] The method overview (abstract and method description): the claim that a single face-derived degradation code suffices to condition restoration of the full scene rests on the untested assumption that degradation attributes (noise, blur, compression) extracted from the face are representative of spatially variant degradations elsewhere. No ablation or analysis is shown to quantify failure modes under localized motion blur or texture-dependent artifacts, which directly undermines the superiority claim.
minor comments (1)
  1. [Abstract] The abstract states 'extensive experiments' without naming datasets, metrics, or baselines; adding these details in the abstract or a dedicated table would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments on our paper. We address the major comment point by point below. We agree that additional analysis on spatially variant degradations would enhance the manuscript and will incorporate it in the revision.

read point-by-point responses
  1. Referee: [Method overview] The method overview (abstract and method description): the claim that a single face-derived degradation code suffices to condition restoration of the full scene rests on the untested assumption that degradation attributes (noise, blur, compression) extracted from the face are representative of spatially variant degradations elsewhere. No ablation or analysis is shown to quantify failure modes under localized motion blur or texture-dependent artifacts, which directly undermines the superiority claim.

    Authors: We appreciate the referee pointing out this potential limitation. Our experiments across multiple datasets with various real-world degradations demonstrate that the face-derived degradation code effectively guides the diffusion model to restore the full scene with superior quality compared to baselines. This suggests that the degradation attributes are sufficiently representative in practice for the types of degradations considered. Nevertheless, we recognize the importance of evaluating under localized degradations. In the revised version, we will include an ablation study using synthetically generated localized motion blur and texture-specific artifacts to quantify the failure modes and discuss the conditions under which the assumption holds. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation builds on external Ref-FR and diffusion components

full rationale

The paper's core pipeline applies an existing reference-based face restoration (Ref-FR) model to a degraded input, extracts a degradation code from the resulting restored-degraded face pair, converts that code into multi-scale tokens, and uses the tokens to condition a diffusion model for full-scene restoration. None of these steps reduce by construction to the target output or rely on a self-citation chain whose validity is presupposed by the present work. The extraction of the degradation code is presented as a direct computation from the face pair rather than a fitted parameter renamed as a prediction, and no uniqueness theorem or ansatz is imported from prior work by the same authors to force the architecture. The method therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that face-derived degradation is a good proxy for scene degradation, which is not independently evidenced in the abstract.

axioms (1)
  • domain assumption Reference-based face restoration models can accurately recover high-quality facial details from degraded inputs given identity references.
    This is invoked in the first stage of the framework.

pith-pipeline@v0.9.0 · 5533 in / 1225 out tokens · 32233 ms · 2026-05-15T09:39:23.395581+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    From the restored–degraded face pair, we extract a face-derived degradation code that captures degradation attributes (e.g., noise, blur, compression), which is then transformed into multi-scale degradation-aware tokens.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 1 internal anchor

  1. [1]

    Stephen Batifol, Andreas Blattmann, Frederic Boesel, Sak- sham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, pages arXiv–2506,

  2. [2]

    Deedsr: Towards real-world im- age super-resolution via degradation-aware stable diffusion

    Chunyang Bi, Xin Luo, Sheng Shen, Mengxi Zhang, Huan- jing Yue, and Jingyu Yang. Deedsr: Towards real-world im- age super-resolution via degradation-aware stable diffusion. arXiv preprint arXiv:2404.00661, 2024. 2, 3

  3. [3]

    The perception-distortion tradeoff

    Yochai Blau and Tomer Michaeli. The perception-distortion tradeoff. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6228–6237, 2018. 7

  4. [4]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 5

  5. [5]

    Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021. 5

  6. [6]

    Personalized restoration via dual-pivot tuning.IEEE Transactions on Image Processing, 2025

    Pradyumna Chari, Sizhuo Ma, Daniil Ostashev, Achuta Kadambi, Gurunandan Krishnan, Jian Wang, and Kfir Aber- man. Personalized restoration via dual-pivot tuning.IEEE Transactions on Image Processing, 2025. 1, 2

  7. [7]

    Topiq: A top-down approach from semantics to distortions for image quality assessment.IEEE Transactions on Image Processing, 33:2404–2418, 2024

    Chaofeng Chen, Jiadi Mo, Jingwen Hou, Haoning Wu, Liang Liao, Wenxiu Sun, Qiong Yan, and Weisi Lin. Topiq: A top-down approach from semantics to distortions for image quality assessment.IEEE Transactions on Image Processing, 33:2404–2418, 2024. 6

  8. [8]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on ma- chine learning, pages 1597–1607. PmLR, 2020. 4

  9. [9]

    Image super- resolution with text prompt diffusion.arXiv preprint arXiv:2311.14282, 2023

    Zheng Chen, Yulun Zhang, Jinjin Gu, Xin Yuan, Linghe Kong, Guihai Chen, and Xiaokang Yang. Image super- resolution with text prompt diffusion.arXiv preprint arXiv:2311.14282, 2023. 3, 4

  10. [10]

    Copy or not? reference-based face image restoration with fine details

    Min Jin Chong, Dejia Xu, Yi Zhang, Zhangyang Wang, David Forsyth, Gurunandan Krishnan, Yicheng Wu, and Jian Wang. Copy or not? reference-based face image restoration with fine details. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 9660–9669. IEEE, 2025. 3

  11. [11]

    Arcface: Additive angular margin loss for deep face recognition

    Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 4690–4699, 2019. 2, 5, 7

  12. [12]

    Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020

    Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020. 6

  13. [13]

    Restoration by generation with constrained priors

    Zheng Ding, Xuaner Zhang, Zhuowen Tu, and Zhihao Xia. Restoration by generation with constrained priors. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2567–2577, 2024. 2

  14. [14]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

  15. [15]

    Human body restoration with one-step diffusion model and a new benchmark

    Jue Gong, Jingkai Wang, Zheng Chen, Xing Liu, Hong Gu, Yulun Zhang, and Xiaokang Yang. Human body restoration with one-step diffusion model and a new benchmark. InIn- ternational Conference on Machine Learning, 2025. 2, 3

  16. [16]

    Haodiff: Human-aware one-step diffusion via dual-prompt guidance

    Jue Gong, Tingyu Yang, Jingkai Wang, Zheng Chen, Xing Liu, Hong Gu, Yulun Zhang, and Xiaokang Yang. Haodiff: Human-aware one-step diffusion via dual-prompt guidance. arXiv preprint arXiv:2505.19742, 2025. 2, 3

  17. [17]

    Momentum contrast for unsupervised visual rep- resentation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. InConference on computer vision and pattern recognition, 2020. 4

  18. [18]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 6

  19. [19]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 4

  20. [20]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1

  21. [21]

    Ref-ldm: A latent diffusion model for reference-based face image restoration

    Chi-Wei Hsiao, Yu-Lun Liu, Cheng-Kun Yang, Sheng-Po Kuo, Kevin Jou, and Chia-Ping Chen. Ref-ldm: A latent diffusion model for reference-based face image restoration. Advances in Neural Information Processing Systems, 37: 74840–74867, 2024. 1, 3, 4

  22. [22]

    Batch normalization: Accelerating deep network training by reducing internal co- variate shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal co- variate shift. InInternational conference on machine learn- ing, 2015. 4

  23. [23]

    InfiniteYou: Flexible photo recrafting while pre- serving your identity.International conference on computer vision (ICCV), 2025

    Liming Jiang, Qing Yan, Yumin Jia, Zichuan Liu, Hao Kang, and Xin Lu. InfiniteYou: Flexible photo recrafting while pre- serving your identity.International conference on computer vision (ICCV), 2025. 5, 2

  24. [24]

    Musiq: Multi-scale image quality transformer

    Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021. 6, 2

  25. [25]

    Dual prompting image restoration with diffusion transformers

    Dehong Kong, Fan Li, Zhixin Wang, Jiaqi Xu, Renjing Pei, Wenbo Li, and WenQi Ren. Dual prompting image restoration with diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12809–12819, 2025. 2, 3

  26. [26]

    Ensembling off-the-shelf models for gan training

    Nupur Kumari, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Ensembling off-the-shelf models for gan training. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 10651–10662, 2022. 5

  27. [27]

    From face to natural image: Learning real degradation for blind image super-resolution

    Xiaoming Li, Chaofeng Chen, Xianhui Lin, Wangmeng Zuo, and Lei Zhang. From face to natural image: Learning real degradation for blind image super-resolution. InEuropean Conference on Computer Vision, pages 376–392. Springer,

  28. [28]

    Learning dual memory dictionaries for blind face restoration.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(5):5904–5917, 2022

    Xiaoming Li, Shiguang Zhang, Shangchen Zhou, Lei Zhang, and Wangmeng Zuo. Learning dual memory dictionaries for blind face restoration.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(5):5904–5917, 2022. 2, 5

  29. [29]

    Learning lens blur fields

    Esther YH Lin, Zhecheng Wang, Rebecca Lin, Daniel Miau, Florian Kainz, Jiawen Chen, Xuaner Zhang, David B Lin- dell, and Kiriakos N Kutulakos. Learning lens blur fields. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2025. 8

  30. [30]

    Diff- bir: Toward blind image restoration with generative diffusion prior

    Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Yu Qiao, Wanli Ouyang, and Chao Dong. Diff- bir: Toward blind image restoration with generative diffusion prior. InEuropean conference on computer vision, pages 430–448. Springer, 2024. 8, 6, 9

  31. [31]

    Faceme: Robust blind face restoration with personal identification

    Siyu Liu, Zheng-Peng Duan, Jia OuYang, Jiayi Fu, Hyunhee Park, Zikun Liu, Chun-Le Guo, and Chongyi Li. Faceme: Robust blind face restoration with personal identification. InProceedings of the AAAI Conference on Artificial Intel- ligence, pages 5567–5575, 2025. 1, 2, 3, 6, 7, 4

  32. [32]

    An accurate and lightweight method for human body image super-resolution.IEEE Transactions on Image Processing, 30:2888–2897, 2021

    Yunan Liu, Shanshan Zhang, Jie Xu, Jian Yang, and Yu- Wing Tai. An accurate and lightweight method for human body image super-resolution.IEEE Transactions on Image Processing, 30:2888–2897, 2021. 3

  33. [33]

    Adaptbir: Adaptive blind image restoration with latent diffusion prior for higher fidelity.Pattern Recognition, 155:110659, 2024

    Yingqi Liu, Jingwen He, Yihao Liu, Xinqi Lin, Fanghua Yu, Jinfan Hu, Yu Qiao, and Chao Dong. Adaptbir: Adaptive blind image restoration with latent diffusion prior for higher fidelity.Pattern Recognition, 155:110659, 2024. 2, 3

  34. [34]

    Metric learning based interactive modulation for real-world super-resolution

    Chong Mou, Yanze Wu, Xintao Wang, Chao Dong, Jian Zhang, and Ying Shan. Metric learning based interactive modulation for real-world super-resolution. InEuropean conference on computer vision, pages 723–740. Springer,

  35. [35]

    Tip: Text-driven image processing with semantic and restoration instructions.CoRR, 2023

    Chenyang Qi, Zhengzhong Tu, Keren Ye, Mauricio Delbra- cio, Peyman Milanfar, Qifeng Chen, and Hossein Talebi. Tip: Text-driven image processing with semantic and restoration instructions.CoRR, 2023. 3, 4

  36. [36]

    You only look once: Unified, real-time object de- tection

    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 2

  37. [37]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1

  38. [38]

    Adversarial diffusion distillation

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer,

  39. [39]

    Overcoming false illusions in real-world face restoration with multi-modal guided diffusion model.arXiv preprint arXiv:2410.04161, 2024

    Keda Tao, Jinjin Gu, Yulun Zhang, Xiucheng Wang, and Nan Cheng. Overcoming false illusions in real-world face restoration with multi-modal guided diffusion model.arXiv preprint arXiv:2410.04161, 2024. 2

  40. [40]

    Overcoming false illusions in real-world face restoration with multi-modal guided diffusion model

    Keda TAO, Jinjin Gu, Yulun Zhang, Xiucheng Wang, and Nan Cheng. Overcoming false illusions in real-world face restoration with multi-modal guided diffusion model. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 1

  41. [41]

    Pfstorer: Personalized face restoration and super-resolution

    Tuomas Varanka, Tapani Toivonen, Soumya Tripathy, Guoy- ing Zhao, and Erman Acar. Pfstorer: Personalized face restoration and super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2372–2381, 2024. 2

  42. [42]

    Ex- ploring clip for assessing the look and feel of images

    Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Ex- ploring clip for assessing the look and feel of images. InPro- ceedings of the AAAI conference on artificial intelligence, pages 2555–2563, 2023. 6, 2

  43. [43]

    Osdface: One-step diffusion model for face restoration

    Jingkai Wang, Jue Gong, Lin Zhang, Zheng Chen, Xing Liu, Hong Gu, Yutong Liu, Yulun Zhang, and Xiaokang Yang. Osdface: One-step diffusion model for face restoration. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12626–12636, 2025. 10

  44. [44]

    Honestface: To- wards honest face restoration with one-step diffusion model

    Jingkai Wang, Wu Miao, Jue Gong, Zheng Chen, Xing Liu, Hong Gu, Yutong Liu, and Yulun Zhang. Honestface: To- wards honest face restoration with one-step diffusion model. arXiv preprint arXiv:2505.18469, 2025. 3

  45. [45]

    Unsuper- vised degradation representation learning for blind super- resolution

    Longguang Wang, Yingqian Wang, Xiaoyu Dong, Qingyu Xu, Jungang Yang, Wei An, and Yulan Guo. Unsuper- vised degradation representation learning for blind super- resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10581– 10590, 2021. 2

  46. [46]

    Prior based pyramid resid- ual clique network for human body image super-resolution

    Simiao Wang, Yu Sang, Yunan Liu, Chunpeng Wang, Mingyu Lu, and Jinguang Sun. Prior based pyramid resid- ual clique network for human body image super-resolution. Pattern Recognition, 150:110352, 2024. 3

  47. [47]

    Esrgan: En- hanced super-resolution generative adversarial networks

    Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: En- hanced super-resolution generative adversarial networks. In Proceedings of the European conference on computer vision (ECCV) workshops, pages 0–0, 2018. 1

  48. [48]

    Real-esrgan: Training real-world blind super-resolution with pure synthetic data

    Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1905–1914,

  49. [49]

    Sinsr: diffusion-based image super- resolution in a single step

    Yufei Wang, Wenhan Yang, Xinyuan Chen, Yaohui Wang, Lanqing Guo, Lap-Pui Chau, Ziwei Liu, Yu Qiao, Alex C Kot, and Bihan Wen. Sinsr: diffusion-based image super- resolution in a single step. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25796–25805, 2024. 2, 3, 6, 9

  50. [50]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6

  51. [51]

    Restoreformer++: Towards real- world blind face restoration from undegraded key-value pairs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15462–15476, 2023

    Zhouxia Wang, Jiawei Zhang, Tianshui Chen, Wenping Wang, and Ping Luo. Restoreformer++: Towards real- world blind face restoration from undegraded key-value pairs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15462–15476, 2023. 10

  52. [52]

    One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024

    Rongyuan Wu, Lingchen Sun, Zhiyuan Ma, and Lei Zhang. One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024. 1, 3, 8, 6, 9

  53. [53]

    Maniqa: Multi-dimension attention network for no-reference image quality assessment

    Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1191–1200, 2022. 6, 2

  54. [54]

    Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization

    Tao Yang, Rongyuan Wu, Peiran Ren, Xuansong Xie, and Lei Zhang. Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization. InEuropean conference on computer vision, pages 74–91. Springer, 2024. 1, 8, 6, 9

  55. [55]

    Refstar: Blind facial image restoration with ref- erence selection, transfer, and reconstruction.arXiv preprint arXiv:2507.10470, 2025

    Zhicun Yin, Junjie Chen, Ming Liu, Zhixin Wang, Fan Li, Renjing Pei, Xiaoming Li, Rynson WH Lau, and Wang- meng Zuo. Refstar: Blind facial image restoration with ref- erence selection, transfer, and reconstruction.arXiv preprint arXiv:2507.10470, 2025. 3

  56. [56]

    Restorerid: Towards tuning- free face restoration with id preservation.arXiv preprint arXiv:2411.14125, 2024

    Jiacheng Ying, Mushui Liu, Zhe Wu, Runming Zhang, Zhu Yu, Siming Fu, Si-Yuan Cao, Chao Wu, Yunlong Yu, and Hui-Liang Shen. Restorerid: Towards tuning- free face restoration with id preservation.arXiv preprint arXiv:2411.14125, 2024. 2, 3, 4

  57. [57]

    Scaling up to excellence: Practicing model scaling for photo- realistic image restoration in the wild

    Fanghua Yu, Jinjin Gu, Zheyuan Li, Jinfan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, and Chao Dong. Scaling up to excellence: Practicing model scaling for photo- realistic image restoration in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25669–25680, 2024. 1, 2, 8, 6, 9

  58. [58]

    Effi- cient diffusion model for image restoration by residual shift- ing.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Effi- cient diffusion model for image restoration by residual shift- ing.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 8, 6, 9

  59. [59]

    Arbitrary-steps image super-resolution via diffusion inver- sion

    Zongsheng Yue, Kang Liao, and Chen Change Loy. Arbitrary-steps image super-resolution via diffusion inver- sion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23153–23163, 2025. 1, 8, 6, 9

  60. [60]

    Degradation-guided one-step im- age super-resolution with diffusion priors.arXiv preprint arXiv:2409.17058, 2024

    Aiping Zhang, Zongsheng Yue, Renjing Pei, Wenqi Ren, and Xiaochun Cao. Degradation-guided one-step im- age super-resolution with diffusion priors.arXiv preprint arXiv:2409.17058, 2024. 2, 3, 4, 5, 7, 8, 6, 9

  61. [61]

    Instantrestore: Single-step personalized face restoration with shared-image attention

    Howard Zhang, Yuval Alaluf, Sizhuo Ma, Achuta Kadambi, Jian Wang, and Kfir Aberman. Instantrestore: Single-step personalized face restoration with shared-image attention. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–10, 2025. 1, 3, 4

  62. [62]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 3

  63. [63]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 6

  64. [64]

    Blind image quality assessment via vision- language correspondence: A multitask learning perspective

    Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, and Kede Ma. Blind image quality assessment via vision- language correspondence: A multitask learning perspective. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 14071–14081, 2023. 6

  65. [65]

    Recognize anything: A strong image tagging model

    Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo, Yaqian Li, Shilong Liu, et al. Recognize anything: A strong image tagging model. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 1724–1732, 2024. 2

  66. [66]

    Diff- body: Human body restoration by imagining with generative diffusion prior.arXiv preprint arXiv:2404.03642, 2024

    Yiming Zhang, Zhe Wang, Xinjie Li, Yunchen Yuan, Cheng- song Zhang, Xiao Sun, Zhihang Zhong, and Jian Wang. Diff- body: Human body restoration by imagining with generative diffusion prior.arXiv preprint arXiv:2404.03642, 2024. 3

  67. [67]

    Reference-guided identity preserving face restora- tion.arXiv preprint arXiv:2505.21905, 2025

    Mo Zhou, Keren Ye, Viraj Shah, Kangfu Mei, Mauricio Delbracio, Peyman Milanfar, Vishal M Patel, and Hossein Talebi. Reference-guided identity preserving face restora- tion.arXiv preprint arXiv:2505.21905, 2025. 3

  68. [68]

    Towards robust blind face restora- tion with codebook lookup transformer.Advances in Neural Information Processing Systems, 35:30599–30611, 2022

    Shangchen Zhou, Kelvin Chan, Chongyi Li, and Chen Change Loy. Towards robust blind face restora- tion with codebook lookup transformer.Advances in Neural Information Processing Systems, 35:30599–30611, 2022. 10 LQ GT Face2Scene S3Diff InvSR OSEDiff PASD SUPIR SinSR Figure 11.Visual comparison on InScene Synthetic dataset across seven methods. LQ GT Face2S...