arxiv: 2603.16570 · v2 · submitted 2026-03-17 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Face2Scene: Using Facial Degradation as an Oracle for Diffusion-Based Scene Restoration

Amirhossein Kazerouni , Maitreya Suin , Tristan Aumentado-Armstrong , Sina Honari , Amanpreet Walia , Iqbal Mohomed , Konstantinos G. Derpanis , Babak Taati

show 1 more author

Alex Levinshtein

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:39 UTC · model grok-4.3

classification 💻 cs.CV

keywords image restorationface restorationdiffusion modelsdegradation estimationreference-based restorationscene restorationconditional diffusionidentity references

0 comments

The pith

Facial degradation extracted from restored faces can condition diffusion models to restore entire degraded scenes including body and background.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a degraded face can serve as a perceptual oracle for estimating degradation across an entire image. It first restores the face using reference-based face restoration, then derives a degradation code from the restored-degraded face pair that captures attributes such as noise, blur, and compression. This code is converted into multi-scale tokens that condition a diffusion model, enabling single-step restoration of the full scene. The method addresses the gap where face-focused restorers ignore non-facial regions and full-scene methods produce artifacts by ignoring degradation cues.

Core claim

Given a degraded image and one or more identity references, apply a Ref-FR model to reconstruct high-quality facial details. From the restored-degraded face pair, extract a face-derived degradation code that captures degradation attributes such as noise, blur, and compression, which is then transformed into multi-scale degradation-aware tokens. These tokens condition a diffusion model to restore the full scene in a single step, including the body and background.

What carries the argument

Face-derived degradation code extracted from the restored-degraded face pair and transformed into multi-scale degradation-aware tokens that condition the diffusion model for full-scene restoration.

Load-bearing premise

The degradation attributes captured from the face are representative of the degradation present across the entire scene including body and background.

What would settle it

A set of test images in which the face receives one type of degradation while the background and body receive a clearly different type, such as added blur only to the face, would show whether the method applies the wrong restoration to non-facial regions.

Figures

Figures reproduced from arXiv: 2603.16570 by Alex Levinshtein, Amanpreet Walia, Amirhossein Kazerouni, Babak Taati, Iqbal Mohomed, Konstantinos G. Derpanis, Maitreya Suin, Sina Honari, Tristan Aumentado-Armstrong.

**Figure 1.** Figure 1: Overview of Face2Scene. We infer a face-derived degradation code from identity references and the observed LQ face, then use it as oracle guidance to restore the full scene (face, body, background) with a one-step diffusion restorer. Abstract Recent advances in image restoration have enabled highfidelity recovery of faces from degraded inputs using reference-based face restoration models (Ref-FR). Howeve… view at source ↗

**Figure 2.** Figure 2: Overview of the Face2Scene pipeline. In stage I, we leverage a set of reference faces to restore the LQ face crop. In stage II, we use the pair of LQ and HQ faces to extract a guided degradation with FaDeX and inject it into a one-step diffusion model using MapNet. The diffusion model then reconstructs the full-scene image. references are often available, enabling a Ref-FR model to reconstruct a high-quali… view at source ↗

**Figure 3.** Figure 3: Visual comparison of Face2Scene with the three top-performing methods from the quantitative results (zoom in to see details). [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Cosine similarity analysis. We show the cosine similarity across embeddings of image pairs with different degradations. (Left) similarities per degradation type (averaged over images). (Right) similarities per image, averaged over degradation types. Shaded area shows standard deviation. This confirms FaDeX isolates degradation from image content. II: 1.3s) to process a 1024 × 1024 image, which is slower … view at source ↗

**Figure 5.** Figure 5: Scene diversity visualization. Word-cloud representation of the semantic tags extracted from our synthetic dataset using the RAM++ [65] image-tagging model. Larger words indicate tags that occur more frequently across generated images, highlighting the broad diversity of scenes captured in our dataset. Stage 1: Image size. We first apply a resolution filter and discard any image whose shorter side is smal… view at source ↗

**Figure 6.** Figure 6: Four representative images of our synthetic InScene dataset. Each sample is shown alongside its structured prompt. We also show the stored metadata, including facial landmarks, bounding boxes, and associated identity labels. Dataset and Gallery Collection Stage 1: Image Size 1- Resolution Filter Reject if min 𝐻, 𝑊 < 1000 2- Face Detection (YOLOv8) Reject if no face is detected! Reject Accept 3- Resize & Sq… view at source ↗

**Figure 7.** Figure 7: Real dataset preprocessing pipeline. Our real-world images are standardized through a three-stage filtering process. Stage 1 removes images with insufficient resolution and ensures a detectable face is present. Stage 2 enforces content quality by verifying that the face region is sufficiently large and the image is not overly blurred (e.g., due to defocus blur in the background, as in the example image sho… view at source ↗

**Figure 8.** Figure 8: Overview of FaDeX contrastive learning. Momentum-encoder contrastive training setup with dual encoders and a queue of negatives to learn degradation-discriminative face embeddings. 7.4. FaDeX Cosine-Similarity Experiment In the paper, we evaluate whether FaDeX encodes degradations while being invariant to image content. To this end, we apply four fixed RealESRGAN-style degradation presets to 10 randomly s… view at source ↗

**Figure 9.** Figure 9: Model complexity comparison. We compare inference time across different diffusion-based restoration models. For Face2Scene, we show the separate time costs of the referencebased preprocessing and our final restoration as two different shades. While our method is not as efficient as standard one-step models, which do not require an initial face restoration step, it is still considerably faster than curren… view at source ↗

**Figure 10.** Figure 10: Effect of CFG scaling. We vary the CFG scaling factor from 1.00 to 1.16 on the InScene synthetic validation set. The top row shows no-reference metrics and the bottom row shows reference-based metrics, illustrating the trade-off between perceptual quality and fidelity. In particular, although there is some noise in the trends, one can see that the no-reference quality metrics tend to improve as the CFG we… view at source ↗

**Figure 11.** Figure 11: Visual comparison on InScene Synthetic dataset across seven methods [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Visual comparison on InScene Synthetic dataset across seven methods [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Visual comparison on InScene Synthetic dataset across seven methods [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: Visual comparison on InScene Synthetic dataset across seven methods [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Visual comparison on InScene Synthetic dataset across seven methods [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16: Visual comparison on InScene Synthetic dataset across seven methods [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗

**Figure 17.** Figure 17: Visual comparison on real validation across seven methods [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗

**Figure 18.** Figure 18: Visual comparison on real validation across seven methods [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗

**Figure 19.** Figure 19: Visual comparison across restoration methods on real test samples [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗

**Figure 20.** Figure 20: Visual comparison across restoration methods on real test samples [PITH_FULL_IMAGE:figures/full_fig_p031_20.png] view at source ↗

**Figure 21.** Figure 21: Visual comparison across restoration methods on real test samples [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗

**Figure 22.** Figure 22: Failure Cases. Our method, like other SD 2.1-based restorers, struggles to recover fine text and very small faces, reflecting the backbone’s limited text rendering ability and difficulty preserving details in tiny spatial regions. More advanced backbones such as FLUX or SD3.5, which provide higher native resolution and stronger text/face modeling, could alleviate these issues [PITH_FULL_IMAGE:figures/ful… view at source ↗

read the original abstract

Recent advances in image restoration have enabled high-fidelity recovery of faces from degraded inputs using reference-based face restoration models (Ref-FR). However, such methods focus solely on facial regions, neglecting degradation across the full scene, including body and background, which limits practical usability. Meanwhile, full-scene restorers often ignore degradation cues entirely, leading to underdetermined predictions and visual artifacts. In this work, we propose Face2Scene, a two-stage restoration framework that leverages the face as a perceptual oracle to estimate degradation and guide the restoration of the entire image. Given a degraded image and one or more identity references, we first apply a Ref-FR model to reconstruct high-quality facial details. From the restored-degraded face pair, we extract a face-derived degradation code that captures degradation attributes (e.g., noise, blur, compression), which is then transformed into multi-scale degradation-aware tokens. These tokens condition a diffusion model to restore the full scene in a single step, including the body and background. Extensive experiments demonstrate the superior effectiveness of the proposed method compared to state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Face2Scene pulls a degradation code from a restored face to steer diffusion on the full scene, which is a clean engineering move but rests on the unproven idea that face degradations match the rest of the image.

read the letter

The main thing here is the two-stage setup: run a reference face restorer first, pull a degradation code from the before-and-after face pair, convert it to multi-scale tokens, and feed those to a diffusion model for the whole picture. That framing is new enough in the restoration literature and directly tackles the split between face-only tools and cue-less scene restorers. It is a practical step if the conditioning actually helps avoid artifacts in body and background areas. The paper does well by grounding the method in existing Ref-FR and diffusion components rather than inventing new architectures from scratch, and the claim of single-step full-scene output is straightforward to test. Extensive experiments are mentioned, which at least suggests they ran proper baselines. The soft spot is the core assumption that the face-derived code captures degradation attributes that transfer to non-face regions. Real images often have spatially varying effects—motion blur stronger on limbs, compression tied to local texture—so a single code could under-correct or over-smooth elsewhere. The abstract gives no sign of ablations on mismatched degradation cases or invariance checks, which leaves the central claim thinner than it needs to be. This is for restoration researchers who already work with diffusion conditioning or reference-based methods. A reader looking for concrete ways to inject perceptual cues into scene models would get something usable from it. It deserves peer review because the idea is well-motivated and the experiments are presented as thorough, even if the spatial transfer question will need clearer answers.

Referee Report

1 major / 1 minor

Summary. The paper introduces Face2Scene, a two-stage framework for full-scene image restoration. Given a degraded input and identity references, a Ref-FR model first restores the face; a degradation code is then extracted from the restored-degraded face pair, converted to multi-scale tokens, and used to condition a diffusion model that restores the entire image (including body and background) in one step. The central claim is that this face-derived oracle yields superior restoration quality over existing methods, supported by extensive experiments.

Significance. If the face-to-scene degradation transfer holds, the approach offers a practical way to supply degradation cues to otherwise underdetermined scene restorers, potentially improving usability in real-world settings where faces are salient. The integration of Ref-FR outputs with diffusion conditioning is a coherent extension of prior work and could influence hybrid restoration pipelines.

major comments (1)

[Method overview] The method overview (abstract and method description): the claim that a single face-derived degradation code suffices to condition restoration of the full scene rests on the untested assumption that degradation attributes (noise, blur, compression) extracted from the face are representative of spatially variant degradations elsewhere. No ablation or analysis is shown to quantify failure modes under localized motion blur or texture-dependent artifacts, which directly undermines the superiority claim.

minor comments (1)

[Abstract] The abstract states 'extensive experiments' without naming datasets, metrics, or baselines; adding these details in the abstract or a dedicated table would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments on our paper. We address the major comment point by point below. We agree that additional analysis on spatially variant degradations would enhance the manuscript and will incorporate it in the revision.

read point-by-point responses

Referee: [Method overview] The method overview (abstract and method description): the claim that a single face-derived degradation code suffices to condition restoration of the full scene rests on the untested assumption that degradation attributes (noise, blur, compression) extracted from the face are representative of spatially variant degradations elsewhere. No ablation or analysis is shown to quantify failure modes under localized motion blur or texture-dependent artifacts, which directly undermines the superiority claim.

Authors: We appreciate the referee pointing out this potential limitation. Our experiments across multiple datasets with various real-world degradations demonstrate that the face-derived degradation code effectively guides the diffusion model to restore the full scene with superior quality compared to baselines. This suggests that the degradation attributes are sufficiently representative in practice for the types of degradations considered. Nevertheless, we recognize the importance of evaluating under localized degradations. In the revised version, we will include an ablation study using synthetically generated localized motion blur and texture-specific artifacts to quantify the failure modes and discuss the conditions under which the assumption holds. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation builds on external Ref-FR and diffusion components

full rationale

The paper's core pipeline applies an existing reference-based face restoration (Ref-FR) model to a degraded input, extracts a degradation code from the resulting restored-degraded face pair, converts that code into multi-scale tokens, and uses the tokens to condition a diffusion model for full-scene restoration. None of these steps reduce by construction to the target output or rely on a self-citation chain whose validity is presupposed by the present work. The extraction of the degradation code is presented as a direct computation from the face pair rather than a fitted parameter renamed as a prediction, and no uniqueness theorem or ansatz is imported from prior work by the same authors to force the architecture. The method therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that face-derived degradation is a good proxy for scene degradation, which is not independently evidenced in the abstract.

axioms (1)

domain assumption Reference-based face restoration models can accurately recover high-quality facial details from degraded inputs given identity references.
This is invoked in the first stage of the framework.

pith-pipeline@v0.9.0 · 5533 in / 1225 out tokens · 32233 ms · 2026-05-15T09:39:23.395581+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

From the restored–degraded face pair, we extract a face-derived degradation code that captures degradation attributes (e.g., noise, blur, compression), which is then transformed into multi-scale degradation-aware tokens.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 1 internal anchor

[1]

Stephen Batifol, Andreas Blattmann, Frederic Boesel, Sak- sham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, pages arXiv–2506,

work page
[2]

Deedsr: Towards real-world im- age super-resolution via degradation-aware stable diffusion

Chunyang Bi, Xin Luo, Sheng Shen, Mengxi Zhang, Huan- jing Yue, and Jingyu Yang. Deedsr: Towards real-world im- age super-resolution via degradation-aware stable diffusion. arXiv preprint arXiv:2404.00661, 2024. 2, 3

work page arXiv 2024
[3]

The perception-distortion tradeoff

Yochai Blau and Tomer Michaeli. The perception-distortion tradeoff. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6228–6237, 2018. 7

work page 2018
[4]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 5

work page 2021
[5]

Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021. 5

work page 2021
[6]

Personalized restoration via dual-pivot tuning.IEEE Transactions on Image Processing, 2025

Pradyumna Chari, Sizhuo Ma, Daniil Ostashev, Achuta Kadambi, Gurunandan Krishnan, Jian Wang, and Kfir Aber- man. Personalized restoration via dual-pivot tuning.IEEE Transactions on Image Processing, 2025. 1, 2

work page 2025
[7]

Topiq: A top-down approach from semantics to distortions for image quality assessment.IEEE Transactions on Image Processing, 33:2404–2418, 2024

Chaofeng Chen, Jiadi Mo, Jingwen Hou, Haoning Wu, Liang Liao, Wenxiu Sun, Qiong Yan, and Weisi Lin. Topiq: A top-down approach from semantics to distortions for image quality assessment.IEEE Transactions on Image Processing, 33:2404–2418, 2024. 6

work page 2024
[8]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on ma- chine learning, pages 1597–1607. PmLR, 2020. 4

work page 2020
[9]

Image super- resolution with text prompt diffusion.arXiv preprint arXiv:2311.14282, 2023

Zheng Chen, Yulun Zhang, Jinjin Gu, Xin Yuan, Linghe Kong, Guihai Chen, and Xiaokang Yang. Image super- resolution with text prompt diffusion.arXiv preprint arXiv:2311.14282, 2023. 3, 4

work page arXiv 2023
[10]

Copy or not? reference-based face image restoration with fine details

Min Jin Chong, Dejia Xu, Yi Zhang, Zhangyang Wang, David Forsyth, Gurunandan Krishnan, Yicheng Wu, and Jian Wang. Copy or not? reference-based face image restoration with fine details. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 9660–9669. IEEE, 2025. 3

work page 2025
[11]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 4690–4699, 2019. 2, 5, 7

work page 2019
[12]

Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020

Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020. 6

work page 2020
[13]

Restoration by generation with constrained priors

Zheng Ding, Xuaner Zhang, Zhuowen Tu, and Zhihao Xia. Restoration by generation with constrained priors. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2567–2577, 2024. 2

work page 2024
[14]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

work page
[15]

Human body restoration with one-step diffusion model and a new benchmark

Jue Gong, Jingkai Wang, Zheng Chen, Xing Liu, Hong Gu, Yulun Zhang, and Xiaokang Yang. Human body restoration with one-step diffusion model and a new benchmark. InIn- ternational Conference on Machine Learning, 2025. 2, 3

work page 2025
[16]

Haodiff: Human-aware one-step diffusion via dual-prompt guidance

Jue Gong, Tingyu Yang, Jingkai Wang, Zheng Chen, Xing Liu, Hong Gu, Yulun Zhang, and Xiaokang Yang. Haodiff: Human-aware one-step diffusion via dual-prompt guidance. arXiv preprint arXiv:2505.19742, 2025. 2, 3

work page arXiv 2025
[17]

Momentum contrast for unsupervised visual rep- resentation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. InConference on computer vision and pattern recognition, 2020. 4

work page 2020
[18]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 6

work page 2017
[19]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1

work page 2020
[21]

Ref-ldm: A latent diffusion model for reference-based face image restoration

Chi-Wei Hsiao, Yu-Lun Liu, Cheng-Kun Yang, Sheng-Po Kuo, Kevin Jou, and Chia-Ping Chen. Ref-ldm: A latent diffusion model for reference-based face image restoration. Advances in Neural Information Processing Systems, 37: 74840–74867, 2024. 1, 3, 4

work page 2024
[22]

Batch normalization: Accelerating deep network training by reducing internal co- variate shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal co- variate shift. InInternational conference on machine learn- ing, 2015. 4

work page 2015
[23]

InfiniteYou: Flexible photo recrafting while pre- serving your identity.International conference on computer vision (ICCV), 2025

Liming Jiang, Qing Yan, Yumin Jia, Zichuan Liu, Hao Kang, and Xin Lu. InfiniteYou: Flexible photo recrafting while pre- serving your identity.International conference on computer vision (ICCV), 2025. 5, 2

work page 2025
[24]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021. 6, 2

work page 2021
[25]

Dual prompting image restoration with diffusion transformers

Dehong Kong, Fan Li, Zhixin Wang, Jiaqi Xu, Renjing Pei, Wenbo Li, and WenQi Ren. Dual prompting image restoration with diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12809–12819, 2025. 2, 3

work page 2025
[26]

Ensembling off-the-shelf models for gan training

Nupur Kumari, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Ensembling off-the-shelf models for gan training. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 10651–10662, 2022. 5

work page 2022
[27]

From face to natural image: Learning real degradation for blind image super-resolution

Xiaoming Li, Chaofeng Chen, Xianhui Lin, Wangmeng Zuo, and Lei Zhang. From face to natural image: Learning real degradation for blind image super-resolution. InEuropean Conference on Computer Vision, pages 376–392. Springer,

work page
[28]

Learning dual memory dictionaries for blind face restoration.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(5):5904–5917, 2022

Xiaoming Li, Shiguang Zhang, Shangchen Zhou, Lei Zhang, and Wangmeng Zuo. Learning dual memory dictionaries for blind face restoration.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(5):5904–5917, 2022. 2, 5

work page 2022
[29]

Learning lens blur fields

Esther YH Lin, Zhecheng Wang, Rebecca Lin, Daniel Miau, Florian Kainz, Jiawen Chen, Xuaner Zhang, David B Lin- dell, and Kiriakos N Kutulakos. Learning lens blur fields. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2025. 8

work page 2025
[30]

Diff- bir: Toward blind image restoration with generative diffusion prior

Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Yu Qiao, Wanli Ouyang, and Chao Dong. Diff- bir: Toward blind image restoration with generative diffusion prior. InEuropean conference on computer vision, pages 430–448. Springer, 2024. 8, 6, 9

work page 2024
[31]

Faceme: Robust blind face restoration with personal identification

Siyu Liu, Zheng-Peng Duan, Jia OuYang, Jiayi Fu, Hyunhee Park, Zikun Liu, Chun-Le Guo, and Chongyi Li. Faceme: Robust blind face restoration with personal identification. InProceedings of the AAAI Conference on Artificial Intel- ligence, pages 5567–5575, 2025. 1, 2, 3, 6, 7, 4

work page 2025
[32]

An accurate and lightweight method for human body image super-resolution.IEEE Transactions on Image Processing, 30:2888–2897, 2021

Yunan Liu, Shanshan Zhang, Jie Xu, Jian Yang, and Yu- Wing Tai. An accurate and lightweight method for human body image super-resolution.IEEE Transactions on Image Processing, 30:2888–2897, 2021. 3

work page 2021
[33]

Adaptbir: Adaptive blind image restoration with latent diffusion prior for higher fidelity.Pattern Recognition, 155:110659, 2024

Yingqi Liu, Jingwen He, Yihao Liu, Xinqi Lin, Fanghua Yu, Jinfan Hu, Yu Qiao, and Chao Dong. Adaptbir: Adaptive blind image restoration with latent diffusion prior for higher fidelity.Pattern Recognition, 155:110659, 2024. 2, 3

work page 2024
[34]

Metric learning based interactive modulation for real-world super-resolution

Chong Mou, Yanze Wu, Xintao Wang, Chao Dong, Jian Zhang, and Ying Shan. Metric learning based interactive modulation for real-world super-resolution. InEuropean conference on computer vision, pages 723–740. Springer,

work page
[35]

Tip: Text-driven image processing with semantic and restoration instructions.CoRR, 2023

Chenyang Qi, Zhengzhong Tu, Keren Ye, Mauricio Delbra- cio, Peyman Milanfar, Qifeng Chen, and Hossein Talebi. Tip: Text-driven image processing with semantic and restoration instructions.CoRR, 2023. 3, 4

work page 2023
[36]

You only look once: Unified, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 2

work page 2016
[37]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1

work page 2022
[38]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer,

work page
[39]

Overcoming false illusions in real-world face restoration with multi-modal guided diffusion model.arXiv preprint arXiv:2410.04161, 2024

Keda Tao, Jinjin Gu, Yulun Zhang, Xiucheng Wang, and Nan Cheng. Overcoming false illusions in real-world face restoration with multi-modal guided diffusion model.arXiv preprint arXiv:2410.04161, 2024. 2

work page arXiv 2024
[40]

Overcoming false illusions in real-world face restoration with multi-modal guided diffusion model

Keda TAO, Jinjin Gu, Yulun Zhang, Xiucheng Wang, and Nan Cheng. Overcoming false illusions in real-world face restoration with multi-modal guided diffusion model. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 1

work page 2025
[41]

Pfstorer: Personalized face restoration and super-resolution

Tuomas Varanka, Tapani Toivonen, Soumya Tripathy, Guoy- ing Zhao, and Erman Acar. Pfstorer: Personalized face restoration and super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2372–2381, 2024. 2

work page 2024
[42]

Ex- ploring clip for assessing the look and feel of images

Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Ex- ploring clip for assessing the look and feel of images. InPro- ceedings of the AAAI conference on artificial intelligence, pages 2555–2563, 2023. 6, 2

work page 2023
[43]

Osdface: One-step diffusion model for face restoration

Jingkai Wang, Jue Gong, Lin Zhang, Zheng Chen, Xing Liu, Hong Gu, Yutong Liu, Yulun Zhang, and Xiaokang Yang. Osdface: One-step diffusion model for face restoration. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12626–12636, 2025. 10

work page 2025
[44]

Honestface: To- wards honest face restoration with one-step diffusion model

Jingkai Wang, Wu Miao, Jue Gong, Zheng Chen, Xing Liu, Hong Gu, Yutong Liu, and Yulun Zhang. Honestface: To- wards honest face restoration with one-step diffusion model. arXiv preprint arXiv:2505.18469, 2025. 3

work page arXiv 2025
[45]

Unsuper- vised degradation representation learning for blind super- resolution

Longguang Wang, Yingqian Wang, Xiaoyu Dong, Qingyu Xu, Jungang Yang, Wei An, and Yulan Guo. Unsuper- vised degradation representation learning for blind super- resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10581– 10590, 2021. 2

work page 2021
[46]

Prior based pyramid resid- ual clique network for human body image super-resolution

Simiao Wang, Yu Sang, Yunan Liu, Chunpeng Wang, Mingyu Lu, and Jinguang Sun. Prior based pyramid resid- ual clique network for human body image super-resolution. Pattern Recognition, 150:110352, 2024. 3

work page 2024
[47]

Esrgan: En- hanced super-resolution generative adversarial networks

Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: En- hanced super-resolution generative adversarial networks. In Proceedings of the European conference on computer vision (ECCV) workshops, pages 0–0, 2018. 1

work page 2018
[48]

Real-esrgan: Training real-world blind super-resolution with pure synthetic data

Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1905–1914,

work page 1905
[49]

Sinsr: diffusion-based image super- resolution in a single step

Yufei Wang, Wenhan Yang, Xinyuan Chen, Yaohui Wang, Lanqing Guo, Lap-Pui Chau, Ziwei Liu, Yu Qiao, Alex C Kot, and Bihan Wen. Sinsr: diffusion-based image super- resolution in a single step. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25796–25805, 2024. 2, 3, 6, 9

work page 2024
[50]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6

work page 2004
[51]

Restoreformer++: Towards real- world blind face restoration from undegraded key-value pairs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15462–15476, 2023

Zhouxia Wang, Jiawei Zhang, Tianshui Chen, Wenping Wang, and Ping Luo. Restoreformer++: Towards real- world blind face restoration from undegraded key-value pairs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):15462–15476, 2023. 10

work page 2023
[52]

One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024

Rongyuan Wu, Lingchen Sun, Zhiyuan Ma, and Lei Zhang. One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024. 1, 3, 8, 6, 9

work page 2024
[53]

Maniqa: Multi-dimension attention network for no-reference image quality assessment

Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1191–1200, 2022. 6, 2

work page 2022
[54]

Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization

Tao Yang, Rongyuan Wu, Peiran Ren, Xuansong Xie, and Lei Zhang. Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization. InEuropean conference on computer vision, pages 74–91. Springer, 2024. 1, 8, 6, 9

work page 2024
[55]

Refstar: Blind facial image restoration with ref- erence selection, transfer, and reconstruction.arXiv preprint arXiv:2507.10470, 2025

Zhicun Yin, Junjie Chen, Ming Liu, Zhixin Wang, Fan Li, Renjing Pei, Xiaoming Li, Rynson WH Lau, and Wang- meng Zuo. Refstar: Blind facial image restoration with ref- erence selection, transfer, and reconstruction.arXiv preprint arXiv:2507.10470, 2025. 3

work page arXiv 2025
[56]

Restorerid: Towards tuning- free face restoration with id preservation.arXiv preprint arXiv:2411.14125, 2024

Jiacheng Ying, Mushui Liu, Zhe Wu, Runming Zhang, Zhu Yu, Siming Fu, Si-Yuan Cao, Chao Wu, Yunlong Yu, and Hui-Liang Shen. Restorerid: Towards tuning- free face restoration with id preservation.arXiv preprint arXiv:2411.14125, 2024. 2, 3, 4

work page arXiv 2024
[57]

Scaling up to excellence: Practicing model scaling for photo- realistic image restoration in the wild

Fanghua Yu, Jinjin Gu, Zheyuan Li, Jinfan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, and Chao Dong. Scaling up to excellence: Practicing model scaling for photo- realistic image restoration in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 25669–25680, 2024. 1, 2, 8, 6, 9

work page 2024
[58]

Effi- cient diffusion model for image restoration by residual shift- ing.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Effi- cient diffusion model for image restoration by residual shift- ing.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 8, 6, 9

work page 2024
[59]

Arbitrary-steps image super-resolution via diffusion inver- sion

Zongsheng Yue, Kang Liao, and Chen Change Loy. Arbitrary-steps image super-resolution via diffusion inver- sion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23153–23163, 2025. 1, 8, 6, 9

work page 2025
[60]

Degradation-guided one-step im- age super-resolution with diffusion priors.arXiv preprint arXiv:2409.17058, 2024

Aiping Zhang, Zongsheng Yue, Renjing Pei, Wenqi Ren, and Xiaochun Cao. Degradation-guided one-step im- age super-resolution with diffusion priors.arXiv preprint arXiv:2409.17058, 2024. 2, 3, 4, 5, 7, 8, 6, 9

work page arXiv 2024
[61]

Instantrestore: Single-step personalized face restoration with shared-image attention

Howard Zhang, Yuval Alaluf, Sizhuo Ma, Achuta Kadambi, Jian Wang, and Kfir Aberman. Instantrestore: Single-step personalized face restoration with shared-image attention. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–10, 2025. 1, 3, 4

work page 2025
[62]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 3

work page 2023
[63]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 6

work page 2018
[64]

Blind image quality assessment via vision- language correspondence: A multitask learning perspective

Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, and Kede Ma. Blind image quality assessment via vision- language correspondence: A multitask learning perspective. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 14071–14081, 2023. 6

work page 2023
[65]

Recognize anything: A strong image tagging model

Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo, Yaqian Li, Shilong Liu, et al. Recognize anything: A strong image tagging model. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 1724–1732, 2024. 2

work page 2024
[66]

Diff- body: Human body restoration by imagining with generative diffusion prior.arXiv preprint arXiv:2404.03642, 2024

Yiming Zhang, Zhe Wang, Xinjie Li, Yunchen Yuan, Cheng- song Zhang, Xiao Sun, Zhihang Zhong, and Jian Wang. Diff- body: Human body restoration by imagining with generative diffusion prior.arXiv preprint arXiv:2404.03642, 2024. 3

work page arXiv 2024
[67]

Reference-guided identity preserving face restora- tion.arXiv preprint arXiv:2505.21905, 2025

Mo Zhou, Keren Ye, Viraj Shah, Kangfu Mei, Mauricio Delbracio, Peyman Milanfar, Vishal M Patel, and Hossein Talebi. Reference-guided identity preserving face restora- tion.arXiv preprint arXiv:2505.21905, 2025. 3

work page arXiv 2025
[68]

Towards robust blind face restora- tion with codebook lookup transformer.Advances in Neural Information Processing Systems, 35:30599–30611, 2022

Shangchen Zhou, Kelvin Chan, Chongyi Li, and Chen Change Loy. Towards robust blind face restora- tion with codebook lookup transformer.Advances in Neural Information Processing Systems, 35:30599–30611, 2022. 10 LQ GT Face2Scene S3Diff InvSR OSEDiff PASD SUPIR SinSR Figure 11.Visual comparison on InScene Synthetic dataset across seven methods. LQ GT Face2S...

work page 2022