pith. sign in

arxiv: 2605.23451 · v1 · pith:6LMTFGM5new · submitted 2026-05-22 · 💻 cs.CV

Efficient One-Step Diffusion Restoration Model with Compact Token Compression and Linear Attention

Pith reviewed 2026-05-25 04:39 UTC · model grok-4.3

classification 💻 cs.CV
keywords real-world image super-resolutiondiffusion modelstoken compressionlinear attentionefficient inferenceimage restorationDiTone-step generation
0
0 comments X

The pith

SANA-SR restores real-world images via 32x token compression and linear-attention DiT in a single diffusion step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies excessive token redundancy and quadratic-cost interactions as the core barrier to practical high-resolution real-world image super-resolution. It counters this by first applying a deep compression autoencoder that shrinks latent tokens by a factor of 32 while keeping restoration-relevant structures, then running a linear-attention diffusion transformer with LoRA fine-tuning on that compact space. The resulting one-step model matches or exceeds existing methods on standard benchmarks in both quantitative scores and visual texture realism. A reader would care because the pruned version delivers the quality at 0.019 seconds, 407.95G MACs, and 344M parameters, opening the door to mobile deployment.

Core claim

SANA-SR is an efficient one-step restoration framework that employs a deep compression autoencoder with a 32x compression ratio to drastically reduce latent tokens while preserving restoration-relevant structures and textures. On top of this compact latent space, a linear-attention DiT with LoRA fine-tuning performs high-resolution restoration with linear-complexity token mixing. Extensive experiments on all benchmark datasets show that SANA-SR achieves highly competitive and often superior quantitative performance against existing methods while restoring clearer and more realistic textures, and the deployed model runs in 0.019s with 407.95G MACs and 344M parameters.

What carries the argument

Deep compression autoencoder at 32x ratio combined with linear-attention DiT for token mixing.

If this is right

  • The model matches or exceeds existing Real-ISR methods on quantitative metrics across all tested benchmarks.
  • Restored images exhibit clearer and more realistic textures than prior generative approaches.
  • After pruning, inference completes in 0.019 seconds using 407.95G MACs and 344M parameters.
  • The linear-complexity design removes the unfavorable scaling of computation and memory with image resolution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same compression-plus-linear-attention pattern could be tested on related tasks such as real-world denoising or deblurring.
  • Further increases in compression ratio beyond 32x could be explored if the autoencoder continues to retain high-frequency texture cues.
  • The LoRA fine-tuning step on the linear DiT suggests a route for adapting the model to new degradation distributions without full retraining.

Load-bearing premise

The 32x compression autoencoder preserves all restoration-relevant structures and textures without introducing artifacts that later stages cannot correct.

What would settle it

Running SANA-SR on the standard Real-ISR benchmark suites and finding either lower perceptual quality scores or visible uncorrectable artifacts compared with quadratic-attention baselines would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2605.23451 by Bingtian Qiao, Guangtao Zhai, Jiezhang Cao, Yingjie Zhou, Yong Guo, Yue Shi.

Figure 1
Figure 1. Figure 1: SANA-SR achieves a strong quality–efficiency trade-off for real-world image super￾resolution. Left: qualitative comparison on a real LR input against seven baselines, the yellow box is zoom region. Right: DRealSR scatter of normalized perceptual score and inference time; marker color encodes method family and size scales with parameters. SANA-SR yields a best perceptual at the lowest latency. Abstract Real… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SANA-SR. Given an LQ input, SANA-SR first maps the image into a compact latent space with a frozen deep-compression VAE, then restores the latent with a prompt-conditioned one-step LinearDiT adapted by LoRA. Training is regularized by frozen-prior alignment and adapter consistency, and the final model is further compressed by prompt-aware structured pruning for efficient deployment. 3.1 Degrada… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on challenging examples from DRealSR. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Additional qualitative comparison on examples from DIV2K-Val, RealSR, and DRealSR. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
read the original abstract

Real-world image super-resolution aims to recover high-quality images from complex and unknown real-world degradations. However, existing generative Real-ISR methods largely inherit the dense latent representations and quadratic-cost global modeling paradigm developed for high-resolution image synthesis, causing computation, memory usage, and inference latency to scale unfavorably with resolution and thus limiting practical deployment. We argue that the key bottleneck lies not in insufficient restoration priors, but in excessive token redundancy and costly token interactions during high-resolution restoration. Motivated by this observation, we revisit Real-ISR from the perspectives of compact latent representation and linear-complexity modeling, and propose SANA-SR, an efficient one-step restoration framework. Specifically, SANA-SR employs a deep compression autoencoder with a 32x compression ratio to drastically reduce latent tokens while preserving restoration-relevant structures and textures. On top of this compact latent space, we introduce a linear-attention DiT with LoRA fine-tuning, enabling efficient high-resolution restoration with linear-complexity token mixing. Extensive experiments on all benchmark datasets demonstrate that SANA-SR achieves highly competitive and often superior quantitative performance against existing methods, while restoring clearer and more realistic textures. Moreover, after pruning, the deployed model runs in 0.019s with 407.95G MACs and 344M parameters, highlighting its strong potential for practical mobile deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes SANA-SR, an efficient one-step diffusion-based framework for real-world image super-resolution. It identifies token redundancy and quadratic attention costs as the primary bottlenecks in existing generative Real-ISR methods and addresses them via a deep compression autoencoder (32x ratio) to produce compact latent tokens while preserving structures and textures, followed by a linear-attention DiT backbone with LoRA fine-tuning for linear-complexity token mixing. The authors report that the resulting model achieves highly competitive or superior quantitative performance (PSNR/SSIM/perceptual metrics) on standard benchmarks, restores clearer textures, and after pruning runs at 0.019 s inference with 407.95 G MACs and 344 M parameters.

Significance. If the central claims hold, the work would be significant for enabling practical, mobile deployment of high-quality generative Real-ISR by demonstrating that extreme latent compression combined with linear attention can maintain restoration fidelity at dramatically reduced compute and latency. The explicit focus on token redundancy rather than prior insufficiency, together with the reported efficiency numbers, offers a concrete path toward scalable restoration models.

major comments (2)
  1. [Abstract, §3] Abstract and §3 (method): The claim that the 32x deep compression autoencoder 'preserves restoration-relevant structures and textures' is load-bearing for both the performance and efficiency assertions, yet no ablation quantifies information loss at this ratio or demonstrates that degradation-specific high-frequency cues remain recoverable by the linear-attention DiT (even with LoRA). If the autoencoder discards unrecoverable details, the reported competitive metrics and 0.019 s latency cannot simultaneously hold.
  2. [§4] §4 (experiments): The abstract asserts 'highly competitive and often superior quantitative performance' and 'clearer and more realistic textures' but supplies no tables, baselines, or error bars in the visible text; without these, the cross-method comparison and the claim that linear attention suffices cannot be evaluated.
minor comments (1)
  1. [§3] Notation for the linear-attention mechanism and the precise definition of the 32x compression ratio should be introduced with equations in the method section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to strengthen the paper where the concerns are valid.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (method): The claim that the 32x deep compression autoencoder 'preserves restoration-relevant structures and textures' is load-bearing for both the performance and efficiency assertions, yet no ablation quantifies information loss at this ratio or demonstrates that degradation-specific high-frequency cues remain recoverable by the linear-attention DiT (even with LoRA). If the autoencoder discards unrecoverable details, the reported competitive metrics and 0.019 s latency cannot simultaneously hold.

    Authors: We agree that an explicit ablation quantifying information loss at the 32x ratio and demonstrating recoverability of degradation-specific cues would strengthen the central claim. In the revised manuscript we will add: (1) a compression-ratio ablation (8x/16x/32x) reporting reconstruction PSNR/SSIM on both clean and degraded inputs, (2) latent-space visualizations and high-frequency energy spectra before/after encoding, and (3) a controlled study measuring how much of the final restoration quality is attributable to the autoencoder versus the DiT. These additions will directly address whether the linear-attention DiT can recover the necessary cues. revision: yes

  2. Referee: [§4] §4 (experiments): The abstract asserts 'highly competitive and often superior quantitative performance' and 'clearer and more realistic textures' but supplies no tables, baselines, or error bars in the visible text; without these, the cross-method comparison and the claim that linear attention suffices cannot be evaluated.

    Authors: Section 4 of the full manuscript contains multiple tables comparing SANA-SR against recent Real-ISR baselines on standard benchmarks (PSNR, SSIM, LPIPS, MUSIQ, etc.), together with qualitative results. We will ensure all tables are clearly referenced from the abstract and §3, add standard-error bars from three independent runs where they were omitted, and include an additional table isolating the contribution of linear attention versus quadratic attention under identical latent tokens. If any tables were missing from the reviewed version due to rendering, we apologize and will correct the submission. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central proposal rests on an architectural argument (token redundancy as the primary bottleneck) followed by a design choice (32x deep compression autoencoder + linear-attention DiT with LoRA) and empirical reporting on external benchmarks. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citation chains appear in the provided text that would reduce the performance claims to the inputs by construction. The autoencoder fidelity assumption is stated explicitly as a design premise rather than derived from prior self-work, and the reported metrics (PSNR/SSIM, latency, MACs) are positioned as measured outcomes on standard datasets. This satisfies the criteria for a self-contained derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities can be extracted beyond the stated design choices.

free parameters (1)
  • compression ratio
    32x ratio is presented as a design choice to reduce tokens while preserving structures.

pith-pipeline@v0.9.0 · 5786 in / 1028 out tokens · 19315 ms · 2026-05-25T04:39:16.013079+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 2 internal anchors

  1. [1]

    Deep learning for image super-resolution: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10):3365–3387, 2020

    Zhihao Wang, Jian Chen, and Steven CH Hoi. Deep learning for image super-resolution: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10):3365–3387, 2020

  2. [2]

    Toward real-world single image super-resolution: A new benchmark and a new model

    Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei Zhang. Toward real-world single image super-resolution: A new benchmark and a new model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3086–3095, 2019

  3. [3]

    Exploiting diffusion prior for real-world image super-resolution.International Journal of Computer Vision, 132(12):5929–5949, 2024

    Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin CK Chan, and Chen Change Loy. Exploiting diffusion prior for real-world image super-resolution.International Journal of Computer Vision, 132(12):5929–5949, 2024

  4. [4]

    Ntire 2020 challenge on real-world image super-resolution: Methods and results

    Andreas Lugmayr, Martin Danelljan, and Radu Timofte. Ntire 2020 challenge on real-world image super-resolution: Methods and results. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 494–495, 2020

  5. [5]

    Quantized image super-resolution on mobile npus, mobile ai 2025 challenge: Report

    Andrey Ignatov, Georgy Perevozchikov, Radu Timofte, Zhiyu Zhang, Tianxiao Gao, Yukun Yang, Shiai Zhu, Shihao Wang, Kihwan Yoon, Ganzorig Gankhuyag, et al. Quantized image super-resolution on mobile npus, mobile ai 2025 challenge: Report. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 1908– 1921, 2025

  6. [6]

    Reversible primitive–composition alignment for continual vision–language learning

    Canran Xiao, Tianxiang Xu, Siyuan Ma, Yiyang Jiang, Haoyu Gao, and Yuhan Wu. Reversible primitive–composition alignment for continual vision–language learning. InInternational Conference on Learning Representations, 2026

  7. [7]

    Diffbir: Toward blind image restoration with generative diffusion prior

    Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Yu Qiao, Wanli Ouyang, and Chao Dong. Diffbir: Toward blind image restoration with generative diffusion prior. InEuropean Conference on Computer Vision, pages 430–448. Springer, 2024

  8. [8]

    Seesr: Towards semantics-aware real-world image super-resolution

    Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, and Lei Zhang. Seesr: Towards semantics-aware real-world image super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25456–25467, 2024

  9. [9]

    One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Processing Systems, 37: 92529–92553, 2024

    Rongyuan Wu, Lingchen Sun, Zhiyuan Ma, and Lei Zhang. One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Processing Systems, 37: 92529–92553, 2024

  10. [10]

    Adversarial diffusion compression for real-world image super-resolution

    Bin Chen, Gehui Li, Rongyuan Wu, Xindong Zhang, Jie Chen, Jian Zhang, and Lei Zhang. Adversarial diffusion compression for real-world image super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28208–28220, 2025

  11. [11]

    Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution

    Linwei Dong, Qingnan Fan, Yihong Guo, Zhonghao Wang, Qi Zhang, Jinwei Chen, Yawei Luo, and Changqing Zou. Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23174–23184, 2025

  12. [12]

    Learning a deep convolutional network for image super-resolution

    Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super-resolution. InEuropean Conference on Computer Vision, pages 184–199. Springer, 2014

  13. [13]

    Enhanced deep residual networks for single image super-resolution

    Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 136–144, 2017

  14. [14]

    Image super- resolution using very deep residual channel attention networks

    Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super- resolution using very deep residual channel attention networks. InEuropean Conference on Computer Vision, pages 286–301, 2018. 10

  15. [15]

    Swinir: Image restoration using swin transformer

    Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1833–1844, 2021

  16. [16]

    Activating more pixels in image super-resolution transformer

    Xiangyu Chen, Xintao Wang, Jiantao Zhou, Yu Qiao, and Chao Dong. Activating more pixels in image super-resolution transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22367–22377, 2023

  17. [17]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022

  18. [18]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  19. [19]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...

  20. [20]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning, pages 5156–5165. PMLR, 2020

  21. [21]

    Sana: Efficient high-resolution image synthesis with linear diffusion transformers

    Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers. InInternational Conference on Learning Representations, 2025

  22. [22]

    LinearSR: Unlocking linear attention for stable and efficient image super-resolution

    Xiaohui Li, Shaobin Zhuang, Shuo Cao, Yang Yang, Yuandong Pu, Qi Qin, Siqi Luo, Bin Fu, and Yihao Liu. LinearSR: Unlocking linear attention for stable and efficient image super-resolution. InInternational Conference on Learning Representations, 2026

  23. [23]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4195–4205, 2023

  24. [24]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  25. [25]

    One diffusion step to real- world super-resolution via flow trajectory distillation

    Jianze Li, Jiezhang Cao, Yong Guo, Wenbo Li, and Yulun Zhang. One diffusion step to real- world super-resolution via flow trajectory distillation. InInternational Conference on Machine Learning, pages 34044–34053. PMLR, 2025

  26. [26]

    Esrgan: Enhanced super-resolution generative adversarial networks

    Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: Enhanced super-resolution generative adversarial networks. InEuropean Conference on Computer Vision Workshops, 2018

  27. [27]

    Designing a practical degradation model for deep blind image super-resolution

    Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a practical degradation model for deep blind image super-resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4791–4800, 2021

  28. [28]

    Real-esrgan: Training real-world blind super-resolution with pure synthetic data

    Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1905–1914, 2021

  29. [29]

    Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild

    Fanghua Yu, Jinjin Gu, Zheyuan Li, Jinfan Hu, Xiangtao Kong, Xintao Wang, Jingwen He, Yu Qiao, and Chao Dong. Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25669–25680, 2024

  30. [30]

    Dreamclear: High-capacity real-world image restoration with privacy-safe dataset curation.Advances in Neural Information Processing Systems, 37:55443–55469, 2024

    Yuang Ai, Xiaoqiang Zhou, Huaibo Huang, Xiaotian Han, Zhengyu Chen, Quanzeng You, and Hongxia Yang. Dreamclear: High-capacity real-world image restoration with privacy-safe dataset curation.Advances in Neural Information Processing Systems, 37:55443–55469, 2024. 11

  31. [31]

    Dit4sr: Taming diffusion transformer for real-world image super-resolution

    Zheng-Peng Duan, Jiawei Zhang, Xin Jin, Ziheng Zhang, Zheng Xiong, Dongqing Zou, Jimmy S Ren, Chunle Guo, and Chongyi Li. Dit4sr: Taming diffusion transformer for real-world image super-resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18948–18958, 2025

  32. [32]

    Sinsr: diffusion-based image super-resolution in a single step

    Yufei Wang, Wenhan Yang, Xinyuan Chen, Yaohui Wang, Lanqing Guo, Lap-Pui Chau, Ziwei Liu, Yu Qiao, Alex C Kot, and Bihan Wen. Sinsr: diffusion-based image super-resolution in a single step. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25796–25805, 2024

  33. [33]

    Taming diffusion prior for image super-resolution with domain shift sdes.Advances in Neural Information Processing Systems, 37:42765–42797, 2024

    Qinpeng Cui, Yixuan Liu, Xinyi Zhang, Qiqi Bao, Qingmin Liao, Li Wang, Tian Lu, Zhongdao Wang, Emad Barsoum, et al. Taming diffusion prior for image super-resolution with domain shift sdes.Advances in Neural Information Processing Systems, 37:42765–42797, 2024

  34. [34]

    Arbitrary-steps image super-resolution via diffusion inversion

    Zongsheng Yue, Kang Liao, and Chen Change Loy. Arbitrary-steps image super-resolution via diffusion inversion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23153–23163, 2025

  35. [35]

    Pixel- level and semantic-level adjustable super-resolution: A dual-lora approach

    Lingchen Sun, Rongyuan Wu, Zhiyuan Ma, Shuaizheng Liu, Qiaosi Yi, and Lei Zhang. Pixel- level and semantic-level adjustable super-resolution: A dual-lora approach. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2333–2343, 2025

  36. [36]

    Unleashing the power of one-step diffusion based image super-resolution via a large-scale diffusion discriminator

    Jianze Li, Jiezhang Cao, Zichen Zou, Xiongfei Su, Xin Yuan, Yulun Zhang, Yong Guo, and Xiaokang Yang. Unleashing the power of one-step diffusion based image super-resolution via a large-scale diffusion discriminator. InAdvances in Neural Information Processing Systems, 2025

  37. [37]

    Q-DiT4SR: Exploration of Detail-Preserving Diffusion Transformer Quantization for Real-World Image Super-Resolution

    Xun Zhang, Kaicheng Yang, Hongliang Lu, Haotong Qin, Yong Guo, and Yulun Zhang. Q- dit4sr: Exploration of detail-preserving diffusion transformer quantization for real-world image super-resolution.arXiv preprint arXiv:2602.01273, 2026

  38. [38]

    Optimal brain damage.Advances in Neural Information Processing Systems, 2, 1989

    Yann LeCun, John Denker, and Sara Solla. Optimal brain damage.Advances in Neural Information Processing Systems, 2, 1989

  39. [39]

    Learning both weights and connections for efficient neural network.Advances in Neural Information Processing Systems, 28, 2015

    Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network.Advances in Neural Information Processing Systems, 28, 2015

  40. [40]

    Learning efficient convolutional networks through network slimming

    Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2736–2744, 2017

  41. [41]

    The lottery ticket hypothesis: Finding sparse, trainable neural networks

    Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InInternational Conference on Learning Representations, 2019

  42. [42]

    Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks

    Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. Journal of Machine Learning Research, 22(241):1–124, 2021

  43. [43]

    Depgraph: Towards any structural pruning

    Gongfan Fang, Xinyin Ma, Mingli Song, Michael Bi Mi, and Xinchao Wang. Depgraph: Towards any structural pruning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16091–16101, 2023

  44. [44]

    Llm-pruner: On the structural pruning of large language models.Advances in Neural Information Processing Systems, 36:21702–21720, 2023

    Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models.Advances in Neural Information Processing Systems, 36:21702–21720, 2023

  45. [45]

    Tinyfusion: Diffusion transformers learned shallow

    Gongfan Fang, Kunjun Li, Xinyin Ma, and Xinchao Wang. Tinyfusion: Diffusion transformers learned shallow. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18144–18154, 2025

  46. [46]

    Ntire 2017 challenge on single image super-resolution: Dataset and study

    Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 126–135, 2017. 12

  47. [47]

    Lsdir: A large scale dataset for image restoration

    Yawei Li, Kai Zhang, Jingyun Liang, Jiezhang Cao, Ce Liu, Rui Gong, Yulun Zhang, Hao Tang, Yun Liu, Denis Demandolx, et al. Lsdir: A large scale dataset for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1775–1787, 2023

  48. [48]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019

  49. [49]

    Pixel-aware stable diffu- sion for realistic image super-resolution and personalized stylization

    Tao Yang, Rongyuan Wu, Peiran Ren, Xuansong Xie, and Lei Zhang. Pixel-aware stable diffu- sion for realistic image super-resolution and personalized stylization. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024

  50. [50]

    Resshift: Efficient diffusion model for image super-resolution by residual shifting.Advances in Neural Information Processing Systems, 36:13294–13307, 2023

    Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Resshift: Efficient diffusion model for image super-resolution by residual shifting.Advances in Neural Information Processing Systems, 36:13294–13307, 2023

  51. [51]

    Degradation- guided one-step image super-resolution with diffusion priors.arXiv preprint arXiv:2409.17058, 2024

    Aiping Zhang, Zongsheng Yue, Renjing Pei, Wenqi Ren, and Xiaochun Cao. Degradation- guided one-step image super-resolution with diffusion priors.arXiv preprint arXiv:2409.17058, 2024

  52. [52]

    Addsr: Ac- celerating diffusion-based blind super-resolution with adversarial diffusion distillation.Pattern Recognition, page 113012, 2026

    Ying Tai, Rui Xie, Chen Zhao, Kai Zhang, Zhenyu Zhang, Jun Zhou, and Jian Yang. Addsr: Ac- celerating diffusion-based blind super-resolution with adversarial diffusion distillation.Pattern Recognition, page 113012, 2026

  53. [53]

    Component divide-and-conquer for real-world image super-resolution

    Pengxu Wei, Ziwei Xie, Hannan Lu, Zongyuan Zhan, Qixiang Ye, Wangmeng Zuo, and Liang Lin. Component divide-and-conquer for real-world image super-resolution. InEuropean Conference on Computer Vision, pages 101–117. Springer, 2020. 13 Appendix A Additional Technical Details A.1 Prompt Construction and Protocol For every experimental run, the prompt source ...