pith. sign in

arxiv: 2506.00433 · v4 · submitted 2025-05-31 · 💻 cs.CV · cs.LG· eess.IV

Latent Wavelet Diffusion For Ultra-High-Resolution Image Synthesis

Pith reviewed 2026-05-19 11:56 UTC · model grok-4.3

classification 💻 cs.CV cs.LGeess.IV
keywords high-resolution image synthesislatent diffusionwavelet transformsfrequency-aware maskingVAE objectivegenerative modelingdetail fidelityperceptual quality
0
0 comments X

The pith

Wavelet energy maps create dynamic masks that focus diffusion training on detail-rich latent regions for better ultra-high-resolution images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Latent Wavelet Diffusion as a training framework that adds a frequency-aware masking step and a scale-consistent objective to existing latent diffusion pipelines. Wavelet energy maps derived from the latent space identify regions containing fine details, and the loss is then concentrated on those areas while a new VAE term enforces consistency across resolution scales. The approach requires no model architecture changes and imposes no extra cost when the model is later used to generate images. A reader would care because ultra-high-resolution synthesis has been limited by the difficulty of preserving textures without exploding compute budgets or redesigning networks from scratch.

Core claim

Latent Wavelet Diffusion (LWD) is a lightweight training framework that uses a novel frequency-aware masking strategy derived from wavelet energy maps to dynamically focus the training process on detail-rich regions of the latent space, complemented by a scale-consistent VAE objective to ensure high spectral fidelity, consistently improving perceptual quality and FID scores across baselines with no architectural modifications and zero additional inference cost.

What carries the argument

Frequency-aware masking strategy derived from wavelet energy maps that dynamically focuses training on detail-rich regions of the latent space.

Load-bearing premise

The wavelet energy maps derived from the latent space accurately and stably identify detail-rich regions such that the resulting dynamic masking improves fidelity without introducing training artifacts or losing global coherence.

What would settle it

Training the same baseline model with and without the wavelet masking and scale-consistent VAE objective on a fixed 4K dataset and finding no consistent gain in FID or perceptual metrics would falsify the central claim.

Figures

Figures reproduced from arXiv: 2506.00433 by Danilo Comminiello, Luigi Sigillo, Shengfeng He.

Figure 1
Figure 1. Figure 1: We propose Latent Wavelet Diffusion, achieving 4K image synthesis without architectural [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Temporal evolution of latent zt, wavelet energy maps Awavelet, and attention map Mt across diffusion timesteps. (b) Our wavelet-masked flow matching objective at a timestep t. The model computes a wavelet attention map Mt from latent zt to modulate the prediction error between target velocity field (ϵ − z0) and predicted velocity vΘ(zt, t, y). This focuses optimization on high-frequency regions with gr… view at source ↗
Figure 3
Figure 3. Figure 3: Normalized DCT amplitudes over zigzag frequency indices. VAE trained with the multi-scale loss reduces high-frequency energy in latents, aligning their spectrum with that of RGB images. To guide spatial supervision based on structural complexity, we extract saliency maps from latent representations using localized frequency anal￾ysis. Given a latent tensor z ∈ R C×H×W , we apply a single-level Discrete Wav… view at source ↗
Figure 4
Figure 4. Figure 4: Visual comparison of 2K image generations. LWD demonstrates improved detail preserva [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: 4K images generated by LWD with different architectures. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Images generated at 4K resolution with LWD+SANA. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Images generated at 4K resolution with LWD+URAE. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Images generated at 4K resolution with LWD+URAE. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visual comparison of 4K image generations from LWD and competing baselines. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: 4K generation of URAE vs LWD + URAE. Upper caption: "Eiffel Tower was Made up of more than 2 million translucent straws to look like a cloud, with the bell tower at the top of the building, Michel installed huge foam-making machines in the forest to blow huge amounts of unpredictable wet clouds in the building’s classic architecture.". Lower caption: "Barbarian woman riding a red dragon, holding a broadsw… view at source ↗
Figure 11
Figure 11. Figure 11: 2K generation of PixArt-Sigma-XL vs LWD + PixArt-Sigma-XL. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: 4K generation of Sana vs LWD + Sana. Upper caption: "A litter of golden retriever puppies playing in the snow. Their heads pop out of the snow, covered in.". Lower caption: "A curvy timber house near a sea, designed by Zaha Hadid, represent the image of a cold, modern architecture, at night, white lighting, highly detailed." [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: 2K generation of SD3-Diff4k-F16 vs LWD + SD3-F16. [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
read the original abstract

High-resolution image synthesis remains a core challenge in generative modeling, particularly in balancing computational efficiency with the preservation of fine-grained visual detail. We present Latent Wavelet Diffusion (LWD), a lightweight training framework that significantly improves detail and texture fidelity in ultra-high-resolution (2K-4K) image synthesis. LWD introduces a novel, frequency-aware masking strategy derived from wavelet energy maps, which dynamically focuses the training process on detail-rich regions of the latent space. This is complemented by a scale-consistent VAE objective to ensure high spectral fidelity. The primary advantage of our approach is its efficiency: LWD requires no architectural modifications and adds zero additional cost during inference, making it a practical solution for scaling existing models. Across multiple strong baselines, LWD consistently improves perceptual quality and FID scores, demonstrating the power of signal-driven supervision as a principled and efficient path toward high-resolution generative modeling. The code is available at https://github.com/LuigiSigillo/LatentWaveletDiffusion

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Latent Wavelet Diffusion (LWD), a lightweight training framework for ultra-high-resolution (2K-4K) image synthesis. It proposes a frequency-aware masking strategy derived from wavelet energy maps on VAE latents to dynamically focus training on detail-rich regions, complemented by a scale-consistent VAE objective for spectral fidelity. The approach requires no architectural modifications to existing diffusion models and adds zero inference cost, while claiming consistent gains in FID scores and perceptual quality across strong baselines.

Significance. If the empirical improvements prove robust, LWD could offer a practical, signal-processing-inspired route to better detail preservation in high-resolution generative models without runtime penalties. The public code release at https://github.com/LuigiSigillo/LatentWaveletDiffusion supports reproducibility and is a clear strength.

major comments (2)
  1. Abstract: the claims of consistent FID and perceptual gains are stated without any quantitative tables, error bars, ablation studies, or dataset details, so the strength of support for the central claim cannot be verified from the given text.
  2. Method section on wavelet energy maps: the frequency-aware masking strategy assumes these maps (computed on standard VAE latents) accurately and stably identify detail-rich regions. Because VAEs attenuate high-frequency content, the maps may misidentify or under-weight true details, risking ineffective masking or training artifacts that could offset the scale-consistent VAE objective; this assumption is load-bearing for attributing reported gains to the proposed mechanism.
minor comments (1)
  1. Abstract: the term 'signal-driven supervision' would benefit from a short definition or pointer to related literature on wavelet-based supervision in generative models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive overall assessment of Latent Wavelet Diffusion. We address each major comment below and have prepared revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: Abstract: the claims of consistent FID and perceptual gains are stated without any quantitative tables, error bars, ablation studies, or dataset details, so the strength of support for the central claim cannot be verified from the given text.

    Authors: We agree that the abstract, as a high-level summary, does not include the supporting numbers or references. The full manuscript contains the requested quantitative evidence in the Experiments section, including FID tables with error bars from multiple seeds, ablation studies on the masking strategy, and dataset specifications. In the revised version we will update the abstract to briefly cite the magnitude of the observed gains and explicitly direct readers to the relevant tables and figures. revision: yes

  2. Referee: Method section on wavelet energy maps: the frequency-aware masking strategy assumes these maps (computed on standard VAE latents) accurately and stably identify detail-rich regions. Because VAEs attenuate high-frequency content, the maps may misidentify or under-weight true details, risking ineffective masking or training artifacts that could offset the scale-consistent VAE objective; this assumption is load-bearing for attributing reported gains to the proposed mechanism.

    Authors: This is a substantive concern. While standard VAEs do attenuate high frequencies, the latent representations retain multi-scale structural information that our wavelet energy maps exploit to locate detail-rich regions. Ablation experiments in the manuscript demonstrate that wavelet-based masking outperforms random and uniform alternatives, and the scale-consistent VAE objective is designed to counteract spectral loss. We will add a dedicated discussion paragraph in the Method section, supported by additional visualizations of the energy maps and their alignment with high-detail areas in decoded images, to make the rationale and empirical grounding explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: LWD masking and VAE objective are derived from external wavelet transforms and standard latent representations

full rationale

The paper's central mechanism computes wavelet energy maps directly on VAE latents to produce a frequency-aware mask, then applies this mask during training alongside a scale-consistent VAE loss. Neither step defines the mask or loss in terms of the final FID/perceptual gains, nor does any equation reduce the reported improvement to a fitted parameter or prior self-citation. The derivation remains self-contained: wavelet energy is an independent signal-processing operation, the VAE is a fixed pretrained component, and empirical gains are presented as outcomes of this supervision rather than tautological redefinitions of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach builds on standard latent diffusion and wavelet transforms without introducing new postulated entities.

pith-pipeline@v0.9.0 · 5704 in / 1020 out tokens · 46144 ms · 2026-05-19T11:56:11.343096+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Spectral Progressive Diffusion for Efficient Image and Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Spectral Progressive Diffusion accelerates image and video generation in pretrained diffusion models by progressively growing resolution along the denoising trajectory using spectral noise expansion and a power spectr...

  2. PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset

    cs.CV 2026-05 unverdicted novelty 5.0

    PixVerve introduces a 95K ultra-high-resolution image-text dataset and training strategies that enable native 100-megapixel text-to-image generation together with a new evaluation benchmark.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · cited by 2 Pith papers · 4 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025

  2. [2]

    Quality-aware image-text alignment for opinion-unaware image quality assessment.arXiv preprint arXiv:2403.11176,

    Lorenzo Agnolucci, Leonardo Galteri, and Marco Bertini. Quality-aware image-text alignment for opinion-unaware image quality assessment. arXiv preprint arXiv:2403.11176, 2024

  3. [3]

    A Wavelet Diffusion GAN for Image Super-Resolution

    Lorenzo Aloisi, Luigi Sigillo, Aurelio Uncini, and Danilo Comminiello. A wavelet diffusion gan for image super-resolution. arXiv preprint arXiv:2410.17966, 2024

  4. [4]

    MultiDiffusion: Fusing diffusion paths for controlled image generation

    Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. MultiDiffusion: Fusing diffusion paths for controlled image generation. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learni...

  5. [5]

    Simpler is better: Spectral regularization and up-sampling techniques for variational autoencoders

    Sara Björk, Jonas Nordhaug Myhre, and Thomas Haugland Johansen. Simpler is better: Spectral regularization and up-sampling techniques for variational autoencoders. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3778–3782, 2022

  6. [6]

    Training diffusion models with reinforcement learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024

  7. [7]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

  8. [8]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

  9. [9]

    Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, pages 74–91. Springer, 2024

  10. [10]

    Uses of Complex Wavelets in Deep Convolutional Neural Networks

    Fergal Cotter. Uses of Complex Wavelets in Deep Convolutional Neural Networks. PhD thesis, Apollo - University of Cambridge Repository, 2019

  11. [11]

    Demofusion: Democratising high-resolution image generation with no $$$

    Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, and Zhanyu Ma. Demofusion: Democratising high-resolution image generation with no $$$. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6159–6168, 2024

  12. [12]

    I-max: Maximize the resolution potential of pre-trained rectified flow transformers with projected flow, 2024

    Ruoyi Du, Dongyang Liu, Le Zhuo, Qin Qi, Hongsheng Li, Zhanyu Ma, and Peng Gao. I-max: Maximize the resolution potential of pre-trained rectified flow transformers with projected flow, 2024

  13. [13]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. In Forty-first international conference on machine learning, 2024

  14. [14]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

  15. [15]

    Spectral image tokenizer

    Carlos Esteves, Mohammed Suhail, and Ameesh Makadia. Spectral image tokenizer. arXiv preprint arXiv:2412.09607, 2024. 10

  16. [16]

    Susskind, and Navdeep Jaitly

    Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Joshua M. Susskind, and Navdeep Jaitly. Matryoshka diffusion models. In The Twelfth International Conference on Learning Representations, 2024

  17. [17]

    Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation

    Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, et al. Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation. In European Conference on Computer Vision, pages 39–55. Springer, 2024

  18. [18]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103, 2024

  19. [19]

    Isometric representation learning for disentangled latent space of diffusion models

    Jaehoon Hahm, Junho Lee, Sunghyun Kim, and Joonseok Lee. Isometric representation learning for disentangled latent space of diffusion models. In Forty-first International Conference on Machine Learning, 2024

  20. [20]

    Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models

    Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. In The Twelfth International Conference on Learning Representations, 2023

  21. [21]

    Clipscore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. In EMNLP (1), 2021

  22. [22]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020

  23. [23]

    Cascaded diffusion models for high fidelity image generation

    Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research, 23(47):1–33, 2022

  24. [24]

    Fouriscale: A frequency perspective on training-free high-resolution image synthesis

    Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li. Fouriscale: A frequency perspective on training-free high-resolution image synthesis. In European Conference on Computer Vision, pages 196–212. Springer, 2024

  25. [25]

    Wavedm: Wavelet-based diffusion models for image restoration

    Yi Huang, Jiancheng Huang, Jianzhuang Liu, Mingfu Yan, Yu Dong, Jiaxi Lv, Chaoqi Chen, and Shifeng Chen. Wavedm: Wavelet-based diffusion models for image restoration. IEEE Transactions on Multimedia, 26:7058–7073, 2024

  26. [26]

    Latent space super-resolution for higher-resolution image generation with diffusion models

    Jinho Jeong, Sangmin Han, Jinwoo Kim, and Seon Joo Kim. Latent space super-resolution for higher-resolution image generation with diffusion models. arXiv preprint arXiv:2503.18446, 2025

  27. [27]

    Low-light image enhancement with wavelet-based diffusion models

    Hai Jiang, Ao Luo, Haoqiang Fan, Songchen Han, and Shuaicheng Liu. Low-light image enhancement with wavelet-based diffusion models. ACM Trans. Graph., 42(6), December 2023

  28. [28]

    Diffusehigh: Training- free progressive high-resolution image synthesis through structure guidance

    Younghyun Kim, Geunmin Hwang, Junyu Zhang, and Eunbyung Park. Diffusehigh: Training- free progressive high-resolution image synthesis through structure guidance. In Proceedings of the AAAI conference on artificial intelligence, volume 39, pages 4338–4346, 2025

  29. [29]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:36652–36663, 2023

  30. [30]

    Eq-vae: Equivariance regularized latent space for improved generative image modeling

    Theodoros Kouzelis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Eq-vae: Equivariance regularized latent space for improved generative image modeling. arXiv preprint arXiv:2502.09509, 2025

  31. [31]

    Black Forest Labs. Flux. https://github.com/black-forest-labs/flux, 2024

  32. [32]

    Syncdiffusion: Coherent montage via synchronized joint diffusions

    Yuseung Lee, Kunho Kim, Hyunjin Kim, and Minhyuk Sung. Syncdiffusion: Coherent montage via synchronized joint diffusions. Advances in Neural Information Processing Systems, 36:50648–50660, 2023. 11

  33. [33]

    Open-Sora Plan: Open-Source Large Video Generation Model

    Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model. arXiv preprint arXiv:2412.00131, 2024

  34. [34]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023

  35. [35]

    Xing, and Zhiting Hu

    Guangyi Liu, Yu Wang, Zeyu Feng, Qiyu Wu, Liping Tang, Yuan Gao, Zhen Li, Shuguang Cui, Julian McAuley, Zichao Yang, Eric P. Xing, and Zhiting Hu. Unified generation, reconstruction, and representation: Generalized diffusion with adaptive latent encoding-decoding. In Forty-first International Conference on Machine Learning, 2024

  36. [36]

    Sora: A review on background, technology, limitations, and opportunities of large vision models, 2024

    Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, and Lichao Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models, 2024

  37. [37]

    Guess what i think: Streamlined eeg-to-image generation with latent diffusion models

    Eleonora Lopez, Luigi Sigillo, Federica Colonnese, Massimo Panella, and Danilo Comminiello. Guess what i think: Streamlined eeg-to-image generation with latent diffusion models. In ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2025

  38. [38]

    Singularity detection and processing with wavelets

    Stephane Mallat and Wen Liang Hwang. Singularity detection and processing with wavelets. IEEE transactions on information theory, 38(2):617–643, 1992

  39. [39]

    SDEdit: Guided image synthesis and editing with stochastic differential equations

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022

  40. [40]

    Moser, Stanislav Frolov, Federico Raue, Sebastian Palacio, and Andreas Dengel

    Brian B. Moser, Stanislav Frolov, Federico Raue, Sebastian Palacio, and Andreas Dengel. Dynamic Attention-Guided Diffusion for Image Super-Resolution . In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 451–460, Los Alamitos, CA, USA, March 2025. IEEE Computer Society

  41. [41]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  42. [42]

    Wavelet diffusion models are fast and scalable image generators

    Hao Phung, Quan Dao, and Anh Tran. Wavelet diffusion models are fast and scalable image generators. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10199–10208, 2023

  43. [43]

    SDXL: Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations, 2024

  44. [44]

    Boosting diffusion models with moving average sampling in frequency domain

    Yurui Qian, Qi Cai, Yingwei Pan, Yehao Li, Ting Yao, Qibin Sun, and Tao Mei. Boosting diffusion models with moving average sampling in frequency domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8911–8920, 2024

  45. [45]

    Lumina-image 2.0: A unified and efficient image generative framework

    Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina-image 2.0: A unified and efficient image generative framework. arXiv preprint arXiv:2503.21758, 2025

  46. [46]

    Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks

    Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, and Lei Zhu. Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  47. [47]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  48. [48]

    Image super-resolution via iterative refinement

    Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. IEEE transactions on pattern analysis and machine intelligence, 45(4):4713–4726, 2022. 12

  49. [49]

    Adversarial diffusion distillation

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. In European Conference on Computer Vision, pages 87–103. Springer, 2024

  50. [50]

    Laion-5b: an open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: an open large-scale dataset for training next generation image-text models....

  51. [51]

    Efficient diffusion models: A survey

    Hui Shen, Jingxuan Zhang, Boning Xiong, Rui Hu, Shoufa Chen, Zhongwei Wan, Xin Wang, Yu Zhang, Zixuan Gong, Guangyin Bao, Chaofan Tao, Yongfeng Huang, Ye Yuan, and Mi Zhang. Efficient diffusion models: A survey. Transactions on Machine Learning Research, 2025

  52. [52]

    Res- master: Mastering high-resolution image generation via structural and fine-grained guidance

    Shuwei Shi, Wenbo Li, Yuechen Zhang, Jingwen He, Biao Gong, and Yinqiang Zheng. Res- master: Mastering high-resolution image generation via structural and fine-grained guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 6887–6895, 2025

  53. [53]

    Quaternion wavelet- conditioned diffusion models for image super-resolution

    Luigi Sigillo, Christian Bianchi, Aurelio Uncini, and Danilo Comminiello. Quaternion wavelet- conditioned diffusion models for image super-resolution. In2025 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2025

  54. [54]

    Ship in sight: Diffusion models for ship-image super resolution

    Luigi Sigillo, Riccardo Fosco Gramaccioni, Alessandro Nicolosi, and Danilo Comminiello. Ship in sight: Diffusion models for ship-image super resolution. In 2024 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2024

  55. [55]

    Improving the diffusability of autoencoders, 2025

    Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, and Aliaksandr Siarohin. Improving the diffusability of autoencoders, 2025

  56. [56]

    Denoising diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021

  57. [57]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

  58. [58]

    HQ-V AE: Hierarchical discrete representation learning with variational bayes.Transactions on Machine Learning Research, 2024

    Yuhta Takida, Yukara Ikemiya, Takashi Shibuya, Kazuki Shimada, Woosung Choi, Chieh-Hsin Lai, Naoki Murata, Toshimitsu Uesaka, Kengo Uchida, Wei-Hsiang Liao, and Yuki Mitsufuji. HQ-V AE: Hierarchical discrete representation learning with variational bayes.Transactions on Machine Learning Research, 2024

  59. [59]

    Vidtok: A versatile and open-source video tokenizer

    Anni Tang, Tianyu He, Junliang Guo, Xinle Cheng, Li Song, and Jiang Bian. Vidtok: A versatile and open-source video tokenizer. arXiv preprint arXiv:2412.13061, 2024

  60. [60]

    Nvae: A deep hierarchical variational autoencoder

    Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational autoencoder. Advances in neural information processing systems, 33:19667–19679, 2020

  61. [61]

    Sinsr: diffusion-based image super-resolution in a single step

    Yufei Wang, Wenhan Yang, Xinyuan Chen, Yaohui Wang, Lanqing Guo, Lap-Pui Chau, Ziwei Liu, Yu Qiao, Alex C Kot, and Bihan Wen. Sinsr: diffusion-based image super-resolution in a single step. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25796–25805, 2024

  62. [62]

    Wang, E.P

    Z. Wang, E.P. Simoncelli, and A.C. Bovik. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pages 1398–1402 V ol.2, 2003

  63. [63]

    Designdiffusion: High-quality text-to-design image generation with diffusion models, 2025

    Zhendong Wang, Jianmin Bao, Shuyang Gu, Dong Chen, Wengang Zhou, and Houqiang Li. Designdiffusion: High-quality text-to-design image generation with diffusion models, 2025

  64. [64]

    Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis, 2023

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis, 2023. 13

  65. [65]

    SANA: Efficient high-resolution text-to-image synthesis with linear diffusion transformers

    Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. SANA: Efficient high-resolution text-to-image synthesis with linear diffusion transformers. In The Thirteenth International Conference on Learning Representations, 2025

  66. [66]

    Difffit: Unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning

    Enze Xie, Lewei Yao, Han Shi, Zhili Liu, Daquan Zhou, Zhaoqiang Liu, Jiawei Li, and Zhenguo Li. Difffit: Unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4230–4239, 2023

  67. [67]

    Maniqa: Multi-dimension attention network for no-reference image quality assessment

    Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1191–1200, 2022

  68. [68]

    Diffusion probabilistic model made slim

    Xingyi Yang, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Diffusion probabilistic model made slim. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22552–22562, 2023

  69. [69]

    Ultra-resolution adaptation with ease

    Ruonan Yu, Songhua Liu, Zhenxiong Tan, and Xinchao Wang. Ultra-resolution adaptation with ease. International Conference on Machine Learning, 2025

  70. [70]

    Conditional image synthesis with diffusion models: A survey.arXiv preprint arXiv:2409.19365,

    Zheyuan Zhan, Defang Chen, Jian-Ping Mei, Zhenghe Zhao, Jiawei Chen, Chun Chen, Siwei Lyu, and Can Wang. Conditional image synthesis with diffusion models: A survey. CoRR, abs/2409.19365, 2024

  71. [71]

    Diffusion-4k: Ultra-high- resolution image synthesis with latent diffusion models

    Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Diffusion-4k: Ultra-high- resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  72. [72]

    Fsim: A feature similarity index for image quality assessment

    Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. Fsim: A feature similarity index for image quality assessment. IEEE Transactions on Image Processing, 20(8):2378–2386, 2011

  73. [73]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

  74. [74]

    Wavelet-based fourier information interaction with frequency diffusion adjustment for underwater image restoration

    Chen Zhao, Weiling Cai, Chenyu Dong, and Chengwei Hu. Wavelet-based fourier information interaction with frequency diffusion adjustment for underwater image restoration. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 8281–8291, 2024

  75. [75]

    Lower caption:

    Qingping Zheng, Yuanfan Guo, Jiankang Deng, Jianhua Han, Ying Li, Songcen Xu, and Hang Xu. Any-size-diffusion: Toward efficient text-driven synthesis for any-size hd images. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 7571–7578, 2024. A Wavelet-Based Relevance Maps for Latent Space Analysis A.1 Discrete Wavelet Trans...