Latent Wavelet Diffusion For Ultra-High-Resolution Image Synthesis
Pith reviewed 2026-05-19 11:56 UTC · model grok-4.3
The pith
Wavelet energy maps create dynamic masks that focus diffusion training on detail-rich latent regions for better ultra-high-resolution images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Latent Wavelet Diffusion (LWD) is a lightweight training framework that uses a novel frequency-aware masking strategy derived from wavelet energy maps to dynamically focus the training process on detail-rich regions of the latent space, complemented by a scale-consistent VAE objective to ensure high spectral fidelity, consistently improving perceptual quality and FID scores across baselines with no architectural modifications and zero additional inference cost.
What carries the argument
Frequency-aware masking strategy derived from wavelet energy maps that dynamically focuses training on detail-rich regions of the latent space.
Load-bearing premise
The wavelet energy maps derived from the latent space accurately and stably identify detail-rich regions such that the resulting dynamic masking improves fidelity without introducing training artifacts or losing global coherence.
What would settle it
Training the same baseline model with and without the wavelet masking and scale-consistent VAE objective on a fixed 4K dataset and finding no consistent gain in FID or perceptual metrics would falsify the central claim.
Figures
read the original abstract
High-resolution image synthesis remains a core challenge in generative modeling, particularly in balancing computational efficiency with the preservation of fine-grained visual detail. We present Latent Wavelet Diffusion (LWD), a lightweight training framework that significantly improves detail and texture fidelity in ultra-high-resolution (2K-4K) image synthesis. LWD introduces a novel, frequency-aware masking strategy derived from wavelet energy maps, which dynamically focuses the training process on detail-rich regions of the latent space. This is complemented by a scale-consistent VAE objective to ensure high spectral fidelity. The primary advantage of our approach is its efficiency: LWD requires no architectural modifications and adds zero additional cost during inference, making it a practical solution for scaling existing models. Across multiple strong baselines, LWD consistently improves perceptual quality and FID scores, demonstrating the power of signal-driven supervision as a principled and efficient path toward high-resolution generative modeling. The code is available at https://github.com/LuigiSigillo/LatentWaveletDiffusion
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Latent Wavelet Diffusion (LWD), a lightweight training framework for ultra-high-resolution (2K-4K) image synthesis. It proposes a frequency-aware masking strategy derived from wavelet energy maps on VAE latents to dynamically focus training on detail-rich regions, complemented by a scale-consistent VAE objective for spectral fidelity. The approach requires no architectural modifications to existing diffusion models and adds zero inference cost, while claiming consistent gains in FID scores and perceptual quality across strong baselines.
Significance. If the empirical improvements prove robust, LWD could offer a practical, signal-processing-inspired route to better detail preservation in high-resolution generative models without runtime penalties. The public code release at https://github.com/LuigiSigillo/LatentWaveletDiffusion supports reproducibility and is a clear strength.
major comments (2)
- Abstract: the claims of consistent FID and perceptual gains are stated without any quantitative tables, error bars, ablation studies, or dataset details, so the strength of support for the central claim cannot be verified from the given text.
- Method section on wavelet energy maps: the frequency-aware masking strategy assumes these maps (computed on standard VAE latents) accurately and stably identify detail-rich regions. Because VAEs attenuate high-frequency content, the maps may misidentify or under-weight true details, risking ineffective masking or training artifacts that could offset the scale-consistent VAE objective; this assumption is load-bearing for attributing reported gains to the proposed mechanism.
minor comments (1)
- Abstract: the term 'signal-driven supervision' would benefit from a short definition or pointer to related literature on wavelet-based supervision in generative models.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive overall assessment of Latent Wavelet Diffusion. We address each major comment below and have prepared revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract: the claims of consistent FID and perceptual gains are stated without any quantitative tables, error bars, ablation studies, or dataset details, so the strength of support for the central claim cannot be verified from the given text.
Authors: We agree that the abstract, as a high-level summary, does not include the supporting numbers or references. The full manuscript contains the requested quantitative evidence in the Experiments section, including FID tables with error bars from multiple seeds, ablation studies on the masking strategy, and dataset specifications. In the revised version we will update the abstract to briefly cite the magnitude of the observed gains and explicitly direct readers to the relevant tables and figures. revision: yes
-
Referee: Method section on wavelet energy maps: the frequency-aware masking strategy assumes these maps (computed on standard VAE latents) accurately and stably identify detail-rich regions. Because VAEs attenuate high-frequency content, the maps may misidentify or under-weight true details, risking ineffective masking or training artifacts that could offset the scale-consistent VAE objective; this assumption is load-bearing for attributing reported gains to the proposed mechanism.
Authors: This is a substantive concern. While standard VAEs do attenuate high frequencies, the latent representations retain multi-scale structural information that our wavelet energy maps exploit to locate detail-rich regions. Ablation experiments in the manuscript demonstrate that wavelet-based masking outperforms random and uniform alternatives, and the scale-consistent VAE objective is designed to counteract spectral loss. We will add a dedicated discussion paragraph in the Method section, supported by additional visualizations of the energy maps and their alignment with high-detail areas in decoded images, to make the rationale and empirical grounding explicit. revision: yes
Circularity Check
No circularity: LWD masking and VAE objective are derived from external wavelet transforms and standard latent representations
full rationale
The paper's central mechanism computes wavelet energy maps directly on VAE latents to produce a frequency-aware mask, then applies this mask during training alongside a scale-consistent VAE loss. Neither step defines the mask or loss in terms of the final FID/perceptual gains, nor does any equation reduce the reported improvement to a fitted parameter or prior self-citation. The derivation remains self-contained: wavelet energy is an independent signal-processing operation, the VAE is a fixed pretrained component, and empirical gains are presented as outcomes of this supervision rather than tautological redefinitions of the inputs.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
Spectral Progressive Diffusion for Efficient Image and Video Generation
Spectral Progressive Diffusion accelerates image and video generation in pretrained diffusion models by progressively growing resolution along the denoising trajectory using spectral noise expansion and a power spectr...
-
PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset
PixVerve introduces a 95K ultra-high-resolution image-text dataset and training strategies that enable native 100-megapixel text-to-image generation together with a new evaluation benchmark.
Reference graph
Works this paper leans on
-
[1]
Cosmos World Foundation Model Platform for Physical AI
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Lorenzo Agnolucci, Leonardo Galteri, and Marco Bertini. Quality-aware image-text alignment for opinion-unaware image quality assessment. arXiv preprint arXiv:2403.11176, 2024
-
[3]
A Wavelet Diffusion GAN for Image Super-Resolution
Lorenzo Aloisi, Luigi Sigillo, Aurelio Uncini, and Danilo Comminiello. A wavelet diffusion gan for image super-resolution. arXiv preprint arXiv:2410.17966, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
MultiDiffusion: Fusing diffusion paths for controlled image generation
Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. MultiDiffusion: Fusing diffusion paths for controlled image generation. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learni...
work page 2023
-
[5]
Simpler is better: Spectral regularization and up-sampling techniques for variational autoencoders
Sara Björk, Jonas Nordhaug Myhre, and Thomas Haugland Johansen. Simpler is better: Spectral regularization and up-sampling techniques for variational autoencoders. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3778–3782, 2022
work page 2022
-
[6]
Training diffusion models with reinforcement learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[7]
Instructpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023
work page 2023
-
[8]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021
work page 2021
-
[9]
Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, pages 74–91. Springer, 2024
work page 2024
-
[10]
Uses of Complex Wavelets in Deep Convolutional Neural Networks
Fergal Cotter. Uses of Complex Wavelets in Deep Convolutional Neural Networks. PhD thesis, Apollo - University of Cambridge Repository, 2019
work page 2019
-
[11]
Demofusion: Democratising high-resolution image generation with no $$$
Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, and Zhanyu Ma. Demofusion: Democratising high-resolution image generation with no $$$. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6159–6168, 2024
work page 2024
-
[12]
Ruoyi Du, Dongyang Liu, Le Zhuo, Qin Qi, Hongsheng Li, Zhanyu Ma, and Peng Gao. I-max: Maximize the resolution potential of pre-trained rectified flow transformers with projected flow, 2024
work page 2024
-
[13]
Scaling rectified flow trans- formers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. In Forty-first international conference on machine learning, 2024
work page 2024
-
[14]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021
work page 2021
-
[15]
Carlos Esteves, Mohammed Suhail, and Ameesh Makadia. Spectral image tokenizer. arXiv preprint arXiv:2412.09607, 2024. 10
-
[16]
Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Joshua M. Susskind, and Navdeep Jaitly. Matryoshka diffusion models. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[17]
Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation
Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, et al. Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation. In European Conference on Computer Vision, pages 39–55. Springer, 2024
work page 2024
-
[18]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Isometric representation learning for disentangled latent space of diffusion models
Jaehoon Hahm, Junho Lee, Sunghyun Kim, and Joonseok Lee. Isometric representation learning for disentangled latent space of diffusion models. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[20]
Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models
Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. In The Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[21]
Clipscore: A reference-free evaluation metric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. In EMNLP (1), 2021
work page 2021
-
[22]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[23]
Cascaded diffusion models for high fidelity image generation
Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research, 23(47):1–33, 2022
work page 2022
-
[24]
Fouriscale: A frequency perspective on training-free high-resolution image synthesis
Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li. Fouriscale: A frequency perspective on training-free high-resolution image synthesis. In European Conference on Computer Vision, pages 196–212. Springer, 2024
work page 2024
-
[25]
Wavedm: Wavelet-based diffusion models for image restoration
Yi Huang, Jiancheng Huang, Jianzhuang Liu, Mingfu Yan, Yu Dong, Jiaxi Lv, Chaoqi Chen, and Shifeng Chen. Wavedm: Wavelet-based diffusion models for image restoration. IEEE Transactions on Multimedia, 26:7058–7073, 2024
work page 2024
-
[26]
Latent space super-resolution for higher-resolution image generation with diffusion models
Jinho Jeong, Sangmin Han, Jinwoo Kim, and Seon Joo Kim. Latent space super-resolution for higher-resolution image generation with diffusion models. arXiv preprint arXiv:2503.18446, 2025
-
[27]
Low-light image enhancement with wavelet-based diffusion models
Hai Jiang, Ao Luo, Haoqiang Fan, Songchen Han, and Shuaicheng Liu. Low-light image enhancement with wavelet-based diffusion models. ACM Trans. Graph., 42(6), December 2023
work page 2023
-
[28]
Diffusehigh: Training- free progressive high-resolution image synthesis through structure guidance
Younghyun Kim, Geunmin Hwang, Junyu Zhang, and Eunbyung Park. Diffusehigh: Training- free progressive high-resolution image synthesis through structure guidance. In Proceedings of the AAAI conference on artificial intelligence, volume 39, pages 4338–4346, 2025
work page 2025
-
[29]
Pick-a-pic: An open dataset of user preferences for text-to-image generation
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:36652–36663, 2023
work page 2023
-
[30]
Eq-vae: Equivariance regularized latent space for improved generative image modeling
Theodoros Kouzelis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Eq-vae: Equivariance regularized latent space for improved generative image modeling. arXiv preprint arXiv:2502.09509, 2025
-
[31]
Black Forest Labs. Flux. https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[32]
Syncdiffusion: Coherent montage via synchronized joint diffusions
Yuseung Lee, Kunho Kim, Hyunjin Kim, and Minhyuk Sung. Syncdiffusion: Coherent montage via synchronized joint diffusions. Advances in Neural Information Processing Systems, 36:50648–50660, 2023. 11
work page 2023
-
[33]
Open-Sora Plan: Open-Source Large Video Generation Model
Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model. arXiv preprint arXiv:2412.00131, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[35]
Guangyi Liu, Yu Wang, Zeyu Feng, Qiyu Wu, Liping Tang, Yuan Gao, Zhen Li, Shuguang Cui, Julian McAuley, Zichao Yang, Eric P. Xing, and Zhiting Hu. Unified generation, reconstruction, and representation: Generalized diffusion with adaptive latent encoding-decoding. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[36]
Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, and Lichao Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models, 2024
work page 2024
-
[37]
Guess what i think: Streamlined eeg-to-image generation with latent diffusion models
Eleonora Lopez, Luigi Sigillo, Federica Colonnese, Massimo Panella, and Danilo Comminiello. Guess what i think: Streamlined eeg-to-image generation with latent diffusion models. In ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2025
work page 2025
-
[38]
Singularity detection and processing with wavelets
Stephane Mallat and Wen Liang Hwang. Singularity detection and processing with wavelets. IEEE transactions on information theory, 38(2):617–643, 1992
work page 1992
-
[39]
SDEdit: Guided image synthesis and editing with stochastic differential equations
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022
work page 2022
-
[40]
Moser, Stanislav Frolov, Federico Raue, Sebastian Palacio, and Andreas Dengel
Brian B. Moser, Stanislav Frolov, Federico Raue, Sebastian Palacio, and Andreas Dengel. Dynamic Attention-Guided Diffusion for Image Super-Resolution . In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 451–460, Los Alamitos, CA, USA, March 2025. IEEE Computer Society
work page 2025
-
[41]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[42]
Wavelet diffusion models are fast and scalable image generators
Hao Phung, Quan Dao, and Anh Tran. Wavelet diffusion models are fast and scalable image generators. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10199–10208, 2023
work page 2023
-
[43]
SDXL: Improving latent diffusion models for high-resolution image synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[44]
Boosting diffusion models with moving average sampling in frequency domain
Yurui Qian, Qi Cai, Yingwei Pan, Yehao Li, Ting Yao, Qibin Sun, and Tao Mei. Boosting diffusion models with moving average sampling in frequency domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8911–8920, 2024
work page 2024
-
[45]
Lumina-image 2.0: A unified and efficient image generative framework
Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Jiakang Yuan, Xinyue Li, Dongyang Liu, et al. Lumina-image 2.0: A unified and efficient image generative framework. arXiv preprint arXiv:2503.21758, 2025
-
[46]
Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks
Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, and Lei Zhu. Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[47]
High- resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[48]
Image super-resolution via iterative refinement
Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. IEEE transactions on pattern analysis and machine intelligence, 45(4):4713–4726, 2022. 12
work page 2022
-
[49]
Adversarial diffusion distillation
Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. In European Conference on Computer Vision, pages 87–103. Springer, 2024
work page 2024
-
[50]
Laion-5b: an open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: an open large-scale dataset for training next generation image-text models....
work page 2022
-
[51]
Efficient diffusion models: A survey
Hui Shen, Jingxuan Zhang, Boning Xiong, Rui Hu, Shoufa Chen, Zhongwei Wan, Xin Wang, Yu Zhang, Zixuan Gong, Guangyin Bao, Chaofan Tao, Yongfeng Huang, Ye Yuan, and Mi Zhang. Efficient diffusion models: A survey. Transactions on Machine Learning Research, 2025
work page 2025
-
[52]
Res- master: Mastering high-resolution image generation via structural and fine-grained guidance
Shuwei Shi, Wenbo Li, Yuechen Zhang, Jingwen He, Biao Gong, and Yinqiang Zheng. Res- master: Mastering high-resolution image generation via structural and fine-grained guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 6887–6895, 2025
work page 2025
-
[53]
Quaternion wavelet- conditioned diffusion models for image super-resolution
Luigi Sigillo, Christian Bianchi, Aurelio Uncini, and Danilo Comminiello. Quaternion wavelet- conditioned diffusion models for image super-resolution. In2025 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2025
work page 2025
-
[54]
Ship in sight: Diffusion models for ship-image super resolution
Luigi Sigillo, Riccardo Fosco Gramaccioni, Alessandro Nicolosi, and Danilo Comminiello. Ship in sight: Diffusion models for ship-image super resolution. In 2024 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2024
work page 2024
-
[55]
Improving the diffusability of autoencoders, 2025
Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, and Aliaksandr Siarohin. Improving the diffusability of autoencoders, 2025
work page 2025
-
[56]
Denoising diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021
work page 2021
-
[57]
Score-based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021
work page 2021
-
[58]
Yuhta Takida, Yukara Ikemiya, Takashi Shibuya, Kazuki Shimada, Woosung Choi, Chieh-Hsin Lai, Naoki Murata, Toshimitsu Uesaka, Kengo Uchida, Wei-Hsiang Liao, and Yuki Mitsufuji. HQ-V AE: Hierarchical discrete representation learning with variational bayes.Transactions on Machine Learning Research, 2024
work page 2024
-
[59]
Vidtok: A versatile and open-source video tokenizer
Anni Tang, Tianyu He, Junliang Guo, Xinle Cheng, Li Song, and Jiang Bian. Vidtok: A versatile and open-source video tokenizer. arXiv preprint arXiv:2412.13061, 2024
-
[60]
Nvae: A deep hierarchical variational autoencoder
Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational autoencoder. Advances in neural information processing systems, 33:19667–19679, 2020
work page 2020
-
[61]
Sinsr: diffusion-based image super-resolution in a single step
Yufei Wang, Wenhan Yang, Xinyuan Chen, Yaohui Wang, Lanqing Guo, Lap-Pui Chau, Ziwei Liu, Yu Qiao, Alex C Kot, and Bihan Wen. Sinsr: diffusion-based image super-resolution in a single step. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25796–25805, 2024
work page 2024
- [62]
-
[63]
Designdiffusion: High-quality text-to-design image generation with diffusion models, 2025
Zhendong Wang, Jianmin Bao, Shuyang Gu, Dong Chen, Wengang Zhou, and Houqiang Li. Designdiffusion: High-quality text-to-design image generation with diffusion models, 2025
work page 2025
-
[64]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis, 2023. 13
work page 2023
-
[65]
SANA: Efficient high-resolution text-to-image synthesis with linear diffusion transformers
Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. SANA: Efficient high-resolution text-to-image synthesis with linear diffusion transformers. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[66]
Enze Xie, Lewei Yao, Han Shi, Zhili Liu, Daquan Zhou, Zhaoqiang Liu, Jiawei Li, and Zhenguo Li. Difffit: Unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4230–4239, 2023
work page 2023
-
[67]
Maniqa: Multi-dimension attention network for no-reference image quality assessment
Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1191–1200, 2022
work page 2022
-
[68]
Diffusion probabilistic model made slim
Xingyi Yang, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Diffusion probabilistic model made slim. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22552–22562, 2023
work page 2023
-
[69]
Ultra-resolution adaptation with ease
Ruonan Yu, Songhua Liu, Zhenxiong Tan, and Xinchao Wang. Ultra-resolution adaptation with ease. International Conference on Machine Learning, 2025
work page 2025
-
[70]
Conditional image synthesis with diffusion models: A survey.arXiv preprint arXiv:2409.19365,
Zheyuan Zhan, Defang Chen, Jian-Ping Mei, Zhenghe Zhao, Jiawei Chen, Chun Chen, Siwei Lyu, and Can Wang. Conditional image synthesis with diffusion models: A survey. CoRR, abs/2409.19365, 2024
-
[71]
Diffusion-4k: Ultra-high- resolution image synthesis with latent diffusion models
Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Diffusion-4k: Ultra-high- resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[72]
Fsim: A feature similarity index for image quality assessment
Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. Fsim: A feature similarity index for image quality assessment. IEEE Transactions on Image Processing, 20(8):2378–2386, 2011
work page 2011
-
[73]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023
work page 2023
-
[74]
Chen Zhao, Weiling Cai, Chenyu Dong, and Chengwei Hu. Wavelet-based fourier information interaction with frequency diffusion adjustment for underwater image restoration. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 8281–8291, 2024
work page 2024
-
[75]
Qingping Zheng, Yuanfan Guo, Jiankang Deng, Jianhua Han, Ying Li, Songcen Xu, and Hang Xu. Any-size-diffusion: Toward efficient text-driven synthesis for any-size hd images. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 7571–7578, 2024. A Wavelet-Based Relevance Maps for Latent Space Analysis A.1 Discrete Wavelet Trans...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.