RectifiedHR: Enable Efficient High-Resolution Synthesis via Energy Rectification
Pith reviewed 2026-05-23 01:31 UTC · model grok-4.3
The pith
RectifiedHR lets diffusion models synthesize high-resolution images without any retraining by refreshing noise and tuning guidance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RectifiedHR shows that a noise refresh strategy combined with classifier-free guidance tuned via average latent energy analysis restores efficient high-resolution synthesis in pre-trained diffusion models without any additional training.
What carries the argument
Noise refresh strategy that re-initializes the denoising trajectory at the target resolution, paired with average latent energy analysis to select an effective classifier-free guidance value.
If this is right
- Pre-trained diffusion models can generate usable images at resolutions above their training scale without retraining or architectural changes.
- The same procedure improves efficiency compared with prior training-based or multi-stage high-resolution approaches.
- The method can be combined with editing, customization, and video pipelines that already rely on the underlying diffusion model.
- Quantitative comparisons indicate higher visual quality and lower compute cost than existing baselines on the same models.
Where Pith is reading between the lines
- The energy measurement step could be inserted into other diffusion workflows to detect and correct scale-dependent degradation without changing the model weights.
- Because the fix is post-training, practitioners could apply RectifiedHR to any publicly released diffusion checkpoint to obtain higher-resolution output immediately.
- The observation that latent energy tracks blurriness may motivate new monitoring tools for diagnosing generation failures at different resolutions.
Load-bearing premise
Energy decay during high-resolution denoising is the primary cause of blurriness and can be reliably corrected by adjusting the classifier-free guidance hyperparameter alone.
What would settle it
High-resolution outputs that remain blurry or acquire new artifacts even after the noise refresh step and the energy-guided guidance adjustment would show the method does not solve the core problem.
Figures
read the original abstract
Diffusion models have achieved remarkable progress across various visual generation tasks. However, their performance significantly declines when generating content at resolutions higher than those used during training. Although numerous methods have been proposed to enable high-resolution generation, they all suffer from inefficiency. In this paper, we propose RectifiedHR, a straightforward and efficient solution for training-free high-resolution synthesis. Specifically, we propose a noise refresh strategy that unlocks the model's training-free high-resolution synthesis capability and improves efficiency. Additionally, we are the first to observe the phenomenon of energy decay, which may cause image blurriness during the high-resolution synthesis process. To address this issue, we introduce average latent energy analysis and find that tuning the classifier-free guidance hyperparameter can significantly improve generation performance. Our method is entirely training-free and demonstrates efficient performance. Furthermore, we show that RectifiedHR is compatible with various diffusion model techniques, enabling advanced features such as image editing, customized generation, and video synthesis. Extensive comparisons with numerous baseline methods validate the superior effectiveness and efficiency of RectifiedHR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes RectifiedHR, a training-free method for high-resolution synthesis with pre-trained diffusion models. It introduces a noise refresh strategy to unlock high-res capability and reports an observed 'energy decay' phenomenon in latent space during the process, which is hypothesized to cause blurriness. Average latent energy analysis is used to motivate tuning the classifier-free guidance (CFG) scale as a correction. The method is presented as efficient, compatible with editing/customization/video tasks, and superior in effectiveness and efficiency to prior baselines.
Significance. If the causal link between energy decay and blurriness is validated and the CFG correction shown to be robust without side effects, the approach could provide a lightweight, training-free route to high-resolution generation that avoids the cost of resolution-specific fine-tuning. Compatibility with other diffusion techniques would further increase its practical value.
major comments (3)
- [Abstract and §3] Abstract and §3 (energy decay observation): the manuscript states that energy decay 'may cause image blurriness' and that CFG tuning 'can significantly improve generation performance,' yet supplies no controlled ablations that isolate energy decay from other factors (noise accumulation, UNet resolution mismatch) while holding sampling steps, scheduler, and prompt fixed. Without such isolation or secondary metrics (saturation histograms, diversity scores, artifact counts), the causal claim remains unverified.
- [§4 and experimental results] §4 (RectifiedHR pipeline) and experimental results: superiority is asserted via 'extensive comparisons,' but the text provides neither error bars across multiple seeds, statistical significance tests, nor quantitative tables reporting FID/CLIP scores on standard high-res benchmarks (e.g., 1024×1024 or 2048×2048). The absence of these load-bearing metrics prevents assessment of whether the reported gains exceed baseline variance.
- [§4.2] §4.2 (CFG tuning via average latent energy): the claim that simply increasing the CFG scale reliably corrects energy decay without introducing new artifacts (oversaturation, mode collapse, detail loss) is not supported by any reported artifact-detection experiments or diversity metrics. A single hyperparameter sweep without negative controls leaves the 'reliable fix' assertion open to question.
minor comments (2)
- [§3] Notation for 'average latent energy' is introduced without an explicit equation; adding a short definition (e.g., E_avg = (1/N) Σ ||z_t||^2) would improve reproducibility.
- [Figures] Figure captions should explicitly state the resolution, CFG scale, and number of sampling steps used for each visual comparison.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our work. We provide point-by-point responses to the major comments below. We agree that the suggested additions will improve the manuscript and plan to incorporate them.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (energy decay observation): the manuscript states that energy decay 'may cause image blurriness' and that CFG tuning 'can significantly improve generation performance,' yet supplies no controlled ablations that isolate energy decay from other factors (noise accumulation, UNet resolution mismatch) while holding sampling steps, scheduler, and prompt fixed. Without such isolation or secondary metrics (saturation histograms, diversity scores, artifact counts), the causal claim remains unverified.
Authors: We agree that controlled ablations are necessary to strengthen the causal claim. In the revised manuscript, we will include experiments that isolate energy decay by controlling for noise accumulation and UNet resolution mismatch, while keeping sampling steps, scheduler, and prompt fixed. We will also report secondary metrics including saturation histograms, diversity scores, and artifact counts to verify the link to blurriness. revision: yes
-
Referee: [§4 and experimental results] §4 (RectifiedHR pipeline) and experimental results: superiority is asserted via 'extensive comparisons,' but the text provides neither error bars across multiple seeds, statistical significance tests, nor quantitative tables reporting FID/CLIP scores on standard high-res benchmarks (e.g., 1024×1024 or 2048×2048). The absence of these load-bearing metrics prevents assessment of whether the reported gains exceed baseline variance.
Authors: We acknowledge the value of rigorous quantitative evaluation. In the revised version, we will add error bars from multiple seeds, statistical significance tests, and tables with FID and CLIP scores on 1024×1024 and 2048×2048 benchmarks to allow proper assessment of the gains. revision: yes
-
Referee: [§4.2] §4.2 (CFG tuning via average latent energy): the claim that simply increasing the CFG scale reliably corrects energy decay without introducing new artifacts (oversaturation, mode collapse, detail loss) is not supported by any reported artifact-detection experiments or diversity metrics. A single hyperparameter sweep without negative controls leaves the 'reliable fix' assertion open to question.
Authors: We agree that additional validation is required. We will expand §4.2 with artifact-detection experiments, diversity metrics, and negative controls for the CFG scale tuning to demonstrate that it corrects energy decay without the listed side effects. revision: yes
Circularity Check
No circularity; empirical observation and hyperparameter tuning
full rationale
The paper's core contributions are a noise refresh strategy and CFG tuning informed by observed energy decay via average latent energy analysis. These are presented as empirical findings without any claimed derivation, first-principles prediction, or fitted parameter that reduces to its own inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked in the provided text. The method is explicitly training-free and validated through comparisons, making the derivation chain self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel / Jcost_pos_of_ne_one echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
we are the first to observe the phenomenon of energy decay, which may cause image blurriness... average latent energy analysis and find that tuning the classifier-free guidance hyperparameter can significantly improve generation performance... energy rectification
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_high_calibrated_iff echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
E[x²_t] = sum x² / (C H W) ... as ω increases, the energy exhibits a gradually increasing trend
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Text2live: Text-driven layered image and video editing
Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kas- ten, and Tali Dekel. Text2live: Text-driven layered image and video editing. In ECCV, pages 707–723. Springer, 2022. 1
work page 2022
-
[2]
Multidiffusion: Fusing diffusion paths for controlled image generation
Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. 2023. 1, 2, 3
work page 2023
-
[3]
Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018. 19
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[4]
Tim Brooks, Aleksander Holynski, and Alexei A. Efros. In- structpix2pix: Learning to follow image editing instructions. In CVPR, pages 18392–18402, 2023. 1
work page 2023
-
[5]
Boyuan Cao, Jiaxin Ye, Yujie Wei, and Hongming Shan. Ap-ldm: Attentive and progressive latent diffusion model for training-free high-resolution image generation. arXiv preprint arXiv:2410.06055, 2024. 2, 3
-
[6]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Pixart-sigma: Weak-to-strong train- ing of diffusion transformer for 4k text-to-image generation
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-sigma: Weak-to-strong train- ing of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, pages 74–91. Springer, 2025. 2 10
work page 2025
-
[8]
Diffedit: Diffusion-based semantic image editing with mask guidance
Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based semantic image editing with mask guidance. 2022. 1
work page 2022
-
[9]
Freecustom: Tuning- free customized image generation for multi-concept compo- sition
Ganggui Ding, Canyu Zhao, Wen Wang, Zhen Yang, Zide Liu, Hao Chen, and Chunhua Shen. Freecustom: Tuning- free customized image generation for multi-concept compo- sition. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 9089–9098,
-
[10]
Demofusion: Democratising high- resolution image generation with no
Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, and Zhanyu Ma. Demofusion: Democratising high- resolution image generation with no. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6159–6168, 2024. 2, 3, 6, 19
work page 2024
-
[11]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learn- ing, 2024. 1, 2, 6
work page 2024
-
[12]
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- nik, Amit H Bermano, Gal Chechik, and Daniel Cohen- Or. An image is worth one word: Personalizing text-to- image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[13]
Make a cheap scaling: A self-cascade diffusion model for higher-resolution adapta- tion
Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xin- tao Wang, Qifeng Chen, et al. Make a cheap scaling: A self-cascade diffusion model for higher-resolution adapta- tion. In European Conference on Computer Vision , pages 39–55. Springer, 2024. 2
work page 2024
-
[14]
Moayed Haji-Ali, Guha Balakrishnan, and Vicente Ordonez. Elasticdiffusion: Training-free arbitrary size image genera- tion through global-local content separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 6603–6612, 2024. 3, 6
work page 2024
-
[15]
Scalecrafter: Tuning-free higher- resolution visual generation with diffusion models
Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher- resolution visual generation with diffusion models. In The Twelfth International Conference on Learning Representa- tions, 2023. 2, 3, 6
work page 2023
-
[16]
Clipscore: A reference-free evaluation met- ric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. 2021. 14
work page 2021
-
[17]
Gans trained by a two time-scale update rule converge to a local nash equilib- rium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in neural information processing systems , 30, 2017. 19
work page 2017
-
[18]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
Denoising dif- fusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. 2
work page 2020
-
[20]
sim- ple diffusion: End-to-end diffusion for high resolution im- ages
Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. sim- ple diffusion: End-to-end diffusion for high resolution im- ages. In International Conference on Machine Learning , pages 13213–13232. PMLR, 2023. 6, 15
work page 2023
-
[21]
Fouriscale: A frequency perspective on training-free high-resolution im- age synthesis
Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li. Fouriscale: A frequency perspective on training-free high-resolution im- age synthesis. In European Conference on Computer Vision, pages 196–212. Springer, 2025. 2, 3, 6
work page 2025
-
[22]
Upsample guidance: Scale up diffusion models without training
Juno Hwang, Yong-Hyun Park, and Junghyo Jo. Upsample guidance: Scale up diffusion models without training. arXiv preprint arXiv:2404.01709, 2024. 2, 3, 15
-
[23]
Pyramidal flow matching for efficient video generative modeling
Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954 ,
-
[24]
Training- free diffusion model adaptation for variable-sized text-to- image synthesis
Zhiyu Jin, Xuli Shen, Bin Li, and Xiangyang Xue. Training- free diffusion model adaptation for variable-sized text-to- image synthesis. Advances in Neural Information Processing Systems, 36:70847–70860, 2023. 2, 3
work page 2023
-
[25]
Elucidating the design space of diffusion-based generative models
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems, 35:26565–26577, 2022. 2
work page 2022
-
[26]
Imagic: Text-based real image editing with diffusion models
Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In CVPR, pages 6007–6017, 2023. 1
work page 2023
-
[27]
Younghyun Kim, Geunmin Hwang, Junyu Zhang, and Eun- byung Park. Diffusehigh: Training-free progressive high- resolution image synthesis through structure guidance.arXiv preprint arXiv:2406.18459, 2024. 2, 3, 6
-
[28]
Black Forest Labs. Flux. https://github.com/ black-forest-labs/flux, 2023. 1, 2
work page 2023
-
[29]
Syncdiffusion: Coherent montage via synchronized joint diffusions
Yuseung Lee, Kunho Kim, Hyunjin Kim, and Minhyuk Sung. Syncdiffusion: Coherent montage via synchronized joint diffusions. Advances in Neural Information Processing Systems, 36:50648–50660, 2023. 2, 3
work page 2023
-
[30]
Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre- trained subject representation for controllable text-to-image generation and editing. Advances in Neural Information Pro- cessing Systems, 36, 2024. 1
work page 2024
-
[31]
Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chi- nese understanding. arXiv preprint arXiv:2405.08748, 2024. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Cutdiffusion: A simple, fast, cheap, and strong diffusion extrapolation method
Mingbao Lin, Zhihang Lin, Wengyi Zhan, Liujuan Cao, and Rongrong Ji. Cutdiffusion: A simple, fast, cheap, and strong diffusion extrapolation method. arXiv preprint arXiv:2404.15141, 2024. 2, 3, 6
-
[33]
Accdiffusion: An accurate method for higher-resolution im- age generation
Zhihang Lin, Mingbao Lin, Meng Zhao, and Rongrong Ji. Accdiffusion: An accurate method for higher-resolution im- age generation. In European Conference on Computer Vi- sion, pages 38–53. Springer, 2025. 2, 3, 6, 19
work page 2025
-
[34]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling. arXiv preprint arXiv:2210.02747, 2022. 2 11
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[35]
Llm4gen: Leveraging semantic representation of llms for text-to-image generation
Mushui Liu, Yuhang Ma, Yang Zhen, Jun Dan, Yunlong Yu, Zeng Zhao, Zhipeng Hu, Bai Liu, and Changjie Fan. Llm4gen: Leveraging semantic representation of llms for text-to-image generation. arXiv preprint arXiv:2407.00737,
-
[36]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
Hiprompt: Tuning-free higher-resolution gen- eration with hierarchical mllm prompts
Xinyu Liu, Yingqing He, Lanqing Guo, Xiang Li, Bu Jin, Peng Li, Yan Li, Chi-Min Chan, Qifeng Chen, Wei Xue, et al. Hiprompt: Tuning-free higher-resolution gen- eration with hierarchical mllm prompts. arXiv preprint arXiv:2409.02919, 2024. 2, 3
-
[38]
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high- resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equa- tions. arXiv preprint arXiv:2108.01073, 2021. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[40]
Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models
Daiki Miyake, Akihiro Iohara, Yu Saito, and Toshiyuki Tanaka. Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models. 2023. 1
work page 2023
-
[41]
Null-text inversion for editing real images using guided diffusion models
Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. In CVPR, pages 6038–6047,
-
[42]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 1, 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 19
work page 2021
-
[44]
Ultrapixel: Advancing ultra-high-resolution image synthesis to new peaks
Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, and Lei Zhu. Ultrapixel: Advancing ultra-high-resolution image synthesis to new peaks. arXiv preprint arXiv:2407.02158, 2024. 2
-
[45]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 2
work page 2022
-
[46]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 22500– 22510, 2023. 18
work page 2023
-
[47]
Hyperdreambooth: Hypernet- works for fast personalization of text-to-image models,
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei, Tingbo Hou, Yael Pritch, Neal Wadhwa, Michael Rubinstein, and Kfir Aberman. Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models. arXiv preprint arXiv:2307.06949, 2023. 1
-
[48]
Improved techniques for training gans
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016. 19
work page 2016
-
[49]
Laion-5b: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural In- formation Processing Systems, 35:25278–25294, 2022. 14, 19
work page 2022
-
[50]
Resmaster: Mastering high- resolution image generation via structural and fine-grained guidance
Shuwei Shi, Wenbo Li, Yuechen Zhang, Jingwen He, Biao Gong, and Yinqiang Zheng. Resmaster: Mastering high- resolution image generation via structural and fine-grained guidance. arXiv preprint arXiv:2406.16476, 2024. 2, 3
-
[51]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[52]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. arXiv preprint arXiv:2011.13456, 2020. 2
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[53]
Relay diffusion: Unifying diffusion process across resolutions for image syn- thesis
Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jian- qiao Wangni, Zhuoyi Yang, and Jie Tang. Relay diffusion: Unifying diffusion process across resolutions for image syn- thesis. arXiv preprint arXiv:2309.03350, 2023. 2
-
[54]
Key-locked rank one editing for text-to-image personaliza- tion
Yoad Tewel, Rinon Gal, Gal Chechik, and Yuval Atzmon. Key-locked rank one editing for text-to-image personaliza- tion. In ACM SIGGRAPH 2023 Conference Proceedings ,
work page 2023
-
[55]
Plug-and-play diffusion features for text-driven image-to-image translation
Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, pages 1921–1930,
work page 1921
-
[56]
Wan: Open and Advanced Large-Scale Video Generative Models
Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video gen- erative models. arXiv preprint arXiv:2503.20314, 2025. 6, 10, 16
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Haoning Wu, Shaocheng Shen, Qiang Hu, Xiaoyun Zhang, Ya Zhang, and Yanfeng Wang. Megafusion: Extend dif- fusion models towards higher-resolution image generation without further tuning. arXiv preprint arXiv:2408.11001 ,
-
[58]
Object-aware inver- sion and reassembly for image editing
Zhen Yang, Ganggui Ding, Wen Wang, Hao Chen, Bo- han Zhuang, and Chunhua Shen. Object-aware inver- sion and reassembly for image editing. arXiv preprint arXiv:2310.12149, 2023. 1, 17
-
[59]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 10, 18
work page 2023
-
[60]
Shen Zhang, Zhaowei Chen, Zhenyu Zhao, Zhenyuan Chen, Yao Tang, Yuhao Chen, Wengang Cao, and Jiajun Liang. Hidiffusion: Unlocking high-resolution creativity and effi- 12 ciency in low-resolution trained diffusion models. arXiv preprint arXiv:2311.17528, 2023. 2, 3, 6
-
[61]
Frecas: Efficient higher-resolution image generation via frequency- aware cascaded sampling
Zhengqiang Zhang, Ruihuang Li, and Lei Zhang. Frecas: Efficient higher-resolution image generation via frequency- aware cascaded sampling. arXiv preprint arXiv:2410.18410,
-
[62]
Lumina-next: Making lumina-t2x stronger and faster with next-dit
Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina-t2x stronger and faster with next-dit. arXiv preprint arXiv:2406.18583 ,
-
[63]
Supplementary 7.1. Quantitative Analysis of “Predicted x0” To quantitatively validate this observation, as shown in Fig.9, we conduct additional experiments on the generation of pt x0 using 100 random prompts sampled from LAION-5B [49], and analyze the CLIP Score [16] and Mean Squared Error (MSE). From Fig. 9a, we observe that after 30 denoising steps, th...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.