RectifiedHR: Enable Efficient High-Resolution Synthesis via Energy Rectification

Guibao Shen; Liang Hou; Luozhou Wang; Minyang Li; Mushui Liu; Xin Tao; Ying-Cong Chen; Zhen Yang

arxiv: 2503.02537 · v4 · submitted 2025-03-04 · 💻 cs.CV · cs.AI

RectifiedHR: Enable Efficient High-Resolution Synthesis via Energy Rectification

Zhen Yang , Guibao Shen , Minyang Li , Liang Hou , Mushui Liu , Luozhou Wang , Xin Tao , Ying-Cong Chen This is my paper

Pith reviewed 2026-05-23 01:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords diffusion modelshigh-resolution synthesistraining-free methodsenergy decayclassifier-free guidanceimage generationcomputer vision

0 comments

The pith

RectifiedHR lets diffusion models synthesize high-resolution images without any retraining by refreshing noise and tuning guidance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a simple training-free procedure can restore high-resolution synthesis capability in diffusion models that were trained only at lower resolutions. It introduces a noise refresh step that re-enables the model’s native behavior at higher scales and identifies energy decay in the latent space as the source of the resulting blur. The authors show that measuring average latent energy and adjusting the classifier-free guidance scale corrects the blur while keeping the process efficient. Because the method requires no parameter updates and works alongside existing techniques, it directly addresses the practical barrier that high-resolution generation has posed for diffusion pipelines.

Core claim

RectifiedHR shows that a noise refresh strategy combined with classifier-free guidance tuned via average latent energy analysis restores efficient high-resolution synthesis in pre-trained diffusion models without any additional training.

What carries the argument

Noise refresh strategy that re-initializes the denoising trajectory at the target resolution, paired with average latent energy analysis to select an effective classifier-free guidance value.

If this is right

Pre-trained diffusion models can generate usable images at resolutions above their training scale without retraining or architectural changes.
The same procedure improves efficiency compared with prior training-based or multi-stage high-resolution approaches.
The method can be combined with editing, customization, and video pipelines that already rely on the underlying diffusion model.
Quantitative comparisons indicate higher visual quality and lower compute cost than existing baselines on the same models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The energy measurement step could be inserted into other diffusion workflows to detect and correct scale-dependent degradation without changing the model weights.
Because the fix is post-training, practitioners could apply RectifiedHR to any publicly released diffusion checkpoint to obtain higher-resolution output immediately.
The observation that latent energy tracks blurriness may motivate new monitoring tools for diagnosing generation failures at different resolutions.

Load-bearing premise

Energy decay during high-resolution denoising is the primary cause of blurriness and can be reliably corrected by adjusting the classifier-free guidance hyperparameter alone.

What would settle it

High-resolution outputs that remain blurry or acquire new artifacts even after the noise refresh step and the energy-guided guidance adjustment would show the method does not solve the core problem.

Figures

Figures reproduced from arXiv: 2503.02537 by Guibao Shen, Liang Hou, Luozhou Wang, Minyang Li, Mushui Liu, Xin Tao, Ying-Cong Chen, Zhen Yang.

**Figure 1.** Figure 1: Generated images by RectifiedHR. The training-free RectifiedHR enables diffusion models (SDXL is shown in the figure) to synthesize images at resolutions exceeding their original training resolution. Please zoom in for a closer view. Abstract Diffusion models have achieved remarkable progress across various visual generation tasks. However, their performance significantly declines when generating content … view at source ↗

**Figure 2.** Figure 2: The visualization images corresponding to “predicted [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: (a) The x-axis denotes the timesteps of the sampling process, and the y-axis indicates the average latent energy. The blue [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of RectifiedHR. (a) The original sampling process and its pseudocode. (b) The sampling process and pseudocode of our method. The orange components in the pseudocode and modules correspond to Noise Refresh, while the purple components represent Energy Rectification. ϵ denotes Gaussian random noise, whose shape adapts to that of p˜ t x0 . The definitions of other symbols used in the pseudocode can b… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison between our method and [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results of the ablation studies at [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison across three different resolutions between our method and other training-free methods. The red box [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Applications. (a) Results of integrating [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: The trend of the “predicted x0” at different timesteps t, denoted as p t x0 , evaluated on 100 random prompts. (a) The average MSE between p t x0 and p t−1 x0 . The x-axis represents the sampling timestep, and the y-axis denotes the average MSE. It can be observed that after approximately 30 steps, the rate of change in p t x0 slows significantly. (b) The trend of the average CLIP Score between p t x0 and … view at source ↗

**Figure 10.** Figure 10: Qualitative comparison on Stable Diffusion 3 at [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: The image illustrates the ablation study of [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: The image illustrates the ablation study of [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: The image illustrates the ablation study of [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 15.** Figure 15: The image illustrates the ablation study of [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗

**Figure 16.** Figure 16: The image illustrates the ablation study of [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗

**Figure 17.** Figure 17: The image illustrates the ablation study of [PITH_FULL_IMAGE:figures/full_fig_p018_17.png] view at source ↗

**Figure 18.** Figure 18: The image illustrates the ablation study of [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗

**Figure 19.** Figure 19: Visualization of the average latent energy curve following energy rectification. [PITH_FULL_IMAGE:figures/full_fig_p019_19.png] view at source ↗

read the original abstract

Diffusion models have achieved remarkable progress across various visual generation tasks. However, their performance significantly declines when generating content at resolutions higher than those used during training. Although numerous methods have been proposed to enable high-resolution generation, they all suffer from inefficiency. In this paper, we propose RectifiedHR, a straightforward and efficient solution for training-free high-resolution synthesis. Specifically, we propose a noise refresh strategy that unlocks the model's training-free high-resolution synthesis capability and improves efficiency. Additionally, we are the first to observe the phenomenon of energy decay, which may cause image blurriness during the high-resolution synthesis process. To address this issue, we introduce average latent energy analysis and find that tuning the classifier-free guidance hyperparameter can significantly improve generation performance. Our method is entirely training-free and demonstrates efficient performance. Furthermore, we show that RectifiedHR is compatible with various diffusion model techniques, enabling advanced features such as image editing, customized generation, and video synthesis. Extensive comparisons with numerous baseline methods validate the superior effectiveness and efficiency of RectifiedHR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RectifiedHR gives a simple training-free high-res tweak via noise refresh and CFG tuning, but the abstract leaves the causal claims unbacked by numbers or ablations.

read the letter

RectifiedHR claims a straightforward training-free route to high-resolution diffusion synthesis. The new elements are the reported energy decay observation and the noise refresh step that supposedly unlocks the capability without retraining. They also add average latent energy analysis to pick the right classifier-free guidance scale and show the method works with editing, customization, and video extensions. That compatibility and the no-training requirement are the practical upsides; they address a common complaint that existing high-res fixes are slow or heavy.

Referee Report

3 major / 2 minor

Summary. The paper proposes RectifiedHR, a training-free method for high-resolution synthesis with pre-trained diffusion models. It introduces a noise refresh strategy to unlock high-res capability and reports an observed 'energy decay' phenomenon in latent space during the process, which is hypothesized to cause blurriness. Average latent energy analysis is used to motivate tuning the classifier-free guidance (CFG) scale as a correction. The method is presented as efficient, compatible with editing/customization/video tasks, and superior in effectiveness and efficiency to prior baselines.

Significance. If the causal link between energy decay and blurriness is validated and the CFG correction shown to be robust without side effects, the approach could provide a lightweight, training-free route to high-resolution generation that avoids the cost of resolution-specific fine-tuning. Compatibility with other diffusion techniques would further increase its practical value.

major comments (3)

[Abstract and §3] Abstract and §3 (energy decay observation): the manuscript states that energy decay 'may cause image blurriness' and that CFG tuning 'can significantly improve generation performance,' yet supplies no controlled ablations that isolate energy decay from other factors (noise accumulation, UNet resolution mismatch) while holding sampling steps, scheduler, and prompt fixed. Without such isolation or secondary metrics (saturation histograms, diversity scores, artifact counts), the causal claim remains unverified.
[§4 and experimental results] §4 (RectifiedHR pipeline) and experimental results: superiority is asserted via 'extensive comparisons,' but the text provides neither error bars across multiple seeds, statistical significance tests, nor quantitative tables reporting FID/CLIP scores on standard high-res benchmarks (e.g., 1024×1024 or 2048×2048). The absence of these load-bearing metrics prevents assessment of whether the reported gains exceed baseline variance.
[§4.2] §4.2 (CFG tuning via average latent energy): the claim that simply increasing the CFG scale reliably corrects energy decay without introducing new artifacts (oversaturation, mode collapse, detail loss) is not supported by any reported artifact-detection experiments or diversity metrics. A single hyperparameter sweep without negative controls leaves the 'reliable fix' assertion open to question.

minor comments (2)

[§3] Notation for 'average latent energy' is introduced without an explicit equation; adding a short definition (e.g., E_avg = (1/N) Σ ||z_t||^2) would improve reproducibility.
[Figures] Figure captions should explicitly state the resolution, CFG scale, and number of sampling steps used for each visual comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our work. We provide point-by-point responses to the major comments below. We agree that the suggested additions will improve the manuscript and plan to incorporate them.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (energy decay observation): the manuscript states that energy decay 'may cause image blurriness' and that CFG tuning 'can significantly improve generation performance,' yet supplies no controlled ablations that isolate energy decay from other factors (noise accumulation, UNet resolution mismatch) while holding sampling steps, scheduler, and prompt fixed. Without such isolation or secondary metrics (saturation histograms, diversity scores, artifact counts), the causal claim remains unverified.

Authors: We agree that controlled ablations are necessary to strengthen the causal claim. In the revised manuscript, we will include experiments that isolate energy decay by controlling for noise accumulation and UNet resolution mismatch, while keeping sampling steps, scheduler, and prompt fixed. We will also report secondary metrics including saturation histograms, diversity scores, and artifact counts to verify the link to blurriness. revision: yes
Referee: [§4 and experimental results] §4 (RectifiedHR pipeline) and experimental results: superiority is asserted via 'extensive comparisons,' but the text provides neither error bars across multiple seeds, statistical significance tests, nor quantitative tables reporting FID/CLIP scores on standard high-res benchmarks (e.g., 1024×1024 or 2048×2048). The absence of these load-bearing metrics prevents assessment of whether the reported gains exceed baseline variance.

Authors: We acknowledge the value of rigorous quantitative evaluation. In the revised version, we will add error bars from multiple seeds, statistical significance tests, and tables with FID and CLIP scores on 1024×1024 and 2048×2048 benchmarks to allow proper assessment of the gains. revision: yes
Referee: [§4.2] §4.2 (CFG tuning via average latent energy): the claim that simply increasing the CFG scale reliably corrects energy decay without introducing new artifacts (oversaturation, mode collapse, detail loss) is not supported by any reported artifact-detection experiments or diversity metrics. A single hyperparameter sweep without negative controls leaves the 'reliable fix' assertion open to question.

Authors: We agree that additional validation is required. We will expand §4.2 with artifact-detection experiments, diversity metrics, and negative controls for the CFG scale tuning to demonstrate that it corrects energy decay without the listed side effects. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical observation and hyperparameter tuning

full rationale

The paper's core contributions are a noise refresh strategy and CFG tuning informed by observed energy decay via average latent energy analysis. These are presented as empirical findings without any claimed derivation, first-principles prediction, or fitted parameter that reduces to its own inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked in the provided text. The method is explicitly training-free and validated through comparisons, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5731 in / 919 out tokens · 22462 ms · 2026-05-23T01:31:15.174074+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel / Jcost_pos_of_ne_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we are the first to observe the phenomenon of energy decay, which may cause image blurriness... average latent energy analysis and find that tuning the classifier-free guidance hyperparameter can significantly improve generation performance... energy rectification
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_high_calibrated_iff echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

E[x²_t] = sum x² / (C H W) ... as ω increases, the energy exhibits a gradually increasing trend

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 13 internal anchors

[1]

Text2live: Text-driven layered image and video editing

Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kas- ten, and Tali Dekel. Text2live: Text-driven layered image and video editing. In ECCV, pages 707–723. Springer, 2022. 1

work page 2022
[2]

Multidiffusion: Fusing diffusion paths for controlled image generation

Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. 2023. 1, 2, 3

work page 2023
[3]

Demystifying MMD GANs

Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018. 19

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. In- structpix2pix: Learning to follow image editing instructions. In CVPR, pages 18392–18402, 2023. 1

work page 2023
[5]

Ap-ldm: Attentive and progressive latent diffusion model for training-free high-resolution image generation

Boyuan Cao, Jiaxin Ye, Yujie Wei, and Hongming Shan. Ap-ldm: Attentive and progressive latent diffusion model for training-free high-resolution image generation. arXiv preprint arXiv:2410.06055, 2024. 2, 3

work page arXiv 2024
[6]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Pixart-sigma: Weak-to-strong train- ing of diffusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-sigma: Weak-to-strong train- ing of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, pages 74–91. Springer, 2025. 2 10

work page 2025
[8]

Diffedit: Diffusion-based semantic image editing with mask guidance

Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based semantic image editing with mask guidance. 2022. 1

work page 2022
[9]

Freecustom: Tuning- free customized image generation for multi-concept compo- sition

Ganggui Ding, Canyu Zhao, Wen Wang, Zhen Yang, Zide Liu, Hao Chen, and Chunhua Shen. Freecustom: Tuning- free customized image generation for multi-concept compo- sition. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 9089–9098,

work page
[10]

Demofusion: Democratising high- resolution image generation with no

Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, and Zhanyu Ma. Demofusion: Democratising high- resolution image generation with no. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6159–6168, 2024. 2, 3, 6, 19

work page 2024
[11]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learn- ing, 2024. 1, 2, 6

work page 2024
[12]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- nik, Amit H Bermano, Gal Chechik, and Daniel Cohen- Or. An image is worth one word: Personalizing text-to- image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Make a cheap scaling: A self-cascade diffusion model for higher-resolution adapta- tion

Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xin- tao Wang, Qifeng Chen, et al. Make a cheap scaling: A self-cascade diffusion model for higher-resolution adapta- tion. In European Conference on Computer Vision , pages 39–55. Springer, 2024. 2

work page 2024
[14]

Elasticdiffusion: Training-free arbitrary size image genera- tion through global-local content separation

Moayed Haji-Ali, Guha Balakrishnan, and Vicente Ordonez. Elasticdiffusion: Training-free arbitrary size image genera- tion through global-local content separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 6603–6612, 2024. 3, 6

work page 2024
[15]

Scalecrafter: Tuning-free higher- resolution visual generation with diffusion models

Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher- resolution visual generation with diffusion models. In The Twelfth International Conference on Learning Representa- tions, 2023. 2, 3, 6

work page 2023
[16]

Clipscore: A reference-free evaluation met- ric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. 2021. 14

work page 2021
[17]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in neural information processing systems , 30, 2017. 19

work page 2017
[18]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Denoising dif- fusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. 2

work page 2020
[20]

sim- ple diffusion: End-to-end diffusion for high resolution im- ages

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. sim- ple diffusion: End-to-end diffusion for high resolution im- ages. In International Conference on Machine Learning , pages 13213–13232. PMLR, 2023. 6, 15

work page 2023
[21]

Fouriscale: A frequency perspective on training-free high-resolution im- age synthesis

Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li. Fouriscale: A frequency perspective on training-free high-resolution im- age synthesis. In European Conference on Computer Vision, pages 196–212. Springer, 2025. 2, 3, 6

work page 2025
[22]

Upsample guidance: Scale up diffusion models without training

Juno Hwang, Yong-Hyun Park, and Junghyo Jo. Upsample guidance: Scale up diffusion models without training. arXiv preprint arXiv:2404.01709, 2024. 2, 3, 15

work page arXiv 2024
[23]

Pyramidal flow matching for efficient video generative modeling

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954 ,

work page arXiv
[24]

Training- free diffusion model adaptation for variable-sized text-to- image synthesis

Zhiyu Jin, Xuli Shen, Bin Li, and Xiangyang Xue. Training- free diffusion model adaptation for variable-sized text-to- image synthesis. Advances in Neural Information Processing Systems, 36:70847–70860, 2023. 2, 3

work page 2023
[25]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems, 35:26565–26577, 2022. 2

work page 2022
[26]

Imagic: Text-based real image editing with diffusion models

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In CVPR, pages 6007–6017, 2023. 1

work page 2023
[27]

Diffusehigh: Training-free progressive high- resolution image synthesis through structure guidance.arXiv preprint arXiv:2406.18459, 2024

Younghyun Kim, Geunmin Hwang, Junyu Zhang, and Eun- byung Park. Diffusehigh: Training-free progressive high- resolution image synthesis through structure guidance.arXiv preprint arXiv:2406.18459, 2024. 2, 3, 6

work page arXiv 2024
[28]

Black Forest Labs. Flux. https://github.com/ black-forest-labs/flux, 2023. 1, 2

work page 2023
[29]

Syncdiffusion: Coherent montage via synchronized joint diffusions

Yuseung Lee, Kunho Kim, Hyunjin Kim, and Minhyuk Sung. Syncdiffusion: Coherent montage via synchronized joint diffusions. Advances in Neural Information Processing Systems, 36:50648–50660, 2023. 2, 3

work page 2023
[30]

Blip-diffusion: Pre- trained subject representation for controllable text-to-image generation and editing

Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre- trained subject representation for controllable text-to-image generation and editing. Advances in Neural Information Pro- cessing Systems, 36, 2024. 1

work page 2024
[31]

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chi- nese understanding. arXiv preprint arXiv:2405.08748, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Cutdiffusion: A simple, fast, cheap, and strong diffusion extrapolation method

Mingbao Lin, Zhihang Lin, Wengyi Zhan, Liujuan Cao, and Rongrong Ji. Cutdiffusion: A simple, fast, cheap, and strong diffusion extrapolation method. arXiv preprint arXiv:2404.15141, 2024. 2, 3, 6

work page arXiv 2024
[33]

Accdiffusion: An accurate method for higher-resolution im- age generation

Zhihang Lin, Mingbao Lin, Meng Zhao, and Rongrong Ji. Accdiffusion: An accurate method for higher-resolution im- age generation. In European Conference on Computer Vi- sion, pages 38–53. Springer, 2025. 2, 3, 6, 19

work page 2025
[34]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling. arXiv preprint arXiv:2210.02747, 2022. 2 11

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

Llm4gen: Leveraging semantic representation of llms for text-to-image generation

Mushui Liu, Yuhang Ma, Yang Zhen, Jun Dan, Yunlong Yu, Zeng Zhao, Zhipeng Hu, Bai Liu, and Changjie Fan. Llm4gen: Leveraging semantic representation of llms for text-to-image generation. arXiv preprint arXiv:2407.00737,

work page arXiv
[36]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Hiprompt: Tuning-free higher-resolution gen- eration with hierarchical mllm prompts

Xinyu Liu, Yingqing He, Lanqing Guo, Xiang Li, Bu Jin, Peng Li, Yan Li, Chi-Min Chan, Qifeng Chen, Wei Xue, et al. Hiprompt: Tuning-free higher-resolution gen- eration with hierarchical mllm prompts. arXiv preprint arXiv:2409.02919, 2024. 2, 3

work page arXiv 2024
[38]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high- resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equa- tions. arXiv preprint arXiv:2108.01073, 2021. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[40]

Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models

Daiki Miyake, Akihiro Iohara, Yu Saito, and Toshiyuki Tanaka. Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models. 2023. 1

work page 2023
[41]

Null-text inversion for editing real images using guided diffusion models

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. In CVPR, pages 6038–6047,

work page
[42]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 1, 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 19

work page 2021
[44]

Ultrapixel: Advancing ultra-high-resolution image synthesis to new peaks

Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, and Lei Zhu. Ultrapixel: Advancing ultra-high-resolution image synthesis to new peaks. arXiv preprint arXiv:2407.02158, 2024. 2

work page arXiv 2024
[45]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 2

work page 2022
[46]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 22500– 22510, 2023. 18

work page 2023
[47]

Hyperdreambooth: Hypernet- works for fast personalization of text-to-image models,

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei, Tingbo Hou, Yael Pritch, Neal Wadhwa, Michael Rubinstein, and Kfir Aberman. Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models. arXiv preprint arXiv:2307.06949, 2023. 1

work page arXiv 2023
[48]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016. 19

work page 2016
[49]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural In- formation Processing Systems, 35:25278–25294, 2022. 14, 19

work page 2022
[50]

Resmaster: Mastering high- resolution image generation via structural and fine-grained guidance

Shuwei Shi, Wenbo Li, Yuechen Zhang, Jingwen He, Biao Gong, and Yinqiang Zheng. Resmaster: Mastering high- resolution image generation via structural and fine-grained guidance. arXiv preprint arXiv:2406.16476, 2024. 2, 3

work page arXiv 2024
[51]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2010
[52]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. arXiv preprint arXiv:2011.13456, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2011
[53]

Relay diffusion: Unifying diffusion process across resolutions for image syn- thesis

Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jian- qiao Wangni, Zhuoyi Yang, and Jie Tang. Relay diffusion: Unifying diffusion process across resolutions for image syn- thesis. arXiv preprint arXiv:2309.03350, 2023. 2

work page arXiv 2023
[54]

Key-locked rank one editing for text-to-image personaliza- tion

Yoad Tewel, Rinon Gal, Gal Chechik, and Yuval Atzmon. Key-locked rank one editing for text-to-image personaliza- tion. In ACM SIGGRAPH 2023 Conference Proceedings ,

work page 2023
[55]

Plug-and-play diffusion features for text-driven image-to-image translation

Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, pages 1921–1930,

work page 1921
[56]

Wan: Open and Advanced Large-Scale Video Generative Models

Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video gen- erative models. arXiv preprint arXiv:2503.20314, 2025. 6, 10, 16

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Megafusion: Extend dif- fusion models towards higher-resolution image generation without further tuning

Haoning Wu, Shaocheng Shen, Qiang Hu, Xiaoyun Zhang, Ya Zhang, and Yanfeng Wang. Megafusion: Extend dif- fusion models towards higher-resolution image generation without further tuning. arXiv preprint arXiv:2408.11001 ,

work page arXiv
[58]

Object-aware inver- sion and reassembly for image editing

Zhen Yang, Ganggui Ding, Wen Wang, Hao Chen, Bo- han Zhuang, and Chunhua Shen. Object-aware inver- sion and reassembly for image editing. arXiv preprint arXiv:2310.12149, 2023. 1, 17

work page arXiv 2023
[59]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 10, 18

work page 2023
[60]

Hidiffusion: Unlocking high-resolution creativity and effi- 12 ciency in low-resolution trained diffusion models

Shen Zhang, Zhaowei Chen, Zhenyu Zhao, Zhenyuan Chen, Yao Tang, Yuhao Chen, Wengang Cao, and Jiajun Liang. Hidiffusion: Unlocking high-resolution creativity and effi- 12 ciency in low-resolution trained diffusion models. arXiv preprint arXiv:2311.17528, 2023. 2, 3, 6

work page arXiv 2023
[61]

Frecas: Efficient higher-resolution image generation via frequency- aware cascaded sampling

Zhengqiang Zhang, Ruihuang Li, and Lei Zhang. Frecas: Efficient higher-resolution image generation via frequency- aware cascaded sampling. arXiv preprint arXiv:2410.18410,

work page arXiv
[62]

Lumina-next: Making lumina-t2x stronger and faster with next-dit

Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina-t2x stronger and faster with next-dit. arXiv preprint arXiv:2406.18583 ,

work page arXiv
[63]

Predicted x0

Supplementary 7.1. Quantitative Analysis of “Predicted x0” To quantitatively validate this observation, as shown in Fig.9, we conduct additional experiments on the generation of pt x0 using 100 random prompts sampled from LAION-5B [49], and analyze the CLIP Score [16] and Mean Squared Error (MSE). From Fig. 9a, we observe that after 30 denoising steps, th...

work page 2048

[1] [1]

Text2live: Text-driven layered image and video editing

Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kas- ten, and Tali Dekel. Text2live: Text-driven layered image and video editing. In ECCV, pages 707–723. Springer, 2022. 1

work page 2022

[2] [2]

Multidiffusion: Fusing diffusion paths for controlled image generation

Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. 2023. 1, 2, 3

work page 2023

[3] [3]

Demystifying MMD GANs

Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018. 19

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. In- structpix2pix: Learning to follow image editing instructions. In CVPR, pages 18392–18402, 2023. 1

work page 2023

[5] [5]

Ap-ldm: Attentive and progressive latent diffusion model for training-free high-resolution image generation

Boyuan Cao, Jiaxin Ye, Yujie Wei, and Hongming Shan. Ap-ldm: Attentive and progressive latent diffusion model for training-free high-resolution image generation. arXiv preprint arXiv:2410.06055, 2024. 2, 3

work page arXiv 2024

[6] [6]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Pixart-sigma: Weak-to-strong train- ing of diffusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-sigma: Weak-to-strong train- ing of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, pages 74–91. Springer, 2025. 2 10

work page 2025

[8] [8]

Diffedit: Diffusion-based semantic image editing with mask guidance

Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based semantic image editing with mask guidance. 2022. 1

work page 2022

[9] [9]

Freecustom: Tuning- free customized image generation for multi-concept compo- sition

Ganggui Ding, Canyu Zhao, Wen Wang, Zhen Yang, Zide Liu, Hao Chen, and Chunhua Shen. Freecustom: Tuning- free customized image generation for multi-concept compo- sition. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 9089–9098,

work page

[10] [10]

Demofusion: Democratising high- resolution image generation with no

Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, and Zhanyu Ma. Demofusion: Democratising high- resolution image generation with no. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6159–6168, 2024. 2, 3, 6, 19

work page 2024

[11] [11]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learn- ing, 2024. 1, 2, 6

work page 2024

[12] [12]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- nik, Amit H Bermano, Gal Chechik, and Daniel Cohen- Or. An image is worth one word: Personalizing text-to- image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Make a cheap scaling: A self-cascade diffusion model for higher-resolution adapta- tion

Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xin- tao Wang, Qifeng Chen, et al. Make a cheap scaling: A self-cascade diffusion model for higher-resolution adapta- tion. In European Conference on Computer Vision , pages 39–55. Springer, 2024. 2

work page 2024

[14] [14]

Elasticdiffusion: Training-free arbitrary size image genera- tion through global-local content separation

Moayed Haji-Ali, Guha Balakrishnan, and Vicente Ordonez. Elasticdiffusion: Training-free arbitrary size image genera- tion through global-local content separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 6603–6612, 2024. 3, 6

work page 2024

[15] [15]

Scalecrafter: Tuning-free higher- resolution visual generation with diffusion models

Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher- resolution visual generation with diffusion models. In The Twelfth International Conference on Learning Representa- tions, 2023. 2, 3, 6

work page 2023

[16] [16]

Clipscore: A reference-free evaluation met- ric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. 2021. 14

work page 2021

[17] [17]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in neural information processing systems , 30, 2017. 19

work page 2017

[18] [18]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

Denoising dif- fusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. 2

work page 2020

[20] [20]

sim- ple diffusion: End-to-end diffusion for high resolution im- ages

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. sim- ple diffusion: End-to-end diffusion for high resolution im- ages. In International Conference on Machine Learning , pages 13213–13232. PMLR, 2023. 6, 15

work page 2023

[21] [21]

Fouriscale: A frequency perspective on training-free high-resolution im- age synthesis

Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li. Fouriscale: A frequency perspective on training-free high-resolution im- age synthesis. In European Conference on Computer Vision, pages 196–212. Springer, 2025. 2, 3, 6

work page 2025

[22] [22]

Upsample guidance: Scale up diffusion models without training

Juno Hwang, Yong-Hyun Park, and Junghyo Jo. Upsample guidance: Scale up diffusion models without training. arXiv preprint arXiv:2404.01709, 2024. 2, 3, 15

work page arXiv 2024

[23] [23]

Pyramidal flow matching for efficient video generative modeling

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954 ,

work page arXiv

[24] [24]

Training- free diffusion model adaptation for variable-sized text-to- image synthesis

Zhiyu Jin, Xuli Shen, Bin Li, and Xiangyang Xue. Training- free diffusion model adaptation for variable-sized text-to- image synthesis. Advances in Neural Information Processing Systems, 36:70847–70860, 2023. 2, 3

work page 2023

[25] [25]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems, 35:26565–26577, 2022. 2

work page 2022

[26] [26]

Imagic: Text-based real image editing with diffusion models

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In CVPR, pages 6007–6017, 2023. 1

work page 2023

[27] [27]

Diffusehigh: Training-free progressive high- resolution image synthesis through structure guidance.arXiv preprint arXiv:2406.18459, 2024

Younghyun Kim, Geunmin Hwang, Junyu Zhang, and Eun- byung Park. Diffusehigh: Training-free progressive high- resolution image synthesis through structure guidance.arXiv preprint arXiv:2406.18459, 2024. 2, 3, 6

work page arXiv 2024

[28] [28]

Black Forest Labs. Flux. https://github.com/ black-forest-labs/flux, 2023. 1, 2

work page 2023

[29] [29]

Syncdiffusion: Coherent montage via synchronized joint diffusions

Yuseung Lee, Kunho Kim, Hyunjin Kim, and Minhyuk Sung. Syncdiffusion: Coherent montage via synchronized joint diffusions. Advances in Neural Information Processing Systems, 36:50648–50660, 2023. 2, 3

work page 2023

[30] [30]

Blip-diffusion: Pre- trained subject representation for controllable text-to-image generation and editing

Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre- trained subject representation for controllable text-to-image generation and editing. Advances in Neural Information Pro- cessing Systems, 36, 2024. 1

work page 2024

[31] [31]

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chi- nese understanding. arXiv preprint arXiv:2405.08748, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Cutdiffusion: A simple, fast, cheap, and strong diffusion extrapolation method

Mingbao Lin, Zhihang Lin, Wengyi Zhan, Liujuan Cao, and Rongrong Ji. Cutdiffusion: A simple, fast, cheap, and strong diffusion extrapolation method. arXiv preprint arXiv:2404.15141, 2024. 2, 3, 6

work page arXiv 2024

[33] [33]

Accdiffusion: An accurate method for higher-resolution im- age generation

Zhihang Lin, Mingbao Lin, Meng Zhao, and Rongrong Ji. Accdiffusion: An accurate method for higher-resolution im- age generation. In European Conference on Computer Vi- sion, pages 38–53. Springer, 2025. 2, 3, 6, 19

work page 2025

[34] [34]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling. arXiv preprint arXiv:2210.02747, 2022. 2 11

work page internal anchor Pith review Pith/arXiv arXiv 2022

[35] [35]

Llm4gen: Leveraging semantic representation of llms for text-to-image generation

Mushui Liu, Yuhang Ma, Yang Zhen, Jun Dan, Yunlong Yu, Zeng Zhao, Zhipeng Hu, Bai Liu, and Changjie Fan. Llm4gen: Leveraging semantic representation of llms for text-to-image generation. arXiv preprint arXiv:2407.00737,

work page arXiv

[36] [36]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[37] [37]

Hiprompt: Tuning-free higher-resolution gen- eration with hierarchical mllm prompts

Xinyu Liu, Yingqing He, Lanqing Guo, Xiang Li, Bu Jin, Peng Li, Yan Li, Chi-Min Chan, Qifeng Chen, Wei Xue, et al. Hiprompt: Tuning-free higher-resolution gen- eration with hierarchical mllm prompts. arXiv preprint arXiv:2409.02919, 2024. 2, 3

work page arXiv 2024

[38] [38]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high- resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equa- tions. arXiv preprint arXiv:2108.01073, 2021. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2021

[40] [40]

Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models

Daiki Miyake, Akihiro Iohara, Yu Saito, and Toshiyuki Tanaka. Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models. 2023. 1

work page 2023

[41] [41]

Null-text inversion for editing real images using guided diffusion models

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. In CVPR, pages 6038–6047,

work page

[42] [42]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 1, 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 19

work page 2021

[44] [44]

Ultrapixel: Advancing ultra-high-resolution image synthesis to new peaks

Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, and Lei Zhu. Ultrapixel: Advancing ultra-high-resolution image synthesis to new peaks. arXiv preprint arXiv:2407.02158, 2024. 2

work page arXiv 2024

[45] [45]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 2

work page 2022

[46] [46]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 22500– 22510, 2023. 18

work page 2023

[47] [47]

Hyperdreambooth: Hypernet- works for fast personalization of text-to-image models,

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei, Tingbo Hou, Yael Pritch, Neal Wadhwa, Michael Rubinstein, and Kfir Aberman. Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models. arXiv preprint arXiv:2307.06949, 2023. 1

work page arXiv 2023

[48] [48]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016. 19

work page 2016

[49] [49]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural In- formation Processing Systems, 35:25278–25294, 2022. 14, 19

work page 2022

[50] [50]

Resmaster: Mastering high- resolution image generation via structural and fine-grained guidance

Shuwei Shi, Wenbo Li, Yuechen Zhang, Jingwen He, Biao Gong, and Yinqiang Zheng. Resmaster: Mastering high- resolution image generation via structural and fine-grained guidance. arXiv preprint arXiv:2406.16476, 2024. 2, 3

work page arXiv 2024

[51] [51]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2010

[52] [52]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. arXiv preprint arXiv:2011.13456, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2011

[53] [53]

Relay diffusion: Unifying diffusion process across resolutions for image syn- thesis

Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jian- qiao Wangni, Zhuoyi Yang, and Jie Tang. Relay diffusion: Unifying diffusion process across resolutions for image syn- thesis. arXiv preprint arXiv:2309.03350, 2023. 2

work page arXiv 2023

[54] [54]

Key-locked rank one editing for text-to-image personaliza- tion

Yoad Tewel, Rinon Gal, Gal Chechik, and Yuval Atzmon. Key-locked rank one editing for text-to-image personaliza- tion. In ACM SIGGRAPH 2023 Conference Proceedings ,

work page 2023

[55] [55]

Plug-and-play diffusion features for text-driven image-to-image translation

Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, pages 1921–1930,

work page 1921

[56] [56]

Wan: Open and Advanced Large-Scale Video Generative Models

Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video gen- erative models. arXiv preprint arXiv:2503.20314, 2025. 6, 10, 16

work page internal anchor Pith review Pith/arXiv arXiv 2025

[57] [57]

Megafusion: Extend dif- fusion models towards higher-resolution image generation without further tuning

Haoning Wu, Shaocheng Shen, Qiang Hu, Xiaoyun Zhang, Ya Zhang, and Yanfeng Wang. Megafusion: Extend dif- fusion models towards higher-resolution image generation without further tuning. arXiv preprint arXiv:2408.11001 ,

work page arXiv

[58] [58]

Object-aware inver- sion and reassembly for image editing

Zhen Yang, Ganggui Ding, Wen Wang, Hao Chen, Bo- han Zhuang, and Chunhua Shen. Object-aware inver- sion and reassembly for image editing. arXiv preprint arXiv:2310.12149, 2023. 1, 17

work page arXiv 2023

[59] [59]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 10, 18

work page 2023

[60] [60]

Hidiffusion: Unlocking high-resolution creativity and effi- 12 ciency in low-resolution trained diffusion models

Shen Zhang, Zhaowei Chen, Zhenyu Zhao, Zhenyuan Chen, Yao Tang, Yuhao Chen, Wengang Cao, and Jiajun Liang. Hidiffusion: Unlocking high-resolution creativity and effi- 12 ciency in low-resolution trained diffusion models. arXiv preprint arXiv:2311.17528, 2023. 2, 3, 6

work page arXiv 2023

[61] [61]

Frecas: Efficient higher-resolution image generation via frequency- aware cascaded sampling

Zhengqiang Zhang, Ruihuang Li, and Lei Zhang. Frecas: Efficient higher-resolution image generation via frequency- aware cascaded sampling. arXiv preprint arXiv:2410.18410,

work page arXiv

[62] [62]

Lumina-next: Making lumina-t2x stronger and faster with next-dit

Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina-t2x stronger and faster with next-dit. arXiv preprint arXiv:2406.18583 ,

work page arXiv

[63] [63]

Predicted x0

Supplementary 7.1. Quantitative Analysis of “Predicted x0” To quantitatively validate this observation, as shown in Fig.9, we conduct additional experiments on the generation of pt x0 using 100 random prompts sampled from LAION-5B [49], and analyze the CLIP Score [16] and Mean Squared Error (MSE). From Fig. 9a, we observe that after 30 denoising steps, th...

work page 2048