pith. sign in

arxiv: 2605.22668 · v1 · pith:WAXUB2LLnew · submitted 2026-05-21 · 💻 cs.CV

SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

Pith reviewed 2026-05-22 06:06 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion transformersresolution extrapolationrotary position embeddingsattention scalinghigh-resolution generationtraining-free methodsspectral energy guidanceSEGA
0
0 comments X

The pith

SEGA dynamically scales RoPE attention in diffusion transformers using the latent's spatial-frequency structure at each denoising step to improve high-resolution synthesis beyond training ranges.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SEGA as a training-free technique that adjusts attention scaling across different frequency components of Rotary Position Embeddings based on the current latent's spectral energy. Existing uniform scaling methods create a compromise between keeping overall image structure and adding fine details when generating at unseen resolutions. By making the scaling adaptive to the latent content at every step, SEGA aims to reduce artifacts and produce more coherent high-resolution outputs. A sympathetic reader would care because this approach could extend the usable range of existing diffusion models without the cost of retraining them on higher-resolution data.

Core claim

SEGA computes the spatial-frequency structure of the latent representation during denoising and uses it to guide per-component scaling of RoPE in the attention layers of Diffusion Transformers. This replaces the content-agnostic uniform scaling used in prior extrapolation methods, allowing the model to better preserve global structure while recovering fine details at target resolutions outside the training distribution.

What carries the argument

SEGA, the spectral-energy guided attention mechanism that derives dynamic scaling factors for RoPE components from the latent's frequency content at each denoising timestep.

If this is right

  • High-resolution synthesis improves consistently across multiple target resolutions without retraining.
  • Both structural coherence and fine-detail fidelity increase compared with uniform RoPE extrapolation.
  • The method outperforms existing training-free baselines on standard high-resolution generation tasks.
  • No additional training or model modification is required beyond the inference-time attention adjustment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same spectral-energy guidance could be tested on other position-embedding schemes or attention variants in generative models.
  • If the method generalizes, it may reduce the data and compute needed to train models that support variable output resolutions.
  • A natural extension would be to apply the guidance across multiple denoising stages or to related tasks such as image editing at extrapolated sizes.

Load-bearing premise

The latent's spatial-frequency structure at each denoising step provides a reliable signal for scaling RoPE components without introducing new inconsistencies or artifacts in the final image.

What would settle it

Apply SEGA and a uniform-scaling baseline to the same DiT at a resolution twice the training size; if the frequency-guided version produces measurably more artifacts or lower perceptual quality on a standard benchmark, the adaptive scaling claim is falsified.

Figures

Figures reproduced from arXiv: 2605.22668 by Babak Taati, David B. Lindell, Javad Rajabi, Kimia Shaban, Koorosh Roohi.

Figure 1
Figure 1. Figure 1: Gallery of SEGA. SEGA unlocks the high-resolution generation capabilities of pre-trained T2I models (Flux [1] and Qwen [2]), producing high-quality images. Best viewed zoomed in. Abstract Diffusion transformers (DiTs) have emerged as a dominant architecture for text￾to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches… view at source ↗
Figure 2
Figure 2. Figure 2: Trade-offs in attention scaling at 40962 . RoPE components are coupled to spatial frequencies: low-frequency components support coarse detail and structure, whereas high-frequency components support fine detail and texture. Static scaling fails to balance this trade-off, leading to different failure modes in (a)–(c). SEGA (d) resolves them by dynamically allocating scaling according to spectral energy. Gre… view at source ↗
Figure 3
Figure 3. Figure 3: SEGA scaling maps at 40962 . For two representative prompts, the scaling maps show how the horizontal-axis scaling magnitudes md change across RoPE dimensions over denoising time. geometric mean to the arithmetic mean of a power spectrum. Applied to Eiso, this yields SF(Eiso) = exp 1 n (iso) bins Pn (iso) bins −1 b=0 ln Eiso[b]  1 n (iso) bins Pn (iso) bins −1 b=0 Eiso[b] ∈ (0, 1], (5) where n (iso) bins… view at source ↗
Figure 4
Figure 4. Figure 4: Impact on Attention Evolution. Visual comparison of attention maps for the center latent token in YaRN and SEGA across multiple denoising steps, evaluated on Flux at 40962 . YaRN DyPE UltraImage SEGA Flux "A woman with glasses and a sword poses confidently beside a muscular man in traditional attire holding a sword, while a small creature with a large hat waves enthusiastically…" "A boy in a cap sits relax… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison. Results on two representative prompts for Qwen and Flux at 40962 resolution show that SEGA improves structural coherence and fine detail over other methods. redistributes scaling across RoPE dimension d to sharpen focus at under-resolved spatial frequency bands while softening it at over-emphasized ones. This content-aware spectral redistribution directly impacts the attention mecha… view at source ↗
Figure 6
Figure 6. Figure 6: Content-Aware Spectral Evolution. The 2D power spectrum of the intermediate latents across the denoising process for two distinct prompts. The spectral energy distribution varies depend￾ing on the image content, demonstrating the necessity of a content-aware approach. Furthermore, the shifting concentration of energy, particularly in low-frequency bands where static over-scaling introduces structural artif… view at source ↗
Figure 7
Figure 7. Figure 7: Attention Entropy. The delta of attention entropy value between different methods and the baseline image generated at 10242 resolution on Flux. A smaller difference indicates a closer attention structure to the baseline image generated without any RoPE extrapolation and scaling methods. direct empirical support for this premise. Each heatmap shows the normalized 2D power spectrum of the intermediate latent… view at source ↗
Figure 8
Figure 8. Figure 8: Impact on Attention Evolution (Other Tokens). Further visual comparison of attention maps for the top-center, middle-left, and bottom-center latent tokens in YaRN and SEGA across multiple denoising steps, evaluated on Flux at 40962 . 22 [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison (non-square resolutions). Results on two non-square resolutions (2048 × 4096 and 4096 × 2048) on Qwen and Flux show that SEGA’s ability to preserve the shape of contents in different aspect ratio. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison (Zero-Shot Dataset). Results on prompts from the zero-shot dataset for Qwen and Flux at 40962 resolution show that SEGA handles complex environments, objects and areas with reflection, contents with challenging lighting, and preserves the shapes of the objects. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparison (with guidance-based approaches). Results on two representa￾tive prompts for Flux at 40962 resolution in comparison with top guidance-based approaches show that SEGA is not limited to the synthesized image at base resolution and provides fine details and high-quality textures. DyPE UltraImage SEGA Flux "An elderly man with braided hair wears a striped headband and purple clothing, h… view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparison (at 51202 resolution). Results on two representative prompts for Qwen and Flux at 51202 resolution show that SEGA elaborates on coarse and fine details as the resolution of the images increases. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative comparison (at 61442 resolution). Results on two representative prompts for Qwen and Flux at 61442 resolution show that SEGA makes image synthesis at this resolution possible while baselines struggle with noise and collapse of global structures. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Visualizing Fine-Grained Details at Extreme Resolutions. Sample generated at 61442 resolution by SEGA on Qwen. The model successfully preserves high-frequency local textures and sharp structural boundaries without experiencing structural collapse or repetition artifacts typical of long-context length extrapolation. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
read the original abstract

Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SEGA, a training-free method for resolution extrapolation in Diffusion Transformers. It dynamically scales attention across RoPE frequency components according to the spatial-frequency energy distribution extracted from the latent at each denoising step, with the goal of resolving the global-structure versus fine-detail trade-off that arises from uniform, content-agnostic scaling. The abstract claims that this adaptive rule yields consistent improvements in high-resolution synthesis over existing training-free baselines across multiple target resolutions.

Significance. If the central claim is substantiated, the work would be significant for practical high-resolution deployment of pre-trained DiT models, as it supplies a content-adaptive, frequency-analysis-based alternative to fixed extrapolation heuristics without requiring retraining. The training-free character and the attempt to ground scaling in per-step spectral properties are clear strengths that could be adopted in production pipelines if the reliability of the signal is demonstrated.

major comments (2)
  1. [Abstract] Abstract: the claim that SEGA 'consistently improves high-resolution synthesis' and 'outperforms state-of-the-art training-free baselines' is asserted without any quantitative metrics, ablation tables, or error analysis. Because the central claim rests on empirical superiority, the absence of these data in the abstract (and the reader's note that the full experimental section is required) makes it impossible to evaluate effect size or robustness.
  2. [Method / Experiments] Method description (and Experiments): the load-bearing premise that the latent's spatial-frequency structure supplies a stable, content-adaptive signal at every denoising step is not yet shown to survive the low-SNR regime. Early timesteps contain essentially Gaussian noise whose power spectrum is flat; any spectral-energy estimate extracted then is dominated by sampling variance. If SEGA modulates RoPE scaling with these noisy estimates, the resulting per-component factors can fluctuate across steps or seeds, potentially re-introducing the very inconsistencies the method is intended to avoid. Explicit stability analysis or timestep-wise ablation is needed to confirm the assumption holds.
minor comments (2)
  1. [Method] Clarify the precise definition of 'spectral-energy' (e.g., which frequency bins, normalization, or aggregation across channels) and how it is mapped to per-component scaling factors; an equation or pseudocode block would remove ambiguity.
  2. [Figures] Figure captions and axis labels should explicitly state the target resolutions, baseline methods, and whether results are averaged over multiple seeds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Revisions have been made to the manuscript to incorporate the referee's suggestions where they strengthen the presentation of our results and analysis.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that SEGA 'consistently improves high-resolution synthesis' and 'outperforms state-of-the-art training-free baselines' is asserted without any quantitative metrics, ablation tables, or error analysis. Because the central claim rests on empirical superiority, the absence of these data in the abstract (and the reader's note that the full experimental section is required) makes it impossible to evaluate effect size or robustness.

    Authors: The abstract is intended as a concise summary, while the full quantitative evidence—including FID, CLIP scores, user studies, and ablation tables comparing against training-free baselines across multiple resolutions—is provided in the Experiments section. We acknowledge that embedding a few key numerical results directly in the abstract would make the central claim easier to evaluate at a glance. The revised manuscript therefore updates the abstract to report the average improvement margins observed in our evaluations. revision: yes

  2. Referee: [Method / Experiments] Method description (and Experiments): the load-bearing premise that the latent's spatial-frequency structure supplies a stable, content-adaptive signal at every denoising step is not yet shown to survive the low-SNR regime. Early timesteps contain essentially Gaussian noise whose power spectrum is flat; any spectral-energy estimate extracted then is dominated by sampling variance. If SEGA modulates RoPE scaling with these noisy estimates, the resulting per-component factors can fluctuate across steps or seeds, potentially re-introducing the very inconsistencies the method is intended to avoid. Explicit stability analysis or timestep-wise ablation is needed to confirm the assumption holds.

    Authors: We agree that early timesteps are noise-dominated and that spectral estimates carry higher variance in that regime. Our existing experiments already show consistent gains across random seeds and target resolutions, indicating that any early-step fluctuations do not degrade final output quality. Nevertheless, to directly substantiate stability, the revised manuscript adds a dedicated analysis subsection that reports the timestep-wise variance of the extracted spectral-energy vectors and includes an ablation that disables SEGA guidance for the first k steps. These results confirm that the adaptive scaling remains beneficial throughout the trajectory without introducing noticeable inconsistencies. revision: yes

Circularity Check

0 steps flagged

No circularity: adaptive scaling rule derived from explicit frequency analysis of latents, independent of fitted parameters or self-citations.

full rationale

The provided abstract and context describe SEGA as a training-free method that computes per-component scaling directly from the latent's spatial-frequency structure at each denoising step. No equations or claims reduce a prediction to a fitted input by construction, nor does the central premise rest on a self-citation chain or imported uniqueness theorem. The derivation appears self-contained against external benchmarks of frequency content, with no evidence of renaming known results or smuggling ansatzes via prior work. This matches the default expectation of an honest non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; a full audit requires the methods and equations sections.

pith-pipeline@v0.9.0 · 5685 in / 925 out tokens · 32828 ms · 2026-05-22T06:06:51.008888+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 8 internal anchors

  1. [1]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  2. [2]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  3. [3]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  4. [4]

    All are worth words: A vit backbone for diffusion models

    Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023

  5. [5]

    Hiflow: Training-free high-resolution image generation with flow-aligned guidance.arXiv preprint arXiv:2504.06232, 2025

    Jiazi Bu, Pengyang Ling, Yujie Zhou, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Hiflow: Training-free high-resolution image generation with flow-aligned guidance.arXiv preprint arXiv:2504.06232, 2025

  6. [6]

    I-max: Maximize the resolution potential of pre-trained rectified flow transformers with projected flow, 2024

    Ruoyi Du, Dongyang Liu, Le Zhuo, Qin Qi, Hongsheng Li, Zhanyu Ma, and Peng Gao. I-max: Maximize the resolution potential of pre-trained rectified flow transformers with projected flow, 2024

  7. [7]

    Latent Wavelet Diffusion For Ultra-High-Resolution Image Synthesis

    Luigi Sigillo, Shengfeng He, and Danilo Comminiello. Latent wavelet diffusion for ultra-high-resolution image synthesis.arXiv preprint arXiv:2506.00433, 2025

  8. [8]

    simple diffusion: End-to-end diffusion for high resolution images

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. InInternational Conference on Machine Learning, pages 13213–13232. PMLR, 2023

  9. [9]

    Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation

    Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, et al. Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation. InEuropean conference on computer vision, pages 39–55. Springer, 2024

  10. [10]

    Demofusion: Democratising high-resolution image generation with no $$$

    Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, and Zhanyu Ma. Demofusion: Democratising high-resolution image generation with no $$$. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6159–6168, 2024

  11. [11]

    Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models

    Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. InThe Twelfth International Conference on Learning Representations, 2023

  12. [12]

    Training-free diffusion model adaptation for variable- sized text-to-image synthesis.Advances in Neural Information Processing Systems, 36:70847–70860, 2023

    Zhiyu Jin, Xuli Shen, Bin Li, and Xiangyang Xue. Training-free diffusion model adaptation for variable- sized text-to-image synthesis.Advances in Neural Information Processing Systems, 36:70847–70860, 2023

  13. [13]

    Diffusehigh: Training-free progressive high-resolution image synthesis through structure guidance

    Younghyun Kim, Geunmin Hwang, Junyu Zhang, and Eunbyung Park. Diffusehigh: Training-free progressive high-resolution image synthesis through structure guidance. InProceedings of the AAAI conference on artificial intelligence, volume 39, pages 4338–4346, 2025

  14. [14]

    Ultraimage: Rethinking resolution extrapolation in image diffusion transformers.arXiv preprint arXiv:2512.04504, 2025

    Min Zhao, Bokai Yan, Xue Yang, Hongzhou Zhu, Jintao Zhang, Shilong Liu, Chongxuan Li, and Jun Zhu. Ultraimage: Rethinking resolution extrapolation in image diffusion transformers.arXiv preprint arXiv:2512.04504, 2025

  15. [15]

    Dype: Dynamic position extrapolation for ultra high resolution diffusion.arXiv preprint arXiv:2510.20766, 2025

    Noam Issachar, Guy Yariv, Sagie Benaim, Yossi Adi, Dani Lischinski, and Raanan Fattal. Dype: Dynamic position extrapolation for ultra high resolution diffusion.arXiv preprint arXiv:2510.20766, 2025

  16. [16]

    Fit: Flexible vision transformer for diffusion model.arXiv preprint arXiv:2402.12376, 2024

    Zeyu Lu, Zidong Wang, Di Huang, Chengyue Wu, Xihui Liu, Wanli Ouyang, and Lei Bai. Fit: Flexible vision transformer for diffusion model.arXiv preprint arXiv:2402.12376, 2024

  17. [17]

    Boosting resolution generalization of diffusion transformers with randomized positional encodings

    Liang Hou, Cong Liu, Mingwu Zheng, Xin Tao, Pengfei Wan, Di Zhang, and Kun Gai. Boosting resolution generalization of diffusion transformers with randomized positional encodings. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 4762–4770, 2026

  18. [18]

    Freescale: Unleashing the resolution of diffusion models via tuning-free scale fusion

    Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, and Ziwei Liu. Freescale: Unleashing the resolution of diffusion models via tuning-free scale fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16893–16903, 2025

  19. [19]

    Frecas: Efficient higher-resolution image generation via frequency-aware cascaded sampling.arXiv preprint arXiv:2410.18410, 2024

    Zhengqiang Zhang, Ruihuang Li, and Lei Zhang. Frecas: Efficient higher-resolution image generation via frequency-aware cascaded sampling.arXiv preprint arXiv:2410.18410, 2024. 10

  20. [20]

    Diffusion-4k: Ultra-high-resolution im- age synthesis with latent diffusion models

    Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Diffusion-4k: Ultra-high-resolution im- age synthesis with latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23464–23473, 2025

  21. [21]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  22. [22]

    YaRN: Efficient Context Window Extension of Large Language Models

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023

  23. [23]

    Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022

    Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022

  24. [24]

    Matryoshka diffusion models

    Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Joshua M Susskind, and Navdeep Jaitly. Matryoshka diffusion models. InThe Twelfth International Conference on Learning Representations, 2023

  25. [25]

    Hierarchical patch diffusion models for high-resolution video generation

    Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, and Sergey Tulyakov. Hierarchical patch diffusion models for high-resolution video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7569–7579, 2024

  26. [26]

    Improving progressive generation with decomposable flow matching.arXiv preprint arXiv:2506.19839, 2025

    Moayed Haji-Ali, Willi Menapace, Ivan Skorokhodov, Arpit Sahni, Sergey Tulyakov, Vicente Ordonez, and Aliaksandr Siarohin. Improving progressive generation with decomposable flow matching.arXiv preprint arXiv:2506.19839, 2025

  27. [27]

    Latent space super-resolution for higher- resolution image generation with diffusion models

    Jinho Jeong, Sangmin Han, Jinwoo Kim, and Seon Joo Kim. Latent space super-resolution for higher- resolution image generation with diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2355–2365, 2025

  28. [28]

    Megafusion: Extend diffusion models towards higher-resolution image generation without further tuning

    Haoning Wu, Shaocheng Shen, Qiang Hu, Xiaoyun Zhang, Ya Zhang, and Yanfeng Wang. Megafusion: Extend diffusion models towards higher-resolution image generation without further tuning. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3944–3953. IEEE, 2025

  29. [29]

    Accdiffusion: An accurate method for higher- resolution image generation

    Zhihang Lin, Mingbao Lin, Meng Zhao, and Rongrong Ji. Accdiffusion: An accurate method for higher- resolution image generation. InEuropean Conference on Computer Vision, pages 38–53. Springer, 2024

  30. [30]

    Fouriscale: A frequency perspective on training-free high-resolution image synthesis

    Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li. Fouriscale: A frequency perspective on training-free high-resolution image synthesis. InEuropean conference on computer vision, pages 196–212. Springer, 2024

  31. [31]

    Scalediff: Higher- resolution image synthesis via efficient and model-agnostic diffusion.arXiv preprint arXiv:2510.25818, 2025

    Sungho Koh, SeungJu Cha, Hyunwoo Oh, Kwanyoung Lee, and Dong-Jin Kim. Scalediff: Higher- resolution image synthesis via efficient and model-agnostic diffusion.arXiv preprint arXiv:2510.25818, 2025

  32. [32]

    LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

    Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753, 2024

  33. [33]

    Pepe: Long-context extension for large language models via periodic extrapolation positional encodings

    Jikun Hu, Dongsheng Guo, Yuli Liu, Qingyao Ai, Lixuan Wang, Xuebing Sun, Qilei Zhang, Quan Zhou, and Cheng Luo. Pepe: Long-context extension for large language models via periodic extrapolation positional encodings. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 21075–21085, 2025

  34. [34]

    Extending Context Window of Large Language Models via Positional Interpolation

    Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023

  35. [35]

    Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation, 2023

    Bowen Peng and Jeffrey Quesnelle. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation, 2023

  36. [36]

    Ultravico: Breaking extrapolation limits in video diffusion transformers.arXiv preprint arXiv:2511.20123, 2025

    Min Zhao, Hongzhou Zhu, Yingze Wang, Bokai Yan, Jintao Zhang, Guande He, Ling Yang, Chongxuan Li, and Jun Zhu. Ultravico: Breaking extrapolation limits in video diffusion transformers.arXiv preprint arXiv:2511.20123, 2025

  37. [37]

    Riflex: A free lunch for length extrapolation in video diffusion transformers.arXiv preprint arXiv:2502.15894, 2025a

    Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, and Jun Zhu. Riflex: A free lunch for length extrapolation in video diffusion transformers.arXiv preprint arXiv:2502.15894, 2025

  38. [38]

    Rotary position embedding for vision transformer

    Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. InEuropean Conference on Computer Vision, pages 289–305. Springer, 2024. 11

  39. [39]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

  40. [40]

    Musiq: Multi-scale image quality transformer

    Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021

  41. [41]

    Exploring clip for assessing the look and feel of images

    Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 2555–2563, 2023

  42. [42]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  43. [43]

    Clipscore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021

  44. [44]

    Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023

  45. [45]

    Pick-a- pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a- pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

  46. [46]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

  47. [47]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  48. [48]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 12 Appendix A Detailed Related Work and Preliminaries A.1 High-Resolution Image Synthesis Training-Based A...