SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers
Pith reviewed 2026-05-22 06:06 UTC · model grok-4.3
The pith
SEGA dynamically scales RoPE attention in diffusion transformers using the latent's spatial-frequency structure at each denoising step to improve high-resolution synthesis beyond training ranges.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SEGA computes the spatial-frequency structure of the latent representation during denoising and uses it to guide per-component scaling of RoPE in the attention layers of Diffusion Transformers. This replaces the content-agnostic uniform scaling used in prior extrapolation methods, allowing the model to better preserve global structure while recovering fine details at target resolutions outside the training distribution.
What carries the argument
SEGA, the spectral-energy guided attention mechanism that derives dynamic scaling factors for RoPE components from the latent's frequency content at each denoising timestep.
If this is right
- High-resolution synthesis improves consistently across multiple target resolutions without retraining.
- Both structural coherence and fine-detail fidelity increase compared with uniform RoPE extrapolation.
- The method outperforms existing training-free baselines on standard high-resolution generation tasks.
- No additional training or model modification is required beyond the inference-time attention adjustment.
Where Pith is reading between the lines
- The same spectral-energy guidance could be tested on other position-embedding schemes or attention variants in generative models.
- If the method generalizes, it may reduce the data and compute needed to train models that support variable output resolutions.
- A natural extension would be to apply the guidance across multiple denoising stages or to related tasks such as image editing at extrapolated sizes.
Load-bearing premise
The latent's spatial-frequency structure at each denoising step provides a reliable signal for scaling RoPE components without introducing new inconsistencies or artifacts in the final image.
What would settle it
Apply SEGA and a uniform-scaling baseline to the same DiT at a resolution twice the training size; if the frequency-guided version produces measurably more artifacts or lower perceptual quality on a standard benchmark, the adaptive scaling claim is falsified.
Figures
read the original abstract
Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SEGA, a training-free method for resolution extrapolation in Diffusion Transformers. It dynamically scales attention across RoPE frequency components according to the spatial-frequency energy distribution extracted from the latent at each denoising step, with the goal of resolving the global-structure versus fine-detail trade-off that arises from uniform, content-agnostic scaling. The abstract claims that this adaptive rule yields consistent improvements in high-resolution synthesis over existing training-free baselines across multiple target resolutions.
Significance. If the central claim is substantiated, the work would be significant for practical high-resolution deployment of pre-trained DiT models, as it supplies a content-adaptive, frequency-analysis-based alternative to fixed extrapolation heuristics without requiring retraining. The training-free character and the attempt to ground scaling in per-step spectral properties are clear strengths that could be adopted in production pipelines if the reliability of the signal is demonstrated.
major comments (2)
- [Abstract] Abstract: the claim that SEGA 'consistently improves high-resolution synthesis' and 'outperforms state-of-the-art training-free baselines' is asserted without any quantitative metrics, ablation tables, or error analysis. Because the central claim rests on empirical superiority, the absence of these data in the abstract (and the reader's note that the full experimental section is required) makes it impossible to evaluate effect size or robustness.
- [Method / Experiments] Method description (and Experiments): the load-bearing premise that the latent's spatial-frequency structure supplies a stable, content-adaptive signal at every denoising step is not yet shown to survive the low-SNR regime. Early timesteps contain essentially Gaussian noise whose power spectrum is flat; any spectral-energy estimate extracted then is dominated by sampling variance. If SEGA modulates RoPE scaling with these noisy estimates, the resulting per-component factors can fluctuate across steps or seeds, potentially re-introducing the very inconsistencies the method is intended to avoid. Explicit stability analysis or timestep-wise ablation is needed to confirm the assumption holds.
minor comments (2)
- [Method] Clarify the precise definition of 'spectral-energy' (e.g., which frequency bins, normalization, or aggregation across channels) and how it is mapped to per-component scaling factors; an equation or pseudocode block would remove ambiguity.
- [Figures] Figure captions and axis labels should explicitly state the target resolutions, baseline methods, and whether results are averaged over multiple seeds.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Revisions have been made to the manuscript to incorporate the referee's suggestions where they strengthen the presentation of our results and analysis.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that SEGA 'consistently improves high-resolution synthesis' and 'outperforms state-of-the-art training-free baselines' is asserted without any quantitative metrics, ablation tables, or error analysis. Because the central claim rests on empirical superiority, the absence of these data in the abstract (and the reader's note that the full experimental section is required) makes it impossible to evaluate effect size or robustness.
Authors: The abstract is intended as a concise summary, while the full quantitative evidence—including FID, CLIP scores, user studies, and ablation tables comparing against training-free baselines across multiple resolutions—is provided in the Experiments section. We acknowledge that embedding a few key numerical results directly in the abstract would make the central claim easier to evaluate at a glance. The revised manuscript therefore updates the abstract to report the average improvement margins observed in our evaluations. revision: yes
-
Referee: [Method / Experiments] Method description (and Experiments): the load-bearing premise that the latent's spatial-frequency structure supplies a stable, content-adaptive signal at every denoising step is not yet shown to survive the low-SNR regime. Early timesteps contain essentially Gaussian noise whose power spectrum is flat; any spectral-energy estimate extracted then is dominated by sampling variance. If SEGA modulates RoPE scaling with these noisy estimates, the resulting per-component factors can fluctuate across steps or seeds, potentially re-introducing the very inconsistencies the method is intended to avoid. Explicit stability analysis or timestep-wise ablation is needed to confirm the assumption holds.
Authors: We agree that early timesteps are noise-dominated and that spectral estimates carry higher variance in that regime. Our existing experiments already show consistent gains across random seeds and target resolutions, indicating that any early-step fluctuations do not degrade final output quality. Nevertheless, to directly substantiate stability, the revised manuscript adds a dedicated analysis subsection that reports the timestep-wise variance of the extracted spectral-energy vectors and includes an ablation that disables SEGA guidance for the first k steps. These results confirm that the adaptive scaling remains beneficial throughout the trajectory without introducing noticeable inconsistencies. revision: yes
Circularity Check
No circularity: adaptive scaling rule derived from explicit frequency analysis of latents, independent of fitted parameters or self-citations.
full rationale
The provided abstract and context describe SEGA as a training-free method that computes per-component scaling directly from the latent's spatial-frequency structure at each denoising step. No equations or claims reduce a prediction to a fitted input by construction, nor does the central premise rest on a self-citation chain or imported uniqueness theorem. The derivation appears self-contained against external benchmarks of frequency content, with no evidence of renaming known results or smuggling ansatzes via prior work. This matches the default expectation of an honest non-finding.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlphaCoordinateFixationalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
m_ref = (R_target / R_train)^κ ... s(a)_d = ϕ(z(a)_d) − E[ϕ(z(a))]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[2]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[4]
All are worth words: A vit backbone for diffusion models
Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023
work page 2023
-
[5]
Jiazi Bu, Pengyang Ling, Yujie Zhou, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Hiflow: Training-free high-resolution image generation with flow-aligned guidance.arXiv preprint arXiv:2504.06232, 2025
-
[6]
Ruoyi Du, Dongyang Liu, Le Zhuo, Qin Qi, Hongsheng Li, Zhanyu Ma, and Peng Gao. I-max: Maximize the resolution potential of pre-trained rectified flow transformers with projected flow, 2024
work page 2024
-
[7]
Latent Wavelet Diffusion For Ultra-High-Resolution Image Synthesis
Luigi Sigillo, Shengfeng He, and Danilo Comminiello. Latent wavelet diffusion for ultra-high-resolution image synthesis.arXiv preprint arXiv:2506.00433, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
simple diffusion: End-to-end diffusion for high resolution images
Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. InInternational Conference on Machine Learning, pages 13213–13232. PMLR, 2023
work page 2023
-
[9]
Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation
Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, et al. Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation. InEuropean conference on computer vision, pages 39–55. Springer, 2024
work page 2024
-
[10]
Demofusion: Democratising high-resolution image generation with no $$$
Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, and Zhanyu Ma. Demofusion: Democratising high-resolution image generation with no $$$. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6159–6168, 2024
work page 2024
-
[11]
Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models
Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. InThe Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[12]
Zhiyu Jin, Xuli Shen, Bin Li, and Xiangyang Xue. Training-free diffusion model adaptation for variable- sized text-to-image synthesis.Advances in Neural Information Processing Systems, 36:70847–70860, 2023
work page 2023
-
[13]
Diffusehigh: Training-free progressive high-resolution image synthesis through structure guidance
Younghyun Kim, Geunmin Hwang, Junyu Zhang, and Eunbyung Park. Diffusehigh: Training-free progressive high-resolution image synthesis through structure guidance. InProceedings of the AAAI conference on artificial intelligence, volume 39, pages 4338–4346, 2025
work page 2025
-
[14]
Min Zhao, Bokai Yan, Xue Yang, Hongzhou Zhu, Jintao Zhang, Shilong Liu, Chongxuan Li, and Jun Zhu. Ultraimage: Rethinking resolution extrapolation in image diffusion transformers.arXiv preprint arXiv:2512.04504, 2025
-
[15]
Noam Issachar, Guy Yariv, Sagie Benaim, Yossi Adi, Dani Lischinski, and Raanan Fattal. Dype: Dynamic position extrapolation for ultra high resolution diffusion.arXiv preprint arXiv:2510.20766, 2025
-
[16]
Fit: Flexible vision transformer for diffusion model.arXiv preprint arXiv:2402.12376, 2024
Zeyu Lu, Zidong Wang, Di Huang, Chengyue Wu, Xihui Liu, Wanli Ouyang, and Lei Bai. Fit: Flexible vision transformer for diffusion model.arXiv preprint arXiv:2402.12376, 2024
-
[17]
Boosting resolution generalization of diffusion transformers with randomized positional encodings
Liang Hou, Cong Liu, Mingwu Zheng, Xin Tao, Pengfei Wan, Di Zhang, and Kun Gai. Boosting resolution generalization of diffusion transformers with randomized positional encodings. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 4762–4770, 2026
work page 2026
-
[18]
Freescale: Unleashing the resolution of diffusion models via tuning-free scale fusion
Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, and Ziwei Liu. Freescale: Unleashing the resolution of diffusion models via tuning-free scale fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16893–16903, 2025
work page 2025
-
[19]
Zhengqiang Zhang, Ruihuang Li, and Lei Zhang. Frecas: Efficient higher-resolution image generation via frequency-aware cascaded sampling.arXiv preprint arXiv:2410.18410, 2024. 10
-
[20]
Diffusion-4k: Ultra-high-resolution im- age synthesis with latent diffusion models
Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Diffusion-4k: Ultra-high-resolution im- age synthesis with latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23464–23473, 2025
work page 2025
-
[21]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
work page 2024
-
[22]
YaRN: Efficient Context Window Extension of Large Language Models
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022
work page 2022
-
[24]
Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Joshua M Susskind, and Navdeep Jaitly. Matryoshka diffusion models. InThe Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[25]
Hierarchical patch diffusion models for high-resolution video generation
Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, and Sergey Tulyakov. Hierarchical patch diffusion models for high-resolution video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7569–7579, 2024
work page 2024
-
[26]
Moayed Haji-Ali, Willi Menapace, Ivan Skorokhodov, Arpit Sahni, Sergey Tulyakov, Vicente Ordonez, and Aliaksandr Siarohin. Improving progressive generation with decomposable flow matching.arXiv preprint arXiv:2506.19839, 2025
-
[27]
Latent space super-resolution for higher- resolution image generation with diffusion models
Jinho Jeong, Sangmin Han, Jinwoo Kim, and Seon Joo Kim. Latent space super-resolution for higher- resolution image generation with diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2355–2365, 2025
work page 2025
-
[28]
Haoning Wu, Shaocheng Shen, Qiang Hu, Xiaoyun Zhang, Ya Zhang, and Yanfeng Wang. Megafusion: Extend diffusion models towards higher-resolution image generation without further tuning. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3944–3953. IEEE, 2025
work page 2025
-
[29]
Accdiffusion: An accurate method for higher- resolution image generation
Zhihang Lin, Mingbao Lin, Meng Zhao, and Rongrong Ji. Accdiffusion: An accurate method for higher- resolution image generation. InEuropean Conference on Computer Vision, pages 38–53. Springer, 2024
work page 2024
-
[30]
Fouriscale: A frequency perspective on training-free high-resolution image synthesis
Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li. Fouriscale: A frequency perspective on training-free high-resolution image synthesis. InEuropean conference on computer vision, pages 196–212. Springer, 2024
work page 2024
-
[31]
Sungho Koh, SeungJu Cha, Hyunwoo Oh, Kwanyoung Lee, and Dong-Jin Kim. Scalediff: Higher- resolution image synthesis via efficient and model-agnostic diffusion.arXiv preprint arXiv:2510.25818, 2025
-
[32]
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Jikun Hu, Dongsheng Guo, Yuli Liu, Qingyao Ai, Lixuan Wang, Xuebing Sun, Qilei Zhang, Quan Zhou, and Cheng Luo. Pepe: Long-context extension for large language models via periodic extrapolation positional encodings. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 21075–21085, 2025
work page 2025
-
[34]
Extending Context Window of Large Language Models via Positional Interpolation
Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Bowen Peng and Jeffrey Quesnelle. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation, 2023
work page 2023
-
[36]
Min Zhao, Hongzhou Zhu, Yingze Wang, Bokai Yan, Jintao Zhang, Guande He, Ling Yang, Chongxuan Li, and Jun Zhu. Ultravico: Breaking extrapolation limits in video diffusion transformers.arXiv preprint arXiv:2511.20123, 2025
-
[37]
Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, and Jun Zhu. Riflex: A free lunch for length extrapolation in video diffusion transformers.arXiv preprint arXiv:2502.15894, 2025
-
[38]
Rotary position embedding for vision transformer
Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. InEuropean Conference on Computer Vision, pages 289–305. Springer, 2024. 11
work page 2024
-
[39]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017
work page 2017
-
[40]
Musiq: Multi-scale image quality transformer
Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021
work page 2021
-
[41]
Exploring clip for assessing the look and feel of images
Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 2555–2563, 2023
work page 2023
-
[42]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[43]
Clipscore: A reference-free evaluation metric for image captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021
work page 2021
-
[44]
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023
work page 2023
-
[45]
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a- pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023
work page 2023
-
[46]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 12 Appendix A Detailed Related Work and Preliminaries A.1 High-Resolution Image Synthesis Training-Based A...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.