SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

Babak Taati; David B. Lindell; Javad Rajabi; Kimia Shaban; Koorosh Roohi

arxiv: 2605.22668 · v1 · pith:WAXUB2LLnew · submitted 2026-05-21 · 💻 cs.CV

SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

Javad Rajabi , Kimia Shaban , Koorosh Roohi , David B. Lindell , Babak Taati This is my paper

Pith reviewed 2026-05-22 06:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion transformersresolution extrapolationrotary position embeddingsattention scalinghigh-resolution generationtraining-free methodsspectral energy guidanceSEGA

0 comments

The pith

SEGA dynamically scales RoPE attention in diffusion transformers using the latent's spatial-frequency structure at each denoising step to improve high-resolution synthesis beyond training ranges.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SEGA as a training-free technique that adjusts attention scaling across different frequency components of Rotary Position Embeddings based on the current latent's spectral energy. Existing uniform scaling methods create a compromise between keeping overall image structure and adding fine details when generating at unseen resolutions. By making the scaling adaptive to the latent content at every step, SEGA aims to reduce artifacts and produce more coherent high-resolution outputs. A sympathetic reader would care because this approach could extend the usable range of existing diffusion models without the cost of retraining them on higher-resolution data.

Core claim

SEGA computes the spatial-frequency structure of the latent representation during denoising and uses it to guide per-component scaling of RoPE in the attention layers of Diffusion Transformers. This replaces the content-agnostic uniform scaling used in prior extrapolation methods, allowing the model to better preserve global structure while recovering fine details at target resolutions outside the training distribution.

What carries the argument

SEGA, the spectral-energy guided attention mechanism that derives dynamic scaling factors for RoPE components from the latent's frequency content at each denoising timestep.

If this is right

High-resolution synthesis improves consistently across multiple target resolutions without retraining.
Both structural coherence and fine-detail fidelity increase compared with uniform RoPE extrapolation.
The method outperforms existing training-free baselines on standard high-resolution generation tasks.
No additional training or model modification is required beyond the inference-time attention adjustment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same spectral-energy guidance could be tested on other position-embedding schemes or attention variants in generative models.
If the method generalizes, it may reduce the data and compute needed to train models that support variable output resolutions.
A natural extension would be to apply the guidance across multiple denoising stages or to related tasks such as image editing at extrapolated sizes.

Load-bearing premise

The latent's spatial-frequency structure at each denoising step provides a reliable signal for scaling RoPE components without introducing new inconsistencies or artifacts in the final image.

What would settle it

Apply SEGA and a uniform-scaling baseline to the same DiT at a resolution twice the training size; if the frequency-guided version produces measurably more artifacts or lower perceptual quality on a standard benchmark, the adaptive scaling claim is falsified.

Figures

Figures reproduced from arXiv: 2605.22668 by Babak Taati, David B. Lindell, Javad Rajabi, Kimia Shaban, Koorosh Roohi.

**Figure 1.** Figure 1: Gallery of SEGA. SEGA unlocks the high-resolution generation capabilities of pre-trained T2I models (Flux [1] and Qwen [2]), producing high-quality images. Best viewed zoomed in. Abstract Diffusion transformers (DiTs) have emerged as a dominant architecture for textto-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches… view at source ↗

**Figure 2.** Figure 2: Trade-offs in attention scaling at 40962 . RoPE components are coupled to spatial frequencies: low-frequency components support coarse detail and structure, whereas high-frequency components support fine detail and texture. Static scaling fails to balance this trade-off, leading to different failure modes in (a)–(c). SEGA (d) resolves them by dynamically allocating scaling according to spectral energy. Gre… view at source ↗

**Figure 3.** Figure 3: SEGA scaling maps at 40962 . For two representative prompts, the scaling maps show how the horizontal-axis scaling magnitudes md change across RoPE dimensions over denoising time. geometric mean to the arithmetic mean of a power spectrum. Applied to Eiso, this yields SF(Eiso) = exp 1 n (iso) bins Pn (iso) bins −1 b=0 ln Eiso[b] 1 n (iso) bins Pn (iso) bins −1 b=0 Eiso[b] ∈ (0, 1], (5) where n (iso) bins… view at source ↗

**Figure 4.** Figure 4: Impact on Attention Evolution. Visual comparison of attention maps for the center latent token in YaRN and SEGA across multiple denoising steps, evaluated on Flux at 40962 . YaRN DyPE UltraImage SEGA Flux "A woman with glasses and a sword poses confidently beside a muscular man in traditional attire holding a sword, while a small creature with a large hat waves enthusiastically…" "A boy in a cap sits relax… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison. Results on two representative prompts for Qwen and Flux at 40962 resolution show that SEGA improves structural coherence and fine detail over other methods. redistributes scaling across RoPE dimension d to sharpen focus at under-resolved spatial frequency bands while softening it at over-emphasized ones. This content-aware spectral redistribution directly impacts the attention mecha… view at source ↗

**Figure 6.** Figure 6: Content-Aware Spectral Evolution. The 2D power spectrum of the intermediate latents across the denoising process for two distinct prompts. The spectral energy distribution varies depending on the image content, demonstrating the necessity of a content-aware approach. Furthermore, the shifting concentration of energy, particularly in low-frequency bands where static over-scaling introduces structural artif… view at source ↗

**Figure 7.** Figure 7: Attention Entropy. The delta of attention entropy value between different methods and the baseline image generated at 10242 resolution on Flux. A smaller difference indicates a closer attention structure to the baseline image generated without any RoPE extrapolation and scaling methods. direct empirical support for this premise. Each heatmap shows the normalized 2D power spectrum of the intermediate latent… view at source ↗

**Figure 8.** Figure 8: Impact on Attention Evolution (Other Tokens). Further visual comparison of attention maps for the top-center, middle-left, and bottom-center latent tokens in YaRN and SEGA across multiple denoising steps, evaluated on Flux at 40962 . 22 [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison (non-square resolutions). Results on two non-square resolutions (2048 × 4096 and 4096 × 2048) on Qwen and Flux show that SEGA’s ability to preserve the shape of contents in different aspect ratio. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison (Zero-Shot Dataset). Results on prompts from the zero-shot dataset for Qwen and Flux at 40962 resolution show that SEGA handles complex environments, objects and areas with reflection, contents with challenging lighting, and preserves the shapes of the objects. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative comparison (with guidance-based approaches). Results on two representative prompts for Flux at 40962 resolution in comparison with top guidance-based approaches show that SEGA is not limited to the synthesized image at base resolution and provides fine details and high-quality textures. DyPE UltraImage SEGA Flux "An elderly man with braided hair wears a striped headband and purple clothing, h… view at source ↗

**Figure 12.** Figure 12: Qualitative comparison (at 51202 resolution). Results on two representative prompts for Qwen and Flux at 51202 resolution show that SEGA elaborates on coarse and fine details as the resolution of the images increases. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative comparison (at 61442 resolution). Results on two representative prompts for Qwen and Flux at 61442 resolution show that SEGA makes image synthesis at this resolution possible while baselines struggle with noise and collapse of global structures. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

**Figure 14.** Figure 14: Visualizing Fine-Grained Details at Extreme Resolutions. Sample generated at 61442 resolution by SEGA on Qwen. The model successfully preserves high-frequency local textures and sharp structural boundaries without experiencing structural collapse or repetition artifacts typical of long-context length extrapolation. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗

read the original abstract

Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SEGA tries to make RoPE extrapolation in DiTs content-adaptive by scaling based on the latent's spectral energy per step, which is a reasonable next step past uniform methods but rests on shaky ground when the latent is still mostly noise.

read the letter

SEGA's main contribution is a training-free adjustment to Rotary Position Embeddings in Diffusion Transformers. Instead of applying one scaling factor across all frequency components, it uses the spatial-frequency content of the current latent to decide how much to stretch or compress each RoPE band at every denoising step. That adaptive rule is what sets it apart from the uniform extrapolation baselines mentioned in the abstract.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SEGA, a training-free method for resolution extrapolation in Diffusion Transformers. It dynamically scales attention across RoPE frequency components according to the spatial-frequency energy distribution extracted from the latent at each denoising step, with the goal of resolving the global-structure versus fine-detail trade-off that arises from uniform, content-agnostic scaling. The abstract claims that this adaptive rule yields consistent improvements in high-resolution synthesis over existing training-free baselines across multiple target resolutions.

Significance. If the central claim is substantiated, the work would be significant for practical high-resolution deployment of pre-trained DiT models, as it supplies a content-adaptive, frequency-analysis-based alternative to fixed extrapolation heuristics without requiring retraining. The training-free character and the attempt to ground scaling in per-step spectral properties are clear strengths that could be adopted in production pipelines if the reliability of the signal is demonstrated.

major comments (2)

[Abstract] Abstract: the claim that SEGA 'consistently improves high-resolution synthesis' and 'outperforms state-of-the-art training-free baselines' is asserted without any quantitative metrics, ablation tables, or error analysis. Because the central claim rests on empirical superiority, the absence of these data in the abstract (and the reader's note that the full experimental section is required) makes it impossible to evaluate effect size or robustness.
[Method / Experiments] Method description (and Experiments): the load-bearing premise that the latent's spatial-frequency structure supplies a stable, content-adaptive signal at every denoising step is not yet shown to survive the low-SNR regime. Early timesteps contain essentially Gaussian noise whose power spectrum is flat; any spectral-energy estimate extracted then is dominated by sampling variance. If SEGA modulates RoPE scaling with these noisy estimates, the resulting per-component factors can fluctuate across steps or seeds, potentially re-introducing the very inconsistencies the method is intended to avoid. Explicit stability analysis or timestep-wise ablation is needed to confirm the assumption holds.

minor comments (2)

[Method] Clarify the precise definition of 'spectral-energy' (e.g., which frequency bins, normalization, or aggregation across channels) and how it is mapped to per-component scaling factors; an equation or pseudocode block would remove ambiguity.
[Figures] Figure captions and axis labels should explicitly state the target resolutions, baseline methods, and whether results are averaged over multiple seeds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Revisions have been made to the manuscript to incorporate the referee's suggestions where they strengthen the presentation of our results and analysis.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that SEGA 'consistently improves high-resolution synthesis' and 'outperforms state-of-the-art training-free baselines' is asserted without any quantitative metrics, ablation tables, or error analysis. Because the central claim rests on empirical superiority, the absence of these data in the abstract (and the reader's note that the full experimental section is required) makes it impossible to evaluate effect size or robustness.

Authors: The abstract is intended as a concise summary, while the full quantitative evidence—including FID, CLIP scores, user studies, and ablation tables comparing against training-free baselines across multiple resolutions—is provided in the Experiments section. We acknowledge that embedding a few key numerical results directly in the abstract would make the central claim easier to evaluate at a glance. The revised manuscript therefore updates the abstract to report the average improvement margins observed in our evaluations. revision: yes
Referee: [Method / Experiments] Method description (and Experiments): the load-bearing premise that the latent's spatial-frequency structure supplies a stable, content-adaptive signal at every denoising step is not yet shown to survive the low-SNR regime. Early timesteps contain essentially Gaussian noise whose power spectrum is flat; any spectral-energy estimate extracted then is dominated by sampling variance. If SEGA modulates RoPE scaling with these noisy estimates, the resulting per-component factors can fluctuate across steps or seeds, potentially re-introducing the very inconsistencies the method is intended to avoid. Explicit stability analysis or timestep-wise ablation is needed to confirm the assumption holds.

Authors: We agree that early timesteps are noise-dominated and that spectral estimates carry higher variance in that regime. Our existing experiments already show consistent gains across random seeds and target resolutions, indicating that any early-step fluctuations do not degrade final output quality. Nevertheless, to directly substantiate stability, the revised manuscript adds a dedicated analysis subsection that reports the timestep-wise variance of the extracted spectral-energy vectors and includes an ablation that disables SEGA guidance for the first k steps. These results confirm that the adaptive scaling remains beneficial throughout the trajectory without introducing noticeable inconsistencies. revision: yes

Circularity Check

0 steps flagged

No circularity: adaptive scaling rule derived from explicit frequency analysis of latents, independent of fitted parameters or self-citations.

full rationale

The provided abstract and context describe SEGA as a training-free method that computes per-component scaling directly from the latent's spatial-frequency structure at each denoising step. No equations or claims reduce a prediction to a fitted input by construction, nor does the central premise rest on a self-citation chain or imported uniqueness theorem. The derivation appears self-contained against external benchmarks of frequency content, with no evidence of renaming known results or smuggling ansatzes via prior work. This matches the default expectation of an honest non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; a full audit requires the methods and equations sections.

pith-pipeline@v0.9.0 · 5685 in / 925 out tokens · 32828 ms · 2026-05-22T06:06:51.008888+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlphaCoordinateFixation alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

m_ref = (R_target / R_train)^κ ... s(a)_d = ϕ(z(a)_d) − E[ϕ(z(a))]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 8 internal anchors

[1]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024
[2]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[4]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023

work page 2023
[5]

Hiflow: Training-free high-resolution image generation with flow-aligned guidance.arXiv preprint arXiv:2504.06232, 2025

Jiazi Bu, Pengyang Ling, Yujie Zhou, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Hiflow: Training-free high-resolution image generation with flow-aligned guidance.arXiv preprint arXiv:2504.06232, 2025

work page arXiv 2025
[6]

I-max: Maximize the resolution potential of pre-trained rectified flow transformers with projected flow, 2024

Ruoyi Du, Dongyang Liu, Le Zhuo, Qin Qi, Hongsheng Li, Zhanyu Ma, and Peng Gao. I-max: Maximize the resolution potential of pre-trained rectified flow transformers with projected flow, 2024

work page 2024
[7]

Latent Wavelet Diffusion For Ultra-High-Resolution Image Synthesis

Luigi Sigillo, Shengfeng He, and Danilo Comminiello. Latent wavelet diffusion for ultra-high-resolution image synthesis.arXiv preprint arXiv:2506.00433, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

simple diffusion: End-to-end diffusion for high resolution images

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. InInternational Conference on Machine Learning, pages 13213–13232. PMLR, 2023

work page 2023
[9]

Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation

Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, et al. Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation. InEuropean conference on computer vision, pages 39–55. Springer, 2024

work page 2024
[10]

Demofusion: Democratising high-resolution image generation with no $$$

Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, and Zhanyu Ma. Demofusion: Democratising high-resolution image generation with no $$$. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6159–6168, 2024

work page 2024
[11]

Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models

Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[12]

Training-free diffusion model adaptation for variable- sized text-to-image synthesis.Advances in Neural Information Processing Systems, 36:70847–70860, 2023

Zhiyu Jin, Xuli Shen, Bin Li, and Xiangyang Xue. Training-free diffusion model adaptation for variable- sized text-to-image synthesis.Advances in Neural Information Processing Systems, 36:70847–70860, 2023

work page 2023
[13]

Diffusehigh: Training-free progressive high-resolution image synthesis through structure guidance

Younghyun Kim, Geunmin Hwang, Junyu Zhang, and Eunbyung Park. Diffusehigh: Training-free progressive high-resolution image synthesis through structure guidance. InProceedings of the AAAI conference on artificial intelligence, volume 39, pages 4338–4346, 2025

work page 2025
[14]

Ultraimage: Rethinking resolution extrapolation in image diffusion transformers.arXiv preprint arXiv:2512.04504, 2025

Min Zhao, Bokai Yan, Xue Yang, Hongzhou Zhu, Jintao Zhang, Shilong Liu, Chongxuan Li, and Jun Zhu. Ultraimage: Rethinking resolution extrapolation in image diffusion transformers.arXiv preprint arXiv:2512.04504, 2025

work page arXiv 2025
[15]

Dype: Dynamic position extrapolation for ultra high resolution diffusion.arXiv preprint arXiv:2510.20766, 2025

Noam Issachar, Guy Yariv, Sagie Benaim, Yossi Adi, Dani Lischinski, and Raanan Fattal. Dype: Dynamic position extrapolation for ultra high resolution diffusion.arXiv preprint arXiv:2510.20766, 2025

work page arXiv 2025
[16]

Fit: Flexible vision transformer for diffusion model.arXiv preprint arXiv:2402.12376, 2024

Zeyu Lu, Zidong Wang, Di Huang, Chengyue Wu, Xihui Liu, Wanli Ouyang, and Lei Bai. Fit: Flexible vision transformer for diffusion model.arXiv preprint arXiv:2402.12376, 2024

work page arXiv 2024
[17]

Boosting resolution generalization of diffusion transformers with randomized positional encodings

Liang Hou, Cong Liu, Mingwu Zheng, Xin Tao, Pengfei Wan, Di Zhang, and Kun Gai. Boosting resolution generalization of diffusion transformers with randomized positional encodings. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 4762–4770, 2026

work page 2026
[18]

Freescale: Unleashing the resolution of diffusion models via tuning-free scale fusion

Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, and Ziwei Liu. Freescale: Unleashing the resolution of diffusion models via tuning-free scale fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16893–16903, 2025

work page 2025
[19]

Frecas: Efficient higher-resolution image generation via frequency-aware cascaded sampling.arXiv preprint arXiv:2410.18410, 2024

Zhengqiang Zhang, Ruihuang Li, and Lei Zhang. Frecas: Efficient higher-resolution image generation via frequency-aware cascaded sampling.arXiv preprint arXiv:2410.18410, 2024. 10

work page arXiv 2024
[20]

Diffusion-4k: Ultra-high-resolution im- age synthesis with latent diffusion models

Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Diffusion-4k: Ultra-high-resolution im- age synthesis with latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23464–23473, 2025

work page 2025
[21]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024
[22]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022

Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022

work page 2022
[24]

Matryoshka diffusion models

Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Joshua M Susskind, and Navdeep Jaitly. Matryoshka diffusion models. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[25]

Hierarchical patch diffusion models for high-resolution video generation

Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, and Sergey Tulyakov. Hierarchical patch diffusion models for high-resolution video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7569–7579, 2024

work page 2024
[26]

Improving progressive generation with decomposable flow matching.arXiv preprint arXiv:2506.19839, 2025

Moayed Haji-Ali, Willi Menapace, Ivan Skorokhodov, Arpit Sahni, Sergey Tulyakov, Vicente Ordonez, and Aliaksandr Siarohin. Improving progressive generation with decomposable flow matching.arXiv preprint arXiv:2506.19839, 2025

work page arXiv 2025
[27]

Latent space super-resolution for higher- resolution image generation with diffusion models

Jinho Jeong, Sangmin Han, Jinwoo Kim, and Seon Joo Kim. Latent space super-resolution for higher- resolution image generation with diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2355–2365, 2025

work page 2025
[28]

Megafusion: Extend diffusion models towards higher-resolution image generation without further tuning

Haoning Wu, Shaocheng Shen, Qiang Hu, Xiaoyun Zhang, Ya Zhang, and Yanfeng Wang. Megafusion: Extend diffusion models towards higher-resolution image generation without further tuning. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3944–3953. IEEE, 2025

work page 2025
[29]

Accdiffusion: An accurate method for higher- resolution image generation

Zhihang Lin, Mingbao Lin, Meng Zhao, and Rongrong Ji. Accdiffusion: An accurate method for higher- resolution image generation. InEuropean Conference on Computer Vision, pages 38–53. Springer, 2024

work page 2024
[30]

Fouriscale: A frequency perspective on training-free high-resolution image synthesis

Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li. Fouriscale: A frequency perspective on training-free high-resolution image synthesis. InEuropean conference on computer vision, pages 196–212. Springer, 2024

work page 2024
[31]

Scalediff: Higher- resolution image synthesis via efficient and model-agnostic diffusion.arXiv preprint arXiv:2510.25818, 2025

Sungho Koh, SeungJu Cha, Hyunwoo Oh, Kwanyoung Lee, and Dong-Jin Kim. Scalediff: Higher- resolution image synthesis via efficient and model-agnostic diffusion.arXiv preprint arXiv:2510.25818, 2025

work page arXiv 2025
[32]

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Pepe: Long-context extension for large language models via periodic extrapolation positional encodings

Jikun Hu, Dongsheng Guo, Yuli Liu, Qingyao Ai, Lixuan Wang, Xuebing Sun, Qilei Zhang, Quan Zhou, and Cheng Luo. Pepe: Long-context extension for large language models via periodic extrapolation positional encodings. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 21075–21085, 2025

work page 2025
[34]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation, 2023

Bowen Peng and Jeffrey Quesnelle. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation, 2023

work page 2023
[36]

Ultravico: Breaking extrapolation limits in video diffusion transformers.arXiv preprint arXiv:2511.20123, 2025

Min Zhao, Hongzhou Zhu, Yingze Wang, Bokai Yan, Jintao Zhang, Guande He, Ling Yang, Chongxuan Li, and Jun Zhu. Ultravico: Breaking extrapolation limits in video diffusion transformers.arXiv preprint arXiv:2511.20123, 2025

work page arXiv 2025
[37]

Riflex: A free lunch for length extrapolation in video diffusion transformers.arXiv preprint arXiv:2502.15894, 2025a

Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, and Jun Zhu. Riflex: A free lunch for length extrapolation in video diffusion transformers.arXiv preprint arXiv:2502.15894, 2025

work page arXiv 2025
[38]

Rotary position embedding for vision transformer

Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. InEuropean Conference on Computer Vision, pages 289–305. Springer, 2024. 11

work page 2024
[39]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017
[40]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021

work page 2021
[41]

Exploring clip for assessing the look and feel of images

Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 2555–2563, 2023

work page 2023
[42]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[43]

Clipscore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021

work page 2021
[44]

Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023

work page 2023
[45]

Pick-a- pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a- pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

work page 2023
[46]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 12 Appendix A Detailed Related Work and Preliminaries A.1 High-Resolution Image Synthesis Training-Based A...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024

[2] [2]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023

[4] [4]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023

work page 2023

[5] [5]

Hiflow: Training-free high-resolution image generation with flow-aligned guidance.arXiv preprint arXiv:2504.06232, 2025

Jiazi Bu, Pengyang Ling, Yujie Zhou, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Hiflow: Training-free high-resolution image generation with flow-aligned guidance.arXiv preprint arXiv:2504.06232, 2025

work page arXiv 2025

[6] [6]

I-max: Maximize the resolution potential of pre-trained rectified flow transformers with projected flow, 2024

Ruoyi Du, Dongyang Liu, Le Zhuo, Qin Qi, Hongsheng Li, Zhanyu Ma, and Peng Gao. I-max: Maximize the resolution potential of pre-trained rectified flow transformers with projected flow, 2024

work page 2024

[7] [7]

Latent Wavelet Diffusion For Ultra-High-Resolution Image Synthesis

Luigi Sigillo, Shengfeng He, and Danilo Comminiello. Latent wavelet diffusion for ultra-high-resolution image synthesis.arXiv preprint arXiv:2506.00433, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

simple diffusion: End-to-end diffusion for high resolution images

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. InInternational Conference on Machine Learning, pages 13213–13232. PMLR, 2023

work page 2023

[9] [9]

Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation

Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, et al. Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation. InEuropean conference on computer vision, pages 39–55. Springer, 2024

work page 2024

[10] [10]

Demofusion: Democratising high-resolution image generation with no $$$

Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, and Zhanyu Ma. Demofusion: Democratising high-resolution image generation with no $$$. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6159–6168, 2024

work page 2024

[11] [11]

Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models

Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023

[12] [12]

Training-free diffusion model adaptation for variable- sized text-to-image synthesis.Advances in Neural Information Processing Systems, 36:70847–70860, 2023

Zhiyu Jin, Xuli Shen, Bin Li, and Xiangyang Xue. Training-free diffusion model adaptation for variable- sized text-to-image synthesis.Advances in Neural Information Processing Systems, 36:70847–70860, 2023

work page 2023

[13] [13]

Diffusehigh: Training-free progressive high-resolution image synthesis through structure guidance

Younghyun Kim, Geunmin Hwang, Junyu Zhang, and Eunbyung Park. Diffusehigh: Training-free progressive high-resolution image synthesis through structure guidance. InProceedings of the AAAI conference on artificial intelligence, volume 39, pages 4338–4346, 2025

work page 2025

[14] [14]

Ultraimage: Rethinking resolution extrapolation in image diffusion transformers.arXiv preprint arXiv:2512.04504, 2025

Min Zhao, Bokai Yan, Xue Yang, Hongzhou Zhu, Jintao Zhang, Shilong Liu, Chongxuan Li, and Jun Zhu. Ultraimage: Rethinking resolution extrapolation in image diffusion transformers.arXiv preprint arXiv:2512.04504, 2025

work page arXiv 2025

[15] [15]

Dype: Dynamic position extrapolation for ultra high resolution diffusion.arXiv preprint arXiv:2510.20766, 2025

Noam Issachar, Guy Yariv, Sagie Benaim, Yossi Adi, Dani Lischinski, and Raanan Fattal. Dype: Dynamic position extrapolation for ultra high resolution diffusion.arXiv preprint arXiv:2510.20766, 2025

work page arXiv 2025

[16] [16]

Fit: Flexible vision transformer for diffusion model.arXiv preprint arXiv:2402.12376, 2024

Zeyu Lu, Zidong Wang, Di Huang, Chengyue Wu, Xihui Liu, Wanli Ouyang, and Lei Bai. Fit: Flexible vision transformer for diffusion model.arXiv preprint arXiv:2402.12376, 2024

work page arXiv 2024

[17] [17]

Boosting resolution generalization of diffusion transformers with randomized positional encodings

Liang Hou, Cong Liu, Mingwu Zheng, Xin Tao, Pengfei Wan, Di Zhang, and Kun Gai. Boosting resolution generalization of diffusion transformers with randomized positional encodings. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 4762–4770, 2026

work page 2026

[18] [18]

Freescale: Unleashing the resolution of diffusion models via tuning-free scale fusion

Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, and Ziwei Liu. Freescale: Unleashing the resolution of diffusion models via tuning-free scale fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16893–16903, 2025

work page 2025

[19] [19]

Frecas: Efficient higher-resolution image generation via frequency-aware cascaded sampling.arXiv preprint arXiv:2410.18410, 2024

Zhengqiang Zhang, Ruihuang Li, and Lei Zhang. Frecas: Efficient higher-resolution image generation via frequency-aware cascaded sampling.arXiv preprint arXiv:2410.18410, 2024. 10

work page arXiv 2024

[20] [20]

Diffusion-4k: Ultra-high-resolution im- age synthesis with latent diffusion models

Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Diffusion-4k: Ultra-high-resolution im- age synthesis with latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23464–23473, 2025

work page 2025

[21] [21]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024

[22] [22]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022

Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022

work page 2022

[24] [24]

Matryoshka diffusion models

Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Joshua M Susskind, and Navdeep Jaitly. Matryoshka diffusion models. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023

[25] [25]

Hierarchical patch diffusion models for high-resolution video generation

Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, and Sergey Tulyakov. Hierarchical patch diffusion models for high-resolution video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7569–7579, 2024

work page 2024

[26] [26]

Improving progressive generation with decomposable flow matching.arXiv preprint arXiv:2506.19839, 2025

Moayed Haji-Ali, Willi Menapace, Ivan Skorokhodov, Arpit Sahni, Sergey Tulyakov, Vicente Ordonez, and Aliaksandr Siarohin. Improving progressive generation with decomposable flow matching.arXiv preprint arXiv:2506.19839, 2025

work page arXiv 2025

[27] [27]

Latent space super-resolution for higher- resolution image generation with diffusion models

Jinho Jeong, Sangmin Han, Jinwoo Kim, and Seon Joo Kim. Latent space super-resolution for higher- resolution image generation with diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2355–2365, 2025

work page 2025

[28] [28]

Megafusion: Extend diffusion models towards higher-resolution image generation without further tuning

Haoning Wu, Shaocheng Shen, Qiang Hu, Xiaoyun Zhang, Ya Zhang, and Yanfeng Wang. Megafusion: Extend diffusion models towards higher-resolution image generation without further tuning. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3944–3953. IEEE, 2025

work page 2025

[29] [29]

Accdiffusion: An accurate method for higher- resolution image generation

Zhihang Lin, Mingbao Lin, Meng Zhao, and Rongrong Ji. Accdiffusion: An accurate method for higher- resolution image generation. InEuropean Conference on Computer Vision, pages 38–53. Springer, 2024

work page 2024

[30] [30]

Fouriscale: A frequency perspective on training-free high-resolution image synthesis

Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li. Fouriscale: A frequency perspective on training-free high-resolution image synthesis. InEuropean conference on computer vision, pages 196–212. Springer, 2024

work page 2024

[31] [31]

Scalediff: Higher- resolution image synthesis via efficient and model-agnostic diffusion.arXiv preprint arXiv:2510.25818, 2025

Sungho Koh, SeungJu Cha, Hyunwoo Oh, Kwanyoung Lee, and Dong-Jin Kim. Scalediff: Higher- resolution image synthesis via efficient and model-agnostic diffusion.arXiv preprint arXiv:2510.25818, 2025

work page arXiv 2025

[32] [32]

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Pepe: Long-context extension for large language models via periodic extrapolation positional encodings

Jikun Hu, Dongsheng Guo, Yuli Liu, Qingyao Ai, Lixuan Wang, Xuebing Sun, Qilei Zhang, Quan Zhou, and Cheng Luo. Pepe: Long-context extension for large language models via periodic extrapolation positional encodings. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 21075–21085, 2025

work page 2025

[34] [34]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation, 2023

Bowen Peng and Jeffrey Quesnelle. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation, 2023

work page 2023

[36] [36]

Ultravico: Breaking extrapolation limits in video diffusion transformers.arXiv preprint arXiv:2511.20123, 2025

Min Zhao, Hongzhou Zhu, Yingze Wang, Bokai Yan, Jintao Zhang, Guande He, Ling Yang, Chongxuan Li, and Jun Zhu. Ultravico: Breaking extrapolation limits in video diffusion transformers.arXiv preprint arXiv:2511.20123, 2025

work page arXiv 2025

[37] [37]

Riflex: A free lunch for length extrapolation in video diffusion transformers.arXiv preprint arXiv:2502.15894, 2025a

Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, and Jun Zhu. Riflex: A free lunch for length extrapolation in video diffusion transformers.arXiv preprint arXiv:2502.15894, 2025

work page arXiv 2025

[38] [38]

Rotary position embedding for vision transformer

Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. InEuropean Conference on Computer Vision, pages 289–305. Springer, 2024. 11

work page 2024

[39] [39]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017

[40] [40]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021

work page 2021

[41] [41]

Exploring clip for assessing the look and feel of images

Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 2555–2563, 2023

work page 2023

[42] [42]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021

[43] [43]

Clipscore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021

work page 2021

[44] [44]

Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903–15935, 2023

work page 2023

[45] [45]

Pick-a- pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a- pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

work page 2023

[46] [46]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 12 Appendix A Detailed Related Work and Preliminaries A.1 High-Resolution Image Synthesis Training-Based A...

work page internal anchor Pith review Pith/arXiv arXiv 2023