pith. sign in

arxiv: 2511.19365 · v2 · submitted 2025-11-24 · 💻 cs.CV · cs.AI

DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

Pith reviewed 2026-05-17 05:44 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords pixel diffusionfrequency decouplingimage generationdiffusion transformerflow matchingImageNettext-to-image
0
0 comments X

The pith

DeCo decouples frequencies in pixel diffusion so the DiT models semantics while a lightweight decoder adds details.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to make end-to-end pixel diffusion competitive with latent methods by splitting the workload along frequency lines. A diffusion transformer focuses on low-frequency semantic structure, while a small decoder produces high-frequency visual details from that semantic guidance alone. This specialization is reinforced by a frequency-aware flow-matching loss that weights important frequencies more heavily. The result is faster training and inference plus better image quality than earlier single-network pixel diffusion approaches.

Core claim

DeCo decouples the generation of high-frequency details from low-frequency semantics in pixel space. The DiT specializes in modeling low-frequency content and supplies semantic guidance to a lightweight pixel decoder that synthesizes the high-frequency components. A frequency-aware flow-matching loss further directs attention to visually salient frequencies. This yields FID scores of 1.62 at 256x256 and 2.22 at 512x512 on ImageNet among pixel diffusion models and a GenEval score of 0.86 for the text-to-image variant.

What carries the argument

The frequency-DeCoupled pixel diffusion framework that routes low-frequency semantics through a DiT and high-frequency details through a lightweight decoder conditioned on the DiT output.

If this is right

  • Pixel diffusion models can train and sample faster because the main transformer no longer expends capacity on high-frequency signals.
  • End-to-end pixel-space generation becomes competitive with two-stage latent diffusion without relying on a VAE bottleneck.
  • The frequency-aware loss produces images with better perceptual quality by suppressing insignificant frequency bands.
  • The same pretrained backbone delivers leading system-level performance on text-to-image benchmarks such as GenEval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same conditioning pattern could be tested on video or 3D diffusion to reduce compute while preserving fine detail.
  • Making the frequency split learned rather than fixed might further improve results on diverse datasets.
  • The approach suggests a general principle: separate semantic and perceptual modeling early in the generative pipeline.

Load-bearing premise

A lightweight pixel decoder can reliably synthesize accurate high-frequency details when given only semantic conditioning from the DiT without reintroducing artifacts or requiring joint optimization.

What would settle it

Train an ablated version of DeCo that removes the separate decoder and forces the DiT to model all frequencies; if the FID on ImageNet 256x256 rises above 3.0 or visible high-frequency artifacts appear in generated images, the decoupling premise is falsified.

Figures

Figures reproduced from arXiv: 2511.19365 by Longhui Wei, Qi Tian, Shiliang Zhang, Shuai Wang, Zehong Ma.

Figure 1
Figure 1. Figure 1: Qualitative results of text-to-image generation of DeCo. All images are 512 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of our frequency-decoupled (DeCo) framework. In (a), traditional baseline models rely on a single DiT to jointly [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed frequency-decoupled (DeCo) framework. The DiT operates on downsampled inputs to model low [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: DCT energy distribution of DiT outputs and predicted [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: FID comparison between our DeCo and baseline. DeCo [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results of class-to-image generation of DeCo. All images are 256 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Base and Scaled Quantization Tables. factor q to create new scaled quantization tables Qcur for different compression levels. Since a smaller quantization step implies that a fre￾quency component is more significant to human perception, we use the normalized reciprocal of the scaled quantiza￾tion tables as adaptive weights, i.e., 1 Qcur with normaliza￾tion. This allows us to assign a higher weight to the f… view at source ↗
Figure 8
Figure 8. Figure 8: More Qualitative results of text-to-image generation at a 512 [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: More qualitative results of class-to-image generation at a 256 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative results of class-to-image generation at a 512 [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
read the original abstract

Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison. Codes are publicly available at https://github.com/Zehong-Ma/DeCo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents DeCo, a frequency-decoupled pixel diffusion framework for end-to-end image generation. It uses a DiT to specialize in low-frequency semantics while a lightweight pixel decoder generates high-frequency details conditioned on DiT guidance, combined with a frequency-aware flow-matching loss that emphasizes salient frequencies. Experiments report FID scores of 1.62 (256×256) and 2.22 (512×512) on ImageNet, closing the gap with latent diffusion models, and a text-to-image variant achieves an overall score of 0.86 on GenEval.

Significance. If the decoupling is effective, the approach could enable more efficient pixel-space diffusion with higher capacity than VAE-based latent methods by avoiding compression artifacts and allowing component specialization. The public code release at the provided GitHub link is a clear strength supporting reproducibility.

major comments (2)
  1. [Method (§3)] The central claim that frequency decoupling succeeds (DiT models only low-frequency semantics while the decoder produces high-frequency content from guidance alone without artifacts or re-coupling via joint optimization) is load-bearing but unsupported by direct evidence. No frequency-spectrum analysis, high-frequency error maps, or conditioning diagrams are provided to verify specialization.
  2. [Experiments (§4)] Experiments section: No ablations isolate the contribution of the frequency-aware flow-matching loss or the lightweight decoder design; without these, it is unclear whether the reported FID gains (1.62 at 256²) stem from true decoupling or from other unisolated factors such as training schedule or architecture scale.
minor comments (2)
  1. [Abstract] The abstract states that the decoder is 'lightweight' but does not quantify parameter count or FLOPs relative to the DiT, which would clarify the efficiency claim.
  2. [Figures] Figure captions and diagrams could more explicitly label the frequency separation path and loss weighting to improve readability for readers unfamiliar with the split.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment below with clarifications and proposed revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method (§3)] The central claim that frequency decoupling succeeds (DiT models only low-frequency semantics while the decoder produces high-frequency content from guidance alone without artifacts or re-coupling via joint optimization) is load-bearing but unsupported by direct evidence. No frequency-spectrum analysis, high-frequency error maps, or conditioning diagrams are provided to verify specialization.

    Authors: We agree that additional direct evidence would better substantiate the specialization claim. In the revised manuscript we will add frequency-spectrum analysis comparing the DiT output and final decoder output, high-frequency error maps relative to ground truth, and a conditioning diagram that illustrates the guidance pathway from DiT to decoder. These additions will be placed in Section 3 and the supplementary material. revision: yes

  2. Referee: [Experiments (§4)] Experiments section: No ablations isolate the contribution of the frequency-aware flow-matching loss or the lightweight decoder design; without these, it is unclear whether the reported FID gains (1.62 at 256²) stem from true decoupling or from other unisolated factors such as training schedule or architecture scale.

    Authors: We acknowledge that the current experiments do not contain targeted ablations for these two components. We will add two new ablation studies in the revised Section 4: (1) a comparison of the frequency-aware flow-matching loss against a standard flow-matching baseline while keeping all other elements fixed, and (2) an ablation replacing the lightweight decoder with a deeper variant to isolate its contribution. These results will be reported alongside the existing FID numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural design choices and empirical results are independent of inputs

full rationale

The paper presents DeCo as an empirical framework consisting of a proposed frequency-decoupled architecture (DiT for low-frequency semantics plus lightweight pixel decoder for high-frequency details) and a frequency-aware flow-matching loss. These are introduced as design decisions motivated by intuition about frequency separation, not derived from equations or prior results that reduce back to the same inputs by construction. Reported FID scores (1.62 at 256x256, 2.22 at 512x512) and GenEval score arise from standard benchmark evaluations on ImageNet, which are external to the model definition. No self-citations, uniqueness theorems, fitted parameters renamed as predictions, or ansatzes smuggled via prior work are used to justify the central claims. The derivation chain is therefore self-contained as an engineering proposal validated experimentally.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard diffusion-model assumptions plus the paper-specific premise that frequency components can be cleanly separated and handled by separate modules without significant information loss or optimization conflicts.

axioms (2)
  • standard math Standard assumptions of flow-matching or diffusion processes in image generation (e.g., gradual noise addition and reversal)
    Invoked implicitly when describing the DiT and flow-matching loss.
  • domain assumption High-frequency details can be generated reliably by a lightweight decoder conditioned solely on low-frequency semantic features from the DiT
    Core design choice stated in the abstract; if false the decoupling benefit disappears.

pith-pipeline@v0.9.0 · 5531 in / 1447 out tokens · 38364 ms · 2026-05-17T05:44:10.296781+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Coevolving Representations in Joint Image-Feature Diffusion

    cs.CV 2026-04 unverdicted novelty 7.0

    CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample ...

  2. HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion

    cs.CV 2026-05 unverdicted novelty 6.0

    HyperDiT achieves FID 1.56 on ImageNet 256x256 in pixel space via hyper-connected cross-scale interactions, cross-attention, SA-RoPE, and VFM registers.

  3. L2P: Unlocking Latent Potential for Pixel Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.

  4. FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    FREPix achieves competitive FID scores on ImageNet by decomposing image generation into separate low- and high-frequency paths within a flow matching framework.

  5. CoD-Lite: Real-Time Diffusion-Based Generative Image Compression

    cs.CV 2026-04 unverdicted novelty 6.0

    CoD-Lite delivers real-time generative image compression via a lightweight convolution-based diffusion codec with compression-oriented pre-training and distillation, achieving substantial bitrate savings.

  6. PixelGen: Improving Pixel Diffusion with Perceptual Supervision

    cs.CV 2026-02 accept novelty 6.0

    PixelGen augments pixel diffusion with gated perceptual supervision to reach FID 5.11 on ImageNet-256 and GenEval 0.79 in text-to-image, narrowing the gap to latent methods without VAEs.

  7. PixIE: Prompted Pixel-Space Low-Light Image Enhancement

    cs.CV 2026-05 unverdicted novelty 5.0

    PixIE proposes a pixel-space low-light image enhancement framework using DINO-prompted blocks, spatial-channel compaction, and multi-receptive-field embeddings, reporting PSNR gains of 1.9-15.0% and LPIPS reductions o...

  8. FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion

    cs.CV 2026-05 unverdicted novelty 5.0

    FrequencyBooster reports state-of-the-art FID scores of 1.60 at 256x256 and 1.69 at 512x512 for pixel diffusion by using a specialized decoder for full-frequency modeling.

  9. Why Do DiT Editors Drift? Plug-and-Play Low Frequency Alignment in VAE Latent Space

    cs.CV 2026-05 unverdicted novelty 4.0

    VAE-LFA suppresses semantic drift in multi-turn DiT image editing by low-pass filtering latent discrepancies and aligning low-frequency components to an EMA of previous rounds in VAE space.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 9 Pith papers · 14 internal anchors

  1. [1]

    All are worth words: A vit backbone for diffusion models

    Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 22669–22679, 2023. 3

  2. [2]

    Improving image gener- ation with better captions

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image gener- ation with better captions. OpenAI Technical Report, 2023. 8

  3. [3]

    Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,

  4. [4]

    Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

    Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffu- sion models.arXiv preprint arXiv:2410.10733, 2024. 3

  5. [5]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Sil- vio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025. 8, 3

  6. [6]

    arXiv preprint arXiv:2504.07963 (2025)

    Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025. 2, 3, 4, 5, 6, 7, 8, 1

  7. [7]

    Vision transformer adapter for dense predictions

    Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. InThe Eleventh International Conference on Learning Representations, 2023. 2

  8. [8]

    Deep generative image models using a laplacian pyramid of adver- sarial networks.Advances in neural information processing systems, 28, 2015

    Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using a laplacian pyramid of adver- sarial networks.Advances in neural information processing systems, 28, 2015. 2

  9. [9]

    Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021. 1, 2, 3, 7

  10. [10]

    Sigmoid- weighted linear units for neural network function approxima- tion in reinforcement learning.Neural networks, 107:3–11,

    Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid- weighted linear units for neural network function approxima- tion in reinforcement learning.Neural networks, 107:3–11,

  11. [11]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024. 8

  12. [12]

    Fluid: Scaling autoregressive text-to-image generative models with continuous tokens

    Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, and Yonglong Tian. Fluid: Scaling autoregressive text-to-image generative models with continuous tokens.arXiv preprint arXiv:2410.13863, 2024. 1

  13. [13]

    Seedream 3.0 Technical Report

    Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025. 1

  14. [14]

    Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36:52132–52152, 2023. 6, 8

  15. [15]

    Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

    Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, et al. Seedream 2.0: A native chinese-english bilin- gual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025. 1

  16. [16]

    Neue methoden zur approximativen integration der differentialgleichungen einer unabh ¨angigen ver¨anderlichen.Z

    Karl Heun et al. Neue methoden zur approximativen integration der differentialgleichungen einer unabh ¨angigen ver¨anderlichen.Z. Math. Phys, 45:23–38, 1900. 7, 8

  17. [17]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 6

  18. [18]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 6, 1

  19. [19]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1

  20. [20]

    sim- ple diffusion: End-to-end diffusion for high resolution im- ages

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. sim- ple diffusion: End-to-end diffusion for high resolution im- ages. InInternational Conference on Machine Learning, pages 13213–13232. PMLR, 2023. 2, 3

  21. [21]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. 6, 8

  22. [22]

    Scalable adaptive computation for iterative generation

    Allan Jabri, David Fleet, and Ting Chen. Scalable adap- tive computation for iterative generation.arXiv preprint arXiv:2212.11972, 2022. 7

  23. [23]

    Information technology — digital compression and coding of continuous-tone still images: Requirements and guidelines

    Joint Photographic Experts Group. Information technology — digital compression and coding of continuous-tone still images: Requirements and guidelines. Technical Report ITU-T T.81, International Telecommunication Union (ITU- T), 1992. 2, 4, 5

  24. [24]

    Progressive growing of gans for improved quality, stability, and variation

    Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. InInternational Conference on Learning Rep- resentations, 2018. 2

  25. [25]

    Analyzing and improving the training dynamics of diffusion models

    Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24174–24184, 2024. 3

  26. [26]

    Understanding diffu- sion objectives as the elbo with simple data augmentation

    Diederik Kingma and Ruiqi Gao. Understanding diffu- sion objectives as the elbo with simple data augmentation. Advances in Neural Information Processing Systems, 36: 65484–65516, 2023. 7

  27. [27]

    Understanding diffu- sion objectives as the elbo with simple data augmentation

    Diederik Kingma and Ruiqi Gao. Understanding diffu- sion objectives as the elbo with simple data augmentation. Advances in Neural Information Processing Systems, 36: 65484–65516, 2023. 2, 3 7

  28. [28]

    Improved precision and recall met- ric for assessing generative models.Advances in neural in- formation processing systems, 32, 2019

    Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall met- ric for assessing generative models.Advances in neural in- formation processing systems, 32, 2019. 6

  29. [29]

    Applying guidance in a limited interval improves sample and distribution quality in diffusion models

    Tuomas Kynk ¨a¨anniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models.arXiv preprint arXiv:2404.07724, 2024. 7, 1

  30. [30]

    Flux.https://github.com/ black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 1, 8

  31. [31]

    Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

    Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers. arXiv preprint arXiv:2504.10483, 2025. 1, 3

  32. [32]

    Back to basics: Let denoising generative models denoise, 2025

    Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise, 2025. 2, 3, 6, 7

  33. [33]

    Fractal generative models

    Tianhong Li, Qinyi Sun, Lijie Fan, and Kaiming He. Fractal generative models.arXiv preprint arXiv:2502.17437, 2025. 3, 7

  34. [34]

    Exploring the effect of high-frequency components in gans training.ACM Trans

    Ziqiang Li, Pengfei Xia, Xue Rui, and Bin Li. Exploring the effect of high-frequency components in gans training.ACM Trans. Multimedia Comput. Commun. Appl., 19(5), 2023. 2

  35. [35]

    Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

    Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025. 1

  36. [36]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for genera- tive modeling. InThe Eleventh International Conference on Learning Representations, 2023. 3

  37. [37]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 1

  38. [38]

    Latent consistency models: Synthesizing high- resolution images with few-step inference, 2024

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high- resolution images with few-step inference, 2024. 3

  39. [39]

    Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Explor- ing flow and diffusion-based generative models with scalable interpolant transformers.arXiv preprint arXiv:2401.08740,

  40. [40]

    Generating images with sparse representations

    Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021. 6

  41. [41]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3

  42. [42]

    How do vision transformers work? InInternational Conference on Learning Represen- tations, 2022

    Namuk Park and Songkuk Kim. How do vision transformers work? InInternational Conference on Learning Represen- tations, 2022. 2

  43. [43]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,

  44. [44]

    1, 3, 4, 5, 6, 7, 8, 9

  45. [45]

    Springer Science & Busi- ness Media, 1992

    William B Pennebaker and Joan L Mitchell.JPEG: Still im- age data compression standard. Springer Science & Busi- ness Media, 1992. 2

  46. [46]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 3

  47. [47]

    U- net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015. 9

  48. [48]

    Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016. 6

  49. [49]

    Inception transformer.Advances in Neural Information Processing Systems, 35:23495–23509,

    Chenyang Si, Weihao Yu, Pan Zhou, Yichen Zhou, Xinchao Wang, and Shuicheng Yan. Inception transformer.Advances in Neural Information Processing Systems, 35:23495–23509,

  50. [50]

    Improving the diffusability of autoen- coders

    Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Mena- pace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, and Ali- aksandr Siarohin. Improving the diffusability of autoen- coders. InForty-second International Conference on Ma- chine Learning, 2025. 2

  51. [51]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models.arXiv:2010.02502, 2020. 1

  52. [52]

    Dmm: Build- ing a versatile image generation model via distillation-based model merging.arXiv preprint arXiv:2504.12364, 2025

    Tianhui Song, Weixin Feng, Shuai Wang, Xubin Li, Tiezheng Ge, Bo Zheng, and Limin Wang. Dmm: Build- ing a versatile image generation model via distillation-based model merging.arXiv preprint arXiv:2504.12364, 2025. 3

  53. [53]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

  54. [54]

    Relay diffusion: Unifying diffusion process across resolutions for image syn- thesis

    Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jian- qiao Wangni, Zhuoyi Yang, and Jie Tang. Relay diffusion: Unifying diffusion process across resolutions for image syn- thesis.arXiv preprint arXiv:2309.03350, 2023. 2, 3, 7

  55. [55]

    arXiv preprint arXiv:2405.14224 , year=

    Yao Teng, Yue Wu, Han Shi, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, and Xihui Liu. Dim: Diffusion mamba for efficient high-resolution image synthesis.arXiv preprint arXiv:2405.14224, 2024. 3

  56. [56]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 1

  57. [57]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 1

  58. [58]

    Jetformer: An autoregres- sive generative model of raw images and text.arXiv preprint arXiv:2411.19722, 2024

    Michael Tschannen, Andr ´e Susano Pinto, and Alexander Kolesnikov. Jetformer: An autoregressive generative model of raw images and text.arXiv preprint arXiv:2411.19722,

  59. [59]

    High-frequency component helps explain the generaliza- tion of convolutional neural networks

    Haohan Wang, Xindi Wu, Zeyi Huang, and Eric P Xing. High-frequency component helps explain the generaliza- tion of convolutional neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8684–8694, 2020. 2

  60. [60]

    Exploring dcn-like ar- chitecture for fast image generation with arbitrary resolu- tion.Advances in Neural Information Processing Systems, 37:87959–87977, 2024

    Shuai Wang, Zexian Li, Tianhui Song, Xubin Li, Tiezheng Ge, Bo Zheng, and Limin Wang. Exploring dcn-like ar- chitecture for fast image generation with arbitrary resolu- tion.Advances in Neural Information Processing Systems, 37:87959–87977, 2024. 3

  61. [61]

    Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025

    Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025. 1, 3, 6, 7, 8

  62. [62]

    Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

    Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025. 3, 6, 1

  63. [63]

    Fregan: Exploit- ing frequency components for training gans under limited data.Advances in Neural Information Processing Systems, 35:33387–33399, 2022

    Zhe Wang, Ziqiu Chi, Yanbing Zhang, et al. Fregan: Exploit- ing frequency components for training gans under limited data.Advances in Neural Information Processing Systems, 35:33387–33399, 2022. 2

  64. [64]

    Native-resolution image synthesis.arXiv preprint arXiv:2506.03131, 2025

    Zidong Wang, Lei Bai, Xiangyu Yue, Wanli Ouyang, and Yiyuan Zhang. Native-resolution image synthesis.arXiv preprint arXiv:2506.03131, 2025. 1

  65. [65]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Jun- jie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jia- hao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omni- gen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.1887...

  66. [66]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 8, 1

  67. [67]

    Reconstruction vs

    Jingfeng Yao and Xinggang Wang. Reconstruction vs. gener- ation: Taming optimization dilemma in latent diffusion mod- els.arXiv preprint arXiv:2501.01423, 2025. 1, 3

  68. [68]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffu- sion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024. 3, 4, 6, 7, 1, 2

  69. [69]

    Diffu- sion models need visual priors for image generation.arXiv preprint arXiv:2410.08531, 2024

    Xiaoyu Yue, Zidong Wang, Zeyu Lu, Shuyang Sun, Meng Wei, Wanli Ouyang, Lei Bai, and Luping Zhou. Diffu- sion models need visual priors for image generation.arXiv preprint arXiv:2410.08531, 2024. 3

  70. [70]

    Normalizing flows are capable generative models,

    Shuangfei Zhai, Ruixiang Zhang, Preetum Nakkiran, David Berthelot, Jiatao Gu, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Navdeep Jaitly, and Josh Susskind. Normalizing flows are capable generative models.arXiv preprint arXiv:2412.06329, 2024. 3

  71. [71]

    Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation

    Mingyuan Zhou, Huangjie Zheng, Zhendong Wang, Mingzhang Yin, and Hai Huang. Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. InForty-first International Confer- ence on Machine Learning, 2024. 3, 7 9