DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

Longhui Wei; Qi Tian; Shiliang Zhang; Shuai Wang; Zehong Ma

arxiv: 2511.19365 · v2 · submitted 2025-11-24 · 💻 cs.CV · cs.AI

DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

Zehong Ma , Longhui Wei , Shuai Wang , Shiliang Zhang , Qi Tian This is my paper

Pith reviewed 2026-05-17 05:44 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords pixel diffusionfrequency decouplingimage generationdiffusion transformerflow matchingImageNettext-to-image

0 comments

The pith

DeCo decouples frequencies in pixel diffusion so the DiT models semantics while a lightweight decoder adds details.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to make end-to-end pixel diffusion competitive with latent methods by splitting the workload along frequency lines. A diffusion transformer focuses on low-frequency semantic structure, while a small decoder produces high-frequency visual details from that semantic guidance alone. This specialization is reinforced by a frequency-aware flow-matching loss that weights important frequencies more heavily. The result is faster training and inference plus better image quality than earlier single-network pixel diffusion approaches.

Core claim

DeCo decouples the generation of high-frequency details from low-frequency semantics in pixel space. The DiT specializes in modeling low-frequency content and supplies semantic guidance to a lightweight pixel decoder that synthesizes the high-frequency components. A frequency-aware flow-matching loss further directs attention to visually salient frequencies. This yields FID scores of 1.62 at 256x256 and 2.22 at 512x512 on ImageNet among pixel diffusion models and a GenEval score of 0.86 for the text-to-image variant.

What carries the argument

The frequency-DeCoupled pixel diffusion framework that routes low-frequency semantics through a DiT and high-frequency details through a lightweight decoder conditioned on the DiT output.

If this is right

Pixel diffusion models can train and sample faster because the main transformer no longer expends capacity on high-frequency signals.
End-to-end pixel-space generation becomes competitive with two-stage latent diffusion without relying on a VAE bottleneck.
The frequency-aware loss produces images with better perceptual quality by suppressing insignificant frequency bands.
The same pretrained backbone delivers leading system-level performance on text-to-image benchmarks such as GenEval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conditioning pattern could be tested on video or 3D diffusion to reduce compute while preserving fine detail.
Making the frequency split learned rather than fixed might further improve results on diverse datasets.
The approach suggests a general principle: separate semantic and perceptual modeling early in the generative pipeline.

Load-bearing premise

A lightweight pixel decoder can reliably synthesize accurate high-frequency details when given only semantic conditioning from the DiT without reintroducing artifacts or requiring joint optimization.

What would settle it

Train an ablated version of DeCo that removes the separate decoder and forces the DiT to model all frequencies; if the FID on ImageNet 256x256 rises above 3.0 or visible high-frequency artifacts appear in generated images, the decoupling premise is falsified.

Figures

Figures reproduced from arXiv: 2511.19365 by Longhui Wei, Qi Tian, Shiliang Zhang, Shuai Wang, Zehong Ma.

**Figure 2.** Figure 2: Illustration of our frequency-decoupled (DeCo) framework. In (a), traditional baseline models rely on a single DiT to jointly [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed frequency-decoupled (DeCo) framework. The DiT operates on downsampled inputs to model low [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: DCT energy distribution of DiT outputs and predicted [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: FID comparison between our DeCo and baseline. DeCo [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results of class-to-image generation of DeCo. All images are 256 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Base and Scaled Quantization Tables. factor q to create new scaled quantization tables Qcur for different compression levels. Since a smaller quantization step implies that a frequency component is more significant to human perception, we use the normalized reciprocal of the scaled quantization tables as adaptive weights, i.e., 1 Qcur with normalization. This allows us to assign a higher weight to the f… view at source ↗

**Figure 8.** Figure 8: More Qualitative results of text-to-image generation at a 512 [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: More qualitative results of class-to-image generation at a 256 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative results of class-to-image generation at a 512 [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

read the original abstract

Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison. Codes are publicly available at https://github.com/Zehong-Ma/DeCo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents DeCo, a frequency-decoupled pixel diffusion framework for end-to-end image generation. It uses a DiT to specialize in low-frequency semantics while a lightweight pixel decoder generates high-frequency details conditioned on DiT guidance, combined with a frequency-aware flow-matching loss that emphasizes salient frequencies. Experiments report FID scores of 1.62 (256×256) and 2.22 (512×512) on ImageNet, closing the gap with latent diffusion models, and a text-to-image variant achieves an overall score of 0.86 on GenEval.

Significance. If the decoupling is effective, the approach could enable more efficient pixel-space diffusion with higher capacity than VAE-based latent methods by avoiding compression artifacts and allowing component specialization. The public code release at the provided GitHub link is a clear strength supporting reproducibility.

major comments (2)

[Method (§3)] The central claim that frequency decoupling succeeds (DiT models only low-frequency semantics while the decoder produces high-frequency content from guidance alone without artifacts or re-coupling via joint optimization) is load-bearing but unsupported by direct evidence. No frequency-spectrum analysis, high-frequency error maps, or conditioning diagrams are provided to verify specialization.
[Experiments (§4)] Experiments section: No ablations isolate the contribution of the frequency-aware flow-matching loss or the lightweight decoder design; without these, it is unclear whether the reported FID gains (1.62 at 256²) stem from true decoupling or from other unisolated factors such as training schedule or architecture scale.

minor comments (2)

[Abstract] The abstract states that the decoder is 'lightweight' but does not quantify parameter count or FLOPs relative to the DiT, which would clarify the efficiency claim.
[Figures] Figure captions and diagrams could more explicitly label the frequency separation path and loss weighting to improve readability for readers unfamiliar with the split.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment below with clarifications and proposed revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Method (§3)] The central claim that frequency decoupling succeeds (DiT models only low-frequency semantics while the decoder produces high-frequency content from guidance alone without artifacts or re-coupling via joint optimization) is load-bearing but unsupported by direct evidence. No frequency-spectrum analysis, high-frequency error maps, or conditioning diagrams are provided to verify specialization.

Authors: We agree that additional direct evidence would better substantiate the specialization claim. In the revised manuscript we will add frequency-spectrum analysis comparing the DiT output and final decoder output, high-frequency error maps relative to ground truth, and a conditioning diagram that illustrates the guidance pathway from DiT to decoder. These additions will be placed in Section 3 and the supplementary material. revision: yes
Referee: [Experiments (§4)] Experiments section: No ablations isolate the contribution of the frequency-aware flow-matching loss or the lightweight decoder design; without these, it is unclear whether the reported FID gains (1.62 at 256²) stem from true decoupling or from other unisolated factors such as training schedule or architecture scale.

Authors: We acknowledge that the current experiments do not contain targeted ablations for these two components. We will add two new ablation studies in the revised Section 4: (1) a comparison of the frequency-aware flow-matching loss against a standard flow-matching baseline while keeping all other elements fixed, and (2) an ablation replacing the lightweight decoder with a deeper variant to isolate its contribution. These results will be reported alongside the existing FID numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural design choices and empirical results are independent of inputs

full rationale

The paper presents DeCo as an empirical framework consisting of a proposed frequency-decoupled architecture (DiT for low-frequency semantics plus lightweight pixel decoder for high-frequency details) and a frequency-aware flow-matching loss. These are introduced as design decisions motivated by intuition about frequency separation, not derived from equations or prior results that reduce back to the same inputs by construction. Reported FID scores (1.62 at 256x256, 2.22 at 512x512) and GenEval score arise from standard benchmark evaluations on ImageNet, which are external to the model definition. No self-citations, uniqueness theorems, fitted parameters renamed as predictions, or ansatzes smuggled via prior work are used to justify the central claims. The derivation chain is therefore self-contained as an engineering proposal validated experimentally.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard diffusion-model assumptions plus the paper-specific premise that frequency components can be cleanly separated and handled by separate modules without significant information loss or optimization conflicts.

axioms (2)

standard math Standard assumptions of flow-matching or diffusion processes in image generation (e.g., gradual noise addition and reversal)
Invoked implicitly when describing the DiT and flow-matching loss.
domain assumption High-frequency details can be generated reliably by a lightweight decoder conditioned solely on low-frequency semantic features from the DiT
Core design choice stated in the abstract; if false the decoupling benefit disappears.

pith-pipeline@v0.9.0 · 5531 in / 1447 out tokens · 38364 ms · 2026-05-17T05:44:10.296781+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT... frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones... DCT... JPEG quantization tables
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DiT to specialize in modeling low-frequency semantics... 8-tick period never mentioned; no golden-ratio or reciprocal-cost identities

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Coevolving Representations in Joint Image-Feature Diffusion
cs.CV 2026-04 unverdicted novelty 7.0

CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample ...
HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion
cs.CV 2026-05 unverdicted novelty 6.0

HyperDiT achieves FID 1.56 on ImageNet 256x256 in pixel space via hyper-connected cross-scale interactions, cross-attention, SA-RoPE, and VFM registers.
L2P: Unlocking Latent Potential for Pixel Generation
cs.CV 2026-05 unverdicted novelty 6.0

L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

FREPix achieves competitive FID scores on ImageNet by decomposing image generation into separate low- and high-frequency paths within a flow matching framework.
CoD-Lite: Real-Time Diffusion-Based Generative Image Compression
cs.CV 2026-04 unverdicted novelty 6.0

CoD-Lite delivers real-time generative image compression via a lightweight convolution-based diffusion codec with compression-oriented pre-training and distillation, achieving substantial bitrate savings.
PixelGen: Improving Pixel Diffusion with Perceptual Supervision
cs.CV 2026-02 accept novelty 6.0

PixelGen augments pixel diffusion with gated perceptual supervision to reach FID 5.11 on ImageNet-256 and GenEval 0.79 in text-to-image, narrowing the gap to latent methods without VAEs.
PixIE: Prompted Pixel-Space Low-Light Image Enhancement
cs.CV 2026-05 unverdicted novelty 5.0

PixIE proposes a pixel-space low-light image enhancement framework using DINO-prompted blocks, spatial-channel compaction, and multi-receptive-field embeddings, reporting PSNR gains of 1.9-15.0% and LPIPS reductions o...
FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion
cs.CV 2026-05 unverdicted novelty 5.0

FrequencyBooster reports state-of-the-art FID scores of 1.60 at 256x256 and 1.69 at 512x512 for pixel diffusion by using a specialized decoder for full-frequency modeling.
Why Do DiT Editors Drift? Plug-and-Play Low Frequency Alignment in VAE Latent Space
cs.CV 2026-05 unverdicted novelty 4.0

VAE-LFA suppresses semantic drift in multi-turn DiT image editing by low-pass filtering latent discrepancies and aligning low-frequency components to an EMA of previous rounds in VAE space.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 9 Pith papers · 14 internal anchors

[1]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 22669–22679, 2023. 3

work page 2023
[2]

Improving image gener- ation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image gener- ation with better captions. OpenAI Technical Report, 2023. 8

work page 2023
[3]

Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,

work page
[4]

Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffu- sion models.arXiv preprint arXiv:2410.10733, 2024. 3

work page arXiv 2024
[5]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Sil- vio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025. 8, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

arXiv preprint arXiv:2504.07963 (2025)

Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025. 2, 3, 4, 5, 6, 7, 8, 1

work page arXiv 2025
[7]

Vision transformer adapter for dense predictions

Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. InThe Eleventh International Conference on Learning Representations, 2023. 2

work page 2023
[8]

Deep generative image models using a laplacian pyramid of adver- sarial networks.Advances in neural information processing systems, 28, 2015

Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using a laplacian pyramid of adver- sarial networks.Advances in neural information processing systems, 28, 2015. 2

work page 2015
[9]

Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021. 1, 2, 3, 7

work page 2021
[10]

Sigmoid- weighted linear units for neural network function approxima- tion in reinforcement learning.Neural networks, 107:3–11,

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid- weighted linear units for neural network function approxima- tion in reinforcement learning.Neural networks, 107:3–11,

work page
[11]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Fluid: Scaling autoregressive text-to-image generative models with continuous tokens

Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, and Yonglong Tian. Fluid: Scaling autoregressive text-to-image generative models with continuous tokens.arXiv preprint arXiv:2410.13863, 2024. 1

work page arXiv 2024
[13]

Seedream 3.0 Technical Report

Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36:52132–52152, 2023. 6, 8

work page 2023
[15]

Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, et al. Seedream 2.0: A native chinese-english bilin- gual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025. 1

work page internal anchor Pith review arXiv 2025
[16]

Neue methoden zur approximativen integration der differentialgleichungen einer unabh ¨angigen ver¨anderlichen.Z

Karl Heun et al. Neue methoden zur approximativen integration der differentialgleichungen einer unabh ¨angigen ver¨anderlichen.Z. Math. Phys, 45:23–38, 1900. 7, 8

work page 1900
[17]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 6

work page 2017
[18]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 6, 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1

work page 2020
[20]

sim- ple diffusion: End-to-end diffusion for high resolution im- ages

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. sim- ple diffusion: End-to-end diffusion for high resolution im- ages. InInternational Conference on Machine Learning, pages 13213–13232. PMLR, 2023. 2, 3

work page 2023
[21]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Scalable adaptive computation for iterative generation

Allan Jabri, David Fleet, and Ting Chen. Scalable adap- tive computation for iterative generation.arXiv preprint arXiv:2212.11972, 2022. 7

work page arXiv 2022
[23]

Information technology — digital compression and coding of continuous-tone still images: Requirements and guidelines

Joint Photographic Experts Group. Information technology — digital compression and coding of continuous-tone still images: Requirements and guidelines. Technical Report ITU-T T.81, International Telecommunication Union (ITU- T), 1992. 2, 4, 5

work page 1992
[24]

Progressive growing of gans for improved quality, stability, and variation

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. InInternational Conference on Learning Rep- resentations, 2018. 2

work page 2018
[25]

Analyzing and improving the training dynamics of diffusion models

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24174–24184, 2024. 3

work page 2024
[26]

Understanding diffu- sion objectives as the elbo with simple data augmentation

Diederik Kingma and Ruiqi Gao. Understanding diffu- sion objectives as the elbo with simple data augmentation. Advances in Neural Information Processing Systems, 36: 65484–65516, 2023. 7

work page 2023
[27]

Understanding diffu- sion objectives as the elbo with simple data augmentation

Diederik Kingma and Ruiqi Gao. Understanding diffu- sion objectives as the elbo with simple data augmentation. Advances in Neural Information Processing Systems, 36: 65484–65516, 2023. 2, 3 7

work page 2023
[28]

Improved precision and recall met- ric for assessing generative models.Advances in neural in- formation processing systems, 32, 2019

Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall met- ric for assessing generative models.Advances in neural in- formation processing systems, 32, 2019. 6

work page 2019
[29]

Applying guidance in a limited interval improves sample and distribution quality in diffusion models

Tuomas Kynk ¨a¨anniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models.arXiv preprint arXiv:2404.07724, 2024. 7, 1

work page arXiv 2024
[30]

Flux.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 1, 8

work page 2024
[31]

Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers. arXiv preprint arXiv:2504.10483, 2025. 1, 3

work page arXiv 2025
[32]

Back to basics: Let denoising generative models denoise, 2025

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise, 2025. 2, 3, 6, 7

work page 2025
[33]

Fractal generative models

Tianhong Li, Qinyi Sun, Lijie Fan, and Kaiming He. Fractal generative models.arXiv preprint arXiv:2502.17437, 2025. 3, 7

work page arXiv 2025
[34]

Exploring the effect of high-frequency components in gans training.ACM Trans

Ziqiang Li, Pengfei Xia, Xue Rui, and Bin Li. Exploring the effect of high-frequency components in gans training.ACM Trans. Multimedia Comput. Commun. Appl., 19(5), 2023. 2

work page 2023
[35]

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025. 1

work page internal anchor Pith review arXiv 2025
[36]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for genera- tive modeling. InThe Eleventh International Conference on Learning Representations, 2023. 3

work page 2023
[37]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 1

work page 2019
[38]

Latent consistency models: Synthesizing high- resolution images with few-step inference, 2024

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high- resolution images with few-step inference, 2024. 3

work page 2024
[39]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Explor- ing flow and diffusion-based generative models with scalable interpolant transformers.arXiv preprint arXiv:2401.08740,

work page arXiv
[40]

Generating images with sparse representations

Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021. 6

work page arXiv 2021
[41]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

How do vision transformers work? InInternational Conference on Learning Represen- tations, 2022

Namuk Park and Songkuk Kim. How do vision transformers work? InInternational Conference on Learning Represen- tations, 2022. 2

work page 2022
[43]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,

work page
[44]

1, 3, 4, 5, 6, 7, 8, 9

work page
[45]

Springer Science & Busi- ness Media, 1992

William B Pennebaker and Joan L Mitchell.JPEG: Still im- age data compression standard. Springer Science & Busi- ness Media, 1992. 2

work page 1992
[46]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 3

work page 2022
[47]

U- net: Convolutional networks for biomedical image segmen- tation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015. 9

work page 2015
[48]

Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016. 6

work page 2016
[49]

Inception transformer.Advances in Neural Information Processing Systems, 35:23495–23509,

Chenyang Si, Weihao Yu, Pan Zhou, Yichen Zhou, Xinchao Wang, and Shuicheng Yan. Inception transformer.Advances in Neural Information Processing Systems, 35:23495–23509,

work page
[50]

Improving the diffusability of autoen- coders

Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Mena- pace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, and Ali- aksandr Siarohin. Improving the diffusability of autoen- coders. InForty-second International Conference on Ma- chine Learning, 2025. 2

work page 2025
[51]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models.arXiv:2010.02502, 2020. 1

work page internal anchor Pith review Pith/arXiv arXiv 2010
[52]

Dmm: Build- ing a versatile image generation model via distillation-based model merging.arXiv preprint arXiv:2504.12364, 2025

Tianhui Song, Weixin Feng, Shuai Wang, Xubin Li, Tiezheng Ge, Bo Zheng, and Limin Wang. Dmm: Build- ing a versatile image generation model via distillation-based model merging.arXiv preprint arXiv:2504.12364, 2025. 3

work page arXiv 2025
[53]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page
[54]

Relay diffusion: Unifying diffusion process across resolutions for image syn- thesis

Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jian- qiao Wangni, Zhuoyi Yang, and Jie Tang. Relay diffusion: Unifying diffusion process across resolutions for image syn- thesis.arXiv preprint arXiv:2309.03350, 2023. 2, 3, 7

work page arXiv 2023
[55]

arXiv preprint arXiv:2405.14224 , year=

Yao Teng, Yue Wu, Han Shi, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, and Xihui Liu. Dim: Diffusion mamba for efficient high-resolution image synthesis.arXiv preprint arXiv:2405.14224, 2024. 3

work page arXiv 2024
[56]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Jetformer: An autoregres- sive generative model of raw images and text.arXiv preprint arXiv:2411.19722, 2024

Michael Tschannen, Andr ´e Susano Pinto, and Alexander Kolesnikov. Jetformer: An autoregressive generative model of raw images and text.arXiv preprint arXiv:2411.19722,

work page arXiv
[59]

High-frequency component helps explain the generaliza- tion of convolutional neural networks

Haohan Wang, Xindi Wu, Zeyi Huang, and Eric P Xing. High-frequency component helps explain the generaliza- tion of convolutional neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8684–8694, 2020. 2

work page 2020
[60]

Exploring dcn-like ar- chitecture for fast image generation with arbitrary resolu- tion.Advances in Neural Information Processing Systems, 37:87959–87977, 2024

Shuai Wang, Zexian Li, Tianhui Song, Xubin Li, Tiezheng Ge, Bo Zheng, and Limin Wang. Exploring dcn-like ar- chitecture for fast image generation with arbitrary resolu- tion.Advances in Neural Information Processing Systems, 37:87959–87977, 2024. 3

work page 2024
[61]

Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025

Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025. 1, 3, 6, 7, 8

work page arXiv 2025
[62]

Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025. 3, 6, 1

work page arXiv 2025
[63]

Fregan: Exploit- ing frequency components for training gans under limited data.Advances in Neural Information Processing Systems, 35:33387–33399, 2022

Zhe Wang, Ziqiu Chi, Yanbing Zhang, et al. Fregan: Exploit- ing frequency components for training gans under limited data.Advances in Neural Information Processing Systems, 35:33387–33399, 2022. 2

work page 2022
[64]

Native-resolution image synthesis.arXiv preprint arXiv:2506.03131, 2025

Zidong Wang, Lei Bai, Xiangyu Yue, Wanli Ouyang, and Yiyuan Zhang. Native-resolution image synthesis.arXiv preprint arXiv:2506.03131, 2025. 1

work page arXiv 2025
[65]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Jun- jie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jia- hao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omni- gen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.1887...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 8, 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Reconstruction vs

Jingfeng Yao and Xinggang Wang. Reconstruction vs. gener- ation: Taming optimization dilemma in latent diffusion mod- els.arXiv preprint arXiv:2501.01423, 2025. 1, 3

work page arXiv 2025
[68]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffu- sion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024. 3, 4, 6, 7, 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

Diffu- sion models need visual priors for image generation.arXiv preprint arXiv:2410.08531, 2024

Xiaoyu Yue, Zidong Wang, Zeyu Lu, Shuyang Sun, Meng Wei, Wanli Ouyang, Lei Bai, and Luping Zhou. Diffu- sion models need visual priors for image generation.arXiv preprint arXiv:2410.08531, 2024. 3

work page arXiv 2024
[70]

Normalizing flows are capable generative models,

Shuangfei Zhai, Ruixiang Zhang, Preetum Nakkiran, David Berthelot, Jiatao Gu, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Navdeep Jaitly, and Josh Susskind. Normalizing flows are capable generative models.arXiv preprint arXiv:2412.06329, 2024. 3

work page arXiv 2024
[71]

Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation

Mingyuan Zhou, Huangjie Zheng, Zhendong Wang, Mingzhang Yin, and Hai Huang. Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. InForty-first International Confer- ence on Machine Learning, 2024. 3, 7 9

work page 2024

[1] [1]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 22669–22679, 2023. 3

work page 2023

[2] [2]

Improving image gener- ation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image gener- ation with better captions. OpenAI Technical Report, 2023. 8

work page 2023

[3] [3]

Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,

work page

[4] [4]

Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffu- sion models.arXiv preprint arXiv:2410.10733, 2024. 3

work page arXiv 2024

[5] [5]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Sil- vio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025. 8, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

arXiv preprint arXiv:2504.07963 (2025)

Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025. 2, 3, 4, 5, 6, 7, 8, 1

work page arXiv 2025

[7] [7]

Vision transformer adapter for dense predictions

Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. InThe Eleventh International Conference on Learning Representations, 2023. 2

work page 2023

[8] [8]

Deep generative image models using a laplacian pyramid of adver- sarial networks.Advances in neural information processing systems, 28, 2015

Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using a laplacian pyramid of adver- sarial networks.Advances in neural information processing systems, 28, 2015. 2

work page 2015

[9] [9]

Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021. 1, 2, 3, 7

work page 2021

[10] [10]

Sigmoid- weighted linear units for neural network function approxima- tion in reinforcement learning.Neural networks, 107:3–11,

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid- weighted linear units for neural network function approxima- tion in reinforcement learning.Neural networks, 107:3–11,

work page

[11] [11]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Fluid: Scaling autoregressive text-to-image generative models with continuous tokens

Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, and Yonglong Tian. Fluid: Scaling autoregressive text-to-image generative models with continuous tokens.arXiv preprint arXiv:2410.13863, 2024. 1

work page arXiv 2024

[13] [13]

Seedream 3.0 Technical Report

Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36:52132–52152, 2023. 6, 8

work page 2023

[15] [15]

Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, et al. Seedream 2.0: A native chinese-english bilin- gual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025. 1

work page internal anchor Pith review arXiv 2025

[16] [16]

Neue methoden zur approximativen integration der differentialgleichungen einer unabh ¨angigen ver¨anderlichen.Z

Karl Heun et al. Neue methoden zur approximativen integration der differentialgleichungen einer unabh ¨angigen ver¨anderlichen.Z. Math. Phys, 45:23–38, 1900. 7, 8

work page 1900

[17] [17]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 6

work page 2017

[18] [18]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 6, 1

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1

work page 2020

[20] [20]

sim- ple diffusion: End-to-end diffusion for high resolution im- ages

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. sim- ple diffusion: End-to-end diffusion for high resolution im- ages. InInternational Conference on Machine Learning, pages 13213–13232. PMLR, 2023. 2, 3

work page 2023

[21] [21]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Scalable adaptive computation for iterative generation

Allan Jabri, David Fleet, and Ting Chen. Scalable adap- tive computation for iterative generation.arXiv preprint arXiv:2212.11972, 2022. 7

work page arXiv 2022

[23] [23]

Information technology — digital compression and coding of continuous-tone still images: Requirements and guidelines

Joint Photographic Experts Group. Information technology — digital compression and coding of continuous-tone still images: Requirements and guidelines. Technical Report ITU-T T.81, International Telecommunication Union (ITU- T), 1992. 2, 4, 5

work page 1992

[24] [24]

Progressive growing of gans for improved quality, stability, and variation

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. InInternational Conference on Learning Rep- resentations, 2018. 2

work page 2018

[25] [25]

Analyzing and improving the training dynamics of diffusion models

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24174–24184, 2024. 3

work page 2024

[26] [26]

Understanding diffu- sion objectives as the elbo with simple data augmentation

Diederik Kingma and Ruiqi Gao. Understanding diffu- sion objectives as the elbo with simple data augmentation. Advances in Neural Information Processing Systems, 36: 65484–65516, 2023. 7

work page 2023

[27] [27]

Understanding diffu- sion objectives as the elbo with simple data augmentation

Diederik Kingma and Ruiqi Gao. Understanding diffu- sion objectives as the elbo with simple data augmentation. Advances in Neural Information Processing Systems, 36: 65484–65516, 2023. 2, 3 7

work page 2023

[28] [28]

Improved precision and recall met- ric for assessing generative models.Advances in neural in- formation processing systems, 32, 2019

Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall met- ric for assessing generative models.Advances in neural in- formation processing systems, 32, 2019. 6

work page 2019

[29] [29]

Applying guidance in a limited interval improves sample and distribution quality in diffusion models

Tuomas Kynk ¨a¨anniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models.arXiv preprint arXiv:2404.07724, 2024. 7, 1

work page arXiv 2024

[30] [30]

Flux.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 1, 8

work page 2024

[31] [31]

Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers. arXiv preprint arXiv:2504.10483, 2025. 1, 3

work page arXiv 2025

[32] [32]

Back to basics: Let denoising generative models denoise, 2025

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise, 2025. 2, 3, 6, 7

work page 2025

[33] [33]

Fractal generative models

Tianhong Li, Qinyi Sun, Lijie Fan, and Kaiming He. Fractal generative models.arXiv preprint arXiv:2502.17437, 2025. 3, 7

work page arXiv 2025

[34] [34]

Exploring the effect of high-frequency components in gans training.ACM Trans

Ziqiang Li, Pengfei Xia, Xue Rui, and Bin Li. Exploring the effect of high-frequency components in gans training.ACM Trans. Multimedia Comput. Commun. Appl., 19(5), 2023. 2

work page 2023

[35] [35]

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025. 1

work page internal anchor Pith review arXiv 2025

[36] [36]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for genera- tive modeling. InThe Eleventh International Conference on Learning Representations, 2023. 3

work page 2023

[37] [37]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 1

work page 2019

[38] [38]

Latent consistency models: Synthesizing high- resolution images with few-step inference, 2024

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high- resolution images with few-step inference, 2024. 3

work page 2024

[39] [39]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Explor- ing flow and diffusion-based generative models with scalable interpolant transformers.arXiv preprint arXiv:2401.08740,

work page arXiv

[40] [40]

Generating images with sparse representations

Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021. 6

work page arXiv 2021

[41] [41]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

How do vision transformers work? InInternational Conference on Learning Represen- tations, 2022

Namuk Park and Songkuk Kim. How do vision transformers work? InInternational Conference on Learning Represen- tations, 2022. 2

work page 2022

[43] [43]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,

work page

[44] [44]

1, 3, 4, 5, 6, 7, 8, 9

work page

[45] [45]

Springer Science & Busi- ness Media, 1992

William B Pennebaker and Joan L Mitchell.JPEG: Still im- age data compression standard. Springer Science & Busi- ness Media, 1992. 2

work page 1992

[46] [46]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 3

work page 2022

[47] [47]

U- net: Convolutional networks for biomedical image segmen- tation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015. 9

work page 2015

[48] [48]

Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016. 6

work page 2016

[49] [49]

Inception transformer.Advances in Neural Information Processing Systems, 35:23495–23509,

Chenyang Si, Weihao Yu, Pan Zhou, Yichen Zhou, Xinchao Wang, and Shuicheng Yan. Inception transformer.Advances in Neural Information Processing Systems, 35:23495–23509,

work page

[50] [50]

Improving the diffusability of autoen- coders

Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Mena- pace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, and Ali- aksandr Siarohin. Improving the diffusability of autoen- coders. InForty-second International Conference on Ma- chine Learning, 2025. 2

work page 2025

[51] [51]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models.arXiv:2010.02502, 2020. 1

work page internal anchor Pith review Pith/arXiv arXiv 2010

[52] [52]

Dmm: Build- ing a versatile image generation model via distillation-based model merging.arXiv preprint arXiv:2504.12364, 2025

Tianhui Song, Weixin Feng, Shuai Wang, Xubin Li, Tiezheng Ge, Bo Zheng, and Limin Wang. Dmm: Build- ing a versatile image generation model via distillation-based model merging.arXiv preprint arXiv:2504.12364, 2025. 3

work page arXiv 2025

[53] [53]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page

[54] [54]

Relay diffusion: Unifying diffusion process across resolutions for image syn- thesis

Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jian- qiao Wangni, Zhuoyi Yang, and Jie Tang. Relay diffusion: Unifying diffusion process across resolutions for image syn- thesis.arXiv preprint arXiv:2309.03350, 2023. 2, 3, 7

work page arXiv 2023

[55] [55]

arXiv preprint arXiv:2405.14224 , year=

Yao Teng, Yue Wu, Han Shi, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, and Xihui Liu. Dim: Diffusion mamba for efficient high-resolution image synthesis.arXiv preprint arXiv:2405.14224, 2024. 3

work page arXiv 2024

[56] [56]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[57] [57]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[58] [58]

Jetformer: An autoregres- sive generative model of raw images and text.arXiv preprint arXiv:2411.19722, 2024

Michael Tschannen, Andr ´e Susano Pinto, and Alexander Kolesnikov. Jetformer: An autoregressive generative model of raw images and text.arXiv preprint arXiv:2411.19722,

work page arXiv

[59] [59]

High-frequency component helps explain the generaliza- tion of convolutional neural networks

Haohan Wang, Xindi Wu, Zeyi Huang, and Eric P Xing. High-frequency component helps explain the generaliza- tion of convolutional neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8684–8694, 2020. 2

work page 2020

[60] [60]

Exploring dcn-like ar- chitecture for fast image generation with arbitrary resolu- tion.Advances in Neural Information Processing Systems, 37:87959–87977, 2024

Shuai Wang, Zexian Li, Tianhui Song, Xubin Li, Tiezheng Ge, Bo Zheng, and Limin Wang. Exploring dcn-like ar- chitecture for fast image generation with arbitrary resolu- tion.Advances in Neural Information Processing Systems, 37:87959–87977, 2024. 3

work page 2024

[61] [61]

Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025

Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025. 1, 3, 6, 7, 8

work page arXiv 2025

[62] [62]

Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025. 3, 6, 1

work page arXiv 2025

[63] [63]

Fregan: Exploit- ing frequency components for training gans under limited data.Advances in Neural Information Processing Systems, 35:33387–33399, 2022

Zhe Wang, Ziqiu Chi, Yanbing Zhang, et al. Fregan: Exploit- ing frequency components for training gans under limited data.Advances in Neural Information Processing Systems, 35:33387–33399, 2022. 2

work page 2022

[64] [64]

Native-resolution image synthesis.arXiv preprint arXiv:2506.03131, 2025

Zidong Wang, Lei Bai, Xiangyu Yue, Wanli Ouyang, and Yiyuan Zhang. Native-resolution image synthesis.arXiv preprint arXiv:2506.03131, 2025. 1

work page arXiv 2025

[65] [65]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Jun- jie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jia- hao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omni- gen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.1887...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[66] [66]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 8, 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[67] [67]

Reconstruction vs

Jingfeng Yao and Xinggang Wang. Reconstruction vs. gener- ation: Taming optimization dilemma in latent diffusion mod- els.arXiv preprint arXiv:2501.01423, 2025. 1, 3

work page arXiv 2025

[68] [68]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffu- sion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024. 3, 4, 6, 7, 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[69] [69]

Diffu- sion models need visual priors for image generation.arXiv preprint arXiv:2410.08531, 2024

Xiaoyu Yue, Zidong Wang, Zeyu Lu, Shuyang Sun, Meng Wei, Wanli Ouyang, Lei Bai, and Luping Zhou. Diffu- sion models need visual priors for image generation.arXiv preprint arXiv:2410.08531, 2024. 3

work page arXiv 2024

[70] [70]

Normalizing flows are capable generative models,

Shuangfei Zhai, Ruixiang Zhang, Preetum Nakkiran, David Berthelot, Jiatao Gu, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Navdeep Jaitly, and Josh Susskind. Normalizing flows are capable generative models.arXiv preprint arXiv:2412.06329, 2024. 3

work page arXiv 2024

[71] [71]

Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation

Mingyuan Zhou, Huangjie Zheng, Zhendong Wang, Mingzhang Yin, and Hai Huang. Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. InForty-first International Confer- ence on Machine Learning, 2024. 3, 7 9

work page 2024