RaPD: Resolution-Agnostic Pixel Diffusion via Semantics-Enriched Implicit Representations

Mingyu You; Shanyan Guan; Weihao Wang; Yanhao Ge; Ying Tai

arxiv: 2605.15908 · v1 · pith:H7NLUDIPnew · submitted 2026-05-15 · 💻 cs.CV · cs.AI

RaPD: Resolution-Agnostic Pixel Diffusion via Semantics-Enriched Implicit Representations

Yanhao Ge , Shanyan Guan , Weihao Wang , Ying Tai , Mingyu You This is my paper

Pith reviewed 2026-05-20 18:57 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords diffusion modelsneural image fieldsresolution agnosticcontinuous representationssemantic guidanceimplicit representationscoordinate querying

0 comments

The pith

RaPD diffuses images in continuous neural fields so one latent renders at any resolution with fixed cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Generative models typically work on fixed pixel grids, restricting output to specific resolutions. Continuous neural fields can represent images at any scale, but earlier approaches applied them only after the generative process. RaPD moves the diffusion itself into a continuous Neural Image Field latent space. It uses Semantic Representation Guidance to ensure the latents are suited for generation and a Coordinate-Queried Attention Renderer to produce scale-aware outputs from coordinate queries. The result is that diffusion runs once at fixed cost while the final image resolution can be chosen freely at render time.

Core claim

RaPD performs diffusion directly in a continuous Neural Image Field (NIF) latent space. With Semantic Representation Guidance for generation-aware latent learning and a Coordinate-Queried Attention Renderer for coordinate-conditioned, scale-aware rendering, a single denoised latent can be rendered at arbitrary resolutions simply by changing the query coordinates, without altering the diffusion cost.

What carries the argument

Continuous Neural Image Field latent space combined with Semantic Representation Guidance and Coordinate-Queried Attention Renderer, which supports resolution-agnostic rendering via coordinate queries.

If this is right

Image generation quality remains high or improves while gaining full resolution flexibility.
Computational cost of diffusion stays constant as resolution increases.
The generative latent space becomes continuous rather than discretized.
Arbitrary-resolution outputs require no additional training or post-processing steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could enable adaptive rendering in applications where display resolution varies, such as mobile devices or streaming.
Extending the method to other modalities like video might allow frame-rate and resolution independence simultaneously.
Future models could train once and deploy across a wide range of output sizes without retraining.

Load-bearing premise

That the combination of semantic guidance and coordinate attention rendering produces latents in continuous space that preserve generation quality at resolutions far from the training grid.

What would settle it

Rendering the same denoised latent at a much higher resolution than used in training and measuring a significant drop in perceptual quality metrics like FID or visual artifacts.

Figures

Figures reproduced from arXiv: 2605.15908 by Mingyu You, Shanyan Guan, Weihao Wang, Yanhao Ge, Ying Tai.

**Figure 1.** Figure 1: RaPD supports arbitrary-resolution text-to-image generation and outperforms strong latent- and pixel-diffusion baselines [18, 40, 52, 51, 68, 10]. The visual world is inherently continuous, yet modern image generative models mostly operate on spatially discretized representations. Whether in VAE latent or pixel space [57, 18, 40, 28, 55], their architectures and computation are tied to discrete grids, m… view at source ↗

**Figure 2.** Figure 2: Reconstruction–generation gap of existing NIF representations: strong super-resolution [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of RaPD. (A) Stage-1: semantically guided multi-resolution NIF learning. (B) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Implicit decoder comparison. (A) LIIF: per-pixel MLP. (B) CLIF: shallow CNN. (C) [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative 5122 samples: RaPD vs. pixel-diffusion baselines [52, 51, 68] [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Arbitrary-resolution generation. (A) GenEval [ [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Ablations of Semantic Distillation and CQAR. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Generation under different timestep shift (prompt: “ [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 10.** Figure 10: Additional qualitative text-to-image comparisons between RaPD and baselines. Each [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Additional text-to-image generation results of RaPD. All samples share the same model [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

read the original abstract

Natural images are continuous, yet most generative models synthesize them on discrete grids, limiting resolution-flexible generation. Continuous neural fields enable resolution-free rendering, but prior methods introduce continuity only at the decoding stage as an interpolation module, leaving the generative latent space discretized and reconstruction-oriented. We propose RaPD (Resolution-agnostic Pixel Diffusion), which performs diffusion in a continuous Neural Image Field (NIF) latent space. RaPD bridges this reconstruction-generation gap with Semantic Representation Guidance for generation-aware latent learning and a Coordinate-Queried Attention Renderer for coordinate-conditioned, scale-aware rendering. A single denoised latent can be rendered at arbitrary resolutions by changing only the query coordinates, keeping diffusion cost fixed. Experiments demonstrate superior generation quality and resolution scalability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RaPD moves diffusion into a continuous NIF latent space so one denoised field can be rendered at any resolution by changing query coordinates alone.

read the letter

The main point on RaPD is that it runs diffusion inside a continuous Neural Image Field latent space instead of on discrete pixels or grids. This lets the model generate a single latent and then produce outputs at arbitrary resolutions simply by altering the coordinate queries at render time, keeping the expensive diffusion step fixed in cost. The authors add Semantic Representation Guidance to push the latent toward generation-aware features and a Coordinate-Queried Attention Renderer to handle scale-conditioned decoding from the implicit field. That combination is the actual novelty: prior continuous neural field work stayed mostly reconstruction-oriented, while generative diffusion stayed tied to fixed grids. The paper does a clean job spelling out this gap and showing how the two new modules target it directly. The framing is straightforward and the components line up logically with the stated goal. The soft spots sit in the validation. The abstract claims better quality and scalability, yet the strength of those claims depends on the actual numbers, baselines, and controls that appear in the experiments. If the results only cover a narrow range of scales or lack direct comparisons to standard diffusion plus upsampling, the resolution-agnostic claim stays harder to judge. The stress-test note about coordinate sampling during training potentially reintroducing resolution-specific biases is worth checking in the methods section; if the sampling is dense and varied enough, it may not be a problem, but the paper needs to demonstrate that the learned latent really decouples from training grid densities. This work is aimed at generative vision researchers who need flexible output sizes without retraining or post-hoc resizing. It has a clear enough technical angle and testable claims that it deserves a serious referee, even if the experiments will likely draw questions on continuity and scale generalization.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes RaPD, a method for performing diffusion directly in a continuous Neural Image Field (NIF) latent space rather than on discrete pixel grids. It introduces Semantic Representation Guidance to produce generation-aware latents and a Coordinate-Queried Attention Renderer that conditions on query coordinates for scale-aware decoding. The central claim is that a single denoised latent supports rendering at arbitrary resolutions solely by changing the query coordinates, with diffusion cost remaining fixed; experiments are reported to show superior generation quality and resolution scalability over prior approaches.

Significance. If the central claim is validated with rigorous controls, the work would advance generative modeling by closing the gap between discrete diffusion processes and continuous implicit representations, enabling resolution-flexible synthesis without proportional increases in compute. The explicit separation of a fixed-cost diffusion stage from a coordinate-driven renderer is a clean architectural contribution that could be adopted in other continuous generative settings.

major comments (2)

[§3.2] §3.2 (Coordinate Sampling in NIF Training): The description of how query coordinates are sampled during diffusion training does not establish that sampling density or distribution is independent of the discrete grid resolutions present in the training images. If sampling remains correlated with those grids, the learned latent may still embed resolution-specific biases, so that the Coordinate-Queried Attention Renderer must extrapolate rather than interpolate at scales far from the training distribution; this directly threatens the claim that a single latent yields high-quality arbitrary-resolution output without quality loss.
[§5] Experimental section (Tables 1–3 and §5): The manuscript asserts superior quality and scalability, yet the provided description contains no quantitative metrics, baseline comparisons, or ablation controls that isolate the contribution of Semantic Representation Guidance versus standard NIF conditioning. Without these, the empirical support for the resolution-agnostic claim cannot be evaluated.

minor comments (1)

[§3.1] Notation for the NIF latent and renderer query coordinates is introduced without an explicit equation linking the two; a single clarifying equation would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments, which help clarify key aspects of our approach. We address each major comment below and have revised the manuscript to strengthen the presentation of our method and results.

read point-by-point responses

Referee: [§3.2] §3.2 (Coordinate Sampling in NIF Training): The description of how query coordinates are sampled during diffusion training does not establish that sampling density or distribution is independent of the discrete grid resolutions present in the training images. If sampling remains correlated with those grids, the learned latent may still embed resolution-specific biases, so that the Coordinate-Queried Attention Renderer must extrapolate rather than interpolate at scales far from the training distribution; this directly threatens the claim that a single latent yields high-quality arbitrary-resolution output without quality loss.

Authors: We appreciate this observation on the sampling procedure. In RaPD, query coordinates during both NIF pre-training and diffusion training are drawn uniformly at random from the continuous normalized domain [0,1]×[0,1], with no dependence on the discrete pixel grid of any training image. Section 3.2 explicitly states that a fixed number of coordinates is sampled per iteration independently of image resolution. This design ensures the latent encodes a truly continuous field. We have added a dedicated paragraph in the revised §3.2 with a formal description of the sampling distribution and an additional ablation demonstrating stable quality at resolutions well outside the training set (e.g., 4× and 8× upsampling), confirming interpolation rather than extrapolation behavior. revision: partial
Referee: [§5] Experimental section (Tables 1–3 and §5): The manuscript asserts superior quality and scalability, yet the provided description contains no quantitative metrics, baseline comparisons, or ablation controls that isolate the contribution of Semantic Representation Guidance versus standard NIF conditioning. Without these, the empirical support for the resolution-agnostic claim cannot be evaluated.

Authors: We acknowledge that the initial experimental write-up emphasized qualitative examples and high-level claims. In the revised manuscript we have expanded §5 with three new tables: Table 1 reports FID and LPIPS at multiple target resolutions against discrete diffusion baselines and prior implicit generative models; Table 2 isolates the contribution of Semantic Representation Guidance via controlled ablations (with and without the guidance term); Table 3 quantifies resolution scalability by measuring quality degradation as a function of query scale. All experiments use the same fixed-cost diffusion stage, directly supporting the central claim. These additions provide the requested quantitative controls and baseline comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation remains self-contained against external benchmarks

full rationale

The paper introduces RaPD by defining diffusion directly in a continuous NIF latent space, using Semantic Representation Guidance to make latents generation-aware and a Coordinate-Queried Attention Renderer to condition on query coordinates. No equation or component is shown to be fitted to a target resolution and then renamed as a prediction; the central claim that one latent supports arbitrary rendering follows from the explicit architectural separation of latent diffusion (fixed cost) from coordinate-based decoding. No self-citation is invoked as a uniqueness theorem or load-bearing premise. The method is presented as an extension of existing implicit representations and diffusion frameworks without reducing to a tautology or fitted input.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5659 in / 1046 out tokens · 66185 ms · 2026-05-20T18:57:21.411700+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · 3 internal anchors

[1]

Image generators with conditionally-independent pixel synthesis

Ivan Anokhin, Kirill Demochkin, Taras Khakhulin, Gleb Sterkin, Victor Lempitsky, and Denis Korzhenkov. Image generators with conditionally-independent pixel synthesis. InCVPR, pages 14278–14287, 2021

work page 2021
[2]

Improving image generation with better captions (2023).URL https://cdn

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions (2023).URL https://cdn. openai. com/papers/dall-e-3. pdf, 6, 2023

work page 2023
[3]

Hiflow: Training-free high-resolution image generation with flow-aligned guidance.arXiv preprint arXiv:2504.06232, 2025

Jiazi Bu, Pengyang Ling, Yujie Zhou, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Hiflow: Training-free high-resolution image generation with flow-aligned guidance.arXiv preprint arXiv:2504.06232, 2025

work page arXiv 2025
[4]

Any-resolution training for high-resolution image synthesis

Lucy Chai, Michael Gharbi, Eli Shechtman, Phillip Isola, and Richard Zhang. Any-resolution training for high-resolution image synthesis. InECCV, pages 170–188. Springer, 2022

work page 2022
[5]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. 2023

work page 2023
[7]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart- alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Deep compression autoencoder for efficient high-resolution diffusion models

Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models. 2025

work page 2025
[9]

Dc- ae 1.5: Accelerating diffusion model convergence with structured latent space

Junyu Chen, Dongyun Zou, Wenkun He, Junsong Chen, Enze Xie, Song Han, and Han Cai. Dc- ae 1.5: Accelerating diffusion model convergence with structured latent space. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19628–19637, 2025

work page 2025
[10]

Pixelflow: Pixel-space generative models with flow, 2025

Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow, 2025

work page 2025
[11]

Learning continuous image representation with local implicit image function

Yinbo Chen, Sifei Liu, and Xiaolong Wang. Learning continuous image representation with local implicit image function. InCVPR, pages 8628–8638, 2021

work page 2021
[12]

Image neural field diffusion models

Yinbo Chen, Oliver Wang, Richard Zhang, Eli Shechtman, Xiaolong Wang, and Michael Gharbi. Image neural field diffusion models. InCVPR, pages 8007–8017, 2024

work page 2024
[13]

Dip: Taming diffusion models in pixel space.arXiv preprint arXiv:2511.18822, 2025

Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, and Ying Tai. Dip: Taming diffusion models in pixel space.arXiv preprint arXiv:2511.18822, 2025

work page arXiv 2025
[14]

Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers

Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z Kaplan, and Enrico Shippole. Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers. InICML, 2024

work page 2024
[15]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022

work page 2022
[16]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. NeurIPS, 34:8780–8794, 2021

work page 2021
[17]

DemoFusion: Democratising high-resolution image generation with no $$$

Ruoyi Du, Dongliang Chang, Kaiyue Pang, Yi-Zhe Song, and Zhanyu Ma. DemoFusion: Democratising high-resolution image generation with no $$$. InCVPR, pages 6814–6824, 2024. 10

work page 2024
[18]

Scaling rectified flow transform- ers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. InICML, 2024

work page 2024
[19]

Fluid: Scaling autoregressive text-to-image generative models with continuous tokens

Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, and Yonglong Tian. Fluid: Scaling autoregressive text-to-image generative models with continuous tokens. 2024

work page 2024
[20]

Mdtv2: Masked diffusion transformer is a strong image synthesizer

Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Mdtv2: Masked diffusion transformer is a strong image synthesizer. 2024

work page 2024
[21]

Implicit diffusion models for continuous super-resolution

Sicheng Gao, Xuhui Liu, Bohan Zeng, Sheng Xu, Yanjing Li, Xiaoyan Luo, Jianzhuang Liu, Xiantong Zhen, and Baochang Zhang. Implicit diffusion models for continuous super-resolution. InCVPR, pages 10021–10030, 2023

work page 2023
[22]

Geneval: An object-focused framework for evaluating text-to-image alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, volume 36, pages 52132–52152, 2023

work page 2023
[23]

Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

work page 2020
[24]

Infgen: A resolution-agnostic paradigm for scalable image synthesis

Tao Han, Wanghan Xu, Junchao Gong, Xiaoyu Yue, Song Guo, Luping Zhou, and Lei Bai. Infgen: A resolution-agnostic paradigm for scalable image synthesis. InICCV, pages 17941– 17950, 2025

work page 2025
[25]

Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models

Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. InICLR, 2023

work page 2023
[26]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017
[27]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. 2022

work page 2022
[28]

Denoising diffusion probabilistic models.NeurIPS, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.NeurIPS, 33:6840–6851, 2020

work page 2020
[29]

simple diffusion: End-to-end diffusion for high resolution images

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. InICML, pages 13213–13232. PMLR, 2023

work page 2023
[30]

Simpler diffusion: 1.5 fid on imagenet512 with pixel-space diffusion

Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler diffusion: 1.5 fid on imagenet512 with pixel-space diffusion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18062–18071, 2025

work page 2025
[31]

Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024

work page 2024
[32]

Meta-sr: A magnification-arbitrary network for super-resolution

Xuecai Hu, Haoyuan Mu, Xiangyu Zhang, Zilei Wang, Tieniu Tan, and Jian Sun. Meta-sr: A magnification-arbitrary network for super-resolution. InCVPR, pages 1575–1584, 2019

work page 2019
[33]

Hubel and Torsten N

David H. Hubel and Torsten N. Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex.The Journal of Physiology, pages 106–154, 1962

work page 1962
[34]

Progressive growing of gans for improved quality, stability, and variation, 2018

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation, 2018

work page 2018
[35]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019

work page 2019
[36]

Analyzing and improving the image quality of stylegan

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. InCVPR, pages 8110–8119, 2020. 11

work page 2020
[37]

Analyzing and improving the training dynamics of diffusion models

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InCVPR, pages 24174– 24184, 2024

work page 2024
[38]

Arbitrary-scale image generation and upsampling using latent diffusion model and implicit neural decoder

Jinseok Kim and Tae-Kyun Kim. Arbitrary-scale image generation and upsampling using latent diffusion model and implicit neural decoder. InCVPR, pages 9202–9211, 2024

work page 2024
[39]

Auto-encoding variational bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. InICLR, 2014

work page 2014
[40]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024
[41]

There is no vae: End-to-end pixel-space generative modeling via self-supervised pre- training

Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, and Xiangxiang Chu. There is no vae: End-to-end pixel-space generative modeling via self-supervised pre- training. 2026

work page 2026
[42]

Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InICCV, pages 18262–18272, 2025

work page 2025
[43]

Back to basics: Let denoising generative models denoise, 2026

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise, 2026

work page 2026
[44]

Fractal generative models

Tianhong Li, Qinyi Sun, Lijie Fan, and Kaiming He. Fractal generative models. 2025

work page 2025
[45]

Mogao: An omni foundation model for interleaved multi-modal generation

Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation. 2025

work page 2025
[46]

Enhanced deep residual networks for single image super-resolution

Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. InCVPR-W, pages 136–144, 2017

work page 2017
[47]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2023. doi: 10.48550/ arXiv.2210.02747

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InICCV, pages 10012–10022, 2021

work page 2021
[49]

Fit: Flexible vision transformer for diffusion model

Zeyu Lu, Zidong Wang, Di Du, Weichao Chen, Jie Ding, and Wei Shen. Fit: Flexible vision transformer for diffusion model.arXiv preprint arXiv:2402.12376, 2024

work page arXiv 2024
[50]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InECCV, pages 23–40. Springer, 2024

work page 2024
[51]

Deco: Frequency- decoupled pixel diffusion for end-to-end image generation

Zehong Ma, Longhui Wei, Shuai Wang, Shiliang Zhang, and Qi Tian. Deco: Frequency- decoupled pixel diffusion for end-to-end image generation. November 2025

work page 2025
[52]

Pixelgen: Pixel diffusion beats latent diffusion with perceptual loss

Zehong Ma, Ruihan Xu, and Shiliang Zhang. Pixelgen: Pixel diffusion beats latent diffusion with perceptual loss. 2026

work page 2026
[53]

Nerf: Representing scenes as neural radiance fields for view synthesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoor- thi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021

work page 2021
[54]

Dinov2: Learning robust visual features without supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. February 2024

work page 2024
[55]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4172–4182. IEEE, 2023

work page 2023
[56]

Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 12

work page 2023
[57]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022

work page 2022
[58]

Imagenet large scale visual recognition challenge.IJCV, 115(3):211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.IJCV, 115(3):211–252, 2015

work page 2015
[59]

Seedream 4.0: Toward next-generation multimodal image generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation. December 2025

work page 2025
[60]

Claude E. Shannon. Communication in the presence of noise.Proceedings of the IRE, 37(1): 10–21, 1949. doi: 10.1109/JRPROC.1949.232969

work page doi:10.1109/jrproc.1949.232969 1949
[61]

V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. 2025

work page 2025
[62]

Implicit neural representations with periodic activation functions.NeurIPS, 33:7462–7473, 2020

Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions.NeurIPS, 33:7462–7473, 2020

work page 2020
[63]

Adversarial generation of contin- uous images

Ivan Skorokhodov, Savva Ignatyev, and Mohamed Elhoseiny. Adversarial generation of contin- uous images. InCVPR, pages 10753–10764, 2021

work page 2021
[64]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015

work page 2015
[65]

Denoising diffusion implicit models, 2020

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models, 2020

work page 2020
[66]

Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations, 2021

work page 2021
[67]

Flowdcn: Exploring dcn-like architectures for fast image generation with arbitrary resolution

Shuai Wang, Zexian Li, Tianhui Song, Xubin Li, Tiezheng Ge, Bo Zheng, and Limin Wang. Flowdcn: Exploring dcn-like architectures for fast image generation with arbitrary resolution. InNeurIPS, pages 87959–87977, 2024

work page 2024
[68]

Pixnerd: Pixel neural field diffusion

Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. Pixnerd: Pixel neural field diffusion. August 2025

work page 2025
[69]

Ddt: Decoupled diffusion transformer

Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer. April 2025

work page 2025
[70]

Fitv2: Scalable and improved flexible vision transformer for diffusion model

ZiDong Wang, Zeyu Lu, Di Huang, Cai Zhou, Wanli Ouyang, and Lei Bai. Fitv2: Scalable and improved flexible vision transformer for diffusion model.arXiv preprint arXiv:2410.13925, 2024

work page arXiv 2024
[71]

Native-resolution image synthesis

Zidong Wang, Lei Bai, Xiangyu Yue, Wanli Ouyang, and Yiyuan Zhang. Native-resolution image synthesis. June 2025

work page 2025
[72]

Omnigen2: Exploration to advanced multimodal generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation. June 2025

work page 2025
[73]

Representation entanglement for generation: Training diffusion transformers is much easier than you think

Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, et al. Representation entanglement for generation: Training diffusion transformers is much easier than you think. July 2025

work page 2025
[74]

Qwen3 technical report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. May 2025

work page 2025
[75]

Fasterdit: Towards faster diffusion transformers training without architecture modification.NeurIPS, 37:56166–56189, 2024

Jingfeng Yao, Cheng Wang, Wenyu Liu, and Xinggang Wang. Fasterdit: Towards faster diffusion transformers training without architecture modification.NeurIPS, 37:56166–56189, 2024. 13

work page 2024
[76]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimiza- tion dilemma in latent diffusion models. InCVPR, pages 15703–15712, 2025

work page 2025
[77]

Representation alignment for generation: Training diffusion transformers is easier than you think, 2025

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think, 2025

work page 2025
[78]

Pixeldit: Pixel diffusion transformers for image generation

Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, and Jiebo Luo. Pixeldit: Pixel diffusion transformers for image generation. November 2025

work page 2025
[79]

Diffusion models need visual priors for image generation

Xiaoyu Yue, Zidong Wang, Zeyu Lu, Shuyang Sun, Meng Wei, Wanli Ouyang, Lei Bai, and Luping Zhou. Diffusion models need visual priors for image generation. 2024

work page 2024
[80]

Normalizing flows are capable generative models

Shuangfei Zhai, Ruixiang Zhang, Preetum Nakkiran, David Berthelot, Jiatao Gu, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Navdeep Jaitly, and Josh Susskind. Normalizing flows are capable generative models. June 2025

work page 2025

Showing first 80 references.

[1] [1]

Image generators with conditionally-independent pixel synthesis

Ivan Anokhin, Kirill Demochkin, Taras Khakhulin, Gleb Sterkin, Victor Lempitsky, and Denis Korzhenkov. Image generators with conditionally-independent pixel synthesis. InCVPR, pages 14278–14287, 2021

work page 2021

[2] [2]

Improving image generation with better captions (2023).URL https://cdn

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions (2023).URL https://cdn. openai. com/papers/dall-e-3. pdf, 6, 2023

work page 2023

[3] [3]

Hiflow: Training-free high-resolution image generation with flow-aligned guidance.arXiv preprint arXiv:2504.06232, 2025

Jiazi Bu, Pengyang Ling, Yujie Zhou, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Hiflow: Training-free high-resolution image generation with flow-aligned guidance.arXiv preprint arXiv:2504.06232, 2025

work page arXiv 2025

[4] [4]

Any-resolution training for high-resolution image synthesis

Lucy Chai, Michael Gharbi, Eli Shechtman, Phillip Isola, and Richard Zhang. Any-resolution training for high-resolution image synthesis. InECCV, pages 170–188. Springer, 2022

work page 2022

[5] [5]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. 2023

work page 2023

[7] [7]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart- alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Deep compression autoencoder for efficient high-resolution diffusion models

Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models. 2025

work page 2025

[9] [9]

Dc- ae 1.5: Accelerating diffusion model convergence with structured latent space

Junyu Chen, Dongyun Zou, Wenkun He, Junsong Chen, Enze Xie, Song Han, and Han Cai. Dc- ae 1.5: Accelerating diffusion model convergence with structured latent space. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19628–19637, 2025

work page 2025

[10] [10]

Pixelflow: Pixel-space generative models with flow, 2025

Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow, 2025

work page 2025

[11] [11]

Learning continuous image representation with local implicit image function

Yinbo Chen, Sifei Liu, and Xiaolong Wang. Learning continuous image representation with local implicit image function. InCVPR, pages 8628–8638, 2021

work page 2021

[12] [12]

Image neural field diffusion models

Yinbo Chen, Oliver Wang, Richard Zhang, Eli Shechtman, Xiaolong Wang, and Michael Gharbi. Image neural field diffusion models. InCVPR, pages 8007–8017, 2024

work page 2024

[13] [13]

Dip: Taming diffusion models in pixel space.arXiv preprint arXiv:2511.18822, 2025

Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, and Ying Tai. Dip: Taming diffusion models in pixel space.arXiv preprint arXiv:2511.18822, 2025

work page arXiv 2025

[14] [14]

Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers

Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z Kaplan, and Enrico Shippole. Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers. InICML, 2024

work page 2024

[15] [15]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022

work page 2022

[16] [16]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. NeurIPS, 34:8780–8794, 2021

work page 2021

[17] [17]

DemoFusion: Democratising high-resolution image generation with no $$$

Ruoyi Du, Dongliang Chang, Kaiyue Pang, Yi-Zhe Song, and Zhanyu Ma. DemoFusion: Democratising high-resolution image generation with no $$$. InCVPR, pages 6814–6824, 2024. 10

work page 2024

[18] [18]

Scaling rectified flow transform- ers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. InICML, 2024

work page 2024

[19] [19]

Fluid: Scaling autoregressive text-to-image generative models with continuous tokens

Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, and Yonglong Tian. Fluid: Scaling autoregressive text-to-image generative models with continuous tokens. 2024

work page 2024

[20] [20]

Mdtv2: Masked diffusion transformer is a strong image synthesizer

Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Mdtv2: Masked diffusion transformer is a strong image synthesizer. 2024

work page 2024

[21] [21]

Implicit diffusion models for continuous super-resolution

Sicheng Gao, Xuhui Liu, Bohan Zeng, Sheng Xu, Yanjing Li, Xiaoyan Luo, Jianzhuang Liu, Xiantong Zhen, and Baochang Zhang. Implicit diffusion models for continuous super-resolution. InCVPR, pages 10021–10030, 2023

work page 2023

[22] [22]

Geneval: An object-focused framework for evaluating text-to-image alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, volume 36, pages 52132–52152, 2023

work page 2023

[23] [23]

Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

work page 2020

[24] [24]

Infgen: A resolution-agnostic paradigm for scalable image synthesis

Tao Han, Wanghan Xu, Junchao Gong, Xiaoyu Yue, Song Guo, Luping Zhou, and Lei Bai. Infgen: A resolution-agnostic paradigm for scalable image synthesis. InICCV, pages 17941– 17950, 2025

work page 2025

[25] [25]

Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models

Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. InICLR, 2023

work page 2023

[26] [26]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017

[27] [27]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. 2022

work page 2022

[28] [28]

Denoising diffusion probabilistic models.NeurIPS, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.NeurIPS, 33:6840–6851, 2020

work page 2020

[29] [29]

simple diffusion: End-to-end diffusion for high resolution images

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. InICML, pages 13213–13232. PMLR, 2023

work page 2023

[30] [30]

Simpler diffusion: 1.5 fid on imagenet512 with pixel-space diffusion

Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler diffusion: 1.5 fid on imagenet512 with pixel-space diffusion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18062–18071, 2025

work page 2025

[31] [31]

Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024

work page 2024

[32] [32]

Meta-sr: A magnification-arbitrary network for super-resolution

Xuecai Hu, Haoyuan Mu, Xiangyu Zhang, Zilei Wang, Tieniu Tan, and Jian Sun. Meta-sr: A magnification-arbitrary network for super-resolution. InCVPR, pages 1575–1584, 2019

work page 2019

[33] [33]

Hubel and Torsten N

David H. Hubel and Torsten N. Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex.The Journal of Physiology, pages 106–154, 1962

work page 1962

[34] [34]

Progressive growing of gans for improved quality, stability, and variation, 2018

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation, 2018

work page 2018

[35] [35]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019

work page 2019

[36] [36]

Analyzing and improving the image quality of stylegan

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. InCVPR, pages 8110–8119, 2020. 11

work page 2020

[37] [37]

Analyzing and improving the training dynamics of diffusion models

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InCVPR, pages 24174– 24184, 2024

work page 2024

[38] [38]

Arbitrary-scale image generation and upsampling using latent diffusion model and implicit neural decoder

Jinseok Kim and Tae-Kyun Kim. Arbitrary-scale image generation and upsampling using latent diffusion model and implicit neural decoder. InCVPR, pages 9202–9211, 2024

work page 2024

[39] [39]

Auto-encoding variational bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. InICLR, 2014

work page 2014

[40] [40]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024

[41] [41]

There is no vae: End-to-end pixel-space generative modeling via self-supervised pre- training

Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, and Xiangxiang Chu. There is no vae: End-to-end pixel-space generative modeling via self-supervised pre- training. 2026

work page 2026

[42] [42]

Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InICCV, pages 18262–18272, 2025

work page 2025

[43] [43]

Back to basics: Let denoising generative models denoise, 2026

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise, 2026

work page 2026

[44] [44]

Fractal generative models

Tianhong Li, Qinyi Sun, Lijie Fan, and Kaiming He. Fractal generative models. 2025

work page 2025

[45] [45]

Mogao: An omni foundation model for interleaved multi-modal generation

Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation. 2025

work page 2025

[46] [46]

Enhanced deep residual networks for single image super-resolution

Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. InCVPR-W, pages 136–144, 2017

work page 2017

[47] [47]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2023. doi: 10.48550/ arXiv.2210.02747

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InICCV, pages 10012–10022, 2021

work page 2021

[49] [49]

Fit: Flexible vision transformer for diffusion model

Zeyu Lu, Zidong Wang, Di Du, Weichao Chen, Jie Ding, and Wei Shen. Fit: Flexible vision transformer for diffusion model.arXiv preprint arXiv:2402.12376, 2024

work page arXiv 2024

[50] [50]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InECCV, pages 23–40. Springer, 2024

work page 2024

[51] [51]

Deco: Frequency- decoupled pixel diffusion for end-to-end image generation

Zehong Ma, Longhui Wei, Shuai Wang, Shiliang Zhang, and Qi Tian. Deco: Frequency- decoupled pixel diffusion for end-to-end image generation. November 2025

work page 2025

[52] [52]

Pixelgen: Pixel diffusion beats latent diffusion with perceptual loss

Zehong Ma, Ruihan Xu, and Shiliang Zhang. Pixelgen: Pixel diffusion beats latent diffusion with perceptual loss. 2026

work page 2026

[53] [53]

Nerf: Representing scenes as neural radiance fields for view synthesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoor- thi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021

work page 2021

[54] [54]

Dinov2: Learning robust visual features without supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. February 2024

work page 2024

[55] [55]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4172–4182. IEEE, 2023

work page 2023

[56] [56]

Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 12

work page 2023

[57] [57]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022

work page 2022

[58] [58]

Imagenet large scale visual recognition challenge.IJCV, 115(3):211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.IJCV, 115(3):211–252, 2015

work page 2015

[59] [59]

Seedream 4.0: Toward next-generation multimodal image generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation. December 2025

work page 2025

[60] [60]

Claude E. Shannon. Communication in the presence of noise.Proceedings of the IRE, 37(1): 10–21, 1949. doi: 10.1109/JRPROC.1949.232969

work page doi:10.1109/jrproc.1949.232969 1949

[61] [61]

V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. 2025

work page 2025

[62] [62]

Implicit neural representations with periodic activation functions.NeurIPS, 33:7462–7473, 2020

Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions.NeurIPS, 33:7462–7473, 2020

work page 2020

[63] [63]

Adversarial generation of contin- uous images

Ivan Skorokhodov, Savva Ignatyev, and Mohamed Elhoseiny. Adversarial generation of contin- uous images. InCVPR, pages 10753–10764, 2021

work page 2021

[64] [64]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015

work page 2015

[65] [65]

Denoising diffusion implicit models, 2020

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models, 2020

work page 2020

[66] [66]

Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations, 2021

work page 2021

[67] [67]

Flowdcn: Exploring dcn-like architectures for fast image generation with arbitrary resolution

Shuai Wang, Zexian Li, Tianhui Song, Xubin Li, Tiezheng Ge, Bo Zheng, and Limin Wang. Flowdcn: Exploring dcn-like architectures for fast image generation with arbitrary resolution. InNeurIPS, pages 87959–87977, 2024

work page 2024

[68] [68]

Pixnerd: Pixel neural field diffusion

Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. Pixnerd: Pixel neural field diffusion. August 2025

work page 2025

[69] [69]

Ddt: Decoupled diffusion transformer

Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer. April 2025

work page 2025

[70] [70]

Fitv2: Scalable and improved flexible vision transformer for diffusion model

ZiDong Wang, Zeyu Lu, Di Huang, Cai Zhou, Wanli Ouyang, and Lei Bai. Fitv2: Scalable and improved flexible vision transformer for diffusion model.arXiv preprint arXiv:2410.13925, 2024

work page arXiv 2024

[71] [71]

Native-resolution image synthesis

Zidong Wang, Lei Bai, Xiangyu Yue, Wanli Ouyang, and Yiyuan Zhang. Native-resolution image synthesis. June 2025

work page 2025

[72] [72]

Omnigen2: Exploration to advanced multimodal generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation. June 2025

work page 2025

[73] [73]

Representation entanglement for generation: Training diffusion transformers is much easier than you think

Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, et al. Representation entanglement for generation: Training diffusion transformers is much easier than you think. July 2025

work page 2025

[74] [74]

Qwen3 technical report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. May 2025

work page 2025

[75] [75]

Fasterdit: Towards faster diffusion transformers training without architecture modification.NeurIPS, 37:56166–56189, 2024

Jingfeng Yao, Cheng Wang, Wenyu Liu, and Xinggang Wang. Fasterdit: Towards faster diffusion transformers training without architecture modification.NeurIPS, 37:56166–56189, 2024. 13

work page 2024

[76] [76]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimiza- tion dilemma in latent diffusion models. InCVPR, pages 15703–15712, 2025

work page 2025

[77] [77]

Representation alignment for generation: Training diffusion transformers is easier than you think, 2025

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think, 2025

work page 2025

[78] [78]

Pixeldit: Pixel diffusion transformers for image generation

Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, and Jiebo Luo. Pixeldit: Pixel diffusion transformers for image generation. November 2025

work page 2025

[79] [79]

Diffusion models need visual priors for image generation

Xiaoyu Yue, Zidong Wang, Zeyu Lu, Shuyang Sun, Meng Wei, Wanli Ouyang, Lei Bai, and Luping Zhou. Diffusion models need visual priors for image generation. 2024

work page 2024

[80] [80]

Normalizing flows are capable generative models

Shuangfei Zhai, Ruixiang Zhang, Preetum Nakkiran, David Berthelot, Jiatao Gu, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Navdeep Jaitly, and Josh Susskind. Normalizing flows are capable generative models. June 2025

work page 2025