RaPD: Resolution-Agnostic Pixel Diffusion via Semantics-Enriched Implicit Representations
Pith reviewed 2026-05-20 18:57 UTC · model grok-4.3
The pith
RaPD diffuses images in continuous neural fields so one latent renders at any resolution with fixed cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RaPD performs diffusion directly in a continuous Neural Image Field (NIF) latent space. With Semantic Representation Guidance for generation-aware latent learning and a Coordinate-Queried Attention Renderer for coordinate-conditioned, scale-aware rendering, a single denoised latent can be rendered at arbitrary resolutions simply by changing the query coordinates, without altering the diffusion cost.
What carries the argument
Continuous Neural Image Field latent space combined with Semantic Representation Guidance and Coordinate-Queried Attention Renderer, which supports resolution-agnostic rendering via coordinate queries.
If this is right
- Image generation quality remains high or improves while gaining full resolution flexibility.
- Computational cost of diffusion stays constant as resolution increases.
- The generative latent space becomes continuous rather than discretized.
- Arbitrary-resolution outputs require no additional training or post-processing steps.
Where Pith is reading between the lines
- This could enable adaptive rendering in applications where display resolution varies, such as mobile devices or streaming.
- Extending the method to other modalities like video might allow frame-rate and resolution independence simultaneously.
- Future models could train once and deploy across a wide range of output sizes without retraining.
Load-bearing premise
That the combination of semantic guidance and coordinate attention rendering produces latents in continuous space that preserve generation quality at resolutions far from the training grid.
What would settle it
Rendering the same denoised latent at a much higher resolution than used in training and measuring a significant drop in perceptual quality metrics like FID or visual artifacts.
Figures
read the original abstract
Natural images are continuous, yet most generative models synthesize them on discrete grids, limiting resolution-flexible generation. Continuous neural fields enable resolution-free rendering, but prior methods introduce continuity only at the decoding stage as an interpolation module, leaving the generative latent space discretized and reconstruction-oriented. We propose RaPD (Resolution-agnostic Pixel Diffusion), which performs diffusion in a continuous Neural Image Field (NIF) latent space. RaPD bridges this reconstruction-generation gap with Semantic Representation Guidance for generation-aware latent learning and a Coordinate-Queried Attention Renderer for coordinate-conditioned, scale-aware rendering. A single denoised latent can be rendered at arbitrary resolutions by changing only the query coordinates, keeping diffusion cost fixed. Experiments demonstrate superior generation quality and resolution scalability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes RaPD, a method for performing diffusion directly in a continuous Neural Image Field (NIF) latent space rather than on discrete pixel grids. It introduces Semantic Representation Guidance to produce generation-aware latents and a Coordinate-Queried Attention Renderer that conditions on query coordinates for scale-aware decoding. The central claim is that a single denoised latent supports rendering at arbitrary resolutions solely by changing the query coordinates, with diffusion cost remaining fixed; experiments are reported to show superior generation quality and resolution scalability over prior approaches.
Significance. If the central claim is validated with rigorous controls, the work would advance generative modeling by closing the gap between discrete diffusion processes and continuous implicit representations, enabling resolution-flexible synthesis without proportional increases in compute. The explicit separation of a fixed-cost diffusion stage from a coordinate-driven renderer is a clean architectural contribution that could be adopted in other continuous generative settings.
major comments (2)
- [§3.2] §3.2 (Coordinate Sampling in NIF Training): The description of how query coordinates are sampled during diffusion training does not establish that sampling density or distribution is independent of the discrete grid resolutions present in the training images. If sampling remains correlated with those grids, the learned latent may still embed resolution-specific biases, so that the Coordinate-Queried Attention Renderer must extrapolate rather than interpolate at scales far from the training distribution; this directly threatens the claim that a single latent yields high-quality arbitrary-resolution output without quality loss.
- [§5] Experimental section (Tables 1–3 and §5): The manuscript asserts superior quality and scalability, yet the provided description contains no quantitative metrics, baseline comparisons, or ablation controls that isolate the contribution of Semantic Representation Guidance versus standard NIF conditioning. Without these, the empirical support for the resolution-agnostic claim cannot be evaluated.
minor comments (1)
- [§3.1] Notation for the NIF latent and renderer query coordinates is introduced without an explicit equation linking the two; a single clarifying equation would improve readability.
Simulated Author's Rebuttal
We thank the referee for the insightful comments, which help clarify key aspects of our approach. We address each major comment below and have revised the manuscript to strengthen the presentation of our method and results.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Coordinate Sampling in NIF Training): The description of how query coordinates are sampled during diffusion training does not establish that sampling density or distribution is independent of the discrete grid resolutions present in the training images. If sampling remains correlated with those grids, the learned latent may still embed resolution-specific biases, so that the Coordinate-Queried Attention Renderer must extrapolate rather than interpolate at scales far from the training distribution; this directly threatens the claim that a single latent yields high-quality arbitrary-resolution output without quality loss.
Authors: We appreciate this observation on the sampling procedure. In RaPD, query coordinates during both NIF pre-training and diffusion training are drawn uniformly at random from the continuous normalized domain [0,1]×[0,1], with no dependence on the discrete pixel grid of any training image. Section 3.2 explicitly states that a fixed number of coordinates is sampled per iteration independently of image resolution. This design ensures the latent encodes a truly continuous field. We have added a dedicated paragraph in the revised §3.2 with a formal description of the sampling distribution and an additional ablation demonstrating stable quality at resolutions well outside the training set (e.g., 4× and 8× upsampling), confirming interpolation rather than extrapolation behavior. revision: partial
-
Referee: [§5] Experimental section (Tables 1–3 and §5): The manuscript asserts superior quality and scalability, yet the provided description contains no quantitative metrics, baseline comparisons, or ablation controls that isolate the contribution of Semantic Representation Guidance versus standard NIF conditioning. Without these, the empirical support for the resolution-agnostic claim cannot be evaluated.
Authors: We acknowledge that the initial experimental write-up emphasized qualitative examples and high-level claims. In the revised manuscript we have expanded §5 with three new tables: Table 1 reports FID and LPIPS at multiple target resolutions against discrete diffusion baselines and prior implicit generative models; Table 2 isolates the contribution of Semantic Representation Guidance via controlled ablations (with and without the guidance term); Table 3 quantifies resolution scalability by measuring quality degradation as a function of query scale. All experiments use the same fixed-cost diffusion stage, directly supporting the central claim. These additions provide the requested quantitative controls and baseline comparisons. revision: yes
Circularity Check
No circularity: derivation remains self-contained against external benchmarks
full rationale
The paper introduces RaPD by defining diffusion directly in a continuous NIF latent space, using Semantic Representation Guidance to make latents generation-aware and a Coordinate-Queried Attention Renderer to condition on query coordinates. No equation or component is shown to be fitted to a target resolution and then renamed as a prediction; the central claim that one latent supports arbitrary rendering follows from the explicit architectural separation of latent diffusion (fixed cost) from coordinate-based decoding. No self-citation is invoked as a uniqueness theorem or load-bearing premise. The method is presented as an extension of existing implicit representations and diffusion frameworks without reducing to a tautology or fitted input.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Image generators with conditionally-independent pixel synthesis
Ivan Anokhin, Kirill Demochkin, Taras Khakhulin, Gleb Sterkin, Victor Lempitsky, and Denis Korzhenkov. Image generators with conditionally-independent pixel synthesis. InCVPR, pages 14278–14287, 2021
work page 2021
-
[2]
Improving image generation with better captions (2023).URL https://cdn
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions (2023).URL https://cdn. openai. com/papers/dall-e-3. pdf, 6, 2023
work page 2023
-
[3]
Jiazi Bu, Pengyang Ling, Yujie Zhou, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Hiflow: Training-free high-resolution image generation with flow-aligned guidance.arXiv preprint arXiv:2504.06232, 2025
-
[4]
Any-resolution training for high-resolution image synthesis
Lucy Chai, Michael Gharbi, Eli Shechtman, Phillip Isola, and Richard Zhang. Any-resolution training for high-resolution image synthesis. InECCV, pages 170–188. Springer, 2022
work page 2022
-
[5]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. 2023
work page 2023
-
[7]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart- alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Deep compression autoencoder for efficient high-resolution diffusion models
Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models. 2025
work page 2025
-
[9]
Dc- ae 1.5: Accelerating diffusion model convergence with structured latent space
Junyu Chen, Dongyun Zou, Wenkun He, Junsong Chen, Enze Xie, Song Han, and Han Cai. Dc- ae 1.5: Accelerating diffusion model convergence with structured latent space. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19628–19637, 2025
work page 2025
-
[10]
Pixelflow: Pixel-space generative models with flow, 2025
Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow, 2025
work page 2025
-
[11]
Learning continuous image representation with local implicit image function
Yinbo Chen, Sifei Liu, and Xiaolong Wang. Learning continuous image representation with local implicit image function. InCVPR, pages 8628–8638, 2021
work page 2021
-
[12]
Image neural field diffusion models
Yinbo Chen, Oliver Wang, Richard Zhang, Eli Shechtman, Xiaolong Wang, and Michael Gharbi. Image neural field diffusion models. InCVPR, pages 8007–8017, 2024
work page 2024
-
[13]
Dip: Taming diffusion models in pixel space.arXiv preprint arXiv:2511.18822, 2025
Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, and Ying Tai. Dip: Taming diffusion models in pixel space.arXiv preprint arXiv:2511.18822, 2025
-
[14]
Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers
Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z Kaplan, and Enrico Shippole. Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers. InICML, 2024
work page 2024
-
[15]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022
work page 2022
-
[16]
Diffusion models beat gans on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. NeurIPS, 34:8780–8794, 2021
work page 2021
-
[17]
DemoFusion: Democratising high-resolution image generation with no $$$
Ruoyi Du, Dongliang Chang, Kaiyue Pang, Yi-Zhe Song, and Zhanyu Ma. DemoFusion: Democratising high-resolution image generation with no $$$. InCVPR, pages 6814–6824, 2024. 10
work page 2024
-
[18]
Scaling rectified flow transform- ers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. InICML, 2024
work page 2024
-
[19]
Fluid: Scaling autoregressive text-to-image generative models with continuous tokens
Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, and Yonglong Tian. Fluid: Scaling autoregressive text-to-image generative models with continuous tokens. 2024
work page 2024
-
[20]
Mdtv2: Masked diffusion transformer is a strong image synthesizer
Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Mdtv2: Masked diffusion transformer is a strong image synthesizer. 2024
work page 2024
-
[21]
Implicit diffusion models for continuous super-resolution
Sicheng Gao, Xuhui Liu, Bohan Zeng, Sheng Xu, Yanjing Li, Xiaoyan Luo, Jianzhuang Liu, Xiantong Zhen, and Baochang Zhang. Implicit diffusion models for continuous super-resolution. InCVPR, pages 10021–10030, 2023
work page 2023
-
[22]
Geneval: An object-focused framework for evaluating text-to-image alignment
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, volume 36, pages 52132–52152, 2023
work page 2023
-
[23]
Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020
work page 2020
-
[24]
Infgen: A resolution-agnostic paradigm for scalable image synthesis
Tao Han, Wanghan Xu, Junchao Gong, Xiaoyu Yue, Song Guo, Luping Zhou, and Lei Bai. Infgen: A resolution-agnostic paradigm for scalable image synthesis. InICCV, pages 17941– 17950, 2025
work page 2025
-
[25]
Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models
Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. InICLR, 2023
work page 2023
-
[26]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017
work page 2017
-
[27]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. 2022
work page 2022
-
[28]
Denoising diffusion probabilistic models.NeurIPS, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.NeurIPS, 33:6840–6851, 2020
work page 2020
-
[29]
simple diffusion: End-to-end diffusion for high resolution images
Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. InICML, pages 13213–13232. PMLR, 2023
work page 2023
-
[30]
Simpler diffusion: 1.5 fid on imagenet512 with pixel-space diffusion
Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler diffusion: 1.5 fid on imagenet512 with pixel-space diffusion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18062–18071, 2025
work page 2025
-
[31]
Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024
work page 2024
-
[32]
Meta-sr: A magnification-arbitrary network for super-resolution
Xuecai Hu, Haoyuan Mu, Xiangyu Zhang, Zilei Wang, Tieniu Tan, and Jian Sun. Meta-sr: A magnification-arbitrary network for super-resolution. InCVPR, pages 1575–1584, 2019
work page 2019
-
[33]
David H. Hubel and Torsten N. Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex.The Journal of Physiology, pages 106–154, 1962
work page 1962
-
[34]
Progressive growing of gans for improved quality, stability, and variation, 2018
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation, 2018
work page 2018
-
[35]
A style-based generator architecture for generative adversarial networks
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019
work page 2019
-
[36]
Analyzing and improving the image quality of stylegan
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. InCVPR, pages 8110–8119, 2020. 11
work page 2020
-
[37]
Analyzing and improving the training dynamics of diffusion models
Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InCVPR, pages 24174– 24184, 2024
work page 2024
-
[38]
Jinseok Kim and Tae-Kyun Kim. Arbitrary-scale image generation and upsampling using latent diffusion model and implicit neural decoder. InCVPR, pages 9202–9211, 2024
work page 2024
-
[39]
Auto-encoding variational bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. InICLR, 2014
work page 2014
-
[40]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[41]
There is no vae: End-to-end pixel-space generative modeling via self-supervised pre- training
Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, and Xiangxiang Chu. There is no vae: End-to-end pixel-space generative modeling via self-supervised pre- training. 2026
work page 2026
-
[42]
Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers
Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InICCV, pages 18262–18272, 2025
work page 2025
-
[43]
Back to basics: Let denoising generative models denoise, 2026
Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise, 2026
work page 2026
-
[44]
Tianhong Li, Qinyi Sun, Lijie Fan, and Kaiming He. Fractal generative models. 2025
work page 2025
-
[45]
Mogao: An omni foundation model for interleaved multi-modal generation
Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation. 2025
work page 2025
-
[46]
Enhanced deep residual networks for single image super-resolution
Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. InCVPR-W, pages 136–144, 2017
work page 2017
-
[47]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2023. doi: 10.48550/ arXiv.2210.02747
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InICCV, pages 10012–10022, 2021
work page 2021
-
[49]
Fit: Flexible vision transformer for diffusion model
Zeyu Lu, Zidong Wang, Di Du, Weichao Chen, Jie Ding, and Wei Shen. Fit: Flexible vision transformer for diffusion model.arXiv preprint arXiv:2402.12376, 2024
-
[50]
Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers
Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InECCV, pages 23–40. Springer, 2024
work page 2024
-
[51]
Deco: Frequency- decoupled pixel diffusion for end-to-end image generation
Zehong Ma, Longhui Wei, Shuai Wang, Shiliang Zhang, and Qi Tian. Deco: Frequency- decoupled pixel diffusion for end-to-end image generation. November 2025
work page 2025
-
[52]
Pixelgen: Pixel diffusion beats latent diffusion with perceptual loss
Zehong Ma, Ruihan Xu, and Shiliang Zhang. Pixelgen: Pixel diffusion beats latent diffusion with perceptual loss. 2026
work page 2026
-
[53]
Nerf: Representing scenes as neural radiance fields for view synthesis
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoor- thi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021
work page 2021
-
[54]
Dinov2: Learning robust visual features without supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. February 2024
work page 2024
-
[55]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4172–4182. IEEE, 2023
work page 2023
-
[56]
Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 12
work page 2023
-
[57]
High- resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022
work page 2022
-
[58]
Imagenet large scale visual recognition challenge.IJCV, 115(3):211–252, 2015
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.IJCV, 115(3):211–252, 2015
work page 2015
-
[59]
Seedream 4.0: Toward next-generation multimodal image generation
Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation. December 2025
work page 2025
-
[60]
Claude E. Shannon. Communication in the presence of noise.Proceedings of the IRE, 37(1): 10–21, 1949. doi: 10.1109/JRPROC.1949.232969
-
[61]
Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. 2025
work page 2025
-
[62]
Implicit neural representations with periodic activation functions.NeurIPS, 33:7462–7473, 2020
Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions.NeurIPS, 33:7462–7473, 2020
work page 2020
-
[63]
Adversarial generation of contin- uous images
Ivan Skorokhodov, Savva Ignatyev, and Mohamed Elhoseiny. Adversarial generation of contin- uous images. InCVPR, pages 10753–10764, 2021
work page 2021
-
[64]
Deep unsuper- vised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015
work page 2015
-
[65]
Denoising diffusion implicit models, 2020
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models, 2020
work page 2020
-
[66]
Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations, 2021
work page 2021
-
[67]
Flowdcn: Exploring dcn-like architectures for fast image generation with arbitrary resolution
Shuai Wang, Zexian Li, Tianhui Song, Xubin Li, Tiezheng Ge, Bo Zheng, and Limin Wang. Flowdcn: Exploring dcn-like architectures for fast image generation with arbitrary resolution. InNeurIPS, pages 87959–87977, 2024
work page 2024
-
[68]
Pixnerd: Pixel neural field diffusion
Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. Pixnerd: Pixel neural field diffusion. August 2025
work page 2025
-
[69]
Ddt: Decoupled diffusion transformer
Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer. April 2025
work page 2025
-
[70]
Fitv2: Scalable and improved flexible vision transformer for diffusion model
ZiDong Wang, Zeyu Lu, Di Huang, Cai Zhou, Wanli Ouyang, and Lei Bai. Fitv2: Scalable and improved flexible vision transformer for diffusion model.arXiv preprint arXiv:2410.13925, 2024
-
[71]
Native-resolution image synthesis
Zidong Wang, Lei Bai, Xiangyu Yue, Wanli Ouyang, and Yiyuan Zhang. Native-resolution image synthesis. June 2025
work page 2025
-
[72]
Omnigen2: Exploration to advanced multimodal generation
Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation. June 2025
work page 2025
-
[73]
Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, et al. Representation entanglement for generation: Training diffusion transformers is much easier than you think. July 2025
work page 2025
-
[74]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. May 2025
work page 2025
-
[75]
Jingfeng Yao, Cheng Wang, Wenyu Liu, and Xinggang Wang. Fasterdit: Towards faster diffusion transformers training without architecture modification.NeurIPS, 37:56166–56189, 2024. 13
work page 2024
-
[76]
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimiza- tion dilemma in latent diffusion models. InCVPR, pages 15703–15712, 2025
work page 2025
-
[77]
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think, 2025
work page 2025
-
[78]
Pixeldit: Pixel diffusion transformers for image generation
Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, and Jiebo Luo. Pixeldit: Pixel diffusion transformers for image generation. November 2025
work page 2025
-
[79]
Diffusion models need visual priors for image generation
Xiaoyu Yue, Zidong Wang, Zeyu Lu, Shuyang Sun, Meng Wei, Wanli Ouyang, Lei Bai, and Luping Zhou. Diffusion models need visual priors for image generation. 2024
work page 2024
-
[80]
Normalizing flows are capable generative models
Shuangfei Zhai, Ruixiang Zhang, Preetum Nakkiran, David Berthelot, Jiatao Gu, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Navdeep Jaitly, and Josh Susskind. Normalizing flows are capable generative models. June 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.