DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
Pith reviewed 2026-05-17 05:44 UTC · model grok-4.3
The pith
DeCo decouples frequencies in pixel diffusion so the DiT models semantics while a lightweight decoder adds details.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeCo decouples the generation of high-frequency details from low-frequency semantics in pixel space. The DiT specializes in modeling low-frequency content and supplies semantic guidance to a lightweight pixel decoder that synthesizes the high-frequency components. A frequency-aware flow-matching loss further directs attention to visually salient frequencies. This yields FID scores of 1.62 at 256x256 and 2.22 at 512x512 on ImageNet among pixel diffusion models and a GenEval score of 0.86 for the text-to-image variant.
What carries the argument
The frequency-DeCoupled pixel diffusion framework that routes low-frequency semantics through a DiT and high-frequency details through a lightweight decoder conditioned on the DiT output.
If this is right
- Pixel diffusion models can train and sample faster because the main transformer no longer expends capacity on high-frequency signals.
- End-to-end pixel-space generation becomes competitive with two-stage latent diffusion without relying on a VAE bottleneck.
- The frequency-aware loss produces images with better perceptual quality by suppressing insignificant frequency bands.
- The same pretrained backbone delivers leading system-level performance on text-to-image benchmarks such as GenEval.
Where Pith is reading between the lines
- The same conditioning pattern could be tested on video or 3D diffusion to reduce compute while preserving fine detail.
- Making the frequency split learned rather than fixed might further improve results on diverse datasets.
- The approach suggests a general principle: separate semantic and perceptual modeling early in the generative pipeline.
Load-bearing premise
A lightweight pixel decoder can reliably synthesize accurate high-frequency details when given only semantic conditioning from the DiT without reintroducing artifacts or requiring joint optimization.
What would settle it
Train an ablated version of DeCo that removes the separate decoder and forces the DiT to model all frequencies; if the FID on ImageNet 256x256 rises above 3.0 or visible high-frequency artifacts appear in generated images, the decoupling premise is falsified.
Figures
read the original abstract
Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison. Codes are publicly available at https://github.com/Zehong-Ma/DeCo.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents DeCo, a frequency-decoupled pixel diffusion framework for end-to-end image generation. It uses a DiT to specialize in low-frequency semantics while a lightweight pixel decoder generates high-frequency details conditioned on DiT guidance, combined with a frequency-aware flow-matching loss that emphasizes salient frequencies. Experiments report FID scores of 1.62 (256×256) and 2.22 (512×512) on ImageNet, closing the gap with latent diffusion models, and a text-to-image variant achieves an overall score of 0.86 on GenEval.
Significance. If the decoupling is effective, the approach could enable more efficient pixel-space diffusion with higher capacity than VAE-based latent methods by avoiding compression artifacts and allowing component specialization. The public code release at the provided GitHub link is a clear strength supporting reproducibility.
major comments (2)
- [Method (§3)] The central claim that frequency decoupling succeeds (DiT models only low-frequency semantics while the decoder produces high-frequency content from guidance alone without artifacts or re-coupling via joint optimization) is load-bearing but unsupported by direct evidence. No frequency-spectrum analysis, high-frequency error maps, or conditioning diagrams are provided to verify specialization.
- [Experiments (§4)] Experiments section: No ablations isolate the contribution of the frequency-aware flow-matching loss or the lightweight decoder design; without these, it is unclear whether the reported FID gains (1.62 at 256²) stem from true decoupling or from other unisolated factors such as training schedule or architecture scale.
minor comments (2)
- [Abstract] The abstract states that the decoder is 'lightweight' but does not quantify parameter count or FLOPs relative to the DiT, which would clarify the efficiency claim.
- [Figures] Figure captions and diagrams could more explicitly label the frequency separation path and loss weighting to improve readability for readers unfamiliar with the split.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. We address each major comment below with clarifications and proposed revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Method (§3)] The central claim that frequency decoupling succeeds (DiT models only low-frequency semantics while the decoder produces high-frequency content from guidance alone without artifacts or re-coupling via joint optimization) is load-bearing but unsupported by direct evidence. No frequency-spectrum analysis, high-frequency error maps, or conditioning diagrams are provided to verify specialization.
Authors: We agree that additional direct evidence would better substantiate the specialization claim. In the revised manuscript we will add frequency-spectrum analysis comparing the DiT output and final decoder output, high-frequency error maps relative to ground truth, and a conditioning diagram that illustrates the guidance pathway from DiT to decoder. These additions will be placed in Section 3 and the supplementary material. revision: yes
-
Referee: [Experiments (§4)] Experiments section: No ablations isolate the contribution of the frequency-aware flow-matching loss or the lightweight decoder design; without these, it is unclear whether the reported FID gains (1.62 at 256²) stem from true decoupling or from other unisolated factors such as training schedule or architecture scale.
Authors: We acknowledge that the current experiments do not contain targeted ablations for these two components. We will add two new ablation studies in the revised Section 4: (1) a comparison of the frequency-aware flow-matching loss against a standard flow-matching baseline while keeping all other elements fixed, and (2) an ablation replacing the lightweight decoder with a deeper variant to isolate its contribution. These results will be reported alongside the existing FID numbers. revision: yes
Circularity Check
No circularity: architectural design choices and empirical results are independent of inputs
full rationale
The paper presents DeCo as an empirical framework consisting of a proposed frequency-decoupled architecture (DiT for low-frequency semantics plus lightweight pixel decoder for high-frequency details) and a frequency-aware flow-matching loss. These are introduced as design decisions motivated by intuition about frequency separation, not derived from equations or prior results that reduce back to the same inputs by construction. Reported FID scores (1.62 at 256x256, 2.22 at 512x512) and GenEval score arise from standard benchmark evaluations on ImageNet, which are external to the model definition. No self-citations, uniqueness theorems, fitted parameters renamed as predictions, or ansatzes smuggled via prior work are used to justify the central claims. The derivation chain is therefore self-contained as an engineering proposal validated experimentally.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Standard assumptions of flow-matching or diffusion processes in image generation (e.g., gradual noise addition and reversal)
- domain assumption High-frequency details can be generated reliably by a lightweight decoder conditioned solely on low-frequency semantic features from the DiT
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT... frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones... DCT... JPEG quantization tables
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DiT to specialize in modeling low-frequency semantics... 8-tick period never mentioned; no golden-ratio or reciprocal-cost identities
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 9 Pith papers
-
Coevolving Representations in Joint Image-Feature Diffusion
CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample ...
-
HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion
HyperDiT achieves FID 1.56 on ImageNet 256x256 in pixel space via hyper-connected cross-scale interactions, cross-attention, SA-RoPE, and VFM registers.
-
L2P: Unlocking Latent Potential for Pixel Generation
L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
-
FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation
FREPix achieves competitive FID scores on ImageNet by decomposing image generation into separate low- and high-frequency paths within a flow matching framework.
-
CoD-Lite: Real-Time Diffusion-Based Generative Image Compression
CoD-Lite delivers real-time generative image compression via a lightweight convolution-based diffusion codec with compression-oriented pre-training and distillation, achieving substantial bitrate savings.
-
PixelGen: Improving Pixel Diffusion with Perceptual Supervision
PixelGen augments pixel diffusion with gated perceptual supervision to reach FID 5.11 on ImageNet-256 and GenEval 0.79 in text-to-image, narrowing the gap to latent methods without VAEs.
-
PixIE: Prompted Pixel-Space Low-Light Image Enhancement
PixIE proposes a pixel-space low-light image enhancement framework using DINO-prompted blocks, spatial-channel compaction, and multi-receptive-field embeddings, reporting PSNR gains of 1.9-15.0% and LPIPS reductions o...
-
FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion
FrequencyBooster reports state-of-the-art FID scores of 1.60 at 256x256 and 1.69 at 512x512 for pixel diffusion by using a specialized decoder for full-frequency modeling.
-
Why Do DiT Editors Drift? Plug-and-Play Low Frequency Alignment in VAE Latent Space
VAE-LFA suppresses semantic drift in multi-turn DiT image editing by low-pass filtering latent discrepancies and aligning low-frequency components to an EMA of previous rounds in VAE space.
Reference graph
Works this paper leans on
-
[1]
All are worth words: A vit backbone for diffusion models
Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 22669–22679, 2023. 3
work page 2023
-
[2]
Improving image gener- ation with better captions
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image gener- ation with better captions. OpenAI Technical Report, 2023. 8
work page 2023
-
[3]
Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis,
-
[4]
Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffu- sion models.arXiv preprint arXiv:2410.10733, 2024. 3
-
[5]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Sil- vio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025. 8, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
arXiv preprint arXiv:2504.07963 (2025)
Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025. 2, 3, 4, 5, 6, 7, 8, 1
-
[7]
Vision transformer adapter for dense predictions
Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. InThe Eleventh International Conference on Learning Representations, 2023. 2
work page 2023
-
[8]
Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using a laplacian pyramid of adver- sarial networks.Advances in neural information processing systems, 28, 2015. 2
work page 2015
-
[9]
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021. 1, 2, 3, 7
work page 2021
-
[10]
Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid- weighted linear units for neural network function approxima- tion in reinforcement learning.Neural networks, 107:3–11,
-
[11]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024. 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Fluid: Scaling autoregressive text-to-image generative models with continuous tokens
Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, and Yonglong Tian. Fluid: Scaling autoregressive text-to-image generative models with continuous tokens.arXiv preprint arXiv:2410.13863, 2024. 1
-
[13]
Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36:52132–52152, 2023. 6, 8
work page 2023
-
[15]
Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model
Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, et al. Seedream 2.0: A native chinese-english bilin- gual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025. 1
work page internal anchor Pith review arXiv 2025
-
[16]
Karl Heun et al. Neue methoden zur approximativen integration der differentialgleichungen einer unabh ¨angigen ver¨anderlichen.Z. Math. Phys, 45:23–38, 1900. 7, 8
work page 1900
-
[17]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 6
work page 2017
-
[18]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 6, 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1
work page 2020
-
[20]
sim- ple diffusion: End-to-end diffusion for high resolution im- ages
Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. sim- ple diffusion: End-to-end diffusion for high resolution im- ages. InInternational Conference on Machine Learning, pages 13213–13232. PMLR, 2023. 2, 3
work page 2023
-
[21]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. 6, 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Scalable adaptive computation for iterative generation
Allan Jabri, David Fleet, and Ting Chen. Scalable adap- tive computation for iterative generation.arXiv preprint arXiv:2212.11972, 2022. 7
-
[23]
Joint Photographic Experts Group. Information technology — digital compression and coding of continuous-tone still images: Requirements and guidelines. Technical Report ITU-T T.81, International Telecommunication Union (ITU- T), 1992. 2, 4, 5
work page 1992
-
[24]
Progressive growing of gans for improved quality, stability, and variation
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. InInternational Conference on Learning Rep- resentations, 2018. 2
work page 2018
-
[25]
Analyzing and improving the training dynamics of diffusion models
Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24174–24184, 2024. 3
work page 2024
-
[26]
Understanding diffu- sion objectives as the elbo with simple data augmentation
Diederik Kingma and Ruiqi Gao. Understanding diffu- sion objectives as the elbo with simple data augmentation. Advances in Neural Information Processing Systems, 36: 65484–65516, 2023. 7
work page 2023
-
[27]
Understanding diffu- sion objectives as the elbo with simple data augmentation
Diederik Kingma and Ruiqi Gao. Understanding diffu- sion objectives as the elbo with simple data augmentation. Advances in Neural Information Processing Systems, 36: 65484–65516, 2023. 2, 3 7
work page 2023
-
[28]
Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall met- ric for assessing generative models.Advances in neural in- formation processing systems, 32, 2019. 6
work page 2019
-
[29]
Applying guidance in a limited interval improves sample and distribution quality in diffusion models
Tuomas Kynk ¨a¨anniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models.arXiv preprint arXiv:2404.07724, 2024. 7, 1
-
[30]
Flux.https://github.com/ black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 1, 8
work page 2024
-
[31]
Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers. arXiv preprint arXiv:2504.10483, 2025. 1, 3
-
[32]
Back to basics: Let denoising generative models denoise, 2025
Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise, 2025. 2, 3, 6, 7
work page 2025
-
[33]
Tianhong Li, Qinyi Sun, Lijie Fan, and Kaiming He. Fractal generative models.arXiv preprint arXiv:2502.17437, 2025. 3, 7
-
[34]
Exploring the effect of high-frequency components in gans training.ACM Trans
Ziqiang Li, Pengfei Xia, Xue Rui, and Bin Li. Exploring the effect of high-frequency components in gans training.ACM Trans. Multimedia Comput. Commun. Appl., 19(5), 2023. 2
work page 2023
-
[35]
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025. 1
work page internal anchor Pith review arXiv 2025
-
[36]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for genera- tive modeling. InThe Eleventh International Conference on Learning Representations, 2023. 3
work page 2023
-
[37]
Decoupled weight de- cay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 1
work page 2019
-
[38]
Latent consistency models: Synthesizing high- resolution images with few-step inference, 2024
Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high- resolution images with few-step inference, 2024. 3
work page 2024
-
[39]
Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers
Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Explor- ing flow and diffusion-based generative models with scalable interpolant transformers.arXiv preprint arXiv:2401.08740,
-
[40]
Generating images with sparse representations
Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021. 6
-
[41]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
How do vision transformers work? InInternational Conference on Learning Represen- tations, 2022
Namuk Park and Songkuk Kim. How do vision transformers work? InInternational Conference on Learning Represen- tations, 2022. 2
work page 2022
-
[43]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,
-
[44]
1, 3, 4, 5, 6, 7, 8, 9
-
[45]
Springer Science & Busi- ness Media, 1992
William B Pennebaker and Joan L Mitchell.JPEG: Still im- age data compression standard. Springer Science & Busi- ness Media, 1992. 2
work page 1992
-
[46]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 3
work page 2022
-
[47]
U- net: Convolutional networks for biomedical image segmen- tation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015. 9
work page 2015
-
[48]
Improved techniques for training gans.Advances in neural information processing systems, 29, 2016
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016. 6
work page 2016
-
[49]
Inception transformer.Advances in Neural Information Processing Systems, 35:23495–23509,
Chenyang Si, Weihao Yu, Pan Zhou, Yichen Zhou, Xinchao Wang, and Shuicheng Yan. Inception transformer.Advances in Neural Information Processing Systems, 35:23495–23509,
-
[50]
Improving the diffusability of autoen- coders
Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Mena- pace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, and Ali- aksandr Siarohin. Improving the diffusability of autoen- coders. InForty-second International Conference on Ma- chine Learning, 2025. 2
work page 2025
-
[51]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models.arXiv:2010.02502, 2020. 1
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[52]
Tianhui Song, Weixin Feng, Shuai Wang, Xubin Li, Tiezheng Ge, Bo Zheng, and Limin Wang. Dmm: Build- ing a versatile image generation model via distillation-based model merging.arXiv preprint arXiv:2504.12364, 2025. 3
-
[53]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,
-
[54]
Relay diffusion: Unifying diffusion process across resolutions for image syn- thesis
Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jian- qiao Wangni, Zhuoyi Yang, and Jie Tang. Relay diffusion: Unifying diffusion process across resolutions for image syn- thesis.arXiv preprint arXiv:2309.03350, 2023. 2, 3, 7
-
[55]
arXiv preprint arXiv:2405.14224 , year=
Yao Teng, Yue Wu, Han Shi, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, and Xihui Liu. Dim: Diffusion mamba for efficient high-resolution image synthesis.arXiv preprint arXiv:2405.14224, 2024. 3
-
[56]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Michael Tschannen, Andr ´e Susano Pinto, and Alexander Kolesnikov. Jetformer: An autoregressive generative model of raw images and text.arXiv preprint arXiv:2411.19722,
-
[59]
High-frequency component helps explain the generaliza- tion of convolutional neural networks
Haohan Wang, Xindi Wu, Zeyi Huang, and Eric P Xing. High-frequency component helps explain the generaliza- tion of convolutional neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8684–8694, 2020. 2
work page 2020
-
[60]
Shuai Wang, Zexian Li, Tianhui Song, Xubin Li, Tiezheng Ge, Bo Zheng, and Limin Wang. Exploring dcn-like ar- chitecture for fast image generation with arbitrary resolu- tion.Advances in Neural Information Processing Systems, 37:87959–87977, 2024. 3
work page 2024
-
[61]
Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025
Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025. 1, 3, 6, 7, 8
-
[62]
Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025
Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025. 3, 6, 1
-
[63]
Zhe Wang, Ziqiu Chi, Yanbing Zhang, et al. Fregan: Exploit- ing frequency components for training gans under limited data.Advances in Neural Information Processing Systems, 35:33387–33399, 2022. 2
work page 2022
-
[64]
Native-resolution image synthesis.arXiv preprint arXiv:2506.03131, 2025
Zidong Wang, Lei Bai, Xiangyu Yue, Wanli Ouyang, and Yiyuan Zhang. Native-resolution image synthesis.arXiv preprint arXiv:2506.03131, 2025. 1
-
[65]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Jun- jie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jia- hao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omni- gen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.1887...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 8, 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[67]
Jingfeng Yao and Xinggang Wang. Reconstruction vs. gener- ation: Taming optimization dilemma in latent diffusion mod- els.arXiv preprint arXiv:2501.01423, 2025. 1, 3
-
[68]
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffu- sion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024. 3, 4, 6, 7, 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
Diffu- sion models need visual priors for image generation.arXiv preprint arXiv:2410.08531, 2024
Xiaoyu Yue, Zidong Wang, Zeyu Lu, Shuyang Sun, Meng Wei, Wanli Ouyang, Lei Bai, and Luping Zhou. Diffu- sion models need visual priors for image generation.arXiv preprint arXiv:2410.08531, 2024. 3
-
[70]
Normalizing flows are capable generative models,
Shuangfei Zhai, Ruixiang Zhang, Preetum Nakkiran, David Berthelot, Jiatao Gu, Huangjie Zheng, Tianrong Chen, Miguel Angel Bautista, Navdeep Jaitly, and Josh Susskind. Normalizing flows are capable generative models.arXiv preprint arXiv:2412.06329, 2024. 3
-
[71]
Mingyuan Zhou, Huangjie Zheng, Zhendong Wang, Mingzhang Yin, and Hai Huang. Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. InForty-first International Confer- ence on Machine Learning, 2024. 3, 7 9
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.