PixelDiT: Pixel Diffusion Transformers for Image Generation
Pith reviewed 2026-05-17 04:26 UTC · model grok-4.3
The pith
PixelDiT runs diffusion directly in pixel space using a dual-level transformer that splits global semantics from local texture refinement.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PixelDiT is a single-stage, fully transformer-based diffusion model that performs the entire diffusion process directly in pixel space through a dual-level architecture: a patch-level DiT captures global semantics and a pixel-level DiT refines texture details, eliminating the pretrained autoencoder and its associated reconstruction losses while enabling efficient high-resolution training.
What carries the argument
Dual-level transformer design that combines a patch-level DiT for coarse semantics with a pixel-level DiT for detail refinement, allowing direct pixel-space diffusion without dimensionality reduction.
If this is right
- Removes error accumulation from autoencoder reconstruction in the generation pipeline.
- Enables true end-to-end optimization of the diffusion process without frozen pretrained components.
- Supports direct pixel-space training at 1024 resolution for text-to-image tasks.
- Surpasses prior pixel-space generative models on standard ImageNet FID benchmarks.
- Approaches the performance of the strongest latent diffusion models on GenEval and DPG-bench.
Where Pith is reading between the lines
- The approach may simplify deployment by removing the need to train or maintain a separate autoencoder.
- It could make high-resolution generation more straightforward in domains where defining a good latent space is difficult.
- The split between patch and pixel levels suggests a general pattern for scaling transformers when operating on high-dimensional raw signals.
Load-bearing premise
The dual-level patch and pixel transformers can handle both global structure and fine details at full resolution without the compression that an autoencoder provides.
What would settle it
Measure whether PixelDiT at 1024 resolution requires substantially more compute or memory than a comparable latent diffusion model while delivering lower or equal FID and GenEval scores.
read the original abstract
Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. PixelDiT achieves 1.61 FID on ImageNet 256 and 1.81 FID on ImageNet 512, surpassing existing pixel generative models. We further extend PixelDiT to text-to-image generation and pretrain it at the 10242resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models. Code: https://github.com/NVlabs/PixelDiT
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PixelDiT, a single-stage end-to-end pixel-space Diffusion Transformer that removes the latent autoencoder stage. It uses a dual-level architecture with a patch-level DiT to capture global semantics and a pixel-level DiT to refine fine texture details. The model is evaluated on class-conditional ImageNet generation at 256 and 512 resolutions and extended to text-to-image at 1024 resolution, reporting FID scores of 1.61 and 1.81 on ImageNet plus 0.74 on GenEval and 83.5 on DPG-bench.
Significance. If the empirical results prove robust, this would represent a meaningful advance by showing that direct pixel-space modeling with a dual-level transformer can match or exceed latent diffusion models without autoencoder-induced reconstruction loss or two-stage training. The approach could simplify generative pipelines and enable fully joint optimization, with the dual-level design offering a practical way to handle native-resolution inputs.
major comments (3)
- [Abstract and Experiments] Abstract and Experiments: The reported FID scores (1.61 at 256, 1.81 at 512) and text-to-image metrics lack error bars, standard deviations across runs, or statistical tests, which weakens assessment of whether the gains over prior pixel-space models are reliable and reproducible.
- [Method and Experiments] Method and Experiments: No ablation studies isolate the contribution of the pixel-level DiT versus a patch-only or single-level baseline, which is load-bearing for the central claim that the dual-level design efficiently models both semantics and details at full pixel dimensionality without latent compression.
- [Implementation or Experiments] Implementation or Experiments: The manuscript provides no parameter counts, FLOPs, memory usage, or scaling curves for the pixel-level transformer at native resolutions, leaving the efficiency and practicality claims relative to latent models unverified despite the higher input dimensionality.
minor comments (1)
- [Abstract and Method] The abstract and method sections would benefit from explicit statements on training stability measures and any data filtering applied, to support the extension to 1024-resolution text-to-image pretraining.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will incorporate to strengthen the presentation of our results and claims.
read point-by-point responses
-
Referee: [Abstract and Experiments] Abstract and Experiments: The reported FID scores (1.61 at 256, 1.81 at 512) and text-to-image metrics lack error bars, standard deviations across runs, or statistical tests, which weakens assessment of whether the gains over prior pixel-space models are reliable and reproducible.
Authors: We agree that reporting variability would improve assessment of reproducibility. The original results were obtained from single training runs per configuration. In the revised manuscript we will add results from at least three independent runs with different random seeds and report mean FID together with standard deviation for the ImageNet 256 and 512 settings. For the text-to-image metrics we will clarify the evaluation protocol and include any available run-to-run variation. These additions will allow readers to better judge the reliability of the reported improvements over prior pixel-space models. revision: partial
-
Referee: [Method and Experiments] Method and Experiments: No ablation studies isolate the contribution of the pixel-level DiT versus a patch-only or single-level baseline, which is load-bearing for the central claim that the dual-level design efficiently models both semantics and details at full pixel dimensionality without latent compression.
Authors: We concur that explicit ablations would more directly support the value of the dual-level architecture. While the current manuscript demonstrates overall performance through comparisons against existing pixel-space generators, we will add targeted ablation experiments in the revision. These will include a patch-only DiT baseline and a single-level full-pixel transformer, with corresponding FID and qualitative results, to quantify the separate contributions of the patch-level semantic modeling and the pixel-level detail refinement. revision: yes
-
Referee: [Implementation or Experiments] Implementation or Experiments: The manuscript provides no parameter counts, FLOPs, memory usage, or scaling curves for the pixel-level transformer at native resolutions, leaving the efficiency and practicality claims relative to latent models unverified despite the higher input dimensionality.
Authors: We thank the referee for highlighting the need for concrete efficiency metrics. In the revised manuscript we will include a dedicated table reporting parameter counts for the patch-level and pixel-level components, estimated FLOPs for forward passes at 256 and 512 resolutions, and peak memory usage during training and inference. We will also add a brief scaling discussion relating model size to performance at native resolution. These quantitative details will enable direct comparison with latent diffusion models and substantiate the practicality claims. revision: yes
Circularity Check
No significant circularity; results are empirical measurements on external benchmarks.
full rationale
The paper proposes PixelDiT, a dual-level patch-plus-pixel transformer architecture for direct pixel-space diffusion without an autoencoder. Its central claims consist of architectural design choices and reported performance numbers (FID on ImageNet, GenEval, DPG-bench) obtained via training and evaluation against standard external datasets. No mathematical derivation, prediction, or first-principles result is presented that reduces by construction to fitted parameters, self-definitions, or self-citation chains. The performance figures are measured outcomes rather than tautological outputs of the model definition itself, rendering the derivation chain self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details... pixel-wise AdaLN modulation... pixel token compaction mechanism
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose PixelDiT, a single-stage, fully transformer-based pixel-space diffusion model... efficient pixel modeling via pixel-wise AdaLN and token compaction
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 17 Pith papers
-
Mat\'ern Noise for Triangulation-Agnostic Flow Matching on Meshes
Proposes discretized Matérn process noise for triangulation-agnostic flow matching on meshes with PoissonNet denoiser, tested on elastic states and humanoid poses for meshes exceeding one million triangles.
-
Cast3: Translating numerical weather prediction principles into data-driven forecasting
Cast3 translates NWP principles into a data-driven model using cubed-sphere grids, super-ensembles, and generative nudging to achieve state-of-the-art ensemble predictions that outperform baselines.
-
PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion
PiD is a pixel diffusion decoder that performs latent-to-pixel conversion and 4-8x upsampling in one generative step, enabling early stopping of latent diffusion and achieving sub-second 2048x2048 decoding with claime...
-
RiT: Vanilla Diffusion Transformers Suffice in Representation Space
A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.
-
Registers Matter for Pixel-Space Diffusion Transformers
Register tokens enhance pixel-space DiT training and output quality via cleaner high-noise feature maps, and a dual-stream design adds further gains with little overhead.
-
HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion
HyperDiT achieves FID 1.56 on ImageNet 256x256 in pixel space via hyper-connected cross-scale interactions, cross-attention, SA-RoPE, and VFM registers.
-
L2P: Unlocking Latent Potential for Pixel Generation
L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
-
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion
BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.
-
FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation
FREPix achieves competitive FID scores on ImageNet by decomposing image generation into separate low- and high-frequency paths within a flow matching framework.
-
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
Tuna-2 shows that direct pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive generation and stronger understanding at scale.
-
PixelGen: Improving Pixel Diffusion with Perceptual Supervision
PixelGen augments pixel diffusion with gated perceptual supervision to reach FID 5.11 on ImageNet-256 and GenEval 0.79 in text-to-image, narrowing the gap to latent methods without VAEs.
-
PixIE: Prompted Pixel-Space Low-Light Image Enhancement
PixIE proposes a pixel-space low-light image enhancement framework using DINO-prompted blocks, spatial-channel compaction, and multi-receptive-field embeddings, reporting PSNR gains of 1.9-15.0% and LPIPS reductions o...
-
FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion
FrequencyBooster reports state-of-the-art FID scores of 1.60 at 256x256 and 1.69 at 512x512 for pixel diffusion by using a specialized decoder for full-frequency modeling.
-
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
-
CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation
CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...
-
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
-
UniCSG: Unified High-Fidelity Content-Constrained Style-Driven Generation via Staged Semantic and Frequency Disentanglement
UniCSG adds staged semantic disentanglement and frequency-aware reconstruction to DiT diffusion models to improve content preservation and style fidelity in both text- and reference-guided generation.
Reference graph
Works this paper leans on
-
[1]
Flowedit: Inversion-free text-based editing using pre-trained flow models
Vladimir Kulikov, Matan Kleiner, Inbar Huberman- Spiegelglas, and Tomer Michaeli. Flowedit: Inversion-free text-based editing using pre-trained flow models. InICCV, pages 19721–19730, 2025
work page 2025
-
[2]
Black Forest Labs. Flux. https://github.com/black- forest-labs/flux, 2024
work page 2024
-
[3]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yan- nik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024
work page 2024
-
[4]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022
work page 2022
-
[5]
Scalable diffusion mod- els with transformers
William Peebles and Saining Xie. Scalable diffusion mod- els with transformers. InICCV, 2023
work page 2023
-
[6]
Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers
Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InECCV, 2024
work page 2024
-
[7]
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruc- tion vs. generation: Taming optimization dilemma in latent diffusion models. InCVPR, 2025
work page 2025
-
[8]
Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers
Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers. InICCV, 2025
work page 2025
-
[9]
Playground v3: Improving text-to-image alignment with deep-fusion large language models
Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Joao Souza, Suhail Doshi, and Daiqing Li. Playground v3: Improving text-to- image alignment with deep-fusion large language models. arXiv preprint arXiv:2409.10695, 2024
- [10]
-
[11]
Diffusion Transformers with Representation Autoencoders
Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoen- coders.arXiv preprint arXiv:2510.11690, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025
Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025
-
[13]
Jetformer: An autoregressive generative model of raw images and text
Michael Tschannen, André Susano Pinto, and Alexander Kolesnikov. Jetformer: An autoregressive generative model of raw images and text. InICLR, 2025
work page 2025
-
[14]
FARMER: Flow autoregressive transformer over pixels.arXiv preprint arXiv:2510.23588, 2025
Guangting Zheng, Qinyu Zhao, Tao Yang, Fei Xiao, Zhijie Lin, Jie Wu, Jiajun Deng, Yanyong Zhang, and Rui Zhu. Farmer: Flow autoregressive transformer over pixels.arXiv preprint arXiv:2510.23588, 2025
-
[15]
Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, and Xiangxiang Chu. Advancing end- to-end pixel space generative modeling via self-supervised pre-training.arXiv preprint arXiv:2510.12586, 2025
-
[16]
Back to Basics: Let Denoising Generative Models Denoise
Kaiming He Tianhong Li. Back to basics: Let de- noising generative models denoise.arXiv preprint arXiv:2511.13720, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
arXiv preprint arXiv:2504.07963 (2025)
Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025. 9 PixelDiT: Pixel Diffusion Transformers for Image Generation
-
[18]
Simple diffusion: End-to-end diffusion for high resolution images
Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. Simple diffusion: End-to-end diffusion for high resolution images. InICML, 2023
work page 2023
-
[19]
Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion
Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion. InCVPR, 2025
work page 2025
-
[20]
Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022
work page 2022
-
[21]
Tianhong Li, Qinyi Sun, Lijie Fan, and Kaiming He. Fractal generative models.arXiv preprint arXiv:2502.17437, 2025
-
[22]
Deep compression autoencoder for efficient high-resolution diffusion models
Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models. InICLR, 2025
work page 2025
-
[23]
Dc-ae 1.5: Accelerating diffusion model convergence with structured latent space, 2025
Junyu Chen, Dongyun Zou, Wenkun He, Junsong Chen, Enze Xie, Song Han, and Han Cai. Dc-ae 1.5: Accelerating diffusion model convergence with structured latent space, 2025
work page 2025
-
[24]
Masked autoencoders are effective tok- enizers for diffusion models
Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhiksha Raj. Masked autoencoders are effective tok- enizers for diffusion models. InICML, 2025
work page 2025
-
[25]
Zipir: Latent pyramid diffusion transformer for high-resolution image restoration
Yongsheng Yu, Haitian Zheng, Zhifei Zhang, Jianming Zhang, Yuqian Zhou, Connelly Barnes, Yuchen Liu, Wei Xiong, Zhe Lin, and Jiebo Luo. Zipir: Latent pyramid diffusion transformer for high-resolution image restoration. arXiv preprint arXiv:2504.08591, 2025
-
[26]
Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025
Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational au- toencoder.arXiv preprint arXiv:2510.15301, 2025
-
[27]
Diffusion models beat GANs on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. InNeurIPS, 2021
work page 2021
-
[28]
Ddt: Decoupled diffusion transformer, 2025
Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer, 2025
work page 2025
-
[29]
Richter, Christo- pher Pal, and Marc Aubreville
Pablo Pernias, Dominic Rampas, Mats L. Richter, Christo- pher Pal, and Marc Aubreville. Würstchen: An efficient architecture for large-scale text-to-image diffusion models. InICLR, 2024
work page 2024
-
[30]
Fast training of diffusion models with masked transformers.TMLR, 2023
Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers.TMLR, 2023
work page 2023
-
[31]
Representation alignment for generation: Training diffu- sion transformers is easier than you think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffu- sion transformers is easier than you think. InICLR, 2025
work page 2025
-
[32]
Stylegan- xl: Scaling stylegan to large diverse datasets
Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan- xl: Scaling stylegan to large diverse datasets. InSIG- GRAPH, 2022
work page 2022
-
[33]
Scalable adaptive computation for iterative generation
Allan Jabri, David Fleet, and Ting Chen. Scalable adaptive computation for iterative generation. InICML, 2023
work page 2023
-
[34]
Understanding diffu- sion objectives as the elbo with simple data augmentation
Diederik Kingma and Ruiqi Gao. Understanding diffu- sion objectives as the elbo with simple data augmentation. NeurIPS, 36, 2024
work page 2024
-
[35]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Sana: Efficient high-resolution text-to-image synthesis with linear diffusion transformers
Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution text-to-image synthesis with linear diffusion transformers. InICLR, 2025
work page 2025
-
[37]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023
work page 2023
-
[38]
Dinov2: Learning robust visual features without supervi- sion
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervi- sion. InTMLR, 2023
work page 2023
-
[39]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpa- thy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015
work page 2015
-
[40]
Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guid- ance in a limited interval improves sample and distribution quality in diffusion models. InNeurIPS, 2024
work page 2024
-
[41]
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36, 2024
work page 2024
-
[42]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.arXiv preprint arXiv:2211.01095, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
Pixart- 𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis
Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- 𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InICLR, 2024
work page 2024
-
[45]
Pixart-𝜎: Weak-to-strong training of diffusion transformer for 4k text-to-image generation
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-𝜎: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In ECCV, 2024. 10 PixelDiT: Pixel Diffusion Transformers for Image Generation
work page 2024
-
[46]
Lumina-next: Making lumina-t2x stronger and faster with next-dit
Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina- t2x stronger and faster with next-dit.arXiv preprint arXiv:2406.18583, 2024
-
[47]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation
Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Lin- miao Xu, and Suhail Doshi. Playground v2. 5: Three in- sights towards enhancing aesthetic quality in text-to-image generation.arXiv preprint arXiv:2402.17245, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A power- ful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
I Loshchilov. Decoupled weight decay regularization. In ICLR, 2019. 11 PixelDiT: Pixel Diffusion Transformers for Image Generation A. Architecture and System Details A.1. Summary of Model Size To study the impact of model size, we evaluate the base (B), large (L), and extra-large (XL) variants of PixelDiT on ImageNet 256×256. Tables 6 and 7 summarize the ...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.