pith. sign in

arxiv: 2511.20645 · v2 · submitted 2025-11-25 · 💻 cs.CV

PixelDiT: Pixel Diffusion Transformers for Image Generation

Pith reviewed 2026-05-17 04:26 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion transformerspixel-space generationimage synthesisdual-level architecturetext-to-imageImageNet benchmarksend-to-end training
0
0 comments X

The pith

PixelDiT runs diffusion directly in pixel space using a dual-level transformer that splits global semantics from local texture refinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that diffusion transformers can be trained end-to-end on raw pixels instead of compressed latents. A patch-level transformer first models overall image structure while a pixel-level transformer sharpens fine details, removing the need for a separate autoencoder stage. This single-stage design avoids reconstruction errors and supports joint optimization across the full pipeline. The resulting model reaches 1.61 FID on ImageNet at 256 resolution and 1.81 FID at 512 resolution, and extends to text-to-image at 1024 resolution with scores close to leading latent models.

Core claim

PixelDiT is a single-stage, fully transformer-based diffusion model that performs the entire diffusion process directly in pixel space through a dual-level architecture: a patch-level DiT captures global semantics and a pixel-level DiT refines texture details, eliminating the pretrained autoencoder and its associated reconstruction losses while enabling efficient high-resolution training.

What carries the argument

Dual-level transformer design that combines a patch-level DiT for coarse semantics with a pixel-level DiT for detail refinement, allowing direct pixel-space diffusion without dimensionality reduction.

If this is right

  • Removes error accumulation from autoencoder reconstruction in the generation pipeline.
  • Enables true end-to-end optimization of the diffusion process without frozen pretrained components.
  • Supports direct pixel-space training at 1024 resolution for text-to-image tasks.
  • Surpasses prior pixel-space generative models on standard ImageNet FID benchmarks.
  • Approaches the performance of the strongest latent diffusion models on GenEval and DPG-bench.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may simplify deployment by removing the need to train or maintain a separate autoencoder.
  • It could make high-resolution generation more straightforward in domains where defining a good latent space is difficult.
  • The split between patch and pixel levels suggests a general pattern for scaling transformers when operating on high-dimensional raw signals.

Load-bearing premise

The dual-level patch and pixel transformers can handle both global structure and fine details at full resolution without the compression that an autoencoder provides.

What would settle it

Measure whether PixelDiT at 1024 resolution requires substantially more compute or memory than a comparable latent diffusion model while delivering lower or equal FID and GenEval scores.

read the original abstract

Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. PixelDiT achieves 1.61 FID on ImageNet 256 and 1.81 FID on ImageNet 512, surpassing existing pixel generative models. We further extend PixelDiT to text-to-image generation and pretrain it at the 10242resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models. Code: https://github.com/NVlabs/PixelDiT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes PixelDiT, a single-stage end-to-end pixel-space Diffusion Transformer that removes the latent autoencoder stage. It uses a dual-level architecture with a patch-level DiT to capture global semantics and a pixel-level DiT to refine fine texture details. The model is evaluated on class-conditional ImageNet generation at 256 and 512 resolutions and extended to text-to-image at 1024 resolution, reporting FID scores of 1.61 and 1.81 on ImageNet plus 0.74 on GenEval and 83.5 on DPG-bench.

Significance. If the empirical results prove robust, this would represent a meaningful advance by showing that direct pixel-space modeling with a dual-level transformer can match or exceed latent diffusion models without autoencoder-induced reconstruction loss or two-stage training. The approach could simplify generative pipelines and enable fully joint optimization, with the dual-level design offering a practical way to handle native-resolution inputs.

major comments (3)
  1. [Abstract and Experiments] Abstract and Experiments: The reported FID scores (1.61 at 256, 1.81 at 512) and text-to-image metrics lack error bars, standard deviations across runs, or statistical tests, which weakens assessment of whether the gains over prior pixel-space models are reliable and reproducible.
  2. [Method and Experiments] Method and Experiments: No ablation studies isolate the contribution of the pixel-level DiT versus a patch-only or single-level baseline, which is load-bearing for the central claim that the dual-level design efficiently models both semantics and details at full pixel dimensionality without latent compression.
  3. [Implementation or Experiments] Implementation or Experiments: The manuscript provides no parameter counts, FLOPs, memory usage, or scaling curves for the pixel-level transformer at native resolutions, leaving the efficiency and practicality claims relative to latent models unverified despite the higher input dimensionality.
minor comments (1)
  1. [Abstract and Method] The abstract and method sections would benefit from explicit statements on training stability measures and any data filtering applied, to support the extension to 1024-resolution text-to-image pretraining.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will incorporate to strengthen the presentation of our results and claims.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments: The reported FID scores (1.61 at 256, 1.81 at 512) and text-to-image metrics lack error bars, standard deviations across runs, or statistical tests, which weakens assessment of whether the gains over prior pixel-space models are reliable and reproducible.

    Authors: We agree that reporting variability would improve assessment of reproducibility. The original results were obtained from single training runs per configuration. In the revised manuscript we will add results from at least three independent runs with different random seeds and report mean FID together with standard deviation for the ImageNet 256 and 512 settings. For the text-to-image metrics we will clarify the evaluation protocol and include any available run-to-run variation. These additions will allow readers to better judge the reliability of the reported improvements over prior pixel-space models. revision: partial

  2. Referee: [Method and Experiments] Method and Experiments: No ablation studies isolate the contribution of the pixel-level DiT versus a patch-only or single-level baseline, which is load-bearing for the central claim that the dual-level design efficiently models both semantics and details at full pixel dimensionality without latent compression.

    Authors: We concur that explicit ablations would more directly support the value of the dual-level architecture. While the current manuscript demonstrates overall performance through comparisons against existing pixel-space generators, we will add targeted ablation experiments in the revision. These will include a patch-only DiT baseline and a single-level full-pixel transformer, with corresponding FID and qualitative results, to quantify the separate contributions of the patch-level semantic modeling and the pixel-level detail refinement. revision: yes

  3. Referee: [Implementation or Experiments] Implementation or Experiments: The manuscript provides no parameter counts, FLOPs, memory usage, or scaling curves for the pixel-level transformer at native resolutions, leaving the efficiency and practicality claims relative to latent models unverified despite the higher input dimensionality.

    Authors: We thank the referee for highlighting the need for concrete efficiency metrics. In the revised manuscript we will include a dedicated table reporting parameter counts for the patch-level and pixel-level components, estimated FLOPs for forward passes at 256 and 512 resolutions, and peak memory usage during training and inference. We will also add a brief scaling discussion relating model size to performance at native resolution. These quantitative details will enable direct comparison with latent diffusion models and substantiate the practicality claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are empirical measurements on external benchmarks.

full rationale

The paper proposes PixelDiT, a dual-level patch-plus-pixel transformer architecture for direct pixel-space diffusion without an autoencoder. Its central claims consist of architectural design choices and reported performance numbers (FID on ImageNet, GenEval, DPG-bench) obtained via training and evaluation against standard external datasets. No mathematical derivation, prediction, or first-principles result is presented that reduces by construction to fitted parameters, self-definitions, or self-citation chains. The performance figures are measured outcomes rather than tautological outputs of the model definition itself, rendering the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical performance of a new transformer architecture trained end-to-end in pixel space. No explicit free parameters, axioms, or invented physical entities are introduced beyond standard deep-learning hyperparameters and the assumption that the dual-level design is sufficient.

pith-pipeline@v0.9.0 · 5511 in / 1087 out tokens · 43749 ms · 2026-05-17T04:26:15.180121+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Mat\'ern Noise for Triangulation-Agnostic Flow Matching on Meshes

    cs.GR 2026-05 unverdicted novelty 7.0

    Proposes discretized Matérn process noise for triangulation-agnostic flow matching on meshes with PoissonNet denoiser, tested on elastic states and humanoid poses for meshes exceeding one million triangles.

  2. Cast3: Translating numerical weather prediction principles into data-driven forecasting

    physics.ao-ph 2026-05 unverdicted novelty 7.0

    Cast3 translates NWP principles into a data-driven model using cubed-sphere grids, super-ensembles, and generative nudging to achieve state-of-the-art ensemble predictions that outperform baselines.

  3. PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

    cs.CV 2026-05 unverdicted novelty 6.0

    PiD is a pixel diffusion decoder that performs latent-to-pixel conversion and 4-8x upsampling in one generative step, enabling early stopping of latent diffusion and achieving sub-second 2048x2048 decoding with claime...

  4. RiT: Vanilla Diffusion Transformers Suffice in Representation Space

    cs.CV 2026-05 conditional novelty 6.0

    A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.

  5. Registers Matter for Pixel-Space Diffusion Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    Register tokens enhance pixel-space DiT training and output quality via cleaner high-noise feature maps, and a dual-stream design adds further gains with little overhead.

  6. HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion

    cs.CV 2026-05 unverdicted novelty 6.0

    HyperDiT achieves FID 1.56 on ImageNet 256x256 in pixel space via hyper-connected cross-scale interactions, cross-attention, SA-RoPE, and VFM registers.

  7. L2P: Unlocking Latent Potential for Pixel Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.

  8. BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion

    cs.CL 2026-05 unverdicted novelty 6.0

    BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.

  9. FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    FREPix achieves competitive FID scores on ImageNet by decomposing image generation into separate low- and high-frequency paths within a flow matching framework.

  10. Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Tuna-2 shows that direct pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive generation and stronger understanding at scale.

  11. PixelGen: Improving Pixel Diffusion with Perceptual Supervision

    cs.CV 2026-02 accept novelty 6.0

    PixelGen augments pixel diffusion with gated perceptual supervision to reach FID 5.11 on ImageNet-256 and GenEval 0.79 in text-to-image, narrowing the gap to latent methods without VAEs.

  12. PixIE: Prompted Pixel-Space Low-Light Image Enhancement

    cs.CV 2026-05 unverdicted novelty 5.0

    PixIE proposes a pixel-space low-light image enhancement framework using DINO-prompted blocks, spatial-channel compaction, and multi-receptive-field embeddings, reporting PSNR gains of 1.9-15.0% and LPIPS reductions o...

  13. FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion

    cs.CV 2026-05 unverdicted novelty 5.0

    FrequencyBooster reports state-of-the-art FID scores of 1.60 at 256x256 and 1.69 at 512x512 for pixel diffusion by using a specialized decoder for full-frequency modeling.

  14. SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

    cs.CV 2026-05 unverdicted novelty 5.0

    SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

  15. CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation

    physics.ins-det 2026-05 unverdicted novelty 5.0

    CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...

  16. Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.

  17. UniCSG: Unified High-Fidelity Content-Constrained Style-Driven Generation via Staged Semantic and Frequency Disentanglement

    cs.CV 2026-04 unverdicted novelty 5.0

    UniCSG adds staged semantic disentanglement and frequency-aware reconstruction to DiT diffusion models to improve content preservation and style fidelity in both text- and reference-guided generation.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 16 Pith papers · 8 internal anchors

  1. [1]

    Flowedit: Inversion-free text-based editing using pre-trained flow models

    Vladimir Kulikov, Matan Kleiner, Inbar Huberman- Spiegelglas, and Tomer Michaeli. Flowedit: Inversion-free text-based editing using pre-trained flow models. InICCV, pages 19721–19730, 2025

  2. [2]

    Black Forest Labs. Flux. https://github.com/black- forest-labs/flux, 2024

  3. [3]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yan- nik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

  4. [4]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

  5. [5]

    Scalable diffusion mod- els with transformers

    William Peebles and Saining Xie. Scalable diffusion mod- els with transformers. InICCV, 2023

  6. [6]

    Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InECCV, 2024

  7. [7]

    Reconstruc- tion vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruc- tion vs. generation: Taming optimization dilemma in latent diffusion models. InCVPR, 2025

  8. [8]

    Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers

    Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers. InICCV, 2025

  9. [9]

    Playground v3: Improving text-to-image alignment with deep-fusion large language models

    Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Joao Souza, Suhail Doshi, and Daiqing Li. Playground v3: Improving text-to- image alignment with deep-fusion large language models. arXiv preprint arXiv:2409.10695, 2024

  10. [10]

    Dalle-3, 2023

    OpenAI. Dalle-3, 2023

  11. [11]

    Diffusion Transformers with Representation Autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoen- coders.arXiv preprint arXiv:2510.11690, 2025

  12. [12]

    Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025

    Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025

  13. [13]

    Jetformer: An autoregressive generative model of raw images and text

    Michael Tschannen, André Susano Pinto, and Alexander Kolesnikov. Jetformer: An autoregressive generative model of raw images and text. InICLR, 2025

  14. [14]

    FARMER: Flow autoregressive transformer over pixels.arXiv preprint arXiv:2510.23588, 2025

    Guangting Zheng, Qinyu Zhao, Tao Yang, Fei Xiao, Zhijie Lin, Jie Wu, Jiajun Deng, Yanyong Zhang, and Rui Zhu. Farmer: Flow autoregressive transformer over pixels.arXiv preprint arXiv:2510.23588, 2025

  15. [15]

    Advancing end- to-end pixel space generative modeling via self-supervised pre-training.arXiv preprint arXiv:2510.12586, 2025

    Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, and Xiangxiang Chu. Advancing end- to-end pixel space generative modeling via self-supervised pre-training.arXiv preprint arXiv:2510.12586, 2025

  16. [16]

    Back to Basics: Let Denoising Generative Models Denoise

    Kaiming He Tianhong Li. Back to basics: Let de- noising generative models denoise.arXiv preprint arXiv:2511.13720, 2025

  17. [17]

    arXiv preprint arXiv:2504.07963 (2025)

    Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025. 9 PixelDiT: Pixel Diffusion Transformers for Image Generation

  18. [18]

    Simple diffusion: End-to-end diffusion for high resolution images

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. Simple diffusion: End-to-end diffusion for high resolution images. InICML, 2023

  19. [19]

    Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion

    Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion. InCVPR, 2025

  20. [20]

    Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022

    Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022

  21. [21]

    Fractal generative models

    Tianhong Li, Qinyi Sun, Lijie Fan, and Kaiming He. Fractal generative models.arXiv preprint arXiv:2502.17437, 2025

  22. [22]

    Deep compression autoencoder for efficient high-resolution diffusion models

    Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models. InICLR, 2025

  23. [23]

    Dc-ae 1.5: Accelerating diffusion model convergence with structured latent space, 2025

    Junyu Chen, Dongyun Zou, Wenkun He, Junsong Chen, Enze Xie, Song Han, and Han Cai. Dc-ae 1.5: Accelerating diffusion model convergence with structured latent space, 2025

  24. [24]

    Masked autoencoders are effective tok- enizers for diffusion models

    Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhiksha Raj. Masked autoencoders are effective tok- enizers for diffusion models. InICML, 2025

  25. [25]

    Zipir: Latent pyramid diffusion transformer for high-resolution image restoration

    Yongsheng Yu, Haitian Zheng, Zhifei Zhang, Jianming Zhang, Yuqian Zhou, Connelly Barnes, Yuchen Liu, Wei Xiong, Zhe Lin, and Jiebo Luo. Zipir: Latent pyramid diffusion transformer for high-resolution image restoration. arXiv preprint arXiv:2504.08591, 2025

  26. [26]

    Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025

    Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational au- toencoder.arXiv preprint arXiv:2510.15301, 2025

  27. [27]

    Diffusion models beat GANs on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. InNeurIPS, 2021

  28. [28]

    Ddt: Decoupled diffusion transformer, 2025

    Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer, 2025

  29. [29]

    Richter, Christo- pher Pal, and Marc Aubreville

    Pablo Pernias, Dominic Rampas, Mats L. Richter, Christo- pher Pal, and Marc Aubreville. Würstchen: An efficient architecture for large-scale text-to-image diffusion models. InICLR, 2024

  30. [30]

    Fast training of diffusion models with masked transformers.TMLR, 2023

    Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers.TMLR, 2023

  31. [31]

    Representation alignment for generation: Training diffu- sion transformers is easier than you think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffu- sion transformers is easier than you think. InICLR, 2025

  32. [32]

    Stylegan- xl: Scaling stylegan to large diverse datasets

    Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan- xl: Scaling stylegan to large diverse datasets. InSIG- GRAPH, 2022

  33. [33]

    Scalable adaptive computation for iterative generation

    Allan Jabri, David Fleet, and Ting Chen. Scalable adaptive computation for iterative generation. InICML, 2023

  34. [34]

    Understanding diffu- sion objectives as the elbo with simple data augmentation

    Diederik Kingma and Ruiqi Gao. Understanding diffu- sion objectives as the elbo with simple data augmentation. NeurIPS, 36, 2024

  35. [35]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

  36. [36]

    Sana: Efficient high-resolution text-to-image synthesis with linear diffusion transformers

    Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution text-to-image synthesis with linear diffusion transformers. InICLR, 2025

  37. [37]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

  38. [38]

    Dinov2: Learning robust visual features without supervi- sion

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervi- sion. InTMLR, 2023

  39. [39]

    Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpa- thy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

  40. [40]

    Applying guid- ance in a limited interval improves sample and distribution quality in diffusion models

    Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guid- ance in a limited interval improves sample and distribution quality in diffusion models. InNeurIPS, 2024

  41. [41]

    Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36, 2024

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36, 2024

  42. [42]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

  43. [43]

    DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.arXiv preprint arXiv:2211.01095, 2022

  44. [44]

    Pixart- 𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis

    Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- 𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InICLR, 2024

  45. [45]

    Pixart-𝜎: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-𝜎: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In ECCV, 2024. 10 PixelDiT: Pixel Diffusion Transformers for Image Generation

  46. [46]

    Lumina-next: Making lumina-t2x stronger and faster with next-dit

    Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina- t2x stronger and faster with next-dit.arXiv preprint arXiv:2406.18583, 2024

  47. [47]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  48. [48]

    Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Lin- miao Xu, and Suhail Doshi. Playground v2. 5: Three in- sights towards enhancing aesthetic quality in text-to-image generation.arXiv preprint arXiv:2402.17245, 2024

  49. [49]

    Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

    Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A power- ful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748, 2024

  50. [50]

    Seq Len 𝐿(1×)

    I Loshchilov. Decoupled weight decay regularization. In ICLR, 2019. 11 PixelDiT: Pixel Diffusion Transformers for Image Generation A. Architecture and System Details A.1. Summary of Model Size To study the impact of model size, we evaluate the base (B), large (L), and extra-large (XL) variants of PixelDiT on ImageNet 256×256. Tables 6 and 7 summarize the ...