PixelDiT: Pixel Diffusion Transformers for Image Generation

Jiebo Luo; Shiqiu Liu; Weili Nie; Wei Xiong; Yichen Sheng; Yongsheng Yu

arxiv: 2511.20645 · v2 · submitted 2025-11-25 · 💻 cs.CV

PixelDiT: Pixel Diffusion Transformers for Image Generation

Yongsheng Yu , Wei Xiong , Weili Nie , Yichen Sheng , Shiqiu Liu , Jiebo Luo This is my paper

Pith reviewed 2026-05-17 04:26 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion transformerspixel-space generationimage synthesisdual-level architecturetext-to-imageImageNet benchmarksend-to-end training

0 comments

The pith

PixelDiT runs diffusion directly in pixel space using a dual-level transformer that splits global semantics from local texture refinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that diffusion transformers can be trained end-to-end on raw pixels instead of compressed latents. A patch-level transformer first models overall image structure while a pixel-level transformer sharpens fine details, removing the need for a separate autoencoder stage. This single-stage design avoids reconstruction errors and supports joint optimization across the full pipeline. The resulting model reaches 1.61 FID on ImageNet at 256 resolution and 1.81 FID at 512 resolution, and extends to text-to-image at 1024 resolution with scores close to leading latent models.

Core claim

PixelDiT is a single-stage, fully transformer-based diffusion model that performs the entire diffusion process directly in pixel space through a dual-level architecture: a patch-level DiT captures global semantics and a pixel-level DiT refines texture details, eliminating the pretrained autoencoder and its associated reconstruction losses while enabling efficient high-resolution training.

What carries the argument

Dual-level transformer design that combines a patch-level DiT for coarse semantics with a pixel-level DiT for detail refinement, allowing direct pixel-space diffusion without dimensionality reduction.

If this is right

Removes error accumulation from autoencoder reconstruction in the generation pipeline.
Enables true end-to-end optimization of the diffusion process without frozen pretrained components.
Supports direct pixel-space training at 1024 resolution for text-to-image tasks.
Surpasses prior pixel-space generative models on standard ImageNet FID benchmarks.
Approaches the performance of the strongest latent diffusion models on GenEval and DPG-bench.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may simplify deployment by removing the need to train or maintain a separate autoencoder.
It could make high-resolution generation more straightforward in domains where defining a good latent space is difficult.
The split between patch and pixel levels suggests a general pattern for scaling transformers when operating on high-dimensional raw signals.

Load-bearing premise

The dual-level patch and pixel transformers can handle both global structure and fine details at full resolution without the compression that an autoencoder provides.

What would settle it

Measure whether PixelDiT at 1024 resolution requires substantially more compute or memory than a comparable latent diffusion model while delivering lower or equal FID and GenEval scores.

read the original abstract

Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. PixelDiT achieves 1.61 FID on ImageNet 256 and 1.81 FID on ImageNet 512, surpassing existing pixel generative models. We further extend PixelDiT to text-to-image generation and pretrain it at the 10242resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models. Code: https://github.com/NVlabs/PixelDiT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PixelDiT makes a direct pixel-space DiT work with a patch-plus-pixel split and posts competitive FID numbers, but the efficiency and ablation gaps leave the scaling advantage unproven.

read the letter

The key takeaway is that PixelDiT manages to train a full diffusion transformer directly on raw pixel inputs by splitting the work between a patch-level component for global structure and a pixel-level one for fine details. This single-stage approach avoids the reconstruction losses from autoencoders and still hits 1.61 FID on ImageNet at 256 resolution and 1.81 at 512, which improves on previous pixel-based generators. They also show it scales to text-to-image at 1024 with competitive scores on GenEval and DPG-bench. The dual-level transformer is what sets this apart from the latent DiT papers it references. By keeping everything in pixel space, it opens up joint optimization without the two-stage bottleneck. It does a good job presenting these results and making the code available, which lets others verify the claims. Where it falls short is in the supporting evidence for the design choices. There are no ablations that test the contribution of the pixel-level DiT separately, no error bars on the FID values, and no information on training stability or how the model scales in terms of memory and compute. The point about needing to check if the higher dimensionality creates efficiency problems is valid here, because the abstract does not provide FLOPs or parameter comparisons to latent baselines. Overall, this is aimed at people in the generative modeling community who are interested in end-to-end pixel diffusion transformers. A reader who follows DiT developments would find the architectural split worth studying. I would send it for peer review because the performance numbers are competitive and the core idea is clear enough that referees can evaluate the missing controls.

Referee Report

3 major / 1 minor

Summary. The paper proposes PixelDiT, a single-stage end-to-end pixel-space Diffusion Transformer that removes the latent autoencoder stage. It uses a dual-level architecture with a patch-level DiT to capture global semantics and a pixel-level DiT to refine fine texture details. The model is evaluated on class-conditional ImageNet generation at 256 and 512 resolutions and extended to text-to-image at 1024 resolution, reporting FID scores of 1.61 and 1.81 on ImageNet plus 0.74 on GenEval and 83.5 on DPG-bench.

Significance. If the empirical results prove robust, this would represent a meaningful advance by showing that direct pixel-space modeling with a dual-level transformer can match or exceed latent diffusion models without autoencoder-induced reconstruction loss or two-stage training. The approach could simplify generative pipelines and enable fully joint optimization, with the dual-level design offering a practical way to handle native-resolution inputs.

major comments (3)

[Abstract and Experiments] Abstract and Experiments: The reported FID scores (1.61 at 256, 1.81 at 512) and text-to-image metrics lack error bars, standard deviations across runs, or statistical tests, which weakens assessment of whether the gains over prior pixel-space models are reliable and reproducible.
[Method and Experiments] Method and Experiments: No ablation studies isolate the contribution of the pixel-level DiT versus a patch-only or single-level baseline, which is load-bearing for the central claim that the dual-level design efficiently models both semantics and details at full pixel dimensionality without latent compression.
[Implementation or Experiments] Implementation or Experiments: The manuscript provides no parameter counts, FLOPs, memory usage, or scaling curves for the pixel-level transformer at native resolutions, leaving the efficiency and practicality claims relative to latent models unverified despite the higher input dimensionality.

minor comments (1)

[Abstract and Method] The abstract and method sections would benefit from explicit statements on training stability measures and any data filtering applied, to support the extension to 1024-resolution text-to-image pretraining.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will incorporate to strengthen the presentation of our results and claims.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments: The reported FID scores (1.61 at 256, 1.81 at 512) and text-to-image metrics lack error bars, standard deviations across runs, or statistical tests, which weakens assessment of whether the gains over prior pixel-space models are reliable and reproducible.

Authors: We agree that reporting variability would improve assessment of reproducibility. The original results were obtained from single training runs per configuration. In the revised manuscript we will add results from at least three independent runs with different random seeds and report mean FID together with standard deviation for the ImageNet 256 and 512 settings. For the text-to-image metrics we will clarify the evaluation protocol and include any available run-to-run variation. These additions will allow readers to better judge the reliability of the reported improvements over prior pixel-space models. revision: partial
Referee: [Method and Experiments] Method and Experiments: No ablation studies isolate the contribution of the pixel-level DiT versus a patch-only or single-level baseline, which is load-bearing for the central claim that the dual-level design efficiently models both semantics and details at full pixel dimensionality without latent compression.

Authors: We concur that explicit ablations would more directly support the value of the dual-level architecture. While the current manuscript demonstrates overall performance through comparisons against existing pixel-space generators, we will add targeted ablation experiments in the revision. These will include a patch-only DiT baseline and a single-level full-pixel transformer, with corresponding FID and qualitative results, to quantify the separate contributions of the patch-level semantic modeling and the pixel-level detail refinement. revision: yes
Referee: [Implementation or Experiments] Implementation or Experiments: The manuscript provides no parameter counts, FLOPs, memory usage, or scaling curves for the pixel-level transformer at native resolutions, leaving the efficiency and practicality claims relative to latent models unverified despite the higher input dimensionality.

Authors: We thank the referee for highlighting the need for concrete efficiency metrics. In the revised manuscript we will include a dedicated table reporting parameter counts for the patch-level and pixel-level components, estimated FLOPs for forward passes at 256 and 512 resolutions, and peak memory usage during training and inference. We will also add a brief scaling discussion relating model size to performance at native resolution. These quantitative details will enable direct comparison with latent diffusion models and substantiate the practicality claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are empirical measurements on external benchmarks.

full rationale

The paper proposes PixelDiT, a dual-level patch-plus-pixel transformer architecture for direct pixel-space diffusion without an autoencoder. Its central claims consist of architectural design choices and reported performance numbers (FID on ImageNet, GenEval, DPG-bench) obtained via training and evaluation against standard external datasets. No mathematical derivation, prediction, or first-principles result is presented that reduces by construction to fitted parameters, self-definitions, or self-citation chains. The performance figures are measured outcomes rather than tautological outputs of the model definition itself, rendering the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical performance of a new transformer architecture trained end-to-end in pixel space. No explicit free parameters, axioms, or invented physical entities are introduced beyond standard deep-learning hyperparameters and the assumption that the dual-level design is sufficient.

pith-pipeline@v0.9.0 · 5511 in / 1087 out tokens · 43749 ms · 2026-05-17T04:26:15.180121+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details... pixel-wise AdaLN modulation... pixel token compaction mechanism
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose PixelDiT, a single-stage, fully transformer-based pixel-space diffusion model... efficient pixel modeling via pixel-wise AdaLN and token compaction

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mat\'ern Noise for Triangulation-Agnostic Flow Matching on Meshes
cs.GR 2026-05 unverdicted novelty 7.0

Proposes discretized Matérn process noise for triangulation-agnostic flow matching on meshes with PoissonNet denoiser, tested on elastic states and humanoid poses for meshes exceeding one million triangles.
Cast3: Translating numerical weather prediction principles into data-driven forecasting
physics.ao-ph 2026-05 unverdicted novelty 7.0

Cast3 translates NWP principles into a data-driven model using cubed-sphere grids, super-ensembles, and generative nudging to achieve state-of-the-art ensemble predictions that outperform baselines.
PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion
cs.CV 2026-05 unverdicted novelty 6.0

PiD is a pixel diffusion decoder that performs latent-to-pixel conversion and 4-8x upsampling in one generative step, enabling early stopping of latent diffusion and achieving sub-second 2048x2048 decoding with claime...
RiT: Vanilla Diffusion Transformers Suffice in Representation Space
cs.CV 2026-05 conditional novelty 6.0

A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.
Registers Matter for Pixel-Space Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 6.0

Register tokens enhance pixel-space DiT training and output quality via cleaner high-noise feature maps, and a dual-stream design adds further gains with little overhead.
HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion
cs.CV 2026-05 unverdicted novelty 6.0

HyperDiT achieves FID 1.56 on ImageNet 256x256 in pixel space via hyper-connected cross-scale interactions, cross-attention, SA-RoPE, and VFM registers.
L2P: Unlocking Latent Potential for Pixel Generation
cs.CV 2026-05 unverdicted novelty 6.0

L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion
cs.CL 2026-05 unverdicted novelty 6.0

BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.
FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation
cs.CV 2026-05 unverdicted novelty 6.0

FREPix achieves competitive FID scores on ImageNet by decomposing image generation into separate low- and high-frequency paths within a flow matching framework.
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
cs.CV 2026-04 unverdicted novelty 6.0

Tuna-2 shows that direct pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive generation and stronger understanding at scale.
PixelGen: Improving Pixel Diffusion with Perceptual Supervision
cs.CV 2026-02 accept novelty 6.0

PixelGen augments pixel diffusion with gated perceptual supervision to reach FID 5.11 on ImageNet-256 and GenEval 0.79 in text-to-image, narrowing the gap to latent methods without VAEs.
PixIE: Prompted Pixel-Space Low-Light Image Enhancement
cs.CV 2026-05 unverdicted novelty 5.0

PixIE proposes a pixel-space low-light image enhancement framework using DINO-prompted blocks, spatial-channel compaction, and multi-receptive-field embeddings, reporting PSNR gains of 1.9-15.0% and LPIPS reductions o...
FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion
cs.CV 2026-05 unverdicted novelty 5.0

FrequencyBooster reports state-of-the-art FID scores of 1.60 at 256x256 and 1.69 at 512x512 for pixel diffusion by using a specialized decoder for full-frequency modeling.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
cs.CV 2026-05 unverdicted novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation
physics.ins-det 2026-05 unverdicted novelty 5.0

CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
cs.CV 2026-04 unverdicted novelty 5.0

Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
UniCSG: Unified High-Fidelity Content-Constrained Style-Driven Generation via Staged Semantic and Frequency Disentanglement
cs.CV 2026-04 unverdicted novelty 5.0

UniCSG adds staged semantic disentanglement and frequency-aware reconstruction to DiT diffusion models to improve content preservation and style fidelity in both text- and reference-guided generation.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 16 Pith papers · 8 internal anchors

[1]

Flowedit: Inversion-free text-based editing using pre-trained flow models

Vladimir Kulikov, Matan Kleiner, Inbar Huberman- Spiegelglas, and Tomer Michaeli. Flowedit: Inversion-free text-based editing using pre-trained flow models. InICCV, pages 19721–19730, 2025

work page 2025
[2]

Black Forest Labs. Flux. https://github.com/black- forest-labs/flux, 2024

work page 2024
[3]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yan- nik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

work page 2024
[4]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

work page 2022
[5]

Scalable diffusion mod- els with transformers

William Peebles and Saining Xie. Scalable diffusion mod- els with transformers. InICCV, 2023

work page 2023
[6]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InECCV, 2024

work page 2024
[7]

Reconstruc- tion vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruc- tion vs. generation: Taming optimization dilemma in latent diffusion models. InCVPR, 2025

work page 2025
[8]

Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers. InICCV, 2025

work page 2025
[9]

Playground v3: Improving text-to-image alignment with deep-fusion large language models

Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Joao Souza, Suhail Doshi, and Daiqing Li. Playground v3: Improving text-to- image alignment with deep-fusion large language models. arXiv preprint arXiv:2409.10695, 2024

work page arXiv 2024
[10]

Dalle-3, 2023

OpenAI. Dalle-3, 2023

work page 2023
[11]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoen- coders.arXiv preprint arXiv:2510.11690, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025

Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025

work page arXiv 2025
[13]

Jetformer: An autoregressive generative model of raw images and text

Michael Tschannen, André Susano Pinto, and Alexander Kolesnikov. Jetformer: An autoregressive generative model of raw images and text. InICLR, 2025

work page 2025
[14]

FARMER: Flow autoregressive transformer over pixels.arXiv preprint arXiv:2510.23588, 2025

Guangting Zheng, Qinyu Zhao, Tao Yang, Fei Xiao, Zhijie Lin, Jie Wu, Jiajun Deng, Yanyong Zhang, and Rui Zhu. Farmer: Flow autoregressive transformer over pixels.arXiv preprint arXiv:2510.23588, 2025

work page arXiv 2025
[15]

Advancing end- to-end pixel space generative modeling via self-supervised pre-training.arXiv preprint arXiv:2510.12586, 2025

Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, and Xiangxiang Chu. Advancing end- to-end pixel space generative modeling via self-supervised pre-training.arXiv preprint arXiv:2510.12586, 2025

work page arXiv 2025
[16]

Back to Basics: Let Denoising Generative Models Denoise

Kaiming He Tianhong Li. Back to basics: Let de- noising generative models denoise.arXiv preprint arXiv:2511.13720, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

arXiv preprint arXiv:2504.07963 (2025)

Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025. 9 PixelDiT: Pixel Diffusion Transformers for Image Generation

work page arXiv 2025
[18]

Simple diffusion: End-to-end diffusion for high resolution images

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. Simple diffusion: End-to-end diffusion for high resolution images. InICML, 2023

work page 2023
[19]

Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion

Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion. InCVPR, 2025

work page 2025
[20]

Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022

Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022

work page 2022
[21]

Fractal generative models

Tianhong Li, Qinyi Sun, Lijie Fan, and Kaiming He. Fractal generative models.arXiv preprint arXiv:2502.17437, 2025

work page arXiv 2025
[22]

Deep compression autoencoder for efficient high-resolution diffusion models

Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models. InICLR, 2025

work page 2025
[23]

Dc-ae 1.5: Accelerating diffusion model convergence with structured latent space, 2025

Junyu Chen, Dongyun Zou, Wenkun He, Junsong Chen, Enze Xie, Song Han, and Han Cai. Dc-ae 1.5: Accelerating diffusion model convergence with structured latent space, 2025

work page 2025
[24]

Masked autoencoders are effective tok- enizers for diffusion models

Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhiksha Raj. Masked autoencoders are effective tok- enizers for diffusion models. InICML, 2025

work page 2025
[25]

Zipir: Latent pyramid diffusion transformer for high-resolution image restoration

Yongsheng Yu, Haitian Zheng, Zhifei Zhang, Jianming Zhang, Yuqian Zhou, Connelly Barnes, Yuchen Liu, Wei Xiong, Zhe Lin, and Jiebo Luo. Zipir: Latent pyramid diffusion transformer for high-resolution image restoration. arXiv preprint arXiv:2504.08591, 2025

work page arXiv 2025
[26]

Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025

Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational au- toencoder.arXiv preprint arXiv:2510.15301, 2025

work page arXiv 2025
[27]

Diffusion models beat GANs on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. InNeurIPS, 2021

work page 2021
[28]

Ddt: Decoupled diffusion transformer, 2025

Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer, 2025

work page 2025
[29]

Richter, Christo- pher Pal, and Marc Aubreville

Pablo Pernias, Dominic Rampas, Mats L. Richter, Christo- pher Pal, and Marc Aubreville. Würstchen: An efficient architecture for large-scale text-to-image diffusion models. InICLR, 2024

work page 2024
[30]

Fast training of diffusion models with masked transformers.TMLR, 2023

Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers.TMLR, 2023

work page 2023
[31]

Representation alignment for generation: Training diffu- sion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffu- sion transformers is easier than you think. InICLR, 2025

work page 2025
[32]

Stylegan- xl: Scaling stylegan to large diverse datasets

Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan- xl: Scaling stylegan to large diverse datasets. InSIG- GRAPH, 2022

work page 2022
[33]

Scalable adaptive computation for iterative generation

Allan Jabri, David Fleet, and Ting Chen. Scalable adaptive computation for iterative generation. InICML, 2023

work page 2023
[34]

Understanding diffu- sion objectives as the elbo with simple data augmentation

Diederik Kingma and Ruiqi Gao. Understanding diffu- sion objectives as the elbo with simple data augmentation. NeurIPS, 36, 2024

work page 2024
[35]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Sana: Efficient high-resolution text-to-image synthesis with linear diffusion transformers

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution text-to-image synthesis with linear diffusion transformers. InICLR, 2025

work page 2025
[37]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

work page 2023
[38]

Dinov2: Learning robust visual features without supervi- sion

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervi- sion. InTMLR, 2023

work page 2023
[39]

Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpa- thy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

work page 2015
[40]

Applying guid- ance in a limited interval improves sample and distribution quality in diffusion models

Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guid- ance in a limited interval improves sample and distribution quality in diffusion models. InNeurIPS, 2024

work page 2024
[41]

Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36, 2024

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36, 2024

work page 2024
[42]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.arXiv preprint arXiv:2211.01095, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

Pixart- 𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis

Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- 𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InICLR, 2024

work page 2024
[45]

Pixart-𝜎: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-𝜎: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In ECCV, 2024. 10 PixelDiT: Pixel Diffusion Transformers for Image Generation

work page 2024
[46]

Lumina-next: Making lumina-t2x stronger and faster with next-dit

Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina- t2x stronger and faster with next-dit.arXiv preprint arXiv:2406.18583, 2024

work page arXiv 2024
[47]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Lin- miao Xu, and Suhail Doshi. Playground v2. 5: Three in- sights towards enhancing aesthetic quality in text-to-image generation.arXiv preprint arXiv:2402.17245, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A power- ful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Seq Len 𝐿(1×)

I Loshchilov. Decoupled weight decay regularization. In ICLR, 2019. 11 PixelDiT: Pixel Diffusion Transformers for Image Generation A. Architecture and System Details A.1. Summary of Model Size To study the impact of model size, we evaluate the base (B), large (L), and extra-large (XL) variants of PixelDiT on ImageNet 256×256. Tables 6 and 7 summarize the ...

work page 2019

[1] [1]

Flowedit: Inversion-free text-based editing using pre-trained flow models

Vladimir Kulikov, Matan Kleiner, Inbar Huberman- Spiegelglas, and Tomer Michaeli. Flowedit: Inversion-free text-based editing using pre-trained flow models. InICCV, pages 19721–19730, 2025

work page 2025

[2] [2]

Black Forest Labs. Flux. https://github.com/black- forest-labs/flux, 2024

work page 2024

[3] [3]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yan- nik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

work page 2024

[4] [4]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

work page 2022

[5] [5]

Scalable diffusion mod- els with transformers

William Peebles and Saining Xie. Scalable diffusion mod- els with transformers. InICCV, 2023

work page 2023

[6] [6]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InECCV, 2024

work page 2024

[7] [7]

Reconstruc- tion vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruc- tion vs. generation: Taming optimization dilemma in latent diffusion models. InCVPR, 2025

work page 2025

[8] [8]

Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers. InICCV, 2025

work page 2025

[9] [9]

Playground v3: Improving text-to-image alignment with deep-fusion large language models

Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Joao Souza, Suhail Doshi, and Daiqing Li. Playground v3: Improving text-to- image alignment with deep-fusion large language models. arXiv preprint arXiv:2409.10695, 2024

work page arXiv 2024

[10] [10]

Dalle-3, 2023

OpenAI. Dalle-3, 2023

work page 2023

[11] [11]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoen- coders.arXiv preprint arXiv:2510.11690, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025

Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268, 2025

work page arXiv 2025

[13] [13]

Jetformer: An autoregressive generative model of raw images and text

Michael Tschannen, André Susano Pinto, and Alexander Kolesnikov. Jetformer: An autoregressive generative model of raw images and text. InICLR, 2025

work page 2025

[14] [14]

FARMER: Flow autoregressive transformer over pixels.arXiv preprint arXiv:2510.23588, 2025

Guangting Zheng, Qinyu Zhao, Tao Yang, Fei Xiao, Zhijie Lin, Jie Wu, Jiajun Deng, Yanyong Zhang, and Rui Zhu. Farmer: Flow autoregressive transformer over pixels.arXiv preprint arXiv:2510.23588, 2025

work page arXiv 2025

[15] [15]

Advancing end- to-end pixel space generative modeling via self-supervised pre-training.arXiv preprint arXiv:2510.12586, 2025

Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, and Xiangxiang Chu. Advancing end- to-end pixel space generative modeling via self-supervised pre-training.arXiv preprint arXiv:2510.12586, 2025

work page arXiv 2025

[16] [16]

Back to Basics: Let Denoising Generative Models Denoise

Kaiming He Tianhong Li. Back to basics: Let de- noising generative models denoise.arXiv preprint arXiv:2511.13720, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

arXiv preprint arXiv:2504.07963 (2025)

Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025. 9 PixelDiT: Pixel Diffusion Transformers for Image Generation

work page arXiv 2025

[18] [18]

Simple diffusion: End-to-end diffusion for high resolution images

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. Simple diffusion: End-to-end diffusion for high resolution images. InICML, 2023

work page 2023

[19] [19]

Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion

Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion. InCVPR, 2025

work page 2025

[20] [20]

Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022

Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23(47):1–33, 2022

work page 2022

[21] [21]

Fractal generative models

Tianhong Li, Qinyi Sun, Lijie Fan, and Kaiming He. Fractal generative models.arXiv preprint arXiv:2502.17437, 2025

work page arXiv 2025

[22] [22]

Deep compression autoencoder for efficient high-resolution diffusion models

Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models. InICLR, 2025

work page 2025

[23] [23]

Dc-ae 1.5: Accelerating diffusion model convergence with structured latent space, 2025

Junyu Chen, Dongyun Zou, Wenkun He, Junsong Chen, Enze Xie, Song Han, and Han Cai. Dc-ae 1.5: Accelerating diffusion model convergence with structured latent space, 2025

work page 2025

[24] [24]

Masked autoencoders are effective tok- enizers for diffusion models

Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhiksha Raj. Masked autoencoders are effective tok- enizers for diffusion models. InICML, 2025

work page 2025

[25] [25]

Zipir: Latent pyramid diffusion transformer for high-resolution image restoration

Yongsheng Yu, Haitian Zheng, Zhifei Zhang, Jianming Zhang, Yuqian Zhou, Connelly Barnes, Yuchen Liu, Wei Xiong, Zhe Lin, and Jiebo Luo. Zipir: Latent pyramid diffusion transformer for high-resolution image restoration. arXiv preprint arXiv:2504.08591, 2025

work page arXiv 2025

[26] [26]

Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025

Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational au- toencoder.arXiv preprint arXiv:2510.15301, 2025

work page arXiv 2025

[27] [27]

Diffusion models beat GANs on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. InNeurIPS, 2021

work page 2021

[28] [28]

Ddt: Decoupled diffusion transformer, 2025

Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer, 2025

work page 2025

[29] [29]

Richter, Christo- pher Pal, and Marc Aubreville

Pablo Pernias, Dominic Rampas, Mats L. Richter, Christo- pher Pal, and Marc Aubreville. Würstchen: An efficient architecture for large-scale text-to-image diffusion models. InICLR, 2024

work page 2024

[30] [30]

Fast training of diffusion models with masked transformers.TMLR, 2023

Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers.TMLR, 2023

work page 2023

[31] [31]

Representation alignment for generation: Training diffu- sion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffu- sion transformers is easier than you think. InICLR, 2025

work page 2025

[32] [32]

Stylegan- xl: Scaling stylegan to large diverse datasets

Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan- xl: Scaling stylegan to large diverse datasets. InSIG- GRAPH, 2022

work page 2022

[33] [33]

Scalable adaptive computation for iterative generation

Allan Jabri, David Fleet, and Ting Chen. Scalable adaptive computation for iterative generation. InICML, 2023

work page 2023

[34] [34]

Understanding diffu- sion objectives as the elbo with simple data augmentation

Diederik Kingma and Ruiqi Gao. Understanding diffu- sion objectives as the elbo with simple data augmentation. NeurIPS, 36, 2024

work page 2024

[35] [35]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Sana: Efficient high-resolution text-to-image synthesis with linear diffusion transformers

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution text-to-image synthesis with linear diffusion transformers. InICLR, 2025

work page 2025

[37] [37]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

work page 2023

[38] [38]

Dinov2: Learning robust visual features without supervi- sion

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervi- sion. InTMLR, 2023

work page 2023

[39] [39]

Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpa- thy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

work page 2015

[40] [40]

Applying guid- ance in a limited interval improves sample and distribution quality in diffusion models

Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guid- ance in a limited interval improves sample and distribution quality in diffusion models. InNeurIPS, 2024

work page 2024

[41] [41]

Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36, 2024

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36, 2024

work page 2024

[42] [42]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.arXiv preprint arXiv:2211.01095, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[44] [44]

Pixart- 𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis

Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- 𝛼: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InICLR, 2024

work page 2024

[45] [45]

Pixart-𝜎: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-𝜎: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In ECCV, 2024. 10 PixelDiT: Pixel Diffusion Transformers for Image Generation

work page 2024

[46] [46]

Lumina-next: Making lumina-t2x stronger and faster with next-dit

Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, et al. Lumina-next: Making lumina- t2x stronger and faster with next-dit.arXiv preprint arXiv:2406.18583, 2024

work page arXiv 2024

[47] [47]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Lin- miao Xu, and Suhail Doshi. Playground v2. 5: Three in- sights towards enhancing aesthetic quality in text-to-image generation.arXiv preprint arXiv:2402.17245, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A power- ful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

Seq Len 𝐿(1×)

I Loshchilov. Decoupled weight decay regularization. In ICLR, 2019. 11 PixelDiT: Pixel Diffusion Transformers for Image Generation A. Architecture and System Details A.1. Summary of Model Size To study the impact of model size, we evaluate the base (B), large (L), and extra-large (XL) variants of PixelDiT on ImageNet 256×256. Tables 6 and 7 summarize the ...

work page 2019