pith. sign in

PixelGen: Improving Pixel Diffusion with Perceptual Supervision

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it
abstract

Pixel diffusion generates images directly in pixel space, avoiding the VAE artifacts and representational bottlenecks of two-stage latent diffusion. Recent JiT further simplifies pixel diffusion with x-prediction, where the model predicts clean images rather than velocity. However, the standard pixel-wise diffusion loss treats all pixels equally, spending model capacity to perceptually insignificant signals and often leading to blurry samples. We propose PixelGen, an end-to-end pixel diffusion framework that augments x-prediction with perceptual supervision. Specifically, PixelGen introduces two complementary perceptual losses on top of x-prediction: an LPIPS loss for local textures and a P-DINO loss for global semantics. To preserve sample coverage, PixelGen further proposes a noise-gating strategy that applies these losses only at lower-noise timesteps. On ImageNet-256 without classifier-free guidance, PixelGen achieves an FID of 5.11 in 80 training epochs, surpassing the latent diffusion baselines. Moreover, PixelGen scales efficiently to text-to-image generation, reaching a GenEval score of 0.79 with only 6 days of training on 8xH800 GPUs. These results show that perceptual supervision substantially narrows the gap between pixel and latent diffusion while preserving a simple one-stage pipeline. Codes are available at https://github.com/Zehong-Ma/PixelGen.

citation-role summary

background 1

citation-polarity summary

years

2026 12

roles

background 1

polarities

background 1

clear filters

representative citing papers

Asymmetric Flow Models

cs.CV · 2026-05-13 · unverdicted · novelty 7.0 · 2 refs

AsymFlow uses rank-asymmetric velocity prediction to reach 1.57 FID on ImageNet 256x256 and enables finetuning of latent flow models into superior pixel-space text-to-image generators.

PixelU: A U-Shaped Transformer for Efficient End-to-End Pixel Diffusion

cs.CV · 2026-06-26 · unverdicted · novelty 6.0

PixelU is a minimalist U-shaped Diffusion Transformer for pixel-space diffusion that decouples frequencies with zero-cost skip connections and constant-channel downsampling, outperforming baselines like JiT-G at 1/3 the compute cost with FID 1.63 on ImageNet 256x256.

DiffusionBench: On Holistic Evaluation of Diffusion Transformers

cs.CV · 2026-06-23 · conditional · novelty 6.0

NanoGen unifies DiT training on ImageNet and T2I, reveals negative Pearson correlations (-0.377 to -0.580) in method rankings across metrics from 21 models, and motivates DiffusionBench for holistic evaluation.

L2P: Unlocking Latent Potential for Pixel Generation

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.

Spectral Progressive Diffusion for Efficient Image and Video Generation

cs.CV · 2026-05-18 · unverdicted · novelty 5.0 · 2 refs

Spectral Progressive Diffusion progressively grows resolution during denoising of pretrained diffusion models via spectral noise expansion and a power-spectrum-derived schedule, enabling training-free speedups and a fine-tuning recipe.

citing papers explorer

Showing 1 of 1 citing paper after filters.

  • Asymmetric Flow Models cs.CV · 2026-05-13 · unverdicted · none · ref 48 · 2 links · internal anchor

    AsymFlow uses rank-asymmetric velocity prediction to reach 1.57 FID on ImageNet 256x256 and enables finetuning of latent flow models into superior pixel-space text-to-image generators.