pith. sign in

PixelGen: Improving Pixel Diffusion with Perceptual Supervision

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it
abstract

Pixel diffusion generates images directly in pixel space, avoiding the VAE artifacts and representational bottlenecks of two-stage latent diffusion. Recent JiT further simplifies pixel diffusion with x-prediction, where the model predicts clean images rather than velocity. However, the standard pixel-wise diffusion loss treats all pixels equally, spending model capacity to perceptually insignificant signals and often leading to blurry samples. We propose PixelGen, an end-to-end pixel diffusion framework that augments x-prediction with perceptual supervision. Specifically, PixelGen introduces two complementary perceptual losses on top of x-prediction: an LPIPS loss for local textures and a P-DINO loss for global semantics. To preserve sample coverage, PixelGen further proposes a noise-gating strategy that applies these losses only at lower-noise timesteps. On ImageNet-256 without classifier-free guidance, PixelGen achieves an FID of 5.11 in 80 training epochs, surpassing the latent diffusion baselines. Moreover, PixelGen scales efficiently to text-to-image generation, reaching a GenEval score of 0.79 with only 6 days of training on 8xH800 GPUs. These results show that perceptual supervision substantially narrows the gap between pixel and latent diffusion while preserving a simple one-stage pipeline. Codes are available at https://github.com/Zehong-Ma/PixelGen.

citation-role summary

background 1

citation-polarity summary

fields

cs.CV 8

years

2026 8

verdicts

UNVERDICTED 8

roles

background 1

polarities

background 1

representative citing papers

Asymmetric Flow Models

cs.CV · 2026-05-13 · unverdicted · novelty 7.0 · 2 refs

AsymFlow uses rank-asymmetric velocity prediction to reach 1.57 FID on ImageNet 256x256 and enables finetuning of latent flow models into superior pixel-space text-to-image generators.

PixelU: A U-Shaped Transformer for Efficient End-to-End Pixel Diffusion

cs.CV · 2026-06-26 · unverdicted · novelty 6.0

PixelU is a minimalist U-shaped Diffusion Transformer for pixel-space diffusion that decouples frequencies with zero-cost skip connections and constant-channel downsampling, outperforming baselines like JiT-G at 1/3 the compute cost with FID 1.63 on ImageNet 256x256.

L2P: Unlocking Latent Potential for Pixel Generation

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.

Spectral Progressive Diffusion for Efficient Image and Video Generation

cs.CV · 2026-05-18 · unverdicted · novelty 5.0 · 2 refs

Spectral Progressive Diffusion progressively grows resolution during denoising of pretrained diffusion models via spectral noise expansion and a power-spectrum-derived schedule, enabling training-free speedups and a fine-tuning recipe.

citing papers explorer

Showing 8 of 8 citing papers.

  • Asymmetric Flow Models cs.CV · 2026-05-13 · unverdicted · none · ref 48 · 2 links · internal anchor

    AsymFlow uses rank-asymmetric velocity prediction to reach 1.57 FID on ImageNet 256x256 and enables finetuning of latent flow models into superior pixel-space text-to-image generators.

  • Structure-Adaptive Sparse Diffusion in Voxel Space for 3D Medical Image Enhancement cs.CV · 2026-04-20 · unverdicted · none · ref 18 · internal anchor

    A sparse voxel-space diffusion method with structure-adaptive modulation achieves up to 10x training speedup and state-of-the-art results for 3D medical image denoising and super-resolution.

  • PixelU: A U-Shaped Transformer for Efficient End-to-End Pixel Diffusion cs.CV · 2026-06-26 · unverdicted · none · ref 30 · internal anchor

    PixelU is a minimalist U-shaped Diffusion Transformer for pixel-space diffusion that decouples frequencies with zero-cost skip connections and constant-channel downsampling, outperforming baselines like JiT-G at 1/3 the compute cost with FID 1.63 on ImageNet 256x256.

  • GPIC: A Giant Permissive Image Corpus for Visual Generation cs.CV · 2026-05-28 · unverdicted · none · ref 34 · internal anchor

    GPIC is a new 28-trillion-pixel permissively licensed image corpus with 100M training examples for visual generative modeling.

  • L2P: Unlocking Latent Potential for Pixel Generation cs.CV · 2026-05-12 · unverdicted · none · ref 16 · internal anchor

    L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.

  • Spectral Progressive Diffusion for Efficient Image and Video Generation cs.CV · 2026-05-18 · unverdicted · none · ref 56 · 2 links · internal anchor

    Spectral Progressive Diffusion progressively grows resolution during denoising of pretrained diffusion models via spectral noise expansion and a power-spectrum-derived schedule, enabling training-free speedups and a fine-tuning recipe.

  • FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion cs.CV · 2026-05-18 · unverdicted · none · ref 16 · internal anchor

    FrequencyBooster reports state-of-the-art FID scores of 1.60 at 256x256 and 1.69 at 512x512 for pixel diffusion by using a specialized decoder for full-frequency modeling.

  • HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion cs.CV · 2026-05-15 · unverdicted · none · ref 41 · 2 links · internal anchor

    HyperDiT reports FID 1.56 on ImageNet 256x256 using hyper-connected cross-scale attention, SA-RoPE, and VFM registers in pixel space.