hub Canonical reference

Scaling text-to-image diffusion transformers with representation autoencoders

Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, Saining Xie · 2026 · arXiv 2601.16208

Canonical reference. 88% of citing Pith papers cite this work as background.

14 Pith papers citing it

Background 88% of classified citations

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 baseline 1

citation-polarity summary

background 7 baseline 1

representative citing papers

Coevolving Representations in Joint Image-Feature Diffusion

cs.CV · 2026-04-19 · unverdicted · novelty 7.0

CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample quality than fixed-representation baselines.

Setting-Matched and Semantics-Scaled Benchmarking of One-Step Generative Models Against Multistep Diffusion and Flow Models

cs.CV · 2026-03-15 · unverdicted · novelty 7.0

Matched benchmarking reveals FID misleads in few-step regimes under CFG, prompting CLIP-scaled and PickScore-scaled FID and IS variants for better semantic evaluation of one-step image generators.

RiT: Vanilla Diffusion Transformers Suffice in Representation Space

cs.CV · 2026-05-21 · conditional · novelty 6.0

A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.

Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

Spatial Gram Alignment aligns internal self-similarities of LDM features with foundation priors to reconcile global structure and fine details in ultra-high-resolution text-to-image synthesis.

Improved Baselines with Representation Autoencoders

cs.CV · 2026-05-18 · conditional · novelty 6.0

RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.

ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

ViTok-v2 is a 5B-parameter native-resolution image autoencoder using NaFlex and DINOv3 loss that matches or exceeds prior tokenizers at 256p and outperforms them at 512p and above while advancing the Pareto frontier in joint scaling with generators.

Taming Outlier Tokens in Diffusion Transformers

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.

Latent Denoising Improves Visual Alignment in Large Multimodal Models

cs.CV · 2026-04-23 · unverdicted · novelty 6.0

A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.

Continuous Adversarial Flow Models

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-image benchmarks.

Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models

cs.CV · 2026-05-07 · unverdicted · novelty 5.0

Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.

Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

cs.CV · 2026-05-07 · unverdicted · novelty 5.0

Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.

Video Generation with Predictive Latents

cs.CV · 2026-05-04 · unverdicted · novelty 5.0

PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.

Asymmetric Flow Models

cs.CV · 2026-05-13

citing papers explorer

Showing 14 of 14 citing papers.

Coevolving Representations in Joint Image-Feature Diffusion cs.CV · 2026-04-19 · unverdicted · none · ref 42
CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample quality than fixed-representation baselines.
Setting-Matched and Semantics-Scaled Benchmarking of One-Step Generative Models Against Multistep Diffusion and Flow Models cs.CV · 2026-03-15 · unverdicted · none · ref 24
Matched benchmarking reveals FID misleads in few-step regimes under CFG, prompting CLIP-scaled and PickScore-scaled FID and IS variants for better semantic evaluation of one-step image generators.
RiT: Vanilla Diffusion Transformers Suffice in Representation Space cs.CV · 2026-05-21 · conditional · none · ref 35
A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.
Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis cs.CV · 2026-05-20 · unverdicted · none · ref 34
Spatial Gram Alignment aligns internal self-similarities of LDM features with foundation priors to reconcile global structure and fine details in ultra-high-resolution text-to-image synthesis.
Improved Baselines with Representation Autoencoders cs.CV · 2026-05-18 · conditional · none · ref 51
RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion cs.CV · 2026-05-08 · unverdicted · none · ref 80
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters cs.CV · 2026-05-06 · unverdicted · none · ref 13
ViTok-v2 is a 5B-parameter native-resolution image autoencoder using NaFlex and DINOv3 loss that matches or exceeds prior tokenizers at 256p and outperforms them at 512p and above while advancing the Pareto frontier in joint scaling with generators.
Taming Outlier Tokens in Diffusion Transformers cs.CV · 2026-05-06 · unverdicted · none · ref 31
Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.
Latent Denoising Improves Visual Alignment in Large Multimodal Models cs.CV · 2026-04-23 · unverdicted · none · ref 81
A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
Continuous Adversarial Flow Models cs.LG · 2026-04-13 · unverdicted · none · ref 69
Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-image benchmarks.
Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models cs.CV · 2026-05-07 · unverdicted · none · ref 60
Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision cs.CV · 2026-05-07 · unverdicted · none · ref 49
Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.
Video Generation with Predictive Latents cs.CV · 2026-05-04 · unverdicted · none · ref 44
PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.
Asymmetric Flow Models cs.CV · 2026-05-13 · unreviewed · ref 64

Scaling text-to-image diffusion transformers with representation autoencoders

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer