A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
Masked diffusion transformer is a strong image synthesizer
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.
sREPA enforces structural consistency in relational geometry of pre-trained vision features to accelerate DiT training and improve generation quality.
FreqFlow introduces frequency-aware conditioning and a two-branch architecture to flow matching, reaching FID 1.38 on ImageNet-256 and outperforming DiT and SiT.
Derives closed-form optimal loss for unified diffusion models, provides variance-controlled estimators, and shows improved diagnosis, training schedules, and power-law scaling after subtracting the optimal value.
Aligning noisy hidden states in diffusion transformers to clean features from pretrained visual encoders speeds up training over 17x and reaches FID 1.42.
Diffusion models suffer representation degradation at high noise due to recoverability mismatch; ERD mitigates this by dynamic optimization reallocation, accelerating convergence across backbones.
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.
citing papers explorer
-
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
-
RiT: Vanilla Diffusion Transformers Suffice in Representation Space
A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.
-
Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers
sREPA enforces structural consistency in relational geometry of pre-trained vision features to accelerate DiT training and improve generation quality.
-
Frequency-Aware Flow Matching for High-Quality Image Generation
FreqFlow introduces frequency-aware conditioning and a two-branch architecture to flow matching, reaching FID 1.38 on ImageNet-256 and outperforming DiT and SiT.
-
Diagnosing and Improving Diffusion Models by Estimating the Optimal Loss Value
Derives closed-form optimal loss for unified diffusion models, provides variance-controlled estimators, and shows improved diagnosis, training schedules, and power-law scaling after subtracting the optimal value.
-
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Aligning noisy hidden states in diffusion transformers to clean features from pretrained visual encoders speeds up training over 17x and reaches FID 1.42.
-
Elucidating Representation Degradation Problem in Diffusion Model Training
Diffusion models suffer representation degradation at high noise due to recoverability mismatch; ERD mitigates this by dynamic optimization reallocation, accelerating convergence across backbones.
-
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.
- SRC-Flow: Compact Semantic Representations Enable Normalizing Flows for Image Generation