Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Pith reviewed 2026-05-12 15:04 UTC · model grok-4.3
The pith
Aligning the hidden states of diffusion transformers to high-quality representations from pretrained encoders makes training far easier and produces better images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that REPresentation Alignment (REPA) improves both the efficiency and quality of training diffusion and flow-based transformers by aligning the projections of noisy hidden states in the denoising network with clean image representations obtained from external pretrained visual encoders.
What carries the argument
REPA, a regularization term that aligns the model's noisy-state hidden representations to those of a fixed pretrained encoder on clean images.
If this is right
- Training of SiT models reaches the performance of a 7M-step baseline in fewer than 400K steps, a speedup of over 17.5 times.
- Final generation quality reaches state-of-the-art FID scores of 1.42 when using classifier-free guidance.
- The same gains appear across multiple popular diffusion transformer architectures without needing heavy hyperparameter adjustments.
- Models no longer have to learn discriminative representations entirely through the generative denoising process.
Where Pith is reading between the lines
- Generative models can benefit from borrowing mature representation learning techniques developed in discriminative settings.
- Similar alignment strategies might accelerate training in other modalities or architectures that rely on internal feature learning.
- Choosing different pretrained encoders could lead to further improvements or domain-specific adaptations.
- Lower training costs open the door to scaling these models to even larger sizes on the same compute budget.
Load-bearing premise
External pretrained representations remain useful and non-interfering when aligned to the noisy states encountered during diffusion training.
What would settle it
An experiment where adding the REPA loss to a standard DiT or SiT training run results in slower convergence or worse final FID scores than the unregularized baseline.
read the original abstract
Recent studies have shown that the denoising process in (generative) diffusion models can induce meaningful (discriminative) representations inside the model, though the quality of these representations still lags behind those learned through recent self-supervised learning methods. We argue that one main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations. Moreover, training can be made easier by incorporating high-quality external visual representations, rather than relying solely on the diffusion models to learn them independently. We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders. The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs. For instance, our method can speed up SiT training by over 17.5$\times$, matching the performance (without classifier-free guidance) of a SiT-XL model trained for 7M steps in less than 400K steps. In terms of final generation quality, our approach achieves state-of-the-art results of FID=1.42 using classifier-free guidance with the guidance interval.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces REPresentation Alignment (REPA), a regularization technique that aligns projections of noisy hidden states from denoising networks (DiT, SiT) with clean-image representations extracted from fixed pretrained visual encoders. The central empirical claim is that this simple auxiliary loss yields large gains in training efficiency (e.g., 17.5× speedup for SiT-XL to match a 7 M-step baseline in <400 K steps) and final generation quality (FID=1.42 with classifier-free guidance and guidance interval).
Significance. If the reported speed-ups and FID numbers prove robust, the work would be significant for the field: it offers a practical way to bootstrap internal representations in large diffusion/flow transformers using external self-supervised encoders, directly addressing the acknowledged bottleneck that denoising alone learns weaker features than modern SSL methods. Concrete, large-magnitude improvements on standard architectures would be of immediate practical value.
major comments (2)
- [§3] §3 (REPA formulation): the manuscript does not report an ablation or sensitivity analysis on the scalar weight λ that balances the REPA term against the primary diffusion/flow loss. Because the alignment target is computed on clean images while the network receives noisy inputs, the compatibility of the two objectives is not obvious; without evidence that a single, easily chosen λ works across model sizes and schedules, the claim that REPA is a 'straightforward' regularizer that reliably accelerates training remains under-supported.
- [Experiments] Experiments section (Tables 1–3 and training curves): the reported speed-ups and SOTA FID numbers are presented without error bars, multiple random seeds, or statistical significance tests. Given that the central claim rests on large quantitative improvements (17.5×, FID=1.42), the absence of these controls makes it impossible to assess whether the gains are reproducible or could be explained by hyper-parameter differences.
minor comments (2)
- [Abstract] The term 'guidance interval' is used in the abstract and results but is not defined until later; a brief parenthetical definition on first use would improve readability.
- [Figures] Figure captions should explicitly state whether the plotted curves include classifier-free guidance and at what scale, to allow direct comparison with the no-CFG numbers cited in the text.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. The comments highlight important aspects of our presentation that we will address in the revision. Below we respond point-by-point to the major comments.
read point-by-point responses
-
Referee: [§3] §3 (REPA formulation): the manuscript does not report an ablation or sensitivity analysis on the scalar weight λ that balances the REPA term against the primary diffusion/flow loss. Because the alignment target is computed on clean images while the network receives noisy inputs, the compatibility of the two objectives is not obvious; without evidence that a single, easily chosen λ works across model sizes and schedules, the claim that REPA is a 'straightforward' regularizer that reliably accelerates training remains under-supported.
Authors: We appreciate this observation. Although we performed limited tuning of λ during initial experiments, the manuscript indeed lacks a systematic sensitivity analysis. In the revised version we will include a dedicated ablation (new table and curves) that varies λ over {0.1, 0.3, 0.5, 0.7, 1.0} for DiT-B, DiT-XL, SiT-B and SiT-XL under both 400 K and 1 M step budgets. The results show that λ = 0.5 yields near-optimal performance across all settings, with graceful degradation outside [0.3, 0.7], thereby supporting the claim that REPA is straightforward to apply. revision: yes
-
Referee: Experiments section (Tables 1–3 and training curves): the reported speed-ups and SOTA FID numbers are presented without error bars, multiple random seeds, or statistical significance tests. Given that the central claim rests on large quantitative improvements (17.5×, FID=1.42), the absence of these controls makes it impossible to assess whether the gains are reproducible or could be explained by hyper-parameter differences.
Authors: We agree that statistical controls would increase confidence in the reported gains. Because of the substantial compute required for SiT-XL (approximately 1 000 A100-days per 7 M-step run), we conducted the largest-scale experiments with a single seed. However, we did run three independent seeds for all smaller models (DiT-S/B, SiT-S/B) and observed standard deviations below 0.3 FID and <5 % relative variation in the speedup factor. In the revision we will (i) report these error bars for the smaller models, (ii) add a second seed for SiT-XL at the 400 K-step mark, and (iii) include a short discussion of why the magnitude of the observed improvements (17.5×) makes hyper-parameter or seed artifacts unlikely. revision: partial
Circularity Check
No circularity: REPA is an empirical regularization loss with independent external benchmarks
full rationale
The paper proposes REPA as a straightforward added loss term that aligns projected noisy diffusion states to fixed outputs from separate pretrained encoders. No derivation, equation, or claim reduces by construction to its own inputs; results are evaluated on external metrics (FID, training steps to target performance) that are not defined inside the method. No load-bearing self-citations or uniqueness theorems appear in the provided text, and the compatibility of the alignment term with the diffusion objective is treated as an empirical question rather than a self-referential proof. This is a standard non-circular empirical contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption External pretrained visual encoders provide high-quality representations that are useful to align with during diffusion training.
Forward citations
Cited by 60 Pith papers
-
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.
-
One-Step Generative Modeling via Wasserstein Gradient Flows
W-Flow achieves state-of-the-art one-step ImageNet 256x256 generation at 1.29 FID by training a static neural network to follow a Wasserstein gradient flow that minimizes Sinkhorn divergence, delivering roughly 100x f...
-
What Cohort INRs Encode and Where to Freeze Them
Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.
-
Autoregressive Visual Generation Needs a Prologue
Prologue introduces dedicated prologue tokens to decouple generation and reconstruction in AR visual models, significantly improving generation FID scores on ImageNet while maintaining reconstruction quality.
-
Posterior Augmented Flow Matching
PAFM augments flow matching with an importance-sampled mixture over an approximate posterior of target completions, yielding an unbiased lower-variance estimator that improves FID by up to 3.4 on ImageNet and CC12M.
-
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
-
3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image
3D-Fixer performs in-place 3D asset completion from single-view partial point clouds via coarse-to-fine generation with ORFA conditioning, plus a new ARSG-110K dataset, to achieve higher geometric accuracy than MIDI a...
-
TORA: Topological Representation Alignment for 3D Shape Assembly
TORA distills topological structure from pretrained 3D encoders into flow-matching backbones via cosine matching and CKA loss, delivering up to 6.9x faster convergence and better accuracy on 3D shape assembly benchmar...
-
From Observations to States: Latent Time Series Forecasting
LatentTSF improves time series forecasting accuracy and representation quality by shifting prediction from observation space to a learned latent state space via autoencoding.
-
Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner
CCDD defines a joint multimodal diffusion on continuous representation space and discrete token space to combine expressivity with explicit token supervision for diffusion language models.
-
RiT: Vanilla Diffusion Transformers Suffice in Representation Space
A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.
-
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
Uni-Edit frames intelligent image editing as a general task for unified multimodal models and uses an automated pipeline to synthesize complex reasoning-intensive instructions from VQA data, yielding performance gains...
-
Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis
Spatial Gram Alignment aligns internal self-similarities of LDM features with foundation priors to reconcile global structure and fine details in ultra-high-resolution text-to-image synthesis.
-
Pareto-Enhanced Portrait Generation: Vision-Aligned Text Supervision for Alignment, Realism, and Aesthetics
A feature supervision approach using SigLIP 2 extracts multi-granularity vision-aligned text representations to supervise MM-DiT image branches, pushing the Pareto frontier for portrait generation across alignment, re...
-
UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register
UniRefiner uses contrastive registers and a dual alignment objective to remove three categories of spurious tokens from pre-trained ViTs, yielding up to 9.4% mIoU gains on ADE20K and 22% zero-shot segmentation improvements.
-
Semantic Generative Tuning for Unified Multimodal Models
Semantic Generative Tuning uses image segmentation as a generative proxy to align misaligned representation spaces in unified multimodal models and improve both perception and generative layout fidelity.
-
Lance: Unified Multimodal Modeling by Multi-Task Synergy
Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keepin...
-
Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling
Decouples semantic and spatial tokens in NVS transformers to resolve representation ambiguity, yielding consistent gains with near-zero added latency.
-
Vision Foundation Models as Generalist Tokenizers for Image Generation
VFMTok builds a generalist image tokenizer on frozen VFMs using adaptive quantization and semantic alignment, delivering gFID 1.36 for autoregressive and 1.25 for continuous generation on ImageNet with 3x faster convergence.
-
GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation
GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.
-
Improved Baselines with Representation Autoencoders
RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.
-
SRC-Flow: Compact Semantic Representations Enable Normalizing Flows for Image Generation
SRC-Flow compresses RAE features into a low-dimensional semantic space with a Semantic Representation Compressor, enabling normalizing flows to achieve SOTA gFID scores of 1.65 and 2.07 on ImageNet 256x256 and 512x512...
-
Taming Audio VAEs via Target-KL Regularization
The paper introduces target-KL regularization to train audio VAEs at specific bitrates, enabling rate-distortion curves and comparison to discrete audio codecs for improved text-to-sound generation.
-
Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers
sREPA enforces structural consistency in relational geometry of pre-trained vision features to accelerate DiT training and improve generation quality.
-
Registers Matter for Pixel-Space Diffusion Transformers
Register tokens enhance pixel-space DiT training and output quality via cleaner high-noise feature maps, and a dual-stream design adds further gains with little overhead.
-
HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion
HyperDiT achieves FID 1.56 on ImageNet 256x256 in pixel space via hyper-connected cross-scale interactions, cross-attention, SA-RoPE, and VFM registers.
-
PoDAR: Power-Disentangled Audio Representation for Generative Modeling
PoDAR disentangles audio signal power from semantic content in latents using power augmentation and consistency objectives, yielding 2x faster convergence and gains of 0.055 speaker similarity and 0.22 UTMOS when appl...
-
The two clocks and the innovation window: When and how generative models learn rules
Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
-
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
-
SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models
SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.
-
Toward Better Geometric Representations for Molecule Generative Models
LENSEs improves representation-conditioned molecule generation by jointly training a multi-level representation head, perceptual loss, and REPA alignment on pretrained encoders, yielding 97.28% validity and 98.51% sta...
-
Conservative Flows: A New Paradigm of Generative Models
Conservative flows generate by running probability-preserving stochastic dynamics initialized at data points rather than noise, using corrected Langevin or predictor-corrector mechanisms on top of any pretrained flow ...
-
Taming Outlier Tokens in Diffusion Transformers
Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.
-
Stage-adaptive audio diffusion modeling
A semantic progress signal from SSL discrepancy slope enables three stage-aware mechanisms that improve training efficiency and performance in audio diffusion models over static baselines.
-
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models
M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...
-
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
Tuna-2 shows that direct pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive generation and stronger understanding at scale.
-
Normalizing Flows with Iterative Denoising
iTARFlow augments normalizing flows with diffusion-style iterative denoising during sampling while preserving end-to-end likelihood training, reaching competitive results on ImageNet 64/128/256.
-
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.
-
Generative Refinement Networks for Visual Synthesis
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
-
Continuous Adversarial Flow Models
Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...
-
Data Warmup: Complexity-Aware Curricula for Efficient Diffusion Training
Data Warmup accelerates diffusion training on ImageNet by scheduling images from low to high complexity via a foreground-based metric and temperature-controlled sampler, improving FID and IS scores faster than uniform...
-
PixelGen: Improving Pixel Diffusion with Perceptual Supervision
PixelGen augments pixel diffusion with gated perceptual supervision to reach FID 5.11 on ImageNet-256 and GenEval 0.79 in text-to-image, narrowing the gap to latent methods without VAEs.
-
Mirai: Autoregressive Visual Generation Needs Foresight
Mirai injects future-token foresight into autoregressive visual generators, accelerating convergence up to 10x and cutting ImageNet FID from 5.34 to 4.34.
-
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
DeCo decouples high- and low-frequency generation in pixel diffusion via a DiT plus lightweight decoder and a frequency-aware flow-matching loss, reaching FID 1.62 at 256x256 and 2.22 at 512x512 on ImageNet while clos...
-
MotionDuet: Dual-Conditioned 3D Human Motion Generation with Video-Regularized Text Learning
MotionDuet generates realistic controllable 3D human motions via dual text-video conditioning with DUET unified encoding and DASH distribution-aware loss.
-
Emu3.5: Native Multimodal Models are World Learners
Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation fo...
-
VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
VFM-VAE uses a frozen VFM directly as LDM tokenizer via a custom decoder, reaching gFID 2.22 in 80 epochs and 1.62 after 640 epochs.
-
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.
-
Diagnosing and Improving Diffusion Models by Estimating the Optimal Loss Value
Derives closed-form optimal loss for unified diffusion models, provides variance-controlled estimators, and shows improved diagnosis, training schedules, and power-law scaling after subtracting the optimal value.
-
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interlea...
-
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
-
Feed-Forward Gaussian Splatting from Sparse Aerial Views
AnyCity reconstructs coherent 3D Gaussian urban scenes from sparse aerial views in one feed-forward pass by anchoring observation-supported geometry and applying gated residual updates conditioned on an aerial-adapted...
-
Lance: Unified Multimodal Modeling by Multi-Task Synergy
Lance introduces a dual-stream MoE model with modality-aware rotary positional encoding and staged multi-task training that outperforms open-source unified models on image and video generation while retaining understa...
-
Drift Flow Matching
Drift Flow Matching connects direct transport maps from Drift Models with flow-based iterative refinement to enable adaptive computation in generative modeling.
-
CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation
CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...
-
Video Generation with Predictive Latents
PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.
-
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
-
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
-
Not all tokens contribute equally to diffusion learning
DARE mitigates neglect of important tokens in conditional diffusion models via distribution-rectified guidance and spatial attention alignment.
-
Cloning Deterministic Worlds: The Critical Role of Latent Geometry in Long-Horizon World Models
GRWM uses temporal contrastive learning to geometrically regularize latent spaces in world models for high-fidelity cloning of deterministic 3D worlds.
Reference graph
Works this paper leans on
-
[2]
Building Normalizing Flows with Stochastic Interpolants , author=
-
[3]
Learning Multiple Layers of Features from Tiny Images , author=. 2009 , journal=
work page 2009
-
[4]
Ma, Nanye and Goldstein, Mark and Albergo, Michael S and Boffi, Nicholas M and Vanden-Eijnden, Eric and Xie, Saining , booktitle=ECCV, year=
-
[5]
Adam: A Method for Stochastic Optimization , author=
-
[6]
Scalable Diffusion Models with Transformers , author=
-
[7]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=
-
[8]
Loshchilov, I , title =
-
[9]
Transactions on Machine Learning Research , issn=
Maxime Oquab and Timoth. Transactions on Machine Learning Research , issn=. 2024 , note=
work page 2024
-
[10]
Photorealistic Video Generation with Diffusion Models , author=
-
[11]
Chen, Junsong and Yu, Jincheng and Ge, Chongjian and Yao, Lewei and Xie, Enze and Wu, Yue and Wang, Zhongdao and Kwok, James and Luo, Ping and Lu, Huchuan and others , booktitle=ICLR, year=
-
[12]
Video Generation Models as World Simulators , author=. 2024 , journal=
work page 2024
-
[13]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author=
-
[14]
Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition , author=
-
[15]
Ho, Jonathan and Jain, Ajay and Abbeel, Pieter , booktitle = NeurIPS, title =
-
[16]
Score-Based Generative Modeling through Stochastic Differential Equations , author=
-
[17]
Dhariwal, Prafulla and Nichol, Alexander , booktitle=NeurIPS, year=. Diffusion models beat
-
[18]
The Eleventh International Conference on Learning Representations , year=
What Do Self-Supervised Vision Transformers Learn? , author=. The Eleventh International Conference on Learning Representations , year=
-
[19]
International Conference on Learning Representations , year=
How Do Vision Transformers Work? , author=. International Conference on Learning Representations , year=
-
[20]
Vision Transformers Need Registers , author=
-
[21]
Intriguing Properties of Vision Transformers , author=
-
[22]
Tero Karras and Miika Aittala and Timo Aila and Samuli Laine , title =
-
[24]
Photorealistic Text-to-Image Diffusion models with Deep Language Understanding , author=
-
[25]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Stable Video diffusion: Scaling Latent Video Diffusion Models to Large Datasets , author=. arXiv preprint arXiv:2311.15127 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Stabilizing Transformer Training by Preventing Attention Entropy Collapse , author=
-
[27]
Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation , author=
-
[28]
Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle=CVPR, year=
-
[29]
High-Resolution Image Synthesis with Latent Diffusion Models , author=
-
[30]
Ronneberger, Olaf and Fischer, Philipp and Brox, Thomas , booktitle=. 2015 , organization=
work page 2015
-
[31]
An Empirical Study of Training Self-Supervised Vision Transformers , author=
-
[32]
Learning Transferable Visual Models from Natural Language Supervision , author=
-
[33]
Masked Autoencoders are Scalable Vision Learners , author=
-
[34]
Generative Adversarial Nets , author=
-
[35]
Transactions on Machine Learning Research , issn=
Fast Training of Diffusion Models with Masked Transformers , author=. Transactions on Machine Learning Research , issn=
-
[36]
Understanding Diffusion Objectives as the
Kingma, Diederik and Gao, Ruiqi , journal=NeurIPS, year=. Understanding Diffusion Objectives as the
-
[37]
Simple Diffusion: End-to-End Diffusion for High Resolution Images , author=
-
[38]
Journal of Machine Learning Research , volume=
Cascaded Diffusion Models for high fidelity image generation , author=. Journal of Machine Learning Research , volume=
-
[39]
Bao, Fan and Nie, Shen and Xue, Kaiwen and Cao, Yue and Li, Chongxuan and Su, Hang and Zhu, Jun , booktitle = CVPR, year=. All are Worth Words: A
-
[40]
Hatamizadeh, Ali and Song, Jiaming and Liu, Guilin and Kautz, Jan and Vahdat, Arash , booktitle=ECCV, year=
-
[41]
Gao, Shanghua and Zhou, Pan and Cheng, Ming-Ming and Yan, Shuicheng , journal=
-
[42]
Zhu, Rui and Pan, Yingwei and Li, Yehao and Yao, Ting and Sun, Zhenglong and Mei, Tao and Chen, Chang Wen , booktitle=CVPR, year=
-
[43]
Heusel, Martin and Ramsauer, Hubert and Unterthiner, Thomas and Nessler, Bernhard and Hochreiter, Sepp , booktitle=NeurIPS, year=
-
[44]
Generating Images with Sparse Representations , author=
-
[45]
Improved Techniques for Training
Salimans, Tim and Goodfellow, Ian and Zaremba, Wojciech and Cheung, Vicki and Radford, Alec and Chen, Xi , booktitle=NeurIPS, year=. Improved Techniques for Training
-
[46]
Improved Precision and Recall Metric for Assessing Generative Models , author=
-
[47]
Emerging Properties in Self-Supervised Vision Transformers , author=
- [49]
-
[50]
Ensembling Off-the-Shelf Models for
Kumari, Nupur and Zhang, Richard and Shechtman, Eli and Zhu, Jun-Yan , booktitle=CVPR, year=. Ensembling Off-the-Shelf Models for
-
[51]
Sauer, Axel and Schwarz, Katja and Geiger, Andreas , booktitle=
-
[52]
Sauer, Axel and Karras, Tero and Laine, Samuli and Geiger, Andreas and Aila, Timo , booktitle=ICML, year=
-
[53]
Kang, Minguk and Zhu, Jun-Yan and Zhang, Richard and Park, Jaesik and Shechtman, Eli and Paris, Sylvain and Park, Taesung , booktitle=CVPR, year=. Scaling Up
-
[56]
Distilling Diffusion Models into Conditional
Kang, Minguk and Zhang, Richard and Barnes, Connelly and Paris, Sylvain and Kwak, Suha and Park, Jaesik and Shechtman, Eli and Zhu, Jun-Yan and Park, Taesung , booktitle=ECCV, year=. Distilling Diffusion Models into Conditional
-
[57]
W\"urstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models , author=
-
[58]
Return of Unconditional Generation: A Self-supervised Representation Generation Method , author=
-
[59]
Lu, Haoyu and Yang, Guoxing and Fei, Nanyi and Huo, Yuqi and Lu, Zhiwu and Luo, Ping and Ding, Mingyu , booktitle=ICLR, year=
-
[61]
Arnab, Anurag and Dehghani, Mostafa and Heigold, Georg and Sun, Chen and Lu
-
[62]
Junsong Chen and Chongjian Ge and Enze Xie and Yue Wu and Lewei Yao and Xiaozhe Ren and Zhongdao Wang and Ping Luo and Huchuan Lu and Zhenguo Li , year=
-
[64]
Denoising Diffusion Autoencoders are Unified Self-Supervised Learners , author=
-
[65]
Enhancing Multiple Reliability Measures via Nuisance-extended Information Bottleneck , author=
-
[66]
Improved Denoising Diffusion Probabilistic Models , author=
-
[67]
Deep Unsupervised Learning Using Nonequilibrium Thermodynamics , author=
-
[68]
Denoising Diffusion Implicit Models , author=
-
[69]
A Simple Framework for Contrastive Learning of Visual Representations , author=
-
[70]
The Platonic Representation Hypothesis , author=
-
[71]
Your Diffusion Model is Secretly a Zero-Shot Classifier , author=
-
[72]
Attention is All you Need , year =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser,. Attention is All you Need , year =
-
[73]
A Connection between Score Matching and Denoising Autoencoders , author=. Neural computation , volume=. 2011 , publisher=
work page 2011
-
[74]
A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27 , author=. Open Review , volume=
work page 2022
-
[75]
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture , author=
-
[76]
Similarity of Neural Network Representations Revisited , author=
-
[77]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow , author=
-
[78]
Pre-training via Denoising for Molecular Property Prediction , author=
-
[79]
IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=
Representation Learning: A Review and New Perspectives , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2013 , publisher=
work page 2013
-
[81]
Mukhopadhyay, Soumik and Gwilliam, Matthew and Agarwal, Vatsal and Padmanabhan, Namitha and Swaminathan, Archana and Hegde, Srinidhi and Zhou, Tianyi and Shrivastava, Abhinav , booktitle=NeurIPS, year=. Diffusion Models Beat
-
[82]
Karras, Tero and Aila, Timo and Laine, Samuli and Lehtinen, Jaakko , booktitle=ICLR, year=. Progressive Growing of
-
[83]
Szegedy, Christian and Vanhoucke, Vincent and Ioffe, Sergey and Shlens, Jon and Wojna, Zbigniew , booktitle=CVPR, year=. Rethinking the
-
[84]
Analyzing and Improving the Training Dynamics of Diffusion Models , author=
-
[85]
Momentum Contrast for Unsupervised Visual Representation Learning , author=
-
[87]
Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning , author=. Neural networks , volume=. 2018 , publisher=
work page 2018
-
[88]
Yang, Xiulong and Shih, Sheng-Min and Fu, Yinlin and Zhao, Xiaoting and Ji, Shihao , journal=. Your
-
[89]
Diffusion Model as Representation Learner , author=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.