arxiv: 2410.06940 · v4 · submitted 2024-10-09 · 💻 cs.CV · cs.LG

Recognition: 3 theorem links

· Lean Theorem

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu , Sangkyung Kwak , Huiwon Jang , Jongheon Jeong , Jonathan Huang , Jinwoo Shin , Saining Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-12 15:04 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords representation alignmentdiffusion transformersDiTSiTtraining accelerationimage generationregularizationpretrained encoders

0 comments

The pith

Aligning the hidden states of diffusion transformers to high-quality representations from pretrained encoders makes training far easier and produces better images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper claims that diffusion models for image generation struggle because their denoising networks must learn good representations on their own from noisy data. The authors propose a simple fix: add a regularization that forces the model's internal projections of noisy inputs to match the representations of clean images from an external pretrained visual encoder. When tested on transformer-based models like DiT and SiT, this alignment leads to much faster training and higher quality outputs. A reader should care because it shows a way to leverage existing strong representation learners to bypass part of the hard work in training generative models from scratch.

Core claim

The central discovery is that REPresentation Alignment (REPA) improves both the efficiency and quality of training diffusion and flow-based transformers by aligning the projections of noisy hidden states in the denoising network with clean image representations obtained from external pretrained visual encoders.

What carries the argument

REPA, a regularization term that aligns the model's noisy-state hidden representations to those of a fixed pretrained encoder on clean images.

If this is right

Training of SiT models reaches the performance of a 7M-step baseline in fewer than 400K steps, a speedup of over 17.5 times.
Final generation quality reaches state-of-the-art FID scores of 1.42 when using classifier-free guidance.
The same gains appear across multiple popular diffusion transformer architectures without needing heavy hyperparameter adjustments.
Models no longer have to learn discriminative representations entirely through the generative denoising process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Generative models can benefit from borrowing mature representation learning techniques developed in discriminative settings.
Similar alignment strategies might accelerate training in other modalities or architectures that rely on internal feature learning.
Choosing different pretrained encoders could lead to further improvements or domain-specific adaptations.
Lower training costs open the door to scaling these models to even larger sizes on the same compute budget.

Load-bearing premise

External pretrained representations remain useful and non-interfering when aligned to the noisy states encountered during diffusion training.

What would settle it

An experiment where adding the REPA loss to a standard DiT or SiT training run results in slower convergence or worse final FID scores than the unregularized baseline.

read the original abstract

Recent studies have shown that the denoising process in (generative) diffusion models can induce meaningful (discriminative) representations inside the model, though the quality of these representations still lags behind those learned through recent self-supervised learning methods. We argue that one main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations. Moreover, training can be made easier by incorporating high-quality external visual representations, rather than relying solely on the diffusion models to learn them independently. We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders. The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs. For instance, our method can speed up SiT training by over 17.5$\times$, matching the performance (without classifier-free guidance) of a SiT-XL model trained for 7M steps in less than 400K steps. In terms of final generation quality, our approach achieves state-of-the-art results of FID=1.42 using classifier-free guidance with the guidance interval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

REPA gives a concrete alignment trick that reportedly cuts DiT/SiT training time by 17x and hits strong FID, but the noisy-to-clean mismatch needs checking to confirm it does not just trade off quality.

read the letter

The punchline is that this paper shows a simple extra loss can make training diffusion transformers noticeably faster and better on standard setups. They add REPA, which projects the hidden states the network sees on noisy inputs and pulls them toward features from a fixed pretrained encoder on clean images. On SiT models this reportedly lets them match the performance of a 7M-step run in under 400K steps, and they reach FID 1.42 with guidance on the final model. That is the core empirical result they lead with.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces REPresentation Alignment (REPA), a regularization technique that aligns projections of noisy hidden states from denoising networks (DiT, SiT) with clean-image representations extracted from fixed pretrained visual encoders. The central empirical claim is that this simple auxiliary loss yields large gains in training efficiency (e.g., 17.5× speedup for SiT-XL to match a 7 M-step baseline in <400 K steps) and final generation quality (FID=1.42 with classifier-free guidance and guidance interval).

Significance. If the reported speed-ups and FID numbers prove robust, the work would be significant for the field: it offers a practical way to bootstrap internal representations in large diffusion/flow transformers using external self-supervised encoders, directly addressing the acknowledged bottleneck that denoising alone learns weaker features than modern SSL methods. Concrete, large-magnitude improvements on standard architectures would be of immediate practical value.

major comments (2)

[§3] §3 (REPA formulation): the manuscript does not report an ablation or sensitivity analysis on the scalar weight λ that balances the REPA term against the primary diffusion/flow loss. Because the alignment target is computed on clean images while the network receives noisy inputs, the compatibility of the two objectives is not obvious; without evidence that a single, easily chosen λ works across model sizes and schedules, the claim that REPA is a 'straightforward' regularizer that reliably accelerates training remains under-supported.
[Experiments] Experiments section (Tables 1–3 and training curves): the reported speed-ups and SOTA FID numbers are presented without error bars, multiple random seeds, or statistical significance tests. Given that the central claim rests on large quantitative improvements (17.5×, FID=1.42), the absence of these controls makes it impossible to assess whether the gains are reproducible or could be explained by hyper-parameter differences.

minor comments (2)

[Abstract] The term 'guidance interval' is used in the abstract and results but is not defined until later; a brief parenthetical definition on first use would improve readability.
[Figures] Figure captions should explicitly state whether the plotted curves include classifier-free guidance and at what scale, to allow direct comparison with the no-CFG numbers cited in the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important aspects of our presentation that we will address in the revision. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [§3] §3 (REPA formulation): the manuscript does not report an ablation or sensitivity analysis on the scalar weight λ that balances the REPA term against the primary diffusion/flow loss. Because the alignment target is computed on clean images while the network receives noisy inputs, the compatibility of the two objectives is not obvious; without evidence that a single, easily chosen λ works across model sizes and schedules, the claim that REPA is a 'straightforward' regularizer that reliably accelerates training remains under-supported.

Authors: We appreciate this observation. Although we performed limited tuning of λ during initial experiments, the manuscript indeed lacks a systematic sensitivity analysis. In the revised version we will include a dedicated ablation (new table and curves) that varies λ over {0.1, 0.3, 0.5, 0.7, 1.0} for DiT-B, DiT-XL, SiT-B and SiT-XL under both 400 K and 1 M step budgets. The results show that λ = 0.5 yields near-optimal performance across all settings, with graceful degradation outside [0.3, 0.7], thereby supporting the claim that REPA is straightforward to apply. revision: yes
Referee: Experiments section (Tables 1–3 and training curves): the reported speed-ups and SOTA FID numbers are presented without error bars, multiple random seeds, or statistical significance tests. Given that the central claim rests on large quantitative improvements (17.5×, FID=1.42), the absence of these controls makes it impossible to assess whether the gains are reproducible or could be explained by hyper-parameter differences.

Authors: We agree that statistical controls would increase confidence in the reported gains. Because of the substantial compute required for SiT-XL (approximately 1 000 A100-days per 7 M-step run), we conducted the largest-scale experiments with a single seed. However, we did run three independent seeds for all smaller models (DiT-S/B, SiT-S/B) and observed standard deviations below 0.3 FID and <5 % relative variation in the speedup factor. In the revision we will (i) report these error bars for the smaller models, (ii) add a second seed for SiT-XL at the 400 K-step mark, and (iii) include a short discussion of why the magnitude of the observed improvements (17.5×) makes hyper-parameter or seed artifacts unlikely. revision: partial

Circularity Check

0 steps flagged

No circularity: REPA is an empirical regularization loss with independent external benchmarks

full rationale

The paper proposes REPA as a straightforward added loss term that aligns projected noisy diffusion states to fixed outputs from separate pretrained encoders. No derivation, equation, or claim reduces by construction to its own inputs; results are evaluated on external metrics (FID, training steps to target performance) that are not defined inside the method. No load-bearing self-citations or uniqueness theorems appear in the provided text, and the compatibility of the alignment term with the diffusion objective is treated as an empirical question rather than a self-referential proof. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the assumption that external pretrained encoders supply beneficial clean representations and that the alignment loss can be added without disrupting the core denoising objective.

axioms (1)

domain assumption External pretrained visual encoders provide high-quality representations that are useful to align with during diffusion training.
Invoked when stating that incorporating these representations makes training easier.

pith-pipeline@v0.9.0 · 5543 in / 1130 out tokens · 45538 ms · 2026-05-12T15:04:08.141186+00:00 · methodology

discussion (0)

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

One-Step Generative Modeling via Wasserstein Gradient Flows
cs.LG 2026-05 conditional novelty 7.0

W-Flow achieves state-of-the-art one-step ImageNet 256x256 generation at 1.29 FID by training a static neural network to follow a Wasserstein gradient flow that minimizes Sinkhorn divergence, delivering roughly 100x f...
What Cohort INRs Encode and Where to Freeze Them
cs.LG 2026-05 unverdicted novelty 7.0

Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.
Autoregressive Visual Generation Needs a Prologue
cs.CV 2026-05 unverdicted novelty 7.0

Prologue introduces dedicated prologue tokens to decouple generation and reconstruction in AR visual models, significantly improving generation FID scores on ImageNet while maintaining reconstruction quality.
Posterior Augmented Flow Matching
cs.CV 2026-05 unverdicted novelty 7.0

PAFM augments flow matching with an importance-sampled mixture over an approximate posterior of target completions, yielding an unbiased lower-variance estimator that improves FID by up to 3.4 on ImageNet and CC12M.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
cs.CV 2026-04 unverdicted novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image
cs.CV 2026-04 unverdicted novelty 7.0

3D-Fixer performs in-place 3D asset completion from single-view partial point clouds via coarse-to-fine generation with ORFA conditioning, plus a new ARSG-110K dataset, to achieve higher geometric accuracy than MIDI a...
TORA: Topological Representation Alignment for 3D Shape Assembly
cs.CV 2026-04 unverdicted novelty 7.0

TORA distills topological structure from pretrained 3D encoders into flow-matching backbones via cosine matching and CKA loss, delivering up to 6.9x faster convergence and better accuracy on 3D shape assembly benchmar...
PoDAR: Power-Disentangled Audio Representation for Generative Modeling
eess.AS 2026-05 unverdicted novelty 6.0

PoDAR disentangles audio signal power from semantic content in latents using power augmentation and consistency objectives, yielding 2x faster convergence and gains of 0.055 speaker similarity and 0.22 UTMOS when appl...
The two clocks and the innovation window: When and how generative models learn rules
cs.LG 2026-05 unverdicted novelty 6.0

Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion
cs.CV 2026-05 unverdicted novelty 6.0

Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.
Toward Better Geometric Representations for Molecule Generative Models
cs.LG 2026-05 unverdicted novelty 6.0

LENSEs improves representation-conditioned molecule generation by jointly training a multi-level representation head, perceptual loss, and REPA alignment on pretrained encoders, yielding 97.28% validity and 98.51% sta...
Conservative Flows: A New Paradigm of Generative Models
cs.LG 2026-05 unverdicted novelty 6.0

Conservative flows generate by running probability-preserving stochastic dynamics initialized at data points rather than noise, using corrected Langevin or predictor-corrector mechanisms on top of any pretrained flow ...
Taming Outlier Tokens in Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 6.0

Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.
Stage-adaptive audio diffusion modeling
cs.SD 2026-05 unverdicted novelty 6.0

A semantic progress signal from SSL discrepancy slope enables three stage-aware mechanisms that improve training efficiency and performance in audio diffusion models over static baselines.
Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models
cs.CV 2026-05 unverdicted novelty 6.0

M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...
Normalizing Flows with Iterative Denoising
cs.CV 2026-04 unverdicted novelty 6.0

iTARFlow augments normalizing flows with diffusion-style iterative denoising during sampling while preserving end-to-end likelihood training, reaching competitive results on ImageNet 64/128/256.
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
cs.CV 2026-04 unverdicted novelty 6.0

By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.
Generative Refinement Networks for Visual Synthesis
cs.CV 2026-04 unverdicted novelty 6.0

GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
Continuous Adversarial Flow Models
cs.LG 2026-04 unverdicted novelty 6.0

Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...
Data Warmup: Complexity-Aware Curricula for Efficient Diffusion Training
cs.LG 2026-04 conditional novelty 6.0

Data Warmup accelerates diffusion training on ImageNet by scheduling images from low to high complexity via a foreground-based metric and temperature-controlled sampler, improving FID and IS scores faster than uniform...
CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation
physics.ins-det 2026-05 unverdicted novelty 5.0

CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...
Video Generation with Predictive Latents
cs.CV 2026-05 unverdicted novelty 5.0

PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
cs.CV 2026-04 unverdicted novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
cs.CV 2026-04 unverdicted novelty 5.0

Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
Not all tokens contribute equally to diffusion learning
cs.CV 2026-04 unverdicted novelty 5.0

DARE mitigates neglect of important tokens in conditional diffusion models via distribution-rectified guidance and spatial attention alignment.
Elucidating Representation Degradation Problem in Diffusion Model Training
cs.LG 2026-05 unverdicted novelty 4.0

Diffusion models suffer representation degradation at high noise due to recoverability mismatch; ERD mitigates this by dynamic optimization reallocation, accelerating convergence across backbones.
Seedream 3.0 Technical Report
cs.CV 2025-04 unverdicted novelty 4.0

Seedream 3.0 improves bilingual image generation through doubled defect-aware data, mixed-resolution training, cross-modality RoPE, representation alignment, aesthetic SFT, VLM reward modeling, and importance-aware ti...

Reference graph

Works this paper leans on

194 extracted references · 194 canonical work pages · cited by 28 Pith papers · 7 internal anchors

[2]

Building Normalizing Flows with Stochastic Interpolants , author=

work page
[3]

2009 , journal=

Learning Multiple Layers of Features from Tiny Images , author=. 2009 , journal=

work page 2009
[4]

Ma, Nanye and Goldstein, Mark and Albergo, Michael S and Boffi, Nicholas M and Vanden-Eijnden, Eric and Xie, Saining , booktitle=ECCV, year=

work page
[5]

Adam: A Method for Stochastic Optimization , author=

work page
[6]

Scalable Diffusion Models with Transformers , author=

work page
[7]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=

work page
[8]

Loshchilov, I , title =

work page
[9]

Transactions on Machine Learning Research , issn=

Maxime Oquab and Timoth. Transactions on Machine Learning Research , issn=. 2024 , note=

work page 2024
[10]

Photorealistic Video Generation with Diffusion Models , author=

work page
[11]

Chen, Junsong and Yu, Jincheng and Ge, Chongjian and Yao, Lewei and Xie, Enze and Wu, Yue and Wang, Zhongdao and Kwok, James and Luo, Ping and Lu, Huchuan and others , booktitle=ICLR, year=

work page
[12]

2024 , journal=

Video Generation Models as World Simulators , author=. 2024 , journal=

work page 2024
[13]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author=

work page
[14]

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition , author=

work page
[15]

Ho, Jonathan and Jain, Ajay and Abbeel, Pieter , booktitle = NeurIPS, title =

work page
[16]

Score-Based Generative Modeling through Stochastic Differential Equations , author=

work page
[17]

Diffusion models beat

Dhariwal, Prafulla and Nichol, Alexander , booktitle=NeurIPS, year=. Diffusion models beat

work page
[18]

The Eleventh International Conference on Learning Representations , year=

What Do Self-Supervised Vision Transformers Learn? , author=. The Eleventh International Conference on Learning Representations , year=

work page
[19]

International Conference on Learning Representations , year=

How Do Vision Transformers Work? , author=. International Conference on Learning Representations , year=

work page
[20]

Vision Transformers Need Registers , author=

work page
[21]

Intriguing Properties of Vision Transformers , author=

work page
[22]

Tero Karras and Miika Aittala and Timo Aila and Samuli Laine , title =

work page
[24]

Photorealistic Text-to-Image Diffusion models with Deep Language Understanding , author=

work page
[25]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Stable Video diffusion: Scaling Latent Video Diffusion Models to Large Datasets , author=. arXiv preprint arXiv:2311.15127 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Stabilizing Transformer Training by Preventing Attention Entropy Collapse , author=

work page
[27]

Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation , author=

work page
[28]

Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle=CVPR, year=

work page
[29]

High-Resolution Image Synthesis with Latent Diffusion Models , author=

work page
[30]

2015 , organization=

Ronneberger, Olaf and Fischer, Philipp and Brox, Thomas , booktitle=. 2015 , organization=

work page 2015
[31]

An Empirical Study of Training Self-Supervised Vision Transformers , author=

work page
[32]

Learning Transferable Visual Models from Natural Language Supervision , author=

work page
[33]

Masked Autoencoders are Scalable Vision Learners , author=

work page
[34]

Generative Adversarial Nets , author=

work page
[35]

Transactions on Machine Learning Research , issn=

Fast Training of Diffusion Models with Masked Transformers , author=. Transactions on Machine Learning Research , issn=

work page
[36]

Understanding Diffusion Objectives as the

Kingma, Diederik and Gao, Ruiqi , journal=NeurIPS, year=. Understanding Diffusion Objectives as the

work page
[37]

Simple Diffusion: End-to-End Diffusion for High Resolution Images , author=

work page
[38]

Journal of Machine Learning Research , volume=

Cascaded Diffusion Models for high fidelity image generation , author=. Journal of Machine Learning Research , volume=

work page
[39]

All are Worth Words: A

Bao, Fan and Nie, Shen and Xue, Kaiwen and Cao, Yue and Li, Chongxuan and Su, Hang and Zhu, Jun , booktitle = CVPR, year=. All are Worth Words: A

work page
[40]

Hatamizadeh, Ali and Song, Jiaming and Liu, Guilin and Kautz, Jan and Vahdat, Arash , booktitle=ECCV, year=

work page
[41]

Gao, Shanghua and Zhou, Pan and Cheng, Ming-Ming and Yan, Shuicheng , journal=

work page
[42]

Zhu, Rui and Pan, Yingwei and Li, Yehao and Yao, Ting and Sun, Zhenglong and Mei, Tao and Chen, Chang Wen , booktitle=CVPR, year=

work page
[43]

Heusel, Martin and Ramsauer, Hubert and Unterthiner, Thomas and Nessler, Bernhard and Hochreiter, Sepp , booktitle=NeurIPS, year=

work page
[44]

Generating Images with Sparse Representations , author=

work page
[45]

Improved Techniques for Training

Salimans, Tim and Goodfellow, Ian and Zaremba, Wojciech and Cheung, Vicki and Radford, Alec and Chen, Xi , booktitle=NeurIPS, year=. Improved Techniques for Training

work page
[46]

Improved Precision and Recall Metric for Assessing Generative Models , author=

work page
[47]

Emerging Properties in Self-Supervised Vision Transformers , author=

work page
[49]

Projected

Sauer, Axel and Chitta, Kashyap and M. Projected

work page
[50]

Ensembling Off-the-Shelf Models for

Kumari, Nupur and Zhang, Richard and Shechtman, Eli and Zhu, Jun-Yan , booktitle=CVPR, year=. Ensembling Off-the-Shelf Models for

work page
[51]

Sauer, Axel and Schwarz, Katja and Geiger, Andreas , booktitle=

work page
[52]

Sauer, Axel and Karras, Tero and Laine, Samuli and Geiger, Andreas and Aila, Timo , booktitle=ICML, year=

work page
[53]

Scaling Up

Kang, Minguk and Zhu, Jun-Yan and Zhang, Richard and Park, Jaesik and Shechtman, Eli and Paris, Sylvain and Park, Taesung , booktitle=CVPR, year=. Scaling Up

work page
[56]

Distilling Diffusion Models into Conditional

Kang, Minguk and Zhang, Richard and Barnes, Connelly and Paris, Sylvain and Kwak, Suha and Park, Jaesik and Shechtman, Eli and Zhu, Jun-Yan and Park, Taesung , booktitle=ECCV, year=. Distilling Diffusion Models into Conditional

work page
[57]

W\"urstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models , author=

work page
[58]

Return of Unconditional Generation: A Self-supervised Representation Generation Method , author=

work page
[59]

Lu, Haoyu and Yang, Guoxing and Fei, Nanyi and Huo, Yuqi and Lu, Zhiwu and Luo, Ping and Ding, Mingyu , booktitle=ICLR, year=

work page
[61]

Arnab, Anurag and Dehghani, Mostafa and Heigold, Georg and Sun, Chen and Lu

work page
[62]

Junsong Chen and Chongjian Ge and Enze Xie and Yue Wu and Lewei Yao and Xiaozhe Ren and Zhongdao Wang and Ping Luo and Huchuan Lu and Zhenguo Li , year=

work page
[64]

Denoising Diffusion Autoencoders are Unified Self-Supervised Learners , author=

work page
[65]

Enhancing Multiple Reliability Measures via Nuisance-extended Information Bottleneck , author=

work page
[66]

Improved Denoising Diffusion Probabilistic Models , author=

work page
[67]

Deep Unsupervised Learning Using Nonequilibrium Thermodynamics , author=

work page
[68]

Denoising Diffusion Implicit Models , author=

work page
[69]

A Simple Framework for Contrastive Learning of Visual Representations , author=

work page
[70]

The Platonic Representation Hypothesis , author=

work page
[71]

Your Diffusion Model is Secretly a Zero-Shot Classifier , author=

work page
[72]

Attention is All you Need , year =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser,. Attention is All you Need , year =

work page
[73]

Neural computation , volume=

A Connection between Score Matching and Denoising Autoencoders , author=. Neural computation , volume=. 2011 , publisher=

work page 2011
[74]

2, 2022-06-27 , author=

A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27 , author=. Open Review , volume=

work page 2022
[75]

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture , author=

work page
[76]

Similarity of Neural Network Representations Revisited , author=

work page
[77]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow , author=

work page
[78]

Pre-training via Denoising for Molecular Property Prediction , author=

work page
[79]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Representation Learning: A Review and New Perspectives , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2013 , publisher=

work page 2013
[81]

Diffusion Models Beat

Mukhopadhyay, Soumik and Gwilliam, Matthew and Agarwal, Vatsal and Padmanabhan, Namitha and Swaminathan, Archana and Hegde, Srinidhi and Zhou, Tianyi and Shrivastava, Abhinav , booktitle=NeurIPS, year=. Diffusion Models Beat

work page
[82]

Progressive Growing of

Karras, Tero and Aila, Timo and Laine, Samuli and Lehtinen, Jaakko , booktitle=ICLR, year=. Progressive Growing of

work page
[83]

Rethinking the

Szegedy, Christian and Vanhoucke, Vincent and Ioffe, Sergey and Shlens, Jon and Wojna, Zbigniew , booktitle=CVPR, year=. Rethinking the

work page
[84]

Analyzing and Improving the Training Dynamics of Diffusion Models , author=

work page
[85]

Momentum Contrast for Unsupervised Visual Representation Learning , author=

work page
[87]

Neural networks , volume=

Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning , author=. Neural networks , volume=. 2018 , publisher=

work page 2018
[88]

Yang, Xiulong and Shih, Sheng-Min and Fu, Yinlin and Zhao, Xiaoting and Ji, Shihao , journal=. Your

work page
[89]

Diffusion Model as Representation Learner , author=

work page

Showing first 80 references.