pith. machine review for the scientific record. sign in

arxiv: 2410.06940 · v4 · submitted 2024-10-09 · 💻 cs.CV · cs.LG

Recognition: 3 theorem links

· Lean Theorem

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Authors on Pith no claims yet

Pith reviewed 2026-05-12 15:04 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords representation alignmentdiffusion transformersDiTSiTtraining accelerationimage generationregularizationpretrained encoders
0
0 comments X

The pith

Aligning the hidden states of diffusion transformers to high-quality representations from pretrained encoders makes training far easier and produces better images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper claims that diffusion models for image generation struggle because their denoising networks must learn good representations on their own from noisy data. The authors propose a simple fix: add a regularization that forces the model's internal projections of noisy inputs to match the representations of clean images from an external pretrained visual encoder. When tested on transformer-based models like DiT and SiT, this alignment leads to much faster training and higher quality outputs. A reader should care because it shows a way to leverage existing strong representation learners to bypass part of the hard work in training generative models from scratch.

Core claim

The central discovery is that REPresentation Alignment (REPA) improves both the efficiency and quality of training diffusion and flow-based transformers by aligning the projections of noisy hidden states in the denoising network with clean image representations obtained from external pretrained visual encoders.

What carries the argument

REPA, a regularization term that aligns the model's noisy-state hidden representations to those of a fixed pretrained encoder on clean images.

If this is right

  • Training of SiT models reaches the performance of a 7M-step baseline in fewer than 400K steps, a speedup of over 17.5 times.
  • Final generation quality reaches state-of-the-art FID scores of 1.42 when using classifier-free guidance.
  • The same gains appear across multiple popular diffusion transformer architectures without needing heavy hyperparameter adjustments.
  • Models no longer have to learn discriminative representations entirely through the generative denoising process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Generative models can benefit from borrowing mature representation learning techniques developed in discriminative settings.
  • Similar alignment strategies might accelerate training in other modalities or architectures that rely on internal feature learning.
  • Choosing different pretrained encoders could lead to further improvements or domain-specific adaptations.
  • Lower training costs open the door to scaling these models to even larger sizes on the same compute budget.

Load-bearing premise

External pretrained representations remain useful and non-interfering when aligned to the noisy states encountered during diffusion training.

What would settle it

An experiment where adding the REPA loss to a standard DiT or SiT training run results in slower convergence or worse final FID scores than the unregularized baseline.

read the original abstract

Recent studies have shown that the denoising process in (generative) diffusion models can induce meaningful (discriminative) representations inside the model, though the quality of these representations still lags behind those learned through recent self-supervised learning methods. We argue that one main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations. Moreover, training can be made easier by incorporating high-quality external visual representations, rather than relying solely on the diffusion models to learn them independently. We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders. The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs. For instance, our method can speed up SiT training by over 17.5$\times$, matching the performance (without classifier-free guidance) of a SiT-XL model trained for 7M steps in less than 400K steps. In terms of final generation quality, our approach achieves state-of-the-art results of FID=1.42 using classifier-free guidance with the guidance interval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces REPresentation Alignment (REPA), a regularization technique that aligns projections of noisy hidden states from denoising networks (DiT, SiT) with clean-image representations extracted from fixed pretrained visual encoders. The central empirical claim is that this simple auxiliary loss yields large gains in training efficiency (e.g., 17.5× speedup for SiT-XL to match a 7 M-step baseline in <400 K steps) and final generation quality (FID=1.42 with classifier-free guidance and guidance interval).

Significance. If the reported speed-ups and FID numbers prove robust, the work would be significant for the field: it offers a practical way to bootstrap internal representations in large diffusion/flow transformers using external self-supervised encoders, directly addressing the acknowledged bottleneck that denoising alone learns weaker features than modern SSL methods. Concrete, large-magnitude improvements on standard architectures would be of immediate practical value.

major comments (2)
  1. [§3] §3 (REPA formulation): the manuscript does not report an ablation or sensitivity analysis on the scalar weight λ that balances the REPA term against the primary diffusion/flow loss. Because the alignment target is computed on clean images while the network receives noisy inputs, the compatibility of the two objectives is not obvious; without evidence that a single, easily chosen λ works across model sizes and schedules, the claim that REPA is a 'straightforward' regularizer that reliably accelerates training remains under-supported.
  2. [Experiments] Experiments section (Tables 1–3 and training curves): the reported speed-ups and SOTA FID numbers are presented without error bars, multiple random seeds, or statistical significance tests. Given that the central claim rests on large quantitative improvements (17.5×, FID=1.42), the absence of these controls makes it impossible to assess whether the gains are reproducible or could be explained by hyper-parameter differences.
minor comments (2)
  1. [Abstract] The term 'guidance interval' is used in the abstract and results but is not defined until later; a brief parenthetical definition on first use would improve readability.
  2. [Figures] Figure captions should explicitly state whether the plotted curves include classifier-free guidance and at what scale, to allow direct comparison with the no-CFG numbers cited in the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important aspects of our presentation that we will address in the revision. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [§3] §3 (REPA formulation): the manuscript does not report an ablation or sensitivity analysis on the scalar weight λ that balances the REPA term against the primary diffusion/flow loss. Because the alignment target is computed on clean images while the network receives noisy inputs, the compatibility of the two objectives is not obvious; without evidence that a single, easily chosen λ works across model sizes and schedules, the claim that REPA is a 'straightforward' regularizer that reliably accelerates training remains under-supported.

    Authors: We appreciate this observation. Although we performed limited tuning of λ during initial experiments, the manuscript indeed lacks a systematic sensitivity analysis. In the revised version we will include a dedicated ablation (new table and curves) that varies λ over {0.1, 0.3, 0.5, 0.7, 1.0} for DiT-B, DiT-XL, SiT-B and SiT-XL under both 400 K and 1 M step budgets. The results show that λ = 0.5 yields near-optimal performance across all settings, with graceful degradation outside [0.3, 0.7], thereby supporting the claim that REPA is straightforward to apply. revision: yes

  2. Referee: Experiments section (Tables 1–3 and training curves): the reported speed-ups and SOTA FID numbers are presented without error bars, multiple random seeds, or statistical significance tests. Given that the central claim rests on large quantitative improvements (17.5×, FID=1.42), the absence of these controls makes it impossible to assess whether the gains are reproducible or could be explained by hyper-parameter differences.

    Authors: We agree that statistical controls would increase confidence in the reported gains. Because of the substantial compute required for SiT-XL (approximately 1 000 A100-days per 7 M-step run), we conducted the largest-scale experiments with a single seed. However, we did run three independent seeds for all smaller models (DiT-S/B, SiT-S/B) and observed standard deviations below 0.3 FID and <5 % relative variation in the speedup factor. In the revision we will (i) report these error bars for the smaller models, (ii) add a second seed for SiT-XL at the 400 K-step mark, and (iii) include a short discussion of why the magnitude of the observed improvements (17.5×) makes hyper-parameter or seed artifacts unlikely. revision: partial

Circularity Check

0 steps flagged

No circularity: REPA is an empirical regularization loss with independent external benchmarks

full rationale

The paper proposes REPA as a straightforward added loss term that aligns projected noisy diffusion states to fixed outputs from separate pretrained encoders. No derivation, equation, or claim reduces by construction to its own inputs; results are evaluated on external metrics (FID, training steps to target performance) that are not defined inside the method. No load-bearing self-citations or uniqueness theorems appear in the provided text, and the compatibility of the alignment term with the diffusion objective is treated as an empirical question rather than a self-referential proof. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the assumption that external pretrained encoders supply beneficial clean representations and that the alignment loss can be added without disrupting the core denoising objective.

axioms (1)
  • domain assumption External pretrained visual encoders provide high-quality representations that are useful to align with during diffusion training.
    Invoked when stating that incorporating these representations makes training easier.

pith-pipeline@v0.9.0 · 5543 in / 1130 out tokens · 45538 ms · 2026-05-12T15:04:08.141186+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. One-Step Generative Modeling via Wasserstein Gradient Flows

    cs.LG 2026-05 conditional novelty 7.0

    W-Flow achieves state-of-the-art one-step ImageNet 256x256 generation at 1.29 FID by training a static neural network to follow a Wasserstein gradient flow that minimizes Sinkhorn divergence, delivering roughly 100x f...

  2. What Cohort INRs Encode and Where to Freeze Them

    cs.LG 2026-05 unverdicted novelty 7.0

    Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.

  3. Autoregressive Visual Generation Needs a Prologue

    cs.CV 2026-05 unverdicted novelty 7.0

    Prologue introduces dedicated prologue tokens to decouple generation and reconstruction in AR visual models, significantly improving generation FID scores on ImageNet while maintaining reconstruction quality.

  4. Posterior Augmented Flow Matching

    cs.CV 2026-05 unverdicted novelty 7.0

    PAFM augments flow matching with an importance-sampled mixture over an approximate posterior of target completions, yielding an unbiased lower-variance estimator that improves FID by up to 3.4 on ImageNet and CC12M.

  5. Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

    cs.CV 2026-04 unverdicted novelty 7.0

    A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

  6. 3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image

    cs.CV 2026-04 unverdicted novelty 7.0

    3D-Fixer performs in-place 3D asset completion from single-view partial point clouds via coarse-to-fine generation with ORFA conditioning, plus a new ARSG-110K dataset, to achieve higher geometric accuracy than MIDI a...

  7. TORA: Topological Representation Alignment for 3D Shape Assembly

    cs.CV 2026-04 unverdicted novelty 7.0

    TORA distills topological structure from pretrained 3D encoders into flow-matching backbones via cosine matching and CKA loss, delivering up to 6.9x faster convergence and better accuracy on 3D shape assembly benchmar...

  8. PoDAR: Power-Disentangled Audio Representation for Generative Modeling

    eess.AS 2026-05 unverdicted novelty 6.0

    PoDAR disentangles audio signal power from semantic content in latents using power augmentation and consistency objectives, yielding 2x faster convergence and gains of 0.055 speaker similarity and 0.22 UTMOS when appl...

  9. The two clocks and the innovation window: When and how generative models learn rules

    cs.LG 2026-05 unverdicted novelty 6.0

    Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.

  10. What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

    cs.CV 2026-05 unverdicted novelty 6.0

    Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.

  11. SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.

  12. Toward Better Geometric Representations for Molecule Generative Models

    cs.LG 2026-05 unverdicted novelty 6.0

    LENSEs improves representation-conditioned molecule generation by jointly training a multi-level representation head, perceptual loss, and REPA alignment on pretrained encoders, yielding 97.28% validity and 98.51% sta...

  13. Conservative Flows: A New Paradigm of Generative Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Conservative flows generate by running probability-preserving stochastic dynamics initialized at data points rather than noise, using corrected Langevin or predictor-corrector mechanisms on top of any pretrained flow ...

  14. Taming Outlier Tokens in Diffusion Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.

  15. Stage-adaptive audio diffusion modeling

    cs.SD 2026-05 unverdicted novelty 6.0

    A semantic progress signal from SSL discrepancy slope enables three stage-aware mechanisms that improve training efficiency and performance in audio diffusion models over static baselines.

  16. Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models

    cs.CV 2026-05 unverdicted novelty 6.0

    M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and lon...

  17. Normalizing Flows with Iterative Denoising

    cs.CV 2026-04 unverdicted novelty 6.0

    iTARFlow augments normalizing flows with diffusion-style iterative denoising during sampling while preserving end-to-end likelihood training, reaching competitive results on ImageNet 64/128/256.

  18. Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

    cs.CV 2026-04 unverdicted novelty 6.0

    By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.

  19. Generative Refinement Networks for Visual Synthesis

    cs.CV 2026-04 unverdicted novelty 6.0

    GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.

  20. Continuous Adversarial Flow Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...

  21. Data Warmup: Complexity-Aware Curricula for Efficient Diffusion Training

    cs.LG 2026-04 conditional novelty 6.0

    Data Warmup accelerates diffusion training on ImageNet by scheduling images from low to high complexity via a foreground-based metric and temperature-controlled sampler, improving FID and IS scores faster than uniform...

  22. CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation

    physics.ins-det 2026-05 unverdicted novelty 5.0

    CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...

  23. Video Generation with Predictive Latents

    cs.CV 2026-05 unverdicted novelty 5.0

    PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.

  24. Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...

  25. Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.

  26. Not all tokens contribute equally to diffusion learning

    cs.CV 2026-04 unverdicted novelty 5.0

    DARE mitigates neglect of important tokens in conditional diffusion models via distribution-rectified guidance and spatial attention alignment.

  27. Elucidating Representation Degradation Problem in Diffusion Model Training

    cs.LG 2026-05 unverdicted novelty 4.0

    Diffusion models suffer representation degradation at high noise due to recoverability mismatch; ERD mitigates this by dynamic optimization reallocation, accelerating convergence across backbones.

  28. Seedream 3.0 Technical Report

    cs.CV 2025-04 unverdicted novelty 4.0

    Seedream 3.0 improves bilingual image generation through doubled defect-aware data, mixed-resolution training, cross-modality RoPE, representation alignment, aesthetic SFT, VLM reward modeling, and importance-aware ti...

Reference graph

Works this paper leans on

194 extracted references · 194 canonical work pages · cited by 28 Pith papers · 7 internal anchors

  1. [2]

    Building Normalizing Flows with Stochastic Interpolants , author=

  2. [3]

    2009 , journal=

    Learning Multiple Layers of Features from Tiny Images , author=. 2009 , journal=

  3. [4]

    Ma, Nanye and Goldstein, Mark and Albergo, Michael S and Boffi, Nicholas M and Vanden-Eijnden, Eric and Xie, Saining , booktitle=ECCV, year=

  4. [5]

    Adam: A Method for Stochastic Optimization , author=

  5. [6]

    Scalable Diffusion Models with Transformers , author=

  6. [7]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=

  7. [8]

    Loshchilov, I , title =

  8. [9]

    Transactions on Machine Learning Research , issn=

    Maxime Oquab and Timoth. Transactions on Machine Learning Research , issn=. 2024 , note=

  9. [10]

    Photorealistic Video Generation with Diffusion Models , author=

  10. [11]

    Chen, Junsong and Yu, Jincheng and Ge, Chongjian and Yao, Lewei and Xie, Enze and Wu, Yue and Wang, Zhongdao and Kwok, James and Luo, Ping and Lu, Huchuan and others , booktitle=ICLR, year=

  11. [12]

    2024 , journal=

    Video Generation Models as World Simulators , author=. 2024 , journal=

  12. [13]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author=

  13. [14]

    Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition , author=

  14. [15]

    Ho, Jonathan and Jain, Ajay and Abbeel, Pieter , booktitle = NeurIPS, title =

  15. [16]

    Score-Based Generative Modeling through Stochastic Differential Equations , author=

  16. [17]

    Diffusion models beat

    Dhariwal, Prafulla and Nichol, Alexander , booktitle=NeurIPS, year=. Diffusion models beat

  17. [18]

    The Eleventh International Conference on Learning Representations , year=

    What Do Self-Supervised Vision Transformers Learn? , author=. The Eleventh International Conference on Learning Representations , year=

  18. [19]

    International Conference on Learning Representations , year=

    How Do Vision Transformers Work? , author=. International Conference on Learning Representations , year=

  19. [20]

    Vision Transformers Need Registers , author=

  20. [21]

    Intriguing Properties of Vision Transformers , author=

  21. [22]

    Tero Karras and Miika Aittala and Timo Aila and Samuli Laine , title =

  22. [24]

    Photorealistic Text-to-Image Diffusion models with Deep Language Understanding , author=

  23. [25]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Stable Video diffusion: Scaling Latent Video Diffusion Models to Large Datasets , author=. arXiv preprint arXiv:2311.15127 , year=

  24. [26]

    Stabilizing Transformer Training by Preventing Attention Entropy Collapse , author=

  25. [27]

    Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation , author=

  26. [28]

    Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle=CVPR, year=

  27. [29]

    High-Resolution Image Synthesis with Latent Diffusion Models , author=

  28. [30]

    2015 , organization=

    Ronneberger, Olaf and Fischer, Philipp and Brox, Thomas , booktitle=. 2015 , organization=

  29. [31]

    An Empirical Study of Training Self-Supervised Vision Transformers , author=

  30. [32]

    Learning Transferable Visual Models from Natural Language Supervision , author=

  31. [33]

    Masked Autoencoders are Scalable Vision Learners , author=

  32. [34]

    Generative Adversarial Nets , author=

  33. [35]

    Transactions on Machine Learning Research , issn=

    Fast Training of Diffusion Models with Masked Transformers , author=. Transactions on Machine Learning Research , issn=

  34. [36]

    Understanding Diffusion Objectives as the

    Kingma, Diederik and Gao, Ruiqi , journal=NeurIPS, year=. Understanding Diffusion Objectives as the

  35. [37]

    Simple Diffusion: End-to-End Diffusion for High Resolution Images , author=

  36. [38]

    Journal of Machine Learning Research , volume=

    Cascaded Diffusion Models for high fidelity image generation , author=. Journal of Machine Learning Research , volume=

  37. [39]

    All are Worth Words: A

    Bao, Fan and Nie, Shen and Xue, Kaiwen and Cao, Yue and Li, Chongxuan and Su, Hang and Zhu, Jun , booktitle = CVPR, year=. All are Worth Words: A

  38. [40]

    Hatamizadeh, Ali and Song, Jiaming and Liu, Guilin and Kautz, Jan and Vahdat, Arash , booktitle=ECCV, year=

  39. [41]

    Gao, Shanghua and Zhou, Pan and Cheng, Ming-Ming and Yan, Shuicheng , journal=

  40. [42]

    Zhu, Rui and Pan, Yingwei and Li, Yehao and Yao, Ting and Sun, Zhenglong and Mei, Tao and Chen, Chang Wen , booktitle=CVPR, year=

  41. [43]

    Heusel, Martin and Ramsauer, Hubert and Unterthiner, Thomas and Nessler, Bernhard and Hochreiter, Sepp , booktitle=NeurIPS, year=

  42. [44]

    Generating Images with Sparse Representations , author=

  43. [45]

    Improved Techniques for Training

    Salimans, Tim and Goodfellow, Ian and Zaremba, Wojciech and Cheung, Vicki and Radford, Alec and Chen, Xi , booktitle=NeurIPS, year=. Improved Techniques for Training

  44. [46]

    Improved Precision and Recall Metric for Assessing Generative Models , author=

  45. [47]

    Emerging Properties in Self-Supervised Vision Transformers , author=

  46. [49]

    Projected

    Sauer, Axel and Chitta, Kashyap and M. Projected

  47. [50]

    Ensembling Off-the-Shelf Models for

    Kumari, Nupur and Zhang, Richard and Shechtman, Eli and Zhu, Jun-Yan , booktitle=CVPR, year=. Ensembling Off-the-Shelf Models for

  48. [51]

    Sauer, Axel and Schwarz, Katja and Geiger, Andreas , booktitle=

  49. [52]

    Sauer, Axel and Karras, Tero and Laine, Samuli and Geiger, Andreas and Aila, Timo , booktitle=ICML, year=

  50. [53]

    Scaling Up

    Kang, Minguk and Zhu, Jun-Yan and Zhang, Richard and Park, Jaesik and Shechtman, Eli and Paris, Sylvain and Park, Taesung , booktitle=CVPR, year=. Scaling Up

  51. [56]

    Distilling Diffusion Models into Conditional

    Kang, Minguk and Zhang, Richard and Barnes, Connelly and Paris, Sylvain and Kwak, Suha and Park, Jaesik and Shechtman, Eli and Zhu, Jun-Yan and Park, Taesung , booktitle=ECCV, year=. Distilling Diffusion Models into Conditional

  52. [57]

    W\"urstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models , author=

  53. [58]

    Return of Unconditional Generation: A Self-supervised Representation Generation Method , author=

  54. [59]

    Lu, Haoyu and Yang, Guoxing and Fei, Nanyi and Huo, Yuqi and Lu, Zhiwu and Luo, Ping and Ding, Mingyu , booktitle=ICLR, year=

  55. [61]

    Arnab, Anurag and Dehghani, Mostafa and Heigold, Georg and Sun, Chen and Lu

  56. [62]

    Junsong Chen and Chongjian Ge and Enze Xie and Yue Wu and Lewei Yao and Xiaozhe Ren and Zhongdao Wang and Ping Luo and Huchuan Lu and Zhenguo Li , year=

  57. [64]

    Denoising Diffusion Autoencoders are Unified Self-Supervised Learners , author=

  58. [65]

    Enhancing Multiple Reliability Measures via Nuisance-extended Information Bottleneck , author=

  59. [66]

    Improved Denoising Diffusion Probabilistic Models , author=

  60. [67]

    Deep Unsupervised Learning Using Nonequilibrium Thermodynamics , author=

  61. [68]

    Denoising Diffusion Implicit Models , author=

  62. [69]

    A Simple Framework for Contrastive Learning of Visual Representations , author=

  63. [70]

    The Platonic Representation Hypothesis , author=

  64. [71]

    Your Diffusion Model is Secretly a Zero-Shot Classifier , author=

  65. [72]

    Attention is All you Need , year =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser,. Attention is All you Need , year =

  66. [73]

    Neural computation , volume=

    A Connection between Score Matching and Denoising Autoencoders , author=. Neural computation , volume=. 2011 , publisher=

  67. [74]

    2, 2022-06-27 , author=

    A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27 , author=. Open Review , volume=

  68. [75]

    Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture , author=

  69. [76]

    Similarity of Neural Network Representations Revisited , author=

  70. [77]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow , author=

  71. [78]

    Pre-training via Denoising for Molecular Property Prediction , author=

  72. [79]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

    Representation Learning: A Review and New Perspectives , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2013 , publisher=

  73. [81]

    Diffusion Models Beat

    Mukhopadhyay, Soumik and Gwilliam, Matthew and Agarwal, Vatsal and Padmanabhan, Namitha and Swaminathan, Archana and Hegde, Srinidhi and Zhou, Tianyi and Shrivastava, Abhinav , booktitle=NeurIPS, year=. Diffusion Models Beat

  74. [82]

    Progressive Growing of

    Karras, Tero and Aila, Timo and Laine, Samuli and Lehtinen, Jaakko , booktitle=ICLR, year=. Progressive Growing of

  75. [83]

    Rethinking the

    Szegedy, Christian and Vanhoucke, Vincent and Ioffe, Sergey and Shlens, Jon and Wojna, Zbigniew , booktitle=CVPR, year=. Rethinking the

  76. [84]

    Analyzing and Improving the Training Dynamics of Diffusion Models , author=

  77. [85]

    Momentum Contrast for Unsupervised Visual Representation Learning , author=

  78. [87]

    Neural networks , volume=

    Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning , author=. Neural networks , volume=. 2018 , publisher=

  79. [88]

    Yang, Xiulong and Shih, Sheng-Min and Fu, Yinlin and Zhao, Xiaoting and Ji, Shihao , journal=. Your

  80. [89]

    Diffusion Model as Representation Learner , author=

Showing first 80 references.