MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training

Changqian Yu; Cheng Da; Huan Yang; Kun Gai; Lianyu Pang; Song Guo; Tianlin Pan; Wenhan Luo

arxiv: 2606.08788 · v1 · pith:QZL7APBBnew · submitted 2026-06-07 · 💻 cs.CV

MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training

Lianyu Pang , Tianlin Pan , Cheng Da , Changqian Yu , Huan Yang , Kun Gai , Song Guo , Wenhan Luo This is my paper

Pith reviewed 2026-06-27 18:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion modelsrepresentation alignmenttoken subsetefficient trainingvision transformersmaskingself-supervised encoders

0 comments

The pith

By aligning representations only on random token subsets, MaskAlign makes diffusion training less dependent on complete clean-image token sets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion transformer training can be sped up by aligning intermediate features to those from pretrained vision encoders. The mismatch arises because diffusion processes noisy images while references come from clean ones. The paper identifies that full-token alignment causes certain tokens to dominate based on their gradient norms showing stable positions. MaskAlign counters this by randomly selecting subsets of tokens for alignment across training steps. A pre-mask mixing block shares information to avoid losing too much from the dropped tokens.

Core claim

Under full-token representation alignment, tokens with large alignment-gradient norms show a stable spatial preference, indicating that the objective encourages reliance on the complete set of clean-image tokens. MaskAlign addresses this by applying alignment to randomly sampled token subsets and using a lightweight pre-mask token mixing block to share information across tokens before masking, thereby reducing dependence on the full token set and encouraging more stable alignment under perturbations.

What carries the argument

MaskAlign, which performs representation alignment on randomly sampled token subsets during training, supported by a pre-mask token mixing block.

Load-bearing premise

The stable spatial preference of high-norm tokens under full alignment means the model relies on the complete clean token set, and random subset sampling will yield robust alignment without major information loss.

What would settle it

Compare convergence speed and final FID scores of diffusion models trained with full-token alignment versus MaskAlign on standard datasets like ImageNet, checking if MaskAlign achieves similar or better results with fewer iterations.

read the original abstract

Representation alignment with pretrained vision models has recently shown strong potential for accelerating diffusion transformer training. By aligning intermediate diffusion features with clean-image representations from self-supervised vision encoders, existing methods improve convergence and generation quality. However, such alignment also introduces a non-trivial constraint: diffusion models operate on noisy inputs whose usable information varies across timesteps, while the reference features are extracted from clean images. In this paper, we revisit this mismatch from a token-level perspective. We find that, under full-token representation alignment, tokens with large alignment-gradient norms exhibit a stable spatial preference, suggesting that the alignment objective does not affect all tokens uniformly and may encourage the model to rely on the complete set of clean-image tokens. To address this issue, we propose MaskAlign, a token-subset representation alignment method that applies alignment to randomly sampled token subsets during training. By exposing the model to different token subsets across iterations, MaskAlign reduces the dependence of representation alignment on the complete token set and encourages alignment behavior that is more stable under token-subset perturbations. To mitigate the information loss caused by directly dropping tokens, we further introduce a lightweight pre-mask token mixing block that shares information across tokens before masking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MaskAlign's token-subset sampling plus pre-mask mixing is a reasonable engineering response to a known alignment mismatch, but the inference from stable spatial preference to reliance on the full token set is not secured.

read the letter

The key point on this paper is that MaskAlign proposes token-subset sampling for representation alignment plus a pre-mask mixing block, but the central motivation linking stable spatial preferences to reliance on the full token set is not strongly supported by the given evidence.

What is new is the specific approach of randomly sampling token subsets during training to reduce dependence on the complete clean-image token set, combined with the mixing block to handle information loss. This seems like a targeted response to the mismatch between noisy diffusion features and clean references at the token level.

The paper does well in spotting that alignment gradients are not uniform across tokens and that some show stable spatial preferences. That's an interesting empirical finding that could guide future work on alignment methods.

The soft spots are in the interpretation and the lack of supporting data. The suggestion that stable preference indicates the objective encourages reliance on all tokens could be off; as noted in the stress test, it might reflect properties of the pretrained encoder or the noising process instead. If so, the subset method might not deliver the intended robustness and could just change the bias. Since only the abstract is detailed here, there are no ablations, no quantitative results on training speed or quality, and no checks on whether the mixing block actually preserves necessary information. That makes it hard to assess if the method works as claimed.

This kind of paper is aimed at practitioners training large diffusion models who want to cut compute costs through better alignment. Readers working on efficient generative model training would find it relevant if the experiments hold up. It deserves a serious referee because the underlying problem of efficient alignment is important and the proposal is concrete, even though the current writeup leaves the motivation and results open to question.

I'd recommend sending it for peer review with the expectation that reviewers will probe the motivation and ask for more detailed experiments and comparisons.

Referee Report

2 major / 1 minor

Summary. The paper claims that full-token representation alignment between noisy diffusion features and clean-image features from pretrained vision encoders encourages over-reliance on the complete token set, as evidenced by stable spatial preferences in high-gradient-norm tokens. To address this, it proposes MaskAlign, which performs alignment on randomly sampled token subsets during training, combined with a pre-mask token mixing block to reduce information loss, thereby producing more robust and stable alignment behavior under token perturbations for faster and higher-quality diffusion transformer training.

Significance. If the core observation and proposed fix hold under rigorous validation, MaskAlign could meaningfully improve the efficiency and stability of representation-alignment-based diffusion training by reducing dependence on full clean-image token sets. This would be a practical advance in accelerating DiT-style models, with potential for broader applicability in conditional generation tasks. The approach is empirically motivated and introduces a lightweight architectural addition, but its significance depends on whether the token-subset strategy demonstrably outperforms full alignment without hidden costs in convergence or sample quality.

major comments (2)

[Abstract] Abstract (motivation paragraph): The inference that 'tokens with large alignment-gradient norms exhibit a stable spatial preference' under full-token alignment directly indicates that 'the alignment objective ... may encourage the model to rely on the complete set of clean-image tokens' is not secured. Stable preference could arise from intrinsic properties of the pretrained vision encoder or the diffusion noising schedule rather than from dependence on the full token set; without a controlled ablation or gradient analysis isolating this link, the motivation for random subset sampling remains under-supported.
[Abstract] Abstract (proposed method): The pre-mask token mixing block is introduced to 'share information across tokens before masking' and mitigate information loss, but the description provides no architectural details, parameter count, or analysis showing that this block does not reintroduce cross-token dependencies equivalent to the original full-set alignment. If the mixing operation effectively restores full-set information flow, the claimed reduction in dependence on the complete token set would be undermined.

minor comments (1)

[Abstract] The abstract would benefit from at least one quantitative result (e.g., FID improvement, training speedup, or stability metric) to ground the claimed gains in convergence and generation quality.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below, clarifying the motivation and method details while proposing targeted revisions to the abstract and main text where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract (motivation paragraph): The inference that 'tokens with large alignment-gradient norms exhibit a stable spatial preference' under full-token alignment directly indicates that 'the alignment objective ... may encourage the model to rely on the complete set of clean-image tokens' is not secured. Stable preference could arise from intrinsic properties of the pretrained vision encoder or the diffusion noising schedule rather than from dependence on the full token set; without a controlled ablation or gradient analysis isolating this link, the motivation for random subset sampling remains under-supported.

Authors: We acknowledge the referee's concern that the observed stable spatial preference could stem from the vision encoder or noising schedule rather than full-token dependence. The abstract uses 'suggesting' to frame this as an empirical observation motivating the approach, not a definitive causal proof. In the full manuscript (Section 3.2), gradient norm visualizations and perturbation experiments demonstrate that full alignment produces stable preferences while MaskAlign yields more uniform and perturbation-stable behavior. To further isolate the link, we will revise the abstract to emphasize the empirical motivation and add a brief note on the perturbation analysis as supporting evidence. A dedicated controlled ablation isolating the encoder and schedule would strengthen the claim but is not currently present; we can include it as additional analysis if requested. revision: partial
Referee: [Abstract] Abstract (proposed method): The pre-mask token mixing block is introduced to 'share information across tokens before masking' and mitigate information loss, but the description provides no architectural details, parameter count, or analysis showing that this block does not reintroduce cross-token dependencies equivalent to the original full-set alignment. If the mixing operation effectively restores full-set information flow, the claimed reduction in dependence on the complete token set would be undermined.

Authors: We agree the abstract lacks architectural specifics on the pre-mask mixing block. The full manuscript (Section 3.3) describes it as a lightweight single-layer transformer with shared weights (approximately 0.5M parameters) applied before random token masking. Ablation studies show that omitting the block degrades performance due to information loss, while including it preserves MaskAlign's improved stability under token-subset perturbations compared to full alignment. The mixing occurs prior to subset sampling and does not condition on the complete token set during alignment, avoiding restoration of full-set dependencies. We will revise the abstract to briefly note its lightweight design and cross-reference the main text for details and ablations. If the referee requires explicit information-flow analysis (e.g., via attention maps), we can add it during revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation drives method proposal without self-referential derivations

full rationale

The paper presents an empirical observation about token gradient norms under full alignment and proposes MaskAlign as a practical response via random subset sampling and pre-mask mixing. No equations, fitted parameters, or predictions are defined in terms of themselves; the central claim rests on an observed training dynamic rather than a quantity constructed from the proposed fix. No self-citations are invoked as load-bearing uniqueness theorems, and the method is not renamed from prior results. The derivation chain is therefore self-contained as an engineering intervention motivated by data, warranting score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, background axioms, or postulated entities beyond naming the new method and block.

pith-pipeline@v0.9.1-grok · 5752 in / 1031 out tokens · 16682 ms · 2026-06-27T18:42:50.333115+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 6 linked inside Pith

[1]

ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers.arXivpreprint arXiv:2211.01324, 2022

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers.arXivpreprint arXiv:2211.01324, 2022

Pith/arXiv arXiv 2022
[2]

Understanding dropout.Advancesinneural informationprocessingsystems, 26, 2013

Pierre Baldi and Peter J Sadowski. Understanding dropout.Advancesinneural informationprocessingsystems, 26, 2013

2013
[3]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conferenceoncomputervision andpattern recognition, pages 248–255. Ieee, 2009

2009
[4]

Mdtv2: Masked diffusion transformer is a strong image synthesizer

Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Mdtv2: Masked diffusion transformer is a strong image synthesizer. arXivpreprint arXiv:2303.14389, 2023

arXiv 2023
[5]

Ganstrainedbyatwotime-scale update rule converge to a local nash equilibrium.Advancesin neuralinformationprocessingsystems, 30, 2017

MartinHeusel,HubertRamsauer,ThomasUnterthiner,BernhardNessler,andSeppHochreiter. Ganstrainedbyatwotime-scale update rule converge to a local nash equilibrium.Advancesin neuralinformationprocessingsystems, 30, 2017

2017
[6]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020. 9

2020
[7]

Auto-encoding variational bayes.arXivpreprint arXiv:1312.6114, 2013

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXivpreprint arXiv:1312.6114, 2013

Pith/arXiv arXiv 2013
[8]

Boosting generative image modeling via joint image-feature synthesis.arXivpreprint arXiv:2504.16064, 2025

Theodoros Kouzelis, Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Boosting generative image modeling via joint image-feature synthesis.arXivpreprint arXiv:2504.16064, 2025

arXiv 2025
[9]

Tread: Token routing for efficient architecture-agnostic diffusion training

Felix Krause, Timy Phan, Ming Gui, Stefan Andreas Baumann, Vincent Tao Hu, and Björn Ommer. Tread: Token routing for efficient architecture-agnostic diffusion training. InProceedings of the IEEE/CVF International Conferenceon Computer Vision, pages 15703–15713, 2025

2025
[10]

Improved precision and recall metric for assessing generative models.Advancesinneural informationprocessing systems, 32, 2019

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models.Advancesinneural informationprocessing systems, 32, 2019

2019
[11]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024
[12]

Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025

2025
[13]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXivpreprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022
[14]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conferenceon Computer Vision, pages 23–40. Springer, 2024

2024
[15]

Generating images with sparse representations.arXiv preprint arXiv:2103.03841, 2021

Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations.arXiv preprint arXiv:2103.03841, 2021

arXiv 2021
[16]

Glide: Towardsphotorealisticimagegenerationandeditingwithtext-guideddiffusionmodels

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towardsphotorealisticimagegenerationandeditingwithtext-guideddiffusionmodels. arXivpreprintarXiv:2112.10741, 2021

Pith/arXiv arXiv 2021
[17]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conferenceoncomputervision, pages 4195–4205, 2023

2023
[18]

Reglue your latents with global and local semantics for entangled diffusion.arXivpreprint arXiv:2512.16636, 2025

Giorgos Petsangourakis, Christos Sgouropoulos, Bill Psomas, Theodoros Giannakopoulos, Giorgos Sfikas, and Ioannis Kakogeorgiou. Reglue your latents with global and local semantics for entangled diffusion.arXivpreprint arXiv:2512.16636, 2025

arXiv 2025
[19]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

2022
[20]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. InNeurIPS, 2022

2022
[21]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advancesinneural informationprocessing systems, 29, 2016

2016
[22]

Stretching each dollar: Diffusion training from scratch on a micro-budget

Vikash Sehwag, Xianghao Kong, Jingtao Li, Michael Spranger, and Lingjuan Lyu. Stretching each dollar: Diffusion training from scratch on a micro-budget. InProceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition, pages 28596–28608, 2025

2025
[23]

What matters for representation alignment: Global information or spatial structure?arXivpreprint arXiv:2512.10794, 2025

Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure?arXivpreprint arXiv:2512.10794, 2025

arXiv 2025
[24]

Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

Pith/arXiv arXiv 2010
[25]

Dropout training as adaptive regularization.Advancesin neural information processing systems, 26, 2013

Stefan Wager, Sida Wang, and Percy S Liang. Dropout training as adaptive regularization.Advancesin neural information processing systems, 26, 2013

2013
[26]

Repa works until it doesn’t: Early-stopped, holistic alignment supercharges diffusion training.arXiv preprint arXiv:2505.16792, 2025

Ziqiao Wang, Wangbo Zhao, Yuhao Zhou, Zekai Li, Zhiyuan Liang, Mingjia Shi, Xuanlei Zhao, Pengfei Zhou, Kaipeng Zhang, Zhangyang Wang, et al. Repa works until it doesn’t: Early-stopped, holistic alignment supercharges diffusion training.arXiv preprint arXiv:2505.16792, 2025

arXiv 2025
[27]

Representation entanglement for generation: Training diffusion transformers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025

Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, et al. Representation entanglement for generation: Training diffusion transformers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025. 10

arXiv 2025
[28]

Representation alignment for generation: Training diffusion transformers is easier than you think.arXivpreprint arXiv:2410.06940, 2024

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXivpreprint arXiv:2410.06940, 2024

Pith/arXiv arXiv 2024
[29]

great grey owl

Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers. arXivpreprint arXiv:2306.09305, 2023. 11 A Experimental Setup Table 7 summarizes the hyperparameter settings of MaskAlign for SiT-B/2 and SiT-XL/2. Following the experimental protocol of REPA, we train models in the latent space with v...

arXiv 2023

[1] [1]

ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers.arXivpreprint arXiv:2211.01324, 2022

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers.arXivpreprint arXiv:2211.01324, 2022

Pith/arXiv arXiv 2022

[2] [2]

Understanding dropout.Advancesinneural informationprocessingsystems, 26, 2013

Pierre Baldi and Peter J Sadowski. Understanding dropout.Advancesinneural informationprocessingsystems, 26, 2013

2013

[3] [3]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conferenceoncomputervision andpattern recognition, pages 248–255. Ieee, 2009

2009

[4] [4]

Mdtv2: Masked diffusion transformer is a strong image synthesizer

Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Mdtv2: Masked diffusion transformer is a strong image synthesizer. arXivpreprint arXiv:2303.14389, 2023

arXiv 2023

[5] [5]

Ganstrainedbyatwotime-scale update rule converge to a local nash equilibrium.Advancesin neuralinformationprocessingsystems, 30, 2017

MartinHeusel,HubertRamsauer,ThomasUnterthiner,BernhardNessler,andSeppHochreiter. Ganstrainedbyatwotime-scale update rule converge to a local nash equilibrium.Advancesin neuralinformationprocessingsystems, 30, 2017

2017

[6] [6]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020. 9

2020

[7] [7]

Auto-encoding variational bayes.arXivpreprint arXiv:1312.6114, 2013

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXivpreprint arXiv:1312.6114, 2013

Pith/arXiv arXiv 2013

[8] [8]

Boosting generative image modeling via joint image-feature synthesis.arXivpreprint arXiv:2504.16064, 2025

Theodoros Kouzelis, Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Boosting generative image modeling via joint image-feature synthesis.arXivpreprint arXiv:2504.16064, 2025

arXiv 2025

[9] [9]

Tread: Token routing for efficient architecture-agnostic diffusion training

Felix Krause, Timy Phan, Ming Gui, Stefan Andreas Baumann, Vincent Tao Hu, and Björn Ommer. Tread: Token routing for efficient architecture-agnostic diffusion training. InProceedings of the IEEE/CVF International Conferenceon Computer Vision, pages 15703–15713, 2025

2025

[10] [10]

Improved precision and recall metric for assessing generative models.Advancesinneural informationprocessing systems, 32, 2019

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models.Advancesinneural informationprocessing systems, 32, 2019

2019

[11] [11]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024

[12] [12]

Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025

2025

[13] [13]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXivpreprint arXiv:2210.02747, 2022

Pith/arXiv arXiv 2022

[14] [14]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conferenceon Computer Vision, pages 23–40. Springer, 2024

2024

[15] [15]

Generating images with sparse representations.arXiv preprint arXiv:2103.03841, 2021

Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations.arXiv preprint arXiv:2103.03841, 2021

arXiv 2021

[16] [16]

Glide: Towardsphotorealisticimagegenerationandeditingwithtext-guideddiffusionmodels

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towardsphotorealisticimagegenerationandeditingwithtext-guideddiffusionmodels. arXivpreprintarXiv:2112.10741, 2021

Pith/arXiv arXiv 2021

[17] [17]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conferenceoncomputervision, pages 4195–4205, 2023

2023

[18] [18]

Reglue your latents with global and local semantics for entangled diffusion.arXivpreprint arXiv:2512.16636, 2025

Giorgos Petsangourakis, Christos Sgouropoulos, Bill Psomas, Theodoros Giannakopoulos, Giorgos Sfikas, and Ioannis Kakogeorgiou. Reglue your latents with global and local semantics for entangled diffusion.arXivpreprint arXiv:2512.16636, 2025

arXiv 2025

[19] [19]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

2022

[20] [20]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. InNeurIPS, 2022

2022

[21] [21]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advancesinneural informationprocessing systems, 29, 2016

2016

[22] [22]

Stretching each dollar: Diffusion training from scratch on a micro-budget

Vikash Sehwag, Xianghao Kong, Jingtao Li, Michael Spranger, and Lingjuan Lyu. Stretching each dollar: Diffusion training from scratch on a micro-budget. InProceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition, pages 28596–28608, 2025

2025

[23] [23]

What matters for representation alignment: Global information or spatial structure?arXivpreprint arXiv:2512.10794, 2025

Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure?arXivpreprint arXiv:2512.10794, 2025

arXiv 2025

[24] [24]

Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

Pith/arXiv arXiv 2010

[25] [25]

Dropout training as adaptive regularization.Advancesin neural information processing systems, 26, 2013

Stefan Wager, Sida Wang, and Percy S Liang. Dropout training as adaptive regularization.Advancesin neural information processing systems, 26, 2013

2013

[26] [26]

Repa works until it doesn’t: Early-stopped, holistic alignment supercharges diffusion training.arXiv preprint arXiv:2505.16792, 2025

Ziqiao Wang, Wangbo Zhao, Yuhao Zhou, Zekai Li, Zhiyuan Liang, Mingjia Shi, Xuanlei Zhao, Pengfei Zhou, Kaipeng Zhang, Zhangyang Wang, et al. Repa works until it doesn’t: Early-stopped, holistic alignment supercharges diffusion training.arXiv preprint arXiv:2505.16792, 2025

arXiv 2025

[27] [27]

Representation entanglement for generation: Training diffusion transformers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025

Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, et al. Representation entanglement for generation: Training diffusion transformers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025. 10

arXiv 2025

[28] [28]

Representation alignment for generation: Training diffusion transformers is easier than you think.arXivpreprint arXiv:2410.06940, 2024

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXivpreprint arXiv:2410.06940, 2024

Pith/arXiv arXiv 2024

[29] [29]

great grey owl

Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers. arXivpreprint arXiv:2306.09305, 2023. 11 A Experimental Setup Table 7 summarizes the hyperparameter settings of MaskAlign for SiT-B/2 and SiT-XL/2. Following the experimental protocol of REPA, we train models in the latent space with v...

arXiv 2023