pith. sign in

arxiv: 2605.16949 · v1 · pith:YAGX4P3Dnew · submitted 2026-05-16 · 💻 cs.CV

Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers

Pith reviewed 2026-05-19 20:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords Diffusion TransformersRepresentation AlignmentStructural AlignmentImage GenerationTraining AccelerationFeature GeometryGenerative Models
0
0 comments X

The pith

Structural alignment of relational geometry in features accelerates Diffusion Transformer training and improves sample quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that aligning noisy latent states with pre-trained semantic features in Diffusion Transformers works best when it captures spatial relationships rather than matching points individually. It introduces sREPA to enforce consistency in the relational geometry of feature maps from vision foundation models. This helps the diffusion model internalize holistic layouts and structural correlations during training. If correct, the approach would deliver faster, more stable convergence and higher-quality generated images than prior alignment strategies. Readers interested in practical high-fidelity image generation would care because current DiT training remains slow despite recent progress.

Core claim

By formulating alignment as an explicit structural constraint on the relational geometry of feature maps rather than point-wise matching, sREPA transfers spatial topology from pre-trained representations more effectively, producing faster and more stable convergence together with improved sample quality in Diffusion Transformers.

What carries the argument

sREPA, which enforces consistency in relational geometry across feature maps instead of matching individual points.

Load-bearing premise

Point-wise matching objectives are insufficient to capture rich spatial topology and an explicit structural constraint on relational geometry will transfer this topology more effectively.

What would settle it

A controlled comparison in which a carefully tuned point-wise baseline reaches the same convergence speed and FID scores as sREPA on standard DiT benchmarks would show the structural constraint is not necessary.

Figures

Figures reproduced from arXiv: 2605.16949 by Houqiang Li, Litong Gong, Shaodong Xu, Tiezheng Ge, Wengang Zhou, Zexian Li, Zhendong Wang.

Figure 1
Figure 1. Figure 1: Structural representation alignment improves diffusion model training beyond point￾wise alignment. Compared with the standard REPA training framework, the proposed sREPA, integrates explicit structural supervision to align relational distributions between teacher and student representations, resulting in faster convergence and consistently better generation quality. inductive biases when aligning diffusion… view at source ↗
Figure 2
Figure 2. Figure 2: Effectiveness of Spatial Structural Supervision. Comparison of similarity maps on the DINOv2 features and diffusion features of models trained with point-wise alignment and further integrated structural alignment. Point-wise only alignment is insufficient to mimic the token relation￾ship of teacher features. Adding explicit structural supervision results in more concentrated token similarity and better ima… view at source ↗
Figure 3
Figure 3. Figure 3: Samples are generated on ImageNet 256×256 with the sREPA (SiT-XL/2 model). Classifier￾free guidance is applied with scale w = 4.0. 4.2 Main Results Accelerating Training Convergence with Improved Performance. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: sREPA improves visual scaling. We observe that sREPA produces higher-quality images at [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) with [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) with [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) with [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) with [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) with [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
read the original abstract

Recent advances in Diffusion Transformers (DiTs) demonstrate that aligning noisy latent states with well-trained semantic features-as pioneered by Representation Alignment (REPA)-can substantially accelerate training and improve generation fidelity. Subsequent analysis(e.g., iREPA) suggests that these gains arise primarily from transferring spatial structure contained in pre-trained vision representations. However, mostly existing alignment methods employ point-wise matching objectives or rely on implicit architectural tweaks, which fail to explicitly model the spatial relational geometry inherent in vision foundation models. We argue that such element-wise supervision is insufficient to capture the rich spatial topology of visual representations, and that effective alignment for generation should instead be formulated as an explicit structural constraint. To this end, we propose sREPA, a structural REPresentation Alignment framework to enforce consistency in the relational geometry of feature maps, rather than merely matching individual feature points. By encouraging the model to internalize holistic spatial layouts and structural correlations from pre-trained features, sREPA achieves faster and more stable convergence, along with improved sample quality, compared to state-of-the-art alignment strategies. Our code and models will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes sREPA, a structural Representation Alignment framework for Diffusion Transformers. It argues that existing methods such as REPA rely on point-wise matching objectives that fail to capture the spatial relational geometry in pre-trained vision features, and instead introduces an explicit structural constraint on the relational geometry of feature maps to encourage internalization of holistic spatial layouts, claiming this yields faster and more stable convergence along with improved sample quality over state-of-the-art alignment strategies. Code and models are promised for release.

Significance. If the empirical claims hold after proper validation, sREPA could advance efficient training of DiTs by supplying a relational inductive bias that better transfers spatial topology from foundation models than point-wise supervision alone. This would extend recent representation-alignment techniques with a more topology-aware formulation, potentially improving both training speed and generation fidelity in large-scale diffusion models.

major comments (3)
  1. Abstract: The abstract asserts performance gains in convergence and sample quality but supplies no quantitative results, ablation studies, or experimental details. All claims rest on the future code release rather than evidence presented in the manuscript.
  2. Method section: The central claim that point-wise matching is insufficient to capture spatial topology is asserted without a supporting derivation, comparison, or isolation experiment showing that the proposed relational-geometry constraint supplies a distinct inductive bias beyond the mere addition of an extra alignment term.
  3. Experiments section: No ablation is described that holds the alignment target and total loss budget fixed while swapping only the structural versus point-wise formulation. Without this control, any observed gains cannot be causally attributed to modeling relational geometry rather than confounding factors such as weighting or architectural side-effects.
minor comments (2)
  1. Abstract: 'However, mostly existing alignment methods' is grammatically awkward and should read 'However, most existing alignment methods'.
  2. Abstract: Missing space in 'analysis(e.g., iREPA)'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our submission. We provide detailed responses to each major comment and indicate the revisions we intend to make.

read point-by-point responses
  1. Referee: Abstract: The abstract asserts performance gains in convergence and sample quality but supplies no quantitative results, ablation studies, or experimental details. All claims rest on the future code release rather than evidence presented in the manuscript.

    Authors: We agree that the abstract would benefit from including specific quantitative results to support the claims. In the revised manuscript, we will update the abstract to include key metrics, such as improvements in FID scores and training convergence rates compared to baselines. The detailed experimental results, ablations, and comparisons are already presented in the Experiments section, and we will ensure the abstract provides a concise summary of these findings rather than relying solely on the code release. revision: yes

  2. Referee: Method section: The central claim that point-wise matching is insufficient to capture spatial topology is asserted without a supporting derivation, comparison, or isolation experiment showing that the proposed relational-geometry constraint supplies a distinct inductive bias beyond the mere addition of an extra alignment term.

    Authors: We appreciate this point. While the Method section motivates the structural constraint by highlighting the limitations of point-wise objectives in preserving relational geometry, we acknowledge that a more formal derivation could strengthen the argument. We will revise the Method section to include a clearer mathematical derivation of how the structural alignment differs from point-wise matching and provide additional analysis to isolate the effect of the relational geometry constraint. revision: yes

  3. Referee: Experiments section: No ablation is described that holds the alignment target and total loss budget fixed while swapping only the structural versus point-wise formulation. Without this control, any observed gains cannot be causally attributed to modeling relational geometry rather than confounding factors such as weighting or architectural side-effects.

    Authors: This is a valid concern regarding causal attribution. To address it, we will conduct and include an additional ablation study in the revised manuscript that maintains the same alignment target and total loss budget, varying only the structural versus point-wise formulation. This will help demonstrate that the gains are due to the modeling of relational geometry. revision: yes

Circularity Check

0 steps flagged

No significant circularity; proposal is an independent structural reformulation

full rationale

The paper motivates sREPA by arguing that point-wise objectives are insufficient for spatial topology and proposes enforcing relational geometry consistency as an explicit structural constraint. No equations, derivations, or fitted parameters are shown that reduce the claimed faster convergence or improved sample quality to the inputs by construction. The argument draws on prior REPA/iREPA work for context but presents the new framework as a distinct reformulation without self-citation load-bearing on the central claim or any renaming of known results. The derivation chain is self-contained as a methodological proposal backed by empirical comparisons rather than tautological reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that pre-trained vision features contain transferable spatial relational geometry that point-wise matching fails to exploit; no free parameters or new entities are specified in the abstract.

axioms (1)
  • domain assumption Pre-trained vision foundation models encode rich spatial relational geometry in their feature maps that can be transferred to diffusion models.
    Invoked to justify moving from point-wise to structural alignment, based on analysis from prior iREPA work.

pith-pipeline@v0.9.0 · 5742 in / 1146 out tokens · 54611 ms · 2026-05-19T20:46:10.172474+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 14 internal anchors

  1. [1]

    Self-supervised learning from images with a joint-embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023

  2. [2]

    Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

  3. [3]

    An empirical study of training self-supervised vision transformers

    Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9640–9649, 2021

  4. [4]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  5. [5]

    Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

  6. [6]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  7. [7]

    Mdtv2: Masked diffusion transformer is a strong image synthesizer.arXiv preprint arXiv:2303.14389, 2023

    Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Mdtv2: Masked diffusion transformer is a strong image synthesizer.arXiv preprint arXiv:2303.14389, 2023

  8. [8]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

  9. [9]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

  10. [10]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

  11. [11]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  12. [12]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  13. [13]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  14. [14]

    Normalizing flows: An introduction and review of current methods.IEEE transactions on pattern analysis and machine intelligence, 43(11):3964–3979, 2020

    Ivan Kobyzev, Simon JD Prince, and Marcus A Brubaker. Normalizing flows: An introduction and review of current methods.IEEE transactions on pattern analysis and machine intelligence, 43(11):3964–3979, 2020

  15. [15]

    arXiv preprint arXiv:2504.16064 , year=

    Theodoros Kouzelis, Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Boosting generative image modeling via joint image-feature synthesis.arXiv preprint arXiv:2504.16064, 2025

  16. [16]

    Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019

    Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019

  17. [17]

    Flux.https://github.com/black-forest-labs/flux, 2023

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2023

  18. [18]

    Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

    Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

  19. [19]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 10

  20. [20]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  21. [21]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  22. [22]

    Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In European Conference on Computer Vision, pages 23–40. Springer, 2024

  23. [23]

    Cmd: Self-supervised 3d action representation learning with cross-modal mutual distillation

    Yunyao Mao, Wengang Zhou, Zhenbo Lu, Jiajun Deng, and Houqiang Li. Cmd: Self-supervised 3d action representation learning with cross-modal mutual distillation. InEuropean Conference on Computer Vision (ECCV), 2022

  24. [24]

    Generating images with sparse representations

    Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations.arXiv preprint arXiv:2103.03841, 2021

  25. [25]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  26. [26]

    Relational knowledge distillation

    Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3967–3976, 2019

  27. [27]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  28. [28]

    Correlation congruence for knowledge distillation

    Baoyun Peng, Xiao Jin, Jiaheng Liu, Dongsheng Li, Yichao Wu, Yu Liu, Shunfeng Zhou, and Zhaoning Zhang. Correlation congruence for knowledge distillation. InProceedings of the IEEE/CVF international conference on computer vision, pages 5007–5016, 2019

  29. [29]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  30. [30]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  31. [31]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  32. [32]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

  33. [33]

    Improved techniques for training gans

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved techniques for training gans. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016

  34. [34]

    DINOv3

    Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

  35. [35]

    What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025

    Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025

  36. [36]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

  37. [37]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020. 11

  38. [38]

    Isd: Self-supervised learning by iterative similarity distillation.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9589–9598, 2020

    Ajinkya Tejankar, Soroush Abbasi Koohpayegani, Vipin Pillai, Paolo Favaro, and Hamed Pirsiavash. Isd: Self-supervised learning by iterative similarity distillation.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9589–9598, 2020. URL https://api.semanticscholar.org/ CorpusID:229297747

  39. [39]

    Similarity-preserving knowledge distillation

    Frederick Tung and Greg Mori. Similarity-preserving knowledge distillation. InProceedings of the IEEE/CVF international conference on computer vision, pages 1365–1374, 2019

  40. [40]

    Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

    Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

  41. [41]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  42. [42]

    Representation entanglement for generation: Training diffusion transformers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025

    Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, et al. Representation entanglement for generation: Training diffusion transform- ers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025

  43. [43]

    Reconstruction vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025

  44. [44]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. arXiv preprint arXiv:2410.06940, 2024

  45. [45]

    Diffusion Transformers with Representation Autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025

  46. [46]

    Zheng, W

    Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers.arXiv preprint arXiv:2306.09305, 2023. 12 A Implementation Details A.1 Training Details We follow the same experimental setup as in REPA [44]. All training experiments are conducted on the ImageNet [4] training split. For preprocessing,...