Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers

Houqiang Li; Litong Gong; Shaodong Xu; Tiezheng Ge; Wengang Zhou; Zexian Li; Zhendong Wang

arxiv: 2605.16949 · v1 · pith:YAGX4P3Dnew · submitted 2026-05-16 · 💻 cs.CV

Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers

Shaodong Xu , Zhendong Wang , Litong Gong , Zexian Li , Wengang Zhou , Tiezheng Ge , Houqiang Li This is my paper

Pith reviewed 2026-05-19 20:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords Diffusion TransformersRepresentation AlignmentStructural AlignmentImage GenerationTraining AccelerationFeature GeometryGenerative Models

0 comments

The pith

Structural alignment of relational geometry in features accelerates Diffusion Transformer training and improves sample quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that aligning noisy latent states with pre-trained semantic features in Diffusion Transformers works best when it captures spatial relationships rather than matching points individually. It introduces sREPA to enforce consistency in the relational geometry of feature maps from vision foundation models. This helps the diffusion model internalize holistic layouts and structural correlations during training. If correct, the approach would deliver faster, more stable convergence and higher-quality generated images than prior alignment strategies. Readers interested in practical high-fidelity image generation would care because current DiT training remains slow despite recent progress.

Core claim

By formulating alignment as an explicit structural constraint on the relational geometry of feature maps rather than point-wise matching, sREPA transfers spatial topology from pre-trained representations more effectively, producing faster and more stable convergence together with improved sample quality in Diffusion Transformers.

What carries the argument

sREPA, which enforces consistency in relational geometry across feature maps instead of matching individual points.

Load-bearing premise

Point-wise matching objectives are insufficient to capture rich spatial topology and an explicit structural constraint on relational geometry will transfer this topology more effectively.

What would settle it

A controlled comparison in which a carefully tuned point-wise baseline reaches the same convergence speed and FID scores as sREPA on standard DiT benchmarks would show the structural constraint is not necessary.

Figures

Figures reproduced from arXiv: 2605.16949 by Houqiang Li, Litong Gong, Shaodong Xu, Tiezheng Ge, Wengang Zhou, Zexian Li, Zhendong Wang.

**Figure 1.** Figure 1: Structural representation alignment improves diffusion model training beyond pointwise alignment. Compared with the standard REPA training framework, the proposed sREPA, integrates explicit structural supervision to align relational distributions between teacher and student representations, resulting in faster convergence and consistently better generation quality. inductive biases when aligning diffusion… view at source ↗

**Figure 2.** Figure 2: Effectiveness of Spatial Structural Supervision. Comparison of similarity maps on the DINOv2 features and diffusion features of models trained with point-wise alignment and further integrated structural alignment. Point-wise only alignment is insufficient to mimic the token relationship of teacher features. Adding explicit structural supervision results in more concentrated token similarity and better ima… view at source ↗

**Figure 3.** Figure 3: Samples are generated on ImageNet 256×256 with the sREPA (SiT-XL/2 model). Classifierfree guidance is applied with scale w = 4.0. 4.2 Main Results Accelerating Training Convergence with Improved Performance. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: sREPA improves visual scaling. We observe that sREPA produces higher-quality images at [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) with [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) with [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) with [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) with [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) with [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: The visualization results of SiT-XL/2 + sREPA utilize Classifier-Free Guidance (CFG) [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗

read the original abstract

Recent advances in Diffusion Transformers (DiTs) demonstrate that aligning noisy latent states with well-trained semantic features-as pioneered by Representation Alignment (REPA)-can substantially accelerate training and improve generation fidelity. Subsequent analysis(e.g., iREPA) suggests that these gains arise primarily from transferring spatial structure contained in pre-trained vision representations. However, mostly existing alignment methods employ point-wise matching objectives or rely on implicit architectural tweaks, which fail to explicitly model the spatial relational geometry inherent in vision foundation models. We argue that such element-wise supervision is insufficient to capture the rich spatial topology of visual representations, and that effective alignment for generation should instead be formulated as an explicit structural constraint. To this end, we propose sREPA, a structural REPresentation Alignment framework to enforce consistency in the relational geometry of feature maps, rather than merely matching individual feature points. By encouraging the model to internalize holistic spatial layouts and structural correlations from pre-trained features, sREPA achieves faster and more stable convergence, along with improved sample quality, compared to state-of-the-art alignment strategies. Our code and models will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

sREPA adds an explicit relational-geometry term to REPA-style alignment for DiTs, but the abstract gives no numbers or ablations so the claimed gains stay unverified.

read the letter

The main point is that this paper takes the REPA line of work and replaces point-wise feature matching with a structural constraint that tries to preserve relational geometry across the feature map. That is a clear next step if the spatial topology really is what matters for faster DiT training. They lay out the argument cleanly: existing methods rely on element-wise losses that miss higher-order correlations, so an explicit structural term should transfer layouts more effectively and produce quicker, more stable convergence plus better samples. The framing is direct and stays close to the cited priors without overclaiming novelty beyond the formulation change. The soft spot is exactly the one the stress-test flags. The abstract asserts performance improvements but shows zero quantitative results, no ablation tables, and no description of how the structural loss is computed or balanced against the base objective. Without a controlled comparison that keeps the alignment target and total loss budget fixed while swapping only the point-wise versus relational term, any observed speed-up could come from extra supervision, different hyperparameters, or side effects rather than the geometry modeling itself. The paper will stand or fall on whether the full version supplies those controls and the code. This is aimed at people who train or fine-tune diffusion transformers under compute constraints and want to squeeze more out of pre-trained vision backbones. A reader who already knows REPA will get the most from seeing the concrete loss definition and the experimental isolation. It deserves peer review because the direction is practical, the literature grounding is solid, and the central hypothesis is testable; the referees can insist on the missing ablations and numbers.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes sREPA, a structural Representation Alignment framework for Diffusion Transformers. It argues that existing methods such as REPA rely on point-wise matching objectives that fail to capture the spatial relational geometry in pre-trained vision features, and instead introduces an explicit structural constraint on the relational geometry of feature maps to encourage internalization of holistic spatial layouts, claiming this yields faster and more stable convergence along with improved sample quality over state-of-the-art alignment strategies. Code and models are promised for release.

Significance. If the empirical claims hold after proper validation, sREPA could advance efficient training of DiTs by supplying a relational inductive bias that better transfers spatial topology from foundation models than point-wise supervision alone. This would extend recent representation-alignment techniques with a more topology-aware formulation, potentially improving both training speed and generation fidelity in large-scale diffusion models.

major comments (3)

Abstract: The abstract asserts performance gains in convergence and sample quality but supplies no quantitative results, ablation studies, or experimental details. All claims rest on the future code release rather than evidence presented in the manuscript.
Method section: The central claim that point-wise matching is insufficient to capture spatial topology is asserted without a supporting derivation, comparison, or isolation experiment showing that the proposed relational-geometry constraint supplies a distinct inductive bias beyond the mere addition of an extra alignment term.
Experiments section: No ablation is described that holds the alignment target and total loss budget fixed while swapping only the structural versus point-wise formulation. Without this control, any observed gains cannot be causally attributed to modeling relational geometry rather than confounding factors such as weighting or architectural side-effects.

minor comments (2)

Abstract: 'However, mostly existing alignment methods' is grammatically awkward and should read 'However, most existing alignment methods'.
Abstract: Missing space in 'analysis(e.g., iREPA)'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our submission. We provide detailed responses to each major comment and indicate the revisions we intend to make.

read point-by-point responses

Referee: Abstract: The abstract asserts performance gains in convergence and sample quality but supplies no quantitative results, ablation studies, or experimental details. All claims rest on the future code release rather than evidence presented in the manuscript.

Authors: We agree that the abstract would benefit from including specific quantitative results to support the claims. In the revised manuscript, we will update the abstract to include key metrics, such as improvements in FID scores and training convergence rates compared to baselines. The detailed experimental results, ablations, and comparisons are already presented in the Experiments section, and we will ensure the abstract provides a concise summary of these findings rather than relying solely on the code release. revision: yes
Referee: Method section: The central claim that point-wise matching is insufficient to capture spatial topology is asserted without a supporting derivation, comparison, or isolation experiment showing that the proposed relational-geometry constraint supplies a distinct inductive bias beyond the mere addition of an extra alignment term.

Authors: We appreciate this point. While the Method section motivates the structural constraint by highlighting the limitations of point-wise objectives in preserving relational geometry, we acknowledge that a more formal derivation could strengthen the argument. We will revise the Method section to include a clearer mathematical derivation of how the structural alignment differs from point-wise matching and provide additional analysis to isolate the effect of the relational geometry constraint. revision: yes
Referee: Experiments section: No ablation is described that holds the alignment target and total loss budget fixed while swapping only the structural versus point-wise formulation. Without this control, any observed gains cannot be causally attributed to modeling relational geometry rather than confounding factors such as weighting or architectural side-effects.

Authors: This is a valid concern regarding causal attribution. To address it, we will conduct and include an additional ablation study in the revised manuscript that maintains the same alignment target and total loss budget, varying only the structural versus point-wise formulation. This will help demonstrate that the gains are due to the modeling of relational geometry. revision: yes

Circularity Check

0 steps flagged

No significant circularity; proposal is an independent structural reformulation

full rationale

The paper motivates sREPA by arguing that point-wise objectives are insufficient for spatial topology and proposes enforcing relational geometry consistency as an explicit structural constraint. No equations, derivations, or fitted parameters are shown that reduce the claimed faster convergence or improved sample quality to the inputs by construction. The argument draws on prior REPA/iREPA work for context but presents the new framework as a distinct reformulation without self-citation load-bearing on the central claim or any renaming of known results. The derivation chain is self-contained as a methodological proposal backed by empirical comparisons rather than tautological reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that pre-trained vision features contain transferable spatial relational geometry that point-wise matching fails to exploit; no free parameters or new entities are specified in the abstract.

axioms (1)

domain assumption Pre-trained vision foundation models encode rich spatial relational geometry in their feature maps that can be transferred to diffusion models.
Invoked to justify moving from point-wise to structural alignment, based on analysis from prior iREPA work.

pith-pipeline@v0.9.0 · 5742 in / 1146 out tokens · 54611 ms · 2026-05-19T20:46:10.172474+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose sREPA … by matching their similarity distributions … LMSE_struc = 1/N(N−1) Σ_{i≠j} ||S^T_ij − S^S_ij||² … LKL_struc via softmax KL on off-diagonal entries
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

explicit structural supervision … relational geometry of feature maps
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

point-wise matching objectives are insufficient to capture the rich spatial topology

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 14 internal anchors

[1]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023

work page 2023
[2]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

work page 2024
[3]

An empirical study of training self-supervised vision transformers

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9640–9649, 2021

work page 2021
[4]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009
[5]

Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

work page 2021
[6]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

work page 2024
[7]

Mdtv2: Masked diffusion transformer is a strong image synthesizer.arXiv preprint arXiv:2303.14389, 2023

Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Mdtv2: Masked diffusion transformer is a strong image synthesizer.arXiv preprint arXiv:2303.14389, 2023

work page arXiv 2023
[8]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

work page 2022
[9]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017
[10]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[11]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[13]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[14]

Normalizing flows: An introduction and review of current methods.IEEE transactions on pattern analysis and machine intelligence, 43(11):3964–3979, 2020

Ivan Kobyzev, Simon JD Prince, and Marcus A Brubaker. Normalizing flows: An introduction and review of current methods.IEEE transactions on pattern analysis and machine intelligence, 43(11):3964–3979, 2020

work page 2020
[15]

arXiv preprint arXiv:2504.16064 , year=

Theodoros Kouzelis, Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Boosting generative image modeling via joint image-feature synthesis.arXiv preprint arXiv:2504.16064, 2025

work page arXiv 2025
[16]

Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019

work page 2019
[17]

Flux.https://github.com/black-forest-labs/flux, 2023

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2023

work page 2023
[18]

Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

work page arXiv 2025
[19]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 10

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In European Conference on Computer Vision, pages 23–40. Springer, 2024

work page 2024
[23]

Cmd: Self-supervised 3d action representation learning with cross-modal mutual distillation

Yunyao Mao, Wengang Zhou, Zhenbo Lu, Jiajun Deng, and Houqiang Li. Cmd: Self-supervised 3d action representation learning with cross-modal mutual distillation. InEuropean Conference on Computer Vision (ECCV), 2022

work page 2022
[24]

Generating images with sparse representations

Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations.arXiv preprint arXiv:2103.03841, 2021

work page arXiv 2021
[25]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Relational knowledge distillation

Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3967–3976, 2019

work page 2019
[27]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[28]

Correlation congruence for knowledge distillation

Baoyun Peng, Xiao Jin, Jiaheng Liu, Dongsheng Li, Yichao Wu, Yu Liu, Shunfeng Zhou, and Zhaoning Zhang. Correlation congruence for knowledge distillation. InProceedings of the IEEE/CVF international conference on computer vision, pages 5007–5016, 2019

work page 2019
[29]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[31]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[32]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

work page 2015
[33]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved techniques for training gans. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016

work page 2016
[34]

DINOv3

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025

Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025

work page arXiv 2025
[36]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[37]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020. 11

work page internal anchor Pith review Pith/arXiv arXiv 2011
[38]

Isd: Self-supervised learning by iterative similarity distillation.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9589–9598, 2020

Ajinkya Tejankar, Soroush Abbasi Koohpayegani, Vipin Pillai, Paolo Favaro, and Hamed Pirsiavash. Isd: Self-supervised learning by iterative similarity distillation.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9589–9598, 2020. URL https://api.semanticscholar.org/ CorpusID:229297747

work page 2021
[39]

Similarity-preserving knowledge distillation

Frederick Tung and Greg Mori. Similarity-preserving knowledge distillation. InProceedings of the IEEE/CVF international conference on computer vision, pages 1365–1374, 2019

work page 2019
[40]

Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

work page arXiv 2025
[41]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Representation entanglement for generation: Training diffusion transformers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025

Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, et al. Representation entanglement for generation: Training diffusion transform- ers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025

work page arXiv 2025
[43]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025

work page 2025
[44]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. arXiv preprint arXiv:2410.06940, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Zheng, W

Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers.arXiv preprint arXiv:2306.09305, 2023. 12 A Implementation Details A.1 Training Details We follow the same experimental setup as in REPA [44]. All training experiments are conducted on the ImageNet [4] training split. For preprocessing,...

work page arXiv 2023

[1] [1]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023

work page 2023

[2] [2]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

work page 2024

[3] [3]

An empirical study of training self-supervised vision transformers

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9640–9649, 2021

work page 2021

[4] [4]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009

[5] [5]

Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

work page 2021

[6] [6]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

work page 2024

[7] [7]

Mdtv2: Masked diffusion transformer is a strong image synthesizer.arXiv preprint arXiv:2303.14389, 2023

Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Mdtv2: Masked diffusion transformer is a strong image synthesizer.arXiv preprint arXiv:2303.14389, 2023

work page arXiv 2023

[8] [8]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

work page 2022

[9] [9]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017

[10] [10]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[11] [11]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020

[13] [13]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[14] [14]

Normalizing flows: An introduction and review of current methods.IEEE transactions on pattern analysis and machine intelligence, 43(11):3964–3979, 2020

Ivan Kobyzev, Simon JD Prince, and Marcus A Brubaker. Normalizing flows: An introduction and review of current methods.IEEE transactions on pattern analysis and machine intelligence, 43(11):3964–3979, 2020

work page 2020

[15] [15]

arXiv preprint arXiv:2504.16064 , year=

Theodoros Kouzelis, Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Boosting generative image modeling via joint image-feature synthesis.arXiv preprint arXiv:2504.16064, 2025

work page arXiv 2025

[16] [16]

Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32, 2019

work page 2019

[17] [17]

Flux.https://github.com/black-forest-labs/flux, 2023

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2023

work page 2023

[18] [18]

Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

work page arXiv 2025

[19] [19]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 10

work page internal anchor Pith review Pith/arXiv arXiv 2022

[20] [20]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In European Conference on Computer Vision, pages 23–40. Springer, 2024

work page 2024

[23] [23]

Cmd: Self-supervised 3d action representation learning with cross-modal mutual distillation

Yunyao Mao, Wengang Zhou, Zhenbo Lu, Jiajun Deng, and Houqiang Li. Cmd: Self-supervised 3d action representation learning with cross-modal mutual distillation. InEuropean Conference on Computer Vision (ECCV), 2022

work page 2022

[24] [24]

Generating images with sparse representations

Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations.arXiv preprint arXiv:2103.03841, 2021

work page arXiv 2021

[25] [25]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Relational knowledge distillation

Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3967–3976, 2019

work page 2019

[27] [27]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023

[28] [28]

Correlation congruence for knowledge distillation

Baoyun Peng, Xiao Jin, Jiaheng Liu, Dongsheng Li, Yichao Wu, Yu Liu, Shunfeng Zhou, and Zhaoning Zhang. Correlation congruence for knowledge distillation. InProceedings of the IEEE/CVF international conference on computer vision, pages 5007–5016, 2019

work page 2019

[29] [29]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021

[31] [31]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022

[32] [32]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015

work page 2015

[33] [33]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved techniques for training gans. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016

work page 2016

[34] [34]

DINOv3

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025

Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025

work page arXiv 2025

[36] [36]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[37] [37]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020. 11

work page internal anchor Pith review Pith/arXiv arXiv 2011

[38] [38]

Isd: Self-supervised learning by iterative similarity distillation.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9589–9598, 2020

Ajinkya Tejankar, Soroush Abbasi Koohpayegani, Vipin Pillai, Paolo Favaro, and Hamed Pirsiavash. Isd: Self-supervised learning by iterative similarity distillation.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9589–9598, 2020. URL https://api.semanticscholar.org/ CorpusID:229297747

work page 2021

[39] [39]

Similarity-preserving knowledge distillation

Frederick Tung and Greg Mori. Similarity-preserving knowledge distillation. InProceedings of the IEEE/CVF international conference on computer vision, pages 1365–1374, 2019

work page 2019

[40] [40]

Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

work page arXiv 2025

[41] [41]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Representation entanglement for generation: Training diffusion transformers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025

Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, et al. Representation entanglement for generation: Training diffusion transform- ers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025

work page arXiv 2025

[43] [43]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025

work page 2025

[44] [44]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. arXiv preprint arXiv:2410.06940, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Zheng, W

Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers.arXiv preprint arXiv:2306.09305, 2023. 12 A Implementation Details A.1 Training Details We follow the same experimental setup as in REPA [44]. All training experiments are conducted on the ImageNet [4] training split. For preprocessing,...

work page arXiv 2023