arxiv: 2604.17492 · v1 · submitted 2026-04-19 · 💻 cs.CV

Recognition: unknown

Coevolving Representations in Joint Image-Feature Diffusion

Theodoros Kouzelis , Spyros Gidaris , Nikos Komodakis

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion modelsrepresentation learningimage synthesisjoint modelingfeature adaptationgenerative modelsVAE latents

0 comments

The pith

The semantic space for guiding diffusion can improve by coevolving with the model through a jointly learned linear projection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that fixed high-level semantic features limit diffusion models because they are not tuned to the generation task. Instead, by learning a lightweight linear projection on these features together with the diffusion process, the representation space adapts to better complement the image latents. Stability is ensured by using stop-gradient targets, normalization, and regularization to avoid collapse. This coevolution leads to faster training convergence and higher quality samples in both latent and pixel diffusion settings. A sympathetic reader would care because it reframes representation learning as dynamic and task-integrated rather than static preprocessing.

Core claim

The central claim is that the representation space guiding diffusion should adapt to the generative task by evolving during training via a jointly learned linear projection on pre-trained semantic features, stabilized against degeneracy, resulting in improved generative performance over fixed representation approaches.

What carries the argument

The CoReDi mechanism, consisting of a learnable linear projection optimized jointly with the diffusion model, supported by stop-gradient targets, normalization, and regularization to maintain stable coevolution of the semantic representation space.

If this is right

Faster convergence during training of the diffusion model.
Higher sample quality in generated images.
Improved complementarity between semantic features and low-level image latents.
Effective for both VAE-based latent diffusion and direct pixel-space diffusion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method suggests that similar coevolution strategies could benefit other joint modeling tasks where one modality or feature type guides another.
It implies that pre-trained encoders might be better used as starting points rather than fixed oracles in generative settings.
Testing this on larger scale models or different pre-trained feature extractors could reveal the limits of the adaptation.

Load-bearing premise

That the proposed stabilization techniques of stop-gradient targets, normalization, and targeted regularization are necessary and sufficient to keep the evolving linear projection from collapsing into degenerate solutions during joint optimization.

What would settle it

An experiment showing that a diffusion model using the coevolving projection without the stabilization techniques achieves comparable or better results than the stabilized version, or that the stabilized version shows no gain over fixed representations, would challenge the central claim.

Figures

Figures reproduced from arXiv: 2604.17492 by Nikos Komodakis, Spyros Gidaris, Theodoros Kouzelis.

**Figure 1.** Figure 1: Evolution of the representations throughout CoReDi training. As training progresses, the coevolving representations develop increasingly structured and semantically meaningful spatial organization. Abstract. Joint image–feature generative modeling has recently emerged as an effective strategy for improving diffusion training by coupling lowlevel VAE latents with high-level semantic features extracted fro… view at source ↗

**Figure 2.** Figure 2: (Left) Comparison of fixed PCA and learned CoReDi representations for DINOv2 and MOCOv3. The learned projections yield cleaner, more structured representations with coherent spatial organization, while the fixed PCA projections produce noisier, less semantically meaningful activations. (Right) By jointly adapting the representation space alongside the generative model, CoReDi consistently speeds up converg… view at source ↗

**Figure 3.** Figure 3: Overview of CoReDi. Given an input image, a frozen pretrained visual encoder extracts semantic features, which are projected to a lower-dimensional space via a learnable projection gϕ, followed by batch normalization and a regularization loss to prevent collapse. Both the noisy image tokens and the noisy coevolving feature tokens are passed as input to a diffusion backbone, which jointly predicts the image… view at source ↗

**Figure 4.** Figure 4: Regularization Prevents Feature Collapse. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Spatial structure of coevolving representations during training [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: FID score as a function of CFG weight. VAE-only Classifier-Free Guidance. Following ReDi [22], we apply Classifier-Free Guidance exclusively to the VAE latents rather than across both the image latents and the features, as this strategy consistently yields superior generation quality and greater robustness to CFG weight variations (see Section 4.4 in [22] for more details). In our ablation as presented in… view at source ↗

**Figure 7.** Figure 7: Spatial structure of coevolving representations in pixel space [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Selected samples from our CoReDi-XL/2 trained for 1M steps on ImageNet 256 × 256. Images and visual representations are jointly generated by our model. We use Classifier-Free Guidance with w = 4.0. Image PCA CoReDi PCA CoReDi PCA CoReDi PCA CoReDi DINOv2 MOCOv3 SigLIPv2 MAE [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison of feature visualizations. For each image, we show PCA visualization of DINOv2, MOCOv3, SigLIPv2 and MAE features. For each feature, we visualize PCA vs CoReDi’s learned projection [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

read the original abstract

Joint image-feature generative modeling has recently emerged as an effective strategy for improving diffusion training by coupling low-level VAE latents with high-level semantic features extracted from pre-trained visual encoders. However, existing approaches rely on a fixed representation space, constructed independently of the generative objective and kept unchanged during training. We argue that the representation space guiding diffusion should itself adapt to the generative task. To this end, we propose Coevolving Representation Diffusion (CoReDi), a framework in which the semantic representation space evolves during training by learning a lightweight linear projection jointly with the diffusion model. While naively optimizing this projection leads to degenerate solutions, we show that stable coevolution can be achieved through a combination of stop-gradient targets, normalization, and targeted regularization that prevents feature collapse. This formulation enables the semantic space to progressively specialize to the needs of image synthesis, improving its complementarity with image latents. We apply CoReDi to both VAE latent diffusion and pixel-space diffusion, demonstrating that adaptive semantic representations improve generative modeling across both settings. Experiments show that CoReDi achieves faster convergence and higher sample quality compared to joint diffusion models operating in fixed representation spaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoReDi lets semantic features adapt during diffusion training through a jointly learned linear projection plus standard stabilizers, but the abstract gives no numbers so the practical payoff is still unclear.

read the letter

The core move here is to stop treating the semantic representation as a fixed input and instead let a lightweight linear projection on top of it train jointly with the diffusion model. They add stop-gradient targets, normalization, and targeted regularization to keep the projection from collapsing, then show the setup on both VAE-latent and pixel-space diffusion. The claim is that the adapted space becomes more complementary to the image latents and produces faster convergence plus better samples than fixed-representation baselines. That framing is clean and directly addresses a real mismatch that joint diffusion papers have lived with so far. The stabilizers they list are standard, so the construction does not look fragile on its own terms. Applying the same trick across latent and pixel regimes is also a reasonable test of generality. The main limitation is that the abstract contains no quantitative results, no ablation numbers, and no detail on how much the features actually shift or how large the reported gains are. Without those, it is hard to tell whether the adaptation delivers a noticeable improvement or just a small, setup-specific edge. A linear projection is also a modest adapter; if the paper does not compare it against something more expressive, readers may wonder whether the gains could be larger with a different parameterization. This paper is mainly for people already working on joint image-feature diffusion or multi-scale generative models who are looking for ways to make high-level guidance more task-dependent. It is worth bringing to a reading group if the full experiments include clear ablations and controlled comparisons, because the idea is straightforward and the problem it targets is legitimate. I would send it to peer review on the strength of the motivation and the internal consistency of the method, provided the experiments hold up under scrutiny.

Referee Report

0 major / 4 minor

Summary. The paper claims that fixed semantic representation spaces are suboptimal for joint image-feature diffusion models. It introduces Coevolving Representation Diffusion (CoReDi), which jointly optimizes a lightweight linear projection on semantic features extracted from pre-trained encoders together with the diffusion model. Stable coevolution is achieved via stop-gradient targets, normalization, and targeted regularization to avoid feature collapse, allowing the semantic space to specialize and improve complementarity with VAE latents. The method is applied to both latent diffusion and pixel-space diffusion, with experiments asserting faster convergence and higher sample quality versus fixed-representation baselines.

Significance. If the stabilization succeeds and the reported gains hold, the work provides a practical mechanism for adapting high-level features to the generative objective rather than treating them as static priors. This could meaningfully advance joint diffusion pipelines by increasing the utility of semantic conditioning without heavy architectural changes. The reliance on standard stabilizers (stop-gradient, normalization) is a strength, as is the dual application to latent and pixel spaces.

minor comments (4)

Abstract: states that experiments demonstrate faster convergence and higher sample quality but provides no numerical metrics, datasets, or baseline comparisons; adding 1-2 key quantitative results would make the claim more informative.
§3 (Method): the linear projection is described as 'lightweight' but its exact dimensionality, initialization, and interaction with the diffusion loss are not fully specified in the high-level description; a short equation or pseudocode block would improve reproducibility.
Figures: convergence plots and sample-quality comparisons should include error bars or multiple runs to support the 'faster convergence' claim; captions could explicitly state the metrics used (e.g., FID, precision/recall).
Related work: the discussion of prior joint diffusion models could more explicitly contrast the fixed vs. adaptive representation distinction with the closest baselines.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary and recommendation of minor revision. The recognition that CoReDi offers a practical way to adapt semantic features to the generative objective without heavy architectural changes is appreciated. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity in the coevolution framework

full rationale

The paper introduces CoReDi as an empirical training procedure: a lightweight linear projection is optimized jointly with the diffusion model, with collapse prevented by stop-gradient targets, normalization, and regularization. These stabilizers are standard techniques whose effectiveness is demonstrated experimentally rather than derived from prior fitted quantities or self-referential definitions. No load-bearing step reduces by construction to its own inputs, no uniqueness theorem is imported from self-citations, and the central claims of faster convergence and higher sample quality rest on comparative experiments across latent and pixel-space diffusion, not on tautological renaming or ansatz smuggling. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that the described stabilization techniques (stop-gradient targets, normalization, targeted regularization) suffice to prevent feature collapse when a linear projection is optimized jointly with the diffusion model; no additional free parameters, axioms, or invented entities are explicitly introduced beyond standard diffusion training components.

pith-pipeline@v0.9.0 · 5502 in / 1085 out tokens · 41761 ms · 2026-05-10T06:27:09.453870+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 26 canonical work pages · 9 internal anchors

[1]

Journal of Machine Learning Research26(209), 1–80 (2025) 5

Albergo, M., Boffi, N.M., Vanden-Eijnden, E.: Stochastic interpolants: A unifying framework for flows and diffusions. Journal of Machine Learning Research26(209), 1–80 (2025) 5

2025
[2]

VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

Bardes, A., Ponce, J., LeCun, Y.: Vicreg: Variance-invariance-covariance regular- ization for self-supervised learning. arXiv preprint arXiv:2105.04906 (2021) 5, 8, 9

work page internal anchor Pith review arXiv 2021
[3]

Advances in Neural Information Processing Systems35, 8799–8810 (2022) 5

Bardes, A., Ponce, J., LeCun, Y.: Vicregl: Self-supervised learning of local visual features. Advances in Neural Information Processing Systems35, 8799–8810 (2022) 5

2022
[4]

Black Forest Labs: FLUX.2: Analyzing and enhancing the latent space of FLUX – representation comparison (2025),https://bfl.ai/research/representation- comparison5

2025
[5]

2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp

Caron, M., Touvron, H., Misra, I., J’egou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 9630–9640 (2021), https://api.semanticscholar.org/CorpusID:2334442735

2021
[6]

ICLR (2026) 5

Chen, B., Bi, S., Tan, H., Zhang, H., Zhang, T., Li, Z., Xiong, Y., Zhang, J., Zhang, K.: Aligning visual foundation encoders to tokenizers for diffusion models. ICLR (2026) 5

2026
[7]

Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025

Chen, S., Ge, C., Zhang, S., Sun, P., Luo, P.: Pixelflow: Pixel-space generative models with flow. arXiv preprint arXiv:2504.07963 (2025) 4

work page arXiv 2025
[8]

In: International conference on machine learning

Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for con- trastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607. PmLR (2020) 5

2020
[9]

Chen,X.,He,K.:Exploringsimplesiameserepresentationlearning.In:Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15750–15758 (2021) 5 16 T. Kouzelis et al

2021
[10]

2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp

Chen, X., Xie, S., He, K.: An empirical study of training self-supervised vi- sion transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 9620–9629 (2021),https://api.semanticscholar.org/CorpusID: 2330249485, 11

2021
[11]

Advances in neural information processing systems34, 8780–8794 (2021) 2, 10, 21

Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems34, 8780–8794 (2021) 2, 10, 21

2021
[12]

Ermolov, A., Siarohin, A., Sangineto, E., Sebe, N.: Whitening for self-supervised representationlearning.In:Internationalconferenceonmachinelearning.pp.3015–
[13]

In: Forty-first international conference on machine learning (2024) 5

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024) 5

2024
[14]

Advances in neural information processing systems33, 21271–21284 (2020) 5

Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Do- ersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems33, 21271–21284 (2020) 5

2020
[15]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022) 11

2022
[16]

Advances in neural information processing systems30(2017) 10, 21

Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017) 10, 21

2017
[17]

Advances in neural information processing systems33, 6840–6851 (2020) 2

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020) 2

2020
[18]

Proceedings of IEEE Computer Society Conference on Computer Vi- sionandPatternRecognitionpp.762–768(1997),https://api.semanticscholar

Huang, J., Kumar, R., Mitra, M., Zhu, W.J., Zabih, R.: Image indexing using color correlograms. Proceedings of IEEE Computer Society Conference on Computer Vi- sionandPatternRecognitionpp.762–768(1997),https://api.semanticscholar. org/CorpusID:931793513

1997
[19]

arXiv preprint arXiv:2412.11673 (2024) 4

Karypidis, E., Kakogeorgiou, I., Gidaris, S., Komodakis, N.: Dino-foresight: Look- ing into the future with dino. arXiv preprint arXiv:2412.11673 (2024) 5

work page arXiv 2024
[20]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 21

work page internal anchor Pith review Pith/arXiv arXiv 2014
[21]

In: Forty-second International Conference on Machine Learning (2025),https://openreview.net/ forum?id=UWhW5YYLo64

Kouzelis, T., Kakogeorgiou, I., Gidaris, S., Komodakis, N.: EQ-VAE: Equivariance regularized latent space for improved generative image modeling. In: Forty-second International Conference on Machine Learning (2025),https://openreview.net/ forum?id=UWhW5YYLo64

2025
[22]

Boosting generative image modeling via joint image-feature synthesis.arXiv preprint arXiv:2504.16064, 2025

Kouzelis, T., Karypidis, E., Kakogeorgiou, I., Gidaris, S., Komodakis, N.: Boost- ing generative image modeling via joint image-feature synthesis. arXiv preprint arXiv:2504.16064 (2025) 2, 5, 6, 10, 11, 13, 14, 19

work page arXiv 2025
[23]

Advances in neural information processing systems32(2019) 10, 22

Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., Aila, T.: Improved precision and recall metric for assessing generative models. Advances in neural information processing systems32(2019) 10, 22

2019
[24]

Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers, 2025

Leng, X., Singh, J., Hou, Y., Xing, Z., Xie, S., Zheng, L.: Repa-e: Unlock- ing vae for end-to-end tuning with latent diffusion transformers. arXiv preprint arXiv:2504.10483 (2025) 2, 5

work page arXiv 2025
[25]

Back to Basics: Let Denoising Generative Models Denoise

Li, T., He, K.: Back to basics: Let denoising generative models denoise. arXiv preprint arXiv:2511.13720 (2025) 4

work page internal anchor Pith review arXiv 2025
[26]

Fractal generative models.arXiv:2502.17437, 2025

Li, T., Sun, Q., Fan, L., He, K.: Fractal generative models. arXiv preprint arXiv:2502.17437 (2025) 4 Coevolving Representations 17

work page arXiv 2025
[27]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) 5

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

In: European Conference on Computer Vision

Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In: European Conference on Computer Vision. pp. 23–40. Springer (2024) 4, 10, 19, 20

2024
[29]

DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

Ma, Z., Wei, L., Wang, S., Zhang, S., Tian, Q.: Deco: Frequency-decoupled pixel diffusion for end-to-end image generation. arXiv preprint arXiv:2511.19365 (2025) 3, 4, 5, 9, 20

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Battaglia

Nash, C., Menick, J., Dieleman, S., Battaglia, P.W.: Generating images with sparse representations. arXiv preprint arXiv:2103.03841 (2021) 10, 22

work page arXiv 2021
[31]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 5, 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Semantics lead the way: Harmonizing semantic and texture modeling with asynchronous latent diffusion.arXiv preprint arXiv:2512.04926, 2025

Pan, Y., Feng, R., Dai, Q., Wang, Y., Lin, W., Guo, M., Luo, C., Zheng, N.: Seman- tics lead the way: Harmonizing semantic and texture modeling with asynchronous latent diffusion. arXiv preprint arXiv:2512.04926 (2025) 2, 5

work page arXiv 2025
[33]

Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 4, 10

2023
[34]

arXiv preprint arXiv:2512.16636 (2025) 2, 5

Petsangourakis, G., Sgouropoulos, C., Psomas, B., Giannakopoulos, T., Sfikas, G., Kakogeorgiou, I.: Reglue your latents with global and local semantics for entangled diffusion. arXiv preprint arXiv:2512.16636 (2025) 2, 5

work page arXiv 2025
[35]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 2, 4, 10

2022
[36]

In: International Conference on Medical image computing and computer-assisted intervention

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi- cal image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015) 4

2015
[37]

Advances in neural information processing systems29(2016) 10, 22

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. Advances in neural information processing systems29(2016) 10, 22

2016
[39]

Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025

Shi, M., Wang, H., Zheng, W., Yuan, Z., Wu, X., Wang, X., Wan, P., Zhou, J., Lu, J.: Latent diffusion model without variational autoencoder. ICLRabs/2510.15301(2026),https://api.semanticscholar.org/CorpusID: 2822033165

work page arXiv 2026
[40]

Singh, J., Leng, X., Wu, Z., Zheng, L., Zhang, R., Shechtman, E., Xie, S.: What matters for representation alignment: Global information or spatial structure? arXiv preprint arXiv:2512.10794 (2025) 2, 5, 13, 14, 22

work page arXiv 2025
[41]

arXiv preprint arXiv:2309.03350 (2023) 4

Teng, J., Zheng, W., Ding, M., Hong, W., Wangni, J., Yang, Z., Tang, J.: Relay diffusion: Unifying diffusion process across resolutions for image synthesis. arXiv preprint arXiv:2309.03350 (2023) 4

work page arXiv 2023
[42]

arXiv preprint arXiv:2601.16208 (2026),https://arxiv.org/abs/2601.16208

Tong, S., Zheng, B., Wang, Z., Tang, B., Ma, N., Brown, E., Yang, J., Fergus, R., LeCun,Y.,Xie,S.:Scalingtext-to-imagediffusiontransformerswithrepresentation autoencoders. arXiv preprint arXiv:2601.16208 (2026) 2, 5 18 T. Kouzelis et al

work page arXiv 2026
[43]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., et al.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, lo- calization, and dense features. arXiv preprint arXiv:2502.14786 (2025) 5, 11

work page internal anchor Pith review arXiv 2025
[44]

Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

Venkataramanan, S., Pariza, V., Salehi, M., Knobel, L., Gidaris, S., Ramzi, E., Bursuc, A., Asano, Y.M.: Franca: Nested matryoshka clustering for scalable visual representation learning. arXiv preprint arXiv:2507.14137 (2025) 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268,

Wang, S., Gao, Z., Zhu, C., Huang, W., Wang, L.: Pixnerd: Pixel neural field diffusion. arXiv preprint arXiv:2507.23268 (2025) 4

work page arXiv 2025
[46]

DDT: Decoupled diffusion Transformer.arXiv:2504.05741, 2025

Wang, S., Tian, Z., Huang, W., Wang, L.: Ddt: Decoupled diffusion transformer. arXiv preprint arXiv:2504.05741 (2025) 4

work page arXiv 2025
[47]

Wu, G., Zhang, S., Shi, R., Gao, S., Chen, Z., Wang, L., Chen, Z., Gao, H., Tang, Y., Yang, J., et al.: Representation entanglement for generation: Training diffusion transformersismucheasierthanyouthink.arXivpreprintarXiv:2507.01467(2025) 2, 5

work page arXiv 2025
[48]

generation: Taming optimization dilemma in latent diffusion models

Yao, J., Yang, B., Wang, X.: Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15703–15712 (2025) 2, 4

2025
[49]

In: International Conference on Learning Representations (2025) 2, 5, 10, 19

Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think. In: International Conference on Learning Representations (2025) 2, 5, 10, 19

2025
[50]

In: International conference on machine learn- ing

Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: Self-supervised learning via redundancy reduction. In: International conference on machine learn- ing. pp. 12310–12320. PMLR (2021) 5, 9

2021
[51]

Normalizing flows are capable generative models.arXiv preprint arXiv:2412.06329, 2024

Zhai, S., Zhang, R., Nakkiran, P., Berthelot, D., Gu, J., Zheng, H., Chen, T., Bautista, M.A., Jaitly, N., Susskind, J.: Normalizing flows are capable generative models. arXiv preprint arXiv:2412.06329 (2024) 4

work page arXiv 2024
[52]

Diffusion Transformers with Representation Autoencoders

Zheng, B., Ma, N., Tong, S., Xie, S.: Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690 (2025) 2, 5

work page internal anchor Pith review arXiv 2025
[53]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling.arXiv preprint arXiv:2409.02908, 2024

Zheng, K., Chen, Y., Mao, H., Liu, M.Y., Zhu, J., Zhang, Q.: Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. arXiv preprint arXiv:2409.02908 (2024) 4 Coevolving Representations 19 Appendix A Additional Results and Ablations Model #Iters. FID↓sFID↓IS↑Prec.↑Rec.↑ SiT-XL/2[28] 7M 8.3 6.3 131.7 0.6...

work page arXiv 2024