pith. machine review for the scientific record. sign in

arxiv: 2604.17492 · v1 · submitted 2026-04-19 · 💻 cs.CV

Recognition: unknown

Coevolving Representations in Joint Image-Feature Diffusion

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion modelsrepresentation learningimage synthesisjoint modelingfeature adaptationgenerative modelsVAE latents
0
0 comments X

The pith

The semantic space for guiding diffusion can improve by coevolving with the model through a jointly learned linear projection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that fixed high-level semantic features limit diffusion models because they are not tuned to the generation task. Instead, by learning a lightweight linear projection on these features together with the diffusion process, the representation space adapts to better complement the image latents. Stability is ensured by using stop-gradient targets, normalization, and regularization to avoid collapse. This coevolution leads to faster training convergence and higher quality samples in both latent and pixel diffusion settings. A sympathetic reader would care because it reframes representation learning as dynamic and task-integrated rather than static preprocessing.

Core claim

The central claim is that the representation space guiding diffusion should adapt to the generative task by evolving during training via a jointly learned linear projection on pre-trained semantic features, stabilized against degeneracy, resulting in improved generative performance over fixed representation approaches.

What carries the argument

The CoReDi mechanism, consisting of a learnable linear projection optimized jointly with the diffusion model, supported by stop-gradient targets, normalization, and regularization to maintain stable coevolution of the semantic representation space.

If this is right

  • Faster convergence during training of the diffusion model.
  • Higher sample quality in generated images.
  • Improved complementarity between semantic features and low-level image latents.
  • Effective for both VAE-based latent diffusion and direct pixel-space diffusion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method suggests that similar coevolution strategies could benefit other joint modeling tasks where one modality or feature type guides another.
  • It implies that pre-trained encoders might be better used as starting points rather than fixed oracles in generative settings.
  • Testing this on larger scale models or different pre-trained feature extractors could reveal the limits of the adaptation.

Load-bearing premise

That the proposed stabilization techniques of stop-gradient targets, normalization, and targeted regularization are necessary and sufficient to keep the evolving linear projection from collapsing into degenerate solutions during joint optimization.

What would settle it

An experiment showing that a diffusion model using the coevolving projection without the stabilization techniques achieves comparable or better results than the stabilized version, or that the stabilized version shows no gain over fixed representations, would challenge the central claim.

Figures

Figures reproduced from arXiv: 2604.17492 by Nikos Komodakis, Spyros Gidaris, Theodoros Kouzelis.

Figure 1
Figure 1. Figure 1: Evolution of the representations throughout CoReDi training. As training pro￾gresses, the coevolving representations develop increasingly structured and semantically meaningful spatial organization. Abstract. Joint image–feature generative modeling has recently emerged as an effective strategy for improving diffusion training by coupling low￾level VAE latents with high-level semantic features extracted fro… view at source ↗
Figure 2
Figure 2. Figure 2: (Left) Comparison of fixed PCA and learned CoReDi representations for DINOv2 and MOCOv3. The learned projections yield cleaner, more structured representations with coherent spatial organization, while the fixed PCA projections produce noisier, less semantically meaningful activations. (Right) By jointly adapting the representation space alongside the generative model, CoReDi consistently speeds up converg… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of CoReDi. Given an input image, a frozen pretrained visual encoder extracts semantic features, which are projected to a lower-dimensional space via a learnable projection gϕ, followed by batch normalization and a regularization loss to prevent collapse. Both the noisy image tokens and the noisy coevolving feature tokens are passed as input to a diffusion backbone, which jointly predicts the image… view at source ↗
Figure 4
Figure 4. Figure 4: Regularization Prevents Feature Collapse. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Spatial structure of coevolving representations during training [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: FID score as a function of CFG weight. VAE-only Classifier-Free Guidance. Following ReDi [22], we apply Classifier-Free Guidance exclusively to the VAE latents rather than across both the image latents and the features, as this strategy consis￾tently yields superior generation quality and greater robustness to CFG weight variations (see Section 4.4 in [22] for more details). In our ablation as presented in… view at source ↗
Figure 7
Figure 7. Figure 7: Spatial structure of coevolving representations in pixel space [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Selected samples from our CoReDi-XL/2 trained for 1M steps on ImageNet 256 × 256. Images and visual representations are jointly generated by our model. We use Classifier-Free Guidance with w = 4.0. Image PCA CoReDi PCA CoReDi PCA CoReDi PCA CoReDi DINOv2 MOCOv3 SigLIPv2 MAE [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison of feature visualizations. For each image, we show PCA visualization of DINOv2, MOCOv3, SigLIPv2 and MAE features. For each feature, we visualize PCA vs CoReDi’s learned projection [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
read the original abstract

Joint image-feature generative modeling has recently emerged as an effective strategy for improving diffusion training by coupling low-level VAE latents with high-level semantic features extracted from pre-trained visual encoders. However, existing approaches rely on a fixed representation space, constructed independently of the generative objective and kept unchanged during training. We argue that the representation space guiding diffusion should itself adapt to the generative task. To this end, we propose Coevolving Representation Diffusion (CoReDi), a framework in which the semantic representation space evolves during training by learning a lightweight linear projection jointly with the diffusion model. While naively optimizing this projection leads to degenerate solutions, we show that stable coevolution can be achieved through a combination of stop-gradient targets, normalization, and targeted regularization that prevents feature collapse. This formulation enables the semantic space to progressively specialize to the needs of image synthesis, improving its complementarity with image latents. We apply CoReDi to both VAE latent diffusion and pixel-space diffusion, demonstrating that adaptive semantic representations improve generative modeling across both settings. Experiments show that CoReDi achieves faster convergence and higher sample quality compared to joint diffusion models operating in fixed representation spaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 4 minor

Summary. The paper claims that fixed semantic representation spaces are suboptimal for joint image-feature diffusion models. It introduces Coevolving Representation Diffusion (CoReDi), which jointly optimizes a lightweight linear projection on semantic features extracted from pre-trained encoders together with the diffusion model. Stable coevolution is achieved via stop-gradient targets, normalization, and targeted regularization to avoid feature collapse, allowing the semantic space to specialize and improve complementarity with VAE latents. The method is applied to both latent diffusion and pixel-space diffusion, with experiments asserting faster convergence and higher sample quality versus fixed-representation baselines.

Significance. If the stabilization succeeds and the reported gains hold, the work provides a practical mechanism for adapting high-level features to the generative objective rather than treating them as static priors. This could meaningfully advance joint diffusion pipelines by increasing the utility of semantic conditioning without heavy architectural changes. The reliance on standard stabilizers (stop-gradient, normalization) is a strength, as is the dual application to latent and pixel spaces.

minor comments (4)
  1. Abstract: states that experiments demonstrate faster convergence and higher sample quality but provides no numerical metrics, datasets, or baseline comparisons; adding 1-2 key quantitative results would make the claim more informative.
  2. §3 (Method): the linear projection is described as 'lightweight' but its exact dimensionality, initialization, and interaction with the diffusion loss are not fully specified in the high-level description; a short equation or pseudocode block would improve reproducibility.
  3. Figures: convergence plots and sample-quality comparisons should include error bars or multiple runs to support the 'faster convergence' claim; captions could explicitly state the metrics used (e.g., FID, precision/recall).
  4. Related work: the discussion of prior joint diffusion models could more explicitly contrast the fixed vs. adaptive representation distinction with the closest baselines.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary and recommendation of minor revision. The recognition that CoReDi offers a practical way to adapt semantic features to the generative objective without heavy architectural changes is appreciated. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity in the coevolution framework

full rationale

The paper introduces CoReDi as an empirical training procedure: a lightweight linear projection is optimized jointly with the diffusion model, with collapse prevented by stop-gradient targets, normalization, and regularization. These stabilizers are standard techniques whose effectiveness is demonstrated experimentally rather than derived from prior fitted quantities or self-referential definitions. No load-bearing step reduces by construction to its own inputs, no uniqueness theorem is imported from self-citations, and the central claims of faster convergence and higher sample quality rest on comparative experiments across latent and pixel-space diffusion, not on tautological renaming or ansatz smuggling. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that the described stabilization techniques (stop-gradient targets, normalization, targeted regularization) suffice to prevent feature collapse when a linear projection is optimized jointly with the diffusion model; no additional free parameters, axioms, or invented entities are explicitly introduced beyond standard diffusion training components.

pith-pipeline@v0.9.0 · 5502 in / 1085 out tokens · 41761 ms · 2026-05-10T06:27:09.453870+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 26 canonical work pages · 9 internal anchors

  1. [1]

    Journal of Machine Learning Research26(209), 1–80 (2025) 5

    Albergo, M., Boffi, N.M., Vanden-Eijnden, E.: Stochastic interpolants: A unifying framework for flows and diffusions. Journal of Machine Learning Research26(209), 1–80 (2025) 5

  2. [2]

    VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

    Bardes, A., Ponce, J., LeCun, Y.: Vicreg: Variance-invariance-covariance regular- ization for self-supervised learning. arXiv preprint arXiv:2105.04906 (2021) 5, 8, 9

  3. [3]

    Advances in Neural Information Processing Systems35, 8799–8810 (2022) 5

    Bardes, A., Ponce, J., LeCun, Y.: Vicregl: Self-supervised learning of local visual features. Advances in Neural Information Processing Systems35, 8799–8810 (2022) 5

  4. [4]

    Black Forest Labs: FLUX.2: Analyzing and enhancing the latent space of FLUX – representation comparison (2025),https://bfl.ai/research/representation- comparison5

  5. [5]

    2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp

    Caron, M., Touvron, H., Misra, I., J’egou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 9630–9640 (2021), https://api.semanticscholar.org/CorpusID:2334442735

  6. [6]

    ICLR (2026) 5

    Chen, B., Bi, S., Tan, H., Zhang, H., Zhang, T., Li, Z., Xiong, Y., Zhang, J., Zhang, K.: Aligning visual foundation encoders to tokenizers for diffusion models. ICLR (2026) 5

  7. [7]

    Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025

    Chen, S., Ge, C., Zhang, S., Sun, P., Luo, P.: Pixelflow: Pixel-space generative models with flow. arXiv preprint arXiv:2504.07963 (2025) 4

  8. [8]

    In: International conference on machine learning

    Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for con- trastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607. PmLR (2020) 5

  9. [9]

    Chen,X.,He,K.:Exploringsimplesiameserepresentationlearning.In:Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15750–15758 (2021) 5 16 T. Kouzelis et al

  10. [10]

    2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp

    Chen, X., Xie, S., He, K.: An empirical study of training self-supervised vi- sion transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 9620–9629 (2021),https://api.semanticscholar.org/CorpusID: 2330249485, 11

  11. [11]

    Advances in neural information processing systems34, 8780–8794 (2021) 2, 10, 21

    Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems34, 8780–8794 (2021) 2, 10, 21

  12. [12]

    Ermolov, A., Siarohin, A., Sangineto, E., Sebe, N.: Whitening for self-supervised representationlearning.In:Internationalconferenceonmachinelearning.pp.3015–

  13. [13]

    In: Forty-first international conference on machine learning (2024) 5

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024) 5

  14. [14]

    Advances in neural information processing systems33, 21271–21284 (2020) 5

    Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Do- ersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems33, 21271–21284 (2020) 5

  15. [15]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022) 11

  16. [16]

    Advances in neural information processing systems30(2017) 10, 21

    Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017) 10, 21

  17. [17]

    Advances in neural information processing systems33, 6840–6851 (2020) 2

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020) 2

  18. [18]

    Proceedings of IEEE Computer Society Conference on Computer Vi- sionandPatternRecognitionpp.762–768(1997),https://api.semanticscholar

    Huang, J., Kumar, R., Mitra, M., Zhu, W.J., Zabih, R.: Image indexing using color correlograms. Proceedings of IEEE Computer Society Conference on Computer Vi- sionandPatternRecognitionpp.762–768(1997),https://api.semanticscholar. org/CorpusID:931793513

  19. [19]

    arXiv preprint arXiv:2412.11673 (2024) 4

    Karypidis, E., Kakogeorgiou, I., Gidaris, S., Komodakis, N.: Dino-foresight: Look- ing into the future with dino. arXiv preprint arXiv:2412.11673 (2024) 5

  20. [20]

    Adam: A Method for Stochastic Optimization

    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 21

  21. [21]

    In: Forty-second International Conference on Machine Learning (2025),https://openreview.net/ forum?id=UWhW5YYLo64

    Kouzelis, T., Kakogeorgiou, I., Gidaris, S., Komodakis, N.: EQ-VAE: Equivariance regularized latent space for improved generative image modeling. In: Forty-second International Conference on Machine Learning (2025),https://openreview.net/ forum?id=UWhW5YYLo64

  22. [22]

    Boosting generative image modeling via joint image-feature synthesis.arXiv preprint arXiv:2504.16064, 2025

    Kouzelis, T., Karypidis, E., Kakogeorgiou, I., Gidaris, S., Komodakis, N.: Boost- ing generative image modeling via joint image-feature synthesis. arXiv preprint arXiv:2504.16064 (2025) 2, 5, 6, 10, 11, 13, 14, 19

  23. [23]

    Advances in neural information processing systems32(2019) 10, 22

    Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., Aila, T.: Improved precision and recall metric for assessing generative models. Advances in neural information processing systems32(2019) 10, 22

  24. [24]

    Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers, 2025

    Leng, X., Singh, J., Hou, Y., Xing, Z., Xie, S., Zheng, L.: Repa-e: Unlock- ing vae for end-to-end tuning with latent diffusion transformers. arXiv preprint arXiv:2504.10483 (2025) 2, 5

  25. [25]

    Back to Basics: Let Denoising Generative Models Denoise

    Li, T., He, K.: Back to basics: Let denoising generative models denoise. arXiv preprint arXiv:2511.13720 (2025) 4

  26. [26]

    Fractal generative models.arXiv:2502.17437, 2025

    Li, T., Sun, Q., Fan, L., He, K.: Fractal generative models. arXiv preprint arXiv:2502.17437 (2025) 4 Coevolving Representations 17

  27. [27]

    Flow Matching for Generative Modeling

    Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022) 5

  28. [28]

    In: European Conference on Computer Vision

    Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In: European Conference on Computer Vision. pp. 23–40. Springer (2024) 4, 10, 19, 20

  29. [29]

    DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

    Ma, Z., Wei, L., Wang, S., Zhang, S., Tian, Q.: Deco: Frequency-decoupled pixel diffusion for end-to-end image generation. arXiv preprint arXiv:2511.19365 (2025) 3, 4, 5, 9, 20

  30. [30]

    Battaglia

    Nash, C., Menick, J., Dieleman, S., Battaglia, P.W.: Generating images with sparse representations. arXiv preprint arXiv:2103.03841 (2021) 10, 22

  31. [31]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 5, 11

  32. [32]

    Semantics lead the way: Harmonizing semantic and texture modeling with asynchronous latent diffusion.arXiv preprint arXiv:2512.04926, 2025

    Pan, Y., Feng, R., Dai, Q., Wang, Y., Lin, W., Guo, M., Luo, C., Zheng, N.: Seman- tics lead the way: Harmonizing semantic and texture modeling with asynchronous latent diffusion. arXiv preprint arXiv:2512.04926 (2025) 2, 5

  33. [33]

    Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 4, 10

  34. [34]

    arXiv preprint arXiv:2512.16636 (2025) 2, 5

    Petsangourakis, G., Sgouropoulos, C., Psomas, B., Giannakopoulos, T., Sfikas, G., Kakogeorgiou, I.: Reglue your latents with global and local semantics for entangled diffusion. arXiv preprint arXiv:2512.16636 (2025) 2, 5

  35. [35]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 2, 4, 10

  36. [36]

    In: International Conference on Medical image computing and computer-assisted intervention

    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi- cal image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015) 4

  37. [37]

    Advances in neural information processing systems29(2016) 10, 22

    Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. Advances in neural information processing systems29(2016) 10, 22

  38. [39]

    Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025

    Shi, M., Wang, H., Zheng, W., Yuan, Z., Wu, X., Wang, X., Wan, P., Zhou, J., Lu, J.: Latent diffusion model without variational autoencoder. ICLRabs/2510.15301(2026),https://api.semanticscholar.org/CorpusID: 2822033165

  39. [40]

    Singh, J., Leng, X., Wu, Z., Zheng, L., Zhang, R., Shechtman, E., Xie, S.: What matters for representation alignment: Global information or spatial structure? arXiv preprint arXiv:2512.10794 (2025) 2, 5, 13, 14, 22

  40. [41]

    arXiv preprint arXiv:2309.03350 (2023) 4

    Teng, J., Zheng, W., Ding, M., Hong, W., Wangni, J., Yang, Z., Tang, J.: Relay diffusion: Unifying diffusion process across resolutions for image synthesis. arXiv preprint arXiv:2309.03350 (2023) 4

  41. [42]

    arXiv preprint arXiv:2601.16208 (2026),https://arxiv.org/abs/2601.16208

    Tong, S., Zheng, B., Wang, Z., Tang, B., Ma, N., Brown, E., Yang, J., Fergus, R., LeCun,Y.,Xie,S.:Scalingtext-to-imagediffusiontransformerswithrepresentation autoencoders. arXiv preprint arXiv:2601.16208 (2026) 2, 5 18 T. Kouzelis et al

  42. [43]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., et al.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, lo- calization, and dense features. arXiv preprint arXiv:2502.14786 (2025) 5, 11

  43. [44]

    Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

    Venkataramanan, S., Pariza, V., Salehi, M., Knobel, L., Gidaris, S., Ramzi, E., Bursuc, A., Asano, Y.M.: Franca: Nested matryoshka clustering for scalable visual representation learning. arXiv preprint arXiv:2507.14137 (2025) 5

  44. [45]

    Pixnerd: Pixel neural field diffusion.arXiv preprint arXiv:2507.23268,

    Wang, S., Gao, Z., Zhu, C., Huang, W., Wang, L.: Pixnerd: Pixel neural field diffusion. arXiv preprint arXiv:2507.23268 (2025) 4

  45. [46]

    DDT: Decoupled diffusion Transformer.arXiv:2504.05741, 2025

    Wang, S., Tian, Z., Huang, W., Wang, L.: Ddt: Decoupled diffusion transformer. arXiv preprint arXiv:2504.05741 (2025) 4

  46. [47]

    Wu, G., Zhang, S., Shi, R., Gao, S., Chen, Z., Wang, L., Chen, Z., Gao, H., Tang, Y., Yang, J., et al.: Representation entanglement for generation: Training diffusion transformersismucheasierthanyouthink.arXivpreprintarXiv:2507.01467(2025) 2, 5

  47. [48]

    generation: Taming optimization dilemma in latent diffusion models

    Yao, J., Yang, B., Wang, X.: Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15703–15712 (2025) 2, 4

  48. [49]

    In: International Conference on Learning Representations (2025) 2, 5, 10, 19

    Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think. In: International Conference on Learning Representations (2025) 2, 5, 10, 19

  49. [50]

    In: International conference on machine learn- ing

    Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: Self-supervised learning via redundancy reduction. In: International conference on machine learn- ing. pp. 12310–12320. PMLR (2021) 5, 9

  50. [51]

    Normalizing flows are capable generative models.arXiv preprint arXiv:2412.06329, 2024

    Zhai, S., Zhang, R., Nakkiran, P., Berthelot, D., Gu, J., Zheng, H., Chen, T., Bautista, M.A., Jaitly, N., Susskind, J.: Normalizing flows are capable generative models. arXiv preprint arXiv:2412.06329 (2024) 4

  51. [52]

    Diffusion Transformers with Representation Autoencoders

    Zheng, B., Ma, N., Tong, S., Xie, S.: Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690 (2025) 2, 5

  52. [53]

    Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling.arXiv preprint arXiv:2409.02908, 2024

    Zheng, K., Chen, Y., Mao, H., Liu, M.Y., Zhu, J., Zhang, Q.: Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling. arXiv preprint arXiv:2409.02908 (2024) 4 Coevolving Representations 19 Appendix A Additional Results and Ablations Model #Iters. FID↓sFID↓IS↑Prec.↑Rec.↑ SiT-XL/2[28] 7M 8.3 6.3 131.7 0.6...