How Neural Losses Shape VAE Latents

Emanuele Rodol\`a; Giorgio Strano; Luca Cerovaz; Michele Mancusi; Tommaso Mencattini

arxiv: 2606.00635 · v1 · pith:V6OZ4GRVnew · submitted 2026-05-30 · 💻 cs.LG

How Neural Losses Shape VAE Latents

Giorgio Strano , Luca Cerovaz , Michele Mancusi , Tommaso Mencattini , Emanuele Rodol\`a This is my paper

Pith reviewed 2026-06-28 18:52 UTC · model grok-4.3

classification 💻 cs.LG

keywords VAElatent space geometryperceptual lossadversarial lossrate-distortion tradeoffposterior varianceisotropic representationsneural reconstruction

0 comments

The pith

Augmenting VAE reconstruction with perceptual and adversarial losses reduces information stored in the latent representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the choice of reconstruction loss in VAEs fundamentally alters the rate-distortion optimization. Adding neural terms such as perceptual and adversarial objectives leads to lower information content in the latents compared to pointwise likelihood alone. These losses also reshape the latent geometry, producing more isotropic representations where uncertainty is spread more evenly across dimensions. A reader would care because this explains observed differences in VAE behavior that are not visible from output quality or standard rate-distortion analysis.

Core claim

Augmenting pointwise reconstruction with neural terms reduces the amount of information stored in the latent representations. Neural reconstruction losses systematically change the geometry of the latent space: they make representations more isotropic and distribute uncertainty more evenly across latent dimensions, producing different posterior variance profiles. The rate-distortion tradeoff is not a comprehensive lens to understand VAE behavior.

What carries the argument

the rate-distortion optimization problem, reshaped by the choice of distortion metric from pointwise to neural reconstruction losses

If this is right

Neural losses produce different posterior variance profiles than pointwise reconstruction.
The standard rate-distortion lens fails to capture how distortion metric choice affects learned representations.
A mechanistic investigation of how each distortion metric reshapes the optimization is required instead.
Latent space properties can be steered by loss choice without changing the model architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests that practitioners could select losses to achieve desired latent properties like isotropy for downstream tasks such as interpolation.
Similar effects may appear in other generative models that combine reconstruction with perceptual objectives.
The findings motivate experiments that isolate the contribution of each neural loss term to the observed geometry changes.

Load-bearing premise

The observed changes in information content and latent geometry are caused by the neural losses altering the rate-distortion problem rather than by optimizer dynamics, regularization schedules, or other training details.

What would settle it

Training identical VAE architectures with neural losses but under controlled optimizer and schedule conditions that match the pointwise baseline, then measuring whether the reduction in latent information and increase in isotropy still occur.

Figures

Figures reproduced from arXiv: 2606.00635 by Emanuele Rodol\`a, Giorgio Strano, Luca Cerovaz, Michele Mancusi, Tommaso Mencattini.

**Figure 2.** Figure 2: Rate-Distortion curve traced by β. If the decoder likelihood is Gaussian with fixed variance, the distortion term reduces to scaled squared error ∥x−p(x|z)∥ 2 2 . More broadly, a chosen likelihood family induces a particular reconstruction loss. In modern practice, this term is often augmented or replaced with perceptual or adversarial losses, breaking the standard pixel-squared-error assumption that Secti… view at source ↗

**Figure 3.** Figure 3: Experimental results supporting the claim in Section 3 for the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Rate reached at convergence as a function of λ for the pythae VAE trained on CelebA with adversarial loss. To test robustness, we vary model architecture (a traditional VAE from pythae [6] and AutoencoderKL from diffusers [33]), dataset (CelebA [24] and Tiny-ImageNet [12]), and the family of neural distortions (LPIPS [36] and DINOv2 features [28] as perceptual losses, and a PatchGAN hinge loss with featu… view at source ↗

**Figure 5.** Figure 5: Rate-matched training of the pythae VAE on CelebA, using the perceptual loss LPIPS [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Equivalence classes under pixel vs. perceptual distortion. Dashed curves enclose equivalent perturbations under each loss. We repeat the experiment across two dataset-model combinations, two neural losses (perceptual and discriminative), and 4 fixed target rate values. For each combination, we sweep λ and measure the average per-sample posterior anisotropy Apost [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Toy model at matched rate: posterior standard deviation per latent dimension. The mechanism behind this counter-intuitive direction can be read off the water-filling formula (8) combined with how neural losses act on pixel space. Natural-image datasets are highly anisotropic in pixel space: a small number of principal components capture most of the dataset’s variance. Since pixel SSE penalizes every pixel … view at source ↗

**Figure 8.** Figure 8: Experimental results supporting the claim in Section 3.2 for the AutoencoderKL architecture, [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Experimental results supporting the claim in Section 3 for the AutoencoderKL architecture [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Experimental results supporting the claim in Section 3 for the AutoencoderKL architecture [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Rate-matched training of the AutoencoderKL architecture on Tiny-ImageNet using the [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Reconstructions of the same input as the weight [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: Toy linear-Gaussian model: per-dimension certainty at matched rate. We instantiate the construction of Section D.2 with D = 16, W = ID and compare the optimal posterior variances s ⋆ i (M, β) obtained for the isotropic metric Miso and the anisotropic metric Maniso at a common variance-KL operating point, i.e., for βiso and βaniso such that P i g(s ⋆ i ) matches the same target level. For the isotropic met… view at source ↗

read the original abstract

Modern VAEs are rarely trained with the pointwise likelihood implied by the standard $\beta$-VAE objective. In practice, pointwise reconstruction is often combined with perceptual and adversarial losses, despite a lack of understanding of how this changes the latent dynamics of the model. We show that the choice of reconstruction loss reshapes the rate-distortion problem itself, altering both the information content and the geometry of the learned latent space in ways that may be invisible from reconstructions alone. First, we prove and verify empirically that augmenting pointwise reconstruction with neural terms, such as perceptual and adversarial objectives, reduces the amount of information stored in the latent representations. Second, we show that neural reconstruction losses systematically change the geometry of the latent space: they make representations more isotropic and distribute uncertainty more evenly across latent dimensions, producing different posterior variance profiles. These findings highlight how the rate-distortion tradeoff is not a comprehensive lens to understand the behavior of VAEs, and we propose a more mechanistic approach to investigate how the choice of a distortion metric reshapes the optimization problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Neural losses reduce latent info and push isotropy in VAEs, but the link to rate-distortion reshaping versus training dynamics is not yet isolated.

read the letter

The core finding here is that swapping pointwise reconstruction for perceptual or adversarial terms in VAEs lowers the mutual information in the latents and produces more isotropic posteriors with flatter variance profiles. They prove the first part on the modified objective and show the geometry shift in experiments.

What stands out is the direct comparison to standard beta-VAE rate-distortion analysis. The work makes clear that modern training practices alter the optimization landscape in ways the usual ELBO lens does not capture, and the proof plus the variance-profile plots give a concrete handle on that.

The weaker part is the attribution. The abstract and stress-test note do not describe matched runs that hold optimizer, schedule, and regularization fixed while only swapping the distortion term. Without those controls it remains possible that some of the isotropy comes from how the neural losses interact with Adam or the KL annealing rather than from the rate-distortion change itself. That gap is real but not fatal; it is the sort of thing referees can ask for.

The paper is aimed at people who train VAEs on images or other high-dimensional data and already use perceptual losses. Anyone who cares about what the latent actually encodes will get something usable from the mechanistic angle. It is coherent on its own terms and shows honest engagement with the literature, so it deserves a serious referee rather than a desk reject.

Referee Report

2 major / 1 minor

Summary. The paper claims that replacing or augmenting pointwise reconstruction in VAEs with neural losses (perceptual, adversarial) reshapes the underlying rate-distortion objective. It proves that such augmentation reduces the mutual information stored in the latent variables and empirically demonstrates that the resulting posteriors become more isotropic with flatter variance profiles across dimensions. The work concludes that the standard rate-distortion lens is insufficient and advocates a mechanistic view of how the distortion metric alters optimization.

Significance. If the theoretical reduction in latent information and the geometric effects are robustly isolated from training artifacts, the result would be significant: it supplies both a proof and concrete empirical signatures (isotropy, variance profiles) showing that widely used perceptual/adversarial objectives change VAE latents in ways invisible to reconstruction metrics alone. This would motivate new analysis tools beyond β-VAE theory and affect how reconstruction losses are chosen in practice.

major comments (2)

[Empirical verification sections] The central empirical claim—that observed isotropy and even uncertainty distribution arise from rate-distortion reshaping rather than optimizer dynamics, regularization schedules, or implementation details—lacks the necessary isolation experiments. No description of matched hyperparameter sweeps, fixed-optimizer ablations, or controlled training procedures is referenced, leaving open the possibility that the geometry changes are artifacts of those factors rather than the modified objective.
[Theoretical proof section] The proof that neural augmentation of the distortion term reduces latent mutual information is load-bearing for the first claim. Without the explicit derivation steps, assumptions on the form of the neural loss, and verification that the reduction holds independently of the variational family or optimization path, it is impossible to assess whether the result is parameter-free or relies on implicit regularizers introduced by the neural terms.

minor comments (1)

Notation for the augmented distortion term and the precise definition of 'neural reconstruction loss' should be introduced early and used consistently to avoid ambiguity between perceptual, adversarial, and other neural objectives.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the empirical isolation and theoretical presentation. We address each major comment below and will incorporate revisions to improve clarity and robustness.

read point-by-point responses

Referee: [Empirical verification sections] The central empirical claim—that observed isotropy and even uncertainty distribution arise from rate-distortion reshaping rather than optimizer dynamics, regularization schedules, or implementation details—lacks the necessary isolation experiments. No description of matched hyperparameter sweeps, fixed-optimizer ablations, or controlled training procedures is referenced, leaving open the possibility that the geometry changes are artifacts of those factors rather than the modified objective.

Authors: We agree that additional controls are needed to more rigorously isolate the contribution of the modified distortion metric. In the revised version, we will add a dedicated subsection detailing matched hyperparameter sweeps (e.g., identical learning rates, batch sizes, and optimizer settings across loss variants), fixed-optimizer ablations, and explicit descriptions of the controlled training procedures used. These will demonstrate that the isotropy and variance profile changes persist under matched conditions. revision: yes
Referee: [Theoretical proof section] The proof that neural augmentation of the distortion term reduces latent mutual information is load-bearing for the first claim. Without the explicit derivation steps, assumptions on the form of the neural loss, and verification that the reduction holds independently of the variational family or optimization path, it is impossible to assess whether the result is parameter-free or relies on implicit regularizers introduced by the neural terms.

Authors: We acknowledge that the proof section would benefit from greater explicitness. The manuscript currently presents a high-level argument; the revision will include the full step-by-step derivation, state the assumptions on the neural loss (specifically that it depends on the reconstruction output but introduces no direct latent dependence beyond the decoder), and add a short verification argument showing the mutual information reduction holds under the variational bound independently of the optimization trajectory. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims derived from modified objective and independent verification

full rationale

The paper derives its central results from the standard VAE rate-distortion objective after augmenting the distortion term with neural losses, proving reduced mutual information directly from the modified objective and verifying geometry changes empirically. No load-bearing steps reduce by construction to fitted parameters, self-citations, or renamed inputs; the proof and observations are self-contained against standard VAE theory without circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard VAE theory and information-theoretic concepts without introducing new free parameters, axioms beyond background math, or invented entities.

axioms (1)

standard math The standard beta-VAE ELBO and rate-distortion formulation applies as the baseline for comparison.
The abstract frames all claims relative to the pointwise reconstruction implied by the beta-VAE objective.

pith-pipeline@v0.9.1-grok · 5721 in / 1225 out tokens · 22337 ms · 2026-06-28T18:52:40.238542+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 22 canonical work pages · 13 internal anchors

[1]

Alemi, Ben Poole, Ian Fischer, Joshua V

Alexander A. Alemi, Ben Poole, Ian Fischer, Joshua V . Dillon, Rif A. Saurous, and Kevin Murphy. Fixing a broken ELBO. In Jennifer G. Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 ofProceedings of Machine Learning Research, pa...

2018
[2]

Zhang, Michael Ruan, Eric Wang, So Hasegawa, Jimmy Ba, and Roger Grosse

Juhan Bae, Michael R. Zhang, Michael Ruan, Eric Wang, So Hasegawa, Jimmy Ba, and Roger Grosse. Multi-rate vae: Train once, get the full rate-distortion curve, 2023. URL https://arxiv.org/abs/2212.03905

work page arXiv 2023
[3]

The perception-distortion tradeoff

Yochai Blau and Tomer Michaeli. The perception-distortion tradeoff. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6228–6237, 2018. doi: 10. 1109/CVPR.2018.00652

work page arXiv 2018
[4]

Rethinking lossy compression: The rate-distortion-perception tradeoff, 2019

Yochai Blau and Tomer Michaeli. Rethinking lossy compression: The rate-distortion-perception tradeoff, 2019. URLhttps://arxiv.org/abs/1901.07821

work page arXiv 2019
[5]

Understanding disentangling in $\beta$-VAE

Christopher P. Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in β-vae, 2018. URL https://arxiv.org/abs/1804.03599

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Pythae: Unifying generative autoencoders in python - a benchmarking use case

Clément Chadebec, Louis Vincent, and Stephanie Allassonniere. Pythae: Unifying generative autoencoders in python - a benchmarking use case. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 21575–21589. Curran Associates, Inc., 2022

2022
[7]

Masked autoencoders are effective tokenizers for diffusion models

Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhiksha Raj. Masked autoencoders are effective tokenizers for diffusion models. InForty-second International Conference on Machine Learning, 2025

2025
[8]

Ricky T. Q. Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders, 2019. URL https://arxiv.org/abs/1802. 04942

2019
[9]

Variational Lossy Autoencoder

Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder, 2017. URL https://arxiv. org/abs/1611.02731

work page internal anchor Pith review Pith/arXiv arXiv 2017
[10]

Wiley, 2nd editio edition, 2009

Thomas Cover and Joy Thomas.Elements of Information Theory. Wiley, 2nd editio edition, 2009

2009
[11]

High Fidelity Neural Audio Compression

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression.arXiv preprint arXiv:2210.13438, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. IEEE, 2009

2009
[13]

Generative modelling in latent space, 2025

Sander Dieleman. Generative modelling in latent space, 2025. URL https://sander.ai/ 2025/04/15/latents.html

2025
[14]

The geometry of efficient codes: How rate-distortion trade-offs distort the latent representations of generative models.PLOS Computational Biology, 21(5):1–30, 05 2025

Leo D’Amato, Gian Luca Lancia, and Giovanni Pezzulo. The geometry of efficient codes: How rate-distortion trade-offs distort the latent representations of generative models.PLOS Computational Biology, 21(5):1–30, 05 2025. doi: 10.1371/journal.pcbi.1012952. URL https://doi.org/10.1371/journal.pcbi.1012952

work page doi:10.1371/journal.pcbi.1012952 2025
[15]

beta-V AE: Learning basic visual concepts with a constrained variational framework

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-V AE: Learning basic visual concepts with a constrained variational framework. InInternational Conference on Learning Representations,
[16]

URLhttps://openreview.net/forum?id=Sy2fzU9gl. 10
[17]

In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022

Xianxu Hou, Linlin Shen, Ke Sun, and Guoping Qiu. Deep feature consistent variational autoencoder. In2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1133–1141, 2017. doi: 10.1109/W ACV .2017.131

work page doi:10.1109/w 2017
[18]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks, 2018. URLhttps://arxiv.org/abs/1611.07004

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution, 2016. URLhttps://arxiv.org/abs/1603.08155

work page internal anchor Pith review Pith/arXiv arXiv 2016
[20]

Auto-Encoding Variational Bayes

Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Yoshua Bengio and Yann LeCun, editors,2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. URL http://arxiv.org/abs/1312.6114

work page internal anchor Pith review Pith/arXiv arXiv 2014
[21]

Eq-vae: Equivariance regularized latent space for improved generative image modeling, 2025

Theodoros Kouzelis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Eq-vae: Equivariance regularized latent space for improved generative image modeling, 2025. URL https://arxiv.org/abs/2502.09509

work page arXiv 2025
[22]

V ARIATIONAL INFERENCE OF DISENTANGLED LATENT CONCEPTS FROM UNLABELED OBSERV ATIONS

Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. V ARIATIONAL INFERENCE OF DISENTANGLED LATENT CONCEPTS FROM UNLABELED OBSERV ATIONS. In International Conference on Learning Representations, 2018. URL https://openreview. net/forum?id=H1kG7GZAW

2018
[23]

Autoencoding beyond pixels using a learned similarity metric

Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. In Maria-Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 ofJMLR Workshop and ...

2016
[24]

Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025

2025
[25]

Deep learning face attributes in the wild

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. InProceedings of International Conference on Computer Vision (ICCV), December 2015

2015
[26]

Disentangling Disentanglement in Variational Autoencoders

Emile Mathieu, Tom Rainforth, N. Siddharth, and Yee Whye Teh. Disentangling disentangle- ment in variational autoencoders, 2019. URLhttps://arxiv.org/abs/1812.02833

work page internal anchor Pith review Pith/arXiv arXiv 2019
[27]

High-fidelity generative image compression, 2020

Fabian Mentzer, George Toderici, Michael Tschannen, and Eirikur Agustsson. High-fidelity generative image compression, 2020. URLhttps://arxiv.org/abs/2006.09965

work page arXiv 2020
[28]

Spectral Normalization for Generative Adversarial Networks

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks, 2018. URLhttps://arxiv.org/abs/1802.05957

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Taming VAEs

Danilo Jimenez Rezende and Fabio Viola. Taming vaes.CoRR, abs/1810.00597, 2018. URL http://arxiv.org/abs/1810.00597

work page internal anchor Pith review Pith/arXiv arXiv 2018
[31]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models, 2022. URL https://arxiv.org/ abs/2112.10752

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

C. E. Shannon. A mathematical theory of communication.The Bell System Technical Journal, 27(3):379–423, 1948. doi: 10.1002/j.1538-7305.1948.tb01338.x

work page doi:10.1002/j.1538-7305.1948.tb01338.x 1948
[33]

Improving the diffusability of autoencoders, 2025

Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, and Aliaksandr Siarohin. Improving the diffusability of autoencoders, 2025. URLhttps://arxiv.org/abs/2502.14831. 11

work page arXiv 2025
[34]

Diffusers: State-of-the-art diffusion models

Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven Liu, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/ diffusers, 2022

2022
[35]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimiza- tion dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025

2025
[36]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric, 2018. URL https: //arxiv.org/abs/1801.03924. 12 A Additional plots 0.0 0.2 0.4 0.6 0.8 1.0 (Perceptual Loss Weight) 0 20 40 60 80 100 120 140Rate (KL Divergence) Rate vs (color) =0 =0.25 =0.5 =0.75 =1 (shap...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[38]

variance- KL budget

Multiply by wℓ and sum over ℓ∈ L. This pointwise domination implies that any reconstruction rule meeting a pixel-MSE budget also meets an appropriately rescaled feature-matching budget. The corresponding RD ordering is an immediate application of Theorem 11. Corollary 28(RD ordering for feature matching vs. pixel MSE).Under Assumption 26, for all ∆≥0, RdF...
[39]

ForZ∼ N(µ,diag(s)), E[dM(x, Z)] = (x−W µ) ⊤G(x−W µ) + DX i=1 cisi = const(x, µ) + DX i=1 cisi.(44)
[40]

The objective can be written as LM,β(s) = const(x, µ) + DX i=1 ℓi(si), ℓ i(s) :=c is+βg(s).(45) Each ℓi is strictly convex on (0,∞) , hence LM,β has a unique minimizer s∗(M, β)∈ (0,∞) D
[41]

, D.(46) Proof.WriteZ=µ+εwithε∼ N 0,diag(s) and expand dM(x, Z) = (x−W µ−W ε) ⊤G(x−W µ−W ε)

The minimizer is given in closed form by s∗ i (M, β) = 1 1 + 2ci/β = β β+ 2c i , i= 1, . . . , D.(46) Proof.WriteZ=µ+εwithε∼ N 0,diag(s) and expand dM(x, Z) = (x−W µ−W ε) ⊤G(x−W µ−W ε). The cross term vanishes in expectation, and E[ε⊤Bε] = tr Bdiag(s) =P i cisi, which yields (44) and the separable form (45). Since g′′(s) = 1/(2s2)>0 for s >0 , each ℓi is ...
[42]

If all ci are equal, then all s∗ i (M, β) coincide and Apost s∗(M, β) = 0 (isotropic posterior)
[43]

TargetP i g(s∗ i )

If the coefficients ci are not all equal, then the entries of s∗(M, β) are not all equal and Apost s∗(M, β) >0(anisotropic posterior). We now show that, for a fixed distortionM, every positive target value of the variance part of the KL can be obtained by a suitable choice ofβ. Theorem 41(Surjectivity of the variance-KL map). FixW, Mand assume at least on...

[1] [1]

Alemi, Ben Poole, Ian Fischer, Joshua V

Alexander A. Alemi, Ben Poole, Ian Fischer, Joshua V . Dillon, Rif A. Saurous, and Kevin Murphy. Fixing a broken ELBO. In Jennifer G. Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 ofProceedings of Machine Learning Research, pa...

2018

[2] [2]

Zhang, Michael Ruan, Eric Wang, So Hasegawa, Jimmy Ba, and Roger Grosse

Juhan Bae, Michael R. Zhang, Michael Ruan, Eric Wang, So Hasegawa, Jimmy Ba, and Roger Grosse. Multi-rate vae: Train once, get the full rate-distortion curve, 2023. URL https://arxiv.org/abs/2212.03905

work page arXiv 2023

[3] [3]

The perception-distortion tradeoff

Yochai Blau and Tomer Michaeli. The perception-distortion tradeoff. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6228–6237, 2018. doi: 10. 1109/CVPR.2018.00652

work page arXiv 2018

[4] [4]

Rethinking lossy compression: The rate-distortion-perception tradeoff, 2019

Yochai Blau and Tomer Michaeli. Rethinking lossy compression: The rate-distortion-perception tradeoff, 2019. URLhttps://arxiv.org/abs/1901.07821

work page arXiv 2019

[5] [5]

Understanding disentangling in $\beta$-VAE

Christopher P. Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in β-vae, 2018. URL https://arxiv.org/abs/1804.03599

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Pythae: Unifying generative autoencoders in python - a benchmarking use case

Clément Chadebec, Louis Vincent, and Stephanie Allassonniere. Pythae: Unifying generative autoencoders in python - a benchmarking use case. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 21575–21589. Curran Associates, Inc., 2022

2022

[7] [7]

Masked autoencoders are effective tokenizers for diffusion models

Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhiksha Raj. Masked autoencoders are effective tokenizers for diffusion models. InForty-second International Conference on Machine Learning, 2025

2025

[8] [8]

Ricky T. Q. Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders, 2019. URL https://arxiv.org/abs/1802. 04942

2019

[9] [9]

Variational Lossy Autoencoder

Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder, 2017. URL https://arxiv. org/abs/1611.02731

work page internal anchor Pith review Pith/arXiv arXiv 2017

[10] [10]

Wiley, 2nd editio edition, 2009

Thomas Cover and Joy Thomas.Elements of Information Theory. Wiley, 2nd editio edition, 2009

2009

[11] [11]

High Fidelity Neural Audio Compression

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression.arXiv preprint arXiv:2210.13438, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. IEEE, 2009

2009

[13] [13]

Generative modelling in latent space, 2025

Sander Dieleman. Generative modelling in latent space, 2025. URL https://sander.ai/ 2025/04/15/latents.html

2025

[14] [14]

The geometry of efficient codes: How rate-distortion trade-offs distort the latent representations of generative models.PLOS Computational Biology, 21(5):1–30, 05 2025

Leo D’Amato, Gian Luca Lancia, and Giovanni Pezzulo. The geometry of efficient codes: How rate-distortion trade-offs distort the latent representations of generative models.PLOS Computational Biology, 21(5):1–30, 05 2025. doi: 10.1371/journal.pcbi.1012952. URL https://doi.org/10.1371/journal.pcbi.1012952

work page doi:10.1371/journal.pcbi.1012952 2025

[15] [15]

beta-V AE: Learning basic visual concepts with a constrained variational framework

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-V AE: Learning basic visual concepts with a constrained variational framework. InInternational Conference on Learning Representations,

[16] [16]

URLhttps://openreview.net/forum?id=Sy2fzU9gl. 10

[17] [17]

In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022

Xianxu Hou, Linlin Shen, Ke Sun, and Guoping Qiu. Deep feature consistent variational autoencoder. In2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1133–1141, 2017. doi: 10.1109/W ACV .2017.131

work page doi:10.1109/w 2017

[18] [18]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks, 2018. URLhttps://arxiv.org/abs/1611.07004

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [19]

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution, 2016. URLhttps://arxiv.org/abs/1603.08155

work page internal anchor Pith review Pith/arXiv arXiv 2016

[20] [20]

Auto-Encoding Variational Bayes

Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Yoshua Bengio and Yann LeCun, editors,2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. URL http://arxiv.org/abs/1312.6114

work page internal anchor Pith review Pith/arXiv arXiv 2014

[21] [21]

Eq-vae: Equivariance regularized latent space for improved generative image modeling, 2025

Theodoros Kouzelis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Eq-vae: Equivariance regularized latent space for improved generative image modeling, 2025. URL https://arxiv.org/abs/2502.09509

work page arXiv 2025

[22] [22]

V ARIATIONAL INFERENCE OF DISENTANGLED LATENT CONCEPTS FROM UNLABELED OBSERV ATIONS

Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. V ARIATIONAL INFERENCE OF DISENTANGLED LATENT CONCEPTS FROM UNLABELED OBSERV ATIONS. In International Conference on Learning Representations, 2018. URL https://openreview. net/forum?id=H1kG7GZAW

2018

[23] [23]

Autoencoding beyond pixels using a learned similarity metric

Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. In Maria-Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 ofJMLR Workshop and ...

2016

[24] [24]

Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025

2025

[25] [25]

Deep learning face attributes in the wild

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. InProceedings of International Conference on Computer Vision (ICCV), December 2015

2015

[26] [26]

Disentangling Disentanglement in Variational Autoencoders

Emile Mathieu, Tom Rainforth, N. Siddharth, and Yee Whye Teh. Disentangling disentangle- ment in variational autoencoders, 2019. URLhttps://arxiv.org/abs/1812.02833

work page internal anchor Pith review Pith/arXiv arXiv 2019

[27] [27]

High-fidelity generative image compression, 2020

Fabian Mentzer, George Toderici, Michael Tschannen, and Eirikur Agustsson. High-fidelity generative image compression, 2020. URLhttps://arxiv.org/abs/2006.09965

work page arXiv 2020

[28] [28]

Spectral Normalization for Generative Adversarial Networks

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks, 2018. URLhttps://arxiv.org/abs/1802.05957

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Taming VAEs

Danilo Jimenez Rezende and Fabio Viola. Taming vaes.CoRR, abs/1810.00597, 2018. URL http://arxiv.org/abs/1810.00597

work page internal anchor Pith review Pith/arXiv arXiv 2018

[31] [31]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models, 2022. URL https://arxiv.org/ abs/2112.10752

work page internal anchor Pith review Pith/arXiv arXiv 2022

[32] [32]

C. E. Shannon. A mathematical theory of communication.The Bell System Technical Journal, 27(3):379–423, 1948. doi: 10.1002/j.1538-7305.1948.tb01338.x

work page doi:10.1002/j.1538-7305.1948.tb01338.x 1948

[33] [33]

Improving the diffusability of autoencoders, 2025

Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, and Aliaksandr Siarohin. Improving the diffusability of autoencoders, 2025. URLhttps://arxiv.org/abs/2502.14831. 11

work page arXiv 2025

[34] [34]

Diffusers: State-of-the-art diffusion models

Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven Liu, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/ diffusers, 2022

2022

[35] [35]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimiza- tion dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025

2025

[36] [36]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric, 2018. URL https: //arxiv.org/abs/1801.03924. 12 A Additional plots 0.0 0.2 0.4 0.6 0.8 1.0 (Perceptual Loss Weight) 0 20 40 60 80 100 120 140Rate (KL Divergence) Rate vs (color) =0 =0.25 =0.5 =0.75 =1 (shap...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[38] [38]

variance- KL budget

Multiply by wℓ and sum over ℓ∈ L. This pointwise domination implies that any reconstruction rule meeting a pixel-MSE budget also meets an appropriately rescaled feature-matching budget. The corresponding RD ordering is an immediate application of Theorem 11. Corollary 28(RD ordering for feature matching vs. pixel MSE).Under Assumption 26, for all ∆≥0, RdF...

[39] [39]

ForZ∼ N(µ,diag(s)), E[dM(x, Z)] = (x−W µ) ⊤G(x−W µ) + DX i=1 cisi = const(x, µ) + DX i=1 cisi.(44)

[40] [40]

The objective can be written as LM,β(s) = const(x, µ) + DX i=1 ℓi(si), ℓ i(s) :=c is+βg(s).(45) Each ℓi is strictly convex on (0,∞) , hence LM,β has a unique minimizer s∗(M, β)∈ (0,∞) D

[41] [41]

, D.(46) Proof.WriteZ=µ+εwithε∼ N 0,diag(s) and expand dM(x, Z) = (x−W µ−W ε) ⊤G(x−W µ−W ε)

The minimizer is given in closed form by s∗ i (M, β) = 1 1 + 2ci/β = β β+ 2c i , i= 1, . . . , D.(46) Proof.WriteZ=µ+εwithε∼ N 0,diag(s) and expand dM(x, Z) = (x−W µ−W ε) ⊤G(x−W µ−W ε). The cross term vanishes in expectation, and E[ε⊤Bε] = tr Bdiag(s) =P i cisi, which yields (44) and the separable form (45). Since g′′(s) = 1/(2s2)>0 for s >0 , each ℓi is ...

[42] [42]

If all ci are equal, then all s∗ i (M, β) coincide and Apost s∗(M, β) = 0 (isotropic posterior)

[43] [43]

TargetP i g(s∗ i )

If the coefficients ci are not all equal, then the entries of s∗(M, β) are not all equal and Apost s∗(M, β) >0(anisotropic posterior). We now show that, for a fixed distortionM, every positive target value of the variance part of the KL can be obtained by a suitable choice ofβ. Theorem 41(Surjectivity of the variance-KL map). FixW, Mand assume at least on...