Entropic Auto-Encoding via Implicit Free-Energy Minimization

Greg van Anders; Hazhir Aliahmadi; Irina Babayan

arxiv: 2605.16164 · v1 · pith:L7GWIRLZnew · submitted 2026-05-15 · 💻 cs.LG

Entropic Auto-Encoding via Implicit Free-Energy Minimization

Hazhir Aliahmadi , Irina Babayan , Greg van Anders This is my paper

Pith reviewed 2026-05-20 20:57 UTC · model grok-4.3

classification 💻 cs.LG

keywords entropic autoencodersposterior collapsefree energy minimizationvariational autoencoderslatent distributionsmultimodal latentsgenerative modeling

0 comments

The pith

Entropic autoencoders avoid posterior collapse by letting an ensemble of encoders minimize free energy to create an implicit prior from entropy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Entropic Autoencoders to fix posterior collapse in variational autoencoders, a problem where explicit priors cause the model to ignore latent variables and produce uninformative representations. EAEs keep reconstruction loss as the only explicit objective and rely on entropy to shape the prior implicitly through a free-energy-minimizing ensemble of encoders. This ensemble biases learning toward high-volume regions of near-optimal solutions while decoder updates steer the process toward informative latent representations. Experiments show the approach yields non-Gaussian multimodal distributions that support diverse generations and preserve data structures ranging from physical dynamics to image hierarchies. A sympathetic reader would care because it offers a route to more reliable generative modeling without manual prior design.

Core claim

Entropic Autoencoders mitigate posterior collapse by learning non-Gaussian, multimodal latent distributions that yield diverse, data-consistent generations and preserve different forms of underlying structure in the data. Reconstruction loss serves as the sole explicit objective while entropy generates the latent prior implicitly through a free energy-minimizing ensemble of encoders; this ensemble biases learning toward high-volume regions of near-optimal solutions and decoder updates direct trajectories toward informative representations.

What carries the argument

The free-energy-minimizing ensemble of encoders, which implicitly generates the prior via entropy while decoder updates guide search trajectories to informative latent representations.

If this is right

EAEs produce diverse generations that remain consistent with the training data.
The latent representations capture different forms of underlying structure, including low-dimensional dynamics in reaction-diffusion processes.
Implicit categorical distinctions emerge in the latent space for datasets such as MNIST.
A hierarchical understanding of features appears in more complex data such as facial images from CelebA.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The ensemble approach might extend to other generative models that suffer from similar collapse problems when explicit priors are used.
Reducing reliance on prior tuning could simplify hyperparameter search in latent variable models more broadly.
Applying the same implicit entropy mechanism to sequential or structured data could test whether structure preservation holds beyond the image and physics examples shown.

Load-bearing premise

A free-energy-minimizing ensemble of encoders will automatically bias learning toward high-volume regions of near-optimal solutions while decoder updates direct trajectories to informative latent representations, without any explicit prior term.

What would settle it

Training an EAE on a dataset where standard VAEs collapse and observing that the learned latent distributions remain Gaussian and unimodal with generations that ignore latent variables would falsify the claim that the implicit mechanism mitigates posterior collapse.

Figures

Figures reproduced from arXiv: 2605.16164 by Greg van Anders, Hazhir Aliahmadi, Irina Babayan.

**Figure 2.** Figure 2: Analysis of interpolation capability (panel a) and learned latent distributions (panel b) of [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: On the CelebA dataset [23], an EAE can exhibit two different generative morphologies – an “all-human” face (panel b), and a variety of data-consistent image generations (panel c) (see Appendix A.5.3). The model corresponding to the “all-human” face (pink, red and orange training curves and upper x-axis, panel a) was trained at with a higher temperature parameter, resulting in a more generic but still data-… view at source ↗

**Figure 4.** Figure 4: Corresponding to coefficient distributions learned by the EAE, and tabulated in [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: In addition to Fig. 2b, we include six more randomly chosen non-collapsed dimensions [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

read the original abstract

Despite their ubiquity, variational autoencoders (VAEs) inherently suffer from posterior collapse, a failure mode in which latent variables are effectively ignored. This failure arises because explicit prior imposition drives optimization toward loss landscape regions corresponding to uninformative latent representations. Here, we introduce Entropic Autoencoders (EAEs), a framework in which reconstruction loss is the only explicit objective, and entropy generates the latent variables' prior implicitly through a free energy-minimizing ensemble of encoders. This ensemble biases learning toward high-volume regions of near-optimal solutions, while decoder updates direct the search trajectories toward informative latent representations. We demonstrate that EAEs mitigate posterior collapse by learning non-Gaussian, multimodal latent distributions that yield diverse, data-consistent generations and preserve different forms of underlying structure in the data. As a proof-of-concept, we show that an EAE captures a superposition of the known low-dimensional dynamics of a reaction-diffusion process. Then, we show that an EAE identifies implicit categorical distinctions in MNIST latent representations, and displays a hierarchical understanding of facial structure on the CelebA dataset, from an "all-human" face to individual-dependent features.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EAEs give a clean implicit-prior route around posterior collapse, but the ensemble construction and free-energy update still need explicit rules before the claim is secure.

read the letter

The core move here is to drop the explicit prior entirely and let entropy do the work through a free-energy-minimizing ensemble of encoders, with reconstruction as the only stated loss. That is genuinely different from the usual KL-regularized VAE family and from most recent fixes that still add some auxiliary term. The experiments back the idea up at the level of qualitative behavior: the reaction-diffusion example captures a superposition of known low-dimensional dynamics, MNIST latents separate implicit categories, and CelebA shows a clear hierarchy from generic faces down to individual features. Those results suggest the method can preserve structure without forcing a Gaussian or unimodal posterior, which is the practical payoff they are after.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Entropic Autoencoders (EAEs) as an alternative to VAEs for mitigating posterior collapse. Reconstruction loss is the sole explicit objective; an implicit prior on the latent variables is generated by minimizing free energy over an ensemble of encoders. This construction is claimed to bias optimization toward high-volume regions of near-optimal solutions while decoder gradients steer toward informative representations, yielding non-Gaussian multimodal posteriors. Proof-of-concept results are presented on a reaction-diffusion system (capturing superposed low-dimensional dynamics), MNIST (implicit categorical distinctions), and CelebA (hierarchical facial structure from global to individual features).

Significance. If the free-energy ensemble mechanism can be shown to produce the claimed multimodal posteriors without explicit regularization, the approach would offer a principled route to prior-free autoencoding that preserves diverse data structure. The reaction-diffusion and hierarchical-feature demonstrations suggest potential utility in scientific modeling and representation learning, provided quantitative evidence of collapse mitigation is supplied.

major comments (2)

[Abstract] Abstract: The central claim that a free-energy-minimizing ensemble of encoders automatically biases learning toward high-volume near-optimal regions (while decoder updates alone suffice to avoid uninformative latents) is stated without any equations defining the free energy, the ensemble construction (multiple independent networks, shared parameters with stochasticity, or alternating optimization), or the approximation used during training. This mechanism is load-bearing for the assertion that the implicit prior mitigates posterior collapse.
[Experiments / Results] Demonstrations: The paper asserts that EAEs learn non-Gaussian multimodal distributions that preserve underlying structure, yet no quantitative collapse metrics (e.g., KL divergence to prior, number of active latent units, or mutual information between latents and data) or direct comparisons against standard VAEs are reported. Without these, the claim that the method yields diverse, data-consistent generations cannot be evaluated.

minor comments (2)

[Methods] Notation for the ensemble and free-energy terms should be introduced with explicit definitions and update rules in the methods section to allow reproducibility.
[Figures] Figure captions for the CelebA and MNIST visualizations should include quantitative measures of diversity or structure preservation to support the qualitative claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify how to strengthen the presentation of our work. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that a free-energy-minimizing ensemble of encoders automatically biases learning toward high-volume near-optimal regions (while decoder updates alone suffice to avoid uninformative latents) is stated without any equations defining the free energy, the ensemble construction (multiple independent networks, shared parameters with stochasticity, or alternating optimization), or the approximation used during training. This mechanism is load-bearing for the assertion that the implicit prior mitigates posterior collapse.

Authors: The abstract is written as a high-level, non-technical summary to remain accessible within length limits. The explicit definitions of the free-energy functional, the ensemble construction (multiple independent encoders whose parameters are optimized to minimize free energy), and the implicit approximation used at training time are provided in full in Section 2 (Model) and Section 3 (Training Procedure) of the manuscript. We agree that a brief pointer to these definitions would improve the abstract and will add one sentence referencing the free-energy ensemble and its implicit prior in the revised abstract. revision: partial
Referee: [Experiments / Results] Demonstrations: The paper asserts that EAEs learn non-Gaussian multimodal distributions that preserve underlying structure, yet no quantitative collapse metrics (e.g., KL divergence to prior, number of active latent units, or mutual information between latents and data) or direct comparisons against standard VAEs are reported. Without these, the claim that the method yields diverse, data-consistent generations cannot be evaluated.

Authors: We accept that quantitative metrics would allow a more direct evaluation of collapse mitigation. The present manuscript emphasizes qualitative demonstrations of structure preservation on the reaction-diffusion, MNIST, and CelebA tasks as a proof-of-concept. In the revision we will add (i) the number of active latent dimensions, (ii) estimates of mutual information between latents and data, and (iii) side-by-side comparisons against a standard VAE baseline using the same architecture and data splits. revision: yes

Circularity Check

0 steps flagged

No circularity: EAE framework defines implicit prior via design choice without reducing claims to fitted inputs or self-referential equations

full rationale

The provided abstract and description introduce EAEs as a new framework with reconstruction loss as the sole explicit objective and an implicit prior arising from free-energy minimization over an encoder ensemble. No equations, derivations, or steps are shown that define a quantity in terms of itself, rename a fitted parameter as a prediction, or rely on a load-bearing self-citation whose content reduces to the target result. The claims about non-Gaussian multimodal posteriors and mitigation of posterior collapse are presented as outcomes of the proposed architecture rather than tautological re-expressions of the reconstruction objective. The derivation chain therefore remains self-contained against external benchmarks and does not meet the criteria for any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that reconstruction loss plus implicit entropy is sufficient to produce informative latents; no free parameters or invented entities are identifiable from the abstract alone.

axioms (1)

domain assumption Reconstruction loss alone, combined with an entropy-driven free-energy ensemble, is sufficient to avoid posterior collapse and produce multimodal latents.
Stated directly as the core of the EAE framework in the abstract.

pith-pipeline@v0.9.0 · 5732 in / 1286 out tokens · 45047 ms · 2026-05-20T20:57:35.822395+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

entropy generates the latent variables' prior implicitly through a free energy-minimizing ensemble of encoders... Ω(θ, Y) acts as an implicit prior over collective variables: values of θ supported by many encoder configurations receive greater weight
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Fβ(θ, ϑk, Y) ≈ ⟨Lrec(ϕ, ϑk)⟩θ − (1/β) S(θ, Y)... decoder updates reshape the collective-variable free energy only through the conditional reconstruction term

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages

[1]

Saurous, and Kevin Murphy

Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A. Saurous, and Kevin Murphy. Fixing a Broken ELBO. InProceedings of the 35th International Conference on Machine Learning, pages 159–168. PMLR, July 2018

work page 2018
[2]

Simmering: Sufficient training of neural networks in Python

Irina Babayan, Hazhir Aliahmadi, and Greg van Anders. Simmering: Sufficient training of neural networks in Python. Zenodo, November 2025

work page 2025
[3]

Sufficient is better than optimal for training neural networks.Nature Communications, 17(1):271, December 2025

Irina Babayan, Hazhir Aliahmadi, and Greg Van Anders. Sufficient is better than optimal for training neural networks.Nature Communications, 17(1):271, December 2025. ISSN 2041-1723. doi: 10.1038/s41467-025-66983-3

work page doi:10.1038/s41467-025-66983-3 2025
[4]

Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. Generating Sentences from a Continuous Space. InProceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 10–21, Berlin, Germany,

work page
[5]

doi: 10.18653/v1/K16-1002

Association for Computational Linguistics. doi: 10.18653/v1/K16-1002

work page doi:10.18653/v1/k16-1002
[6]

Importance Weighted Autoencoders, 2015

Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance Weighted Autoencoders, 2015

work page 2015
[7]

Champion, B

Kathleen Champion, Bethany Lusch, J. Nathan Kutz, and Steven L. Brunton. Data-driven discovery of coordinates and governing equations.Proceedings of the National Academy of Sciences, 116(45):22445–22451, November 2019. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1906995116

work page doi:10.1073/pnas.1906995116 2019
[8]

Entropy-SGD: Biasing gradient descent into wide valleys*.Journal of Statistical Mechanics: Theory and Experiment, 2019 (12):124018, December 2019

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-SGD: Biasing gradient descent into wide valleys*.Journal of Statistical Mechanics: Theory and Experiment, 2019 (12):124018, December 2019. ISSN 1742-5468. doi: 10.1088/1742-5468/ab39d9

work page doi:10.1088/1742-5468/ab39d9 2019
[9]

Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel

Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational Lossy Autoencoder, 2016

work page 2016
[10]

Bayesian neural networks uncertainty quantification with cubature rules

Jen-Tzung Chien and Chih-Jung Tsai. Amortized Mixture Prior for Variational Sequence Generation. In2020 International Joint Conference on Neural Networks (IJCNN), pages 1–6, July 2020. doi: 10.1109/IJCNN48605.2020.9206667

work page doi:10.1109/ijcnn48605.2020.9206667 2020
[11]

Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M

Tim R. Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M. Tomczak. Hyper- spherical Variational Auto-Encoders, September 2022

work page 2022
[12]

Duane, A

S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo.Physics Letters, B195:216–222, 1987. doi: 10.1016/0370-2693(87)91197-X

work page doi:10.1016/0370-2693(87)91197-x 1987
[13]

Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing, June 2019

Hao Fu, Chunyuan Li, Xiaodong Liu, Jianfeng Gao, Asli Celikyilmaz, and Lawrence Carin. Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing, June 2019

work page 2019
[14]

Beta-V AE: Learning Basic Visual Concepts with a Constrained Variational Framework

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. Beta-V AE: Learning Basic Visual Concepts with a Constrained Variational Framework. InInternational Conference on Learning Representations, February 2017

work page 2017
[15]

G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks.Science, 313(5786):504–507, July 2006. doi: 10.1126/science.1127647

work page doi:10.1126/science.1127647 2006
[16]

, volume =

H. Hotelling. Analysis of a complex of statistical variables into principal components.Journal of Educational Psychology, 24(6):417–441, September 1933. ISSN 1939-2176, 0022-0663. doi: 10.1037/h0071325. 10

work page doi:10.1037/h0071325 1933
[17]

E. T. Jaynes. Information theory and statistical mechanics.Physical Review, 186(4):620–630, May 1957. doi: 10.1103/PhysRev.106.620

work page doi:10.1103/physrev.106.620 1957
[18]

Prior Probabilities.IEEE Transactions on Systems Science and Cybernetics, 4 (3):227–241, 1968

Edwin Jaynes. Prior Probabilities.IEEE Transactions on Systems Science and Cybernetics, 4 (3):227–241, 1968. ISSN 0536-1567. doi: 10.1109/TSSC.1968.300117

work page doi:10.1109/tssc.1968.300117 1968
[19]

Fantastic Generalization Measures and Where to Find Them

Yiding Jiang*, Behnam Neyshabur*, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic Generalization Measures and Where to Find Them. InInternational Conference on Learning Representations, September 2019

work page 2019
[20]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes, December 2022

work page 2022
[21]

Improved Variational Inference with Inverse Autoregressive Flow

Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved Variational Inference with Inverse Autoregressive Flow. InAdvances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016

work page 2016
[22]

Learn- ing Hierarchical Priors in V AEs

Alexej Klushyn, Nutan Chen, Richard Kurle, Botond Cseke, and Patrick van der Smagt. Learn- ing Hierarchical Priors in V AEs. InAdvances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

work page 2019
[23]

MNIST handwwritten digit database.ATT Labs [Online], 2, 2010

Yann LeCun, Corinna Cortes, and CJ Burges. MNIST handwwritten digit database.ATT Labs [Online], 2, 2010

work page 2010
[24]

Deep Learning Face Attributes in the Wild

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep Learning Face Attributes in the Wild. InProceedings of International Conference on Computer Vision (ICCV), December 2015

work page 2015
[25]

Don’ t Blame the ELBO! A Linear V AE Perspective on Posterior Collapse

James Lucas, George Tucker, Roger B Grosse, and Mohammad Norouzi. Don’ t Blame the ELBO! A Linear V AE Perspective on Posterior Collapse. InAdvances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

work page 2019
[26]

MAE: Mutual Posterior-Divergence Regulariza- tion for Variational AutoEncoders

Xuezhe Ma, Chunting Zhou, and Eduard Hovy. MAE: Mutual Posterior-Divergence Regulariza- tion for Variational AutoEncoders. InInternational Conference on Learning Representations, September 2018

work page 2018
[27]

David J. C. MacKay. Bayesian Interpolation.Neural Computation, 4(3):415–447, May 1992. ISSN 0899-7667. doi: 10.1162/neco.1992.4.3.415

work page doi:10.1162/neco.1992.4.3.415 1992
[28]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murra...

work page 2015
[29]

Number 19 in Biomathematics

James Dickson Murray.Mathematical Biology. Number 19 in Biomathematics. Springer, Berlin New York Paris, 1990. ISBN 978-3-540-19460-6 978-0-387-19460-8

work page 1990
[30]

Neal.Bayesian Learning for Neural Networks, volume 118 ofLecture Notes in Statistics

Radford M. Neal.Bayesian Learning for Neural Networks, volume 118 ofLecture Notes in Statistics. Springer, New York, NY , 1996. ISBN 978-0-387-94724-2 978-1-4612-0745-0. doi: 10.1007/978-1-4612-0745-0

work page doi:10.1007/978-1-4612-0745-0 1996
[31]

Preventing posterior collapse in variational autoencoders for text generation via decoder regularization, October 2021

Alban Petit and Caio Corro. Preventing posterior collapse in variational autoencoders for text generation via decoder regularization, October 2021

work page 2021
[32]

Relative Flatness and Generalization

Henning Petzka, Michael Kamp, Linara Adilova, Cristian Sminchisescu, and Mario Boley. Relative Flatness and Generalization. InAdvances in Neural Information Processing Systems, volume 34, pages 18420–18432. Curran Associates, Inc., 2021

work page 2021
[33]

Sam roweis : Data

Sam Roweis. Sam roweis : Data. https://cs.nyu.edu/home/people/in_memoriam/roweis/data.html. 11

work page
[34]

A Hybrid Convolutional Varia- tional Autoencoder for Text Generation

Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. A Hybrid Convolutional Varia- tional Autoencoder for Text Generation. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel, editors,Proceedings of the 2017 Conference on Empirical Methods in Natural Lan- guage Processing, pages 627–637, Copenhagen, Denmark, September 2017. Association for Computatio...

work page doi:10.18653/v1/d17-1066 2017
[35]

ControlV AE: Controllable Variational Autoencoder, June 2020

Huajie Shao, Shuochao Yao, Dachun Sun, Aston Zhang, Shengzhong Liu, Dongxin Liu, Jun Wang, and Tarek Abdelzaher. ControlV AE: Controllable Variational Autoencoder, June 2020

work page 2020
[36]

Ladder Variational Autoencoders

Casper Kaae Sø nderby, Tapani Raiko, Lars Maalø e, Sø ren Kaae Sø nderby, and Ole Winther. Ladder Variational Autoencoders. InAdvances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016

work page 2016
[37]

Scale-V AE: Preventing Posterior Collapse in Variational Autoencoder

Tianbao Song, Jingbo Sun, Xin Liu, and Weiming Peng. Scale-V AE: Preventing Posterior Collapse in Variational Autoencoder. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors,Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (L...

work page 2024
[38]

Conditional Image Generation with PixelCNN Decoders, 2016

Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional Image Generation with PixelCNN Decoders, 2016

work page 2016
[39]

Neural Discrete Representation Learning

Aaron van den Oord, Oriol Vinyals, and koray kavukcuoglu. Neural Discrete Representation Learning. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017
[40]

Posterior Collapse and Latent Variable Non- identifiability

Yixin Wang, David Blei, and John Cunningham. Posterior Collapse and Latent Variable Non- identifiability. InAdvances in Neural Information Processing Systems, volume 34, pages 5443–5455. Curran Associates, Inc., 2021

work page 2021
[41]

How Good is the Bayes Posterior in Deep Neural Networks Really? InProceedings of the 37th International Conference on Machine Learning, pages 10248–10259

Florian Wenzel, Kevin Roth, Bastiaan Veeling, Jakub Swiatkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. How Good is the Bayes Posterior in Deep Neural Networks Really? InProceedings of the 37th International Conference on Machine Learning, pages 10248–10259. PMLR, November 2020

work page 2020
[42]

InfoV AE: Balancing Learning and Inference in Variational Autoencoders.Proceedings of the AAAI Conference on Artificial Intelligence, 33 (01):5885–5892, July 2019

Shengjia Zhao, Jiaming Song, and Stefano Ermon. InfoV AE: Balancing Learning and Inference in Variational Autoencoders.Proceedings of the AAAI Conference on Artificial Intelligence, 33 (01):5885–5892, July 2019. ISSN 2374-3468, 2159-5399. doi: 10.1609/aaai.v33i01.33015885

work page doi:10.1609/aaai.v33i01.33015885 2019
[43]

Unsupervised Discrete Sentence Repre- sentation Learning for Interpretable Neural Dialog Generation

Tiancheng Zhao, Kyusong Lee, and Maxine Eskenazi. Unsupervised Discrete Sentence Repre- sentation Learning for Interpretable Neural Dialog Generation. In Iryna Gurevych and Yusuke Miyao, editors,Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1098–1107, Melbourne, Australia, July 2018....

work page doi:10.18653/v1/p18-1101 2018
[44]

Tsang, and Jia Wang

Huangjie Zheng, Jiangchao Yao, Ya Zhang, Ivor W. Tsang, and Jia Wang. Understanding V AEs in Fisher-Shannon Plane.Proceedings of the AAAI Conference on Artificial Intelligence, 33 (01):5917–5924, July 2019. ISSN 2374-3468, 2159-5399. doi: 10.1609/aaai.v33i01.33015917. A Appendix / supplemental material A.1 Collective-variable free energy of the encoder en...

work page doi:10.1609/aaai.v33i01.33015917 2019
[45]

Any hyperparameters not mentioned below can be assumed to be set to the default values of the open-source Simmering implementation

published on Zenodo [2] (released under a Creative Commons 4.0 Attribution Internal license). Any hyperparameters not mentioned below can be assumed to be set to the default values of the open-source Simmering implementation. A.5.1 Setup: Recovery of meaningful low-dimensional representations with an EAE This experimental setup description pertains to exp...

work page
[46]

SINDy regularization

but remove the constant term as we do not specify the initial conditions for the latent variables in the objective function. The objective function is also based on the SINDy autoencoder objective function in [6] but with two key differences: we remove the “SINDy regularization” term (applying an L1 norm regularization on basis coefficients), and we sampl...

work page
[47]

initial condition

were the linear (z1, and z2 terms) and the sine terms (sinz 1, sinz 2). Appropriate combinations of these basis functions correspond to linear or non-linear oscillation dynamics. Analysis (Fig. 4) of coefficient correlations show that, beyond displaying the expected strong correlation between coefficient combinations corresponding to descriptions of linea...

work page

[1] [1]

Saurous, and Kevin Murphy

Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A. Saurous, and Kevin Murphy. Fixing a Broken ELBO. InProceedings of the 35th International Conference on Machine Learning, pages 159–168. PMLR, July 2018

work page 2018

[2] [2]

Simmering: Sufficient training of neural networks in Python

Irina Babayan, Hazhir Aliahmadi, and Greg van Anders. Simmering: Sufficient training of neural networks in Python. Zenodo, November 2025

work page 2025

[3] [3]

Sufficient is better than optimal for training neural networks.Nature Communications, 17(1):271, December 2025

Irina Babayan, Hazhir Aliahmadi, and Greg Van Anders. Sufficient is better than optimal for training neural networks.Nature Communications, 17(1):271, December 2025. ISSN 2041-1723. doi: 10.1038/s41467-025-66983-3

work page doi:10.1038/s41467-025-66983-3 2025

[4] [4]

Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. Generating Sentences from a Continuous Space. InProceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 10–21, Berlin, Germany,

work page

[5] [5]

doi: 10.18653/v1/K16-1002

Association for Computational Linguistics. doi: 10.18653/v1/K16-1002

work page doi:10.18653/v1/k16-1002

[6] [6]

Importance Weighted Autoencoders, 2015

Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance Weighted Autoencoders, 2015

work page 2015

[7] [7]

Champion, B

Kathleen Champion, Bethany Lusch, J. Nathan Kutz, and Steven L. Brunton. Data-driven discovery of coordinates and governing equations.Proceedings of the National Academy of Sciences, 116(45):22445–22451, November 2019. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1906995116

work page doi:10.1073/pnas.1906995116 2019

[8] [8]

Entropy-SGD: Biasing gradient descent into wide valleys*.Journal of Statistical Mechanics: Theory and Experiment, 2019 (12):124018, December 2019

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-SGD: Biasing gradient descent into wide valleys*.Journal of Statistical Mechanics: Theory and Experiment, 2019 (12):124018, December 2019. ISSN 1742-5468. doi: 10.1088/1742-5468/ab39d9

work page doi:10.1088/1742-5468/ab39d9 2019

[9] [9]

Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel

Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational Lossy Autoencoder, 2016

work page 2016

[10] [10]

Bayesian neural networks uncertainty quantification with cubature rules

Jen-Tzung Chien and Chih-Jung Tsai. Amortized Mixture Prior for Variational Sequence Generation. In2020 International Joint Conference on Neural Networks (IJCNN), pages 1–6, July 2020. doi: 10.1109/IJCNN48605.2020.9206667

work page doi:10.1109/ijcnn48605.2020.9206667 2020

[11] [11]

Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M

Tim R. Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M. Tomczak. Hyper- spherical Variational Auto-Encoders, September 2022

work page 2022

[12] [12]

Duane, A

S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo.Physics Letters, B195:216–222, 1987. doi: 10.1016/0370-2693(87)91197-X

work page doi:10.1016/0370-2693(87)91197-x 1987

[13] [13]

Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing, June 2019

Hao Fu, Chunyuan Li, Xiaodong Liu, Jianfeng Gao, Asli Celikyilmaz, and Lawrence Carin. Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing, June 2019

work page 2019

[14] [14]

Beta-V AE: Learning Basic Visual Concepts with a Constrained Variational Framework

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. Beta-V AE: Learning Basic Visual Concepts with a Constrained Variational Framework. InInternational Conference on Learning Representations, February 2017

work page 2017

[15] [15]

G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks.Science, 313(5786):504–507, July 2006. doi: 10.1126/science.1127647

work page doi:10.1126/science.1127647 2006

[16] [16]

, volume =

H. Hotelling. Analysis of a complex of statistical variables into principal components.Journal of Educational Psychology, 24(6):417–441, September 1933. ISSN 1939-2176, 0022-0663. doi: 10.1037/h0071325. 10

work page doi:10.1037/h0071325 1933

[17] [17]

E. T. Jaynes. Information theory and statistical mechanics.Physical Review, 186(4):620–630, May 1957. doi: 10.1103/PhysRev.106.620

work page doi:10.1103/physrev.106.620 1957

[18] [18]

Prior Probabilities.IEEE Transactions on Systems Science and Cybernetics, 4 (3):227–241, 1968

Edwin Jaynes. Prior Probabilities.IEEE Transactions on Systems Science and Cybernetics, 4 (3):227–241, 1968. ISSN 0536-1567. doi: 10.1109/TSSC.1968.300117

work page doi:10.1109/tssc.1968.300117 1968

[19] [19]

Fantastic Generalization Measures and Where to Find Them

Yiding Jiang*, Behnam Neyshabur*, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic Generalization Measures and Where to Find Them. InInternational Conference on Learning Representations, September 2019

work page 2019

[20] [20]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes, December 2022

work page 2022

[21] [21]

Improved Variational Inference with Inverse Autoregressive Flow

Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved Variational Inference with Inverse Autoregressive Flow. InAdvances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016

work page 2016

[22] [22]

Learn- ing Hierarchical Priors in V AEs

Alexej Klushyn, Nutan Chen, Richard Kurle, Botond Cseke, and Patrick van der Smagt. Learn- ing Hierarchical Priors in V AEs. InAdvances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

work page 2019

[23] [23]

MNIST handwwritten digit database.ATT Labs [Online], 2, 2010

Yann LeCun, Corinna Cortes, and CJ Burges. MNIST handwwritten digit database.ATT Labs [Online], 2, 2010

work page 2010

[24] [24]

Deep Learning Face Attributes in the Wild

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep Learning Face Attributes in the Wild. InProceedings of International Conference on Computer Vision (ICCV), December 2015

work page 2015

[25] [25]

Don’ t Blame the ELBO! A Linear V AE Perspective on Posterior Collapse

James Lucas, George Tucker, Roger B Grosse, and Mohammad Norouzi. Don’ t Blame the ELBO! A Linear V AE Perspective on Posterior Collapse. InAdvances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

work page 2019

[26] [26]

MAE: Mutual Posterior-Divergence Regulariza- tion for Variational AutoEncoders

Xuezhe Ma, Chunting Zhou, and Eduard Hovy. MAE: Mutual Posterior-Divergence Regulariza- tion for Variational AutoEncoders. InInternational Conference on Learning Representations, September 2018

work page 2018

[27] [27]

David J. C. MacKay. Bayesian Interpolation.Neural Computation, 4(3):415–447, May 1992. ISSN 0899-7667. doi: 10.1162/neco.1992.4.3.415

work page doi:10.1162/neco.1992.4.3.415 1992

[28] [28]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murra...

work page 2015

[29] [29]

Number 19 in Biomathematics

James Dickson Murray.Mathematical Biology. Number 19 in Biomathematics. Springer, Berlin New York Paris, 1990. ISBN 978-3-540-19460-6 978-0-387-19460-8

work page 1990

[30] [30]

Neal.Bayesian Learning for Neural Networks, volume 118 ofLecture Notes in Statistics

Radford M. Neal.Bayesian Learning for Neural Networks, volume 118 ofLecture Notes in Statistics. Springer, New York, NY , 1996. ISBN 978-0-387-94724-2 978-1-4612-0745-0. doi: 10.1007/978-1-4612-0745-0

work page doi:10.1007/978-1-4612-0745-0 1996

[31] [31]

Preventing posterior collapse in variational autoencoders for text generation via decoder regularization, October 2021

Alban Petit and Caio Corro. Preventing posterior collapse in variational autoencoders for text generation via decoder regularization, October 2021

work page 2021

[32] [32]

Relative Flatness and Generalization

Henning Petzka, Michael Kamp, Linara Adilova, Cristian Sminchisescu, and Mario Boley. Relative Flatness and Generalization. InAdvances in Neural Information Processing Systems, volume 34, pages 18420–18432. Curran Associates, Inc., 2021

work page 2021

[33] [33]

Sam roweis : Data

Sam Roweis. Sam roweis : Data. https://cs.nyu.edu/home/people/in_memoriam/roweis/data.html. 11

work page

[34] [34]

A Hybrid Convolutional Varia- tional Autoencoder for Text Generation

Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. A Hybrid Convolutional Varia- tional Autoencoder for Text Generation. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel, editors,Proceedings of the 2017 Conference on Empirical Methods in Natural Lan- guage Processing, pages 627–637, Copenhagen, Denmark, September 2017. Association for Computatio...

work page doi:10.18653/v1/d17-1066 2017

[35] [35]

ControlV AE: Controllable Variational Autoencoder, June 2020

Huajie Shao, Shuochao Yao, Dachun Sun, Aston Zhang, Shengzhong Liu, Dongxin Liu, Jun Wang, and Tarek Abdelzaher. ControlV AE: Controllable Variational Autoencoder, June 2020

work page 2020

[36] [36]

Ladder Variational Autoencoders

Casper Kaae Sø nderby, Tapani Raiko, Lars Maalø e, Sø ren Kaae Sø nderby, and Ole Winther. Ladder Variational Autoencoders. InAdvances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016

work page 2016

[37] [37]

Scale-V AE: Preventing Posterior Collapse in Variational Autoencoder

Tianbao Song, Jingbo Sun, Xin Liu, and Weiming Peng. Scale-V AE: Preventing Posterior Collapse in Variational Autoencoder. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors,Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (L...

work page 2024

[38] [38]

Conditional Image Generation with PixelCNN Decoders, 2016

Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional Image Generation with PixelCNN Decoders, 2016

work page 2016

[39] [39]

Neural Discrete Representation Learning

Aaron van den Oord, Oriol Vinyals, and koray kavukcuoglu. Neural Discrete Representation Learning. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017

[40] [40]

Posterior Collapse and Latent Variable Non- identifiability

Yixin Wang, David Blei, and John Cunningham. Posterior Collapse and Latent Variable Non- identifiability. InAdvances in Neural Information Processing Systems, volume 34, pages 5443–5455. Curran Associates, Inc., 2021

work page 2021

[41] [41]

How Good is the Bayes Posterior in Deep Neural Networks Really? InProceedings of the 37th International Conference on Machine Learning, pages 10248–10259

Florian Wenzel, Kevin Roth, Bastiaan Veeling, Jakub Swiatkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. How Good is the Bayes Posterior in Deep Neural Networks Really? InProceedings of the 37th International Conference on Machine Learning, pages 10248–10259. PMLR, November 2020

work page 2020

[42] [42]

InfoV AE: Balancing Learning and Inference in Variational Autoencoders.Proceedings of the AAAI Conference on Artificial Intelligence, 33 (01):5885–5892, July 2019

Shengjia Zhao, Jiaming Song, and Stefano Ermon. InfoV AE: Balancing Learning and Inference in Variational Autoencoders.Proceedings of the AAAI Conference on Artificial Intelligence, 33 (01):5885–5892, July 2019. ISSN 2374-3468, 2159-5399. doi: 10.1609/aaai.v33i01.33015885

work page doi:10.1609/aaai.v33i01.33015885 2019

[43] [43]

Unsupervised Discrete Sentence Repre- sentation Learning for Interpretable Neural Dialog Generation

Tiancheng Zhao, Kyusong Lee, and Maxine Eskenazi. Unsupervised Discrete Sentence Repre- sentation Learning for Interpretable Neural Dialog Generation. In Iryna Gurevych and Yusuke Miyao, editors,Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1098–1107, Melbourne, Australia, July 2018....

work page doi:10.18653/v1/p18-1101 2018

[44] [44]

Tsang, and Jia Wang

Huangjie Zheng, Jiangchao Yao, Ya Zhang, Ivor W. Tsang, and Jia Wang. Understanding V AEs in Fisher-Shannon Plane.Proceedings of the AAAI Conference on Artificial Intelligence, 33 (01):5917–5924, July 2019. ISSN 2374-3468, 2159-5399. doi: 10.1609/aaai.v33i01.33015917. A Appendix / supplemental material A.1 Collective-variable free energy of the encoder en...

work page doi:10.1609/aaai.v33i01.33015917 2019

[45] [45]

Any hyperparameters not mentioned below can be assumed to be set to the default values of the open-source Simmering implementation

published on Zenodo [2] (released under a Creative Commons 4.0 Attribution Internal license). Any hyperparameters not mentioned below can be assumed to be set to the default values of the open-source Simmering implementation. A.5.1 Setup: Recovery of meaningful low-dimensional representations with an EAE This experimental setup description pertains to exp...

work page

[46] [46]

SINDy regularization

but remove the constant term as we do not specify the initial conditions for the latent variables in the objective function. The objective function is also based on the SINDy autoencoder objective function in [6] but with two key differences: we remove the “SINDy regularization” term (applying an L1 norm regularization on basis coefficients), and we sampl...

work page

[47] [47]

initial condition

were the linear (z1, and z2 terms) and the sine terms (sinz 1, sinz 2). Appropriate combinations of these basis functions correspond to linear or non-linear oscillation dynamics. Analysis (Fig. 4) of coefficient correlations show that, beyond displaying the expected strong correlation between coefficient combinations corresponding to descriptions of linea...

work page