arxiv: 1503.03585 · v8 · submitted 2015-03-12 · 💻 cs.LG · cond-mat.dis-nn· q-bio.NC· stat.ML

Recognition: 3 theorem links

· Lean Theorem

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Jascha Sohl-Dickstein , Eric A. Weiss , Niru Maheswaranathan , Surya Ganguli

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:07 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nnq-bio.NCstat.ML

keywords diffusion processgenerative modelsunsupervised learningnonequilibrium thermodynamicsdeep networkssamplinginferenceprobability evaluation

0 comments

The pith

Gradually destroying structure in data via iterative diffusion and learning a neural network to reverse the process yields tractable deep generative models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a generative modeling method that starts by applying a forward diffusion process to slowly add noise and erase structure from training data. It then trains a reverse diffusion process, parameterized by a neural network, to iteratively restore that structure and generate new samples. This physics-inspired construction keeps sampling, probability evaluation, and inference computationally feasible even when the model has thousands of time steps or layers. The resulting models support both unconditional generation and tasks like computing conditional or posterior probabilities over the data.

Core claim

A forward diffusion process systematically destroys structure in a data distribution through iterative noise addition; learning a parameterized reverse diffusion process that restores structure produces a flexible generative model in which learning, sampling, and probability evaluation remain tractable for architectures with thousands of layers.

What carries the argument

The forward diffusion process that gradually corrupts data toward a simple noise distribution, together with the neural-network-parameterized reverse diffusion process that reconstructs data from noise.

If this is right

Deep generative models with thousands of layers become learnable in practice.
Sampling from the model and evaluating its probabilities become rapid operations.
Conditional and posterior probabilities under the model can be computed directly.
The same framework applies to both continuous and discrete data distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may scale to higher-dimensional data such as images or audio by extending the diffusion schedule without changing the core training procedure.
It offers a route to generative models whose likelihoods remain well-defined, potentially aiding tasks that require uncertainty quantification.
Adjusting the forward diffusion rate could produce models specialized for different data modalities while retaining tractability.

Load-bearing premise

A neural network can accurately parameterize the reverse diffusion steps, and the chosen forward noise schedule keeps the reverse process Markovian and computationally tractable.

What would settle it

If samples drawn from the learned reverse process fail to match the statistical properties of held-out data or if log-probabilities assigned to test examples diverge systematically from those computed by exact methods on smaller problems, the central claim would be refuted.

read the original abstract

A central problem in machine learning involves modeling complex data-sets using highly flexible families of probability distributions in which learning, sampling, inference, and evaluation are still analytically or computationally tractable. Here, we develop an approach that simultaneously achieves both flexibility and tractability. The essential idea, inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process. We then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data. This approach allows us to rapidly learn, sample from, and evaluate probabilities in deep generative models with thousands of layers or time steps, as well as to compute conditional and posterior probabilities under the learned model. We additionally release an open source reference implementation of the algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows how to turn a fixed forward diffusion process into a trainable generative model by learning the reverse steps, and the derivation plus released code make the claims verifiable.

read the letter

The main thing here is that the authors define a gradual forward process that adds Gaussian noise over many steps until the data is destroyed, then train a neural net to reverse it and generate samples. This gives a flexible model that can handle thousands of steps while keeping sampling and probability evaluation tractable through a variational bound that breaks into per-step KL terms. The construction comes from nonequilibrium thermodynamics but is applied directly to ML data, and the reverse process stays Markovian by Bayes rule on the forward chain. When noise per step is small, the reverse conditional is approximately Gaussian, so the net only needs to predict the mean. Experiments on 2D mixtures, MNIST, and CIFAR show the model recovers structure and produces reasonable samples, and the open-source code lets anyone check the implementation. The training loss matches the derived objective exactly, with no hidden fitting or circular targets. One soft spot is that the forward diffusion schedule itself is a free parameter that must be chosen to keep the reverse tractable; their choice works on the tested datasets but would likely need retuning for new data. That is a practical detail rather than a flaw in the central argument. The paper is aimed at researchers building deep generative models who need a principled way to scale beyond what VAEs or other methods allowed at the time. Anyone working on density estimation or sampling from high-dimensional distributions would get concrete value from the math and the released reference implementation. I would send it for peer review; the idea is fresh enough, the evidence from the derivation and experiments is sharp enough, and the code makes it reproducible.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces a generative modeling framework inspired by nonequilibrium thermodynamics. A fixed forward Markov diffusion process iteratively adds Gaussian noise to destroy data structure over many time steps; a neural network is then trained to parameterize the reverse Markov process that restores structure. The resulting model supports tractable learning via a variational objective consisting of KL divergences between true and approximate reverse conditionals, sampling, and probability evaluation even for models with thousands of layers, with additional support for conditional and posterior inference. Experiments on 2-D mixtures, MNIST, and CIFAR-10 are presented along with an open-source implementation.

Significance. If the central claims hold, the work supplies a scalable, physics-motivated route to flexible deep generative models whose training objective remains analytically tractable at large numbers of diffusion steps. The explicit construction of a Markovian reverse process via Bayes rule on the forward chain, the open-source reference code, and the empirical recovery of structure on standard benchmarks constitute concrete strengths that distinguish the contribution from contemporaneous variational auto-encoder approaches.

minor comments (3)

[§3.2] §3.2: the statement that the reverse conditional is 'approximately Gaussian' when the forward noise variance is small would benefit from an explicit bound or reference to the relevant lemma establishing the approximation error.
[Figure 4] Figure 4: the caption does not specify the number of samples drawn per class or the precise temperature used for the CIFAR-10 generations, making direct reproduction of the visual results more difficult.
[Experiments] The forward diffusion schedule is listed as a free parameter in the axiom ledger; a short discussion of its sensitivity (or lack thereof) on the reported MNIST and CIFAR results would strengthen the reproducibility section.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the manuscript, recognition of its strengths relative to contemporaneous variational approaches, and recommendation to accept. We are pleased that the tractability of the variational objective at large numbers of diffusion steps, the explicit Markovian reverse-process construction, and the open-source implementation were highlighted as distinguishing contributions.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The forward diffusion process is defined explicitly from first principles as a fixed Markov chain of Gaussian transitions q(x_t | x_{t-1}) with a pre-specified noise schedule beta_t, independent of any learned parameters or data. The reverse process is parameterized by a neural network whose mean is optimized via a variational lower bound that expands directly into a sum of KL divergences between the true reverse conditionals (obtained analytically from the forward process via Bayes rule) and the approximate reverse conditionals; this expansion is shown in the paper's equations without any fitted input being relabeled as a prediction. No self-citations are load-bearing for the core tractability claims, no uniqueness theorems are imported from prior author work, and no ansatz is smuggled in. The entire chain remains self-contained, with the learned model evaluated against held-out data likelihood and sampling fidelity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a Markov forward diffusion process whose reverse can be parameterized and learned; the noise schedule is an additional design choice.

free parameters (1)

forward diffusion schedule
The per-step variance or noise level schedule must be chosen or optimized to make the reverse process tractable.

axioms (1)

domain assumption The forward process is a Markov chain with known Gaussian transition kernels
Invoked to ensure the reverse process remains Markovian and analytically characterizable.

pith-pipeline@v0.9.0 · 5453 in / 1200 out tokens · 41859 ms · 2026-05-13T20:07:30.598354+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation Jcost_cosh_identity echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

The essential idea, inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process. We then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data.
Foundation.DAlembert.TriangulatedProof gates_connected echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the reverse diffusion process can be accurately parameterized by a neural network

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Autoregressive Learning in Joint KL: Sharp Oracle Bounds and Lower Bounds
cs.LG 2026-05 unverdicted novelty 8.0

Joint KL yields horizon-free approximation but an information-theoretic lower bound of order Omega(H) for estimation error in autoregressive learning, with matching computationally efficient upper bounds.
Generative models on phase space
hep-ph 2026-04 unverdicted novelty 8.0

Generative diffusion and flow models are constructed to remain exactly on the Lorentz-invariant massless N-particle phase space manifold during sampling for particle physics applications.
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
cs.LG 2015-11 accept novelty 8.0

DCGANs with architectural constraints learn a hierarchy of representations from object parts to scenes in both generator and discriminator across image datasets.
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning
cs.LG 2022-08 unverdicted novelty 7.0

Diffusion-QL uses conditional diffusion models as expressive policies in offline RL by coupling behavior cloning with Q-value maximization, achieving SOTA on most D4RL tasks.
Hierarchical Text-Conditional Image Generation with CLIP Latents
cs.CV 2022-04 accept novelty 7.0

A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.
High-Resolution Image Synthesis with Latent Diffusion Models
cs.CV 2021-12 conditional novelty 7.0

Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
cs.CV 2021-12 accept novelty 7.0

A 3.5-billion-parameter diffusion model with classifier-free guidance generates images preferred over DALL-E by human raters and can be fine-tuned for text-guided inpainting.
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
cs.CV 2021-08 conditional novelty 7.0

SDEdit performs guided image synthesis and editing by adding noise to inputs and refining them via denoising with a diffusion model's SDE prior, outperforming GAN methods in human studies without task-specific training.
Diffusion Models Beat GANs on Image Synthesis
cs.LG 2021-05 accept novelty 7.0

Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
PG-3DGS: Optimizing 3D Gaussian Splatting to Satisfy Physics Objectives
cs.CV 2026-05 unverdicted novelty 6.0

PG-3DGS couples 3D Gaussian Splatting with differentiable physics so that optimized shapes satisfy both visual fidelity and physical objectives such as pouring and aerodynamic lift, with real-world 3D-printed validation.
Diffusion model for SU(N) gauge theories
hep-lat 2026-05 unverdicted novelty 6.0

Implicit score matching trains diffusion models that successfully sample SU(3) Wilson gauge configurations on lattices, with a Hamiltonian-dynamics corrector needed for strong coupling.
GCCM: Enhancing Generative Graph Prediction via Contrastive Consistency Model
cs.AI 2026-05 unverdicted novelty 6.0

GCCM prevents shortcut collapse in consistency models for graph prediction by using contrastive negative pairs and input feature perturbation, leading to better performance than deterministic baselines.
Breaking Watermarks in the Frequency Domain: A Modulated Diffusion Attack Framework
cs.CV 2026-04 unverdicted novelty 6.0

FMDiffWA uses frequency-domain modulation inside diffusion sampling to neutralize watermarks in images while preserving visual quality and generalizing across watermarking schemes.
Deepfake Detection Generalization with Diffusion Noise
cs.CV 2026-04 unverdicted novelty 6.0

ANL uses diffusion noise prediction and attention to regularize deepfake detectors for better generalization to unseen synthesis methods without added inference cost.
Discrete Flow Maps
stat.ML 2026-04 unverdicted novelty 6.0

Discrete Flow Maps recast flow map training for discrete domains using simplex geometry to enable single-step text generation from noise and outperform prior discrete flow models.
MuPPet: Multi-person 2D-to-3D Pose Lifting
cs.CV 2026-04 unverdicted novelty 6.0

MuPPet introduces person encoding, permutation augmentation, and dynamic multi-person attention to outperform prior single- and multi-person 2D-to-3D pose lifting methods on group interaction datasets while improving ...
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
cs.CV 2024-03 conditional novelty 6.0

Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
cs.CV 2023-11 conditional novelty 6.0

Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
cs.CV 2023-07 conditional novelty 6.0

SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...
VideoGPT: Video Generation using VQ-VAE and Transformers
cs.CV 2021-04 accept novelty 6.0

VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.
Mesh Based Simulations with Spatial and Temporal awareness
cs.LG 2026-05 unverdicted novelty 5.0

A unified training framework for mesh-based ML surrogates in CFD improves accuracy and long-horizon stability by enforcing spatial derivative consistency via multi-node prediction, using temporal cross-attention corre...
Generating Satellite Imagery Data for Wildfire Detection through Mask-Conditioned Generative AI
cs.CV 2026-04 conditional novelty 4.0

A pre-trained Earth Observation diffusion model generates realistic post-wildfire Sentinel-2 imagery from burn masks via inpainting, achieving Burn IoU 0.456 and improved saliency over full generation.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
cs.CV 2024-02 unverdicted novelty 2.0

The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 23 Pith papers · 1 internal anchor

[1]

T., Biggin, M

Barron, J. T., Biggin, M. D., Arbelaez, P., Knowles, D. W., Keranen, S. V., and Malik, J. Volumetric Semantic Segmentation Using Pyramid Context Features . In 2013 IEEE International Conference on Computer Vision, pp.\ 3448--3455. IEEE, December 2013. ISBN 978-1-4799-2840-8. doi:10.1109/ICCV.2013.428

work page doi:10.1109/iccv.2013.428 2013
[2]

and Thibodeau-Laufer, E

Bengio, Y. and Thibodeau-Laufer, E. Deep generative stochastic networks trainable by backprop . arXiv preprint arXiv:1306.1091, 2013

work page arXiv 2013
[3]

Better Mixing via Deep Representations

Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. Better Mixing via Deep Representations . arXiv preprint arXiv:1207.4404, July 2012

work page Pith review arXiv 2012
[4]

and Breuleux, O

Bergstra, J. and Breuleux, O. Theano: a CPU and GPU math expression compiler . Proceedings of the Python for Scientific Computing Conference (SciPy), 2010

work page 2010
[5]

Statistical Analysis of Non-Lattice Data

Besag, J. Statistical Analysis of Non-Lattice Data . The Statistician, 24(3), 179-195, 1975

work page 1975
[6]

GTM: The generative topographic mapping

Bishop, C., Svens\' e n, M., and Williams, C. GTM: The generative topographic mapping . Neural computation, 1998

work page 1998
[7]

and Bengio, Y

Bornschein, J. and Bengio, Y. Reweighted Wake-Sleep . International Conference on Learning Representations, June 2015

work page 2015
[8]

B., and Salakhutdinov, R

Burda, Y., Grosse, R. B., and Salakhutdinov, R. Accurate and Conservative Estimates of MRF Log-likelihood using Reverse Annealing . arXiv:1412.8566, December 2014

work page arXiv 2014
[9]

E., Neal, R

Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. The helmholtz machine . Neural computation, 7 0 (5): 0 889--904, 1995

work page 1995
[11]

On the theory of stochastic processes, with particular reference to applications

Feller, W. On the theory of stochastic processes, with particular reference to applications . In Proceedings of the [First] Berkeley Symposium on Mathematical Statistics and Probability. The Regents of the University of California, 1949

work page 1949
[12]

Gershman, S. J. and Blei, D. M. A tutorial on Bayesian nonparametric models . Journal of Mathematical Psychology, 56 0 (1): 0 1--12, 2012

work page 2012
[13]

and Raftery, A

Gneiting, T. and Raftery, A. E. Strictly proper scoring rules, prediction, and estimation . Journal of the American Statistical Association, 102 0 (477): 0 359--378, 2007

work page 2007
[14]

J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative Adversarial Nets . Advances in Neural Information Processing Systems, 2014

work page 2014
[15]

Deep AutoRegressive Networks

Gregor, K., Danihelka, I., Mnih, A., Blundell, C., and Wierstra, D. Deep AutoRegressive Networks . arXiv preprint arXiv:1310.8499, October 2013

work page arXiv 2013
[16]

B., Maddison, C

Grosse, R. B., Maddison, C. J., and Salakhutdinov, R. Annealing between distributions by averaging moments . In Advances in Neural Information Processing Systems, pp.\ 2769--2777, 2013

work page 2013
[17]

Hinton, G. E. Training products of experts by minimizing contrastive divergence . Neural Computation, 14 0 (8): 0 1771--1800, 2002

work page 2002
[18]

Hinton, G. E. The wake-sleep algorithm for unsupervised neural networks ) . Science, 1995

work page 1995
[19]

Estimation of non-normalized statistical models using score matching

Hyv\" a rinen, A. Estimation of non-normalized statistical models using score matching. Journal of Machine Learning Research, 6: 0 695--709, 2005

work page 2005
[20]

Equilibrium free-energy differences from nonequilibrium measurements: A master-equation approach

Jarzynski, C. Equilibrium free-energy differences from nonequilibrium measurements: A master-equation approach . Physical Review E, January 1997

work page 1997
[21]

Equalities and inequalities: irreversibility and the second law of thermodynamics at the nanoscale

Jarzynski, C. Equalities and inequalities: irreversibility and the second law of thermodynamics at the nanoscale . Annu. Rev. Condens. Matter Phys., 2011

work page 2011
[22]

Dead leaves models: from space tesselation to random functions

Jeulin, D. Dead leaves models: from space tesselation to random functions . Proc. of the Symposium on the Advances in the Theory and Applications of Random Sets, 1997

work page 1997
[23]

I., Ghahramani, Z., Jaakkola, T

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. An introduction to variational methods for graphical models . Machine learning, 37 0 (2): 0 183--233, 1999

work page 1999
[24]

Fast inference in sparse coding algorithms with applications to object recognition

Kavukcuoglu, K., Ranzato, M., and LeCun, Y. Fast inference in sparse coding algorithms with applications to object recognition . arXiv preprint arXiv:1010.3467, 2010

work page arXiv 2010
[25]

Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes . International Conference on Learning Representations, December 2013

work page 2013
[26]

and Hinton, G

Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images . Computer Science Department University of Toronto Tech. Rep., 2009

work page 2009
[27]

Sur la th\' e orie du mouvement brownien

Langevin, P. Sur la th\' e orie du mouvement brownien . CR Acad. Sci. Paris, 146 0 (530-533), 1908

work page 1908
[28]

and Murray, I

Larochelle, H. and Murray, I. The neural autoregressive distribution estimator . Journal of Machine Learning Research, 2011

work page 2011
[29]

A sparse texture representation using local affine regions

Lazebnik, S., Schmid, C., and Ponce, J. A sparse texture representation using local affine regions . Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27 0 (8): 0 1265--1278, 2005

work page 2005
[30]

and Cortes, C

LeCun, Y. and Cortes, C. The MNIST database of handwritten digits . 1998

work page 1998
[31]

Occlusion models for natural images: A statistical study of a scale-invariant dead leaves model

Lee, A., Mumford, D., and Huang, J. Occlusion models for natural images: A statistical study of a scale-invariant dead leaves model . International Journal of Computer Vision, 2001

work page 2001
[32]

Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

Lyu, S. Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction . Advances in Neural Information Processing Systems 24, pp.\ 64--72, 2011

work page 2011
[33]

Bayesian neural networks and density networks

MacKay, D. Bayesian neural networks and density networks . Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 1995

work page 1995
[34]

P., Weiss, Y., and Jordan, M

Murphy, K. P., Weiss, Y., and Jordan, M. I. Loopy belief propagation for approximate inference: An empirical study . In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pp.\ 467--475. Morgan Kaufmann Publishers Inc., 1999

work page 1999
[35]

Annealed importance sampling

Neal, R. Annealed importance sampling . Statistics and Computing, January 2001

work page 2001
[36]

and Bengio, Y

Ozair, S. and Bengio, Y. Deep Directed Generative Autoencoders . arXiv:1410.0630, October 2014

work page arXiv 2014
[37]

P., Lauritzen, S., and Others

Parry, M., Dawid, A. P., Lauritzen, S., and Others. Proper local scoring rules . The Annals of Statistics, 40 0 (1): 0 561--592, 2012

work page 2012
[38]

J., Mohamed, S., and Wierstra, D

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic Backpropagation and Approximate Inference in Deep Generative Models . Proceedings of the 31st International Conference on Machine Learning (ICML-14), January 2014

work page 2014
[39]

NICE: Non-linear Independent Components Estimation

Rippel, O. and Adams, R. P. High-Dimensional Probability Estimation with Deep Density Models . arXiv:1410.8516, pp.\ 12, February 2013

work page internal anchor Pith review arXiv 2013
[40]

Learning factorial codes by predictability minimization

Schmidhuber, J. Learning factorial codes by predictability minimization . Neural Computation, 1992

work page 1992
[41]

Learning joint top-down and bottom-up processes for 3D visual inference

Sminchisescu, C., Kanaujia, A., and Metaxas, D. Learning joint top-down and bottom-up processes for 3D visual inference . In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 2, pp.\ 1743--1752. IEEE, 2006

work page 2006
[43]

B., and DeWeese, M

Sohl-Dickstein, J., Battaglino, P. B., and DeWeese, M. R. Minimum Probability Flow Learning . International Conference on Machine Learning, 107 0 (22): 0 11--14, November 2011 b . ISSN 0031-9007. doi:10.1103/PhysRevLett.107.220601

work page doi:10.1103/physrevlett.107.220601 2011
[44]

Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods

Sohl-Dickstein, J., Poole, B., and Ganguli, S. Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods . In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp.\ 604--612, 2014

work page 2014
[45]

and Ford, I

Spinney, R. and Ford, I. Fluctuation Relations : A Pedagogical Overview . arXiv preprint arXiv:1201.6381, pp.\ 3--56, 2013

work page arXiv 2013
[46]

Learning stochastic inverses

Stuhlm\" u ller, A., Taylor, J., and Goodman, N. Learning stochastic inverses . Advances in Neural Information Processing Systems, 2013

work page 2013
[47]

and Vandewalle, J

Suykens, J. and Vandewalle, J. Nonconvex optimization using a Fokker-Planck learning machine . In 12th European Conference on Circuit Theory and Design, 1995

work page 1995
[48]

Convergence condition of the TAP equation for the infinite-ranged Ising spin glass model

T, P. Convergence condition of the TAP equation for the infinite-ranged Ising spin glass model . J. Phys. A: Math. Gen. 15 1971, 1982

work page 1971
[49]

Mean-field theory of Boltzmann machine learning

Tanaka, T. Mean-field theory of Boltzmann machine learning . Physical Review Letters E, January 1998

work page 1998
[50]

Mixtures of conditional Gaussian scale mixtures applied to multiscale image representations

Theis, L., Hosseini, R., and Bethge, M. Mixtures of conditional Gaussian scale mixtures applied to multiscale image representations . PloS one, 7 0 (7): 0 e39857, 2012

work page 2012
[51]

A note on the evaluation of generative models

Theis, L., van den Oord, A., and Bethge, M. A note on the evaluation of generative models . arXiv preprint arXiv:1511.01844, 2015

work page Pith review arXiv 2015
[52]

RNADE: The real-valued neural autoregressive density-estimator

Uria, B., Murray, I., and Larochelle, H. RNADE: The real-valued neural autoregressive density-estimator . Advances in Neural Information Processing Systems, 2013 a

work page 2013
[53]

A Deep and Tractable Density Estimator

Uria, B., Murray, I., and Larochelle, H. A Deep and Tractable Density Estimator . arXiv:1310.1757, pp.\ 9, October 2013 b

work page arXiv 2013
[54]

Blocks and Fuel

van Merri\" e nboer, B., Chorowski, J., Serdyuk, D., Bengio, Y., Bogdanov, D., Dumoulin, V., and Warde-Farley, D. Blocks and Fuel . Zenodo, May 2015. doi:10.5281/zenodo.17721

work page doi:10.5281/zenodo.17721 2015
[55]

and Hinton, G

Welling, M. and Hinton, G. A new learning algorithm for mean field Boltzmann machines . Lecture Notes in Computer Science, January 2002

work page 2002
[56]

On the Equivalence Between Deep NADE and Generative Stochastic Networks

Yao, L., Ozair, S., Cho, K., and Bengio, Y. On the Equivalence Between Deep NADE and Generative Stochastic Networks . In Machine Learning and Knowledge Discovery in Databases, pp.\ 322--336. Springer, 2014

work page 2014