pith. machine review for the scientific record. sign in

arxiv: 1503.03585 · v8 · submitted 2015-03-12 · 💻 cs.LG · cond-mat.dis-nn· q-bio.NC· stat.ML

Recognition: 3 theorem links

· Lean Theorem

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:07 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nnq-bio.NCstat.ML
keywords diffusion processgenerative modelsunsupervised learningnonequilibrium thermodynamicsdeep networkssamplinginferenceprobability evaluation
0
0 comments X

The pith

Gradually destroying structure in data via iterative diffusion and learning a neural network to reverse the process yields tractable deep generative models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a generative modeling method that starts by applying a forward diffusion process to slowly add noise and erase structure from training data. It then trains a reverse diffusion process, parameterized by a neural network, to iteratively restore that structure and generate new samples. This physics-inspired construction keeps sampling, probability evaluation, and inference computationally feasible even when the model has thousands of time steps or layers. The resulting models support both unconditional generation and tasks like computing conditional or posterior probabilities over the data.

Core claim

A forward diffusion process systematically destroys structure in a data distribution through iterative noise addition; learning a parameterized reverse diffusion process that restores structure produces a flexible generative model in which learning, sampling, and probability evaluation remain tractable for architectures with thousands of layers.

What carries the argument

The forward diffusion process that gradually corrupts data toward a simple noise distribution, together with the neural-network-parameterized reverse diffusion process that reconstructs data from noise.

If this is right

  • Deep generative models with thousands of layers become learnable in practice.
  • Sampling from the model and evaluating its probabilities become rapid operations.
  • Conditional and posterior probabilities under the model can be computed directly.
  • The same framework applies to both continuous and discrete data distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may scale to higher-dimensional data such as images or audio by extending the diffusion schedule without changing the core training procedure.
  • It offers a route to generative models whose likelihoods remain well-defined, potentially aiding tasks that require uncertainty quantification.
  • Adjusting the forward diffusion rate could produce models specialized for different data modalities while retaining tractability.

Load-bearing premise

A neural network can accurately parameterize the reverse diffusion steps, and the chosen forward noise schedule keeps the reverse process Markovian and computationally tractable.

What would settle it

If samples drawn from the learned reverse process fail to match the statistical properties of held-out data or if log-probabilities assigned to test examples diverge systematically from those computed by exact methods on smaller problems, the central claim would be refuted.

read the original abstract

A central problem in machine learning involves modeling complex data-sets using highly flexible families of probability distributions in which learning, sampling, inference, and evaluation are still analytically or computationally tractable. Here, we develop an approach that simultaneously achieves both flexibility and tractability. The essential idea, inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process. We then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data. This approach allows us to rapidly learn, sample from, and evaluate probabilities in deep generative models with thousands of layers or time steps, as well as to compute conditional and posterior probabilities under the learned model. We additionally release an open source reference implementation of the algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces a generative modeling framework inspired by nonequilibrium thermodynamics. A fixed forward Markov diffusion process iteratively adds Gaussian noise to destroy data structure over many time steps; a neural network is then trained to parameterize the reverse Markov process that restores structure. The resulting model supports tractable learning via a variational objective consisting of KL divergences between true and approximate reverse conditionals, sampling, and probability evaluation even for models with thousands of layers, with additional support for conditional and posterior inference. Experiments on 2-D mixtures, MNIST, and CIFAR-10 are presented along with an open-source implementation.

Significance. If the central claims hold, the work supplies a scalable, physics-motivated route to flexible deep generative models whose training objective remains analytically tractable at large numbers of diffusion steps. The explicit construction of a Markovian reverse process via Bayes rule on the forward chain, the open-source reference code, and the empirical recovery of structure on standard benchmarks constitute concrete strengths that distinguish the contribution from contemporaneous variational auto-encoder approaches.

minor comments (3)
  1. [§3.2] §3.2: the statement that the reverse conditional is 'approximately Gaussian' when the forward noise variance is small would benefit from an explicit bound or reference to the relevant lemma establishing the approximation error.
  2. [Figure 4] Figure 4: the caption does not specify the number of samples drawn per class or the precise temperature used for the CIFAR-10 generations, making direct reproduction of the visual results more difficult.
  3. [Experiments] The forward diffusion schedule is listed as a free parameter in the axiom ledger; a short discussion of its sensitivity (or lack thereof) on the reported MNIST and CIFAR results would strengthen the reproducibility section.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the manuscript, recognition of its strengths relative to contemporaneous variational approaches, and recommendation to accept. We are pleased that the tractability of the variational objective at large numbers of diffusion steps, the explicit Markovian reverse-process construction, and the open-source implementation were highlighted as distinguishing contributions.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The forward diffusion process is defined explicitly from first principles as a fixed Markov chain of Gaussian transitions q(x_t | x_{t-1}) with a pre-specified noise schedule beta_t, independent of any learned parameters or data. The reverse process is parameterized by a neural network whose mean is optimized via a variational lower bound that expands directly into a sum of KL divergences between the true reverse conditionals (obtained analytically from the forward process via Bayes rule) and the approximate reverse conditionals; this expansion is shown in the paper's equations without any fitted input being relabeled as a prediction. No self-citations are load-bearing for the core tractability claims, no uniqueness theorems are imported from prior author work, and no ansatz is smuggled in. The entire chain remains self-contained, with the learned model evaluated against held-out data likelihood and sampling fidelity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a Markov forward diffusion process whose reverse can be parameterized and learned; the noise schedule is an additional design choice.

free parameters (1)
  • forward diffusion schedule
    The per-step variance or noise level schedule must be chosen or optimized to make the reverse process tractable.
axioms (1)
  • domain assumption The forward process is a Markov chain with known Gaussian transition kernels
    Invoked to ensure the reverse process remains Markovian and analytically characterizable.

pith-pipeline@v0.9.0 · 5453 in / 1200 out tokens · 41859 ms · 2026-05-13T20:07:30.598354+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation Jcost_cosh_identity echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    The essential idea, inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process. We then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data.

  • Foundation.DAlembert.TriangulatedProof gates_connected echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    the reverse diffusion process can be accurately parameterized by a neural network

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Autoregressive Learning in Joint KL: Sharp Oracle Bounds and Lower Bounds

    cs.LG 2026-05 unverdicted novelty 8.0

    Joint KL yields horizon-free approximation but an information-theoretic lower bound of order Omega(H) for estimation error in autoregressive learning, with matching computationally efficient upper bounds.

  2. Generative models on phase space

    hep-ph 2026-04 unverdicted novelty 8.0

    Generative diffusion and flow models are constructed to remain exactly on the Lorentz-invariant massless N-particle phase space manifold during sampling for particle physics applications.

  3. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

    cs.LG 2015-11 accept novelty 8.0

    DCGANs with architectural constraints learn a hierarchy of representations from object parts to scenes in both generator and discriminator across image datasets.

  4. Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

    cs.LG 2022-08 unverdicted novelty 7.0

    Diffusion-QL uses conditional diffusion models as expressive policies in offline RL by coupling behavior cloning with Q-value maximization, achieving SOTA on most D4RL tasks.

  5. Hierarchical Text-Conditional Image Generation with CLIP Latents

    cs.CV 2022-04 accept novelty 7.0

    A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.

  6. High-Resolution Image Synthesis with Latent Diffusion Models

    cs.CV 2021-12 conditional novelty 7.0

    Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...

  7. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    cs.CV 2021-12 accept novelty 7.0

    A 3.5-billion-parameter diffusion model with classifier-free guidance generates images preferred over DALL-E by human raters and can be fine-tuned for text-guided inpainting.

  8. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    cs.CV 2021-08 conditional novelty 7.0

    SDEdit performs guided image synthesis and editing by adding noise to inputs and refining them via denoising with a diffusion model's SDE prior, outperforming GAN methods in human studies without task-specific training.

  9. Diffusion Models Beat GANs on Image Synthesis

    cs.LG 2021-05 accept novelty 7.0

    Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.

  10. PG-3DGS: Optimizing 3D Gaussian Splatting to Satisfy Physics Objectives

    cs.CV 2026-05 unverdicted novelty 6.0

    PG-3DGS couples 3D Gaussian Splatting with differentiable physics so that optimized shapes satisfy both visual fidelity and physical objectives such as pouring and aerodynamic lift, with real-world 3D-printed validation.

  11. Diffusion model for SU(N) gauge theories

    hep-lat 2026-05 unverdicted novelty 6.0

    Implicit score matching trains diffusion models that successfully sample SU(3) Wilson gauge configurations on lattices, with a Hamiltonian-dynamics corrector needed for strong coupling.

  12. GCCM: Enhancing Generative Graph Prediction via Contrastive Consistency Model

    cs.AI 2026-05 unverdicted novelty 6.0

    GCCM prevents shortcut collapse in consistency models for graph prediction by using contrastive negative pairs and input feature perturbation, leading to better performance than deterministic baselines.

  13. Breaking Watermarks in the Frequency Domain: A Modulated Diffusion Attack Framework

    cs.CV 2026-04 unverdicted novelty 6.0

    FMDiffWA uses frequency-domain modulation inside diffusion sampling to neutralize watermarks in images while preserving visual quality and generalizing across watermarking schemes.

  14. Deepfake Detection Generalization with Diffusion Noise

    cs.CV 2026-04 unverdicted novelty 6.0

    ANL uses diffusion noise prediction and attention to regularize deepfake detectors for better generalization to unseen synthesis methods without added inference cost.

  15. Discrete Flow Maps

    stat.ML 2026-04 unverdicted novelty 6.0

    Discrete Flow Maps recast flow map training for discrete domains using simplex geometry to enable single-step text generation from noise and outperform prior discrete flow models.

  16. MuPPet: Multi-person 2D-to-3D Pose Lifting

    cs.CV 2026-04 unverdicted novelty 6.0

    MuPPet introduces person encoding, permutation augmentation, and dynamic multi-person attention to outperform prior single- and multi-person 2D-to-3D pose lifting methods on group interaction datasets while improving ...

  17. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    cs.CV 2024-03 conditional novelty 6.0

    Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.

  18. Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    cs.CV 2023-11 conditional novelty 6.0

    Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...

  19. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    cs.CV 2023-07 conditional novelty 6.0

    SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...

  20. VideoGPT: Video Generation using VQ-VAE and Transformers

    cs.CV 2021-04 accept novelty 6.0

    VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.

  21. Mesh Based Simulations with Spatial and Temporal awareness

    cs.LG 2026-05 unverdicted novelty 5.0

    A unified training framework for mesh-based ML surrogates in CFD improves accuracy and long-horizon stability by enforcing spatial derivative consistency via multi-node prediction, using temporal cross-attention corre...

  22. Generating Satellite Imagery Data for Wildfire Detection through Mask-Conditioned Generative AI

    cs.CV 2026-04 conditional novelty 4.0

    A pre-trained Earth Observation diffusion model generates realistic post-wildfire Sentinel-2 imagery from burn masks via inpainting, achieving Burn IoU 0.456 and improved saliency over full generation.

  23. Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    cs.CV 2024-02 unverdicted novelty 2.0

    The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 23 Pith papers · 1 internal anchor

  1. [1]

    T., Biggin, M

    Barron, J. T., Biggin, M. D., Arbelaez, P., Knowles, D. W., Keranen, S. V., and Malik, J. Volumetric Semantic Segmentation Using Pyramid Context Features . In 2013 IEEE International Conference on Computer Vision, pp.\ 3448--3455. IEEE, December 2013. ISBN 978-1-4799-2840-8. doi:10.1109/ICCV.2013.428

  2. [2]

    and Thibodeau-Laufer, E

    Bengio, Y. and Thibodeau-Laufer, E. Deep generative stochastic networks trainable by backprop . arXiv preprint arXiv:1306.1091, 2013

  3. [3]

    Better Mixing via Deep Representations

    Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. Better Mixing via Deep Representations . arXiv preprint arXiv:1207.4404, July 2012

  4. [4]

    and Breuleux, O

    Bergstra, J. and Breuleux, O. Theano: a CPU and GPU math expression compiler . Proceedings of the Python for Scientific Computing Conference (SciPy), 2010

  5. [5]

    Statistical Analysis of Non-Lattice Data

    Besag, J. Statistical Analysis of Non-Lattice Data . The Statistician, 24(3), 179-195, 1975

  6. [6]

    GTM: The generative topographic mapping

    Bishop, C., Svens\' e n, M., and Williams, C. GTM: The generative topographic mapping . Neural computation, 1998

  7. [7]

    and Bengio, Y

    Bornschein, J. and Bengio, Y. Reweighted Wake-Sleep . International Conference on Learning Representations, June 2015

  8. [8]

    B., and Salakhutdinov, R

    Burda, Y., Grosse, R. B., and Salakhutdinov, R. Accurate and Conservative Estimates of MRF Log-likelihood using Reverse Annealing . arXiv:1412.8566, December 2014

  9. [9]

    E., Neal, R

    Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. The helmholtz machine . Neural computation, 7 0 (5): 0 889--904, 1995

  10. [11]

    On the theory of stochastic processes, with particular reference to applications

    Feller, W. On the theory of stochastic processes, with particular reference to applications . In Proceedings of the [First] Berkeley Symposium on Mathematical Statistics and Probability. The Regents of the University of California, 1949

  11. [12]

    Gershman, S. J. and Blei, D. M. A tutorial on Bayesian nonparametric models . Journal of Mathematical Psychology, 56 0 (1): 0 1--12, 2012

  12. [13]

    and Raftery, A

    Gneiting, T. and Raftery, A. E. Strictly proper scoring rules, prediction, and estimation . Journal of the American Statistical Association, 102 0 (477): 0 359--378, 2007

  13. [14]

    J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y

    Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative Adversarial Nets . Advances in Neural Information Processing Systems, 2014

  14. [15]

    Deep AutoRegressive Networks

    Gregor, K., Danihelka, I., Mnih, A., Blundell, C., and Wierstra, D. Deep AutoRegressive Networks . arXiv preprint arXiv:1310.8499, October 2013

  15. [16]

    B., Maddison, C

    Grosse, R. B., Maddison, C. J., and Salakhutdinov, R. Annealing between distributions by averaging moments . In Advances in Neural Information Processing Systems, pp.\ 2769--2777, 2013

  16. [17]

    Hinton, G. E. Training products of experts by minimizing contrastive divergence . Neural Computation, 14 0 (8): 0 1771--1800, 2002

  17. [18]

    Hinton, G. E. The wake-sleep algorithm for unsupervised neural networks ) . Science, 1995

  18. [19]

    Estimation of non-normalized statistical models using score matching

    Hyv\" a rinen, A. Estimation of non-normalized statistical models using score matching. Journal of Machine Learning Research, 6: 0 695--709, 2005

  19. [20]

    Equilibrium free-energy differences from nonequilibrium measurements: A master-equation approach

    Jarzynski, C. Equilibrium free-energy differences from nonequilibrium measurements: A master-equation approach . Physical Review E, January 1997

  20. [21]

    Equalities and inequalities: irreversibility and the second law of thermodynamics at the nanoscale

    Jarzynski, C. Equalities and inequalities: irreversibility and the second law of thermodynamics at the nanoscale . Annu. Rev. Condens. Matter Phys., 2011

  21. [22]

    Dead leaves models: from space tesselation to random functions

    Jeulin, D. Dead leaves models: from space tesselation to random functions . Proc. of the Symposium on the Advances in the Theory and Applications of Random Sets, 1997

  22. [23]

    I., Ghahramani, Z., Jaakkola, T

    Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. An introduction to variational methods for graphical models . Machine learning, 37 0 (2): 0 183--233, 1999

  23. [24]

    Fast inference in sparse coding algorithms with applications to object recognition

    Kavukcuoglu, K., Ranzato, M., and LeCun, Y. Fast inference in sparse coding algorithms with applications to object recognition . arXiv preprint arXiv:1010.3467, 2010

  24. [25]

    Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes . International Conference on Learning Representations, December 2013

  25. [26]

    and Hinton, G

    Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images . Computer Science Department University of Toronto Tech. Rep., 2009

  26. [27]

    Sur la th\' e orie du mouvement brownien

    Langevin, P. Sur la th\' e orie du mouvement brownien . CR Acad. Sci. Paris, 146 0 (530-533), 1908

  27. [28]

    and Murray, I

    Larochelle, H. and Murray, I. The neural autoregressive distribution estimator . Journal of Machine Learning Research, 2011

  28. [29]

    A sparse texture representation using local affine regions

    Lazebnik, S., Schmid, C., and Ponce, J. A sparse texture representation using local affine regions . Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27 0 (8): 0 1265--1278, 2005

  29. [30]

    and Cortes, C

    LeCun, Y. and Cortes, C. The MNIST database of handwritten digits . 1998

  30. [31]

    Occlusion models for natural images: A statistical study of a scale-invariant dead leaves model

    Lee, A., Mumford, D., and Huang, J. Occlusion models for natural images: A statistical study of a scale-invariant dead leaves model . International Journal of Computer Vision, 2001

  31. [32]

    Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

    Lyu, S. Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction . Advances in Neural Information Processing Systems 24, pp.\ 64--72, 2011

  32. [33]

    Bayesian neural networks and density networks

    MacKay, D. Bayesian neural networks and density networks . Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 1995

  33. [34]

    P., Weiss, Y., and Jordan, M

    Murphy, K. P., Weiss, Y., and Jordan, M. I. Loopy belief propagation for approximate inference: An empirical study . In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pp.\ 467--475. Morgan Kaufmann Publishers Inc., 1999

  34. [35]

    Annealed importance sampling

    Neal, R. Annealed importance sampling . Statistics and Computing, January 2001

  35. [36]

    and Bengio, Y

    Ozair, S. and Bengio, Y. Deep Directed Generative Autoencoders . arXiv:1410.0630, October 2014

  36. [37]

    P., Lauritzen, S., and Others

    Parry, M., Dawid, A. P., Lauritzen, S., and Others. Proper local scoring rules . The Annals of Statistics, 40 0 (1): 0 561--592, 2012

  37. [38]

    J., Mohamed, S., and Wierstra, D

    Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic Backpropagation and Approximate Inference in Deep Generative Models . Proceedings of the 31st International Conference on Machine Learning (ICML-14), January 2014

  38. [39]

    NICE: Non-linear Independent Components Estimation

    Rippel, O. and Adams, R. P. High-Dimensional Probability Estimation with Deep Density Models . arXiv:1410.8516, pp.\ 12, February 2013

  39. [40]

    Learning factorial codes by predictability minimization

    Schmidhuber, J. Learning factorial codes by predictability minimization . Neural Computation, 1992

  40. [41]

    Learning joint top-down and bottom-up processes for 3D visual inference

    Sminchisescu, C., Kanaujia, A., and Metaxas, D. Learning joint top-down and bottom-up processes for 3D visual inference . In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 2, pp.\ 1743--1752. IEEE, 2006

  41. [43]

    B., and DeWeese, M

    Sohl-Dickstein, J., Battaglino, P. B., and DeWeese, M. R. Minimum Probability Flow Learning . International Conference on Machine Learning, 107 0 (22): 0 11--14, November 2011 b . ISSN 0031-9007. doi:10.1103/PhysRevLett.107.220601

  42. [44]

    Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods

    Sohl-Dickstein, J., Poole, B., and Ganguli, S. Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods . In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp.\ 604--612, 2014

  43. [45]

    and Ford, I

    Spinney, R. and Ford, I. Fluctuation Relations : A Pedagogical Overview . arXiv preprint arXiv:1201.6381, pp.\ 3--56, 2013

  44. [46]

    Learning stochastic inverses

    Stuhlm\" u ller, A., Taylor, J., and Goodman, N. Learning stochastic inverses . Advances in Neural Information Processing Systems, 2013

  45. [47]

    and Vandewalle, J

    Suykens, J. and Vandewalle, J. Nonconvex optimization using a Fokker-Planck learning machine . In 12th European Conference on Circuit Theory and Design, 1995

  46. [48]

    Convergence condition of the TAP equation for the infinite-ranged Ising spin glass model

    T, P. Convergence condition of the TAP equation for the infinite-ranged Ising spin glass model . J. Phys. A: Math. Gen. 15 1971, 1982

  47. [49]

    Mean-field theory of Boltzmann machine learning

    Tanaka, T. Mean-field theory of Boltzmann machine learning . Physical Review Letters E, January 1998

  48. [50]

    Mixtures of conditional Gaussian scale mixtures applied to multiscale image representations

    Theis, L., Hosseini, R., and Bethge, M. Mixtures of conditional Gaussian scale mixtures applied to multiscale image representations . PloS one, 7 0 (7): 0 e39857, 2012

  49. [51]

    A note on the evaluation of generative models

    Theis, L., van den Oord, A., and Bethge, M. A note on the evaluation of generative models . arXiv preprint arXiv:1511.01844, 2015

  50. [52]

    RNADE: The real-valued neural autoregressive density-estimator

    Uria, B., Murray, I., and Larochelle, H. RNADE: The real-valued neural autoregressive density-estimator . Advances in Neural Information Processing Systems, 2013 a

  51. [53]

    A Deep and Tractable Density Estimator

    Uria, B., Murray, I., and Larochelle, H. A Deep and Tractable Density Estimator . arXiv:1310.1757, pp.\ 9, October 2013 b

  52. [54]

    Blocks and Fuel

    van Merri\" e nboer, B., Chorowski, J., Serdyuk, D., Bengio, Y., Bogdanov, D., Dumoulin, V., and Warde-Farley, D. Blocks and Fuel . Zenodo, May 2015. doi:10.5281/zenodo.17721

  53. [55]

    and Hinton, G

    Welling, M. and Hinton, G. A new learning algorithm for mean field Boltzmann machines . Lecture Notes in Computer Science, January 2002

  54. [56]

    On the Equivalence Between Deep NADE and Generative Stochastic Networks

    Yao, L., Ozair, S., Cho, K., and Bengio, Y. On the Equivalence Between Deep NADE and Generative Stochastic Networks . In Machine Learning and Knowledge Discovery in Databases, pp.\ 322--336. Springer, 2014