arxiv: 1907.05600 · v3 · pith:IZHUHFCYnew · submitted 2019-07-12 · 💻 cs.LG · stat.ML

Generative Modeling by Estimating Gradients of the Data Distribution

Yang Song , Stefano Ermon This is my paper

Pith reviewed 2026-05-17 13:56 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords generative modelingscore matchingLangevin dynamicsgradient estimationimage generationdenoisingsampling methods

0 comments

The pith

A generative model learns gradients of noisy data distributions to drive annealed Langevin dynamics and produce samples without adversarial training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a generative approach that estimates the gradient of the log-density of the data, known as the score, by applying score matching to versions of the data corrupted by different amounts of Gaussian noise. This perturbation ensures the gradients remain well-defined even when the underlying data lies on a lower-dimensional surface. Samples are then drawn by running an annealed version of Langevin dynamics that begins at high noise and steadily reduces the noise level while following the learned gradients. The method uses ordinary neural networks and a regression-style loss, so it requires no sampling inside the training loop and no discriminator. If the central claim holds, the result is a stable training procedure that yields image samples comparable to those from GANs and representations that support tasks such as filling in missing pixels.

Core claim

The central claim is that one can build an effective generative model by jointly estimating the score functions of Gaussian-perturbed data distributions at multiple noise scales and then using those scores inside an annealed Langevin dynamics sampler that gradually lowers noise to produce points near the original data manifold.

What carries the argument

Score functions of noise-perturbed distributions, estimated by score matching and used to guide steps in annealed Langevin dynamics.

If this is right

Training reduces to a single regression objective that can be used for direct model comparison without separate validation sampling.
Flexible neural architectures can be plugged in directly because no adversarial loss or inner sampling loop is required.
The same learned scores support conditional tasks such as image inpainting by guiding the dynamics toward observed pixels.
Sampling quality can be traded against speed by changing the noise schedule or the number of dynamics steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gradient fields might be used to interpolate between data points by following the estimated trajectories at low noise.
Replacing Gaussian noise with other structured corruptions could adapt the method to non-image domains such as sequences or graphs.
Because the objective directly penalizes gradient error, it may naturally encourage better coverage of multiple modes than minimax objectives.

Load-bearing premise

Adding multiple levels of Gaussian noise renders the otherwise ill-defined gradients on low-dimensional data manifolds both estimable and sufficient to drive high-quality sampling.

What would settle it

If running the annealed Langevin sampler on the learned scores produces samples whose quality metrics on CIFAR-10 fall well below the reported inception score or whose inpainting results remain incoherent, the central claim would be falsified.

read the original abstract

We introduce a new generative model where samples are produced via Langevin dynamics using gradients of the data distribution estimated with score matching. Because gradients can be ill-defined and hard to estimate when the data resides on low-dimensional manifolds, we perturb the data with different levels of Gaussian noise, and jointly estimate the corresponding scores, i.e., the vector fields of gradients of the perturbed data distribution for all noise levels. For sampling, we propose an annealed Langevin dynamics where we use gradients corresponding to gradually decreasing noise levels as the sampling process gets closer to the data manifold. Our framework allows flexible model architectures, requires no sampling during training or the use of adversarial methods, and provides a learning objective that can be used for principled model comparisons. Our models produce samples comparable to GANs on MNIST, CelebA and CIFAR-10 datasets, achieving a new state-of-the-art inception score of 8.87 on CIFAR-10. Additionally, we demonstrate that our models learn effective representations via image inpainting experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows how to generate images by jointly learning scores across noise scales with denoising matching and then sampling via annealed Langevin dynamics, delivering GAN-level results on standard benchmarks without adversarial training.

read the letter

The main takeaway is that perturbing data at multiple Gaussian noise levels, training one network to estimate the score for every level, and then running Langevin dynamics while annealing the noise produces samples that match or beat contemporary GANs on MNIST, CelebA, and CIFAR-10. The reported inception score of 8.87 on CIFAR-10 was new at the time, and the inpainting experiments indicate the learned scores capture useful structure for downstream tasks. The training objective is straightforward regression and requires no sampling or adversaries, which is a practical advantage over many alternatives then available. Model architecture is also flexible since the loss is just a weighted squared error on the score estimates. The paper spells out the equivalence to denoising score matching and gives a clear description of the geometric noise schedule and the discrete annealed sampler. Those pieces are reproducible from the text and the reported numbers line up with what the method should deliver when the schedule is followed. The central limitation is that the argument for why the finite-step, approximate-score chain ends up close to the data distribution stays mostly heuristic. There is no explicit bound on total variation or Wasserstein distance that accounts for score error accumulating across the annealing stages, so it is not obvious how sensitive the final sample quality is to the particular choice of step sizes and noise levels. The empirical results are strong enough that this does not sink the contribution, but a reader would still want to see more analysis or ablation on schedule robustness. This paper is for people working on generative models who are looking for non-adversarial baselines or who already care about score matching and Langevin methods. It is worth a serious referee because the framework is new, the experiments are solid on the benchmarks used, and the idea has clear follow-on potential even if the convergence story could be tightened.

Referee Report

1 major / 2 minor

Summary. The paper introduces a generative modeling approach that estimates the score function (gradients of the log-density) of data distributions perturbed by Gaussian noise at multiple scales via denoising score matching. Samples are then produced using annealed Langevin dynamics that progressively decreases the noise level. The authors report that the resulting models generate samples comparable to GANs on MNIST, CelebA, and CIFAR-10, achieving a new state-of-the-art Inception Score of 8.87 on CIFAR-10, while also learning representations useful for image inpainting.

Significance. If the empirical results hold, the work supplies a non-adversarial alternative to GANs that uses a well-defined objective, avoids sampling during training, and handles manifold-supported data through explicit multi-scale perturbation. The reported Inception Score and inpainting results constitute concrete evidence of practical performance on standard image benchmarks.

major comments (1)

[4.2] §4.2: The annealed Langevin sampler is specified with a geometric noise schedule (σ_L > … > σ_1) and fixed step sizes α_t, yet no convergence bound or total-variation/Wasserstein guarantee is given that relates the finite-step, approximate-score trajectory to the target data distribution. Because the central claim of GAN-comparable sample quality rests on this sampler producing high-fidelity outputs, the absence of such analysis leaves open the possibility that the observed IS of 8.87 is schedule-dependent rather than a general property of the method.

minor comments (2)

[5] The experimental protocol for the Inception Score computation (number of samples, classifier details, and comparison baselines) should be stated explicitly in §5 to allow direct reproduction.
[3] Notation for the noise-conditional score network s_θ(x, σ) could be introduced earlier and used consistently when describing the joint training objective in §3.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and valuable comments on our manuscript. We address the major comment regarding the convergence analysis of the annealed Langevin sampler below.

read point-by-point responses

Referee: [4.2] §4.2: The annealed Langevin sampler is specified with a geometric noise schedule (σ_L > … > σ_1) and fixed step sizes α_t, yet no convergence bound or total-variation/Wasserstein guarantee is given that relates the finite-step, approximate-score trajectory to the target data distribution. Because the central claim of GAN-comparable sample quality rests on this sampler producing high-fidelity outputs, the absence of such analysis leaves open the possibility that the observed IS of 8.87 is schedule-dependent rather than a general property of the method.

Authors: We agree that the manuscript does not supply a formal convergence bound or total-variation/Wasserstein guarantee for the finite-step annealed Langevin trajectory under approximate scores. Deriving such guarantees is technically difficult because the scores are learned rather than exact and the number of steps is finite. Our primary contribution lies in the multi-scale denoising score matching objective that enables stable score estimation on manifold-supported data; the annealed Langevin procedure is presented as a practical sampling strategy that progressively reduces noise. We validate this strategy empirically across MNIST, CelebA, and CIFAR-10, obtaining sample quality on par with GANs and a new state-of-the-art Inception Score of 8.87. While we acknowledge that the results could in principle be schedule-dependent, the geometric schedule is chosen to span the relevant noise range and the reported performance is consistent across datasets. We are prepared to add a brief discussion of the empirical behavior of the sampler and the rationale for the chosen schedule in a revised Section 4.2. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained with empirical validation

full rationale

The paper derives a generative procedure from score matching on noise-perturbed data distributions followed by annealed Langevin dynamics. The central objective in §3.2 is shown to equal denoising score matching by algebraic rearrangement of the standard score-matching loss, which is an independent derivation rather than a self-definition or fitted input renamed as prediction. Sampling in §4.2 applies the estimated scores at decreasing noise levels without reducing the reported sample quality or IS=8.87 to any input parameter by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to justify the core claims; performance is measured against external benchmarks (GANs, Inception Score) on standard datasets. The derivation chain therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach assumes the validity of score matching for gradient estimation and the utility of multi-scale noise for handling manifold issues, with no new entities postulated. The central claim rests on these domain assumptions rather than new free parameters or invented entities.

free parameters (1)

Gaussian noise levels
Multiple levels of noise are used but selection criteria not detailed in abstract.

axioms (2)

domain assumption Score matching can be used to estimate gradients of the data distribution
Central to estimating the vector fields for Langevin dynamics.
domain assumption Perturbing data with Gaussian noise makes gradients well-defined on manifolds
Addressed in the abstract as the reason for perturbation.

pith-pipeline@v0.9.0 · 5468 in / 1542 out tokens · 72418 ms · 2026-05-17T13:56:41.052084+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

achieving a new state-of-the-art inception score of 8.87 on CIFAR-10

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Generative models on phase space
hep-ph 2026-04 unverdicted novelty 8.0

Generative diffusion and flow models are constructed to remain exactly on the Lorentz-invariant massless N-particle phase space manifold during sampling for particle physics applications.
Inferring Active Neural Circuits Using Diffusion Scores
q-bio.NC 2026-05 unverdicted novelty 7.0

SBTG recovers the Jacobian of the nonlinear transition map between brain states by multiplying cross-block scores from denoising models, enabling inference of lag-specific directed interactions in neural population da...
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
cs.AI 2026-04 unverdicted novelty 7.0

FVD applies Fleming-Viot population dynamics to diffusion model sampling at inference time to reduce diversity collapse while improving reward alignment and FID scores.
pop-cosmos: Star formation over 12 Gyr from generative modelling of a deep infrared-selected galaxy catalogue
astro-ph.GA 2025-09 unverdicted novelty 7.0

A score-based diffusion generative model on deep infrared galaxy photometry yields a star formation rate density peaking at z=1.3 and shows distinct non-parametric star formation histories plus AGN activity peaking du...
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Diffusion Models Beat GANs on Image Synthesis
cs.LG 2021-05 accept novelty 7.0

Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
PG-3DGS: Optimizing 3D Gaussian Splatting to Satisfy Physics Objectives
cs.CV 2026-05 unverdicted novelty 6.0

PG-3DGS couples 3D Gaussian Splatting with differentiable physics so that optimized shapes satisfy both visual fidelity and physical objectives such as pouring and aerodynamic lift, with real-world 3D-printed validation.
Diffusion model for SU(N) gauge theories
hep-lat 2026-05 unverdicted novelty 6.0

Implicit score matching trains diffusion models that successfully sample SU(3) Wilson gauge configurations on lattices, with a Hamiltonian-dynamics corrector needed for strong coupling.
A unified perspective on fine-tuning and sampling with diffusion and flow models
stat.ML 2026-04 unverdicted novelty 6.0

A unified framework for exponential tilting in diffusion and flow models that includes bias-variance decompositions showing finite gradient variance for some methods, norm bounds on adjoint ODEs, and adapted losses wi...
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
cs.AI 2026-04 unverdicted novelty 6.0

VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and M...
Adjoint Matching through the Lens of the Stochastic Maximum Principle in Optimal Control
math.OC 2026-03 unverdicted novelty 6.0

Adjoint matching objectives derived from the Stochastic Maximum Principle have critical points satisfying HJB stationarity conditions for SOC problems with control-dependent drift and diffusion.
MIOFlow 2.0: A unified framework for inferring cellular stochastic dynamics from single cell and spatial transcriptomics data
cs.LG 2026-03 unverdicted novelty 6.0

MIOFlow 2.0 learns stochastic cellular trajectories from transcriptomics data via neural SDEs, unbalanced optimal transport for growth, and a joint latent space unifying gene expression with spatial features.
A probabilistic framework for crystal structure denoising, phase classification, and order parameters
cond-mat.mtrl-sci 2025-12 unverdicted novelty 6.0

A unified probabilistic model uses per-atom logits over crystal prototypes to denoise atomic configurations, classify phases, and derive order parameters from a single differentiable scalar field.
EnScale: Temporally-consistent multivariate generative downscaling via proper scoring rules
physics.ao-ph 2025-09 unverdicted novelty 6.0

EnScale emulates high-resolution regional climate model outputs from global circulation models for multiple variables using a two-step generative process with sparse local stochastic layers and energy score optimizati...
Shap-E: Generating Conditional 3D Implicit Functions
cs.CV 2023-05 accept novelty 6.0

Shap-E encodes 3D assets into implicit function parameters then uses a conditional diffusion model to generate new ones from text, enabling fast multi-representation 3D asset creation.
HuggingFace's Transformers: State-of-the-art Natural Language Processing
cs.CL 2019-10 accept novelty 6.0

Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.
Scaling Properties of Continuous Diffusion Spoken Language Models
cs.CL 2026-04 unverdicted novelty 5.0

Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.
Rethinking the Diffusion Model from a Langevin Perspective
cs.LG 2026-04 unverdicted novelty 5.0

Diffusion models are reorganized under a Langevin perspective that unifies ODE and SDE formulations and shows flow matching is equivalent to denoising under maximum likelihood.
A Review of Diffusion-based Simulation-Based Inference: Foundations and Applications in Non-Ideal Data Scenarios
cs.LG 2025-12 accept novelty 2.0

A synthesis of diffusion-based simulation-based inference methods that address model misspecification, irregular observations, and missing data in scientific applications.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 18 Pith papers · 10 internal anchors

[1]

Alain, Y

G. Alain, Y . Bengio, L. Yao, J. Yosinski, E. Thibodeau-Laufer, S. Zhang, and P. Vincent. GSNs: generative stochastic networks. Information and Inference, 2016

work page 2016
[2]

Arjovsky, S

M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In D. Precup and Y . W. Teh, editors,Proceedings of the 34th International Conference on Ma- chine Learning, volume 70 of Proceedings of Machine Learning Research, pages 214–223, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR

work page 2017
[3]

Belkin and P

M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data represen- tation. Neural computation, 15(6):1373–1396, 2003

work page 2003
[4]

Bengio, L

Y . Bengio, L. Yao, G. Alain, and P. Vincent. Generalized denoising auto-encoders as generative models. In Advances in neural information processing systems, pages 899–907, 2013

work page 2013
[5]

Learning to Generate Samples from Noise through Infusion Training

F. Bordes, S. Honari, and P. Vincent. Learning to generate samples from noise through infusion training. arXiv preprint arXiv:1703.06975, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

Brock, J

A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high ﬁdelity natural image synthesis. In International Conference on Learning Representations, 2019

work page 2019
[7]

Chandra and R

B. Chandra and R. K. Sharma. Adaptive noise schedule for denoising autoencoder. In Interna- tional conference on neural information processing, pages 535–542. Springer, 2014

work page 2014
[8]

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017

work page 2017
[9]

T. Chen, E. Fox, and C. Guestrin. Stochastic gradient hamiltonian monte carlo. In International conference on machine learning, pages 1683–1691, 2014

work page 2014
[10]

Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. R. Salakhutdinov. Good semi-supervised learning that requires a bad gan. In Advances in neural information processing systems, pages 6510–6520, 2017

work page 2017
[11]

L. Dinh, D. Krueger, and Y . Bengio. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[12]

Du and I

Y . Du and I. Mordatch. Implicit generation and generalization in energy-based models.arXiv preprint arXiv:1903.08689, 2019

work page arXiv 1903
[13]

Dumoulin, J

V . Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. InInterna- tional Conference on Learning Representations 2017, 2017

work page 2017
[14]

K. J. Geras and C. Sutton. Scheduled denoising autoencoders. arXiv preprint arXiv:1406.3269, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[15]

Goodfellow, J

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial nets. InAdvances in neural information processing systems, pages 2672–2680, 2014

work page 2014
[16]

A. G. A. P. Goyal, N. R. Ke, S. Ganguli, and Y . Bengio. Variational walkback: Learning a transition operator as a stochastic recurrent net. In Advances in Neural Information Processing Systems, pages 4392–4402, 2017

work page 2017
[17]

A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[18]

Gulrajani, F

I. Gulrajani, F. Ahmed, M. Arjovsky, V . Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017. 10

work page 2017
[19]

Gutmann and A

M. Gutmann and A. Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 297–304, 2010

work page 2010
[20]

Heusel, H

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017

work page 2017
[21]

G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002

work page 2002
[22]

Ho and S

J. Ho and S. Ermon. Generative adversarial imitation learning. In Advances in Neural Informa- tion Processing Systems, pages 4565–4573, 2016

work page 2016
[23]

Huang and S

X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance nor- malization. In Proceedings of the IEEE International Conference on Computer Vision, pages 1501–1510, 2017

work page 2017
[24]

Hyvärinen

A. Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695–709, 2005

work page 2005
[25]

Karras, T

T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018

work page 2018
[26]

A Style-Based Generator Architecture for Generative Adversarial Networks

T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

Kingma and Y

D. Kingma and Y . LeCun. Regularized estimation of image statistics by score matching. In Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010, NIPS 2010, 2010

work page 2010
[28]

D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pages 3581–3589, 2014

work page 2014
[29]

D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014

work page 2014
[30]

Kirkpatrick, C

S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. SCIENCE, 220(4598):671–680, 1983

work page 1983
[31]

Krizhevsky

A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009

work page 2009
[32]

G. Lin, A. Milan, C. Shen, and I. Reid. Reﬁnenet: Multi-path reﬁnement networks for high- resolution semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1925–1934, 2017

work page 1925
[33]

Q. Liu, J. Lee, and M. Jordan. A kernelized stein discrepancy for goodness-of-ﬁt tests. In International Conference on Machine Learning, pages 276–284, 2016

work page 2016
[34]

Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015

work page 2015
[35]

Miyasawa

K. Miyasawa. An empirical bayes estimator of the mean of a normal population. Bull. Inst. Internat. Statist, 38(181-188):1–2, 1961

work page 1961
[36]

Miyato, T

T. Miyato, T. Kataoka, M. Koyama, and Y . Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018

work page 2018
[37]

R. M. Neal. Annealed importance sampling. Statistics and computing, 11(2):125–139, 2001

work page 2001
[38]

R. M. Neal. Mcmc using hamiltonian dynamics. arXiv preprint arXiv:1206.1901, 2012

work page arXiv 1901
[39]

Nijkamp, M

E. Nijkamp, M. Hill, T. Han, S.-C. Zhu, and Y . N. Wu. On the anatomy of mcmc-based maximum likelihood learning of energy-based models. arXiv preprint arXiv:1903.12370, 2019

work page arXiv 1903
[40]

Nowozin, B

S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pages 271–279, 2016

work page 2016
[41]

Ostrovski, M

G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos. Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2721–2730. JMLR. org, 2017. 11

work page 2017
[42]

Ostrovski, W

G. Ostrovski, W. Dabney, and R. Munos. Autoregressive quantile networks for generative modeling. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 3933–

work page
[43]

Raphan and E

M. Raphan and E. P. Simoncelli. Learning to be bayesian without supervision. In Advances in neural information processing systems, pages 1145–1152, 2007

work page 2007
[44]

Raphan and E

M. Raphan and E. P. Simoncelli. Least squares estimation without priors or supervision. Neural computation, 23(2):374–420, 2011

work page 2011
[45]

Ravuri, S

S. Ravuri, S. Mohamed, M. Rosca, and O. Vinyals. Learning implicit generative models with the method of learned moments. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 4314–4323, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR

work page 2018
[46]

U-Net: Convolutional Networks for Biomedical Image Segmentation

O. Ronneberger, P.Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), volume 9351 of LNCS, pages 234–241. Springer, 2015. (available on arXiv:1505.04597 [cs.CV])

work page internal anchor Pith review Pith/arXiv arXiv 2015
[47]

S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323–2326, 2000

work page 2000
[48]

Salimans, I

T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in neural information processing systems , pages 2234–2242, 2016

work page 2016
[49]

Deep Energy Estimator Networks

S. Saremi, A. Mehrjou, B. Schölkopf, and A. Hyvärinen. Deep energy estimator networks. arXiv preprint arXiv:1805.08306, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[50]

Minimum Probability Flow Learning

J. Sohl-Dickstein, P. Battaglino, and M. R. DeWeese. Minimum probability ﬂow learning. arXiv preprint arXiv:0906.4779, 2009

work page internal anchor Pith review Pith/arXiv arXiv 2009
[51]

Sohl-Dickstein, E

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning , pages 2256–2265, 2015

work page 2015
[52]

J. Song, S. Zhao, and S. Ermon. A-nice-mc: Adversarial training for mcmc. In Advances in Neural Information Processing Systems, pages 5140–5150, 2017

work page 2017
[53]

Y . Song, S. Garg, J. Shi, and S. Ermon. Sliced score matching: A scalable approach to density and score estimation. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019, page 204, 2019

work page 2019
[54]

Y . Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. In International Conference on Learning Representations, 2018

work page 2018
[55]

B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schölkopf, and G. R. Lanckriet. On integral probability metrics,\phi-divergences and binary classiﬁcation.arXiv preprint arXiv:0901.2698, 2009

work page internal anchor Pith review Pith/arXiv arXiv 2009
[56]

C. M. Stein. Estimation of the mean of a multivariate normal distribution. The annals of Statistics, pages 1135–1151, 1981

work page 1981
[57]

Instance Normalization: The Missing Ingredient for Fast Stylization

D. Ulyanov, A. Vedaldi, and V . Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[58]

van den Oord, S

A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. In Arxiv, 2016

work page 2016
[59]

Van den Oord, N

A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems , pages 4790–4798, 2016

work page 2016
[60]

Van Den Oord, N

A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 1747–1756. JMLR.org, 2016

work page 2016
[61]

P. Vincent. A connection between score matching and denoising autoencoders. Neural compu- tation, 23(7):1661–1674, 2011. 12

work page 2011
[62]

Welling and Y

M. Welling and Y . W. Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11) , pages 681–688, 2011

work page 2011
[63]

Wenliang, D

L. Wenliang, D. Sutherland, H. Strathmann, and A. Gretton. Learning deep kernels for expo- nential family densities. In International Conference on Machine Learning, pages 6737–6746, 2019

work page 2019
[64]

Yu and V

F. Yu and V . Koltun. Multi-scale context aggregation by dilated convolutions. InInternational Conference on Learning Representations (ICLR), 2016

work page 2016
[65]

F. Yu, V . Koltun, and T. Funkhouser. Dilated residual networks. InComputer Vision and Pattern Recognition (CVPR), 2017

work page 2017
[66]

Zhang and L

Q. Zhang and L. Zhang. Convolutional adaptive denoising autoencoders for hierarchical feature extraction. Frontiers of Computer Science, 12(6):1140–1148, 2018. 13 A Architectures The architecture of our NCSNs used in the experiments has three important components: instance normalization, dilated convolutions and U-Net-type architectures. Below we give mor...

work page 2018