pith. machine review for the scientific record. sign in

arxiv: 1907.05600 · v3 · pith:IZHUHFCYnew · submitted 2019-07-12 · 💻 cs.LG · stat.ML

Generative Modeling by Estimating Gradients of the Data Distribution

Pith reviewed 2026-05-17 13:56 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords generative modelingscore matchingLangevin dynamicsgradient estimationimage generationdenoisingsampling methods
0
0 comments X

The pith

A generative model learns gradients of noisy data distributions to drive annealed Langevin dynamics and produce samples without adversarial training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a generative approach that estimates the gradient of the log-density of the data, known as the score, by applying score matching to versions of the data corrupted by different amounts of Gaussian noise. This perturbation ensures the gradients remain well-defined even when the underlying data lies on a lower-dimensional surface. Samples are then drawn by running an annealed version of Langevin dynamics that begins at high noise and steadily reduces the noise level while following the learned gradients. The method uses ordinary neural networks and a regression-style loss, so it requires no sampling inside the training loop and no discriminator. If the central claim holds, the result is a stable training procedure that yields image samples comparable to those from GANs and representations that support tasks such as filling in missing pixels.

Core claim

The central claim is that one can build an effective generative model by jointly estimating the score functions of Gaussian-perturbed data distributions at multiple noise scales and then using those scores inside an annealed Langevin dynamics sampler that gradually lowers noise to produce points near the original data manifold.

What carries the argument

Score functions of noise-perturbed distributions, estimated by score matching and used to guide steps in annealed Langevin dynamics.

If this is right

  • Training reduces to a single regression objective that can be used for direct model comparison without separate validation sampling.
  • Flexible neural architectures can be plugged in directly because no adversarial loss or inner sampling loop is required.
  • The same learned scores support conditional tasks such as image inpainting by guiding the dynamics toward observed pixels.
  • Sampling quality can be traded against speed by changing the noise schedule or the number of dynamics steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gradient fields might be used to interpolate between data points by following the estimated trajectories at low noise.
  • Replacing Gaussian noise with other structured corruptions could adapt the method to non-image domains such as sequences or graphs.
  • Because the objective directly penalizes gradient error, it may naturally encourage better coverage of multiple modes than minimax objectives.

Load-bearing premise

Adding multiple levels of Gaussian noise renders the otherwise ill-defined gradients on low-dimensional data manifolds both estimable and sufficient to drive high-quality sampling.

What would settle it

If running the annealed Langevin sampler on the learned scores produces samples whose quality metrics on CIFAR-10 fall well below the reported inception score or whose inpainting results remain incoherent, the central claim would be falsified.

read the original abstract

We introduce a new generative model where samples are produced via Langevin dynamics using gradients of the data distribution estimated with score matching. Because gradients can be ill-defined and hard to estimate when the data resides on low-dimensional manifolds, we perturb the data with different levels of Gaussian noise, and jointly estimate the corresponding scores, i.e., the vector fields of gradients of the perturbed data distribution for all noise levels. For sampling, we propose an annealed Langevin dynamics where we use gradients corresponding to gradually decreasing noise levels as the sampling process gets closer to the data manifold. Our framework allows flexible model architectures, requires no sampling during training or the use of adversarial methods, and provides a learning objective that can be used for principled model comparisons. Our models produce samples comparable to GANs on MNIST, CelebA and CIFAR-10 datasets, achieving a new state-of-the-art inception score of 8.87 on CIFAR-10. Additionally, we demonstrate that our models learn effective representations via image inpainting experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces a generative modeling approach that estimates the score function (gradients of the log-density) of data distributions perturbed by Gaussian noise at multiple scales via denoising score matching. Samples are then produced using annealed Langevin dynamics that progressively decreases the noise level. The authors report that the resulting models generate samples comparable to GANs on MNIST, CelebA, and CIFAR-10, achieving a new state-of-the-art Inception Score of 8.87 on CIFAR-10, while also learning representations useful for image inpainting.

Significance. If the empirical results hold, the work supplies a non-adversarial alternative to GANs that uses a well-defined objective, avoids sampling during training, and handles manifold-supported data through explicit multi-scale perturbation. The reported Inception Score and inpainting results constitute concrete evidence of practical performance on standard image benchmarks.

major comments (1)
  1. [4.2] §4.2: The annealed Langevin sampler is specified with a geometric noise schedule (σ_L > … > σ_1) and fixed step sizes α_t, yet no convergence bound or total-variation/Wasserstein guarantee is given that relates the finite-step, approximate-score trajectory to the target data distribution. Because the central claim of GAN-comparable sample quality rests on this sampler producing high-fidelity outputs, the absence of such analysis leaves open the possibility that the observed IS of 8.87 is schedule-dependent rather than a general property of the method.
minor comments (2)
  1. [5] The experimental protocol for the Inception Score computation (number of samples, classifier details, and comparison baselines) should be stated explicitly in §5 to allow direct reproduction.
  2. [3] Notation for the noise-conditional score network s_θ(x, σ) could be introduced earlier and used consistently when describing the joint training objective in §3.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and valuable comments on our manuscript. We address the major comment regarding the convergence analysis of the annealed Langevin sampler below.

read point-by-point responses
  1. Referee: [4.2] §4.2: The annealed Langevin sampler is specified with a geometric noise schedule (σ_L > … > σ_1) and fixed step sizes α_t, yet no convergence bound or total-variation/Wasserstein guarantee is given that relates the finite-step, approximate-score trajectory to the target data distribution. Because the central claim of GAN-comparable sample quality rests on this sampler producing high-fidelity outputs, the absence of such analysis leaves open the possibility that the observed IS of 8.87 is schedule-dependent rather than a general property of the method.

    Authors: We agree that the manuscript does not supply a formal convergence bound or total-variation/Wasserstein guarantee for the finite-step annealed Langevin trajectory under approximate scores. Deriving such guarantees is technically difficult because the scores are learned rather than exact and the number of steps is finite. Our primary contribution lies in the multi-scale denoising score matching objective that enables stable score estimation on manifold-supported data; the annealed Langevin procedure is presented as a practical sampling strategy that progressively reduces noise. We validate this strategy empirically across MNIST, CelebA, and CIFAR-10, obtaining sample quality on par with GANs and a new state-of-the-art Inception Score of 8.87. While we acknowledge that the results could in principle be schedule-dependent, the geometric schedule is chosen to span the relevant noise range and the reported performance is consistent across datasets. We are prepared to add a brief discussion of the empirical behavior of the sampler and the rationale for the chosen schedule in a revised Section 4.2. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained with empirical validation

full rationale

The paper derives a generative procedure from score matching on noise-perturbed data distributions followed by annealed Langevin dynamics. The central objective in §3.2 is shown to equal denoising score matching by algebraic rearrangement of the standard score-matching loss, which is an independent derivation rather than a self-definition or fitted input renamed as prediction. Sampling in §4.2 applies the estimated scores at decreasing noise levels without reducing the reported sample quality or IS=8.87 to any input parameter by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to justify the core claims; performance is measured against external benchmarks (GANs, Inception Score) on standard datasets. The derivation chain therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach assumes the validity of score matching for gradient estimation and the utility of multi-scale noise for handling manifold issues, with no new entities postulated. The central claim rests on these domain assumptions rather than new free parameters or invented entities.

free parameters (1)
  • Gaussian noise levels
    Multiple levels of noise are used but selection criteria not detailed in abstract.
axioms (2)
  • domain assumption Score matching can be used to estimate gradients of the data distribution
    Central to estimating the vector fields for Langevin dynamics.
  • domain assumption Perturbing data with Gaussian noise makes gradients well-defined on manifolds
    Addressed in the abstract as the reason for perturbation.

pith-pipeline@v0.9.0 · 5468 in / 1542 out tokens · 72418 ms · 2026-05-17T13:56:41.052084+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Generative models on phase space

    hep-ph 2026-04 unverdicted novelty 8.0

    Generative diffusion and flow models are constructed to remain exactly on the Lorentz-invariant massless N-particle phase space manifold during sampling for particle physics applications.

  2. Inferring Active Neural Circuits Using Diffusion Scores

    q-bio.NC 2026-05 unverdicted novelty 7.0

    SBTG recovers the Jacobian of the nonlinear transition map between brain states by multiplying cross-block scores from denoising models, enabling inference of lag-specific directed interactions in neural population da...

  3. VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion

    cs.AI 2026-04 unverdicted novelty 7.0

    FVD applies Fleming-Viot population dynamics to diffusion model sampling at inference time to reduce diversity collapse while improving reward alignment and FID scores.

  4. pop-cosmos: Star formation over 12 Gyr from generative modelling of a deep infrared-selected galaxy catalogue

    astro-ph.GA 2025-09 unverdicted novelty 7.0

    A score-based diffusion generative model on deep infrared galaxy photometry yields a star formation rate density peaking at z=1.3 and shows distinct non-parametric star formation histories plus AGN activity peaking du...

  5. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    cs.LG 2025-02 unverdicted novelty 7.0

    A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

  6. Diffusion Models Beat GANs on Image Synthesis

    cs.LG 2021-05 accept novelty 7.0

    Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.

  7. PG-3DGS: Optimizing 3D Gaussian Splatting to Satisfy Physics Objectives

    cs.CV 2026-05 unverdicted novelty 6.0

    PG-3DGS couples 3D Gaussian Splatting with differentiable physics so that optimized shapes satisfy both visual fidelity and physical objectives such as pouring and aerodynamic lift, with real-world 3D-printed validation.

  8. Diffusion model for SU(N) gauge theories

    hep-lat 2026-05 unverdicted novelty 6.0

    Implicit score matching trains diffusion models that successfully sample SU(3) Wilson gauge configurations on lattices, with a Hamiltonian-dynamics corrector needed for strong coupling.

  9. A unified perspective on fine-tuning and sampling with diffusion and flow models

    stat.ML 2026-04 unverdicted novelty 6.0

    A unified framework for exponential tilting in diffusion and flow models that includes bias-variance decompositions showing finite gradient variance for some methods, norm bounds on adjoint ODEs, and adapted losses wi...

  10. VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion

    cs.AI 2026-04 unverdicted novelty 6.0

    VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and M...

  11. Adjoint Matching through the Lens of the Stochastic Maximum Principle in Optimal Control

    math.OC 2026-03 unverdicted novelty 6.0

    Adjoint matching objectives derived from the Stochastic Maximum Principle have critical points satisfying HJB stationarity conditions for SOC problems with control-dependent drift and diffusion.

  12. MIOFlow 2.0: A unified framework for inferring cellular stochastic dynamics from single cell and spatial transcriptomics data

    cs.LG 2026-03 unverdicted novelty 6.0

    MIOFlow 2.0 learns stochastic cellular trajectories from transcriptomics data via neural SDEs, unbalanced optimal transport for growth, and a joint latent space unifying gene expression with spatial features.

  13. A probabilistic framework for crystal structure denoising, phase classification, and order parameters

    cond-mat.mtrl-sci 2025-12 unverdicted novelty 6.0

    A unified probabilistic model uses per-atom logits over crystal prototypes to denoise atomic configurations, classify phases, and derive order parameters from a single differentiable scalar field.

  14. EnScale: Temporally-consistent multivariate generative downscaling via proper scoring rules

    physics.ao-ph 2025-09 unverdicted novelty 6.0

    EnScale emulates high-resolution regional climate model outputs from global circulation models for multiple variables using a two-step generative process with sparse local stochastic layers and energy score optimizati...

  15. Shap-E: Generating Conditional 3D Implicit Functions

    cs.CV 2023-05 accept novelty 6.0

    Shap-E encodes 3D assets into implicit function parameters then uses a conditional diffusion model to generate new ones from text, enabling fast multi-representation 3D asset creation.

  16. HuggingFace's Transformers: State-of-the-art Natural Language Processing

    cs.CL 2019-10 accept novelty 6.0

    Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.

  17. Scaling Properties of Continuous Diffusion Spoken Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.

  18. Rethinking the Diffusion Model from a Langevin Perspective

    cs.LG 2026-04 unverdicted novelty 5.0

    Diffusion models are reorganized under a Langevin perspective that unifies ODE and SDE formulations and shows flow matching is equivalent to denoising under maximum likelihood.

  19. A Review of Diffusion-based Simulation-Based Inference: Foundations and Applications in Non-Ideal Data Scenarios

    cs.LG 2025-12 accept novelty 2.0

    A synthesis of diffusion-based simulation-based inference methods that address model misspecification, irregular observations, and missing data in scientific applications.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 18 Pith papers · 10 internal anchors

  1. [1]

    Alain, Y

    G. Alain, Y . Bengio, L. Yao, J. Yosinski, E. Thibodeau-Laufer, S. Zhang, and P. Vincent. GSNs: generative stochastic networks. Information and Inference, 2016

  2. [2]

    Arjovsky, S

    M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In D. Precup and Y . W. Teh, editors,Proceedings of the 34th International Conference on Ma- chine Learning, volume 70 of Proceedings of Machine Learning Research, pages 214–223, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR

  3. [3]

    Belkin and P

    M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data represen- tation. Neural computation, 15(6):1373–1396, 2003

  4. [4]

    Bengio, L

    Y . Bengio, L. Yao, G. Alain, and P. Vincent. Generalized denoising auto-encoders as generative models. In Advances in neural information processing systems, pages 899–907, 2013

  5. [5]

    Learning to Generate Samples from Noise through Infusion Training

    F. Bordes, S. Honari, and P. Vincent. Learning to generate samples from noise through infusion training. arXiv preprint arXiv:1703.06975, 2017

  6. [6]

    Brock, J

    A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019

  7. [7]

    Chandra and R

    B. Chandra and R. K. Sharma. Adaptive noise schedule for denoising autoencoder. In Interna- tional conference on neural information processing, pages 535–542. Springer, 2014

  8. [8]

    L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017

  9. [9]

    T. Chen, E. Fox, and C. Guestrin. Stochastic gradient hamiltonian monte carlo. In International conference on machine learning, pages 1683–1691, 2014

  10. [10]

    Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. R. Salakhutdinov. Good semi-supervised learning that requires a bad gan. In Advances in neural information processing systems, pages 6510–6520, 2017

  11. [11]

    L. Dinh, D. Krueger, and Y . Bengio. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014

  12. [12]

    Du and I

    Y . Du and I. Mordatch. Implicit generation and generalization in energy-based models.arXiv preprint arXiv:1903.08689, 2019

  13. [13]

    Dumoulin, J

    V . Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. InInterna- tional Conference on Learning Representations 2017, 2017

  14. [14]

    K. J. Geras and C. Sutton. Scheduled denoising autoencoders. arXiv preprint arXiv:1406.3269, 2014

  15. [15]

    Goodfellow, J

    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial nets. InAdvances in neural information processing systems, pages 2672–2680, 2014

  16. [16]

    A. G. A. P. Goyal, N. R. Ke, S. Ganguli, and Y . Bengio. Variational walkback: Learning a transition operator as a stochastic recurrent net. In Advances in Neural Information Processing Systems, pages 4392–4402, 2017

  17. [17]

    A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013

  18. [18]

    Gulrajani, F

    I. Gulrajani, F. Ahmed, M. Arjovsky, V . Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767–5777, 2017. 10

  19. [19]

    Gutmann and A

    M. Gutmann and A. Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 297–304, 2010

  20. [20]

    Heusel, H

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017

  21. [21]

    G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002

  22. [22]

    Ho and S

    J. Ho and S. Ermon. Generative adversarial imitation learning. In Advances in Neural Informa- tion Processing Systems, pages 4565–4573, 2016

  23. [23]

    Huang and S

    X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance nor- malization. In Proceedings of the IEEE International Conference on Computer Vision, pages 1501–1510, 2017

  24. [24]

    Hyvärinen

    A. Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr):695–709, 2005

  25. [25]

    Karras, T

    T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018

  26. [26]

    A Style-Based Generator Architecture for Generative Adversarial Networks

    T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, 2018

  27. [27]

    Kingma and Y

    D. Kingma and Y . LeCun. Regularized estimation of image statistics by score matching. In Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010, NIPS 2010, 2010

  28. [28]

    D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pages 3581–3589, 2014

  29. [29]

    D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014

  30. [30]

    Kirkpatrick, C

    S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. SCIENCE, 220(4598):671–680, 1983

  31. [31]

    Krizhevsky

    A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009

  32. [32]

    G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet: Multi-path refinement networks for high- resolution semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1925–1934, 2017

  33. [33]

    Q. Liu, J. Lee, and M. Jordan. A kernelized stein discrepancy for goodness-of-fit tests. In International Conference on Machine Learning, pages 276–284, 2016

  34. [34]

    Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015

  35. [35]

    Miyasawa

    K. Miyasawa. An empirical bayes estimator of the mean of a normal population. Bull. Inst. Internat. Statist, 38(181-188):1–2, 1961

  36. [36]

    Miyato, T

    T. Miyato, T. Kataoka, M. Koyama, and Y . Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018

  37. [37]

    R. M. Neal. Annealed importance sampling. Statistics and computing, 11(2):125–139, 2001

  38. [38]

    R. M. Neal. Mcmc using hamiltonian dynamics. arXiv preprint arXiv:1206.1901, 2012

  39. [39]

    Nijkamp, M

    E. Nijkamp, M. Hill, T. Han, S.-C. Zhu, and Y . N. Wu. On the anatomy of mcmc-based maximum likelihood learning of energy-based models. arXiv preprint arXiv:1903.12370, 2019

  40. [40]

    Nowozin, B

    S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pages 271–279, 2016

  41. [41]

    Ostrovski, M

    G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos. Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2721–2730. JMLR. org, 2017. 11

  42. [42]

    Ostrovski, W

    G. Ostrovski, W. Dabney, and R. Munos. Autoregressive quantile networks for generative modeling. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 3933–

  43. [43]

    Raphan and E

    M. Raphan and E. P. Simoncelli. Learning to be bayesian without supervision. In Advances in neural information processing systems, pages 1145–1152, 2007

  44. [44]

    Raphan and E

    M. Raphan and E. P. Simoncelli. Least squares estimation without priors or supervision. Neural computation, 23(2):374–420, 2011

  45. [45]

    Ravuri, S

    S. Ravuri, S. Mohamed, M. Rosca, and O. Vinyals. Learning implicit generative models with the method of learned moments. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 4314–4323, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR

  46. [46]

    U-Net: Convolutional Networks for Biomedical Image Segmentation

    O. Ronneberger, P.Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), volume 9351 of LNCS, pages 234–241. Springer, 2015. (available on arXiv:1505.04597 [cs.CV])

  47. [47]

    S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323–2326, 2000

  48. [48]

    Salimans, I

    T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in neural information processing systems , pages 2234–2242, 2016

  49. [49]

    Deep Energy Estimator Networks

    S. Saremi, A. Mehrjou, B. Schölkopf, and A. Hyvärinen. Deep energy estimator networks. arXiv preprint arXiv:1805.08306, 2018

  50. [50]

    Minimum Probability Flow Learning

    J. Sohl-Dickstein, P. Battaglino, and M. R. DeWeese. Minimum probability flow learning. arXiv preprint arXiv:0906.4779, 2009

  51. [51]

    Sohl-Dickstein, E

    J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning , pages 2256–2265, 2015

  52. [52]

    J. Song, S. Zhao, and S. Ermon. A-nice-mc: Adversarial training for mcmc. In Advances in Neural Information Processing Systems, pages 5140–5150, 2017

  53. [53]

    Y . Song, S. Garg, J. Shi, and S. Ermon. Sliced score matching: A scalable approach to density and score estimation. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019, page 204, 2019

  54. [54]

    Y . Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. In International Conference on Learning Representations, 2018

  55. [55]

    B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schölkopf, and G. R. Lanckriet. On integral probability metrics,\phi-divergences and binary classification.arXiv preprint arXiv:0901.2698, 2009

  56. [56]

    C. M. Stein. Estimation of the mean of a multivariate normal distribution. The annals of Statistics, pages 1135–1151, 1981

  57. [57]

    Instance Normalization: The Missing Ingredient for Fast Stylization

    D. Ulyanov, A. Vedaldi, and V . Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016

  58. [58]

    van den Oord, S

    A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. In Arxiv, 2016

  59. [59]

    Van den Oord, N

    A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems , pages 4790–4798, 2016

  60. [60]

    Van Den Oord, N

    A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 1747–1756. JMLR.org, 2016

  61. [61]

    P. Vincent. A connection between score matching and denoising autoencoders. Neural compu- tation, 23(7):1661–1674, 2011. 12

  62. [62]

    Welling and Y

    M. Welling and Y . W. Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11) , pages 681–688, 2011

  63. [63]

    Wenliang, D

    L. Wenliang, D. Sutherland, H. Strathmann, and A. Gretton. Learning deep kernels for expo- nential family densities. In International Conference on Machine Learning, pages 6737–6746, 2019

  64. [64]

    Yu and V

    F. Yu and V . Koltun. Multi-scale context aggregation by dilated convolutions. InInternational Conference on Learning Representations (ICLR), 2016

  65. [65]

    F. Yu, V . Koltun, and T. Funkhouser. Dilated residual networks. InComputer Vision and Pattern Recognition (CVPR), 2017

  66. [66]

    Zhang and L

    Q. Zhang and L. Zhang. Convolutional adaptive denoising autoencoders for hierarchical feature extraction. Frontiers of Computer Science, 12(6):1140–1148, 2018. 13 A Architectures The architecture of our NCSNs used in the experiments has three important components: instance normalization, dilated convolutions and U-Net-type architectures. Below we give mor...