pith. machine review for the scientific record. sign in

arxiv: 1801.01401 · v5 · submitted 2018-01-04 · 📊 stat.ML · cs.LG

Recognition: 1 theorem link

· Lean Theorem

Demystifying MMD GANs

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:03 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords MMD GANWasserstein GANgradient biasunbiased estimatorsKernel Inception Distanceintegral probability metricsgenerative adversarial networkskernel choice
0
0 comments X

The pith

Gradient estimators for MMD GANs and Wasserstein GANs are unbiased, but finite-sample discriminators bias the generator updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper clarifies the bias picture in MMD-based generative adversarial networks and related Wasserstein models. It establishes that the gradient estimators applied during optimization remain unbiased for both the MMD critic and the Wasserstein critic. At the same time, training the discriminator itself on finite samples produces biased gradients with respect to the generator parameters. This distinction matters for practitioners because it explains sources of instability and points toward simpler network choices that still match performance. The work further shows that MMD GANs can adopt smaller critic networks than Wasserstein GANs, yielding faster training, and introduces the Kernel Inception Distance as a convergence diagnostic that can adapt learning rates on the fly.

Core claim

We show that gradient estimators used in the optimization process for both MMD GANs and Wasserstein GANs are unbiased, but learning a discriminator based on samples leads to biased gradients for the generator parameters. We also discuss the issue of kernel choice for the MMD critic, and characterize the kernel corresponding to the energy distance used for the Cramer GAN critic. Being an integral probability metric, the MMD benefits from training strategies recently developed for Wasserstein GANs. In experiments, the MMD GAN is able to employ a smaller critic network than the Wasserstein GAN, resulting in a simpler and faster-training algorithm with matching performance. We also propose an 2

What carries the argument

The MMD critic whose gradient estimators are shown to be unbiased when the kernel is fixed, together with the finite-sample bias that appears once the discriminator is learned from data.

If this is right

  • MMD GANs can use smaller critic networks than Wasserstein GANs while achieving matching performance.
  • Training strategies developed for Wasserstein GANs transfer directly to MMD GANs because both rely on integral probability metrics.
  • The Kernel Inception Distance can serve as a dynamic learning-rate scheduler during GAN training.
  • The kernel corresponding to the energy distance is explicitly characterized, allowing direct comparison between Cramer GAN and MMD GAN critics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The sample-induced bias identified here may be one concrete mechanism behind the well-known instability of many GAN training runs.
  • Similar unbiasedness proofs could be attempted for other integral probability metric critics, potentially unifying design rules across a wider family of GAN variants.
  • Adaptive use of the Kernel Inception Distance might improve convergence monitoring in non-image generative tasks where FID-style metrics are unavailable.

Load-bearing premise

The theoretical unbiasedness of the critic gradients assumes the kernel is fixed and positive definite, and that any remaining finite-sample bias does not dominate other optimization difficulties.

What would settle it

Train an MMD GAN critic on an effectively infinite data set and verify whether the observed generator gradients exactly match the closed-form unbiased estimator derived in the paper.

read the original abstract

We investigate the training and performance of generative adversarial networks using the Maximum Mean Discrepancy (MMD) as critic, termed MMD GANs. As our main theoretical contribution, we clarify the situation with bias in GAN loss functions raised by recent work: we show that gradient estimators used in the optimization process for both MMD GANs and Wasserstein GANs are unbiased, but learning a discriminator based on samples leads to biased gradients for the generator parameters. We also discuss the issue of kernel choice for the MMD critic, and characterize the kernel corresponding to the energy distance used for the Cramer GAN critic. Being an integral probability metric, the MMD benefits from training strategies recently developed for Wasserstein GANs. In experiments, the MMD GAN is able to employ a smaller critic network than the Wasserstein GAN, resulting in a simpler and faster-training algorithm with matching performance. We also propose an improved measure of GAN convergence, the Kernel Inception Distance, and show how to use it to dynamically adapt learning rates during GAN training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates MMD GANs and provides a theoretical clarification that gradient estimators for both MMD GANs and Wasserstein GANs are unbiased when the critic is fixed, while learning a discriminator from samples induces bias in the generator gradients. It discusses kernel choice for the MMD critic, characterizes the kernel for the energy distance used in Cramer GANs, and proposes the Kernel Inception Distance (KID) as an improved convergence measure that can be used to adapt learning rates dynamically. Experiments show that MMD GANs achieve matching performance to WGANs using smaller critic networks, resulting in simpler and faster training.

Significance. If the central distinction between population-level unbiasedness (via U-statistics for fixed positive-definite kernels) and finite-sample bias holds, the work offers a useful clarification of gradient issues in integral probability metric GANs, extending prior WGAN results with an independent derivation. The empirical finding that smaller critics suffice and the introduction of KID for practical training provide concrete value for the field.

major comments (2)
  1. [Abstract and theoretical analysis] Abstract and theoretical section: the claim that gradient estimators are unbiased for fixed-critic MMD relies on interchanging gradient and expectation under a fixed positive-definite kernel. When the critic is a neural network, the effective kernel depends on critic parameters; the paper should explicitly state whether critic parameters are held fixed during the generator gradient computation and provide the precise conditions under which the interchange remains valid.
  2. [Experiments] Experiments section: the claim of matching performance with smaller networks is central to the practical contribution, yet no variance across random seeds, multiple runs, or statistical significance tests are reported. This makes it difficult to assess whether the observed equivalence is robust or could be due to training variability.
minor comments (2)
  1. [Abstract] The abstract introduces KID without a one-sentence definition; adding a brief parenthetical description would improve readability.
  2. [Kernel discussion] In the kernel characterization for the energy distance, ensure the final kernel expression is numbered as an equation and the derivation steps are clearly separated from surrounding text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We address each major comment below and will incorporate clarifications and additional reporting in the revised version.

read point-by-point responses
  1. Referee: [Abstract and theoretical analysis] Abstract and theoretical section: the claim that gradient estimators are unbiased for fixed-critic MMD relies on interchanging gradient and expectation under a fixed positive-definite kernel. When the critic is a neural network, the effective kernel depends on critic parameters; the paper should explicitly state whether critic parameters are held fixed during the generator gradient computation and provide the precise conditions under which the interchange remains valid.

    Authors: We thank the referee for this observation. In the standard alternating optimization used for MMD GANs (and WGANs), the critic parameters are held fixed during the generator update step; only the generator parameters are optimized while the kernel induced by the current critic remains constant. Under this fixed-kernel regime the interchange of gradient and expectation is justified by the dominated convergence theorem for the bounded continuous functions arising from a positive-definite kernel. We will add an explicit paragraph in the theoretical section stating these conditions and confirming that the critic is frozen during generator gradient computation. revision: yes

  2. Referee: [Experiments] Experiments section: the claim of matching performance with smaller networks is central to the practical contribution, yet no variance across random seeds, multiple runs, or statistical significance tests are reported. This makes it difficult to assess whether the observed equivalence is robust or could be due to training variability.

    Authors: We agree that the absence of variance estimates and statistical tests weakens the empirical claim. Although the reported runs were performed with multiple random seeds and produced qualitatively consistent results, we did not include standard deviations or significance tests in the original manuscript. In the revision we will add error bars computed over at least five independent seeds for the key FID/KID curves and include a brief discussion of statistical significance for the observed performance parity between the smaller MMD critic and the larger WGAN critic. revision: yes

Circularity Check

0 steps flagged

No significant circularity: independent derivation of bias properties

full rationale

The paper's central claims rest on standard properties of U-statistics for the MMD estimator and the ability to interchange gradient and expectation when the kernel is fixed and positive definite. The distinction between population-level unbiasedness of the gradient estimator and finite-sample bias induced by learning the critic is derived directly from these properties without reducing to fitted parameters, self-definitions, or load-bearing self-citations. Training strategies are borrowed from WGAN literature (non-overlapping authors) but the MMD-specific bias analysis is presented as an independent contribution. No step in the provided derivation chain collapses by construction to its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The work relies on standard properties of MMD as an integral probability metric and prior results on Wasserstein GAN training strategies; kernel choice is discussed but treated as a design decision rather than a fitted parameter.

free parameters (1)
  • kernel bandwidth or choice
    Discussed as important for MMD critic performance but no specific fitted values reported in abstract.
axioms (1)
  • domain assumption MMD is an integral probability metric benefiting from WGAN training strategies
    Invoked to justify using existing WGAN techniques with MMD GANs.
invented entities (1)
  • Kernel Inception Distance no independent evidence
    purpose: Measure of GAN convergence using kernel on Inception features
    Newly proposed metric to dynamically adapt learning rates.

pith-pipeline@v0.9.0 · 5486 in / 1314 out tokens · 37357 ms · 2026-05-15T01:03:14.672850+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    cs.CV 2021-09 accept novelty 8.0

    HM3D offers 1000 building-scale 3D environments that are larger and higher-fidelity than existing datasets, enabling better-performing embodied AI agents for tasks like PointGoal navigation.

  2. DirectTryOn: One-Step Virtual Try-On via Straightened Conditional Transport

    cs.CV 2026-05 unverdicted novelty 7.0

    DirectTryOn achieves state-of-the-art one-step virtual try-on performance by applying pure conditional transport, garment preservation loss, and self-consistency loss to straighten trajectories in pretrained generativ...

  3. STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models

    cs.CV 2026-05 unverdicted novelty 7.0

    STRIDE boosts diversity in one-step diffusion models by injecting PCA-aligned pink noise into transformer features while preserving text alignment and quality.

  4. Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion

    cs.CV 2026-05 unverdicted novelty 7.0

    ActDiff-VC achieves up to 64.6% bitrate reduction at matched NIQE and improves perceptual metrics like KID and FID by using content-adaptive keyframe selection and budget-aware sparse trajectory selection to condition...

  5. Faithful Extreme Image Rescaling with Learnable Reversible Transformation and Semantic Priors

    cs.CV 2026-05 unverdicted novelty 7.0

    FaithEIR combines learnable reversible latent transformations, an adaptive high-frequency detail prior, and semantic conditioning to outperform prior methods in fidelity and perceptual quality for extreme image rescaling.

  6. OccDirector: Language-Guided Behavior and Interaction Generation in 4D Occupancy Space

    cs.CV 2026-04 unverdicted novelty 7.0

    OccDirector uses a VLM-guided Spatio-Temporal MMDiT model with history anchoring to generate physically plausible 4D occupancy from language scripts, supported by the new OccInteract-85k dataset.

  7. FIT: A Large-Scale Dataset for Fit-Aware Virtual Try-On

    cs.CV 2026-04 unverdicted novelty 7.0

    FIT is a large-scale dataset of 1.13M try-on triplets with exact size data plus a synthetic generation pipeline that enables training of virtual try-on models capable of depicting realistic garment fit including ill-f...

  8. Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off

    cs.CV 2026-03 unverdicted novelty 7.0

    Dress-ED is the first large-scale benchmark unifying virtual try-on, try-off, and text-guided garment editing with 146k verified samples plus a multimodal diffusion baseline.

  9. Diffusion Posterior Sampling for General Noisy Inverse Problems

    stat.ML 2022-09 unverdicted novelty 7.0

    Diffusion models solve noisy (non)linear inverse problems via approximated posterior sampling that blends diffusion steps with manifold gradients without strict consistency projection.

  10. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    cs.CV 2021-08 conditional novelty 7.0

    SDEdit performs guided image synthesis and editing by adding noise to inputs and refining them via denoising with a diffusion model's SDE prior, outperforming GAN methods in human studies without task-specific training.

  11. TOPOS: High-Fidelity and Efficient Industry-Grade 3D Head Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    TOPOS creates high-fidelity 3D heads with fixed industry topology from single images via a specialized VAE with Perceiver Resampler and a rectified flow transformer.

  12. CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis

    cs.CV 2026-05 unverdicted novelty 6.0

    CRAFT adapts diffusion models to medical images via clinical reward alignment from LLMs and VLMs, improving alignment scores and cutting low-quality generations by 20.4% on average across modalities.

  13. Score-Based Generative Modeling through Anisotropic Stochastic Partial Differential Equations

    cs.CE 2026-05 unverdicted novelty 6.0

    Anisotropic SPDEs preserve geometric data structure over longer timescales in score-based generative modeling, yielding better image quality than standard SDE baselines and flow matching in unconditional and condition...

  14. Stylistic Attribute Control in Latent Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    A technique for parametric stylistic control in latent diffusion models learns disentangled directions from synthetic datasets and applies them via guidance composition while preserving semantics.

  15. InpaintSLat: Inpainting Structured 3D Latents via Initial Noise Optimization

    cs.CV 2026-05 unverdicted novelty 6.0

    Optimizing initial noise via backpropagation approximation and spectral parameterization in structured 3D latent diffusion yields higher contextual consistency and prompt alignment in training-free inpainting.

  16. FashionStylist: An Expert Knowledge-enhanced Multimodal Dataset for Fashion Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    FashionStylist is an expert-annotated benchmark dataset that unifies outfit-to-item grounding, completion, and evaluation tasks for multimodal large language models in fashion.

  17. One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control

    cs.CV 2026-03 unverdicted novelty 6.0

    O2MAG generates high-fidelity text-guided anomalies from a single image without training by manipulating self-attention in diffusion models with anomaly masks and dual enhancements.

  18. CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation

    physics.ins-det 2026-05 unverdicted novelty 5.0

    CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...

  19. SAMIC: A Lightweight Semantic-Aware Mamba for Efficient Perceptual Image Compression

    cs.CV 2026-05 unverdicted novelty 5.0

    SAMIC introduces semantic-aware Mamba blocks and SVD-based redundancy reduction to achieve efficient perceptual image compression with improved rate-distortion-perception tradeoffs.

  20. Learning to Emulate Chaos: Adversarial Optimal Transport Regularization

    stat.ML 2026-04 unverdicted novelty 5.0

    Adversarial optimal transport objectives train neural emulators with improved long-term statistical fidelity on chaotic systems.

  21. LoRaQ: Optimized Low Rank Approximation for 4-bit Quantization

    cs.LG 2026-04 unverdicted novelty 5.0

    LoRaQ enables fully sub-16-bit quantized diffusion models by optimizing low-rank error compensation in a data-free way, outperforming prior methods at equal memory cost on Pixart-Σ and SANA while supporting mixed low-...

  22. Protecting and Preserving Protest Dynamics for Responsible Analysis

    cs.CV 2026-04 unverdicted novelty 5.0

    A responsible computing framework substitutes real protest imagery with labeled synthetic reproductions from conditional image synthesis to enable privacy-aware analysis of collective action patterns.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 22 Pith papers · 39 internal anchors

  1. [1]

    Towards Principled Methods for Training Generative Adversarial Networks

    M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. In ICLR, 2017. arXiv:1701.04862

  2. [2]

    Wasserstein GAN

    M. Arjovsky, S. Chintala, and L. Bottou. W asserstein generative adversarial networks. In ICML, 2017. arXiv:1701.07875

  3. [3]

    Do GANs actually learn the distribution? An empirical study

    S. Arora and Y. Zhang. Do GAN s actually learn the distribution? A n empirical study, 2017. arXiv:1706.08224

  4. [4]

    Generalization and Equilibrium in Generative Adversarial Nets (GANs)

    S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium in generative adversarial nets ( GAN s). In ICML, 2017. arXiv:1703.00573

  5. [5]

    M. G. Bellemare, I. Danihelka, W. Dabney, S. Mohamed, B. Lakshminarayanan, S. Hoyer, and R. Munos. The C ramer distance as a solution to biased W asserstein gradients, 2017. arXiv:1705.10743

  6. [6]

    Better Mixing via Deep Representations

    Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai. Better mixing via deep representations. In ICML, 2013. arXiv:1207.4404

  7. [7]

    BEGAN: Boundary Equilibrium Generative Adversarial Networks

    D. Berthelot, T. Schumm, and L. Metz. BEGAN : Boundary equilibrium generative adversarial networks, 2017. arXiv:1703.10717

  8. [8]

    P. J. Bickel and E. L. Lehmann. Unbiased estimation in convex families. The Annals of Mathematical Statistics, 40 0 (5): 0 1523--1535, 1969

  9. [9]

    Bouchacourt, P

    D. Bouchacourt, P. K. Mudigonda, and S. Nowozin. DISCO nets: DIS similarity CO efficients networks. In NIPS, pp.\ 352--360. 2016

  10. [10]

    A Test of Relative Similarity For Model Selection in Generative Models

    W. Bounliphone, E. Belilovsky, M. B. Blaschko, I. Antonoglou, and A. Gretton. A test of relative similarity for model selection in generative models. In ICLR, 2016. arXiv:1511.04581

  11. [11]

    Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

    D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units ( ELU s). In ICLR, 2016. arXiv:1511.07289

  12. [12]

    Comparison of Maximum Likelihood and GAN-based training of Real NVPs

    I. Danihelka, B. Lakshminarayanan, B. Uria, D. Wierstra, and P. Dayan. Comparison of maximum likelihood and GAN -based training of R eal NVP s, 2017. arXiv:1705.05263

  13. [13]

    G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In UAI, 2015. arXiv:1505.03906

  14. [14]

    Many Paths to Equilibrium: GANs Do Not Need to Decrease a Divergence At Every Step

    W. Fedus, M. Rosca, B. Lakshminarayanan, A. M. Dai, S. Mohamed, and I. Goodfellow. Many paths to equilibrium: GAN s do not need to decrease a divergence at every step. In ICLR, 2018. arXiv:1710.08446

  15. [15]

    Gneiting and A

    T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. JASA, 102 0 (477): 0 359--378, 2007

  16. [16]

    Generative Adversarial Networks

    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014. arXiv:1406.2661

  17. [17]

    Gretton, K

    A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch \" o lkopf, and A. J. Smola. A kernel two-sample test. JMLR, 13, 2012

  18. [18]

    Improved Training of Wasserstein GANs

    I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of W asserstein GAN s. In NIPS, 2017. arXiv:1704.00028

  19. [19]

    GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, G. Klambauer, and S. Hochreiter. GAN s trained by a two time-scale update rule converge to a N ash equilibrium. In NIPS, 2017. arXiv:1706.08500

  20. [20]

    Beyond Face Rotation: Global and Local Perception GAN for Photorealistic and Identity Preserving Frontal View Synthesis

    R. Huang, S. Zhang, T. Li, and R. He. Beyond face rotation: Global and local perception GAN for photorealistic and identity preserving frontal view synthesis. In ICCV, 2017 a . arXiv:1704.04086

  21. [21]

    Stacked Generative Adversarial Networks

    X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie. Stacked generative adversarial networks. In CVPR, 2017 b . arXiv:1612.04357

  22. [22]

    Y. Jin, K. Zhang, M. Li, Y. Tian, H. Zhu, and Z. Fang. Towards the automatic anime characters creation with generative adversarial networks, 2017. arXiv:1708.05509

  23. [23]

    Adam: A Method for Stochastic Optimization

    D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015. arXiv:1412.6980

  24. [24]

    A. Klenke. Probability Theory: A Comprehensive Course. World Publishing Corporation, 2008

  25. [25]

    Krizhevsky

    A. Krizhevsky. Learning multiple layers of features from tiny images, 2009

  26. [26]

    LeCun, L

    Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998

  27. [27]

    C. Li, D. Alvarez-Melis, K. Xu, S. Jegelka, and S. Sra. Distributional adversarial networks, 2017 a . arXiv:1706.09549

  28. [28]

    MMD GAN: Towards Deeper Understanding of Moment Matching Network

    C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. P \' o czos. MMD GAN : Towards deeper understanding of moment matching network. In NIPS, 2017 b . arXiv:1705.08584

  29. [29]

    Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In ICML, 2015. arXiv:1502.02761

  30. [30]

    L. Liu. On the two-sample statistic approach to generative adversarial networks. Master's thesis, University of Princeton Senior Thesis, April 2017. URL http://arks.princeton.edu/ark:/88435/dsp0179408079v

  31. [31]

    S. Liu, O. Bousquet, and K. Chaudhuri. Approximation and convergence properties of generative adversarial learning. In NIPS, 2017. arXiv:1705.08991

  32. [32]

    Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, 2015

  33. [33]

    Revisiting Classifier Two-Sample Tests

    D. Lopez-Paz and M. Oquab. Revisiting classifier two-sample tests. In ICLR, 2017. arXiv:1610.06545

  34. [34]

    R. Lyons. Distance covariance in metric spaces. The Annals of Probability, 41 0 (5): 0 3051--3696, 2013

  35. [35]

    The Zero Set of a Real Analytic Function

    B. Mityagin. The zero set of a real analytic function, 2015. arXiv:1512.07276

  36. [36]

    Fisher GAN

    Y. Mroueh and T. Sercu. F isher GAN . In NIPS, 2017. arXiv:1705.09675

  37. [37]

    McGan: Mean and Covariance Feature Matching GAN

    Y. Mroueh, T. Sercu, and V. Goel. McGan : Mean and covariance feature matching GAN . In ICML, 2017. arXiv:1702.08398

  38. [38]

    M \"u ller

    A. M \"u ller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29 0 (2): 0 429--443, 1997

  39. [39]

    f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization

    S. Nowozin, B. Cseke, and R. Tomioka. f- GAN : Training generative neural samplers using variational divergence minimization. In NIPS, 2016. arXiv:1606.00709

  40. [40]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in P ython. JMLR, 12: 0 2825--2830, 2011

  41. [41]

    Piranian

    G. Piranian. The Set of Nondifferentiability of a Continuous Function . The American Mathematical Monthly, 73 0 (4): 0 57--61, 1966

  42. [42]

    Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

    A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016. arXiv:1511.06434

  43. [43]

    C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006

  44. [44]

    Rosenbaum

    S. Rosenbaum. Moments of a truncated bivariate normal distribution. JRSS B, 23: 0 405--408, 1961

  45. [45]

    Improved Techniques for Training GANs

    T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GAN s. In NIPS, 2016. arXiv:1606.03498

  46. [46]

    Equivalence of distance-based and RKHS-based statistics in hypothesis testing

    D. Sejdinovic, B. K. Sriperumbudur, A. Gretton, and K. Fukumizu. Equivalence of distance-based and RKHS -based statistics in hypothesis testing. The Annals of Stastistics, 41 0 (5): 0 2263--2291, 2013. arXiv:1207.6076

  47. [47]

    B. K. Sriperumbudur, K. Fukumizu, A. Gretton, G. R. G. Lanckriet, and B. Sch \" o lkopf. Kernel choice and classifiability for RKHS embeddings of probability distributions. In NIPS, 2009 a

  48. [48]

    B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Sch \" o lkopf, and G. R. G. Lanckriet. On integral probability metrics, phi-divergences and binary classification, 2009 b . arXiv:0901.2698

  49. [49]

    B. K. Sriperumbudur, A. Gretton, K. Fukumizu, G. R. G. Lanckriet, and B. Sch \"o lkopf. Hilbert space embeddings and metrics on probability measures. JMLR, 11: 0 1517--1561, 2010. arXiv:0907.5309

  50. [50]

    B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet. Universality, characteristic kernels and RKHS embedding of measures. JMLR, 12: 0 2389--2410, 2011. arXiv:1003.0887

  51. [51]

    B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Sch \" o lkopf, and G. R. G. Lanckriet. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6: 0 1550--1599, 2012

  52. [52]

    Steinwart and A

    I. Steinwart and A. Christmann. Support Vector Machines. Information Science and Statistics. Springer, 2008

  53. [53]

    D. J. Sutherland. What are the mean and variance of a 0-censored multivariate normal? Cross Validated answer, 2018. URL https://stats.stackexchange.com/q/326347

  54. [54]

    D. J. Sutherland, H.-Y. Tung, H. Strathmann, S. De, A. Ramdas, A. Smola, and A. Gretton. Generative models and model criticism via optimized maximum mean discrepancy. In International Conference on Learning Representations, 2017. arXiv:1611.04488

  55. [55]

    Intriguing properties of neural networks

    C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In ICLR, 2014. arXiv:1312.6199

  56. [56]

    Rethinking the Inception Architecture for Computer Vision

    C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the I nception architecture for computer vision. In CVPR, 2016. arXiv:1512.00567

  57. [57]

    Sz\' e kely and M

    G. Sz\' e kely and M. Rizzo. Testing for equal distributions in high dimension. InterStat, 5, 2004

  58. [58]

    A note on the evaluation of generative models

    L. Theis, A. van den Oord, and M. Bethge. A note on the evaluation of generative models. In ICLR, 2016. arXiv:1511.01844

  59. [59]

    F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao. LSUN : Construction of a large-scale image dataset using deep learning with humans in the loop, 2015. arXiv:1506.03365

  60. [60]

    Zahorski

    Z. Zahorski. Sur l'ensemble des points de non-d \'e rivabilit \'e d'une fonction continue. Bulletin de la Soci \'e t \'e math \'e matique de France , 2: 0 147--178, 1946

  61. [61]

    B-tests: Low Variance Kernel Two-Sample Tests

    W. Zaremba, A. Gretton, and M. B. Blaschko. B-tests: Low variance kernel two-sample tests. In NIPS, 2013. arXiv:1307.1954

  62. [62]

    J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017. arXiv:1703.10593