Recognition: 2 theorem links
· Lean TheoremLarge Scale GAN Training for High Fidelity Natural Image Synthesis
Pith reviewed 2026-05-13 12:07 UTC · model grok-4.3
The pith
Scaling GAN training with orthogonal regularization and a truncation trick yields new state-of-the-art results on class-conditional ImageNet synthesis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors train Generative Adversarial Networks at the largest scale yet attempted and find that orthogonal regularization applied to the generator makes the model amenable to a truncation trick that reduces the variance of its input; this enables explicit control over the fidelity-variety trade-off and produces new state-of-the-art class-conditional image synthesis, with BigGANs reaching an Inception Score of 166.5 and Frechet Inception Distance of 7.4 on 128x128 ImageNet images.
What carries the argument
The truncation trick, which reduces variance of the generator's latent input to improve fidelity at the cost of variety, enabled by orthogonal regularization on the generator weights.
If this is right
- Class-conditional image generation at 128x128 resolution reaches substantially higher fidelity than earlier methods.
- Orthogonal regularization stabilizes large-scale GAN training enough to allow simple post-training control of output statistics.
- The same scaling and regularization approach can be applied to other datasets and resolutions while preserving the truncation control.
- Higher metric scores translate into visibly more realistic and varied samples from complex distributions.
Where Pith is reading between the lines
- The truncation mechanism may generalize to unconditional generation or other generative architectures if similar regularization is applied.
- Controlled trade-offs between fidelity and diversity could support targeted use cases such as data augmentation for downstream vision tasks.
- Further increases in model scale combined with the same regularization might continue to improve results until new instabilities appear.
Load-bearing premise
The reported Inception Score and Frechet Inception Distance on standard ImageNet splits reliably reflect human-perceived image quality and diversity.
What would settle it
A direct human preference study in which raters consistently judge samples from prior GANs as higher quality than BigGAN samples at the same resolution.
read the original abstract
Despite recent progress in generative image modeling, successfully generating high-resolution, diverse samples from complex datasets such as ImageNet remains an elusive goal. To this end, we train Generative Adversarial Networks at the largest scale yet attempted, and study the instabilities specific to such scale. We find that applying orthogonal regularization to the generator renders it amenable to a simple "truncation trick," allowing fine control over the trade-off between sample fidelity and variety by reducing the variance of the Generator's input. Our modifications lead to models which set the new state of the art in class-conditional image synthesis. When trained on ImageNet at 128x128 resolution, our models (BigGANs) achieve an Inception Score (IS) of 166.5 and Frechet Inception Distance (FID) of 7.4, improving over the previous best IS of 52.52 and FID of 18.6.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents BigGAN, a class-conditional GAN trained at unprecedented scale on ImageNet. Architectural scaling, orthogonal regularization applied to the generator, and a truncation trick on the latent input are introduced to mitigate scale-induced instabilities; the resulting models report new state-of-the-art Inception Score of 166.5 and FID of 7.4 at 128×128 resolution, improving on prior bests of 52.52 and 18.6.
Significance. If the empirical results hold, the work demonstrates that targeted scaling combined with simple stabilization techniques can substantially advance high-fidelity, diverse image synthesis on complex datasets, providing both new benchmarks and practical insights into training dynamics at large scale. The ablations on regularization and truncation add concrete value for the community.
major comments (2)
- [Experiments section (Table 1)] Experiments section (Table 1 and associated text): the IS of 166.5 and FID of 7.4 are reported from single training runs without error bars, standard deviations, or results across multiple random seeds; given the well-known stochasticity of GAN optimization, this weakens the strength of the strict SOTA claim.
- [§3.2] §3.2 (truncation trick): the quantitative trade-off curves between fidelity and diversity are shown only for a single truncation threshold schedule; additional analysis of sensitivity to the precise threshold value and its interaction with the orthogonal regularization term would be needed to confirm robustness of the reported gains.
minor comments (2)
- [Abstract and §4] The abstract and §4 could explicitly state the total parameter count and approximate compute (GPU-hours) used for the largest models to contextualize the 'largest scale yet attempted' claim.
- [Figures 3-5] Figure captions for generated samples should include the exact truncation threshold and class-conditioning details used for each panel to improve reproducibility of the visual results.
Simulated Author's Rebuttal
We thank the referee for their positive evaluation and recommendation for minor revision. We address the major comments point by point below, with proposed revisions where appropriate.
read point-by-point responses
-
Referee: Experiments section (Table 1 and associated text): the IS of 166.5 and FID of 7.4 are reported from single training runs without error bars, standard deviations, or results across multiple random seeds; given the well-known stochasticity of GAN optimization, this weakens the strength of the strict SOTA claim.
Authors: We acknowledge that reporting results from single training runs limits the ability to quantify variability, and that multiple seeds would provide stronger evidence for the SOTA claims. Training at this scale is extremely resource-intensive, which constrained our ability to run extensive replicates. The reported gains (IS from 52.52 to 166.5, FID from 18.6 to 7.4) are large relative to prior work, and our ablations demonstrate consistent benefits from the proposed techniques. In the revised manuscript we will add a paragraph in the Experiments section explicitly noting the single-run limitation and the potential for stochastic variation in GAN training. revision: partial
-
Referee: §3.2 (truncation trick): the quantitative trade-off curves between fidelity and diversity are shown only for a single truncation threshold schedule; additional analysis of sensitivity to the precise threshold value and its interaction with the orthogonal regularization term would be needed to confirm robustness of the reported gains.
Authors: We agree that further analysis would strengthen the presentation of the truncation trick. The curves in §3.2 use the schedule we found most effective after initial tuning, but we can expand the section to include additional threshold values and a short discussion of how orthogonal regularization interacts with truncation (the regularization helps stabilize the generator under reduced-variance inputs). We will revise §3.2 to incorporate sensitivity plots and a brief robustness note. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is an empirical study of large-scale GAN training rather than a derivation of new theoretical results. It reports experimental outcomes from training BigGAN models on ImageNet, including measured IS of 166.5 and FID of 7.4, supported by ablations on orthogonal regularization and the truncation trick. These metrics are standard, externally defined, and evaluated on fixed dataset splits independent of the paper's own equations or fitted parameters. No load-bearing step reduces by construction to self-definition, fitted inputs renamed as predictions, or self-citation chains; the central claims rest on reproducible training runs and standard evaluation protocols.
Axiom & Free-Parameter Ledger
free parameters (2)
- truncation threshold
- generator and discriminator channel multipliers
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our modifications lead to models which set the new state of the art in class-conditional image synthesis. When trained on ImageNet at 128x128 resolution, our models (BigGANs) achieve an Inception Score (IS) of 166.5 and Frechet Inception Distance (FID) of 7.4
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 28 Pith papers
-
Prompt-to-Prompt Image Editing with Cross Attention Control
Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.
-
One-Step Generative Modeling via Wasserstein Gradient Flows
W-Flow achieves state-of-the-art one-step ImageNet 256x256 generation at 1.29 FID by training a static neural network to follow a Wasserstein gradient flow that minimizes Sinkhorn divergence, delivering roughly 100x f...
-
SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation
SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.
-
Less is More: Recursive Reasoning with Tiny Networks
TRM with 7M parameters achieves 45% accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, surpassing most LLMs with under 0.01% of their parameters.
-
One Step Diffusion via Shortcut Models
Shortcut models enable high-quality single or few-step sampling in diffusion models with one network and training phase by conditioning on desired step size.
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
-
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.
-
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
A 3.5-billion-parameter diffusion model with classifier-free guidance generates images preferred over DALL-E by human raters and can be fine-tuned for text-guided inpainting.
-
Diffusion Models Beat GANs on Image Synthesis
Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
-
Reduce the Artifacts Bias for More Generalizable AI-Generated Image Detection
SEF introduces GAN upsampling for diverse artifacts and expert fusion to reduce domain interference, yielding stronger generalization on 13 benchmarks for AI-generated image detection.
-
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
Fashion130K dataset and UMC framework align text and visual prompts to generate more consistent fashion outfits than prior state-of-the-art methods.
-
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
Fashion130K dataset and UMC framework align text and visual prompts with embedding refiner, Fusion Transformer, and redesigned attention to generate more consistent outfits than prior methods.
-
Reference-based Category Discovery: Unsupervised Object Detection with Category Awareness
RefCD enables unsupervised category-aware object detection by using feature similarity between predicted objects and unlabeled reference images to guide category learning.
-
Intermediate Representations are Strong AI-Generated Image Detectors
Intermediate layer embedding sensitivity to perturbations distinguishes AI-generated images from real ones, yielding higher AUROC on GenImage and Forensics Small benchmarks than prior methods.
-
Efficient Diffusion Distillation via Embedding Loss
Embedding Loss aligns feature distributions via MMD in random network embeddings to boost one-step diffusion distillation, reaching SOTA FID of 1.475 on CIFAR-10 unconditional generation.
-
Pairing Regularization for Mitigating Many-to-One Collapse in GANs
Pairing regularization mitigates intra-mode collapse in GANs by penalizing redundant latent-to-sample mappings, improving recall under collapse-prone conditions or precision under stabilized training.
-
Frequency-Aware Flow Matching for High-Quality Image Generation
FreqFlow introduces frequency-aware conditioning and a two-branch architecture to flow matching, reaching FID 1.38 on ImageNet-256 and outperforming DiT and SiT.
-
ELT: Elastic Looped Transformers for Visual Generation
Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.
-
Multimodal Large Language Models for Multi-Subject In-Context Image Generation
MUSIC is the first MLLM for multi-subject in-context image generation that uses an automatic data pipeline, vision chain-of-thought reasoning, and semantics-driven spatial layout planning to outperform prior methods o...
-
MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model
MPDiT uses a hierarchical multi-patch design in transformers to lower computation in diffusion models by handling coarse global features first then fine local details, plus faster-converging embeddings.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
VideoGPT: Video Generation using VQ-VAE and Transformers
VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.
-
Information theoretic underpinning of self-supervised learning by clustering
SSL clustering is derived as KL-divergence optimization where a teacher-distribution constraint normalizes via inverse cluster priors and simplifies to batch centering by Jensen's inequality.
-
Micro-Defects Expose Macro-Fakes: Detecting AI-Generated Images via Local Distributional Shifts
MDMF detects AI-generated images by learning patch-level forensic signatures and quantifying their distributional discrepancies with MMD, yielding larger separation than global methods when micro-defects are present.
-
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
-
Frequency-Aware Semantic Fusion with Gated Injection for AI-generated Image Detection
FGINet uses a band-masked frequency encoder and layer-wise gated injection to fuse frequency artifacts with vision foundation model semantics, plus hyperspherical compactness learning, to achieve better generalization...
-
Adaptive Forensic Feature Refinement via Intrinsic Importance Perception
I2P adaptively selects the most discriminative layers from visual foundation models for synthetic image detection and constrains task updates to low-sensitivity parameter subspaces to improve specificity without harmi...
Reference graph
Works this paper leans on
-
[1]
Shane Barratt and Rishi Sharma. A note on the Inception Score. InarXiv preprint arXiv:1801.01973,
-
[2]
The Cramer Distance as a Solution to Biased Wasserstein Gradients
Marc G. Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, and R´emi Munos. The Cramer distance as a solution to biased Wasserstein gra- dients. In arXiv preprint arXiv:1705.10743,
-
[3]
Dai, Shakir Mohamed, and Ian Goodfellow
9 Published as a conference paper at ICLR 2019 William Fedus, Mihaela Rosca, Balaji Lakshminarayanan, Andrew M. Dai, Shakir Mohamed, and Ian Goodfellow. Many paths to equilibrium: GANs do not need to decrease a divergence at every step. In ICLR,
work page 2019
-
[4]
On convergence and stability of GANs
Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On convergence and stability of GANs. In arXiv preprint arXiv:1705.07215,
-
[5]
Jae Hyun Lim and Jong Chul Ye. Geometric GAN. In arXiv preprint arXiv:1705.02894,
- [6]
-
[7]
Megapixel size image creation using generative adversarial networks
Marco Marchesi. Megapixel size image creation using generative adversarial networks. In arXiv preprint arXiv:1706.00082,
-
[8]
Conditional Generative Adversarial Nets
Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. In arXiv preprint arXiv:1411.1784,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Brown, Christopher Olah, Colin Raf- fel, and Ian Goodfellow
10 Published as a conference paper at ICLR 2019 Augustus Odena, Jacob Buckman, Catherine Olsson, Tom B. Brown, Christopher Olah, Colin Raf- fel, and Ian Goodfellow. Is generator conditioning causally related to GAN performance? In ICML,
work page 2019
-
[10]
Comparing generative adversarial network techniques for image creation and modificatio
Mathijs Pieters and Marco Wiering. Comparing generative adversarial network techniques for image creation and modificatio. In arXiv preprint arXiv:1803.09093,
-
[11]
Dropout: A simple way to prevent neural networks from overfitting.JMLR, 15:1929–1958,
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting.JMLR, 15:1929–1958,
work page 1929
-
[12]
A note on the evaluation of generative models
Lucas Theis, A ¨aron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. In arXiv preprint arXiv:1511.01844,
-
[13]
The unusual effectiveness of averaging in gan training
Yasin Yazc, Chuan-Sheng Foo, Stefan Winkler, Kim-Hui Yap, Georgios Piliouras, and Vijay Chandrasekhar. The unusual effectiveness of averaging in gan training. In arXiv preprint arXiv:1806.04498,
-
[14]
Self-Attention Generative Adversarial Networks
Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In arXiv preprint arXiv:1805.08318,
-
[15]
Figure 6: Samples generated by our BigGAN model at 512×512 resolution
11 Published as a conference paper at ICLR 2019 APPENDIX A A DDITIONAL SAMPLES , INTERPOLATIONS , AND NEAREST NEIGHBORS FROM IMAGE NET MODELS Figure 5: Samples generated by our BigGAN model at 256×256 resolution. Figure 6: Samples generated by our BigGAN model at 512×512 resolution. 12 Published as a conference paper at ICLR 2019 (a) (b) Figure 7: Compari...
work page 2019
-
[16]
The generated image is in the top left
feature space. The generated image is in the top left. 14 Published as a conference paper at ICLR 2019 Figure 11: Nearest neighbors in ResNet-50-avgpool (He et al.,
work page 2019
-
[17]
The generated image is in the top left
feature space. The generated image is in the top left. Figure 12: Nearest neighbors in pixel space. The generated image is in the top left. 15 Published as a conference paper at ICLR 2019 Figure 13: Nearest neighbors in VGG-16-fc7 (Simonyan & Zisserman,
work page 2019
-
[18]
The generated image is in the top left
feature space. The generated image is in the top left. 16 Published as a conference paper at ICLR 2019 APPENDIX B A RCHITECTURAL DETAILS In the BigGAN model (Figure 15), we use the ResNet (He et al.,
work page 2019
-
[19]
GAN architecture of (Zhang et al., 2018), which is identical to that used by (Miyato et al., 2018), but with the channel pattern in D modified so that the number of filters in the first convolutional layer of each block is equal to the number of output filters (rather than the number of input filters, as in Miyato et al. (2018); Gulrajani et al. (2017)). We us...
work page 2018
-
[20]
differs from BigGAN in several aspects. It uses a simpler vari- ant of skip-z conditioning: instead of first splittingz into chunks, we concatenate the entire z with the class embedding, and pass the resulting vector to each residual block through skip connections. BigGAN-deep is based on residual blocks with bottlenecks (He et al., 2016), which incorporat...
work page 2016
-
[21]
Relative to the 256× 256 architecture, we add an additional ResBlock at the 512× 512 resolution
(b) Discriminator 19 Published as a conference paper at ICLR 2019 Table 6: BigGAN architecture for 512× 512 images. Relative to the 256× 256 architecture, we add an additional ResBlock at the 512× 512 resolution. Memory constraints force us to move the non-local block in both networks back to 64× 64 resolution as in the 128× 128 pixel setting. z∈ R160∼N (...
work page 2019
-
[22]
(b) Discriminator 20 Published as a conference paper at ICLR 2019 Table 8: BigGAN-deep architecture for 256× 256 images. z∈ R128∼N (0,I ) Embed(y)∈ R128 Linear (128 + 128)→ 4× 4× 16ch ResBlock 16ch→ 16ch ResBlock up 16ch→ 16ch ResBlock 16ch→ 16ch ResBlock up 16ch→ 8ch ResBlock 8ch→ 8ch ResBlock up 8ch→ 8ch ResBlock 8ch→ 8ch ResBlock up 8ch→ 4ch Non-Local ...
work page 2019
-
[23]
(b) Discriminator 21 Published as a conference paper at ICLR 2019 Table 9: BigGAN-deep architecture for 512× 512 images. z∈ R128∼N (0,I ) Embed(y)∈ R128 Linear (128 + 128)→ 4× 4× 16ch ResBlock 16ch→ 16ch ResBlock up 16ch→ 16ch ResBlock 16ch→ 16ch ResBlock up 16ch→ 8ch ResBlock 8ch→ 8ch ResBlock up 8ch→ 8ch ResBlock 8ch→ 8ch ResBlock up 8ch→ 4ch Non-Local ...
work page 2019
-
[24]
(b) Discriminator 22 Published as a conference paper at ICLR 2019 APPENDIX C E XPERIMENTAL DETAILS Our basic setup follows SA-GAN (Zhang et al., 2018), and is implemented in TensorFlow (Abadi et al., 2016). We employ the architectures detailed in Appendix B, with non-local blocks inserted at a single stage in each network. Both G and D networks are initia...
work page 2019
-
[25]
is used in bothG and D, following SA-GAN (Zhang et al., 2018). We train on a Google TPU v3 Pod, with the number of cores proportional to the resolution: 128 for 128×128, 256 for 256×256, and 512 for 512×512. Training takes between 24 and 48 hours for most models. We increase ϵ from the default 10−8 to 10−4 in BatchNorm and Spectral Norm to mollify low-pre...
work page 2018
-
[26]
23 Published as a conference paper at ICLR 2019 APPENDIX D A DDITIONAL PLOTS Figure 17: IS vs
The discrepancy between training and validation scores is due to the Inception classifier having been trained on the training data, resulting in high-confidence outputs that are preferred by the Inception Score. 23 Published as a conference paper at ICLR 2019 APPENDIX D A DDITIONAL PLOTS Figure 17: IS vs. FID at 128×128. Scores are averaged across three ran...
work page 2019
-
[27]
24 Published as a conference paper at ICLR 2019 5 10 15 20 25 30 35 40 45 50 JFT-300M Inception Score 0 20 40 60 80 100 120 140 160 180 JFT-300M FID FID vs IS as a function of truncation Ch=128 Ch=96 Ch=64 Ch=64 (Baseline) 15 20 25 30 35 40 45 50 JFT-300M Inception Score 10 20 30 40 50 60 70 80 JFT-300M FID FID vs IS as a function of truncation Ch=128 Ch=...
work page 2019
-
[28]
The curve labeled with baseline corresponds to the first row (with orthogonal regularization and other techniques disabled), while the rest correspond to rows 2-4 – the same architecture at different capacities (Ch). 25 Published as a conference paper at ICLR 2019 APPENDIX E C HOOSING LATENT SPACES While most previous work has employedN (0,I ) orU [−1, 1] ...
work page 2019
-
[29]
Collapse occurs after 125000 iterations. 28 Published as a conference paper at ICLR 2019 (a)σ0 (b) σ0 σ1 (c)σ1 (d)σ2 Figure 23: G training statistics with an R1 Gradient Penalty of strength 10 on D. This model does not collapse, but only reaches a maximum IS of
work page 2019
-
[30]
This model does not collapse, but only reaches a maximum IS of
29 Published as a conference paper at ICLR 2019 (a)σ0 (b) σ0 σ1 (c)σ1 (d)σ2 Figure 25: G training statistics with Dropout (keep probability 0.8) applied to the last feature layer of D. This model does not collapse, but only reaches a maximum IS of
work page 2019
-
[31]
Collapse occurs after 200000 iterations
30 Published as a conference paper at ICLR 2019 (a) G∥W∥2 (b) D∥W∥2 (c) losses (d) Variance of all gradient norms in G and D Figure 27: Additional training statistics for a typical model without special modifications. Collapse occurs after 200000 iterations. (a) G∥W∥2 (b) D∥W∥2 (c) losses (d) Variance of all gradient norms in G and D Figure 28: Additional ...
work page 2019
-
[32]
31 Published as a conference paper at ICLR 2019 APPENDIX G A DDITIONAL DISCUSSION : S TABILITY AND COLLAPSE In this section, we present and discuss additional investigations into the stability of our models, expanding upon the discussion in Section
work page 2019
-
[33]
This leads to two conclusions: first, as has been noted in previous works (Miyato et al., 2018; Gulrajani et al., 2017; Zhang et al., 2018),D must remain optimal with respect to G both for stability and to provide useful gradient information. The consequence ofG being allowed to win the game is a complete breakdown of the training process, regardless of G’...
work page 2018
-
[34]
Increasing the margin beyond 3 results in unstable training similar to using the Wasserstein loss
does not prevent collapse or reduce the noise in D’s spectra. Increasing the margin beyond 3 results in unstable training similar to using the Wasserstein loss. Finally, the memorization argument might suggest that using a smaller D or using dropout in D would improve training by reducing its capacity to memorize, but in practice this degrades training. 3...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.